WriterOfMinds: AI Ideology VI: Existential Risk Critique

I come to you with the final installment in my series on AI-related ideology and politics. In Part V, I tried to briefly lay out the argument for existential risk from AI, along with what I consider the weaker counterpoints. Today I will conclude the series with a discussion of the counterarguments I find more interesting.

The Alignment Problem does not strike me as intractable

All the dangers laid out in the previous article are associated with misaligned AI agents - that is, agents that do not (in a broad sense) want what humans want. If we could produce an agentive superintelligence that did want what we want, it would pursue our goals just as aggressively as hostile superintelligence is expected to work against them. So all the fear of doom evaporates if the Alignment Problem is feasible to solve, at or before the time when AGI first comes on the scene.

Even though his followers have had two decades or so to think about the Problem, Yudkowsky insists that "We are not prepared. We are not on course to be prepared in any reasonable time window. There is no plan. Progress in AI capabilities is running vastly, vastly ahead of progress in AI alignment ..." [1] My own intuitions about alignment don't match up with this. To me it seems like a potentially difficult problem, but not any harder than the capability problem, i.e. the puzzle of how to create any AGI at all. The foundations of human values are somewhat obscure to us for the same reasons the mechanisms of our own intelligence are obscure; if we can discover one, we can discover the other. How can it be accurate to say that nobody has a plan for this?

It's possible that I feel this way because the work I'm doing, as well as my personal ideas of the best path to AGI, have little to do with ANNs and ML. A fair bit of the hand-wringing about AI alignment reads to me like this: "Heaven forbid that we *design* an AI to fulfill our complex desires - that would be too much work. No, we have to stick to these easy processes that draw trained models from the primordial ooze without any need for us to understand or directly architect them. This lazy approach won't reliably produce aligned AIs! OH NO!"

Since all of my systems are designed on purpose and the code is human-intelligible, they already have the level of transparency that ANN-builders dream of getting. I don't have to worry about whether some subsystem I've built just happens to contain an agent that's trying to optimize for a thing I never wanted, because none of my subsystems are auto-generated black boxes. I don't do haphazard emergent stuff, and I think that's one reason I feel more comfortable with my AI than some of these people feel with the mainstream approaches.

A selection of articles pulled from Alignment Forum provides evidence that many issues Existential Risk Guardians have identified are tied to particular techniques:

"In general, we have no way to use RL to actually interpret and implement human wishes, rather than to optimize some concrete and easily-calculated reward signal." [2]

"For our purposes, the key characteristic of this research paradigm is that agents are optimized for success at particular tasks. To the extent that they learn particular decision-making strategies, those are learned implicitly. We only provide external supervision, and it wouldn't be entirely wrong to call this sort of approach 'recapitulating evolution', even if this isn't exactly what is going on most of the time.

As many people have pointed out, it could be difficult to become confident that a system produced through this sort of process is aligned - that is, that all its cognitive work is actually directed towards solving the tasks it is intended to help with. The reason for this is that alignment is a property of the decision-making process (what the system is 'trying to do'), but that is unobserved and only implicitly controlled." [3]

"Traditional ML algorithms optimize a model or policy to perform well on the training distribution. These models can behave arbitrarily badly when we move away from the training distribution. Similarly, they can behave arbitrarily badly on a small part of the training distribution ... If we understood enough about the structure of our model (for example if it reflected the structure of the underlying data-generating process), we might be confident that it will generalize correctly. Very few researchers are aiming for a secure/competitive/scalable solution along these lines, and finding one seems almost (but not completely) hopeless to me." [4]

We could undercut a lot of these problems by taking alternate paths that do a better job of truly replicating human intelligence, and permit easier examination of how the system is doing its work.

Members of the Existential Risk Guardian/Doomer faction also like to frame all goal-directed agency in terms of "maximizing expected utility." In other words, you figure out a mathematical function that represents the sum of all you desire, and then you order your behavior in a way that maximizes this function's output. This idea fits in well with the way current mainstream AI works, but there are also game theoretic reasons for it, apparently. If you can frame your goals as a utility function and behave in a way that maximizes it, your results will be mathematically optimal, and other agents will be unable to take advantage of you in certain ways when making bets or deals. Obviously we humans don't usually implement our goals this way. But since this is, in some theoretical sense, the "best" way to think about goals, the Doomers assume that any superintelligence would eventually self-improve into thinking this way. If its goals were not initially specified as a utility function, it would distill them into one. [5]

Hence the Doomers think we must reduce the Alignment Problem to finding a utility function which, when maximized, yields a world that humans would find congenial. And the big fear arises from our not knowing how to do this. Our preferences and interests don't seem to be natively given as a mathematical function, and it is difficult to imagine transforming them into one without a large possibility for error.

Many struggles can be avoided by modeling human values in a more "natural" way: looking for methods of grounding concepts like deprivation/satisfaction, empathy, reciprocity, and fairness, instead of trying to reduce everything to a function. Technically it is possible to view any agent as maximizing some utility function, regardless of how it models its goals internally[6], but this not necessarily the most useful or transparent way to frame the situation!

And I consider it safe to model goals in the more "natural" way, because an ASI on a self-improvement quest would also recognize that 1) framing goals in terms of utility maximization, while theoretically optimal, is not always practical and 2) transforming goals from one specification into another carries potential for error. Since one of the convergent instrumental goals of any such self-aware agent is goal preservation, the ASI would be just as wary of these transformation errors as we are!

The alignment concern that seems most directly relevant to the work I'm personally doing is the possibility of oversimplified goal specification. But there are a number of strategies for managing this that I consider sound:

*Treat the AI's value system as a system - with all the potential for complexity that implies - instead of expecting just a few objectives or rules to carry all the weight.

*Embed into the AI some uncertainty about the quality of its own goal specifications, and a propensity to accept feedback on its actions or even adjustment of the goal specs. This is a form of "corrigibility" - keeping the AI sensitive to human opinions of its performance after it is complete and operational.

*Specify goals in an indirect manner so that the AI's concept of the goals will advance as the AI's general skill advances. For instance, provide goals in natural language and associate the AI's objective with their "real meaning," rather than trying to translate the goals into code, so that the AI's understanding of the goals can be enriched by improved understanding of human communication.

In short, I don't think the Doomers have proven that agentive AI is universally dangerous. Their arguments so far are focused on a subset of possible design pathways, none of which I am following.

This should go some way toward explaining why I'm not terribly worried about my own work (even if it gets anywhere near becoming AGI, and I make no claims that it will). But what about all the mainstream AI efforts that are rushing ahead as I write? Those don't manage to frighten me much either, but for different reasons.

I'm skeptical of intelligence explosion hazards

As noted in the previous article, one pillar of Doomer fears is the notion that AGI will probably become ASI, perhaps very quickly. A misaligned AGI has the power level of a bad human, and we already have plenty of those, so it is nothing to be especially worried about. Real danger requires a path from AGI to ASI. Let's think a little more about the most frightening type of ASI: qualitative superintelligence. Recall that this variety of supermind would use modes of thinking so exotic, and so much better than ours, that we couldn't even grasp how it thinks.

The usual assumption is that human engineers will not produce qualitative ASI directly. Instead, an AGI will bootstrap itself to that level by re-engineering its own mental processes. Is this really plausible? Can an agent just make itself smarter in a vacuum?

Imagine for a moment that you are a jumping spider, and a member of the Salticid Intelligence Acceleration Consortium. Other spiders bring you food so that you can do nothing but sit around and Think Really Hard about how to be smarter. You'd like to invent abstract logic, meta-knowledge, long-term planning, and all the other cool cognitive abilities that certain big mammals have. Except ... you don't even have names for these things. You don't even have concepts for these things. If you knew what they were - if you even knew which direction to be going to improve toward them - you'd already have them. So how, exactly, are you going to think your way there? How are you to think about the thoughts you cannot think?

"Bootstrap" is actually a rather ironic expression[7], because pulling on your own bootstraps won't lift you anywhere. And spending a long time thinking at a low level won't necessarily get you to a higher level.

If you set out to design an AI deliberately, you're using your own intelligence to produce intelligence in a machine. If you train an ANN on data that was produced or labeled by humans, that's also a way of putting human intelligence into a machine. Large language models derive their smarts (such as those are) from all the knowledge people have previously encoded in the huge piles of text used as their training data. Supervised reinforcement learners also benefit from the intelligence of the human supervisor poking the reward button. Even evolutionary algorithms can glean intelligence from the design of the evaluator that determines "fitness." [8] So none of these approaches are really conjuring intelligence out of nothing; they're descendants of pre-existing human intelligence (perhaps in an awkward, incomplete, or flawed way).

So then: what could be smart enough to write a program smarter than itself? And from where shall our future AGIs get the dataset to train a superintelligent ANN? Doesn't it stand to reason that you might need data produced by superintelligences? (From what we've seen so far, training a new generation of ML AI on a previous generation's output can actually make the new generation worse. [9]) When humans try to develop AGI, they're making something from something. The idea of AGI developing qualitative ASI emits the uncomfortable odor of a "something from nothing" fantasy. "Just stir the giant vat of math around enough, and superintelligence will crawl out! It won't even take millions of years!" Heh. I'll believe that one when I see it.

Programs like AlphaStar, which learn by playing against themselves, are one example I can think of that seems to develop intelligence without much human input beyond the learning algorithm. But they are still utilizing a resource, namely practice; they learn from experience. Their advantage lies in their ability to practice very, very fast. Video games lend themselves well to that sort of thing, but is it possible to practice general reasoning in the same fashion? It's harder to iterate rapidly if you have to learn about doing anything in the physical world, or learn about the psychology of humans. You'd need a high-fidelity simulator (which, by itself, would take a lot of work to develop). And then you wouldn't discover anything that humans and AGIs don't already know about the universe, because they wouldn't be able to include those unknown properties in the simulation.

The one thing an AGI might get by sitting around and putting billions of cycles into thinking, would be new branches of philosophy and mathematics. And some of those might lead to methods for idealized formal reasoning, in the same way Game Theory does. But are our previous improvements in these areas sufficient to constitute superintelligence vs. the generations of humans before the discovery?

Even if Qualitative Superintelligence is unlikely, that leaves Speed Superintelligence and Collective Superintelligence on the table. And both of these are much more straightforward to obtain, given that they only require scaling. But they also make the Alignment Problem easier. Now our AI researcher isn't faced with the need to analyze or supervise an entity smarter than himself, or the need to prevent his AGI from mutating into such an entity with possible loss of fidelity. He only has to align an AGI which is as smart as himself, or perhaps slightly less. Copying and speedup multiply the abilities of the seed AGI *without* modifying it. So if the seed AGI is well-aligned, the resulting Collective or Speed ASI should be as well.

Note that we already have collective general intelligences, in the form of corporations, governments, and other organizations which contain large numbers of humans working toward a common cause. Some of these are even misaligned. For example, the desired goal of a corporation is "produce goods and services beneficial to potential customers," but the actual goal is "maximize profit returned to the owners or shareholders." Profit functions as a proxy for public benefit in an idealized free market, but frequently diverges from it in the real world, and I'm sure I don't need to go into the multitude of problems this can cause. And yet, despite those problems ... here we still are. Despite the large number of human collectives that have formed and dispersed over the course of history, not one of them has developed fanatic optimizer tendencies and proceeded to rapidly and utterly destroy the world.

Note also that AGI's ability to scale will probably not be unlimited. Increasing the number of copies working together increases coordination overhead. Increasing the total amount of data processed increases the challenges of memory management to store and retrieve results. There are hard physical limits on the speed and density of computational hardware that we will hit eventually.

I'm skeptical of spontaneously emerging harmful properties

The type of AGI that Doomers expect to balloon into a hostile ASI is actually pretty specific. I agree that, given the use of mainstream ML methods, it is reasonably easy to accidentally produce a trained model that is maximizing some other function than the one you want. However, the deadly scenario also requires that this model 1) be an agent, capable of generalizing to very open-ended ways of maximizing its function, and 2) have embedded situational awareness. I.e. it must have a sense of self, knowing that it is an agent in an environment, and knowing that the environment includes a whole world outside its computer for it to operate upon. It is only this combination of properties that can give an AI ideas like "I should repave the whole Earth with computronium."

The corporate AI systems provided to the public as tools do not, so far, have these properties. For something like ChatGPT, the whole world consists of its prompts and the text it generates to complete them. No matter how intelligently future GPT iterations complete text, there's no reason to think that something in there is strategizing about how to manipulate its users into doing things in the human world that will make GPT even better at completing text. GPT's internal architecture simply doesn't provide for that. It doesn't have an operational concept of the human world as an environment and itself as a distinct actor therein. It just knows a whole awful lot about text. Asking an LLM to simulate a particular kind of speaking character can produce at least the appearance of self-aware agency, but this agency is with respect to whatever scenario the user has created in the prompt, not with respect to the "real world" that we inhabit.

So if OpenAI and Google and Meta keep churning out systems that follow this same design pattern, where's the cause for alarm? It seems Doomers are worried that self-aware, situationally-aware agents will be produced spontaneously during machine learning processes - even without deliberate effort to train for, select for, or reward them - just because they enable the most extreme maximization of any objective.

This bothers me in much the same way the "superintelligence by bootstraps" argument bothers me. Where would these properties or systems come from? Why would they just pop out of nowhere, unasked for? José Luis Ricón describes the conclusion of a discussion he had about this issue, and gathers that the disagreement comes down to differences of intuition about how ML processes work, and what they can reasonably be expected to produce. [10] Doomers expect that an unwanted level of situational awareness would just appear unaided. I do not.

The Doomer counter to this kind of argument is, "But if you can't guarantee (e.g. by a formal proof) that it won't happen by chance, you should still be worried. Misaligned ASI would be so terrible that even a remote possibility of it should justify policy changes." No matter how speculative their nightmare scenario is, they use the severity of the consequences to push the burden of proof onto their opponents. Is this reasonable? You decide.

I have doubts the path mainstream AI is on will get us to AGI anyway

If the state-of-the-art but still sloppy approaches that are currently in vogue don't succeed in producing AGI, the various teams working in the field will have to abandon or reform them ... hopefully for methods that make the Alignment Problem easier. Despite some remarkable recent progress, I suspect the abilities of present-day AI systems and their nearness to AGI have been over-hyped. I don't have enough time to go into a detailed discussion of that here, so let's just say I'm far from the only person with this opinion. [11][12][13][14][15][16]

This means that I have a "long timeline" - that is, I don't think AGI is coming very soon, so I expect we'll get more time to work on the Alignment Problem. But it also means that I expect the difficulty of the Problem to drop as AI development is driven toward sounder methods.

Thus ends (for now) my discussion of politics, ideology, and risk perception in the AI enthusiast subcultures. Whatever opinions you come away with, I hope this has been informative and left you with a better sense of the landscape of current events.

Until the next cycle,
Jenny

[1] Yudkowsky, Eliezer. "Pausing AI Developments Isn’t Enough. We Need to Shut it All Down." TIME Magazine. https://time.com/6266923/ai-eliezer-yudkowsky-open-letter-not-enough/

[2] Christiano, Paul. "Prosaic AI Alignment." Alignment Forum. https://www.alignmentforum.org/s/EmDuGeRw749sD3GKd/p/YTq4X6inEudiHkHDF

[3] Stuhlmüller, Andreas. "Factored Cognition." Alignment Forum. https://www.alignmentforum.org/s/EmDuGeRw749sD3GKd/p/DFkGStzvj3jgXibFG

[4] Christiano, Paul. "Directions and desiderata for AI alignment." Alignment Forum. https://www.alignmentforum.org/s/EmDuGeRw749sD3GKd/p/kphJvksj5TndGapuh

[5] Shah, Rohin. "Coherence arguments do not entail goal-directed behavior." Alignment Forum. https://www.alignmentforum.org/s/4dHMdK5TLN6xcqtyc/p/NxF5G6CJiof6cemTw

[6] Shah, Rohin. "Conclusion to the sequence on value learning." Alignment Forum. https://www.alignmentforum.org/s/4dHMdK5TLN6xcqtyc/p/TE5nJ882s5dCMkBB8

[7] Bologna, Caroline. "Why The Phrase 'Pull Yourself Up By Your Bootstraps' Is Nonsense: The interpretation of the phrase as we know it today is quite different from its original meaning." The Huffington Post. https://www.huffpost.com/entry/pull-yourself-up-by-your-bootstraps-nonsense_n_5b1ed024e4b0bbb7a0e037d4

[8] Dembski, William A. "Conservation of Information - The Idea." Evolution News & Science Today. https://evolutionnews.org/2022/06/conservation-of-information-the-idea/

[9] Dupré, Maggie Harrison. "AI Loses Its Mind After Being Trained on AI-Generated Data." Futurism. https://futurism.com/ai-trained-ai-generated-data

[10] Ricón, José Luis. "The situational awareness assumption in AI risk discourse, or why people should chill." Nintil (2023-07-01). https://nintil.com/situational-awareness-agi/.

[11] Marcus, Gary. "AGI by 2027? Fun with charts." Marcus on AI. https://garymarcus.substack.com/p/agi-by-2027

[12] Brooks, Rodney. "Predictions Scorecard, 2024 January 01." Rodney Brooks: Robots, AI, and other stuff. https://rodneybrooks.com/predictions-scorecard-2024-january-01/

[13] Bender, Emily M. "On NYT Magazine on AI: Resist the Urge to be Impressed." Medium blog of user @emilymenonbender. https://medium.com/@emilymenonbender/on-nyt-magazine-on-ai-resist-the-urge-to-be-impressed-3d92fd9a0edd

[14] Piekniewski, Filip. "AI Psychosis." Piekniewski's blog. https://blog.piekniewski.info/2023/02/07/ai-psychosis/

[15] Moore, Jennifer. "Losing the imitation game." Jennifer++. https://jenniferplusplus.com/losing-the-imitation-game/

[16] Castor, Amy and Gerard, David. "Pivot to AI: Pay no attention to the man behind the curtain." Amy Castor (personal website/blog). https://amycastor.com/2023/09/12/pivot-to-ai-pay-no-attention-to-the-man-behind-the-curtain/

WriterOfMinds

Pages

Sunday, June 16, 2024

AI Ideology VI: Existential Risk Critique

No comments:

Post a Comment

New AI overlords? Be the first to know!