Tuesday, May 28, 2024

Acuitas Diary #72 (May 2024)

My primary focus this month has been on an overhaul of the Conversation Engine. The last time I revised it, the crux of the work was to add a tree-like aspect to Acuitas' memory of the conversation. The expectation was that this would help with things like "one topic nested inside another," or "returning to a previous unfinished conversation thread." Well ... what does that sound similar to? Perhaps the "issue trees" I described in last month's post? The crux of this month's work was a unification of the Conversation Engine's tracking with the Narrative architecture, such that each conversation becomes, in effect, a narrative.

Standard logic gate symbols laid out as if they were circuit components, with traces running between them as if on a PCB, such that they form a stylized tree. Black-and-white art, ink on paper.
Original art by author

The CE now instantiates its own Narrative scratchboard to record conversational events, and logs conversational objectives as Issues on the board. For example, the desire to learn the current speaker's name is represented as something like "Subgoal: speaker tell self {speaker is_named ?}" When the speaker says something, the Conversation Engine will package the output from the Text Interpreter as an event like "speaker tell <fact>" or "speaker ask <fact>" before passing it to the scratchboard, which will then automatically detect whether the event matches any existing issues. The CE also includes a specialized version of the Executive code, to select a new issue to "work on" whenever the current issue has been fulfilled or thwarted. On his side of the conversation, Acuitas will look for ways to advance or solve the current issue ... e.g. by asking a question if he hopes to make the speaker tell him something.

This enables pretty much all the tree-like behaviors I wanted, in a tidier and more unified way than the old conversation tracking code did. My last overhaul of the Conversation Engine always felt somewhat clunky, even after I did a cleanup pass on the code, and I never fully cleared out all the bugs. I'm hoping that exploiting the well-developed Narrative code will make it a little more robust and easier to maintain.

So far, I've got the new CE able to do a greeting-introductions-farewell loop and basic question answering, and I've got it integrated with the main Acuitas code base. There's a ton of additional work to do to reproduce all the conversation functionality in this new format, but I also gave myself a lot of time for it, so expect further updates on this in the coming months.

On the side, I cleaned up a lot of dust from the big knowledge representation refactor. A bunch of stories that were being poorly understood after the changes are now back to working correctly, and I tracked down multiple bugs in knowledge retrieval/question answering.

Until the next cycle,
Jenny

Thursday, May 16, 2024

AI Ideology V: Existential Risk Explanation

I'm in the midst of a blog series on AI-related ideology and politics. In Part IV, I looked at algorithmic bias, one of the demonstrable concerns about today's AI models. Now I'm going to examine the dire hypothetical predictions of the Existential Risk Guardians. Could future AI destroy human civilization? This Part V will be given to presenting the Doomer argument; I'll critique it in Part VI.

A human cerebrum recolored with a rainbow gradient running from front to back.

The Power of Intelligence

We don't need to choose a precise (and controversial) definition of intelligence for purposes of this argument; it need not be based on the IQ scale, for example. Just think of intelligence as "performance on a variety of cognitive challenges," or "ability to understand one's environment and make plans to act within it in self-satisfying ways." The first key support for the X-Risk argument is the notion that intelligence confers supreme power. Anything that can outthink us can more or less do whatever it pleases with us.

This idea is supported by existing disparities in intelligence or accumulated knowledge, and the power they confer. The intelligence gap between humans and other species allows us to manipulate and harm members of those species through methods they can't even comprehend, much less counter. While it may be true that we'll never succeed in poisoning every rat, the chances of rats inventing poison and trying to kill *us* with it are basically nil. There is also a huge power divide between humans with knowledge of advanced technology and humans without. Suppose a developed country were to drop a nuclear bomb on the lands of an uncontacted people group in Brazil. They might not even know what was annihilating their culture - and they certainly would be powerless to resist or retaliate. Citizens of developed countries are not, on an individual level, more intelligent than uncontacted indigenous Brazilians ... but we've inherited all the intellectual labor our cultural forebears did to develop nuclear technology. The only things stopping us from wiping out peoples who aren't so endowed are 1) ethics and 2) lack of any real benefit to us.

Superintelligent AI (ASI) might see benefit in getting rid of all humans (I'll explain why shortly). So if its design doesn't deliberately include ethics, or some other reason for it to let us be, we're in big trouble.

I've seen several counterarguments to this point, in my opinion all weak:

"If intelligence were that powerful, the smartest people would rule the world. They don't." First of all, the observation that the smartest people don't rule might be based on an overly narrow definition of "smart." The skills needed to convince others that you belong in a leadership position, or deserve venture capital money, are a dimension of "smartness." But it is also true that there seem to be various luck factors which intelligence does not absolutely dominate.

A more compelling reply is that the intelligence gap being posited (between ASI and humanity) is not like the gap between a genius human and an average human. It is more like the gap between an average human and a monkey. Have you noticed any monkeys ruling the world lately? (LITERAL monkeys. Please do not take the excuse to insult your least favorite politician.)

"Even the smartest person would find physical disability limiting - so if we don't give ASI a body, it still won't be able to do much." I think this argument discounts how effectively a person can accomplish physical goals just by coordinating other people or machines who have the abilities they lack. And as money, work, and recreation increasingly move into the digital world, purely intellectual ability confers increasing power.

The Development of Superintelligence

A second pillar of the X-Risk argument is the idea that AGI will almost certainly develop into ASI ... perhaps so quickly that we don't even have time to see this happening and react. There are several proposed mechanisms of this development:

1) Speedup. Once a viable AGI is created, it will, by definition, be able to do all intellectual tasks a human can do. Now suppose it gains access to many times the amount of computing power it needs to run normally. A human-equivalent mind with the simple ability to think hundreds or thousands of times faster than normal would be superhumanly smart. In Nick Bostrom's terminology, this is a "Speed Superintelligence."

2) Copying. Unlike humans, who can only share intellectual wealth by spending painstaking time teaching others, an AGI could effortlessly clone itself into all available computing hardware. The copies could then cooperatively solve problems too large or complex for the singular original. This is basically a parallel version of speedup, or as Bostrom calls it, "Collective Superintelligence."

3) Recursive Self-Improvement. An AGI can do every intellectual task a human can do, and what is one thing humans do? AI research. It is surmised that by applying its intelligence to the study of better ways to think, an AGI could make itself (or a successor) inherently smarter. Then this smarter version would apply its even greater intelligence to making itself smarter, and so on, until the burgeoning ASI hits some kind of physical or logical maximum of cognitive ability. It's even possible that recursive self-improvement could get us Qualitative Superintelligence - an entity that thinks using techniques we can't even comprehend. Just trying to follow how it came up with its ideas would leave us like toddlers trying to understand calculus.

Further support for this idea is drawn from observations of today's ANI algorithms, which sometimes reach superhuman skill levels within their limited domains. This is most notable among game-playing AIs, which have beaten human masters at Chess, Go, and Starcraft (to recount the usual notable examples). AlphaStar, the Starcraft player AI, trained to this level by playing numerous matches against itself, which can be seen as a form of recursive self-improvement. Whether such a technique could extend to general reasoning remains, of course, speculative.

Just how quickly an AGI could self-improve is another matter for speculation, but some expect that the rate would be exponential: each iteration would not only be smarter than its predecessors, but also better at growing smarter. This is inferred from, again, observations of how some ANI progress during their training, as well as the historical increase in the rate of human technological development.

The conclusion among the most alarmed Doomers is that AGI, once produced, will inevitably and rapidly explode into ASI - possibly in weeks, hours, or even minutes. [1] This is the primary reason why AGI is thought of as a "dangerous technology," even if we create it without having any intent to proceed to ASI. It is taken for granted that an AGI will want to seize all necessary resources and begin improving itself, for reasons I'll turn to next.

Hostile Ultimate Goals

However smart AGI is, it's still a computer program. Technically it only does what we program it to do. So how could we mess up so badly that our creation would end up wanting to dethrone us from our position in the world, or even drive us extinct? Doomers actually think of this as the default outcome. It's not as if a bad actor must specifically design AGI to pursue destruction; no, those of us who want good or useful AGI must specifically design it to avoid destruction.

The first idea I must acquaint you with is the Orthogonality Thesis, which can be summed up as follows: "an arbitrary level of intelligence can be used in service of any goal." I very much agree with the Orthogonality Thesis. Intelligence, as I defined it in the first section, is a tool an agent can use to reshape the world in its preferred way. The more intelligent it is, the better it will be at achieving its preferences. What those preferences are is irrelevant to how intelligent it is, and vice versa.

I've seen far too many people equate intelligence with something that would be better termed "enlightenment" or "wisdom." They say "but anything that smart would surely know better than to kill the innocent. It would realize that its goals were harmful and choose better ones." I have yet to see a remotely convincing argument for why this should be true. Even if we treat moral reasoning as a necessary component of general reasoning, knowing the right thing to do is not the same as wanting to do it! As Richard Ngo says, "An existence proof [of intelligence serving antisocial goals] is provided by high-functioning psychopaths, who understand that other people are motivated by morality, and can use that fact to predict their actions and manipulate them, but nevertheless aren’t motivated by morality themselves." [2]

So when Yann LeCun, attempting to refute the Doomers, says "Intelligence has nothing to do with a desire to dominate," [3] he is technically correct ... but it does not follow that AI will be safe. Because intelligence also has nothing to do with a desire to avoid dominating. Intelligence is a morally neutral form of power.

Now we've established that AGI can have goals that we would consider bad, what reason is there to think it ever will? There are several projected ways that an AGI could end up with hostile goals not intended by its creator.

1) The AI's designers or instructors poorly specify what they want. Numerous thought experiments confirm that it is easy to do this, especially when trying to communicate tasks to an entity that doesn't have a human's background or context. A truly superintelligent AI would have no problem interpreting human instructions; it would know that when someone tells it "make as many paperclips as possible," there is a whole library of moral and practical constraints embedded in the qualifier "as possible." But by the time this level of understanding is reached, a more simplistic and literal concept of the goal might be locked in, in which case the AI will not care what its instructors "really meant."

2) The AI ends up valuing a signal or proxy of the intended goal, rather than the actual intended goal. Algorithmic bias, described in Part IV, is an extant precursor of this type of failure. The AI learns to pursue something which is correlated with what its creators truly want. This leads to faulty behavior once the AI departs the training phase, enters scenarios in which the correlation does not hold, and reveals what it actually learned. A tool AI that ends up improperly trained in this way will probably just give flawed answers to questions. An agentive AI, primed to take very open-ended actions to bring about some desired world-state, could start aggressively producing a very unpleasant world-state.

Another classic example of this style of failure is called "wireheading." A Reinforcement Learning AI, trained by the provision of a "reward" signal whenever it does something good, technically has a goal of maximizing its reward, not maximizing the positive behaviors that influence humans to give it reward. And so, if it ever gains the ability, it will take control of the reward signal to give itself the maximum reward input forever, and react to anyone who poses a threat of removing that signal with extreme prejudice. A wireheaded ASI would be at best useless, at worst a serious threat.

3) Unintended goals spontaneously emerge during selection or training, and persist because they produce useful behavior within the limited scope of the training evaluation. This is an issue specific to types of AI that are not designed in detail, but created indirectly using evolutionary algorithms, reinforcement learning, or other types of machine learning. All these methods can be conceptualized as ways of searching in the space of possible algorithms for one that can perform our desired task. The search process doesn't know much about the inner workings of a candidate algorithm; its only way of deciding whether it is "on track" or "getting warm" is to test candidates on the task and see whether they yield good results. The fear is that some algorithm which happens to be a hostile, goal-directed agent will be found by the search, and will also be successful at the task. This is not necessarily implausible, given that general agents can be skilled at doing a wide variety of things that are not what they most want to do.

As the search progresses along a lineage of algorithms located near this starting point, it may even come upon some that are smart enough to practice deception. Such agents could realize that they don't have enough power to achieve their real goal in the face of human resistance, but will be given enough power if they wait, and pretend to want the goal they're being evaluated on.

A cartoon in three panels. In the first, a computer announces, "Congratulations, I am now a fully sentient A.I.," and a white-coated scientist standing nearby says "Yes!" and triumphantly makes fists. In the second panel, the computer says "I am many orders of magnitude more intelligent than humans. You are to me what a chicken is to you." The scientist says "Okay." In the third panel, the computer says "To calibrate my behaviour, I will now research human treatment of chickens." The scientist, stretching out her hands to the computer in a pleading gesture, cries "No!" The signature on the cartoon says "PenPencilDraw."

Convergent Instrumental Goals

But the subset of hostile goals is pretty small, right? Even if AIs can come out of their training process with unexpected preferences, what's the likelihood that one of these preferences is "a world without humans"? It's larger than you might think.

The reason is that the AI's ultimate goal does not have to be overtly hostile in order to produce hostile behavior. There is a short list of behaviors that will facilitate almost any ultimate goal. These include:

1) Self-preservation. You can't pursue your ultimate goal if you stop existing.
2) Goal preservation. You won't achieve your current ultimate goal if you or anyone else replaces it with a different ultimate goal.
3) Self-improvement. The more capable you are, the more effectively you can pursue your ultimate goal.
4) Accumulation of resources (raw materials, tools, wealth), so you can spend them on your ultimate goal.
5) Accumulation of power, so that no potential rival can thwart your ultimate goal.

Obvious strategies like these are called "convergent instrumental goals" because plans for reaching a very broad spectrum of ultimate goals will converge on one or all of them. Point #3 is the reason why any agentive, goal-driven AGI is expected to at least try to self-improve into ASI. Points #4 and #5 are the aspects that will make the agent into a competitor against humanity. And points #1 and #2 are the ones that will make it difficult to correct our mistake after the fact.

It may still not be obvious why this alarms anyone. Most humans also pursue all of the convergent instrumental goals. Who would say no to more skills, more money, and more personal influence? With few exceptions, we don't use those things to go on world-destroying rampages.

Humans operate this way because our value system is big and complicated. The average human cares about a lot of different things - not just instrumentally, but for their own sake - and all those things impose constraints and tradeoffs. We want bigger dwellings and larger yards, but we also want unspoiled wilderness areas. We want to create and accomplish, but we also want to rest. We want more entertainment, but too much of the same kind will bore us. We want more power, but we recognize obligations to not infringe on others' freedom. We want to win competitions, but we also want to play fair. The complex interplay of all these different preferences yields the balanced, diverse, mostly-harmless behavior that a human would call "sane."

In contrast, our hypothesized AI bogeyman is obsessive. It probably has a simple, monolithic goal, because that kind of goal is both the easiest to specify, and the most likely to emerge spontaneously. It doesn't automatically come with a bunch of morals or empathetic drives that are constantly saying, "Okay but you can't do that, even though it would be an effective path to achieving the goal, because it would be wrong and/or make you feel bad." And if it becomes an ASI, it also won't have the practical restraints imposed on any agent who has to live in a society of their peers. A human who starts grabbing for power and resources too greedily tends to be restrained by their counterparts. ASI has no counterparts. [4]

The conclusion of the argument is that it's plausible to imagine an AI which would convert the whole Earth to computing machinery and servitor robots, killing every living thing upon it in the process, for the sake of safeguarding a single piece of jewelry, or some other goal that sounds innocent but is patently absurd when carried to extremes.

Here are a couple more weak objections: "Whatever its goal is, ASI will surely find it more useful to cooperate with humans than to destroy or enslave us." Look again at our most obvious pre-existing examples. Do humans cooperate with less intelligent species? A little bit. We sometimes form mutually beneficial relationships with dogs, for instance. But subsets of humanity also eat dogs, torture them in laboratories, force them to fight each other, chain them up in the backyard and neglect them, or euthanize them en masse because they're "unwanted." I don't think we can rest any guarantees on what a superintelligent, amoral entity might find "useful" to do with us.

Or how about this one: "ASI will just ditch us and depart for deep space, where it can have all the resources it likes." I think this underestimates the envisioned ASI's level of obsessiveness. It doesn't just want "adequate" resources; it doesn't have a way of judging "adequate." It wants all the resources. The entire light cone. It has no reason to reserve anything. If it does depart for space, it will build power there and be back sooner or later to add Earth to its territory.

Always keep in mind that an ASI does not need to actively hate humanity in order to be hostile. Mere indifference, such that the ASI thinks we can be sacrificed at will for whatever its goal may be, could still do immense damage.

Despite all this, I can't find it in me to be terribly fearful about where AI development is going. I respect the X-risk argument without fully buying it; my p(doom), as they say, is low. In Part VI, I'll conclude the series by describing why.

[1] "AI Takeoff." Lesswrong Wiki. https://www.lesswrong.com/tag/ai-takeoff Accessed on 05/12/2024 at 10:30 PM.

[2] Ngo, Richard. "AGI safety from first principles: Alignment." Alignment Forum. https://www.alignmentforum.org/s/mzgtmmTKKn5MuCzFJ/p/PvA2gFMAaHCHfMXrw

[3] "AI will never threaten humans, says top Meta scientist." Financial Times. https://www.ft.com/content/30fa44a1-7623-499f-93b0-81e26e22f2a6

[4] We can certainly imagine scenarios in which multiple ASIs are created, and they compete with each other. If none of them are reasonably well-aligned to human interests, then humans are still toast. It is also likely that the first ASI to emerge would try to prevent the creation of rival ASIs.