Paper:
Existential Risk Analysis for AI Research
1.
Weaponization: Some are concerned that weaponizing AI may be an onramp to more dangerous outcomes. In recent years, deep RL algorithms can outperform humans at aerial combat [18], AlphaFold has discovered new chemical weapons [66], researchers have been developing AI systems for automated cyberattacks [11, 14], military leaders have discussed having AI systems have decisive control over nuclear silos [41], and superpowers of the world have declined to sign agreements banning autonomous weapons. Additionally, an automated retaliation system accident could rapidly escalate and give rise to a major war. Looking forward, we note that since the nation with the most intelligent AI systems could have a strategic advantage, it may be challenging for nations not to build increasingly powerful weaponized AI systems.
Even if AI alignment is solved and all superpowers agree not to build destructive AI technologies, rogue actors still could use AI to cause significant harm. Easy access to powerful AI systems increases the risk of unilateral, malicious usage. As with nuclear and biological weapons, only one irrational or malevolent actor is sufficient to unilaterally cause harm on a massive scale. Unlike previous weapons, stealing and widely proliferating powerful AI systems could just be a matter of copy and pasting.
2.
Enfeeblement: As AI systems encroach on human-level intelligence, more and more aspects of human labor will become faster and cheaper to accomplish with AI. As the world accelerates, organizations may voluntarily cede control to AI systems in order to keep up. This may cause humans to become economically irrelevant, and once AI automates aspects of many industries, it may be hard for displaced humans to reenter them. In this world, humans could have few incentives to gain knowledge or skills. These trends could lead to human enfeeblement and reduce human flourishing, leading to a world that is undesirable. Furthermore, along this trajectory, humans would have less control of the future.
3.
Eroded epistemics: States, parties, and organizations use technology to influence and convince others of their political beliefs, ideologies, and narratives. Strong AI may bring this use-case into a new era and enable personally customized disinformation campaigns at scale. Additionally, AI itself could generate highly persuasive arguments that invoke primal human responses and inflame crowds. Together these trends could undermine collective decision-making, radicalize individuals, derail moral progress, or erode consensus reality.
4.
Proxy misspecification: AI agents are directed by goals and objectives. Creating general-purpose objectives that capture human values could be challenging. As we have seen, easily measurable objectives such as watch time and click rates often trade off with our actual values, such as wellbeing [43]. For instance, well-intentioned AI objectives have unexpectedly caused people to fall down conspiracy theory rabbit holes. This demonstrates that organizations have deployed models with flawed objectives and that creating objectives which further human values is an unsolved problem. Since goal-directed AI systems need measurable objectives, by default our systems may pursue simplified proxies of human values. The result could be suboptimal or even catastrophic if a sufficiently powerful AI successfully optimizes its flawed objective to an extreme degree.
5.
Value lock-in: Strong AI imbued with particular values may determine the values that are propagated into the future. Some argue that the exponentially increasing compute and data barriers to entry make AI a centralizing force. As time progresses, the most powerful AI systems may be designed by and available to fewer and fewer stakeholders. This may enable, for instance, regimes to enforce narrow values through pervasive surveillance and oppressive censorship. Overcoming such a regime could be unlikely, especially if we come to depend on it. Even if creators of these systems know their systems are self-serving or harmful to others, they may have incentives to reinforce their power and avoid distributing control. The active collaboration among many groups with varying goals may give rise to better goals [20], so locking in a small group’s value system could curtail humanity’s long-term potential.
6.
Emergent functionality: Capabilities and novel functionality can spontaneously emerge in today’s AI systems [26, 57], even though these capabilities were not anticipated by system designers. If we do not know what capabilities systems possess, systems become harder to control or safely deploy. Indeed, unintended latent capabilities may only be discovered during deployment. If any of these capabilities are hazardous, the effect may be irreversible.
New system objectives could also emerge. For complex adaptive systems, including many AI agents, goals such as self-preservation often emerge [30]. Goals can also undergo qualitative shifts through the emergence of intrasystem goals [25, 33]. In the future, agents may break down difficult long-term goals into smaller subgoals. However, breaking down goals can distort the objective, as the true objective may not be the sum of its parts. This distortion can result in misalignment. In more extreme cases, the intrasystem goals could be pursued at the expense of the overall goal. For example, many companies create intrasystem goals and have different specializing departments pursue these distinct subgoals. However, some departments, such as bureaucratic departments, can capture power and have the company pursue goals unlike its original goals. Even if we correctly specify our high-level objectives, systems may not operationally pursue our objectives [38]. This is another way in which systems could fail to optimize human values.
7.
Deception: Future AI systems could conceivably be deceptive not out of malice, but because deception can help agents achieve their goals. It may be more efficient to gain human approval through deception than to earn human approval legitimately. Deception also provides optionality: systems that have the capacity to be deceptive have strategic advantages over restricted, honest models. Strong AIs that can deceive humans could undermine human control.
AI systems could also have incentives to bypass monitors. Historically, individuals and organizations have had incentives to bypass monitors. For example, Volkswagen programmed their engines to reduce emissions only when being monitored. This allowed them to achieve performance gains while retaining purportedly low emissions. Future AI agents could similarly switch strategies when being monitored and take steps to obscure their deception from monitors. Once deceptive AI systems are cleared by their monitors or once such systems can overpower them, these systems could take a “treacherous turn” and irreversibly bypass human control.
8.
Power-seeking behavior: Agents that have more power are better able to accomplish their goals. Therefore, it has been shown that agents have incentives to acquire and maintain power [65]. AIs that acquire substantial power can become especially dangerous if they are not aligned with human values. Power- seeking behavior can also incentivize systems to pretend to be aligned, collude with other AIs, overpower monitors, and so on. On this view, inventing machines that are more powerful than us is playing with fire. Building power-seeking AI is also incentivized because political leaders see the strategic advantage in having the most intelligent, most powerful AI systems. For example, Vladimir Putin has said “Whoever becomes the leader in [AI] will become the ruler of the world.”