Nine Ways to Bias Open Source AGI Toward Friendliness

While it seems unlikely that any method of guaranteeing human-friendliness (“Friendliness”) on the part of advanced Artificial General Intelligence (AGI) systems will be possible, this doesn’t mean the only alternatives are throttling AGI development to safeguard humanity, or plunging recklessly into the complete unknown. Without denying the presence of a certain irreducible uncertainty in such matters, it is still sensible to explore ways of biasing the odds in a favorable way, such that newly created AI systems are significantly more likely than not to be Friendly. Several potential methods of effecting such biasing are explored here, with a particular but non-exclusive focus on those that are relevant to open-source AGI projects, and with illustrative examples drawn from the OpenCog open-source AGI project. Issues regarding the relative safety of open versus closed approaches to AGI are discussed and then nine techniques for biasing AGIs in favor of Friendliness are presented:

1. Engineer the capability to acquire integrated ethical knowledge.
2. Provide rich ethical interaction and instruction, respecting developmental stages.
3. Develop stable, hierarchical goal systems.
4. Ensure that the early stages of recursive self-improvement occur relatively slowly and with rich human involvement.
5. Tightly link AGI with the Global Brain.
6. Foster deep, consensus-building interactions between divergent viewpoints.
7. Create a mutually supportive community of AGIs.
8. Encourage measured co-advancement of AGI software and AGI ethics theory.
9. Develop advanced AGI sooner not later.

In conclusion, and related to the final point, we advise the serious co-evolution of functional AGI systems and AGI-related ethical theory as soon as possible, before we have so much technical infrastructure that parties relatively unconcerned with ethics are able to rush ahead with brute force approaches to AGI development.

Introduction to Artificial General Intelligence and Friendly AI

Artificial General Intelligence (AGI), like any technology, carries both risks and rewards. One science fiction film after another has highlighted the potential dangers of AGI, lodging the issue deep in our cultural awareness. Hypothetically, an AGI with superhuman intelligence and capability could dispense with humanity altogether and thus pose an “existential risk” (Bostrom 2002). In the worst case, an evil but brilliant AGI, programmed by some cyber Marquis de Sade, could consign humanity to unimaginable tortures (perhaps realizing a modern version of the medieval Christian imagery of hell). On the other hand, the potential benefits of powerful AGI also go literally beyond human imagination. An AGI with massively superhuman intelligence and a positive disposition toward humanity could provide us with truly dramatic benefits, through the application of superior intellect to scientific and engineering challenges that befuddle us today. Such benefits could include a virtual end to material scarcity via advancement of molecular manufacturing, and also force us to revise our assumptions about the inevitability of disease and aging (Drexler1986). Advanced AGI could also help individual humans grow in a variety of directions, including directions leading beyond our biological legacy, leading to massive diversity in human experience, and hopefully a simultaneous enhanced capacity for openmindedness and empathy.

Eliezer Yudkowsky introduced the term “Friendly AI” to refer to advanced AGI systems that act with human benefit in mind (Yudkowsky 2001). Exactly what this means has not been specified precisely, though informal interpretations abound. Goertzel (2006a) has sought to clarify the notion in terms of three core values of “Joy, Growth and Freedom.” In this view, a Friendly AI would be one that advocates individual and collective human joy and growth, while respecting the autonomy of human choice.

Some (for example, De Garis 2005) have argued that Friendly AI is essentially an impossibility, in the sense that the odds of a dramatically superhumanly intelligent mind worrying about human benefit are vanishingly small, drawing parallels with humanity’s own exploitation of less intelligent systems. Indeed, in our daily life, questions such as the nature of consciousness in animals, plants, and larger ecological systems are generally considered merely philosophical, and only rarely lead to individuals making changes in outlook, lifestyle or diet. If Friendly AI is impossible for this reason, then the best options for the human race would presumably be to avoid advanced AGI development altogether, or else to fuse with AGI before the disparity between its intelligence and humanity’s becomes too large, so that beings-originated-as-humans can enjoy the benefits of greater intelligence and capability. Some may consider sacrificing their humanity an undesirable cost. The concept of humanity, however, is not a stationary one, and can be viewed as sacrificed only from our contemporary perspective of what humanity is. With our cell phones, massively connected world, and the inability to hunt, it’s unlikely that we’d seem the same species to the humanity of the past. Just like an individual’s self, the self of humanity will inevitably change, and as we do not usually mourn losing our identity of a decade ago to our current self, our current concern for what we may lose may seem unfounded in retrospect.

Others, such as Waser (2008), have argued that Friendly AI is essentially inevitable, linking greater intelligence with greater cooperation. Waser adduces evidence from evolutionary and human history in favor of this point, along with more abstract arguments such as the economic viability of cooperation over not cooperating.

Omohundro (2008) has argued that any advanced AI system will very likely demonstrate certain “basic AI drives,” such as desiring to be rational, to self-protect, to acquire resources, and to preserve and protect its utility function and avoid counterfeit utility; these drives, he suggests, must be taken carefully into account in formulating approaches to Friendly AI.

Yudkowsky (2006) discusses the possibility of creating AGI architectures that are in some sense “provably Friendly” – either mathematically, or else by very tight lines of rational verbal argument. However, several possibly insurmountable challenges face such an approach. First, proving mathematical results of this nature would likely require dramatic advances in multiple branches of mathematics. Second, such a proof would require a formalization of the goal of “Friendliness,” which is a subtler matter than it might seem (Legg 2006; Legg 2006a), as formalization of human morality has vexed moral philosophers for quite some time. Finally, it is unclear the extent to which such a proof could be created in a generic, environment-independent way – but if the proof depends on properties of the physical environment, then it would require a formalization of the environment itself, which runs up against various problems related to the complexity of the physical world, not to mention the current lack of a complete, consistent theory of physics.

The problem of formally or at least very carefully defining the goal of Friendliness has been considered from a variety of perspectives. Among a list of fourteen objections to the Friendly AI concept, with suggested answers to each, Sotala (2011) includes the issue of friendliness being a vague concept. A primary contender for this role is the concept of Coherent Extrapolated Volition (CEV) suggested by Yudkowsky (2004), which roughly equates to the extrapolation of the common values shared by all people when at their best. Many subtleties arise in specifying this concept – e.g. if Bob Jones is often possessed by a strong desire to kill all Martians, but he deeply aspires to be a nonviolent person, then the CEV approach would not rate “killing Martians” as part of Bob’s contribution to the CEV of humanity. Resolving inconsistencies in aspirations and desires, and the different temporal scales involved for each, is another non-trivial problem.

One of the authors, Goertzel (2010), has proposed a related notion of Coherent Aggregated Volition (CAV), which eschews some subtleties of extrapolation, and instead seeks a reasonably compact, coherent, and consistent set of values that is close to the collective value-set of humanity. In the CAV approach, “killing Martians” would be removed from humanity’s collective value-set because it’s assumedly uncommon and not part of the most compact/coherent/consistent overall model of human values, rather than because of Bob Jones’s aspiration to nonviolence.

More recently we have considered that the core concept underlying CAV might be better thought of as CBV or Coherent Blended Volition. CAV seems to be easily misinterpreted as meaning the average of different views, which was not the original intention. The CBV terminology clarifies that the CBV of a diverse group of people should not be thought of as an average of their perspectives, but as something more analogous to a “conceptual blend” (Fauconnier and Turner 2002) – incorporating the most essential elements of their divergent views into a whole that is overall compact, elegant and harmonious. The subtlety here (to which we shall return below) is that for a CBV blend to be broadly acceptable, the different parties whose views are being blended must agree to some extent that enough of the essential elements of their own views have been included.

Multiple attempts at axiomatization of human values have also been attempted. In one case, this is done with a view toward providing near-term guidance to military robots (from Arkin (2009)’s excellent though chillingly-titled book Governing Lethal Behavior in Autonomous Robots). However, there are reasonably strong arguments that human values (and similarly a human’s language and perceptual classification) are too complex and multifaceted to be captured in any compact set of formal logical rules. Wallach and Allen (2010) have made this point eloquently, and argued for the necessity of fusing top-down (e.g. formal logic based) and bottom-up (e.g. self-organizing learning based) approaches to machine ethics.