With this sequence, we (Sam + Jan) want to provide a principled derivation of the natural abstractions hypothesis (which we will introduce in-depth in later posts) by motivating it with insights from computational neuroscience.
Goals for this sequence are:
- show why we expect natural abstractions to emerge in biological brains,
- provide empirical evidence and a mechanistic explanation for the emergence of natural abstractions in biology, and
- spelling out implications of the emergence of natural abstractions in biology.
Author’s note: This is currently my (Sam’s) main research project, but my first nonetheless. Happy to receive any feedback! Some of the original ideas and guidance come from Jan. I don’t expect you, the reader, to have solid background knowledge in any of the discussed topics. So, whenever you get lost, I will try to get you back on board by providing a more high-level summary of what I said.
As alignment researchers, the higher-level problem we are trying to solve is: ‘How do we teach an AI what we value?’. To simplify the question, we assume that we already know what we value¹. Now, we’ve got to teach an AI that we value “things out there in the world”, e.g. trees. Specifying “trees” should be easy, right?
… o no …
You: Yea, so… we value trees. These for example. gestures wildly AGI: Of course! So you only value these trees in your backyard? Okay, I will just quickly measure the constellation of molecules in— You: No, no, no! Stop, that’s not what I meant! Not only those, but you know… the idea behind the trees? I also value trees that aren’t present here and certainly not their constituting molecules. Look, let’s go over to my neighbors backyard. These are trees too. AGI: Huh, this tree seems to be made out of papier-mâché. You: Ahhh uups… Yea, humans are fallible. Don’t treat my best guess estimations as unfailing sources of truth. Please note that I care about the territory - not just the map in my head that might be mistaken. AGI: Wh- What? How should I know about the relation between your map and the territory?
In a similar vein, John Wentworth spells out the pointers problem²: ‘An AI should optimize for the real-world things I value, not just my estimates of those things’. He formalizes the problem as follows:
“What functions of what variables (if any) in the environment and/or another world model correspond to the latent variables in the agent’s world model”.
Let’s apply JW’s definition to the trees example from above.
- The concept of a ‘tree’ (that’s what JW calls a ‘latent variable’) is not directly observable and is only clear in your head.
- To make the concept observable, you have to specify the correct relationship between the concept in your head and the observable objects in the world (that’s what JW calls ‘functions of […] variables (if any) in the environment and/or another world model’).
Eventually, you want to transfer something (the concept of a tree) in your map to the AGI’s map (that’s what JW calls ‘agents world model’).
There is also an analogous view on the pointers problem from classic philosophy, called the ‘problem of universals’. Similar to our story, we face issues when pointing towards the ‘universal of a tree’. We will explain the problem of universals using an example:
Let’s consider the thought process of the divine mind³ when creating trees. Trees come in very different shapes or differing branching patterns. Now, who imposed on them that they are ‘trees’? Did the divine mind already have a category in mind and created trees from an existing idea of trees, universalia ante rem (‘universals before the thing’)? Is the idea of a tree realized in trees, universalia in re (’universals in the thing’)? Or did we create the idea of a tree by examining trees and throwing away unnecessary properties? This would mean that we formed something like an abstraction, universalia post rem (’universals after the thing’).
As we will see, the natural part in ‘natural abstractions hypothesis’ suggests that we should expect universals, in the things⁴.
After having read the previous sections, we want to keep in mind that:
- ‘Problems’ with universals only come across when you happen to study philosophy. Normally, we don’t face issues when communicating concepts. You balance out the imprecission in my expressions. Since our maps are roughly similar, you can fill in the necessary gaps. Concepts like ‘tree’ don’t confuse you, even if we have differing cultural upbringings. I mean the green, leaf-y thing that is not a cabbage. The concept of a ‘tree’ formed in our heads by mere observation.
- The discrepancy between the theoretical pointers problem and the situation in the real-world has a curious implication: An AGI that derives its concepts in the same way that we do, might have a much easier time understanding what we mean when we specify trees.
- To belabor the point further, we of course don’t primarily care about trees, we care about the concept ‘human values’. We need a formal specification between the concept in our head and observables in the world. A superintelligent AGI without such a specification will misunderstand what we mean with e.g. a ‘happy person’⁵.
So, how is it that our maps are so similar? Is this the way things generally have to be? Does the emergence of concepts like ‘tree’ involve things like genetics? Should we expect aliens or artificial intelligence⁶ to share the same understanding of trees? And what happens when they don’t understand us?
All these questions have something to do with our brains and how it learns. With this sequence, we want to explore exactly these questions using the brain as a working example. We choose the simple⁷ and plausible Hebbian learning rule that models how the brain learns in an unsupervised way. The goal of this sequence is to provide empirical evidence for the so-called natural abstraction hypothesis in real intelligent agents and give a more mechanistic explanation for how abstractions emerge. We supplement John Wentworth’s information-theoretic perspective with our perspective from neuroscience/biology.
If you want to dig deeper into John Wentworth’s perspective and the relevance of abstractions, we refer to John Wentworth's posts.
Future posts will talk about the mathematics behind 'Hebbian' Natural Abstractions and the empirical background.
1) To convince yourself that this is already hard enough, read the sequences, The Hidden Complexity of Wishes, Value is Fragile and Thou Art Godshatter
2) Actually, I first read about the pointers problem in Abram Demski’s posts. In a vague sense, other authors talked about the problem at hand as well. Though, John Wentworth is the first to spell it out in this way. If you want to dig deeper into the pointers problem, read John Wentworth’s post, this, this and this. The issue goes back to the wireheading problem. Here, we want to prevent a generally intelligent agent to realize that it can stimulate its sensors, so that it receives greatest reward all the time. To solve this issue, we have to tell the agent that it should optimize for our intended outcome, ‘the idea behind it’. Natural abstractions are ought to be a way to specify exactly what we value.
3) Or which other being or event may have created them.
4) Natural here means, that we should expect a variety of intelligent agents to converge on finding the same abstractions. Thus, abstractions are somehow embedded in things, otherwise we would expect agents to find different abstractions, or universals.
5) Definitely not a rectus grin, but a genuine smile. It does not understand what we are pointing at. But you do.
6) ‘Aliens’ or artificial intelligence might have completely different computational limitations compared to humans. How do abstractions behave then?
7) The model is imperfect, but a suitable abstraction (huh) to talk about the topic at hand.