With this sequence, we (Sam + Jan) want to provide a principled derivation of the natural abstractions hypothesis (which we will introduce in-depth in later posts) by motivating it with insights from computational neuroscience.
Goals for this sequence are:
- show why we expect natural abstractions to emerge in biological brains,
- provide empirical evidence and a mechanistic explanation for the emergence of natural abstractions in biology, and
- spelling out implications of the emergence of natural abstractions in biology.
Author’s note: This is currently my (Sam’s) main research project, but my first nonetheless. Happy to receive any feedback! Some of the original ideas and guidance come from Jan. I don’t expect you, the reader, to have solid background knowledge in any of the discussed topics. So, whenever you get lost, I will try to get you back on board by providing a more high-level summary of what I said.
As alignment researchers, the higher-level problem we are trying to solve is: ‘How do we teach an AI what we value?’. To simplify the question, we assume that we already know what we value¹. Now, we’ve got to teach an AI that we value “things out there in the world”, e.g. trees. Specifying “trees” should be easy, right?
… o no …
You: Yea, so… we value trees. These for example. gestures wildly AGI: Of course! So you only value these trees in your backyard? Okay, I will just quickly measure the constellation of molecules in— You: No, no, no! Stop, that’s not what I meant! Not only those, but you know… the idea behind the trees? I also value trees that aren’t present here and certainly not their constituting molecules. Look, let’s go over to my neighbors backyard. These are trees too. AGI: Huh, this tree seems to be made out of papier-mâché. You: Ahhh uups… Yea, humans are fallible. Don’t treat my best guess estimations as unfailing sources of truth. Please note that I care about the territory - not just the map in my head that might be mistaken. AGI: Wh- What? How should I know about the relation between your map and the territory?
In a similar vein, John Wentworth spells out the pointers problem²: ‘An AI should optimize for the real-world things I value, not just my estimates of those things’. He formalizes the problem as follows:
“What functions of what variables (if any) in the environment and/or another world model correspond to the latent variables in the agent’s world model”.
Let’s apply JW’s definition to the trees example from above.
- The concept of a ‘tree’ (that’s what JW calls a ‘latent variable’) is not directly observable and is only clear in your head.
- To make the concept observable, you have to specify the correct relationship between the concept in your head and the observable objects in the world (that’s what JW calls ‘functions of […] variables (if any) in the environment and/or another world model’).
Eventually, you want to transfer something (the concept of a tree) in your map to the AGI’s map (that’s what JW calls ‘agents world model’).
There is also an analogous view on the pointers problem from classic philosophy, called the ‘problem of universals’. Similar to our story, we face issues when pointing towards the ‘universal of a tree’. We will explain the problem of universals using an example:
Let’s consider the thought process of the divine mind³ when creating trees. Trees come in very different shapes or differing branching patterns. Now, who imposed on them that they are ‘trees’? Did the divine mind already have a category in mind and created trees from an existing idea of trees, universalia ante rem (‘universals before the thing’)? Is the idea of a tree realized in trees, universalia in re (’universals in the thing’)? Or did we create the idea of a tree by examining trees and throwing away unnecessary properties? This would mean that we formed something like an abstraction, universalia post rem (’universals after the thing’).
As we will see, the natural part in ‘natural abstractions hypothesis’ suggests that we should expect universals, in the things⁵.
After having read the previous sections, we want to keep in mind that:
- ‘Problems’ with universals only come across when you happen to study philosophy. Normally, we don’t face issues when communicating concepts. You balance out the imprecission in my expressions. Since our maps are roughly similar, you can fill in the necessary gaps. Concepts like ‘tree’ don’t confuse you, even if we have differing cultural upbringings. I mean the green, leaf-y thing that is not a cabbage. The concept of a ‘tree’ formed in our heads by mere observation.
- The discrepancy between the theoretical pointers problem and the situation in the real-world has a curious implication: An AGI that derives its concepts in the same way that we do, might have a much easier time understanding what we mean when we specify trees.
- To belabor the point further, we of course don’t primarily care about trees, we care about the concept ‘human values’. We need a formal specification between the concept in our head and observables in the world. A superintelligent AGI without such a specification will misunderstand what we mean with e.g. a ‘happy person’⁴.
So, how is it that our maps are so similar? Is this the way things generally have to be? Does the emergence of concepts like ‘tree’ involve things like genetics? Should we expect aliens or artificial intelligence⁶ to share the same understanding of trees? And what happens when they don’t understand us?
All these questions have something to do with our brains and how it learns. With this sequence, we want to explore exactly these questions using the brain as a working example. We choose the simple⁷ and plausible Hebbian learning rule that models how the brain learns in an unsupervised way. The goal of this sequence is to provide empirical evidence for the so-called natural abstraction hypothesis in real intelligent agents and give a more mechanistic explanation for how abstractions emerge. We supplement John Wentworth’s information-theoretic perspective with our perspective from neuroscience/biology.
If you want to dig deeper into John Wentworth’s perspective and the relevance of abstractions, we refer to John Wentworth's posts.
The mathematical backbone of “Hebbian” Natural Abstractions:
The empirical background for “Hebbian” Natural Abstractions:
Criticism for the natural abstractions hypothesis:
1) To convince yourself that this is already hard enough, read the sequences, The Hidden Complexity of Wishes, Value is Fragile and Thou Art Godshatter
2) Actually, I first read about the pointers problem in Abram Demski’s posts. In a vague sense, other authors talked about the problem at hand as well. Though, John Wentworth is the first to spell it out in this way. If you want to dig deeper into the pointers problem, read John Wentworth’s post, this, this and this. The issue goes back to the wireheading problem. Here, we want to prevent a generally intelligent agent to realize that it can stimulate its sensors, so that it receives greatest reward all the time. To solve this issue, we have to tell the agent that it should optimize for our intended outcome, ‘the idea behind it’. Natural abstractions are ought to be a way to specify exactly what we value.
3) Or which other being or event may have created them.
4) Definitely not a rectus grin, but a genuine smile. It does not understand what we are pointing at. But you do.
5) Natural here means, that we should expect a variety of intelligent agents to converge on finding the same abstractions. Thus, abstractions are somehow embedded in things, otherwise we would expect agents to find different abstractions, or universals.
6) ‘Aliens’ or artificial intelligence might have completely different computational limitations compared to humans. How do abstractions behave then?
7) The model is imperfect, but a suitable abstraction (huh) to talk about the topic at hand.
Whatever the answer is, we all ended up with some idea of a tree. This is due to the fact that our brains all face similar computational limitations, which means we have an equal amount of epistemic pressure to form categories and abstractions to form a useful world model.
To avoid the pointers problem we are looking for a way to ‘translate’ a concept in our heads in a way that is comprehensible to a non-human agent.
You might have noticed (as is common with problems of philosophy), we almost never encounter the pointers problem when talking to our fellow humans. When I say ‘tree’, you get it. I mean the green, leaf-y thing that is not a cabbage. But is this really the way things have to be? Should we expect aliens or artificial intelligence to share the same understanding of trees? And what happens when they don’t understand us?
If we face the pointers problem, I will have difficulties in describing to you what a happy person is. I might say, that the corners of their mouth are usually pulled upwards or that the person speaks in a cheerful manner. This itself is still high-level - I don’t care about the exact position of their corners or even the exact positions of the molecules in their face. But it’s more lower-level than “a happy person” is. At the latest, when I try to describe what the internal state of a happy person looks like, I would face my own sensory limits.
But we probably won’t face these difficulties. You will understand what a happy person² is. You know what I point to in the real world when I present the concept “a happy person” to you. This is because we roughly share the same world-model. A tree corresponds to a similar idea in my world-model as it does in your world-model. Thus, the solution we are looking for is a way to translate abstractions between agents. This will enable us to tell an AI that we value happy people, without having to rely on us estimating whether somebody is happy, and without having to define all lower-level things in the world that constitute happy people.
We already are able to do that between humans, so naturally, we could ask, how does the brain acquire these abstractions? How are these abstractions “natural” in humans? How can it be that our abstractions are shared across millions and millions of people? It must have something to do with how the brain learns. Thus, with this sequence, I want to explore exactly these questions using the brain as a working example. The goal of this sequence is then to provide empirical evidence for the yet-to-introduce natural abstraction hypothesis (we will look at the hypothesis in greater detail in later posts) in real intelligent agents and give a more mechanistic explanation for how abstractions form.
The idea of a tree corresponds to a latent idea, thus, not directly observable, which itself corresponds to something in the real world. But we usually don’t mean precisely what it refers to. I don’t care about the exact constellation of molecules in a tree, but some abstract idea of that.
On the other hand, when I tell an AI that I value humans and things that humans value, I have to provide some insights in the world I would like to see, i.e. which characteristics should the world, that I want, have? Then, I don’t want the AI to optimize for my estimation of the value of a possible world, I want it to optimize towards the actual state of the world, e.g. I want people to actually be happy, not appear to me as happy. This also depends on things I can never (or currently) not sense - e.g. the internal states of you, the reader.
So the pointers problem is two-fold:
- I have to tell the AI that I care about high-level things, like happy people, but at the same time,
- I want the AI to optimize not for my estimation of whether somebody is happy, but the person actually being happy.