Shared with: Nele, Konstantin
TL;DR: Variants of hebbian learning change structure in the brain to resemble natural abstractions. In particular, those variants,
- find the first principal component of the received input and encode it in the weight vector of all weights leading to that neuron. We show that this direction is the direction with the highest amount of information from the input dataset.
- Variants of hebbian learning and structures in the brain converge to characteristics that are reminiscent of natural abstractions, in the sense that a) they converge towards learning the first principal component, pointing in the direction with the highest variance of the input and thus, presumably, in the direction with the highest amount of information and that b) several structures in the brain learn the correlational structure of the “world out there”.
In this post, we
want to do the following:
- connect Oja’s rule and Hebbian learning as
abiologically plausible learning rules with the natural abstractions hypothesis.
By connecting how the brain presumably learns with natural abstractions = this, we want to provide backing for the “natural” part of John Wentworth’s natural abstractions hypothesis. We want to show that we should expect a wide variety of cognitive systems (including biological brain) to converge on using natural abstractions. Similarly, this post this will serve as the mathematical backbone of the whole argument presented in this sequence.
In later posts, we want to further explore implications of this idea, by providing excursions into neuroscience and related topics.
In later posts, we want to further explore implications of connecting how the brain learns with natural abstractions. We do that by providing excursions into neuroscience and related topics.
Note: This post is quite mathy. We will provide an interpretation at the end of the post.
- Often, we can find abstractions, lower-dimensional, high-level summaries of information, that are relevant “further ahead (causally, but also in other senses)” for prediction.
- They are natural in the sense that a wide selection of intelligent agents are expected to converge on them.
- Often summarized as “Cells that fire together, wire together.”, thus an increase in synaptic efficacy arises from a presynaptic cell's repeated and persistent stimulation of a postsynaptic cell.
- This means if a neuron A causes another neuron B to activate, the weight between them is strengthened.
- Mathematically described by , with (in vector notation) being the change in synaptic strength between x, the pre-synaptic input, and the post-synaptic output of the neuron y, with a small learning rate ( is the rate of change of the synaptic weight in regard to time).
- A variant of the hebbian learning rule that tries to minimize some physiological implausabilities (such as that in hebbian learning, the weights grow indefinitely).
- Mathematically formalized by (the change in the weights in regard to time is depended on the learning rate times the input vector times the output minus the forgetting term (composed of the squared output and the current weight configuration)). Main features of the Oja’s rule are, that it features a) implicit normalization of the weight vector (this means that the weight vectors length is equalized to one) and b) a forgetting term that grows proportional with the output of the neuron by squaring it, thus preventing unrealistic, unlimited growth of the weight vector.
As stated above, hebbian learning for a single neuron can be reduced to . Thus, the change in weights between some input neurons and a given neuron depends on the learning rate , the firing rates of the input neurons , and the output of the given neuron .
Unfortunately, this simple algorithm is physiologically implausible, primarily because with this, the numerical values of the weights grow indefinitely (see Appendix for visual explanation). This leads to Oja’s rule, as a variant of hebbian learning, introduced as:
Now, if we set this equation to zero, we find the ‘steady-state’ solution for the weight vector. This means that the change of the weight vector goes to zero. This is the point of convergence of the weight vector. In the following, I will show that this point of convergence leads to the weight vector pointing in the direction of the largest variance of the input data.
From information theory we know, that “largest variance” is usually synonymous with “the largest amount of information in the input”.
Let’s consider two cases that should make it clearer what we are talking about:
Take, for one, a dataset with everything that we have ever experienced. Basically, a large set of sensory inputs, including images of trees. Now, this dataset is pretty vast. Therefore, it makes sense for a system to “make sense” of this input sequence. Secondly, let’s take a dataset with lots of sensory data of trees. We’ve scanned several thousand examples of trees.
Now, how would this look like, if applied to what we are trying to show?
We said, that “largest variance” is usually synonymous with “the largest amount of information in the input”. This seems desirable for our first dataset, since we want to extract meaningful abstractions from the input. E.g. we want to find a concept of a tree, that when we perceive it, we can better decide whether something is a tree or not. We can do that by looking for a property that maximizes the amount of variance between trees and everything else, but minimizes it for trees, i.e. is the same for all trees, but different for everything else.
But if we look at our second dataset, we want to find properties of trees, that vary strongly between them. This makes sense, if we want to learn as much as possible about trees: we don’t care about the fact that every tree does photosynthesis, that doesn’t tell us a lot about trees. Instead, we care about all the properties that vary widely between them, i.e. branching patterns.
Let’s look at the mathematical implementation of how Oja’s rule converges to finding the direction of the largest variance (as we will later see, that’s the first principal component of the input) by considering a simple, single neuron with several inputs:
Here, we can see that the neuron calculates the weighted sum of the inputs (x). With this, let’s introduce as
Now, when we put this into (1), we get:
Now, we want to look at the steady-state solution, thus the point of convergence, where the weight’s change is (this means, that when we train the neuron again, it won’t update it’s weight’s again). For this we want to average over , the input vector, and assume that the weights stay constant.
Furthermore, assuming that , we can equate with the covariance matrix, or second moment matrix, of the inputs, .
Since is a scalar, we can now substitute this with .
This should remind you of the typical eigenvector equation for a linear transformation. Namely, this shows that is an eigenvalue of and the weight vector is one of the eigenvectors of . It can be shown that Oja’s rule converges to the first principal component, i.e. the largest eigenvalue of the covariance matrix .
In other words: a simple neuron, equipped with the Oja’s learning rule, will naturally converge towards the first principal component of the input data, thus the learned weight vector will point in the direction of the largest variance, where the greatest amount of information lies.
The weight learns to encode the most correlated features of the input. What this means is that, when given inputs, the neuron learns to summarize the input data (reducing higher-dimensional concepts to lower-dimensional concepts) in a way that picks out related properties, while still maximizing information. This is valid for single output neurons, but it can also be shown that if several neurons are linked together, they will compete for the same principal components, and with some architectural tweaks, a larger network with Oja’s rule will converge to the principal components in descending order.
Knowing this, we can plausibly assume that intelligent agents can learn abstractions through a learning rule similar to Oja’s rule, and thus also the brain.
You may ask, how plausible is it to assume that the brain learns similarly; or: how closely does this really resemble how the brain learns associatively? For example, why is a normalization needed?
It seems plausible that normalization reflects the competition for trophic factors of newly formed synaptic connections, between a cell and a target cell. These trophic factors are provided in limited capacity, leading to programmed cell death in the early development of the brain. Thus, there is a strong selection pressure on newly established connections . Similarly, since the weights of the connections leading to the target neuron in an ANN with Oja’s rule are implicitly normalized to 1°, weights have to compete for their share.
Thus, Oja’s rule is better suited to resemble the brain’s learning rule than pure Hebbian learning. Of course, as always, neuroscience is a mess of caveats: quite likely the brain learns differently. But we can assume that there seems to be some similarity between how the brain learns and how an ANN with Oja’s rule learns.
Next, I want to list and explain empirical evidence from neuroscience that shows how the brain learns the causal structure, i.e. the abstractions of the things, i.e. the things we want to point to, out there. You can find the next post here: [Insert link].
Proof for implicit normalization:
With we can see that has to be 1, since:
If that wouldn’t be true, then would be false. Thus, the weights in the steady-state solution are normalized to 1.
Why is pure hebbian learning biologically implausible?
See this graph, that shows, for 250 iterations with a learning rate of 0.1, how the output of the algorithm behaves and develops:
Similarly, here are the plotted weight vectors , also for 250 Iterations:
This shows, that with enough iterations, neurons equipped with hebbian learning grow indefinitely - there is no decay term or something similar, just increase.
We want to show that the following equation is also sufficient for achieving a PCA analyzer:
This is equal to standard hebbian learning with a decay term that depends on the squared output of the neuron. For that, we have the following setup:
So, let’s derive the steady state solution for this equation and see whether weight vector also encodes the direction with the highest variance here. We say that , so the left side is zero. Note that and imply that we look at the recent firing rates of and .
Now we assume that the weights stay constant while we average over both sides.
Assuming that , we can substitute with . So:
Thus, the weight between two neurons is dependen on the Covariance matrix divided by the variance of .