I just spent a week with neuroscience and vision people, and was pointed to an interesting pair of relatively recent neuroscience papers (by Rodrigo Perin, Henry Markram, and Thomas K Berger) with interesting potentially computational-level implications that I want to think through. First, a disclaimer: I'm not a neuroscientist, and I can't personally evaluate the biological methodology. But, assuming their basic measurements and interpretations are correct, I think the work is really cool.
To start, let's talk a bit about the very basics of what neurons have to do with computation, and then turn to Hebbian learning. Neurons ``fire'' by sending out an electrical charge, and a neuron is more likely to fire when it receives electrical charges from those neurons it is connected to. However, not all neural connections are the same: some are strong, and some are weak, and a neuron may not fire if it receives a charge from many weak connections, and it may fire if it receives a charge from a few strong connections.
We can understand this better by imagining a very small artificial neural network. with two ``input'' neurons and that are both connected to one ``output'' neuron :
(This kind of network would actually not really be suitable for Hebbian learning; I'm just trying to show how different weights can encode different functions that might be useful for cognition.)
As a simplification, let's say that, at any given time, if an input neuron fires, it sends a 1 along the connection to the output neuron, otherwise it sends a 0. To model the relevance of strong and weak connections, whatever an input neuron sends out is multiplied by the weight of its connection to the output neuron. So at any given time, the output neuron receives as activation whether the first input neuron fires times its weight, plus whether the second input neuron fires times its weight:
We can then say that the output neuron fires if its incoming activation is greater than or equal to one. If we set and both to 1, then the output neuron fires if one or both input neurons fire: the network computes OR . Alternatively, we can set and to 0.5, and the output neuron fires if both input neurons fire: the network computes AND . Finally, we can set to 1 and to -1, in which case the network computes OR NOT . So we can see that networks of neurons can in principle compute different functions of the input if all we do is change the strength of their connections. In fact, if we allow multiple layers of neurons, with output neurons that turn connect to other neurons, this kind of network is capable of representing any function of the input with enough neurons.
Under the proposal that cognition is computation, then, this all speaks towards the view that the biological implementation of learning comes down to finding the right weights. Brains have immense numbers of neurons and astronomical numbers of connections: the usually-reported numbers are 100 billion neurons with about 10,000 connections each. Each. It is hard to imagine how these weights could be encoded in the comparatively small human genome, and so most work has looked for biologically-plausible learning rules that could apply throughout the brain.
Hebb discovered, appropriately, Hebbian learning. The intuition is simple: when two neurons fire at the same time, any connections between them become stronger. Computationally, this allows the network to encode correlations in the environment. More importantly, biological neurons behave in this way, with a telling wrinkle. In the simple intuition of Hebbian learning stated above, connections only become stronger. For biological neurons, however, there are physical limits on how strong connections can be. This leads to a ``saturation'' problem where, eventually, all connections are as strong as physically possible. It turns out, however, that biological neurons also weaken connections that rarely fire together. These two rules, one for strengthening and one for weakening, allow effective Hebbian learning within fixed limits on biological connections.
Ok, on to the results of the actual papers that inspired this post. Most work has ignored the question of which neurons connect to which neurons. Computational models of neural networks typically have pretty simple structure, with every neuron either connected to every neuron (or maybe to every neuron in a different layer) or connected to random neurons. This tabula rasa approach assumes that the only interesting structure in neural connections is learned structure in the weights. The papers look for, and find, interesting structure in neural connections that is not learned (at least not in a Hebbian way).
Specifically, they take slices of brains from 14-day-old rats, and look at structure not in the strength of connections but whether or not any connection exists at all, finding a number of interesting predictors. First, as you would expect, neurons are, broadly, more likely to be connected if they are near each other. However, this trend holds only for distances greater than 100 micrometers; neurons that are 50 micrometers apart are less likely to be connected than neurons that are 100 micrometers apart, even though they are closer to each other. So distance matters, but something else also matters, at least for the first 150 micrometers or so.
That something else is whether the two neurons under consideration are also connected to other neurons. So, if we want to know if A and B are connected, then we should check how far apart they are, and also see how many other neurons they are both connected to. If they are connected to many other neurons, then there is a much higher chance that they are also connected to each other. Even more interestingly, it turns out that there are recurring patterns in the actual shape of the connections, which the paper calls motifs.
These results together indicate that neurons do not form connections randomly but focus on forming these small (2-6 neuron) motif networks that then connect to each other. But here's the real kicker: remember that saturation problem with Hebbian learning, where connection strengths hit physical limits? According to the paper, connections within these motifs are saturated, so meaningful Hebbian learning only occurs between motifs.
The authors of these two papers talk about these motifs as ``Lego-like building blocks for memory and perception,'' with each motif encoding fundamentally different ways of integrating input activations, and with motifs joining together to form larger motifs. Now, I think the authors are speculating at this point, but they say that the stability of these motifs throughout a species provides a stable basic toolkit for high-level perception and memory, explaining why different members of a given species appear to perceive inputs in similar ways. That is, if perception and memory were strictly based on learning from the environment, we would expect enormous individual variation throughout a species. While some variation certainly exists, perception is remarkably stable throughout a species. They propose that this stability is the result of different individuals relying on a similar innately specified relative prevalences of these motifs.
Back to Language
So what might this have to do with language? At this point, it doesn't necessarily have anything to do with language. These motifs are extremely low-level structures, involving literally fewer than 10 cells, and it's entirely possible that the computations involved in language happen at a scale that is so much larger that this low-level detail is not relevant. However, this is my blog so I can do what I want, and right now I want to speculate wildly on what these motifs might potentially have to do with language.
As I mentioned in my Principles & Parameters (P&P) post, one of the central questions of modern linguistics, and specifically those parts that involve the acquisition, representation, and processing of syntax, has been what kinds of grammatical regularities exist across languages. For example, languages tend to put syntactic heads either at the beginning of syntactic phrases or at the ends of syntactic phrases. In my P&P post, I described how P&P tries to exploit these regularities by genetically-specifying an auxiliary Parameter space that explicitly represents each Parameter, and then exploring this space (that is exponentially large in the number of parameters) by way of a random walk. I also described how data-driven approaches similarly exploit a smaller space, but, in this case, the smaller space is a subspace of the observed and unobserved variables whose shape is determined by both the observed data and the innate functional form of the assumed statistical dependencies.
I'm not sure how these motifs/lego blocks would fit into the P&P approach, but they look an awful lot like they could provide an inventory of innate functional forms for statistical dependencies. Hebbian learning algorithms, at least in the simple idealized forms I'm familiar with that don't worry too much about the neural hardware, implicitly rely on linear dependencies in a continuous space (i.e. hebbian learning basically does Principle Components Analysis, which assumes linear components).
I see three basic ways to exploit non-linear dependencies in a high-dimensional space to find that useful lower-dimensional manifold. First, we can just fit lots of linear functions and locally approximate the manifold with just the linear functions. Second, we can move to more general ``manifold-embedding'' methods that build a mesh in the high-dimensional space based on the observed data points that should trace out the manifold on which they lie. Third, we can augment our original high-dimensional space with even more dimensions that are non-linear transforms of the original dimensions, and hope that the non-linear relationships become linear once we get these extra dimensions (this is the basic idea behind kernel methods). Motifs, and their compositions, might provide a good enough set of non-linear transforms to enable language learning and processing.
I think a fully worked-out model would concretely relate the shapes of these motifs to, say, syntactic subtrees, and compute parse probabilities in terms of these subtrees, perhaps relying on these subtrees in proportion to the prevalence of the corresponding motif in areas of the brain that process syntax. We could then evaluate variants of this neurally-inspired model by seeing to what extent it produced accurate parses and predicted measures of syntactic processing, such as reading times.
So what do y'all think of my wild speculation?