Why does linguistic structure exist?

Why does linguistic structure exist? The common-sense view of language is that it is an adaptation for communication, and all of the crazy nested structures (i.e. phrases inside of phrases) that we see in language don't seem necessary to communicate well: just look at morse code! We could define an alternative, nested morse code that had to be parsed with a context-free grammar to decode the message, but this seems not only unnecessary but wasteful. An alternative view can be found in the Chomskyans, who hold that language is not about communication but about thought, and would attribute the nested structure to the inherently hierarchical and recursive nature of thought.

In this post, I'm going to discuss an alternative resolution to this paradox that I've been playing around with. It started developing while writing a paper with Sharon Goldwater (which will appear soon in the Journal of Memory and Language). Briefly, several people have been exploring the proposal that human speech is adapted for communication that is efficient, in a way that is defined by information theory. Information theory defines certain limits on how quickly information can be transferred, and part of our motivation for the paper was the realization that, for natural language, there is a conflict between incremental speech (producing and comprehending speech one word or unit at a time) and information-theoretic efficiency.

The details of this realization were beyond the scope of that paper, but in this post I want to describe this conflict and explore some future directions that capitalize on it directly.

Entropy and Source Coding

Let's be more specific about what "efficient in a way that is defined by information theory" means. The basic set-up in information theory is that we have some message composed of message characters m_i, and we want to transfer it to our listener by translating the message into some signal composed of signal characters. For convenience, the signal characters are usually set to be just ones and zeros. A good code in information theory has two fundamental properties:

  1. It is short: each message has, on average, a short signal (few signal characters).
  2. It has a low error rate: the message the receiver (listener) interprets is almost always the same as the one the sender (talker) intended.

We'll ignore the error rate in this post and just consider the first point of looking for short messages. Let's adopt the following notation. All of the possible message characters are denoted by a subscript index m_i; if our alphabet of message characters is the English alphabet, m_1 = A, m_2 = B, and so on. We'll denote the length of the signal assigned by code c to a message character by l_c(m_i). For example, if we decide to encode the message character "A" with 00111, then l_c(\mbox{A}) = 5.

We can compute the expected length of a code \mathcal{E}\left[l_c\right] using the usual formula for the expected value of a function:

\mathcal{E}\left[l_c\right] = \sum_{m_i} P(m_i)l_c(m_i)

Now we can ask: what is the shortest possible code? I.e., what is the optimal set of length assignments l^* that minimizes the expected length?

l^* = \underset{l}{\mbox{argmin}}~\mathcal{E}\left[l\right]

It turns out that there is a single answer: l^*(m_i) = -\log_bP(m_i), where the base of the logarithm is the size of the signal alphabet (i.e., it is 2 if the signal alphabet is binary letters, it is about 40 if the signal alphabet is English phonemes, etc.). This quantity is variously called the surprisal or Shannon information of m_i, and the expected surprisal is the entropy of the probability distribution over m_i. Adjusting your code so that l_c(m_i) matches its Shannon information is called source coding. Intuitively, this means our signals will be shorter on average if we use our shorter signals for more frequent messages. For example, if the phonological form of a word is much longer than its Shannon information, we could improve our efficiency by making that word shorter. Conversely, if the phonological form of a word is much shorter than its Shannon information, we could improve our efficiency by lengthening that word, thereby freeing up a short phonological form to be used by a different, more frequent word.

Encoding Messages

At this point you may notice a problem: -log_bP(m_i) will almost never be an integer, and so we might have trouble approaching the Shannon Information by adjusting the lengths of codes for individual message components, such as the phonological forms of those message components, since lengths are all integer values. For example, if the Shannon information of some message component is 2.4, and we need to encode it twice, we'll have to pick a codeword that is 3 characters long, for a total of 6 characters when the theoretical limit uses only 4.8.

There are approaches to coding that get around this limit. The simplest is called "block coding." In a block code, we encode sequences of message characters instead of individual message characters. In the example above, instead of providing one codeword for each message character, we can provide one codeword, of length 5, for the pair of characters. This new codeword is much closer to the theoretical limit, and we can approach the theoretical limit arbitrarily well (in the limit of long messages) by picking longer codewords. Using a block code has a downside, however, because we lose a degree of incrementality: we have to wait until the end of the block to interpret the message characters in the block.

However, block coding is not the only way to circumvent this obstacle. Arithmetic coding, in fact, can be processed incrementally while approaching theoretical efficiency limits. Is it possible that natural language implements some other strategy, so there is no tension between incrementality and brevity?

Why language is probably not information-theoretically optimal

Now I will describe my reason to think that natural language does not implement one of these other strategies. First, observe that a substantial portion of linguistic meaning is compositional: the meaning of the whole is analyzed into parts, each of which is then encoded into a signal. For example, a \lambda-expression is divided into sub-expressions:

\lambda P Q . \exists x( Px \wedge Qx ),~~~~\mathit{cat}^\prime,~~~~\mathit{run}^\prime

and each sub-expression is encoded into a word: Some, cat, runs. So the length of the signal for the message is the sum of the lengths of the signals for the parts of the message.

To make this precise, we need to introduce just a touch more notation, so we can talk about entire messages that are composed of message characters. A complete message is denoted by boldface \mathbf{m}, and we will index locations within the message with superscripts: m^j is the j^{\mbox{th}} character of message \mathbf{m}. Then, we can see that because language is compositional:

l(\mathbf{m}) = \sum_{m^j \in \mathbf{m}} l(m^j)

Let's call this the "sum of lengths" equation.

Ok, there's just one more concept. In the Entropy section, we looked at how to compute optimal lengths, given a probability distribution. However, with a little algebra, we can see how to go the other way. If I am given a set of signal lengths, what is the probability distribution for which they would be optimal? We call this the implicit probability distribution Q that the code assumes is true. Since we take a log to compute lengths from probabilities, we exponentiate to compute probabilities from lengths. Assuming a binary signal alphabet:

Q(m_i) = \frac{2^{-l_c(m_i)}}{z}

(don't worry too much about the z term. It just ensures that Q sums to 1. z will be equal to 1 if our code is complete, i.e. there are no "accidental gaps" in the set of signals). Now, we can ask: What does natural language assume about the distribution over messages? For simplicity, let's assume a binary signal alphabet; we'll see the signal alphabet doesn't matter because it cancels. Because natural language is compositional, we can use the "sum of lengths" equation to show that natural language assumes that the probability of a message is a product of the probability of each part of the message:

\begin{array}{rcrr} Q^\prime( \mathbf{m} ) & = & \frac{2^{-l(\mathbf{m})}}{z^\prime} & \mbox{implicit distribution over messages} \\ & = & \frac{ 2^{\sum_{m^j \in \mathbf{m}} -l(m^j) } }{ z^\prime } & \mbox{``sum of lengths''} \\ & = & \frac{ 2^{\sum_{m^j \in \mathbf{m}} \log_2(zQ(m^j)) } }{ z^\prime } & \mbox{implicit distribution over components} \\ & = & \frac{ 2^{\log_2(\prod_{m^j \in \mathbf{m}} zQ(m^j)) } }{ z^\prime } \\ & \propto & \prod_{m^j \in \mathbf{m}} Q(m^j)\end{array}

In other words, because natural language is compositional, it assumes that different components of a message are statistically independent! However, different parts of linguistic messages are highly dependent. Thus, natural language implicitly relies on a family of probability distributions that is far too constrained to capture the true distribution over messages for optimal source coding.

Speculations about language evolution and change

This view suggests that language is far from optimal in the sense of information theory. It does not, however, mean that there is not any pressure in the direction of optimal behavior (indeed, in my paper with Sharon Goldwater we conclude that there is evidence of pressure in the direction of optimal behavior). This view does make some suggestions about what such moves towards optimal behavior should look like. We could, for example, use the same block coding trick as before: instead of encoding grammatical elements as separate message components, encode ensembles of grammatical elements simultaneously. Now, however, we are dealing with a much larger space of message components, such as all the words in the language, so we'd want to get a lot of "bang for our buck:" the increase in fit of our low-dimensional approximation must be very large relative to the additional parameter introduced by the larger block. A likely example here is the reduction of "going to" to "gonna." This happens only when "to" is part of an infinitive, never when "to" is part of a prepositional phrase:

  • I'm gonna school you in Chess!
  • *I'm gonna school tomorrow.

In "gonna", we see message components that were previously encoded in two separate words encoded in one amalgamated and overall shorter block.

This is a process called grammaticalization, and Joan Bybee and others have already proposed for quite a while that usage-based processes drive these kinds of changes. What does this extra information-theoretic framework give us? Well, first, it opens up a library of theoretically-motivated measure for a broader range of data-driven tests. Secondly, I still have a lot of reading to do in this grammaticalization literature, but, as far as I can tell, the frequency effects are assumed to be given, as some kind of basic feature of how brains work. This framing suggests that the role of frequency effects in grammaticalization itself may reflect an adaptation towards some balance between incrementality and brevity.

Leave a Reply