A response to Noam Chomsky on where artificial intelligence went wrong.

I've got marking to do along with a dissertation chapter to write, so naturally it's about time for another blogpost! A+ discipline. I've been wanting to write a response to an interview with Noam Chomsky provocatively titled Where Artificial Intelligence went wrong. Go ahead and look through the interview, I'll put my reaction below the fold.

The interview is best understood in the context of the historical development of the field of linguistics, and the rise of the field of cognitive science. Way back in the 1950's, the behaviorist pioneer B. F. Skinner wrote a book purporting to provide a psychological theory of language entitled "Language behavior." As a behaviorist account, this book sought to explain language as the result of "stimulus-response" type behavior, just like Pavlov's dogs. Chomsky showed, in a scathing review of Language Behavior, that straightforward "stimulus-response" type accounts that did not appeal to some notion of linguistic structure are inadequate for language.

Here's the basic idea behind the kinds of arguments Chomsky made. We can distinguish between sentences that are syntactically well-formed but meaningless and random sequences of words, as demonstrated by the (now-famous) example: "colorless green ideas sleep furiously." This sentence is acceptable in a way that "green sleep ideas furiously colorless" is not. However, the sentence doesn't have any clear meaning, and so it's not obvious how this stimulus could receive a response that is systematically different from "green sleep ideas furiously colorless;" at least, prior to decades of linguistic commentary on this particular meaningless sentence ;) . Chomsky argued that "colorless green ideas sleep furiously" is acceptable because it has the right kinds of words in the right order to provide the subject and verb phrase that constitute a sentence, but "green sleep ideas furiously colorless" does not have the words in the kind of order that allows a sentence to be constructed. So the two word sequences can be distinguished only by reference to some kind of psychologically-real internal structure.

Behaviorism, however, expressly forbids theories from referencing psychologically-real internal structure. Behaviorists argued that internal psychological processes are not objectively measurable, and good science only uses data that are measurable, so psychological theories should not speculate on what happens inside people's heads. If "colorless green ideas sleep furiously" is different from "green sleep ideas furiously colorless" only by virtue of psychologically-real internal structure, then behaviorist approaches are inherently broken. Chomsky argued that, yes, behaviorism is completely hopeless as a paradigm for understanding language, and good linguists should abandon behaviorism and focus on elucidating linguistic structures. This rejection of behaviorism bled over into other parts of psychology, leading to the "Cognitive Revolution" in which several subfields of psychology started proposing, and testing experimentally, psychologically-real structures.

Artificial Intelligence is, broadly put, the endeavor to make computers capable of doing what people can do. AI projects are often classified as "strong" or "weak" AI. Strong AI projects seek to build systems that do the things people can do in the same way that people do them, while weak AI projects seek to build systems using whatever we can get to work. Thus, strong AI projects can be viewed as trying to implement a psychological theory, but weak AI is more of an engineering field. Because decades of research following up on the Cognitive Revolution have shown that we can make at least some empirically-supported, substantive claims about the internal structures and processes of human cognition, strong AI should presumably be looking to implement non-behaviorist theories.

In the interview, Chomsky is essentially arguing that behaviorism (or "associationism") has snuck back into strong AI in the form of statistical models, and, since behaviorism is a fundamentally flawed paradigm for human cognition, this is bad. Computational models for human language have increasingly relied on statistical cues, especially in the last twenty years. In fact, the term "Language Model" almost always refers to what's called an n-gram model, where you compute the probability of a sequence of words by looking at how common portions of that sequence are. So a trigram model (n-gram with n set to 3) would determine the goodness of "colorless green ideas sleep furiously" by seeing how common "colorless green ideas," "green ideas sleep," and "ideas sleep furiously" are in some dataset. As long as the dataset doesn't contain too many papers by linguists (har har), the trigram model should give the sentence a very low probability. Thus, the trigram model makes exactly the same kind of mistake as the behaviorist approach, for the same reason: it maintains no notion of linguistic structure. Moreover, the trigram model can be viewed as a kind of behaviorist theory, wherein the subject's response to a stimulus of "x" is "increase the probability of trigram x."

Chomsky explicitly targets Bayesian approaches. Remember from my last post that a Bayesian approach explains observations (the coin flips) by averaging over possible hypotheses (the bias of the coin). For natural language syntax, this comes down to explaining sequences of words (or morphemes, or speech sounds) by averaging over possible grammars and syntactic analyses. In the interview, Chomsky portrays Bayesian approaches as simply trying to regenerate the observations, and says that a system that simply regenerates what you saw does not improve your scientific understanding. I agree that a system that merely regenerates what you saw does not improve your scientific understanding, but most Bayesian approaches do not just regenerate the observations. A Bayesian approach that only regenerates our word strings would pick an n-gram model to be the likelihood function. Remember that a likelihood function is the term P( D | \theta ). If we've picked a trigram likelihood function, our observed data D would be triples of words, and \theta would just be a table that tells us the probability of each trigram. We would then compute the probability of our observed data by averaging over all possible entries in the table \theta.

I agree with Chomsky that this is a boring model, but there's no reason to confine ourselves to such boring models. I like to work with a likelihood function called the Dependency Model with Valence, or DMV (originally devised by Dan Klein and Christopher Manning). It generates not just the observed words but also a tree structure for them:

(Unlabeled) dependency parse for "colorless green ideas sleep furiously."

This kind of likelihood function has two kinds of generated variables: observed variables w (the words) and hidden variables t (different parts of the tree), and the DMV is one definition for how to compute the likelihood function P(w, t | \theta). The hidden variables represent the psychologically-real internal structure the cognitive agent is proposed to be manipulating. Empirical evaluations of the trees from this kind of model show that higher probability trees tend to correspond more closely to linguist intuitions about syntactic trees than would be expected by chance and various baselines, bolstering the notion that they can be interpreted as syntactic trees. The probability of a tree and words under this kind of likelihood function will be lower than the probability of words under the trigram likelihood function, so this approach isn't just trying to reproduce noisy data, as Chomsky alleges. Instead, it uses results from linguistic theory, laboratory experiments, and corpus studies to guide the kind of hidden structure we build, and the form of the likelihood function we rely on.

Chomsky is not unreasonable in his critique because there are people who focus on models that only generate or represent fairly superficial aspects of linguistic data. However, it is unreasonable to dismiss all Bayesian methods as efforts to fit noise.

Computational modeling of this kind is important, especially in the context of language acquisition. Elaborate morphological and syntactic structures are important to explain adult competence, but, during learning, a structure that has more parameters (entries in \theta) than are supported by the amount of data available will be too ambiguous to be useful, even if the structure is correct. Bayesian approaches allow us to measure the ambiguity of different kinds of assumed structure directly, by specifying different likelihood functions, regardless of the learning strategy the child actually uses.

Alright, back to marking. Thanks!

Leave a Reply