Why are so many (recent) cognitive models Bayesian?

Computational cognitive science and artificial intelligence go through periods when different kinds of models are fashionable. Early on, it was all symbolic processing with rules and predicate logic, then people got excited about connectionism, then everyone started putting weights or probabilities on their rules. The current big fad in Cognitive Science is Bayesian modeling. I like this fad, so I'm going to talk about it.

In this post about computational modeling, I promised to talk about how a computational model could improve our understanding of cognition without asserting that people actually did what the computational model did. The basic point is that we can use probabilistic modeling to explore the shape of our data, which in turn constrains what kinds of strategies a human learner could successfully use. In this post, I'll discuss how Bayesian modeling allows us to explore the statistical structure of data and characterize what any (decent) probabilistic algorithm would try to do with that data.

It's easiest to understand the motivation behind Bayesian modelling by comparing with Maximum-Likelihood modelling.

Bayesian models of cognition propose that some aspect of cognitive behavior can be explained in terms of a probability distribution over what the agent has seen (the observed data) and what the agent believes about the observed data (the hypotheses). Concretely, if you flip a coin three times and get two heads, a Bayesian model would propose you are maintaining two kinds of beliefs. First, you have a belief about whether the coin is fair or it is biased to give one result; this is the belief about the hypothesis. Second, you have a belief about how likely you are to get two heads in three flips under each possible kind of coin; this is the belief about the data.

A Maximum-Likelihood model, by contrast, proposes that we only need to pay attention to the probability over the observed data. Here's an example:

Flipping a coin three times doesn't give us very much evidence about the coin. A perfectly fair coin gives this result 37.5% of the time (in three flips), and a coin that's biased against heads 2:1 gives this result just over 22% of the time. Moreover, we have an odd number of flips, so it's not actually possible to obtain the same number of heads and tails. Intuitively, most people would continue to believe that the coin is probably fair after these three flips.

What does the maximum-likelihood estimate (MLE) about the coin say? Well, remember that it only cares about the probability of the observed data. Specifically, it proposes that people believe the hypothesis that makes the data most probable. If we call our belief about the coin \theta (which is traditional), and our observations, evidence, or data D (which is just the outcome of the three flips), the MLE for theta is \theta^*:

\theta^* = \underset{\theta}{\operatorname{argmax}} P( D | \theta )

So what is the \theta that makes two heads and one tail most likely? Well, it's a coin that comes up heads \frac{2}{3} of the time. The maximum-likelihood model proposes that people commit immediately to the data, and ignore all but the best explanation. In this case, that's not very realistic; we have so little data that other explanations are also pretty good.

A Bayesian model maintains uncertainty about the coin. Let's take a brief digression to establish two basic facts in probability theory:


First, let's talk about conditional probabilities. If A is a variable indicating whether I'm using an umbrella (A can be yes or no), and B is a variable indicating whether it's raining (B can also be yes or no), then P(A|B=\mbox{yes}) is the probability that I'm using an umbrella, given that it's raining, i.e. if we only consider times when it's raining. We can get a probability distribution over whether I'm using an umbrella and it's raining by taking the probability that I'm using an umbrella given that it's raining and multiplying it by the probability that it's raining:

 P( A, B ) = P(A | B)P(B)

P(A|B) assumes we know B, but P(A,B) maintains uncertainty over both A and B

The other important thing to understand is marginalization. This is a fancy word for using addition in a particular context. We might know the probability that I'm using an umbrella and it is raining P(A, B=\mbox{yes}) and the probability that I'm using an umbrella and it is not raining (A,B=\mbox{no}). If we only care about the probability that I'm using an umbrella (perhaps you want to borrow it, and don't know if it's raining), then we can get a probability distribution over A alone by adding up the two probabilities: P(A) = P(A, B=\mbox{yes}) + P(A, B=\mbox{no}). In general, as long as we include all possible values of B, the following equation holds:

 P( A ) = \sum_B P(A,B)

If B is a continuous variable, like height, or the bias of a coin, then we have to integrate:

 P( A ) = \int P(A,B) dB

Putting these equations together, we have the rule of total probability for discrete variables:

P(A) = \sum_B P(A|B)P(B)

and continuous variables:

P(A) = \int P(A|B)P(B) dB


Now, back to coin flipping. Remember that I said a Bayesian model maintains a probability distribution over both data D and hypotheses \theta. That just means we care about the joint distribution over these variables:

P( D, \theta ) = P(D | \theta)P(\theta)

The first term on the right-hand side is the same thing that the MLE approach above maximized, and is called the likelihood of the data or the evidence under a particular hypothesis \theta (which is why the MLE approach is called "maximum-likelihood"). The second term is called the prior, and reflects a prior distribution over all hypotheses \theta. Now, after seeing two heads in three flips, we've already pointed out that P(D|\theta) is pretty large for all \theta. Since the likelihood doesn't change much across possible values for \theta, the prior over \theta has a strong effect on the joint distribution. In our daily lives, the coins we encounter (when we bother to flip them) are pretty fair, so our prior P(\theta) will give much higher probability to the hypothesis that the coin is probably fair.

Accordingly, the MLE approach incorrectly predicts that people will feel a coin is strongly biased after seeing two heads in three flips, while the Bayesian approach both makes the correct prediction and provides an intuitively appealing separation between prior biases and evidence.

Now let's dig a little deeper. I have previously talked about David Marr's levels of analysis. What level of analysis do these two approaches target? We can probe people's beliefs about the coin and the data by asking what would happen if the coin were flipped again. So now we have D_{old}, which is our old data (the three flips we already saw), and D_{new}, which is the new data we are making a prediction about, and we want to make a prediction about D_{new} given the old data: P(D_{new}|D_{old}). Now, under the MLE approach, we assume people see three flips, fix their opinion of \theta^*, and then make a guess about D_{new} purely on the basis of \theta^*:

\theta^* = \underset{\theta}{\operatorname{argmax}} P( D_{old} | \theta ) \\ P( D_{new} | D_{old} ) \approx P( D_{new} | \theta^* )

That is, the MLE approach bases its guess about what happens next not on what happened before directly but on only the best explanation of what happened before. Under the Bayesian approach, however since we maintain uncertainty about \theta, we can model P( D_{new} | D_{old} ) exactly:

\begin{aligned}P( D_{new} | D_{old} ) & = \int P( D_{new}, \theta | D_{old} ) d\theta \\ & = \int P( D_{new} | \theta, D_{old} )P( \theta | D_{old} ) d\theta \end{aligned}

The first line is just marginalization we talked about earlier, and the second line results from the definition of a conditional probability discussed earlier. If we assume that the likelihood function doesn't change over time (and we have to make some assumption about how it does or doesn't change over time), we have P( D_{new} | \theta, D_{old} ) = P( D_{new} | \theta ), producing:

P( D_{new} | D_{old} ) =\int P( D_{new} | \theta )P( \theta | D_{old} ) d\theta

So the Bayesian approach bases its guess about what happens next on what happened before, modulated by the likelihood term.

The contrast is important when understanding these kinds of models in the context of Marr's levels of analysis. The Bayesian approach characterizes P( D_{new} | D_{old} ) in terms of a likelihood function P( D | \theta ). As long as the likelihood function expresses the dependencies within D well, and P(\theta) represents prior knowledge well, the Bayesian approach characterizes what any probabilistic algorithm will be trying to compute. Accordingly, the Bayesian approach is squarely targeted at the computational level of analysis, and brings in a minimum of assumptions per se

The MLE approach, however, approximates P( D_{new} | D_{old} ) with a single "best" hypothesis about the data \theta^* which is known to break down in the face of small data. Of course, it's possible that people actually adopt this strategy in some contexts. However, people still behave sensibly in the face of small data (by not changing their belief about the coin very much after only three flips, for example). Additionally, the MLE approach assumes that people actually adopt this strategy, and, in doing so, targets the algorithmic level of analysis at least to some extent.

So, while the Bayesian approach appears more complicated (omg integrals), it is actually less complex from a theoretical viewpoint because it imposes fewer assumptions about what people do. By abstaining from such assumptions, we can explore how different kinds of likelihood functions P(D | \theta) and prior biases P( \theta ) behave with the data that humans see. When we find a likelihood function and prior bias that behaves similarly to humans, we have evidence that whatever algorithm humans use is taking advantage of the statistical structure assumed by that likelihood function and prior bias. For the coin example, the likelihood function is very simple: each flip is independent, and comes up heads with a probability of \theta. Figuring out the details of the prior function is a little more tricky, but the basic shape (lots of probability for a fair coin, little probability for a biased coin) is fairly clear. How do people store, represent, and access these probabilities? Well, that's a different question that a computational-level model does not try to answer.

Thanks,

Leave a Reply