# More on probabilities and Bayesian modeling

Hey all, I realized I promised pictures but the last post had zero (0) pictures. So in this post, I'll sort of recap the previous post but put in some pictures.

So we considered in the previous post a situation where we see a coin flipped three times, and get two heads. I tried to build the intuition that this result is fairly consistent with coins that are fair and with coins that are biased for or against heads. It's easiest to see this by plotting the likelihood $P( {H, H, T} | \theta)$. If we assume that the coin flips are independent (that is, the outcome of one flip does not affect the next flip), then we have for the sequence H,H,T:

This is the probability of one sequence of flips. But I didn't tell you the order of the flips; maybe the one tail came first, maybe it was in the second flip, or maybe in the third. So to get the probability of two heads and one tail, ignoring order, we have to add up all three possibilities:

We can see this probability distribution in the next plot, where the horizontal axis is the probability of heads $\theta$ and the vertical axis is the probability of two heads and one tail for each value of $\theta$:

We can see that the peak of the distribution is at $\theta=\frac{2}{3}$ (and we can see that I made a mistake in the previous post when I said this peak was probability 0.375... *sigh*), but other values of $\theta$ are pretty high as well. This is because two heads in three flips is pretty likely for a coin of almost any bias. Let's see what happens if we take thirty flips and get twenty heads; the same ratio, but more total flips:

We can see that the peak is still at 2/3, but a much tighter range of possible values of $\theta$ get appreciable probability. For both of these sets of observations, though, the maximum-likelihood estimate (MLE) is the same: $\theta^* = \underset{\theta}{\operatorname{argmax}} P( D | \theta ) = \frac{2}{3}$.

In the previous post, I mentioned that, in a Bayesian approach, we can model our joint distribution over data $D$ and parameters $\theta$ in terms of a likelihood $P( D | \theta)$ and a prior $P(\theta)$: $P( D, \theta ) = P( D | \theta ) P( \theta )$. Let's have a look at what a sensible prior might look like. Remember, for a prior expectation of a fair coin, we want a prior that gives high probability to $\theta$ near $0.5$. Here's one possibility:

This prior concentrates probability mass on $\theta$ near 0.5. Technically, this is a Beta prior with hyperparameters of 10, but the details aren't important. Also, since $\theta$ is a continuous variable, plots of the probability of $\theta$ are probability densities, and we can only get probabilities that $\theta$ is in a certain range by taking the integral over that range. Because of this, the "probability $\theta$" will sometimes go above 1 (this can happen any time $\theta$ is not on the right side of the vertical bar).

Now let's look at the distribution over $\theta$ and our observations $P( \theta , {H,H,T} )$:

After three flips, the peak of $\theta$ in the joint distribution is only 0.52! The three flips have barely changed our opinion of the fairness of the coin. Now let's see the joint distribution after thirty flips, using the second likelihood function from above:

We can see in this plot that the peak, with respect to $\theta$, of the joint distribution is 0.6, after seeing thirty flips, and almost all of the probability mass is to the right of a perfectly fair ($\theta=0.5$) coin. So, as we gather more evidence, our opinion of the fairness of the coin depends less on the prior distribution.

In fact, if we can get infinite evidence, the Bayesian distribution over $\theta$ is an infinitely thin spike at the maximum-likelihood estimate. Of course, we never have infinite evidence, and cognitive tasks involve variables that are almost always rare. This is relevant in two different ways. First, some researchers believe humans use strategies that compute Bayesian posteriors. For example, in a daily conversation (or blog post), you'll encounter sentences that you've never seen before; in fact, most of the sub-parts of those sentences will be rare as well. The proposal that humans use (approximations to) Bayesian strategies provides an intuitive and mathematically satisfying explanation of why people are able to deal with all these rare events.

The second way that Bayesian reasoning is relevant has to do with the intention of my previous post. Probabilistic computational-level models essentially come down to proposing that "This likelihood function $P(D|\theta)$ together with this prior distribution $P(\theta)$ recover cognitively-relevant structure in the data." Maximum-Likelihood Estimates do not test the likelihood function or prior distribution directly; rather, they test it subject to the requirement that every single parameter in the likelihood function is supported by the data. Since we know data in cognitive domains is often sparse, this is a harmful requirement that interferes with our ability to assess the likelihood function per se. Bayesian approaches allow us to test the extent to which likelihood functions of interest recover cognitively-relevant structure directly.

Ok, I've just talked a lot about "likelihood functions of interest," but all we've talked about so far are coin flips. Coin flipping is pretty boring, I know. I promise to talk about more interesting likelihood functions soon.

Thanks,