Bayesian inference primer: an introduction

When I first came across the concept of Bayesian inference, I was interviewing for a job with Pinterest. The interviewer was familiar with recommendation systems and how to use distributions of scores, rather than simple point estimates, and showed me how to use these to learn about a system. My surprise mini-lecture only took 45 minutes, and in retrospect, it was kind of a weird interview, but it left an impression in my mind, even though I wouldn’t touch recommendation systems at the company.

That interviewer, by the way, had previously worked at Amazon, and it wouldn’t be until I went to work there too that I learned about Bayesian inference and how you can use it to do what would otherwise appear to be magic. The concepts are really quite beautiful, and as a former physicist, they captured my imagination. I couldn’t help but see the commonalities between probability distributions and quantum wave functions, modes and quantum resonance. Of course, a lot of modern machine learning borrows ideas from statistical physics, but it was the way that similar applied mathematical tricks were used with quantum mechanics that grabbed me.

So I’m writing up this little review of Bayesian inference because my exploration of the field was wrought with wrong turns and there just isn’t a lot of good pedagogical material out there. Practitioners play loose and fast with the nomenclature, probably because, like me, they came into the field from an applied perspective. But when you get a grasp of the concepts and how they work in the real world, it truly opens your eyes, and recently, it opens up powerful new applications in the technology sector to make the most of our interactions with each other.

It still amazes me how many of the best ideas and algorithms in computer science came from Los Alamos National Laboratory in the 1950s, when transistor-based computers first came into their own. It also still amazes me that despite seventy years of continued research and a deep, deep desire to improve on the state of things… not much has changed. Variations of the core algorithms keep getting invented, published, and distributed, but they are often so unreliable that no one really uses them. As a Bayesian researcher at Harvard told me, most people in the field still end up writing their own software routines by hand, based on the same old algorithms from the 50s, and generally ignore the latest software being released.

Let’s begin by reviewing Bayes’ Theorem, which, along with the Bellman Equation, is one of the most important equations in all of machine learning. Why is it so important? Because if we want to understand how our beliefs about a model change when we observe data, we have to do this bit of mental gymnastics to get the right answer. Also, it relates easy-to-compute quantities that come from how we define any model to difficulty-to-compute quantities about what that model actually looks like after we observe some data. It also doesn’t hurt that Bayes’ Theorem is beautiful, elegant, and clearly elucidates the biggest challenges in the field.

In short, Bayes’ Theorem can be stated as “the posterior is equal to the point-wise product of the prior and the likelihood, divided by the evidence”, or more succinctly as

CodeCogsEqn-3.png

Or mathematically as...

1*i_hU9GxbZSr_6mQm24Cujg.png

Let’s write this out briefly in English. The symbol θ refers to the parameters of the model, and the probability function of θ refers to our beliefs about those parameters — that is, their distribution. For a simple linear regression y = m*x + b, θ is a two-element vector that concatenates m and b. Unlike most applications people see with linear regression, however, we are focused not on point estimates, or single numerical values, but probability distributions of these numbers. The most common distributions we encounter are generally Gaussian bell curves, and if we have multiple scalars in the distribution (in our case, m and b), then we often deal with multivariate Gaussians — that is, Gaussians in multiple dimensions.

The symbol X refers to our data. Thus, the posterior, or p(θ|X), represents our beliefs about the model parameters given the data we have observed. The prior, p(θ), represents our beliefs about those parameters before we observed the data X. The likelihood, p(X|θ), represents the likelihood that we would observe the data X given our original beliefs about the model parameters, i.e., the prior. And lastly, the evidence or marginal likelihood, p(X), is a normalization constant so that the posterior sums to one. The above equation writes out the evidence explicitly as an integral over all possible values of θ. The prior, posterior, and likelihoods are all probability distributions, while the evidence is just a simple scalar.

The word “likelihood” may be unfamiliar to you in this context, and it is the biggest mental leap you have to take when learning Bayes’ theorem. While both the prior and posterior can be defined as functions of the model parameters θ, the likelihood is defined by a function of both the model parameters and the data. Of course, we only observe the data that we have observed, so we generally take the dependency on the data out in its functional form, but that’s where it starts, and where the problems begin.

The likelihood function we choose completely defines our model. For example, in a linear regression, we say that the likelihood is defined as a decreasing function of the sum of squared residuals, i.e., the squares of the vertical distances between the data points and the line estimate. This is where the term “least squares” comes from, since a linear regression attempts to maximize the likelihood by minimizing the squared residuals.

The last piece, which I glossed over earlier, is the evidence, and it took me a while to figure out why it is the biggest bugaboo in all of Bayesian inference. I mean, it’s just a normalization constant, and those are easy to compute. Can’t you just ignore it?

Here’s the rub. If we want to know the scalar value of the posterior given a proposed point-wise value θ, we have to perform that entire integral. While relative values can be computed without performing the integral, there is no way for me to tell you if, say, we focus all our attention on a small range of θ, if that is where most of the probability is in the posterior. 

If you think of the posterior as a heatmap or terrain where the mountain tops represent the most-likely values of θ and the valleys are where they are extremely unlikely, it is insufficient just to get proportional values. There might be a huge Mount Everest outside the area we are looking at, which means that we are actually in the foothills and don’t know it. Until we compute the integral, we have no idea if that’s the case or not.

Oh, and it gets worse. While the above integral allows us to compute the evidence based on the likelihood function, that function depends on the data as well as the model parameters. Outside of simple problems like linear regression, it could take any form, because the data points could interact with the model specification in all sorts of ways we can’t predict. What if there’s a Mount Everest that is infinitely narrow? There would be no way to find out numerically. And that happens a lot, especially when modelling physical systems defined by differential equations. In other words, outside of a few solved problems, there is no way to know.

The other issue is that for linear regression, I can express an analytical function that defines the likelihood, but for other problems, like physical systems defined by differential equations, we compute the likelihood by running a simulation. For example, I may be given a set of model parameters and initial conditions, run the simulation, and then predict an outcome, and compare it to the real-world results. In this case, the likelihood is again a decreasing function of the residuals (the difference between my prediction and the real-world result) just like with the linear regression, but to compute the residual, I had to run the simulation first, which may be computationally expensive. Now imagine doing that for all possible parameter values to compute the integral, and you can see why the problem becomes intractable.

We will cover the state-of-the-art approaches to solving this problem in an upcoming article, but like I said at the beginning, they haven’t meaningfully evolved in seventy years, which I find pretty fascinating. What’s particularly interesting is that the most successful advance in that subject, called variational inference, is successful because it’s kind of bad, but very cheap, allowing folks to use it in really thorny problems like deep learning which can have many thousands of model parameters. It’s not accurate, but it’s better than nothing, and by placing other safeguards, can be very effective. Stay tuned to learn more.

Since the bulk of the work in Bayesian inference comes down to performing an integral over the entirety of the parameter space defining θ, a large number of algorithms have been devised to solve this problem. It’s not as simple as putting the integral into Mathematica, since the output of a model may come from a numerical simulation! And direct numerical integration comes with many caveats, specifically in the high-dimensional problems you encounter in Bayesian inference. So let’s go through the primary methods of performing the integral, building our intuition as we go. Then, we will cover a variety of model types that Bayesian inference baked right into them. Finally, we will cover important notes on the types of uncertainty we encounter in applications, and how Bayesian inference fits into them.

Our outline is as follows:

A note on nomenclature: There is terminology distinction between the likelihood function, which we focus on considerably in Bayesian inference, and the a posteriori, that is, the posterior (because we always need a new pretentious way to say the same things, and hey, why not Latin). In classical machine learning models, we often compute the MLE, the maximum likelihood estimate. This is the point in our parameter space where the likelihood is maximized, and when you obtain a point estimate of the parameters as you do, for example, in stochastic gradient descent, this is what you are computing. On the other hand, there’s the MAP, the maximum a posterior, that is, the point in parameter space where the posterior is maximized. The distinction: the MAP is the maximum point of the pointwise product of the likelihood AND the prior, not just the likelihood. 






Douglas MasonComment