Bayesian inference primer: random walk

Previously, we discussed how we could directly sample the likelihood and perform numerical integration to compute the evidence in Bayesian inference, and why that generally doesn’t work. Wouldn’t it be great if we could just sample the likelihood in some fashion that represented the density, like someone walking along a mountain range and hopping from mountain peak to mountain peak? In fact, just that intuition is the basis of one of the most important approaches to Bayesian inference ever devices.

First invented in 1953 at Los Alamos National Laboratory and called the Metropolis algorithm (named after its inventor, not the classic film), the Markov Chain Monte Carlo (MCMC) class of algorithms are the gold standard for Bayesian inference, with many popular spin-offs like Gibbs sampling, Hamiltonian MCMC, and Metropolis-Hastings (see https://arxiv.org/pdf/0808.2902.pdf for more).

Like so much of this field, the fancy names conceal some ingenious and simple intuitions. For example, if we translate “Markov Chain Monte Carlo” into english, it comes out as “a random walk”, and if you want to be really fancy, we could call it “random walk without memory”. Why the complicated names? Because acronyms are more fun!

Just as the name implies, this class of algorithms performs a random walk which, over time, describes the density of the likelihood function. If we treat the steps of the random walk as weighted samples of the likelihood function, we can extrapolate an integration. Of course, this means we are adding many new hyperparameters such as:

  • how many steps do we take first  until we converge to the proper walk?

  • how many samples do we need to accurately describe the likelihood function?

  • how big should we make each step?

  • ...and many more!

Unlike regular or random sampling, MCMC may get stuck in an area of high probability, then suddenly jump over to another one. And there aren’t any terribly useful metrics to know if we should watch out for that, or if we have converged to the density yet, since our first steps will likely be off-the-mark.

There are really a psychotic number of perturbation, variations, and approaches folks have used to answer these questions. For example, what if we use gradient ascent to find a bunch of local maximum, then start a bunch of random walks from those peaks? Could information be passed from one walker to another, to help parallelize the calculation? Many careers have been made from these questions.

And not just careers! The internet is flooded with many many software libraries that amount to a set of methods for encoding your problem into a proprietary format, and usually one algorithm variant for performing the calculation. Which one should you use? No one knows, and there isn’t really a lot of peer review to confirm which libraries are best.

For the Python practitioner, of course, all you care about is what libraries are the best for the job. The most common choices are:

  • PyStan

    • This adapts a C-based library called Stan to Python, which comes with its own proprietary programming language, and it’s a horror to install. However, this is the most respectable library of the bunch.

  • PyMC3

    • Probably the most common package you’ll come across, and the easiest to work with. It uses a variant of MCMC called NUTS, which its authors promise is really good.

  • Pyro

    • Uber thought this library was good enough to purchase it! However, they were mostly attracted to its other approaches to Bayesian inference, which we cover in the next section.

  • Emcee

    • I like this library because it doesn’t require mastering a proprietary language to convert your problem into something it can understand. You just provide it with a likelihood function. However, I also didn’t find it to be very stable and it was very slow.

  • Tensorflow Probability

    • This library may be overkill for your applications, especially if you don’t need to use neural networks or GPUs, but it comes from the great minds at Google.

There’s a lot to choose from, and to frustrate you! But in the end, you’ll still have to tune all those hyperparameters, the program will fail most of the time, and you will have to encode your problem into a thorny proprietary system. No fun! And why you never see MCMC in any production environment whatsoever.

So what do folks who design self-driving cars and other exciting applications of Bayesian inference use? This brings us to the next section.

Code sample for random walk

(Note that you can code your own MCMC sampler, and an example can be found in the method MCMC here. However, the core algorithm itself is pretty simple and a terrific exercise for the reader.)

import emcee
sampler = emcee.EnsembleSampler(n_walkers, 
                                n_params, 
                                log_likelihood_function)
sampler.run_mcmc(initial_parameters, n_steps, progress=True)

print('Autocorrelation:')
print(sampler.get_autocorr_time())
samples = sampler.get_chain(flat=True)

burn_in = int(len(samples) * 0.5) # remove the first 50% of samples
samples = samples[burn_in:]

covariance_matrix = np.cov(samples)

Outline of the Bayesian inference primer

Douglas MasonComment