Bayesian inference primer: Bayesian-native models

No overview of methods of Bayesian inference would be complete without a review of models that come with Bayesian inference baked straight into the construction. In this article, we will review the following models, with bootstrapping covered in an earlier article:

Bayesian Linear Regression

The Bayesian linear regression is a Bayesian scientist’s powertool. Using closed form equations to derive the Gaussian distribution of all its parameters, it is as incontrovertible as it is useful. Even better, many of the proofs guaranteeing efficiency in reinforcement learning are based on this tool, and its ease-of-interpretation makes it’s go-to in the field.

The Bayesian linear regression is based on the standard linear regression we all know and love, but instead of just providing us point estimates, we also get the variances of our model parameters, based on the assumption that they are all Gaussian-distributed. Even better, we can get distribution parameters of the underlying variance of the residuals, making this a completely parameter-free model to work with. That is, the model itself has no hyperparameters to tune. Goddess gracious, I’m just salivating at the thought.

At Koyote Science, LLC, we’ve been disappointed with the implementations of linear regression we’ve found for linear regression (Bayesian or not) available on Python, and we have built our own library for our use. It has the added benefit of working in a sequential fashion -- at no point do we have to contain the entire dataset in memory. To learn more, check out our completely free, open-source library on Github, complete with usage examples.

https://github.com/KoyoteScience/BayesianLinearRegressor 

Gaussian Processes

Let’s get the bad stuff out of the way first: Gaussian processes require a moderate number of hyperparameters to tune, and they scale among the worst of algorithms available. For a given number of data points N, a Gaussian process computational complexity scales with O(N^3). By comparison, any online/sequential/incremental methods scales with O(N), and linear regression and random forest algorithm scale with O(N^2). This immediately makes Gaussian processes a poor choice for big data operations, and sparse Gaussian process methods, which attempt to approximate the dataset with a core dataset that still possesses the most important qualities, are difficult to work and require fine-tuning in application.

The other problem is that Gaussian processes will produce zero uncertainty for any prediction on an input that matches any input it was trained on. By comparison, a linear regression will still produce an uncertainty at the same inputs used to train it. This means that Gaussian processes are best-used for situations like hyperparameter tuning, where we assume that there is absolutely no uncertainty in our data labels.

As for hyperparameters, a Gaussian process requires a covariance matrix, or a kernel. This means we have to choose a distance function that describes our data, and this requires domain knowledge. For example, the simplest kernel is to assume a Gaussian correlation between data points, with only one hyperparameter: the correlation distance. How do you choose that? Well, you have to guess based on what you know.

With the bad stuff out of the way, here is the good stuff: the Gaussian process allows you to work entirely through relationships between training data, rather than through an intermediary (the coefficients and basis functions in a Bayesian linear regression). In fact, the Gaussian process is a Bayesian linear regression in the limit of an infinite set of basis functions, with certain modifications. And because the Gaussian process has predictable behavior, it is an excellent choice in settings where efficiency is less important than exactness.

At Koyote Science, LLC, we have written up an extensive review of how Gaussian processes relate to Bayesian linear regression, which is an invaluable tool for learning the intricacies of Bayesian inference, and informative of how modern methods are built on a solid foundation. To learn more, visit:

https://github.com/KoyoteScience/GP-BLR

Bayesian Neural Nets

Since neural networks are really an extension of the linear regression, it should come as no surprise that we can also train a neural network with distributions of its parameters instead of single point estimates (arrived at by some gradient method). However, due to the large number of parameters involved, and the fact that there are no closed-form solutions, this problem is particularly challenging.

You can tackle the problem directly, or you can use a variety of tricks that have yet to be fully vetted, but which may help you get to where you need to go. The most highly-regarded such approach is called Bayes-by-Backprop, coming out of Google’s Deepmind operation, which is based on similar principles to the variational autoencoder. However, there are no standard implementations yet, so you’ll have to work with what folks have posted on Github.

Example tutorials on how to implement your own BNN through various methods:

  • The Keras library, when working with the Tensorflow Probability library underneath, has an excellent tutorial. Once you work with Tensorflow Probability,  you can use any of the methods we’ve discussed (Markov Chain Monte Carlo, variatonal inference, etc.)

  • There is also a good write-up using PyTorch in this article.

  • Monte Carlo Drop-Out original paper.

  • Get parameter distribution through the ADAM optimizer according to this article.

  • Bayes by Backprop original paper, with an example implementation here.

  • This article covers how you can add a Bayesian linear regression to a non-Bayesian neural net (also called an embedding method)

  • Applying Ian Osband’s randomized prior functions on a bootstrap ensemble of neural nets recovers desirable out-of-domain uncertainty properties. This tutorial covers the major details.

As for Koyote Science, LLC, we prefer the final approach presented above, which uses a bootstrap ensemble and applies a randomized prior to each ensemble member. Why? Many of the other approaches listed above provide distributions but those distributions don’t possess the major qualities that we would like: (1) epistemic uncertainty should reduce in domains with more data, (2) it should increase to some pre-determined value in domains with no data, and (3) it should not mix with aleatoric uncertainty. In a similar vein, you can create distributions in any bagging technique, such as random forests, to help regularize your predictions, but these distributions aren’t calibrated to accurately reflect the epistemic uncertainty, reducing their utility in bandit and reinforcement learning applications. Ian Osband’s empirical review of BNN methods is one of the most useful articles to discuss the subject.

Outline of the Bayesian inference primer

Douglas MasonComment