Bayesian inference primer: notes on uncertainty

Working with Bayesian inference requires a keen sense of the types of uncertainty we face in our models. Here we review a few seminal articles on the subject, which can be found here:

There are two types of uncertainty you encounter in Bayesian inference:

  • Aleatoric Uncertainty

    • The uncertainty of our predictions even if we have an infinite amount of data covering all possible domains. This uncertainty encompasses both irreducible uncertainty (the spread of repeated measurements with the same input) as well as uncertainty that emerges from crude models (also known as bias). For example, a linear regression that describes a relationship that is better described with a quadratic regression will report higher aleatoric uncertainty (here known as the residuals or the noise parameter) even though we could reduce that uncertainty by choosing a model with a better fit. Aleatoric uncertainty can be either:

      • learned in a machine learning model as part of the parameters themselves (as in the learned noise parameter in Bayesian linear regression)

      • learned through hyperparameter tuning (as in linear regressions with a fixed noise parameter, or Gaussian processes with a white noise kernel)

      • learned through empirical calculations (as when we calculate residuals)

      • produced as predictions themselves

        • as in logistic regression or softmax classification, which predicts the probability of an input belonging to each possible class

        • or as in quantile regression, loess regression, and neural nets that explicitly predict the noise parameter through the loss function, for example, through the hinge loss

        • in both cases, the aleatoric uncertainty is modeled through parameters that are simply less explicit than the noise parameter in a linear regression

  • Epistemic Uncertainty

    • The uncertainty of our predictions due to limits of the model’s knowledge because we have a finite sample of data that has gaps in many domains. Models with a large amount of data in a domain will report close to zero epistemic uncertainty in that domain, while models with a small amount of data in a domain will report a large amount of epistemic uncertainty in that domain. As we collect an infinite amount of data covering all domains, the total epistemic uncertainty reduces to zero. Epistemic uncertainty is calculated using the methods of Bayesian inference covered in this article series.

Note that in building bandits, reinforcement learning algorithms, or Bayesian optimization, we are using the epistemic uncertainty to explore rather than the aleatoric uncertainty. If our model is sufficiently simple, as in a Bayesian linear regression without a set of basis functions, then our epistemic uncertainty reduces to zero quickly while leaving behind a large aleatoric uncertainty (i.e., we underfit with high bias and low variance). On the other hand, if our model is sufficiently complex, as in a deep neural net, the epistemic uncertainty remains large and our aleatoric uncertainty reduces to zero (i.e., we overfit with low bias and high variance). For this reason, complex models are poor choices for running optimization procedures because they take too long to get to the exploitation stage.

In applications, we often only care about the means predicted from our machine learning model. For example, the aleatoric uncertainty for a given input may be large compared to the difference in the means between competing options in an optimization process. However, we want to optimize our results so that our means integrated over time, and therefor our future totals, are improved. Just because our sample noise is larger than the mean difference doesn’t mean we can’t resolve that mean difference and use that difference to do a better job (see our review of the Bellman equation for more context). But if we have a difference in our prediction means, we want to know if that difference is significant, and we determine that using epistemic uncertainty. This allows us to determine if we should explore more because of spurious results trained on too little data.

It is for this reason that training an ensemble of machine learning models and using the mean prediction can improve model accuracy: a single sample of the model parameters will likely not predict the mean due to epistemic uncertainty, and we come closer to predicting that mean when we average over many i.i.d. samples of those parameters. Similarly, if we add the aleatoric uncertainty to our predictions, that combined with recording the distribution of your input can render any model a generative model.

Outline of the Bayesian inference primer

Douglas MasonComment