Bayesian inference primer: bootstrapping

The bootstrapping approximation is elegant and extremely useful, but it also raises a number of questions since at first the concept can be unintuitive. First published in 1979 by Bradley Efron, it has led to an enormous amount of speculation and study. Some folks lambast the bootstrap, while other circles swear by it, but the general consensus is that under certain assumptions, it is a reliable and excellent tool. 

First, let us briefly cover what the bootstrap is: an approach to producing distribution estimates of your model parameters (such as their means and variances) that works by perturbing your data, identifying which parameters give your maximum likelihood estimate on the re-sampled data. Compare this to all the other approaches we have discussed so far that keep the data constant, and then perturb your model parameters and evaluate the likelihood function to obtain a weighted sample of their distributions. Because of this fundamental difference, the bootstrap is sometimes referred to as the empirical bootstrap.

To drill a little deeper, the bootstrap resamples your dataset with replacement (this distinction is important) to keep the final size the same. In other words, it treats your dataset as a population it draws from to create samples of the data distribution. In the limit of a very large dataset with good coverage over its domain, this can work excellently. However, keep in mind that each data point has a probability of being selected that is approximated by the Poisson distribution. This means that if you have a categorical feature, and only a few samples of a particular value, it is possible to completely miss certain observations (see this article to learn more). As a result, we need to “enhance” our variance for estimates about low-sample domains in our data.

The other benefit of bootstrapping is its engineering benefits. For example, we can use an “online” bootstrap, that cycles through our dataset only once, and assigns each data point to each bootstrap an integer number of times, where that integer is randomly sampled from the Poisson distribution. (Although other distributions are viable, and a particularly useful one is double-or-nothing, that is, each data point can be used with a 50/50 chance, and if we use it, we put it into the bucket twice.) Then, when we want to perform a sample of our model parameter or our model predictions, we can keep a trained model for each bootstrap, and select just one of them at run-time to conduct our prediction. (We are currently avoiding the question of propensity scores, which are required for off-policy evaluation, and which you can read about in our other articles.) To learn more, see this other article from Facebook research Dean Eckles. Another sampling strategy available to us when data is particularly sparse is to use the Dirichlet distribution instead of the Poisson or Bernoulli. This is called the Bayesian bootstrap, and a full explanation and demonstration can be found here.

So, to summarize, the pros and cons of using the bootstrap method are:

  • Pros

    • Bootstraps work for any distribution that has a defined mean

    • Bootstraps can incorporate any type of underlying model, including tree-based algorithms that provide easier feature-engineering

    • Bootstraps can be elegantly engineered for efficient and fast real-time computation. This makes them ideal for bandit algorithms.

  • Cons

    • The bootstrap does not apply for distributions without a defined mean (such as a power distribution or distributions with heavy tails)

    • Bootstraps are only reliable for estimating the mean and the variance -- it is generally not used to estimate higher moments

    • The bootstrap is only correct in the limit of infinite data points and infinite bootstraps. 

      • In practice, only a few hundred bootstraps need to be used for reasonable accuracy, but thousands may be expected to publish results. 

      • For small data sizes, heuristics need to be applied to artificially increase the measured variance. This is also important to keep in mind if your dataset is unbalanced, with poor coverage in certain areas, such as a categorical variable that only has a few samples from a given value.

And two excellent references are: 

Code sample for online bootstrap

(Note that the base model constructor must support the partial_fit method (i.e., it can train on data row-by-row, rather than requiring all the data at the same time in memory). For algorithms within the scikit-learn library that support this method, see this page. Another good library for this tool is xgboost.)

class OnlineBootstrap:

   def __init__(self, n_bootstraps=100, base_model_constructor=None, **model_hyperparameters):       
       self.n_bootstraps = n_bootstraps
       if base_model_constructor is None:
           self.base_model_constructor = SGDRegressor
       else:
           self.base_model_constructor = base_model_constructor
       self.model_hyperparameters = model_hyperparameters
       self.bootstraps = []
       for _ in range(n_bootstraps):
           self.bootstraps.append(base_model_constructor(self.model_hyperparameters))

   def partial_fit(self, X, Y):
       for bootstrap_index in range(self.n_bootstraps):
           if random.random() < 0.5:
               for _ in range(2):
                   self.bootstraps[bootstrap_index].partial_fit(X, Y)

   def predict(self, X, means=False):
       if not means:
           bootstrap_index = random.choice(range(self.n_bootstraps))
           return self.bootstraps[bootstrap_index].predict(X)
       else:
           prediction_samples = []
           for bootstrap_index in range(self.n_bootstraps):
               prediction_samples.append(self.bootstraps[bootstrap_index].predict(X))
           return sum(prediction_samples) / len(prediction_samples)

Outline of the Bayesian inference primer

Douglas MasonComment