Exploration strategies in contextual bandits

It’s kind of amazing, when you learn the details of contextual bandits, that the exploration strategies used by bandits — strategies optimized so we learn the most robust and performative policies with the least amount of data — don’t appear that much in other reinforcement learning papers, but there are ghosts of these techniques all around. This is because most reinforcement learning papers are concerned with simulations, not learning directly from the real-world, and often use very simple exploration strategies because their challenges lie in the super-huge state spaces, not data efficiency.

Epsilon-greedy is the most common one you’ll encounter. It says that for some percentage of the time (epsilon), choose a random action, otherwise, choose the best action according to your predictions of summed future rewards. Easy-peasy!

But, oh no, how do we determine that percentage? And do we change it over time? These unavoidable questions add hyperparameters to tune, which we know kills real-world bandits.

The approach we use is a beautiful and elegant one called Thompson sampling. It says that we make a prediction of the distribution of future summed rewards for each action, NOT just a point estimate. This puts us into the realm of Bayesian inferencing, generative models, bootstrapping, and directly predicting the variance of our predictions. So many options! Even better, there are many software packages out there you can use right away if you want to see how the different approaches work (see below). 

Back to Thompson sampling, which says that we should sample from the distribution for each action we are considering, then choose the sample with the highest predicted value. Actions that are closer in performance to each other than the quantity of data can resolve will be interchangeable, while actions with large uncertainty will be explored in proportion to the likelihood that they might be better than our current winner. The best part about Thompson sampling? If we use an algorithm like Bayesian linear regression, or the empirical values for categorical features and actions, to get the distribution, there are no new hyperparameters to tune. And bootstrapping (which allows us to employ gradient boosted trees) only adds one hyperparameter (the number of bootstraps) that can be easily determined by answering a few questions. Huzzah!

For Bandito, we use Thompson sampling on a Bayesian linear regression (see our code and explanation here, free and open-source), because there are strong theoretical guarantees on its performance, the parameters are interpretable, and because there is a great body of literature to help make this approach as computationally and memory efficient as possible. For simpler systems, such as advertisement optimization, where our actions and contexts are defined by discrete categorical features, we can use the distributions of the empirical values for each category to make an even more robust and interpretable solution.

Here are some open-source bandit APIs you can use, if you trust them. They tend to be displays of engineering prowess generating stupendous numbers of algorithmic variations. They all require setting up a server to persist state information, but often have nice algorithmic features that avoid having to store the historical data to perform. This can be a feature or a bug, depending on whether you want to recover your data, but streaming algorithms play an important role in saving storage and computational costs:

And here are some ready-made APIs you can use, if you want to deal with their terrible user experience:


Meanwhile, did you notice that this article was readable, approachable, and knowledgeable? Wouldn’t it be nice to work with people who not only know the science, but also have the communication skills to help you succeed? Or how about people who built a bandit solution as customers rather than academics first, to make best-in-class products that really work and scale cheaply and reliably? If so, talk to us about using our bandit solution, called Bandito, in your product. It’s ridiculously easy to use, as extensible as you would like, and we have already built several implementations to get you started. Plus, we have the skills to manage engineering teams and answer your questions to keep you rolling.

Douglas MasonComment