Practical considerations when using a contextual bandit for your problem
Academics in the bandit and reinforcement learning literature are primarily interested in developing proofs on guarantees of regret -- in other words, they are interested in proving that a particular bandit algorithm can be more efficient and robust in a mathematically exact way. It seems crazy but it’s not. You’d be amazed how many common techniques, like bootstrapping, are based on intuition and have been used for seventy years, but we don’t exactly know why it works. So building these proofs is an absolutely necessary endeavor to resolve questions about which algorithms to drive under the hood.
However, when we apply contextual bandits in a product, it turns out that the needs of the project almost always overwrite any benefits from one algorithm over the other. We want to know what will work in the situations we will most likely encounter. Like Python has swept the programming language landscape not because of its efficiency, but rather for its smart approach to common data structures and attention to design detail, putting a contextual bandit to work requires relying on sturdy, robust solutions so you can save your labor for reward design and feature engineering.
The reality is that overly complex applications of contextual bandits are unnecessary. If you have to do substantial feature engineering or hyperparameter search, that means you are performing offline analysis and training offline models. Why not just implement those statically? That’s a lot easier to maintain! You would only use a bandit in this context if:
You want your model to be personalized to a small dataset (say users or apps). However, note that you will need extremely small feature sets to work in this case.
You expect substantial drift in your data that needs online learning to stay current and the bandit implementation is cheaper and more reliable than an equivalent online learning setup.
A contextual bandit must have the following qualities if you hope to integrate it into a product within your lifetime:
No hyperparameters to tune
Problem: You put a bandit into production but alas, you screwed up the hyperparameters! Now you have to train again from scratch. Worse, there are no truly parameter-free ways to use a bandit.
Solution: A parameter-free bandit is difficult to achieve, but we have found solutions that get you there in most applications. We use Thompson Sampling on a Bayesian linear regression (see our code and explanation here, free and open-source) as our default engine, and for simpler systems we offer one based on empirical values, as our default approaches because they are as parameter-free as you can get with good guarantees on performance. After that, you can pin what hyperparameters you have left over on grounded guesses that are unlikely to need to be tuned. For example, the size of the memory window for models that forget old data can be set by someone with domain knowledge of the application.
Easy feature engineering
Problem: Feature engineering is an inescapable bugaboo in machine learning. The more features you add, the better the model may perform down the road, but the longer it will take to learn. Too many features and the model may overfit and never learn at all. And an online learner means that every time we change our features, we have to throw our model away and train from scratch. How do we balance these needs in an online setting?
Solution: We recommend customers start from as few features as possible, and build up their feature set over time, stopping when they see reduced returns. After all, a dumb model will still perform at least better than random chance, whereas an overly complex one may fail completely.
The gradient-boosted tree engine substantially reduces the need for feature engineering when you have a mixture of categorical and numerical data.
Moreover, we provide data needed for off-policy evaluation. What does that mean? You can learn from earlier models to bootstrap your later, more-complex models. No data has to be lost!
Lastly, every bandit comes with a visual dashboard which allows you to download a full JSON output with information about the data distribution, model parameters, and historical performance to help you debug and iterate. Moreover, you can train your own models offline and upload the data you used to train them to the service in a batch operation to speed up learning.
Robust prediction model underneath
Problem: Using a complex model like a neural net to run your bandit is famously difficult to debug.
Solution: Then don’t use a neural, and use as simple and robust of a model as possible. This is why we use linear regression and empirical value models.
Scale cheaply and reliably
Problem: You could implement your own bandit API, and you could take a pre-built one and host it yourself. But now you’re managing a codebase, and no one is there to help you.
Solution: We built the Bandito API using AWS, which means our solution is cheap, reliable, secure, and it scales using cloud services. We have open-sourced the most important part of the code so you can check it, and we also provide testing software so you can check it out yourself. Because we built Bandito as customers first, we’ve worked through many of the problems you’re about to face, and we provide several working demonstrations you can test and use right now.