Contextual bandits in reinforcement learning: coming up to speed
The field of reinforcement learning is an exciting one that is constantly evolving. The best papers are coming out, literally, in the past eight years. There are few academic disciplines that are so enthralling!
However, there is a drawback to all this activity: fractured understanding. The best academic papers focus their attention on elaborate proofs, while the best application write-ups tend to have huge gaps in understanding and a poor ability to connect concepts across multiple sub-fields. The reality is, there is so much going on in reinforcement learning these days that practitioners working 8 hours a day (and honestly, far more) still can’t keep up!
If the field interests you, and you truly believe that the future of industry will be shaped by the concepts being built right now, I highly recommend you to make a window of a week to devote just to reading reinforcement learning concepts, writing down your questions, and working through them systematically. The reason this helps is because there are many acronyms and jargon that relate to concepts bridging the methods, and you want to be able to recall them quickly and connect each one to an intuitive understanding. It’s impossible to do this with more than a few days between reading sessions.
As far as literature goes, while Wikipedia is a good resource for quickly recalling a concept, it is a poor resource for learning it the first time. These are the literature I strongly recommend. At first, think of it like many other machine learning applications. Build up a self-supervised model in your mind by skimming the material, noticing which acronyms and jargon come up the most. Your mind is implicitly trying to predict what you read, so this will help you get the data structures set up. Then, as you dig into your questions, you can get more detailed with your approach… just like a reinforcement learning algorithm would!
The DeepMind Papers.
2013: Playing Atari with Q-Learning using a deep neural net.
https://deepmind.com/research/publications/playing-atari-deep-reinforcement-learning
Because of the stability issues (see below), no one knew how to get deep neural nets to do Q-learning at first. Discover the tricks that DeepMind used to overcome the hurdles and usher in huge interest in deep learning.
2016: Beating masters at Go with a really complicated algorithm.
https://deepmind.com/research/publications/mastering-game-go-deep-neural-networks-tree-search
Another compendium of tricks to get reinforcement learning to work in an application space that we previously thought to be insurmountable. This is actually a hard paper to read because it is so complicated.
2017: Beating masters at Go without being so complicated.
https://deepmind.com/research/publications/mastering-game-go-without-human-knowledge
They greatly simplify the algorithm from the previous paper, throw in more computer power, and voila! They did even better!
This is just an amazing paper everyone should read at some point in their lifetime. Pay special attention to the part about the Monte Carlo Tree Search, and think about how you could use a contextual bandit to replace it.
These papers are truly inspiring, extremely well-written, and think at a high level. They are also brief, so they are good for your early skims and your later detailed inquisitions, but since they gloss over the details you’ll want to reference other resources to fill in the blanks.
Course Syllabi
Alekh Agarwal from Microsoft Research does contextual bandits and reinforcement learning.
This is a good overview, but every lecture devolves into ornate proofs half-way through that only a grad student would drool over. There are also a lot of pointless limitations introduced for academic precision. For example, you can totally train a contextual bandit on delayed or summed future rewards. We don’t have to limit ourselves to only immediate rewards and pure MDPs just to make our thesis committee happy.
Check out the companion book by Alex Slivkins, in particular the chapter on Bayesian bandits and Thompson sampling.
Sample-based reinforcement learning on Coursera
https://www.coursera.org/specializations/reinforcement-learning
A great resource if you want someone to speak proofs aloud to you, based on the Sutton and Barto book (see below). The best value here is in its syllabus outline, and the different approach it takes to the material than the other resources. You can audit this course for free (don’t fall for Coursera’s confusing interface suggesting otherwise) and watch videos that drill directly into a concept you find confusing.
Sutton and Barto’s Reinforcement Learning: An Introduction
Note that this book is huge, detailed, and terrific. It is also a serious investment of time, and is completely overwhelming. Don’t let insecure nerds convince you otherwise: this book is difficult material to master! However, simply skimming through the first five chapters can provide a terrific overview, even if you don’t get all the concepts. Or just read the table of contents and the introduction. They do a good job of putting the concepts into context, which helps with cementing them in your mind later when you come across more approachable explanations.