The fundamental theorem of reinforcement learning: the bellman equation

When I first began thinking about how contextual bandits could be applied in an episodic setting, that is, in a situation where the sequence of steps matters, I just applied some intuition, and it turns that after reading a ton of reinforcement learning literature, that intuition still holds true, but now I can connect to a large field of research and terminology. So I will start by describing the intuition I used, which will probably make more sense to you, and then I will connect it to the field. I won’t use math here because really, this is a story to help you navigate, but feel free to dig into the math in more detail.

Say you have a setup where the only reward occurs at the end of the episode, which is a common situation in sign-up flows, purchase flows, and goal-oriented chatbots. The steps in the middle don’t get you any immediate reward, but they affect the final step and whether you decide to follow-through or abandon early (the final reward). We would expect that we could apply a bandit to the final step to choose the best action right before the reward comes in. Then, if we applied a bandit to the step before, we would want it to put us in a state where we are more likely to choose the best action thereafter.

Intuitively speaking, this is a concept directly related to A/B testing. When you compare two possible presentations of an app to see which one drives better performance (via click-through, for example), you want the presentation that you are measuring to be as immediate and impactful on the reward as possible, otherwise you won’t be able to resolve any difference in impact between the two choices. I can A/B test whether a blue or red button makes you click on it, but it becomes really hard to test whether presenting you a long End-User License Agreement five minutes ago had any effect, even with tons of data.

Figure 1: Consider two actions, A in blue and B in red, that have prediction value distribution densities shown above. Given a result (a), we have three options for increasing our ability to resolve the difference between the two actions. We can (b) reduce their variance, (c) increase the difference in their means, and/or (d), not shown, we can increase the number of samples. However, note that our resolving power only goes up by the square-root of the number of samples, so it is a potentially expensive or impossible solution.

Figure 1: Consider two actions, A in blue and B in red, that have prediction value distribution densities shown above. Given a result (a), we have three options for increasing our ability to resolve the difference between the two actions. We can (b) reduce their variance, (c) increase the difference in their means, and/or (d), not shown, we can increase the number of samples. However, note that our resolving power only goes up by the square-root of the number of samples, so it is a potentially expensive or impossible solution.


In an episodic setting with rewards that only occur at the end, we expect the prediction models to give us results that resemble figure (a) for the later steps, and results that resemble figure (b) for the earlier steps when comparing equivalent actions. This is because the sum of future rewards that we attempt to predict don’t change throughout the episode, but actions at earlier steps will more likely mix the distribution of those rewards due to the mixing of states during intermediate steps. Of course, if different actions are considered at earlier steps than later steps, and if an earlier action has a direct (if long-term) impact on the final reward, this fact will dominate and reduce its prediction variance.

Consider Figure 1. If you perform an A/B test on each step in an episode and treat them as completely separate problems to predict the final reward at the end of the episode, you see that the farther you are from the final goal, the larger the variance in your predictions — that is, the more uncertain you are about how each choice affects the outcome, until all the choices overlap so much you can’t tell them apart. However, if you are applying a contextual bandit for each step in the sequence, something different happens that’s extremely important: the contextual bandits improve over time, and the final ones improve the most quickly.

This means that the bandit running the final step with the smaller variance and higher resolution makes better and better choices. Thus, when we look at the bandit running the penultimate step, trained on the rewards after we reach the final step, we are collecting data with a constantly improving bandit for the final choice. Once the final bandit learns, it reduces the variance in the penultimate bandit. Extrapolating, this means that information about the best choices to make propagate from the final step to the previous steps.

This is the heart of the Bellman equation, which is basically the most important equation in all of reinforcement learning, and it’s a really intuitive one. Moreover, it is really a reformulation of dynamic programming, an approach most tech workers have to master during one interview or another. The idea of dynamic programming is that if you can break down a large, complex problem into parts that only concern themselves and the results of the other parts, you can greatly reduce the complexity of solving the whole problem. For example, if you want to compute the factorial of a number N, you can use the result for the factorial of N-1 and just multiply it by N.

The intuition behind the Bellman equation is that if you want to make the best choice in Step 1 of a 5-step process, you can only concern yourself with what immediate reward you get during Step 1, and what the best possible policy would give you for all the subsequent steps. You don’t have to think of all the steps at once. In other words, a policy for the final step only cares about its immediate reward, the policy for the penultimate step only depends on its immediate rewards and our expectations from the policy used to run the final step, and so on. This means we can break up the problem into smaller pieces, which is a huge relief! 

(After all, could you imagine having to learn the best policy by predicting the reward for each possible sequence of actions in an episode? The complexity would explode! Thank you dynamic programming, even if you do have a horrible, pretentious name that doesn’t make sense anymore.)

A great number of papers and algorithms in reinforcement learning revolve around making those predictions for the subsequent steps, in the same way that when you play chess, you mentally imagine how the rest of the game will go if you make a given move. But in a bandit setting, we are hoping to learn those predictions from a real-world application. Obviously, this means that bandits can’t handle complex situations like chess which have huge state spaces that require us to predict millions of future scenarios — because this would mean running those millions of future scenarios in the real world! But we can get good performance in the more modest applications we actually encounter, and we can guarantee our bandits will at least do better than random chance.

The Bellman equation is a game-changer. You might wonder, at first, how to handle the complexity of how each step in an episode affects the later ones. The Bellman equation tells us to stop. Don’t. It doesn’t work, and your confusion is correct. Rather, break up the problem into smaller pieces, and if you do it in the right way, we can guarantee they will work together to produce a stronger policy.

Note that when you train your bandits according to the above strategy, you are approximating Q-learning, which means we are only training prediction models to predict the scalar value function (summed future rewards). We could, on the other hand, train models to predict the best action directly, if we only have discrete (categorical) actions, as opposed to actions defined by continuous feature vectors. However, this is less robust and overkill for most real-world bandit applications which is why you rarely see it in the bandit literature.




Douglas MasonComment