Contextual bandits in reinforcement learning: core concepts

Supervised, Self-Supervised Learning, and Reinforcement Learning, oh my!
- Supervised Learning
  - This is your classic prediction model. It ingests inputs (feature vectors) and then renders predictions (either scalars, vectors, or distributions, as in Bayesian models). Supervised learning requires a needs a training data set of correlated inputs and outputs, known as labelled data.
- Self-Supervised Learning
  - Language models which learn embeddings, like word2vec and BERT, are trained on writing in the wild. This is because their input is a sequence of text and their output is the next piece of text, which means we can generate labelled data automatically from any string of text. Other than the data preparation step, this is the same as a supervised learner.
  - Usually it’s not the output of self-supervised models that we desire, but the representations in their parameters, which we can use with other supervised learners to improve performance on novel tasks by taking a lot of the interpretation guesswork out. NLU models that use word embeddings and computer vision models which appropriate ImageNet to perform new classifications are a good example of this application.
- Unsupervised Learning
  - Clustering, autoencoders, and other unsupervised algorithms train on the input itself, and don’t need any labelled data. Some people debate whether self-supervised is unsupervised learning, and it doesn’t matter.
- Reinforcement Learning
  - Here we have some labeled data (when you reach the end of a sign-up sequence and successfully sign up), but also some unlabelled data (every action you took that led up to the final one that signed you up). How do we propagate the information from the labeled data to the unlabeled data? We reinforce it, using the Bellman equation (see below). And reinforcement learning shows us the gory details of how to do it.
  - While some methods employ approaches that are entirely separate from supervised learning, many of the algorithms we encounter in Q-learning and bandits are actually just ways of building labelled datasets that we feed to supervised learners.

Figure 1: Diagram illustrating the core concepts around agent, environment, state, action, and reward within a Markov Decision Process (MDP) — **Figure 1:** Diagram illustrating the core concepts around agent, environment, state, action, and reward within a Markov Decision Process (MDP)

Agent, State, Action, Reward, Context, Value, Episode, and… uh… MDP *sigh* (see Figure 1)
- Agent
  - The protagonist of the story, the character you relate to. The agent is a personification of the system being used to make decisions (the policy, see below) and to learn from the consequences of those decisions (the reward, see below).
- State (s)
  - Describes the situation where the bandit is being applied. Yes, that’s vague, and that’s the point. The state encompasses the person using the bandit, the application they’re using it in, and the details of their current situation within the app (for example, are you on page three of a five-page sign-up flow)?
- Action (a)
  - The choices available to the bandit. This can be a categorical choice, like a chosen headline or advertisement, or it could be a vector of continuous numbers describing a choice, like a sentence embedding. The bandit can only rank a finite number of choices at a time.
- Reward (r)
  - We are aiming to optimize something and that something is the reward. What you choose to optimize falls under the realm of reward design. However, in most applications it’s pretty obvious what you use for the reward. Advertisements try to optimize for you to click on the ad, and signup-flows optimize for getting you finally sign up.
  - Rewards can be binary like the examples above, or they can be continuous. For example, in Fret Ferret, the reward is the time-to-completion for each task. A downstream modification of the standard bandit approach lets us decide if we want to minimize, maximize, or aim for a particular value as a proxy for question difficulty.
- Episode
  - The sequence of states, actions, and rewards that occurs with each use of an application. In a sign-up flow, this is the sequence of steps a user takes to sign up (or to ditch). In a chatbot, this is the sequence of questions and answers before the conversation ends. In an advertisement, the episode only has one step: an advertisement is shown and the user clicks on it or they don’t.
- Value (V)
  - The sum of future rewards from a given state in its current episode. In Q-learning, we learn the Q-value: the sum of future rewards from the current state, assuming that we keep using the best possible policy thereafter.
- Context (C)
  - A featurization of the state that we concatenate with the featurization of the action to feed into the value prediction model that powers a contextual bandit.
- Policy (pi)
  - When we choose an action, how do we choose it? For categorical actions, this can be represented as a probability assigned to each action. For continuous-feature actions, we get scores for each candidate and sample from the candidate pool giving more weight to the candidates with higher scores.
- Markov Decision Process (MDP)
  - A pretentious phrase meaning that we have a state, and actions move us to a new state that is equivalent to the old one but with different values in the context vector describing the state. Why even make this distinction? Because we could make a separate bandit for each step in an episode, or use one bandit for all steps. In the latter case, its situation is called an MDP. And if episodes are only ever one step long, we also call it an MDP.
On-policy vs. Off-policy.
- In contextual bandits, we are working in an off-policy regime. That means that we can make choices with our actions and use the data we trained from them to build our models even as our policy changes. In Bandito, this is accomplished by using a technique called Inverse Propensity Score Weighting (IPSW), although other techniques are available. Some reinforcement learning methods also use IPSW but call it the action ratio.
- On-policy methods, however, are usually only slightly different from the off-policy ones when it comes to implementation, but we can’t use them to learn in real-world applications because they are too data inefficient since they can’t use data from an older or different bandit or policy that chose different actions than we would have.
Q-Learning vs. Policy Gradients
- You can train a model to predict the value after an action is taken (a scalar output), or you can train it to predict the probability of choosing each action (a vector output) at a given state. In the former case, the policy amounts to how we use the value predictions to choose our action, and the latter case, the model explicitly gives us a policy.
- Contextual bandits are technically defined by only predicting the immediate reward in the academic literature, but if you train them to predict summed future rewards in an episode (like a sign-up flow or chatbot conversation), then they approximate Q-learning.
- The simplest policy gradient algorithm is called REINFORCE, which applies stochastic gradient descent in a supervised learning setup training against the summed future rewards on a model that only ingests the context features and predicts a probability distribution for the discrete actions it can take. On the other hand, Q-learning can (but does not have to) apply stochastic gradient descent in a supervised learning setup training against the summed future rewards — on a model that concatenates the action and context features and predicts a scalar.
- These aspects make Q-learning more flexible, but REINFORCE was a big deal ten years ago when we hadn’t learned what tricks were necessary to make Q-learning work in complex state spaces and with non-linear value prediction models (see the DeepMind Atari paper).
Training Concepts for Reinforcement Learning:
- Temporal Difference (TD) Learning. The Bellman equation (see below) says that we should choose the action that gives us the most reward for the current step and the most reward for all future steps in the episode assuming we keep using the best policy thereafter. But where do we get the second part? If we don’t get that answer from real-world data, we can get it from our model’s own predictions, and bootstrap our training procedure that way, propagating information from the events that produce the strongest rewards back to the otherwise boring events that led up to them. However, this procedure is unstable, so it benefits us to have two models that update each other occasionally rather than one model training on its own outputs.
- Double Deep Q-Learning (DDQL) is a similar concept to TD-Learning aiming to increase training stability, and it also uses two models, but the update equations are different.
- Actor-Critic reinforcement learning is a branch dealing specifically with how to use a Q-learning algorithm and a Policy Gradient algorithm at the same time, and how they can be used to stabilize each other. Advantage Actor Critic (A2C) uses the advantage value: which is the difference between the Q-value (using an optimal policy) and the value we expect on average given our current policy. If you make it asynchronous, then you A3C (get it? we have an extra A), but no one likes that algorithm because it does work as well as you’d expect.
Entropy, Cross-entropy and KL-Divergence
- A lot of the more confusing concepts in machine learning come down to the fact that if you add two things together, we expect some quantities to grow by adding them together, and others by multiplying them. For example, if you have two events defined by two probabilities of happening, we have a concept of information provided by those events that should add. However, the combined probability of those events occurring is the product of their separate probabilities. When relating the quantity that multiply to the quantities that add, we use logarithms, and exponentials for the reverse.
- For example, if you want to know how similar one distribution is to another, you can ask how many samples you would need to take from one distribution on average to describe an event from the other distribution. This is cross-entropy, which is used in many loss functions, and is defined as a weighted sum of log probabilities, where the weights come from the distribution you want to describe, and the logarithms are of the probabilities from the distribution you’re using to perform the description.
- The self-entropy answers how many samples from a distribution are needed to define another sample from the same distribution on average. And the KL-divergence is a pretentious way of saying cross entropy minus self-entropy. Everyone loves that term, it makes you look really smart and esoteric.
Model-free reinforcement learning
- If we learn a function to approximate our summed future rewards, as we do with deep Q-learning, this is a model-free reinforcement learning solution, even if we use a neural network or linear regression to approximate Q. The “model” in “model-free” for reinforcement learning refers to a prediction model of the agent and the environment, and what the next state or reward will be when an action is taking. That is, “model-based” reinforcement learning refers a simulation model, NOT just any function approximation. This is really confusing, because the function approximate is often called a prediction model, and in Q-learning, you really are learning a prediction of sorts. Unless you want to go down the laborious and experimental path of building user simulators, stick with model-free reinforcement learning when working with real-world bandits, which is what the vast majority of the literature in this field does anyway. See this stackexhange for a great write-up on the jargon.

Douglas MasonMay 13, 2021Comment