Doom AI
Experience Replay
Our solution: create a “replay buffer.” This stores experience tuples while interacting with the environment, and then we sample a small batch of tuple to feed our neural network.
Reducing correlation between experiences
We have another problem — we know that every action affects the next state. This outputs a sequence of experience tuples which can be highly correlated.
Here is a quick summary on the Reinforcement Learning taxonomy:
On-policy vs. Off-Policy
This division is based on whether you update yourQ values based on actions undertaken according to your current policy or not. Let's say your current policy is a completely random policy. You're in states and make an actiona that leads you to states′. Will you update yourQ(s,a) based on the best possible action you can take ins′
or based on an action according to your current policy (random action)? The first choice method is called off-policy and the latter - on-policy. E.g. Q-learning does the first and SARSA does the latter.
Policy-based vs. Value-based
In Policy-based methods we explicitly build a representation of a policy (mappingπ:s→a
) and keep it in memory during learning.
In Value-based we don't store any explicit policy, only a value function. The policy is here implicit and can be derived directly from the value function (pick the action with the best value).
Actor-critic is a mix of the two.
Model-based vs. Model-free
The problem we're often dealing with in RL is that whenever you are in states and make an actiona you might not necessarily know the next states′
that you'll end up in (the environment influences the agent).
In Model-based approach you either have an access to the model (environment) so you know the probability distribution over states that you end up in, or you first try to build a model (often - approximation) yourself. This might be useful because it allows you to do planning (you can "think" about making moves ahead without actually performing any actions).
In Model-free you're not given a model and you're not trying to explicitly figure out how it works. You just collect some experience and then derive (hopefully) optimal policy.
You can have an on-policy RL algorithm that is value-based. An example of such algorithm is SARSA, so not all value-based algorithms are off-policy. A value-based algorithm is just an algorithm that estimates the policy by first estimating the associated value function.
To understand the difference between on-policy and off-policy, you need to understand that there are two phases of an RL algorithm: the learning (or training) phase and the inference (or behavior) phase (after the training phase). The distinction between on-policy and off-policy algorithms only concerns the training phase.
During the learning phase, the RL agent needs to learn an estimate of the optimal value (or policy) function. Given that the agent still does not know the optimal policy, it often behaves sub-optimally. During training, the agent faces a dilemma: the exploration or exploitation dilemma. In the context of RL, exploration and exploitation are different concepts: exploration is the selection and execution (in the environment) of an action that is likely not optimal (according to the knowledge of the agent) and exploitation is the selection and execution of an action that is optimal according to the agent's knowledge (that is, according to the agent's current best estimate of the optimal policy). During the training phase, the agent needs to explore and exploit: the exploration is required to discover more about the optimal strategy, but the exploitation is also required to know even more about the already visited and partially known states of the environment. During the learning phase, the agent thus can't just exploit the already visited states, but it also needs to explore possibly unvisited states. To explore possibly unvisited states, the agent often needs to perform a sub-optimal action.
The difference between Q-learning (off-policy) and SARSA (on-policy) is respectively the use or not of the max operator in their update rule.
In the case of policy-based or policy search algorithm (e.g. REINFORCE), the distinction between on-policy and off-policy is often not made because, in this context, there isn't usually a clear separation between a behavior policy (the policy to behave during training) and a target policy (the policy to be estimated).
You can think of actor-critic algorithms as value and policy-based because they use both a value and policy functions.
The usual examples of model-based algorithms are value and policy iterations, which are algorithms that use the transition and reward functions (of the given Markov decision process) to estimate the value function. However, it might be the case that you also have on-policy, off-policy, value-based or policy-based algorithms that are model-based, in some way, that is, they might use a model of the environment in some way.
Last updated
Was this helpful?