Doom AI
Experience Replay 
Our solution: create a âreplay buffer.â This stores experience tuples while interacting with the environment, and then we sample a small batch of tuple to feed our neural network.
Reducing correlation between experiences
We have another problem â we know that every action affects the next state. This outputs a sequence of experience tuples which can be highly correlated.
Initialize Doom Environment EInitialize replay Memory M with capacity N (= finite capacity)
Initialize the DQN weights wfor episode in max_episode:    
s = Environment state    
for steps in max_steps:         
    Choose action a from state s using epsilon greedy.         
    Take action a, get r (reward) and s' (next state)         
    Store experience tuple <s, a, r, s'> in M         
    s = s' (state = new_state)                  
    Get random minibatch of exp tuples from M         
    Set Q_target = reward(s,a) +  ÎłmaxQ(s')         
    Update w =  Îą(Q_target - Q_value) *  âw Q_valueHere is a quick summary on the Reinforcement Learning taxonomy:
On-policy vs. Off-Policy
This division is based on whether you update yourQ values based on actions undertaken according to your current policy or not. Let's say your current policy is a completely random policy. You're in states and make an actiona that leads you to statesâ˛. Will you update yourQ(s,a) based on the best possible action you can take insâ˛
or based on an action according to your current policy (random action)? The first choice method is called off-policy and the latter - on-policy. E.g. Q-learning does the first and SARSA does the latter.
Policy-based vs. Value-based
In Policy-based methods we explicitly build a representation of a policy (mappingĎ:sâa
) and keep it in memory during learning.
In Value-based we don't store any explicit policy, only a value function. The policy is here implicit and can be derived directly from the value function (pick the action with the best value).
Actor-critic is a mix of the two.
Model-based vs. Model-free
The problem we're often dealing with in RL is that whenever you are in states and make an actiona you might not necessarily know the next statesâ˛
that you'll end up in (the environment influences the agent).
In Model-based approach you either have an access to the model (environment) so you know the probability distribution over states that you end up in, or you first try to build a model (often - approximation) yourself. This might be useful because it allows you to do planning (you can "think" about making moves ahead without actually performing any actions).
In Model-free you're not given a model and you're not trying to explicitly figure out how it works. You just collect some experience and then derive (hopefully) optimal policy.
# https://www.geeksforgeeks.org/sarsa-reinforcement-learning/
echo 
'Making AI agents for playing games like Taxi-v3, Atari Space Invaders, Doom..
⢠Used RL concepts like Policy-based method, Value-based-methods and Greedy Policy simplification.
⢠Added Deep-learning in Q-learning to work with a enviroment with millions of states
⢠Explored LSTM-NN and CNN working for developing more efficient use of observed experience.
⢠Improved Deep-Q learning using double DQNs (DDQN) and Prioritized Experience Replay.
⢠Implemented Asynchronous Advantage Actor-Critic (A3C) algorithm in Tensorflow .'You can have an on-policy RL algorithm that is value-based. An example of such algorithm is SARSA, so not all value-based algorithms are off-policy. A value-based algorithm is just an algorithm that estimates the policy by first estimating the associated value function.
To understand the difference between on-policy and off-policy, you need to understand that there are two phases of an RL algorithm: the learning (or training) phase and the inference (or behavior) phase (after the training phase). The distinction between on-policy and off-policy algorithms only concerns the training phase.
During the learning phase, the RL agent needs to learn an estimate of the optimal value (or policy) function. Given that the agent still does not know the optimal policy, it often behaves sub-optimally. During training, the agent faces a dilemma: the exploration or exploitation dilemma. In the context of RL, exploration and exploitation are different concepts: exploration is the selection and execution (in the environment) of an action that is likely not optimal (according to the knowledge of the agent) and exploitation is the selection and execution of an action that is optimal according to the agent's knowledge (that is, according to the agent's current best estimate of the optimal policy). During the training phase, the agent needs to explore and exploit: the exploration is required to discover more about the optimal strategy, but the exploitation is also required to know even more about the already visited and partially known states of the environment. During the learning phase, the agent thus can't just exploit the already visited states, but it also needs to explore possibly unvisited states. To explore possibly unvisited states, the agent often needs to perform a sub-optimal action.
An off-policy algorithm is an algorithm that, during training, uses a behavior policy (that is, the policy it uses to select actions) that is different than the optimal policy it tries to estimate (the optimal policy). For example,Q-learning often uses anĎľ-greedy policy (Ďľ percentage of the time it chooses a random or explorative action and1âĎľ percentage of the time it chooses the action that is optimal, according to its current best estimate of the optimal policy) to behave (that is, to exploit and explore the environment), while, in its update rule, because of the max operator, it assumes that the greedy action (that is, the current optimal action in a given state) is chosen.
An on-policy algorithm is an algorithm that, during training, chooses actions using a policy that is derived from the current estimate of the optimal policy, while the updates are also based on the current estimate of the optimal policy. For example, SARSA is an on-policy algorithm because it doesn't use the max operator in its update rule.
The difference between Q-learning (off-policy) and SARSA (on-policy) is respectively the use or not of the max operator in their update rule.
In the case of policy-based or policy search algorithm (e.g. REINFORCE), the distinction between on-policy and off-policy is often not made because, in this context, there isn't usually a clear separation between a behavior policy (the policy to behave during training) and a target policy (the policy to be estimated).
You can think of actor-critic algorithms as value and policy-based because they use both a value and policy functions.
The usual examples of model-based algorithms are value and policy iterations, which are algorithms that use the transition and reward functions (of the given Markov decision process) to estimate the value function. However, it might be the case that you also have on-policy, off-policy, value-based or policy-based algorithms that are model-based, in some way, that is, they might use a model of the environment in some way.
Last updated
Was this helpful?