Temporal Difference (TD)

Simple Introduction

So to discuss these algorithms I’m going to try and explain them in a simple way. Feel free to reference the David Silver lectures or the Sutton and Barto book for more depth. Temporal difference is an agent learning from an environment through episodes with no prior knowledge of the environment. This means temporal difference takes a model-free or unsupervised learning approach. You can consider it learning from trial and error.

Let’s Learn Greek

You will notice in this post some notation and we’ll discuss 3 algorithms: TD(0), TD(1) and TD(λ). I will show the equations for each of these and we’ll explain them, but lets quickly define the notation for at least some of the hyper parameters (greek letters that are sometimes intimidating).

  1. Gamma (γ): the discount rate. A value between 0 and 1. The higher the value the less you are discounting.

  2. Lambda (λ): the credit assignment variable. A value between 0 and 1. The higher the value the more credit you can assign to further back states and actions.

  3. Alpha (α): the learning rate. How much of the error should we accept and therefore adjust our estimates towards. A value between 0 and 1. A higher value adjusts aggressively, accepting more of the error while a smaller one adjusts conservatively but may make more conservative moves towards the actual values.

  4. Delta (δ): a change or difference in value.

TD(1) Algorithm

So the first algorithm we’ll start with will be TD(1). TD(1) makes an update to our values in the same manner as Monte Carlo, at the end of an episode. So back to our random walk, going left or right randomly, until landing in ‘A’ or ‘G’. Once the episode ends then the update is made to the prior states. As we mentioned above if the higher the lambda value the further the credit can be assigned and in this case its the extreme with lambda equaling 1. This is an important distinction because TD(1) and MC only work in episodic environments meaning they need a ‘finish line’ to make an update.

Now lets look at the algorithm and try and make sense of this. Gt (Figure 2) is the discounted sum of all the rewards seen in our episode. So as we’re traveling through our environment we keep track of all the rewards and sum them together with a discount (γ). So lets act like we’re reading this out loud: the immediate reward (R) at a given point (time, t+1) plus the discount (γ) of a future reward (Rt+2) and so on. You can see that we discount (γ) more heavily in the future with γ^T-1. So if γ=0.2 and you’re discounting the reward at time step 6, your discount value γ become γ^6–1 which equals 0.00032. Significantly smaller after just 6 time steps.Figure 2: Sum of Discounted Rewards

Now we’ll make an update to our value estimates V(S). Its important to know that when starting out you really don’t have a good starting estimate. You initialize using either random values or all zeros and then make updates to that estimate. For our random walk sequence we initialize values for all the states between ‘B’ and ‘F’ to zero [0,0,0,0,0,0,1] . We’ll leave the terminal states alone since those are known. Remember we’re only trying to predict the values for the non-terminal states since we know the value of the terminal states.

We’ll use the sum of discounted rewards from above, Gt, that we saw from our episode and we’ll subtract that from the prior estimate. This is called the TD Error. Our updated estimate minus the previous estimate. Then we multiple by an alpha (α) term to adjust how much of that error we want to update by. Lastly we make the update by simply adding our pervious estimate V(St) to the adjusted TD Error (Figure 3).Figure 3: TD(1) Update Value toward Action Return

So thats 1 episode. We did 1 random walk and accumulated rewards. We then took those rewards at each time step and compared it to our original estimate of values (all zeros). We weight the difference and adjust our prior estimate. Then start over again. You just learned the TD(1) or MC update!

Last updated

Was this helpful?