These are the notes taken during the RL Course by David Silver.
Lecture 4 · Model-free predictionIntroductionMonte-Carlo learningFirst-visit Monte-Carlo policy evaluationEvery-visit Monte-Carlo policy evaluationIncremental Monte-Carlo updatesTemporal Difference learningMC and TD learningAdvantages and disadvantagesBias-variance trade-offBatch MC and TDCertainty equivalenceMarkov propertyBootstrapping and samplingTD()-returnEligibility traceBackward view TD()
In the last lecture we worked assuming that we knew a model of the environment. Now we will not assume this. In this lecture we aim to predict the value of a given policy if we don't know how the environment works.
These methods learn directly from experience.
We use complete episodes to learn. A caveat is that all episodes must terminate.
MC assumes that value = mean return. We replace the expectations by the mean.
To evaluate state , at the first time-step that is visited in an episode:
By the law of large numbers, approaches the true value function as .
In this case, we consider every visit to state to estimate the value. All the other ideas and steps are the same.
In this way, we could increment the counter several times in the same episode.
Incremental mean: The mean can be computed incrementally: . This result can be easily obtained by separating into two parts.
We can incrementally update the mean in a MC learning scenario in a similar way:
In non-stationary problems, we can track a running mean to forget old episodes: .
TD methods learn from actual experience, like MC methods.
The difference is that TD can learn from incomplete episodes, by bootstrapping.
Bootstrapping: making a guess towards a guess.
If we take the incremental Monte-Carlo update equation: ,
And then replace the average return by the estimated return , also called the TD target
We obtain TD(0):
is called the TD error
The return is unbiased estimate of
The true TD target is unbiased estimate of
TD target is biased estimate of
However, TD target is much lower variance than the return, because the return depends on many random actions, rewards... whereas the TD target depends only on one random action, reward...
MC has high variance, zero bias
TD has low variance, some bias
What if the experience was finite? Do these algorithms converge given a finite batch of experience?
Several techniques to estimate the value fucntion:
Bootstrapping: update involves an estimate
Sampling: update uses an expectation

The choice of the algorithm is not either TD or MC. There is in fact a family of algorithms that lie between these two cases, and contain both TD(0) and MC as specific cases.

We can obtain this using n-step returns.
Define the n-step return:
n-step temporal-difference learning:
Averaging n-step returns: We can average n-step returns with different n (e.g. average 2-step and 4-step returns)
TD() combines all n-step returns by weighting them
Update:
This is called forward TD(), because of the looking forward towards the future.
An eligibility trace combines both the frequency heuristic and recency heuristic (that say that an event must occur based on the times we've seen something or the recency of us seeing that)
When , we only update the current state: equivalent to TD(0)
Theorem:
The sum of offline updates is identical for forward and backward-view TD()
The equality can be proved .
This result has been extended to online updates as well.
