Lecture 4 · Model-free prediction

These are the notes taken during the RL Course by David Silver.

Lecture 4 · Model-free predictionIntroductionMonte-Carlo learningFirst-visit Monte-Carlo policy evaluationEvery-visit Monte-Carlo policy evaluationIncremental Monte-Carlo updatesTemporal Difference learningMC and TD learningAdvantages and disadvantagesBias-variance trade-offBatch MC and TDCertainty equivalenceMarkov propertyBootstrapping and samplingTD( $\lambda$ ) $\lambda$ -returnEligibility traceBackward view TD( $\lambda$ )

Introduction

In the last lecture we worked assuming that we knew a model of the environment. Now we will not assume this. In this lecture we aim to predict the value of a given policy if we don't know how the environment works.

Monte-Carlo learning

These methods learn directly from experience.

We use complete episodes to learn. A caveat is that all episodes must terminate.

MC assumes that value = mean return. We replace the expectations by the mean.

First-visit Monte-Carlo policy evaluation

$s$ first $s$ is visited in an episode:

$N(s)$
$S(s) \leftarrow S(s) + G_t$
$V(s) = S(s)/N(s)$

$V(s)$ $N(s) \rightarrow \infty$ .

Every-visit Monte-Carlo policy evaluation

$s$ to estimate the value. All the other ideas and steps are the same.

In this way, we could increment the counter several times in the same episode.

Incremental Monte-Carlo updates

Incremental mean $\mu_k = \mu_{k-1} + \frac{1}{k}(x_k - \mu_{k-1})$ $\mu_k$ into two parts.

We can incrementally update the mean in a MC learning scenario in a similar way:

\begin{align} N(S_t) &\leftarrow N(S_t)+1 \\ V(S_t) &\leftarrow V(S_t) + \frac{1}{N(S_t)}(G_t-V(S_t)) \end{align}

$V(S_t) \leftarrow V(S_t) + \alpha(G_t-V(S_t))$ .

Temporal Difference learning

TD methods learn from actual experience, like MC methods.

The difference is that TD can learn from incomplete episodes, by bootstrapping.

Bootstrapping: making a guess towards a guess.

MC and TD learning

$V(S_t) \leftarrow V(S_t) + \alpha(G_t-V(S_t))$ ,

$R_{t+1} + \gamma V(S_{t+1})$ , also called the TD target

$V(S_t) \leftarrow V(S_t) + \alpha(R_{t+1} + \gamma V(S_{t+1})-V(S_t))$

$\delta_t = R_{t+1} + \gamma V(S_{t+1})-V(S_t)$ is called the TD error

Advantages and disadvantages

TD can learn before seeing the final outcome. MC must wait until the end of the episode.
TD can learn in situations when you don't see the final outcome. MC doesn't apply here.

Bias-variance trade-off

$G_t$ $v_\pi(S_t)$

$R_{t+1} + \gamma v_\pi(S_{t+1})$ $v_\pi(S_t)$

$R_{t+1} + \gamma V(S_{t+1})$ $v_\pi(S_t)$

However, TD target is much lower variance than the return, because the return depends on many random actions, rewards... whereas the TD target depends only on one random action, reward...

MC has high variance, zero bias
- good convergence
- not sensitive to initial value
TD has low variance, some bias
- usually more efficient than MC
- not always converges with function approximation
- more sensitive to initial value

Batch MC and TD

What if the experience was finite? Do these algorithms converge given a finite batch of experience?

Certainty equivalence

MC converges to solution with minimum mean-squared error.
TD(0) converges to solution of max likelihood Markov model: the solution to the MDP that best fits the data.

Markov property

TD exploits the Markov property, therefore being more efficient in Markov environments
MC doesn't exploit Markov property, usually more effective in non-Markov environments

Bootstrapping and sampling

Several techniques to estimate the value fucntion:

Bootstrapping: update involves an estimate
- MC doesn't bootstrap
- TD bootstraps
- DP bootstraps
Sampling: update uses an expectation
- MC samples
- TD samples
- DP doesn't sample

$\lambda$ )

The choice of the algorithm is not either TD or MC. There is in fact a family of algorithms that lie between these two cases, and contain both TD(0) and MC as specific cases.

We can obtain this using n-step returns.

Define the n-step return:

G_t^{(n)}=R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})

n-step temporal-difference learning:

V(S_t) \leftarrow V(S_t) + \alpha \left( G_t^{(n)} - V(S_t) \right)

Averaging n-step returns: We can average n-step returns with different n (e.g. average 2-step and 4-step returns)

$\lambda$ -return

$\lambda$ ) combines all n-step returns by weighting them

G_t^\lambda = (1-\lambda) \sum_{n=1}^\infty \lambda^{n-1} G_t^{(n)}

Update:

V(S_t) \leftarrow V(S_t) + \alpha \left( G_t^\lambda - V(S_t) \right)

$\lambda$ ), because of the looking forward towards the future.

Eligibility trace

An eligibility trace combines both the frequency heuristic and recency heuristic (that say that an event must occur based on the times we've seen something or the recency of us seeing that)

\begin{align} E_0(s)&=0 \\ E_t(s) &= \gamma \lambda E_{t-1}(s) + \mathbf{1}(S_t=s) \end{align}

$\lambda$ )

We keep an eligibility trace for every state
We update the value for every state
Every update is performed in proportion to TD-error and eligibility trace:

\begin{align} \delta_t &= R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \\ V(s) &\leftarrow V(s) + \alpha \delta_t E_t(s) \end{align}

$\lambda = 0$ , we only update the current state: equivalent to TD(0)

Theorem:
$\lambda$ )
$\forall \lambda$ .

This result has been extended to online updates as well.

Lecture 4 · Model-free prediction

Introduction

Monte-Carlo learning

First-visit Monte-Carlo policy evaluation

Every-visit Monte-Carlo policy evaluation

Incremental Monte-Carlo updates

Temporal Difference learning

MC and TD learning

Advantages and disadvantages

Bias-variance trade-off

Batch MC and TD

Certainty equivalence

Markov property

Bootstrapping and sampling

TD(\lambda)

\lambda-return

Eligibility trace

Backward view TD(\lambda)

$\lambda$ )

$\lambda$ -return

$\lambda$ )