Reinforcement Learning [Answers]

Kai

Kai Apr 12, 2019 · 9 mins read

Share this

Here’s a list of answers.

Model Free Prediction

Definitions
1. MC is model-free. Learns from complete episodes. No bootstraping (which means to use estimates for future rewards). MC can only be applied to episodic(terminating) MDPs. Algorithm is just to simulate many episodes and either average the first visited return, or every visited return.
2. $\mu_{k} = \mu_{k-1} + \frac{1}{k}(x_k - \mu_{k-1})$ . Instead of $\frac{1}{k}$ , sometimes just use $\alpha$ .
3. TD is model-free. Learns from incomplete episodes.
4. MC updates toward $G_t$ , TD updates toward $R_t + \gamma V(S_{t+1})$ . TD can learn before and without the final outcome. TD is biased, but lower variance. MC finds the minimum mean-squared error solution. TD(0) converges to maximum likelihood Markov model.
5. $G_{t}^{(n)} = R_{t+1} + ... + \gamma^{n-1}R_{t+n} + \gamma^n V(S_{t+n})$ . When $n = 1$ it is TD, $n = \infty$ it is MC.
6. We can combine all $G_{t}^{(n)}$ by computing $G_{t}^{\lambda} = (1 - \lambda) \sum_{n=1}^{\infty}\lambda^{n-1}G_{t}^{(n)}$ . It’s called forward TD( $\lambda$ ). And needs complete episodes.
7. It has two rules. Frequency and recency. $E_t(s) = \gamma \lambda E_{t-1}(s) + 1(S_t = s)$
8. At $s_{t+1}$ , calculate the regular TD error. Update $V(s)$ by $\alpha \delta_t E_t(s)$

Model Free Control

on-policy means learn about policy $\pi$ from experience sampled from $\pi$ . Off-policy means learn $\pi$ from $\mu$ .
$V(s)$ requires MDP, but $Q(s,a)$ is model free.
$\epsilon$ -Greedy selection.
Greedy in the Limit with Infinite Exploration(GLIE). It has two things:
1. All state-action pairs are explored infinite times
2. The policy converges on a greedy policy.
3. For example, $\epsilon-$ greedy is GLIE if $\epsilon_k = \frac{1}{k}$

Value Function Approximation

We either use $\hat{v}(s, w) \approx v(s)$ $\overset{v}{^} (s, w) \approx v (s)$ , or $\hat{q}(s,a,w) \approx q(s,a)$ $\overset{q}{^} (s, a, w) \approx q (s, a)$ .
1. [todo: screenshot for three types]

Kai

Written by Kai

Hi, I am Kai.

Book Review, Excerpts of 'Warped Passages - Unraveling The Mysteries of The Universe's Hidden Dimensions'

Reinforcement Learning