ml,

Reinforcement Learning [Answers]

Kai Kai Apr 12, 2019 · 9 mins read
Share this

Here’s a list of answers.

Model Free Prediction

  1. Definitions
    1. MC is model-free. Learns from complete episodes. No bootstraping (which means to use estimates for future rewards). MC can only be applied to episodic(terminating) MDPs. Algorithm is just to simulate many episodes and either average the first visited return, or every visited return.
    2. μk=μk1+1k(xkμk1)\mu_{k} = \mu_{k-1} + \frac{1}{k}(x_k - \mu_{k-1}). Instead of 1k\frac{1}{k}, sometimes just use α\alpha.
    3. TD is model-free. Learns from incomplete episodes.
    4. MC updates toward GtG_t, TD updates toward Rt+γV(St+1)R_t + \gamma V(S_{t+1}). TD can learn before and without the final outcome. TD is biased, but lower variance. MC finds the minimum mean-squared error solution. TD(0) converges to maximum likelihood Markov model.
    5. Gt(n)=Rt+1+...+γn1Rt+n+γnV(St+n)G_{t}^{(n)} = R_{t+1} + ... + \gamma^{n-1}R_{t+n} + \gamma^n V(S_{t+n}). When n=1n = 1 it is TD, n=n = \infty it is MC.
    6. We can combine all Gt(n)G_{t}^{(n)} by computing Gtλ=(1λ)n=1λn1Gt(n)G_{t}^{\lambda} = (1 - \lambda) \sum_{n=1}^{\infty}\lambda^{n-1}G_{t}^{(n)}. It’s called forward TD(λ\lambda). And needs complete episodes.
    7. It has two rules. Frequency and recency. Et(s)=γλEt1(s)+1(St=s)E_t(s) = \gamma \lambda E_{t-1}(s) + 1(S_t = s)
    8. At st+1s_{t+1}, calculate the regular TD error. Update V(s)V(s) by αδtEt(s)\alpha \delta_t E_t(s)

Model Free Control

  1. on-policy means learn about policy π\pi from experience sampled from π\pi. Off-policy means learn π\pi from μ\mu.
  2. V(s)V(s) requires MDP, but Q(s,a)Q(s,a) is model free.
  3. ϵ\epsilon-Greedy selection.
  4. Greedy in the Limit with Infinite Exploration(GLIE). It has two things:
    1. All state-action pairs are explored infinite times
    2. The policy converges on a greedy policy.
    3. For example, ϵ\epsilon-greedy is GLIE if ϵk=1k\epsilon_k = \frac{1}{k}

Value Function Approximation

  1. We either use v^(s,w)v(s)\hat{v}(s, w) \approx v(s), or q^(s,a,w)q(s,a)\hat{q}(s,a,w) \approx q(s,a).
    1. [todo: screenshot for three types]
Kai
Written by Kai
Hi, I am Kai.