Here’s a list of answers.
Model Free Prediction
- Definitions
- MC is model-free. Learns from complete episodes. No bootstraping (which means to use estimates for future rewards). MC can only be applied to episodic(terminating) MDPs. Algorithm is just to simulate many episodes and either average the first visited return, or every visited return.
- . Instead of , sometimes just use .
- TD is model-free. Learns from incomplete episodes.
- MC updates toward , TD updates toward . TD can learn before and without the final outcome. TD is biased, but lower variance. MC finds the minimum mean-squared error solution. TD(0) converges to maximum likelihood Markov model.
- . When it is TD, it is MC.
- We can combine all by computing . It’s called forward TD(). And needs complete episodes.
- It has two rules. Frequency and recency.
- At , calculate the regular TD error. Update by
Model Free Control
on-policy
means learn about policy from experience sampled from .Off-policy
means learn from .- requires MDP, but is model free.
- -Greedy selection.
- Greedy in the Limit with Infinite Exploration(GLIE). It has two things:
- All state-action pairs are explored infinite times
- The policy converges on a greedy policy.
- For example, greedy is GLIE if
Value Function Approximation
- We either use , or .
- [todo: screenshot for three types]