By 이태호 in Study — Nov 2, 2023

RL-5. Model-free Prediction

저번 강까지는 known MDP에서 dynamic programming을 통한 planning을 다뤘다. 이번 강부터는 unknown MDP에서 model-free prediction을 다룬다.

Monte Carlo Algorithms

샘플링을 통해 학습할 수 있다
Monte Carlo: direct sampling of episodes
model-free: MDP에 대한 knowledge가 필요하지 않음. 샘플링만으로 해결함.

대표적인 예시로 2강에서 다뤘던 multi-armed bandit이 있다.

Contextual Bandits: bandit with state

Introduction Function Approximation

Value Function Approximation

lookup tables
- 모든 state s는 entry v(s)를 갖고 있다
- 혹은 모든 state-action pair s, a는 entry q(s, a)를 갖고 있다
large MDP로 인해 발생하는 문제점
- 메모리 상에 너무 많은 state를 들고 있어야 함
- 각각의 state는 fully observable하지 않음
function approximation을 통해 value function을 추정해야 함
- parameter $w$를 업데이트함
agent state
- $S_t=u_w(S_{t-1}, A_{t-1}, O_t)$
- $O_t$가 새로 들어가게 된 이유: partial observable하기 때문에

Linear Function Approximation

제일 간단한 예시 중 하나가 linear 함수를 이용하는 것이다. agent state에서 feature로 가는 linear 함수가 된다.

$$v_w(s) = w^T x(s) = \sum_{j=1} ^n x_j(s)w_j$$

objective function is quadratic in w

$$ L(w) = \mathbb{E}_{S \sim d}[(v_\pi(S) - w^T x(S))^2]$$

update는 일반적인 gradient descent와 동일하다.

Model-Free Prediction: MC

Monte-Carlo Policy Evaluation

$$v_\pi(s) = \mathbb{E}[G_t|S_t=s,\pi]$$

로 표현되는데 expected return을 sample된 average return으로 대체함

이것이 monte-carlo policy evaluation

Disadvantages of Monte-Carlo Learning

MC algorithms은 value prediction을 학습하는데에 사용할 수 있다
그러나 episode가 너무 길면 학습이 느리다

Temporal Difference Learning

TD Learning by Sampling Bellman Equations

$G_t$는 t가 무한일 때까지 전부 계산하지만 TD는 그걸 다 알 수 없는 환경이기에 다음 스텝만을 계산한다.

Bootstrapping and Sampling

Bootstrapping: update involves an estimate
1. MC X
2. DP O
3. TD O
Sampling: update samples an expectation
1. MC O
2. DP X
3. TD O

SARSA

action-value에서도 똑같은 알고리즘을 적용할 수 있다

TD is model-free(no knowledge of MDP) and learn directly from experience
TD can learn from incomplete episodes, by bootstrapping
TD can learn during each episode

TD vs MC

TD can learn before knowing the final outcome
TD can learn without the final outcome
TD is independent of the temporal span of the prediction
TD target is a biased estimate of $v_\pi(S_t)$
TD target has lower variance
- return depends on many random actions, transitions, rewards
- TD target depends on one random action, transition, reward

Unified View of RL

너비와 깊이를 전체 다 하면 전체 탐색
너비만 전체 다 하면 DP
깊이만 전체 다 하면 Monte-Carlo
너비, 깊이 모두 샘플링하면 TD

Multi-Step Updates

TD는 최종 보상을 보지 않고 업데이트하기에 부정확할 수 있다
정보의 back prop이 천천히 된다
MC는 정보가 빠르게 propagate되지만 업데이트가 더 노이지하다

TD와 MC 사이를 골라보자!

MC-1step이 원래 MC, MC-무한step이 TD가 된다. 적당한 step 수를 찾아 TD와 MC의 장점을 모두 노려보자.

Mixing multi-step returns

multi-step과 1step 을 섞어보자!
1step, 2step, ..., n step을 섞어보자!

$lambda$가 0이면 1step TD, 1이면 MC가 된다.

TD는 bias가 있고 MC는 variance가 있는데 그 사이의 장점을 취할 수 있게 된다.

Independence of temporal span

TD는 즉시 업데이트하기에 수명과 독립적
MC 혹은 multi-step returns은 수명에 의존적

Eligibility traces

어렵다...

다음 주에는 Model-free Control에 대해 배울 예정이다.

RL-5. Model-free Prediction

Monte Carlo Algorithms

Introduction Function Approximation

Linear Function Approximation