Reinforcement Learning

2018 · Machine Learning

Policy gradient

Hello!

Algorithm:
- Update per step $t$: $\theta \leftarrow \theta+\alpha\nabla_\theta \log \pi_\theta (s_t, a_t) v_t$
  $s_t$: state, $a_t$: action, $v_t$: long-term value, $\pi_\theta(s,a) = P[a|s, \theta]$
Result: Baseline: 3 points

Hello!

Hello!

Algorithm:
1. Take some action $a_i$ and observe $(s_i, a_i, s_i’, r_i)$, add it to $\mathcal{B}$ (replay buffer)
2. Sample mini-batch ${s_j, a_j, s_j’, r_j}$ from $\mathcal{B}$ uniformly
3. Compute $y_j = r_j+\gamma\text{max}_{a_j’}Q_{\phi’}(s_j’,a_j’)$ using target network $Q_{\phi’}$
4. $\phi\leftarrow\phi-\alpha\sum_j\frac{dQ_{\phi’}}{d\phi}(s_j, a_j)(Q_\phi (s_j, a_j)-y_j)$
5. Update $\phi’$: copy $\phi$ every $N$ steps
Result: Baseline: 40 points

Hello!

Hello!