Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew Barto: link
Introduction
Reinforcement Learning: An area of machine learning that aims to learn what to do to maximize cumulative reward.
Agents and Environments
At each timestep $t$:
- Agents: Receive Reward $R_t$ and Observation $O_t$, execute Action $A_t$
- Environments: Receive Action $A_t$, then emit Reward $R_{t+1}$ and Next Observation $O_{t+1}$
The sequence of observations, actions, and rewards is called history.
$$\begin{aligned} H_t = O_1, R_1, A_1, … , A_{t-1}, O_t, R_t \end{aligned}$$State$(S_t)$: information used to determine what happens next. Both the agent and environment have state, and they may not agree with each other.
- Fully observable environments: $O_t = S_t$. The agent can know the exact state of the environment. It is called a Markov Decision Process (MDP)
- Partially observable environments: $O_t \neq S_t$. The agent only knows the partial state of the environment. It is called a Partially Observable Markov Decision Process (POMDP)
Markov Decision Processes (MDP)
Markov Decision Process (MDP) is an environment that can be defined as a 5-tuple:
$$\begin{aligned} (\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma) \end{aligned}$$
where
$\mathcal{S}$: state space (set of states)
$\mathcal{A}$: action space (set of actions)
$\mathcal{P}$: state transition probability matrix $\mathcal{P}=\mathbb{P}[S_{t+1}=s’|S_t=s, A_t=a]$
$\mathcal{R}$: reward function $\mathcal{R} = \mathbb{E}[R_{t+1}|S_t=s, A_t=a]$
$\gamma$: discount factor $\gamma \in [0, 1]$
Markov Property
$$\begin{aligned} \mathbb{P}[S_{t+1}|S_t] = \mathbb{P}[S_{t+1}|S_1, S_2, … , S_t] \end{aligned}$$
Policy
$$\begin{aligned} \pi(a|s) = \mathbb{P}[A_{t} = a|S_t = s] \end{aligned}$$
Return
$$\begin{aligned} G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} \cdots = \sum_{k=0}^\infty \gamma^k R_{t+k+1} \end{aligned}$$
where $\gamma \in [0, 1]$
Value function & Action-value function (Q-value function)
$$\begin{aligned} v_\pi(s) = \mathbb{E_\pi}[G_t|S_t = s] \end{aligned}$$
$$\begin{aligned} q_\pi(s, a) = \mathbb{E_\pi}[G_t|S_t = s, A_t = a] \end{aligned}$$
Bellman equation
Let’s derive the Bellman expectation equation for $v_\pi(s)$.
$$ \begin{aligned} v_{\pi}(s) &= \mathbb{E_\pi}[G_t \mid S_t = s] \\ &= \mathbb{E_\pi} \left[\sum_{k=0}^\infty \gamma^k R_{t+k+1} \mid S_t = s \right] \\ &= \mathbb{E_\pi} \left[R_{t+1} + \gamma \sum_{k=0}^\infty \gamma^k R_{t+k+2} \mid S_t = s \right] \\ &= \mathbb{E_\pi} \left[R_{t+1} + \gamma v_\pi(s_{t+1}) \mid S_t = s \right] \end{aligned} $$
$$ \begin{aligned} q_\pi(s, a) &= \mathbb{E_\pi}[R_{t+1} + \gamma q_\pi (S_{t+1}, A_{t+1}) \mid S_t = s, A_t = a] \end{aligned} $$
Optimal value functions
$$\begin{aligned} \pi \geq \pi’ \text{ iff } v_\pi (s) \geq v_{\pi’} (s) \text{ for } \forall s \in \mathcal{S} \end{aligned}$$
$$\begin{aligned} v_*(s) = \max_{\pi} v_\pi (s) \end{aligned}$$
$$\begin{aligned} q_*(s, a) = \max_{\pi} q_\pi (s, a) \end{aligned}$$
for all $s \in \mathcal{S}$ and $a \in \mathcal{A}$.
$$\begin{aligned} v_\ast(s) = \max_{a} q_\ast (s, a) \end{aligned}$$
$$\begin{aligned} v_\ast(s) = \mathbb{E_\pi}[R_{t+1} + \gamma v_\ast (S_{t+1})|S_t = s] \end{aligned}$$
$$\begin{aligned} q_*(s, a) = \mathbb{E_\pi}[R_{t+1} + \gamma \max_{a’} q_\ast (S_{t+1}, a’)|S_t = s, A_t = a] \end{aligned}$$
Dynamic Programming
What is Dynamic Programming?
Dynamic programming (DP): a collection of algorithms that can be used to compute optimal policies given a perfect model such as a Markov Decision Process (MDP).
DP can solve problems that have two properties:
- Overlapping subproblems
- Problem can be broken down into subproblems
- Solutions to subproblems can be reused
- Optimal substructure
- Problem can be decomposed into subproblems
Policy Iteration
Value Iteration
Model-Free Prediction
Model-free prediction algorithms aim to estimate the value function of a certain policy without knowing the MDP.
Monte-Carlo Prediction
Temporal-Difference Learning
Difference between Monte-Carlo (MC) and Temporal-Difference (TD)
Model-Free Control
Model-free control algorithms can be divided into two groups: on-policy control and off-policy control.
On-policy control: Learn about policy $\pi$ from experience sampled from $\pi$
Off-policy control: Learn about policy $\pi$ from experience sampled from policy $\mu \neq \pi$
Both types of control algorithms are widely utilized in RL.
On-policy Monte-Carlo Control
On-policy Temporal-Difference Control
Off-policy Temporal-Difference Control
Policy Gradient Methods
In policy gradient, policy parameter $\theta$ is updated by some scalar performance measure $\mathcal{J}(\theta)$ with respect to the policy parameter. To maximize policy performance, their updates approximate gradient ascent in $\mathcal{J}$:
$$\begin{align} \theta_{t+1} = \theta_t + \alpha \nabla \mathcal{J}(\theta_t) \end{align}$$
where $\alpha$ is a step-size parameter.
Then what can be the performance measure $\mathcal{J}(\theta)$ for an MDP policy in finite episodes?
In the episodic case, we have trajectory $\tau$. Then, we can define $\mathcal{J}(\theta)$ as its value.
$$\begin{align} \mathcal{J}(\theta) = \sum_{s \in \mathcal{S}} d^\pi(s)v_\pi(s) = \sum_{s \in \mathcal{S}} d^\pi(s) \sum_{a \in \mathcal{A}} \pi_\theta(a|s) q_\pi(s, a) \end{align}$$
where $d^\pi$ is the stationary distribution for the Markov chain for $\pi_\theta$.
Then the gradient of $\mathcal{J}$ can be reformatted as the following:
$$\begin{aligned} \nabla_\theta \mathcal{J}(\theta) &= \nabla_\theta \sum_{s \in \mathcal{S}} d^\pi(s) \sum_{a \in \mathcal{A}} \pi_\theta(a|s) Q_\pi(s, a) & \\ &\propto \sum_{s \in \mathcal{S}} d^\pi(s) \sum_{a \in \mathcal{A}} Q_\pi(s, a) \nabla_\theta \pi_\theta(a|s) \\ &= \sum_{s \in \mathcal{S}} d^\pi(s) \sum_{a \in \mathcal{A}} \pi_\theta(s, a) Q_\pi(s, a) \dfrac{\nabla_\theta \pi_\theta(a|s)}{\pi_\theta(a|s)} \\ &= \mathbb{E_\pi} [Q_\pi(s,a) \nabla_\theta \log \pi_\theta(s,a)] \end{aligned}$$
where $\mathbb{E_\pi}$ refers to $\mathbb{E_{s \sim d_\pi, a \sim \pi_\theta}}$ and $Q_\pi$ means the true state-value under policy $\pi$.
Policy Based Reinforcement Learning
RL algorithms that directly generate the policy from experience: Policy-based RL.
Monte-Carlo Policy Gradient
$$\begin{align} \mathbb{E_\pi} [G_t|s_t, a_t] = Q_\pi (s_t, a_t) \end{align}$$
Actor-Critic Policy Gradient
Yet the algorithm REINFORCE has a disadvantage of high gradient variance.
Thus, to reduce variance, a critic can be used instead of return to estimate $Q_\pi$.
$$\begin{align} q_\phi (s, a) = Q_\pi(s, a) \end{align}$$
Therefore, the policy gradient is changed as
$$\begin{align} \nabla_\theta \mathcal{J}(\theta) = \mathbb{E_\pi} [q_\phi(s,a) \nabla_\theta \log \pi_\theta(s,a)]\ \theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta (a_t|s_t) q_\phi(s_t, a_t) \end{align}$$
Critic can be updated by
- Monte-Carlo evaluation
- TD(0)
- TD($\lambda$)
For example, if we use TD(0), then in timestep $t$ the critic will be updated as
$$\begin{align} \phi \leftarrow \phi + \beta (r_t + \gamma q_\phi(s_{t+1}, a_{t+1}) - q_\phi(s_t, a_t)) \nabla_\phi q_\phi(s_t, a_t) \end{align}$$
Model-based Reinforcement Learning
Model-based RL: it learns a model from experience and plans value function and/or policy from the model.
What is a Model?
A Model $\mathcal{M}$ is a representation of an MDP <$\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}$> parametrized by $\eta$. Assuming we know the state space $\mathcal{S}$ and action space $\mathcal{A}$, a model $<\mathcal{P}, \mathcal{R}>$ can represent transitions
$$\begin{align} S_{t+1} \sim \mathcal{P_\eta(S_{t+1}|S_t, A_t)} \\ R_{t+1} \sim \mathcal{R_\eta(R_{t+1}|S_t, A_t)} \end{align}$$
The model $<\mathcal{P}, \mathcal{R}>$ is learned from experience ${S_1, A_1, R_2, … , S_T}$.
A model can be parametrized in various ways: from lookup table models to deep neural networks.