RL as an Adaptive Optimal Control

Slide: Explore reinforcement learning (RL) by comparing the Markov Chain and the Markov Decision Process (MDP). Understand how RL functions as a direct adaptive optimal control method through the example of Q-Learning. Inspired by Sutton's paper "Reinforcement Learning is Direct Adaptive Optimal Control" and Pieter Abbeel's lecture "Foundations of Deep RL".

Recap

Problem Formulation

Consider the deterministic discrete-time optimal control problem:

\begin{aligned} min_{x_{1 : N}, u_{1 : N - 1}} & \sum_{n = 1}^{N - 1} l (x_{n}, u_{n}) + l_{F} (x_{N}) \\ s.t. & x_{n + 1} = f (x_{n}, u_{n}) \\ u_{n} \in U \end{aligned}

The first-order necessary conditions for optimality can be derived using:

The Lagrangian framework (special case of KKT conditions)
Pontryagin's Minimum Principle (PMP)

Lagrangian Formulation

Form the Lagrangian:

L = \sum_{n = 1}^{N - 1} l (x_{n}, u_{n}) + λ_{n + 1}^{⊤} (f (x_{n}, u_{n}) - x_{n + 1}) + l_{F} (x_{N})

Define the Hamiltonian:

H (x_{n}, u_{n}, λ_{n + 1}) = l (x_{n}, u_{n}) + λ_{n + 1}^{⊤} f (x_{n}, u_{n})

Rewrite the Lagrangian using the Hamiltonian:

L = H (x_{1}, u_{1}, λ_{2}) + [\sum_{n = 2}^{N - 1} H (x_{n}, u_{n}, λ_{n + 1}) - λ_{n}^{⊤} x_{n}] + l_{F} (x_{N}) - λ_{N}^{⊤} x_{N}

Optimality Conditions

Take derivatives with respect to $x$ and $λ$ :

\begin{aligned} \frac{\partial L}{\partial λ_{n}} & = \frac{\partial H}{\partial λ_{n}} - x_{n + 1} = f (x_{n}, u_{n}) - x_{n + 1} = 0 \\ \frac{\partial L}{\partial x_{n}} & = \frac{\partial H}{\partial x_{n}} - λ_{n}^{⊤} = \frac{\partial l}{\partial x_{n}} + λ_{n + 1}^{⊤} \frac{\partial f}{\partial x_{n}} - λ_{n}^{⊤} = 0 \\ \frac{\partial L}{\partial x_{N}} & = \frac{\partial l_{F}}{\partial x_{N}} - λ_{N}^{⊤} = 0 \end{aligned}

For $u$ , we write the minimization explicitly to handle constraints:

\begin{aligned} u_{n} = & \arg min_{u} H (x_{n}, u, λ_{n + 1}) \\ s.t. u \in U \end{aligned}

Summary of Necessary Conditions

The first-order necessary conditions can be summarized as:

\begin{aligned} x_{n + 1} & = \nabla_{λ} H (x_{n}, u_{n}, λ_{n + 1}) \\ λ_{n} & = \nabla_{x} H (x_{n}, u_{n}, λ_{n + 1}) \\ u_{n} & = \arg min_{u} H (x_{n}, u, λ_{n + 1}), s.t. u \in U \\ λ_{N} & = \frac{\partial l_{F}}{\partial x_{N}} \end{aligned}

In continuous time, these become:

\begin{aligned} \dot{x} & = \nabla_{λ} H (x, u, λ) \\ - \dot{λ} & = \nabla_{x} H (x, u, λ) \\ u & = \arg min_{\tilde{u}} H (x, \tilde{u}, λ), s.t. \tilde{u} \in U \\ λ (t_{F}) & = \frac{\partial l_{F}}{\partial x} \end{aligned}

Application to LQR Problems

For LQR problems with quadratic cost and linear dynamics:

\begin{aligned} l (x_{n}, u_{n}) & = \frac{1}{2} (x_{n}^{⊤} Q_{n} x_{n} + u_{n}^{⊤} R_{n} u_{n}) \\ l_{F} (x_{N}) & = \frac{1}{2} x_{N}^{⊤} Q_{N} x_{N} \\ f (x_{n}, u_{n}) & = A_{n} x_{n} + B_{n} u_{n} \end{aligned}

The necessary conditions simplify to:

\begin{aligned} x_{n + 1} & = A_{n} x_{n} + B_{n} u_{n} \\ λ_{n} & = Q_{n} x_{n} + A_{n}^{⊤} λ_{n + 1} \\ λ_{N} & = Q_{N} x_{N} \\ u_{n} & = - R_{n}^{- 1} B_{n}^{⊤} λ_{n + 1} \end{aligned}

This forms a linear two-point boundary value problem.

MDP & RL

Bridging Optimal Control and RL

Markov Chains

State space $X$
Action space $U$
System dynamics $f (x_{n}, u_{n})$
Cost function $l (x, u)$ and $l_{F} (x)$

Find feedback $u = K (x)$ to minimize $J (x_{0}, u)$

J (x_{0}, u) = E [\sum_{n = 0}^{N - 1} l (x_{n}, u_{n}) + l_{F} (x_{N})]

subject to $x_{n + 1} \sim f (x_{n}, u_{n})$ .

Markov (Decision) Process

State space $S$
Action space $A$
Transition dynamics $P (s^{'} | s, a)$
Reward function $R (s, a, s^{'})$

Find policy $π (a | s)$ to maximize $V (s_{0}, π)$

V (s_{0}, π) = E [\sum_{n = 0}^{H} γ^{n} R (s_{n}, a_{n}, s_{n + 1})]

subject to $s_{n + 1} \sim P (\cdot | s_{n}, a_{n})$ , $a_{n} \sim π (\cdot | s_{n})$ , where $γ \in [0, 1)$ is the discount factor.

RL is an adaptive method to solve MDP in the absence of model knowledge.

Value Function and Action-Value Function

Optimal Control:

Value Function: $V (x) = min_{u} E [\sum_{n = 0}^{N - 1} l (x_{n}, u_{n}) + l_{F} (x_{N}) | x_{0} = x]$
Action-Value Function: $Q (x, u) = E [l (x, u) + V (x^{'}) | x^{'} \sim f (x, u)]$

Reinforcement Learning:

Value Function: $V^{π} (s) = E [\sum_{n = 0}^{H} γ^{n} R (s_{n}, a_{n}, s_{n + 1}) | s_{0} = s, a_{n} \sim π (\cdot | s_{n})]$
Action-Value Function: $Q^{π} (s, a) = E [R (s, a, s^{'}) + γ V^{π} (s^{'}) | s^{'} \sim P (\cdot | s, a)]$

Q-Learning

The Scalability Challenge

For discrete, low-dimensional problems with a known model, Optimal Control and Model-based RL can be solved exactly using Dynamic Programming (DP). But what if...

The model $f (x, u)$ , $l (x, u)$ or $P (s^{'} | s, a), R (s, a, s^{'})$ is unknown?
The state or action space is too large or continuous making DP loops intractable?
The system is too complex to model accurately?

We need model-free, stochastic, and approximate methods.

(Tabular) Q-Learning

(Tabular) Q-Learning replace expectation by samples:

For an state-action pair $(s, a)$ , receive $s^{'} \sim P (s^{'} | s, a)$
Consider old estimate $Q_{k} (s, a)$
Consider new sample estimate: $target (s^{'}) = R (s, a, s^{'}) + γ max_{a^{'}} Q_{k} (s^{'}, a^{'})$
Incorporate the new estimate into a running average: $Q_{k + 1} (s, a) \leftarrow (1 - α) Q_{k} (s, a) + α [target (s^{'})]$

Q-learning converges to optimal policy even if you're acting suboptimally and is called off-policy learning. Requires sufficient exploration and a learning rate $α$ that decays appropriately:

\sum_{t = 0}^{\infty} α_{t} (s, a) = \infty \sum_{t = 0}^{\infty} α_{t}^{2} (s, a) < \infty

Approximate Q-Learning

Instead of a table, we use a parametrized Q function $Q_{θ} (s, a)$ to approximate:

Learning rule: $\begin{aligned} target (s^{'}) & = R (s, a, s^{'}) + γ max_{a^{'}} Q_{θ_{k}} (s^{'}, a^{'}) \\ θ_{k + 1} & \leftarrow θ_{k} - α \nabla_{θ} [\frac{1}{2} (Q_{θ} (s, a) - target (s^{'}))^{2}] |_{θ = θ_{k}} \end{aligned}$
Practical details:
- Use Huber loss instead of squared loss on Bellman error: $L_{δ} (a) = {\begin{cases} \frac{1}{2} a^{2} & for | a | \leq δ \\ δ (| a | - \frac{1}{2} δ) & otherwise \end{cases}$
- Use RMSProp instead of vanilla SGD.
- It is beneficial to anneal the exploration rate over time.

Conclusion