RL as an Adaptive Optimal Control
Slide: Explore reinforcement learning (RL) by comparing the Markov Chain and the Markov Decision Process (MDP). Understand how RL functions as a direct adaptive optimal control method through the example of Q-Learning. Inspired by Sutton's paper "Reinforcement Learning is Direct Adaptive Optimal Control" and Pieter Abbeel's lecture "Foundations of Deep RL".
Recap
Problem Formulation
Consider the deterministic discrete-time optimal control problem:
The first-order necessary conditions for optimality can be derived using:
- The Lagrangian framework (special case of KKT conditions)
- Pontryagin's Minimum Principle (PMP)
Lagrangian Formulation
Form the Lagrangian:
Define the Hamiltonian:
Rewrite the Lagrangian using the Hamiltonian:
Optimality Conditions
Take derivatives with respect to
For
Summary of Necessary Conditions
The first-order necessary conditions can be summarized as:
In continuous time, these become:
Application to LQR Problems
For LQR problems with quadratic cost and linear dynamics:
The necessary conditions simplify to:
This forms a linear two-point boundary value problem.
MDP & RL
Bridging Optimal Control and RL
Markov Chains
- State space
- Action space
- System dynamics
- Cost function
and
Find feedback
subject to
Markov (Decision) Process
- State space
- Action space
- Transition dynamics
- Reward function
Find policy
subject to
RL is an adaptive method to solve MDP in the absence of model knowledge.
Value Function and Action-Value Function
Optimal Control:
- Value Function:
- Action-Value Function:
Reinforcement Learning:
- Value Function:
- Action-Value Function:
Q-Learning
The Scalability Challenge
For discrete, low-dimensional problems with a known model, Optimal Control and Model-based RL can be solved exactly using Dynamic Programming (DP). But what if...
- The model
, or is unknown? - The state or action space is too large or continuous making DP loops intractable?
- The system is too complex to model accurately?
We need model-free, stochastic, and approximate methods.
(Tabular) Q-Learning
(Tabular) Q-Learning replace expectation by samples:
- For an state-action pair
, receive - Consider old estimate
- Consider new sample estimate:
- Incorporate the new estimate into a running average:
Q-learning converges to optimal policy even if you're acting suboptimally and is called off-policy learning. Requires sufficient exploration and a learning rate
Approximate Q-Learning
Instead of a table, we use a parametrized Q function
- Learning rule:
- Practical details:
- Use Huber loss instead of squared loss on Bellman error:
- Use RMSProp instead of vanilla SGD.
- It is beneficial to anneal the exploration rate over time.
- Use Huber loss instead of squared loss on Bellman error:
Conclusion
RL as an Adaptive Optimal Control
Common Core:
- Value Function
- Bellman Equation
- Sequential Decision Making
The Evolution:
- Expectation
Samples - Table
Function Approximation - Exact Solution
Stochastic Optimization
A powerful and adaptive optimal control framework.