Skip to content

RL as an Adaptive Optimal Control

Slide: Explore reinforcement learning (RL) by comparing the Markov Chain and the Markov Decision Process (MDP). Understand how RL functions as a direct adaptive optimal control method through the example of Q-Learning. Inspired by Sutton's paper "Reinforcement Learning is Direct Adaptive Optimal Control" and Pieter Abbeel's lecture "Foundations of Deep RL".

Recap

Problem Formulation

Consider the deterministic discrete-time optimal control problem:

minx1:N,u1:N1n=1N1l(xn,un)+lF(xN)s.t.  xn+1=f(xn,un)unU

The first-order necessary conditions for optimality can be derived using:

  • The Lagrangian framework (special case of KKT conditions)
  • Pontryagin's Minimum Principle (PMP)

Lagrangian Formulation

Form the Lagrangian:

L=n=1N1l(xn,un)+λn+1(f(xn,un)xn+1)+lF(xN)

Define the Hamiltonian:

H(xn,un,λn+1)=l(xn,un)+λn+1f(xn,un)

Rewrite the Lagrangian using the Hamiltonian:

L=H(x1,u1,λ2)+[n=2N1H(xn,un,λn+1)λnxn]+lF(xN)λNxN

Optimality Conditions

Take derivatives with respect to x and λ:

Lλn=Hλnxn+1=f(xn,un)xn+1=0Lxn=Hxnλn=lxn+λn+1fxnλn=0LxN=lFxNλN=0

For u, we write the minimization explicitly to handle constraints:

un=argminuH(xn,u,λn+1)s.t. uU

Summary of Necessary Conditions

The first-order necessary conditions can be summarized as:

xn+1=λH(xn,un,λn+1)λn=xH(xn,un,λn+1)un=argminuH(xn,u,λn+1),s.t. uUλN=lFxN

In continuous time, these become:

x˙=λH(x,u,λ)λ˙=xH(x,u,λ)u=argminu~H(x,u~,λ),s.t. u~Uλ(tF)=lFx

Application to LQR Problems

For LQR problems with quadratic cost and linear dynamics:

l(xn,un)=12(xnQnxn+unRnun)lF(xN)=12xNQNxNf(xn,un)=Anxn+Bnun

The necessary conditions simplify to:

xn+1=Anxn+Bnunλn=Qnxn+Anλn+1λN=QNxNun=Rn1Bnλn+1

This forms a linear two-point boundary value problem.

MDP & RL

Bridging Optimal Control and RL

Markov Chains

  • State space X
  • Action space U
  • System dynamics f(xn,un)
  • Cost function l(x,u) and lF(x)

Find feedback u=K(x) to minimize J(x0,u)

J(x0,u)=E[n=0N1l(xn,un)+lF(xN)]

subject to xn+1f(xn,un).

Markov (Decision) Process

  • State space S
  • Action space A
  • Transition dynamics P(s|s,a)
  • Reward function R(s,a,s)

Find policy π(a|s) to maximize V(s0,π)

V(s0,π)=E[n=0HγnR(sn,an,sn+1)]

subject to sn+1P(|sn,an), anπ(|sn), where γ[0,1) is the discount factor.

RL is an adaptive method to solve MDP in the absence of model knowledge.

Value Function and Action-Value Function

Optimal Control:

  • Value Function:V(x)=minuE[n=0N1l(xn,un)+lF(xN)|x0=x]
  • Action-Value Function:Q(x,u)=E[l(x,u)+V(x)|xf(x,u)]

Reinforcement Learning:

  • Value Function:Vπ(s)=E[n=0HγnR(sn,an,sn+1)|s0=s,anπ(|sn)]
  • Action-Value Function:Qπ(s,a)=E[R(s,a,s)+γVπ(s)|sP(|s,a)]

Q-Learning

The Scalability Challenge

For discrete, low-dimensional problems with a known model, Optimal Control and Model-based RL can be solved exactly using Dynamic Programming (DP). But what if...

  • The model f(x,u), l(x,u) or P(s|s,a),R(s,a,s) is unknown?
  • The state or action space is too large or continuous making DP loops intractable?
  • The system is too complex to model accurately?

We need model-free, stochastic, and approximate methods.

(Tabular) Q-Learning

(Tabular) Q-Learning replace expectation by samples:

  • For an state-action pair (s,a) , receive sP(s|s,a)
  • Consider old estimate Qk(s,a)
  • Consider new sample estimate: target(s)=R(s,a,s)+γmaxaQk(s,a)
  • Incorporate the new estimate into a running average:Qk+1(s,a)(1α)Qk(s,a)+α[target(s)]

Q-learning converges to optimal policy even if you're acting suboptimally and is called off-policy learning. Requires sufficient exploration and a learning rate α that decays appropriately:

t=0αt(s,a)=t=0αt2(s,a)<

Approximate Q-Learning

Instead of a table, we use a parametrized Q function Qθ(s,a) to approximate:

  • Learning rule:target(s)=R(s,a,s)+γmaxaQθk(s,a)θk+1θkαθ[12(Qθ(s,a)target(s))2]|θ=θk
  • Practical details:
    • Use Huber loss instead of squared loss on Bellman error:Lδ(a)={12a2for |a|δδ(|a|12δ)otherwise
    • Use RMSProp instead of vanilla SGD.
    • It is beneficial to anneal the exploration rate over time.

Conclusion

RL as an Adaptive Optimal Control

Common Core:

  • Value Function
  • Bellman Equation
  • Sequential Decision Making

The Evolution:

  • Expectation Samples
  • Table Function Approximation
  • Exact Solution Stochastic Optimization

A powerful and adaptive optimal control framework.