Zhu, Y., & Zhao, D. (2018). Comprehensive comparison of online
ADP algorithms for continuous-time optimal control. Artificial
Intelligence Review, 49(4), 531-547.
1 Optimal control and HJB
equation
The continuous-time system considered here is described by The subject of interest is to find a state-feedback control
policy that minimizes a prescribed performance criterion. For a policy u
= u(x(t)), its value function is defined as an infinite horizon integral
cost An infinitesimal equivalent to the value function definition
is the Bellman equation Define the Hamiltonian function as follows
The sufficient condition for the optimality is provided by the
famous Hamilton–Jacobi–Bellman equation: According to the stationary condition, the optimal policy is
constructed by the optimal value function in the form With the system and optimal policy, the HJB equation becomes:
An efficient approach is by policy iteration method, which
involves a two-step iteration. Given an admissible policy , calculate the value of the current
policy in the policy evaluation step with and update in the policy improvement step to produce
a new policy with In ADP, a critic NN is constructed to approximate the value
function, while an actor NN is constructed to approximate the policy. On
the basis of PI method, numerous online algorithms are proposed to solve
the optimal control of CT systems using ADP. The critic and the actor
are tuned based on observations of online trajectories.
2 Synchronous policy
iteration algorithm
Vamvoudakis KG, Lewis FL (2010) Online actor–critic algorithm to
solve the continuous-time infinite horizon optimal control problem.
Automatica 46(5):878–888
In the algorithm, is
approximated by the critic NN with The corresponding policy is formulated by But they use another NN to represent the actor with In the algorithm, the actor is applied on the system to
produce online trajectories. The critic is tuned using the following
updating law The another actor law is designed as The system states and the critic/actor errors are uniformly
ultimately bounded. In order to guarantee the convergence, persistency
of excitation (PE) condition is necessary, requiring the system be
persistently excited. So probing noise is added in the control input to
excite the system.
3 Actor-critic-identifier
SPI algorithm
Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon
WE (2013) A novel actor- critic-identifier architecture for approximate
optimal control of uncertain nonlinear systems. Automatica
49(1):82–92
In the updating laws of the SPI algorithm, the system dynamics and is supposed to be known. The following
multi-layer dynamic neural network (MLDNN) identifer to approximate the
system and are weight estimates, and is the robust
integral of sign of the error (RISE) feedback term.
4 Integral
reinforcement learning SPI algorithm
Vamvoudakis KG, Vrabie D, Lewis FL (2014) Online adaptive algorithm
for optimal control with integral reinforcement learning. Int J Robust
Nonlinear Control 24(17):2686–2710
The identification process additionally increases the computational
complexity and extends the learning time. It is more desired to develop
direct online ADP algorithms that learn the critic and the actor using
online trajectories. The paper combines integral reinforcement learning
(IRL) with their SPI algorithm and propose the algorithm which we denote
as SPI-IRL.
Reviewing the Hamiltonian error defined as along the system evolution , after integrating both sides over interval , the integral Hamiltonian error
is defined as Based on the error, the critic can be updated by using the
gradient descent method
The actor updating law resembles the original SPI algorithm but with
minor adjustment as follows It is also proved that with the critic/actor and their
updating laws, the system states and the critic/actor errors are
UUB.
After the combination of IRL and SPI, the internal drift dynamics f
is no longer needed, but the input gain matrix is still necessary to define the
actor.
5
Integral reinforcement learning and experience replay SPI algorithm
Modares H, Lewis FL, Naghibi-Sistani MB (2014) Integral reinforcement
learning and experience replay for adaptive optimal control of
partially-unknown constrained-input continuous-time systems. Automatica
50(1):193–202
Reviewing the integral Hamiltonian error, it only uses the instant
observation to define the error. Past observations are also capable of
defining errors. Suppose the past data are stored in a history stack,
with the time intervals . For the past time , together with the critic and the
actor coefficients, its error is defined by We have the experience-replay based gradient-descent updating
law for the critic as follows In comparison to the original SPI-IRL algorithm in which only
the current observation defines the updating law, the ER-based algorithm
improves the data utilization.
6 Robust ADP algorithm
Jiang Y, Jiang ZP (2014) Robust adaptive dynamic programming and
feedback stabilization of nonlinear systems. IEEE Trans Neural Netw
Learn Syst 25(5):882–893
Review the -th policy
iteration. Consider an arbitrary control input and execute it on the system to
produce solutions .
Differentiate the value function along the solutions and utilize the
relationship
we have After applying the IRL technique and making some
manipulations, we get the following equation over an arbitrary interval
, Note that by solving the above equation, we get the value
function and the improved
policy at one
calculation. In addition, computing the equation needs no knowledge of
dynamics, making it completely model-free. After getting the new policy
, we continue the next
iteration until finding the converged optimal policy ∗.
When using the actor-critic structure to approximate the value and
the policy functions, the algorithm first collects online data from the
system, and then calculates the NN coefficients based on the data. The
critic and the actor are approximated by NNs independently with After inserting into the above equation, a new error is
defined as By using the Kronecker product: the equation is rewritten into a linear form with: Given a sequence of online data with the time intervals , a vector of
errors are defined by Based on the least-squares principle, the coefficients are
determined by Even though the algorithm is model-free, its implementation is
not in a real-time manner. The critic and the actor are not tuned along
the system evolution. Their coefficients are computed and updated
following a batch process with a group of online data.
7 Off-policy SPI algorithm
If equals , then and . The equation becomes the
following integral off-policy HJB equation Similarly, we define the critic NN and the actor NN for and with Then, we have Under the representation of Kronecker product, define the
following Then e is rewritten as Under the gradient descent method, the updating laws for the
critic and the actor have The off-policy based online algorithm without system dynamics
is proposed. We term it as SPI-IRL-OffPo algorithm to indicate it is a
SPI algorithm based on IRL and off-policy techniques.