Comprehensive comparison of online ADP algorithms for continuous-time optimal control

Zhu, Y., & Zhao, D. (2018). Comprehensive comparison of online ADP algorithms for continuous-time optimal control. Artificial Intelligence Review, 49(4), 531-547.

1 Optimal control and HJB equation

The continuous-time system considered here is described by The subject of interest is to find a state-feedback control policy that minimizes a prescribed performance criterion. For a policy u = u(x(t)), its value function is defined as an infinite horizon integral cost An infinitesimal equivalent to the value function definition is the Bellman equation Define the Hamiltonian function as follows The sufficient condition for the optimality is provided by the famous Hamilton–Jacobi–Bellman equation: According to the stationary condition, the optimal policy is constructed by the optimal value function in the form With the system and optimal policy, the HJB equation becomes: An efficient approach is by policy iteration method, which involves a two-step iteration. Given an admissible policy , calculate the value of the current policy in the policy evaluation step with and update in the policy improvement step to produce a new policy with In ADP, a critic NN is constructed to approximate the value function, while an actor NN is constructed to approximate the policy. On the basis of PI method, numerous online algorithms are proposed to solve the optimal control of CT systems using ADP. The critic and the actor are tuned based on observations of online trajectories.

2 Synchronous policy iteration algorithm

Vamvoudakis KG, Lewis FL (2010) Online actor–critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica 46(5):878–888

In the algorithm, is approximated by the critic NN with The corresponding policy is formulated by But they use another NN to represent the actor with In the algorithm, the actor is applied on the system to produce online trajectories. The critic is tuned using the following updating law The another actor law is designed as The system states and the critic/actor errors are uniformly ultimately bounded. In order to guarantee the convergence, persistency of excitation (PE) condition is necessary, requiring the system be persistently excited. So probing noise is added in the control input to excite the system.

3 Actor-critic-identifier SPI algorithm

Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2013) A novel actor- critic-identifier architecture for approximate optimal control of uncertain nonlinear systems. Automatica 49(1):82–92

In the updating laws of the SPI algorithm, the system dynamics and is supposed to be known. The following multi-layer dynamic neural network (MLDNN) identifer to approximate the system and are weight estimates, and is the robust integral of sign of the error (RISE) feedback term.

4 Integral reinforcement learning SPI algorithm

Vamvoudakis KG, Vrabie D, Lewis FL (2014) Online adaptive algorithm for optimal control with integral reinforcement learning. Int J Robust Nonlinear Control 24(17):2686–2710

The identification process additionally increases the computational complexity and extends the learning time. It is more desired to develop direct online ADP algorithms that learn the critic and the actor using online trajectories. The paper combines integral reinforcement learning (IRL) with their SPI algorithm and propose the algorithm which we denote as SPI-IRL.

Reviewing the Hamiltonian error defined as along the system evolution , after integrating both sides over interval , the integral Hamiltonian error is defined as Based on the error, the critic can be updated by using the gradient descent method

The actor updating law resembles the original SPI algorithm but with minor adjustment as follows It is also proved that with the critic/actor and their updating laws, the system states and the critic/actor errors are UUB.

After the combination of IRL and SPI, the internal drift dynamics f is no longer needed, but the input gain matrix is still necessary to define the actor.

5 Integral reinforcement learning and experience replay SPI algorithm

Modares H, Lewis FL, Naghibi-Sistani MB (2014) Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems. Automatica 50(1):193–202

Reviewing the integral Hamiltonian error, it only uses the instant observation to define the error. Past observations are also capable of defining errors. Suppose the past data are stored in a history stack, with the time intervals . For the past time , together with the critic and the actor coefficients, its error is defined by We have the experience-replay based gradient-descent updating law for the critic as follows In comparison to the original SPI-IRL algorithm in which only the current observation defines the updating law, the ER-based algorithm improves the data utilization.

6 Robust ADP algorithm

Jiang Y, Jiang ZP (2014) Robust adaptive dynamic programming and feedback stabilization of nonlinear systems. IEEE Trans Neural Netw Learn Syst 25(5):882–893

Review the -th policy iteration. Consider an arbitrary control input and execute it on the system to produce solutions . Differentiate the value function along the solutions and utilize the relationship

we have After applying the IRL technique and making some manipulations, we get the following equation over an arbitrary interval , Note that by solving the above equation, we get the value function and the improved policy at one calculation. In addition, computing the equation needs no knowledge of dynamics, making it completely model-free. After getting the new policy , we continue the next iteration until finding the converged optimal policy ∗.

When using the actor-critic structure to approximate the value and the policy functions, the algorithm first collects online data from the system, and then calculates the NN coefficients based on the data. The critic and the actor are approximated by NNs independently with After inserting into the above equation, a new error is defined as By using the Kronecker product: the equation is rewritten into a linear form with: Given a sequence of online data with the time intervals , a vector of errors are defined by Based on the least-squares principle, the coefficients are determined by Even though the algorithm is model-free, its implementation is not in a real-time manner. The critic and the actor are not tuned along the system evolution. Their coefficients are computed and updated following a batch process with a group of online data.

7 Off-policy SPI algorithm

If equals , then and . The equation becomes the following integral off-policy HJB equation Similarly, we define the critic NN and the actor NN for and with Then, we have Under the representation of Kronecker product, define the following Then e is rewritten as Under the gradient descent method, the updating laws for the critic and the actor have The off-policy based online algorithm without system dynamics is proposed. We term it as SPI-IRL-OffPo algorithm to indicate it is a SPI algorithm based on IRL and off-policy techniques.