Adaptive Dynamic Programming for Control

Liu, D., Xue, S., Zhao, B., Luo, B., & Wei, Q. (2020). Adaptive dynamic programming for control: A survey and recent advances. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 51(1), 142-160.

A. ADP for Optimal State Regulation

The optimal state regulator is to keep the state near the equilibrium and to maximize the value function of the system.

Nonlinear dynamic systems can generally be divided into affine and nonaffine systems, which are described by The general state regulator problem can be divided into the finite-time state regulator and infinite-time state regulator. The value function V of the finite-time state regulator can be written as When approaches infinity, i.e., the value function of the infinite-time state regulator is obtained.

Problem 1: For affine/nonaffine systems and finite/infinite-time value functions, the problem of optimal state regulation is to design a learning control structure, and then gradually explores the optimal control function which minimizes the value function and stabilizes the closed-loop systems.

The Hamiltonian of the systems (1) is designed as The, the HJB equation is presented as where is the optimal value of . Then, the optimal control function can be obtained by For affine system, when the control energy function is quadratic, i.e., , we can obtain that PI Algorithm: The PI algorithm for the continuous-time system is shown in Algorithm 1.

Algorithm 1: PI for the system Step 1 Initialization: . Select an initial admissible control policy .

Step 2 Evaluation:

.

The value function under the control is obtained according to Step 3 Improvement:

The updated control policy is obtained according to More specifically Step 4 Judgment:

If preset conditions for convergence are not met, go back to Step 2.

Step 5 Stop:

Obtain the optimal control policy and the optimal value function .

Integral Reinforcement Learning: Traditional policy evaluation requires the knowledge of system dynamics. In the case of unknown internal dynamics, many algorithms cannot be used directly. To deal with this problem, IRL was developed.

For time and time interval , it can be found that Off-Policy IRL: Off-policy learning is often employed when the accurate system model is unknown. Compared with on-policy learning, it uses system data generated by arbitrary control to solve the HJB equation. The system can be rewritten as Based on (9) and (11), we have The derivative of the value function with respect to (13) is By integrating both sides of (17) on the interval , the following equation is used in the off-policy IRL scheme:

It is found that the mathematical system model is not explicitly included, but is actually implicit in the data measurement.

Function Approximation Based on NN: NNs have been used for function approximation in the implementation of ADP algorithms.

Assumption: The continuous function can be expressed by an NN According to Assumption, the optimal value function and the optimal control are expressed as

Employing a critic NN and an actor NN to approximate the optimal value function and the optimal control, we have The weight estimation errors of the critic NN and the actor NN are given as According to the relationship between the optimal value function and the optimal control derived from the HJB equation, the actor network can be omitted for affine nonlinear systems and only a single-critic network is used. This structure is called the single network adaptive critic (SNAC). In this case, (18) becomes and (19) becomes A single network can simplify the analysis and reduce the amount of calculation. However, when the complete knowledge of the system is unknown, the actor network needs to be considered.

The residual error is defined as The objective function to be minimized is A critic NN update rule based on least squares is given by An actor NN learning algorithm based on gradient is given by

B. ADP for Optimal Output Regulation

Problem 2 : The optimal output regulation is to design control input such that the output approaches zero when minimizing the value function.

Optimal Output Regulation for Linear Systems:

Take a linear continuous-time system as an example For time interval and any time t, an online learning algorithm for a suboptimal output feedback controller based on the IRL technique is presented in Algorithm 2.

Algorithm 2 LQR Based on IRL for Linear System

Step 1 Initialization: ,

=0

Select an initial stabilizing gain .

Step 2 Evaluation:

.

is obtained according to Step 3 Improvement:

The updated control policy is obtained according to Step 4 Judgment:

If preset conditions for convergence are not met, go back to Step 2.

Step 5 Stop:

Obtain the optimal control policy and the optimal value function .

The method was applied to both linear quadratic regulator (LQR) and linear quadratic tracking (LQT) by using a discounted value function as ADP for Optimal Tracking Control:

The optimal tracking problem has attracted more and more attention in the control field. By constructing the augmented system with tracking error and desired trajectory, the solution of the optimal tracking control problem is transformed into an optimal regulation problem.

Problem 3: The optimal tracking control problem is to design a control policy to make the actual output of the system track the desired trajectory and minimize the preset value function.

Consider the general value function: The desired reference trajectory dynamics is defined as The tracking error is Then, we have The augmented system state is defined as and then the augmented system dynamics are further obtained as The value function associated with (36) is The Hamiltonian is given by The HJB equation of the tracking control problem is derived based on the Bellman principle of optimality According to the stationarity condition, the relationship between the optimal control and the optimal value function can be obtained. For the quadratic energy function, i.e., , we have In the optimal tracking control problem, it is shown that the discount factor in the value function is needed. Since the control input includes feedforward control and feedback control, the feedforward control input may make the value function V unbounded when the reference trajectory does not converge to zero. The employment of the discount factor ensures that the value function V is bounded, thereby effectively avoiding this problem.