Echo State Networks

A Practical Guide to Applying Echo State Networks

1 The Basic Model

ESNs are applied to supervised temporal ML tasks where for a given training input signal , a desired target output signal is known. Here is the discrete time and is the number of data points in the training dataset. The task is to learn a model with output , where matches as well as possible , minimizing an error measure . The error measure is typically a Mean-Square Error (MSE), for example Root-Mean-Square Error (RMSE), which is also averaged over the dimensions of the output here.

ESNs use an RNN type with leaky-integrated discrete-time continuous-value units (泄漏积分离散时间连续值单元). The typical update equations are where is a vector of reservoir neuron activations and is its update, all at time step , is applied element-wise. and are the input and recurrent weight matrices respectively, and is the leaking rate. The model is also sometimes used without the leaky integration, which is a special case of and thus .

The linear readout layer is defined as where is network output, the output weight matrix.

The original method of RC introduced with ESNs was to:

generate a large random reservoir RNN ();

run it using the training input and collect the corresponding reservoir activation states ;

compute the linear readout weights from the reservoir using linear regression, minimizing the MSE between and ;

use the trained network on new input data computing by employing the trained output weights .

2 Producing a Reservoir

2.1 Function of the Reservoir

In practice it is important to keep in mind that the reservoir acts (i) as a nonlinear expansion and (ii) as a memory of input at the same time.

At the same time, (ii) the reservoir serves as a memory, providing temporal context. This is a crucial reason for using RNNs in the first place. In the tasks where memory is not necessary, non-temporal ML techniques implementing functional mappings from current input to current output should be used.

2.2 Global parameters of the Reservoir

The reservoir is defined by the tuple (). The input and recurrent connection matrices and are generated randomly according to some parameters discussed later and the leaking rate is selected as a free parameter itself.

In analogy to other ML, and especially NN, approaches, what we call “parameters” here could as well be called “meta-parameters” or “hyper-parameters”, as they are not concrete connection weights but parameters governing their distributions. We will call them “global parameters” to better reflect their nature, or simply “parameters” for brevity.

The defining global parameters of the reservoir are: the size , sparsity, distribution of nonzero elements, and spectral radius of ; scaling(-s) of ; and the leaking rate .

2.2.1 Size of the Reservoir

A lower bound for the reservoir size can roughly be estimated by considering the number of independent real values that the reservoir must remember from the input to successfully accomplish the task.

2.2.2 Sparsity of the Reservoir

In the original ESN publications it is recommended to make the reservoir connections sparse, i.e., make most of elements in equal to zero. In our practical experience also sparse connections tend to give a slightly better performance. In general, sparsity of the reservoir does not affect the performance much and this parameter has a low priority to be optimized. However, sparsity enables fast reservoir updates if sparse matrix representations are used.

2.2.3 Distribution of Nonzero Elements

The matrix is typically generated sparse, with nonzero elements having an either a symmetrical uniform, discrete bi-valued, or normal distribution centered around zero. We usually prefer a uniform distribution for its continuity of values and boundedness.

The input matrix is usually generated according to the same type of distribution as , but typically dense.

2.2.4 Spectral Radius

One of the most central global parameters of an ESN is spectral radius of the reservoir connection matrix , i.e., the maximal absolute eigenvalue of this matrix. It scales the matrix , or viewed alternatively, scales the width of the distribution of its nonzero elements.

Typically a random sparse is generated; its spectral radius is computed; then is divided by to yield a matrix with a unit spectral radius; this is then conveniently scaled with the ultimate spectral radius to be determined in a tuning procedure.

As a guiding principle, should be set greater for tasks where a more extensive history of the input is required to perform it, and smaller for tasks where the current output depends more on the recent history of . The spectral radius determines how fast the influence of an input dies out in a reservoir with time, and how stable the reservoir activations are. ==The spectral radius should be greater in tasks requiring longer memory of the input.==

2.2.5 Input Scaling

The scaling of the input weight matrix is another key parameter to optimize in an ESN. For uniformly distributed Win we usually define the input scaling a as the range of the interval from which values of are sampled; for normal distributed input weights one may take the standard deviation as a scaling measure.

Scale the whole uniformly to have few global parameters in ESN. However, to increase the performance:

scale the first column of (i.e., the bias inputs) separately;

scale other columns of separately if channels of contribute differently to the task.

It is advisable to normalize the data and may help to keep the inputs bounded avoiding outliers (e.g., apply to u(n) if it is unbounded).

Input scaling determines how nonlinear the reservoir responses are. The amount of nonlinearity the task requires is not easy to judge. Finding a proper setting benefits from experience and intuitive insight into nonlinear dynamics. But also the masters of RC (if there are such) use trial and error to tune this characteristic.

Looking at (2), it is clear that the scaling of , together with the scaling of (i.e., ) determines the proportion of how much the current state depends on the current input and how much on the previous state , respectively. The respective sizes and should also be taken into account.

The input scaling regulates:

the amount of nonlinearity of the reservoir representation (also increasing with );

the relative effect of the current input on as opposed to the history (in proportion to ).

2.2.6 Leaking Rate

The leaking rate of the reservoir nodes in (3) can be regarded as the speed of the reservoir update dynamics discretized in time. We can describe the reservoir update dynamics in continuous time as an Ordinary Differential Equation (ODE) If we make an Euler's discretization of this ODE in time, taking Thus can be regarded as the time interval in the continuous world between two consecutive time steps in the discrete realization. The leaking rate can even be adapted online to deal with time wrapping of the signals.

The version (3) has emerged as preferred, because it guarantees that never goes outside the (−1, 1) interval.

Set the leaking rate to match the speed of the dynamics of and/or .

When the task requires modeling the time series producing dynamical system on multiple time scales, it might be useful to set different leaking rates to different units (making a vector ), with a possible downside of having more parameters to optimize.

In some cases setting a small , and thus inducing slow dynamics of , can dramatically increase the duration of the short-term memory in ESN

3 Training Readouts

3.1 Ridge Regression

Since readouts from an ESN are typically linear and feed-forward, the Equation (4) can be written in a matrix notation as Finding the optimal weights that minimize the squared error between and amounts to solving a typically overdetermined system of linear equations Probably the most universal and stable solution to (7) in this context is ridge regression, also known as regression with Tikhonov regularization: where is a regularization coefficient.

3.2 Regularization

Extremely large values may be an indication of a very sensitive and unstable solution.

Use regularization whenever there is a danger of overfitting or feedback instability.

The optimal values of can vary by many magnitudes of size, depending on the exact instance of the reservoir and length of the training data. If doing a simple exhaustive search, it is advisable to search on a logarithmic grid.

3.4 Online Learning

Some applications require online model adaptation. In such cases the process generating the data is often not assumed to be stationary and is tracked by the constantly adapting model. here acts as an adaptive linear combiner.

The simplest way to train is the method known as the Least Mean Squares (LMS) algorithm. It is a stochastic gradient descent algorithm which at every time step changes in the direction of minimizing the instantaneous squared error . LMS is a first-order gradient descent method, locally approximating the error surface with a hyperplane. This approximation is poor then curvature of the error surface is very different in different directions, which is signified by large eigenvalue spreads of . In such a situation the convergence performance of LMS is unfortunately severely impaired.

An alternative linear readout learning to LMS, known in linear signal processing as the ==Recursive Least Squares (RLS)== algorithm, is insensitive to the detrimental effects of eigenvalue spread and boasts a much faster convergence. It explicitly at each time step n minimizes a square error that is exponentially discounted going back in time

The downside of RLS is it being computationally more expensive (quadratic in number of weights instead of linear like LMS) and notorious for numerical stability issues.

Echo State Gaussian Process

Let us again consider the readout expression of an ESN comprising reservoir neurons. Let the network output consist of component responses, i.e. . Then, we have Let us now impose a spherical Gaussian prior over the readout weights , such that Under this setting, we have for the mean and covariance of the readout component responses Thus, under the Gaussian prior assumption, it turns out that and are jointly Gaussian with zero mean and covariance given by the dot product ,for any ,and . In other words, under our Bayesian approach, the distributions of the ESN readouts turn out to yield a GP of the form where is the design matrix and is given by with the kernels of the obtained GPs being functions of the reservoir state vectors, in the form

Generalizing the above results to allow for the utilization of kernels of any kind, in this paper we introduce a novel Bayesian treatment of ESNs, namely, the ESGP. For example, in case a Gaussian radial basis function (RBF) kernel is considered, the definition of the ESGP model yields a prior distribution of the form with its kernel function (reservoir kernel) given by

Let us consider an ESGP with

reservoir neurons and

readout signals. To endow our model with increased robustness to observation noise, we additionally assume that the target signals of an observed phenomenon modeled using the postulated ESGP model consist of a latent function of the input signals, which is learnable by the considered ESGP model, superimposed on an independent white Gaussian noise signal, that is, we adopt the hypothesis that the available training target signals

are given by

The predictive density of the postulated ESGP model at a test time point

yields

where $$

$$ Similar, the model evidence will be given by