ESNs are applied to supervised temporal ML tasks where for a given
training input signal , a desired target output signal is known. Here is the discrete time and is the number of data points in the
training dataset. The task is to learn a model with output , where
matches as well as
possible , minimizing an error measure . The error measure
is typically a Mean-Square Error
(MSE), for example Root-Mean-Square Error (RMSE), which is also averaged over the dimensions of the output here.
ESNs use an RNN type with leaky-integrated discrete-time
continuous-value units (泄漏积分离散时间连续值单元). The typical update
equations are where is a vector of reservoir neuron activations
and is its update, all at time step , is applied element-wise. and are the input and recurrent weight
matrices respectively, and is the leaking rate. The model is also sometimes used
without the leaky integration, which is a special case of and thus .
The linear readout layer is defined as where is network output, the output weight matrix.
The original method of RC introduced with ESNs was to:
generate a large random reservoir RNN ();
run it using the training input and collect the
corresponding reservoir activation states ;
compute the linear readout weights from the
reservoir using linear regression, minimizing the MSE between and ;
use the trained network on new input data computing by employing the trained
output weights .
2 Producing a Reservoir
2.1 Function of the Reservoir
In practice it is important to keep in mind that the reservoir acts
(i) as a nonlinear expansion and (ii) as a memory of input at the same time.
At the same time, (ii) the reservoir serves as a memory, providing
temporal context. This is a crucial reason for using RNNs in the first
place. In the tasks where memory is not necessary, non-temporal ML
techniques implementing functional mappings from current input to
current output should be used.
2.2 Global parameters of the
Reservoir
The reservoir is defined by the tuple (). The input and
recurrent connection matrices and are generated randomly
according to some parameters discussed later and the leaking rate is selected as a free parameter
itself.
In analogy to other ML, and especially NN, approaches, what we call
“parameters” here could as well be called “meta-parameters” or
“hyper-parameters”, as they are not concrete connection weights but
parameters governing their distributions. We will call them “global
parameters” to better reflect their nature, or simply “parameters” for
brevity.
The defining global parameters of the reservoir are: the size , sparsity, distribution of nonzero
elements, and spectral radius of ; scaling(-s) of ; and the leaking
rate .
2.2.1 Size of the Reservoir
A lower bound for the reservoir size can roughly be estimated by
considering the number of independent real values that the reservoir
must remember from the input to successfully accomplish the task.
2.2.2 Sparsity of the Reservoir
In the original ESN publications it is recommended to make the
reservoir connections sparse, i.e., make most of elements in equal to zero. In
our practical experience also sparse connections tend to give a slightly
better performance. In general, sparsity of the reservoir does not
affect the performance much and this parameter has a low priority to be
optimized. However, sparsity enables fast reservoir updates if sparse
matrix representations are used.
2.2.3 Distribution of Nonzero
Elements
The matrix is
typically generated sparse, with nonzero elements having an either a
symmetrical uniform, discrete bi-valued, or normal distribution centered
around zero. We usually prefer a uniform distribution for its continuity
of values and boundedness.
The input matrix is usually
generated according to the same type of distribution as , but typically dense.
2.2.4 Spectral Radius
One of the most central global parameters of an ESN is spectral
radius of the reservoir connection matrix , i.e., the maximal absolute
eigenvalue of this matrix. It scales the matrix , or viewed alternatively,
scales the width of the distribution of its nonzero elements.
Typically a random sparse is generated; its spectral
radius is
computed; then is
divided by to
yield a matrix with a unit spectral radius; this is then conveniently
scaled with the ultimate spectral radius to be determined in a tuning
procedure.
As a guiding principle, should be set greater
for tasks where a more extensive history of the input is required to
perform it, and smaller for tasks where the current output depends more on the recent
history of . The
spectral radius determines how fast the influence of an input dies out
in a reservoir with time, and how stable the reservoir activations are.
==The spectral radius should be greater in tasks requiring longer memory
of the input.==
2.2.5 Input Scaling
The scaling of the input weight matrix is another key
parameter to optimize in an ESN. For uniformly distributed Win we
usually define the input scaling a as the range of the interval from which values of are sampled; for
normal distributed input weights one may take the standard deviation as
a scaling measure.
Scale the whole uniformly to have
few global parameters in ESN. However, to increase the performance:
scale the first column of (i.e., the bias
inputs) separately;
scale other columns of separately if
channels of
contribute differently to the task.
It is advisable to normalize the data and may help to keep the inputs
bounded avoiding
outliers (e.g., apply to
u(n) if it is unbounded).
Input scaling determines how nonlinear the reservoir responses are.
The amount of nonlinearity the task requires is not easy to judge.
Finding a proper setting benefits from experience and intuitive insight
into nonlinear dynamics. But also the masters of RC (if there are such)
use trial and error to tune this characteristic.
Looking at (2), it is clear that the scaling of , together with the
scaling of (i.e., ) determines the
proportion of how much the current state depends on the current
input and how much on
the previous state , respectively. The respective sizes and should also be taken into
account.
The input scaling regulates:
the amount of nonlinearity of the reservoir representation (also increasing with );
the relative effect of the current input on as opposed to the history
(in proportion to ).
2.2.6 Leaking Rate
The leaking rate of the
reservoir nodes in (3) can be regarded as the speed of the reservoir
update dynamics discretized in time. We can describe the reservoir
update dynamics in continuous time as an Ordinary Differential Equation
(ODE) If we make an Euler's discretization of this ODE in time,
taking Thus can be
regarded as the time interval in the continuous world between two
consecutive time steps in the discrete realization. The leaking rate
can even be adapted online
to deal with time wrapping of the signals.
The version (3) has emerged as preferred, because it guarantees that
never goes outside
the (−1, 1) interval.
Set the leaking rate to
match the speed of the dynamics of and/or .
When the task requires modeling the time series producing dynamical
system on multiple time scales, it might be useful to set different
leaking rates to different units (making a vector ), with a
possible downside of having more parameters to optimize.
In some cases setting a small , and thus inducing slow dynamics
of , can dramatically
increase the duration of the short-term memory in ESN
3 Training Readouts
3.1 Ridge Regression
Since readouts from an ESN are typically linear and feed-forward, the
Equation (4) can be written in a matrix notation as Finding the optimal weights that minimize the squared
error between and
amounts to solving a typically overdetermined system of linear equations
Probably the most universal and stable solution to (7) in this
context is ridge regression, also known as regression with Tikhonov
regularization: where is a
regularization coefficient.
3.2 Regularization
Extremely large values may be an indication of a very
sensitive and unstable solution.
Use regularization whenever there is a danger of overfitting or
feedback instability.
The optimal values of can
vary by many magnitudes of size, depending on the exact instance of the
reservoir and length of the training data. If doing a simple exhaustive
search, it is advisable to search on a logarithmic grid.
3.4 Online Learning
Some applications require online model adaptation. In such cases the
process generating the data is often not assumed to be stationary and is
tracked by the constantly adapting model. here acts as an
adaptive linear combiner.
The simplest way to train is the method
known as the Least Mean Squares (LMS) algorithm. It is a stochastic
gradient descent algorithm which at every time step changes in the direction
of minimizing the instantaneous squared error .
LMS is a first-order gradient descent method, locally approximating the
error surface with a hyperplane. This approximation is poor then
curvature of the error surface is very different in different
directions, which is signified by large eigenvalue spreads of . In such a situation
the convergence performance of LMS is unfortunately severely
impaired.
An alternative linear readout learning to LMS, known in linear signal
processing as the ==Recursive Least Squares (RLS)== algorithm, is
insensitive to the detrimental effects of eigenvalue spread and boasts a
much faster convergence. It explicitly at each time step n minimizes a
square error that is exponentially discounted going back in time
The downside of RLS is it being computationally more expensive
(quadratic in number of weights instead of linear like LMS) and
notorious for numerical stability issues.
Echo State Gaussian Process
Let us again consider the readout expression of an ESN comprising
reservoir neurons. Let the
network output
consist of component responses,
i.e. . Then, we have Let us now impose a spherical Gaussian prior over the readout
weights , such
that Under this setting, we have for the mean and covariance of the
readout component responses Thus, under the Gaussian prior assumption, it turns out that
and are jointly Gaussian with zero
mean and covariance given by the dot product ,for any
,and . In other words, under
our Bayesian approach, the distributions of the ESN readouts turn out to
yield a GP of the form where is the design matrix and is given by with the kernels of the obtained GPs being functions of the
reservoir state vectors, in the form
Generalizing the above results to allow for the utilization of kernels
of any kind, in this paper we introduce a novel Bayesian treatment of
ESNs, namely, the ESGP. For example, in case a Gaussian radial basis
function (RBF) kernel is considered, the definition of the ESGP model
yields a prior distribution of the form with its kernel function
(reservoir kernel) given by Let us consider an ESGP with reservoir neurons and readout signals. To endow our model
with increased robustness to observation noise, we additionally assume
that the target signals of an observed phenomenon modeled using the
postulated ESGP model consist of a latent function of the input signals,
which is learnable by the considered ESGP model, superimposed on an
independent white Gaussian noise signal, that is, we adopt the
hypothesis that the available training target signals are given by The predictive density of the postulated ESGP model at a test
time point yields where $$