Stulp, F., & Sigaud, O. (2015). Many regression algorithms, one
unified model: A review. Neural Networks, 69,
60-79.
1. Overview
Model: mixture of linear models
LWR
Locally weighted regression
RFWR
Receptive field weighted regression
LWPR
Locally weighted projection regression
XCSF
XCS for functions
GMR
Gaussian mixture regression
M5
Model trees
Model: weighted sum of basis functions
RBFNS
Radial basis function networks
KRR
Kernel ridge regression
GPR
Gaussian process regression
SVR
Support vector regression
iRFRLS
Incr. random features regularized least squares
I-SSGPR
Incr. sparse spectrum gaussian process regr.
CART
Regression trees
ELM
Extreme learning machine
BPROP
Backpropagation
2. Linear Least Squares (LLS)
In the case of linear regression, the family of model functions is
linear, which means that is represented as a line
(1D), plane (2D) or hyperplane ( 3D). Formally, if the data is
assumed centered so that the hyperplane goes through the origin, the
underlying model is
A popular method for determining a is the linear least squares (LLS)
algorithm. It uses the vertical distance between observed values and the predictions , which are known as the
residuals: In the LLS case, the sum of the squared residuals is
minimized: where Therefore, our target is to minimize which is a continuously differentiable unconstrained
optimization problem. The solution of this problem is to determine when
the derivative
2.1. Least Squares and Least
Deviations
The -norm of a D-dimensional
vector is defined as
2.2 Regularized Least Squares
Potential singularities in may make it
difficult to invert this matrix, which can result in very large values
in the vector a. A solution to this issue is to explicitly penalize
large weights, which results in Regularized Linear Least Squares (RGLS),
where (5) becomes where is a
parameter that determines the trade-off between small values in a and
small residuals. Using the -norm
for is called Tikhonov regularization. RGLS with Tikhonov
regularization is known as Ridge Regression (RR). Using
the same derivation as for (6), the analytical solution to this least
squares minimization problem is
2.3 Adding offsets to the
linear model
If the data is not centered, an offset should be added so that the
linear model becomes In this case, the design matrix becomes
2.4 Multivariate regression
In multivariate regression, the output space is multi-dimensional
(), where the
output is thus a vector, a becomes a matrix , and the linear model is The design matrix of inputs is the same matrix , but the targets are now
stored in a matrix , and
the least squares solution becomes This solution is decoupled between the different output
variables, and corresponds to
separate univariate least squares solutions:
3. From
linear least squares to non-linear regression
3.1 Locally weighted regression
(LWR)
For each training example , we define a corresponding weight . With these weights, the sum of
residuals is now computed as Defining to be
an diagonal weight matrix with , and rewriting
(16) in matrix form the weighted linear least squares solution becomes The weights for each sample are typically defined as a
function of the input space through a function , parameterized with A commonly used weighting function is the multivariate
Gaussian: Locally Weighted Regression (LWR) is an extension of weighted
linear least squares, in which independent weighted regressions are performed on the
same data (in the design matrix ), but with different weight matrices . Performing weighted least squares E times for the different
weighting functions leads to
local linear models . This local responsibility is
the reason for the name Locally Weighted Regression.
The resulting model is a weighted sum of these linear models, where the
weights are determined by the functions :
Because each of the E local least squares regressions is
performed independently of the others, the summation in (22) requires
the weighting functions to be normalized, i.e. . For instance, the normalized
Gaussian weighting function is: In summary, the LWR algorithm performs several independent
weighted linear least squares regressions on the same data, but each
with a different weighting function, usually localized around a
different part of the input space. The result is a weighted sum of
linear models, where the models are linear, and where the weights are
determined by the (normalized) weighting functions.
3.2
Regression with radial basis function networks (RBFNS)
Instead of differently weighting each input sample – as in weighted
linear least squares and LWR – an alternative extension to LLS is to
project each input sample into a higher-dimensional feature space using
a set of basis functions: When the basis functions are radial, it is called a Radial
Basis Function Network (RBFN). To apply LLS to optimize w, we define the
projected version of the design matrix – the feature matrix – which is
(one column for each
basis function feature) as and we get the least squares solution In summary, non-linear regression can be achieved by using a
set of basis functions to project the input space into
a feature space, and to perform LLS in the feature space. The result is
a weighted sum of linear models, where the models are the basis
functions, and where the weights are determined by a single
linear regression.
Although proposed almost two decades ago, LWR is still a competitive,
widely-used algorithm, due to its simplicity. Fitting results may,
however, become brittle if the input space dimensionality is high and
not enough data is available.
4.2
Algorithm: receptive field weighted regression (RFWR)
The RFWR algorithm is the incremental variant of LWR, and it
automates the choice of several model parameters. The main differences
to LWR are:
RFWR is incremental, rather than batch, and uses Recursive Least
Squares (RLS) instead of a batch ls algorithm;
New linear models and weighting functions – called receptive fields
– are added automatically by the algorithm when needed, i.e. when some
input point is not covered by any receptive field.
The centers and covariance matrices of the receptive fields are also
adapted.
This flexibility comes at the cost of having to tune several
meta-parameters, such as thresholds for removing or adding
receptive fields as well as first and/or second order learning
rates.
The LWPR algorithm is an extension of RFWR, in which a
low-dimensional affine projection is realized before fitting the linear
models. LWPR has many of the same meta-parameters as RFWR, and thus the
same difficulties in tuning these meta-parameters apply.
4.4 Algorithm: XCSF
4.5 Algorithm:
Gaussian mixture regression (GMR)
The underlying assumption in Gaussian Mixture Regression (GMR) is
that the data in the joint inputoutput space can be well
represented by a set of Gaussians, which is known as a Gaussian Mixture
Model (GMM).
Training: unsupervised learning. A notable
feature of GMR is that the training phase consists of unsupervised
learning. It is performed by fitting a Gaussian Mixture Model (GMM) to
the data with the Expectation–Maximization (EM) algorithm. -means clustering is commonly used to
provide a first initialization of the centers of the Gaussians.
Consider the input
and target , they are
they are concatenated into one vector . The GMM represents a
model of the density of the vectors as a weighted sum of Gaussian functions: The EM algorithm adjusts the priors and the parameters of the Gaussian
functions (the means and covariance
that define this model. The only meta-parameter of the training phase is
, the number of Gaussians.
Regression: conditioning on the input.
In regression, we are interested in predicting . To do so,
can be separated in input and output components as follows: For instance, if the dimensionality of the input space is 2, and that of the output space
is 1, then $ {e,X}{e,Y} _{e}$
would then be of size 3 × 3, being the covariance matrix of the full
input × output space.
Then, the expected output given an input is then computed as
As a Bayesian method, an advantage of GMR over other LWR methods is
that the variance can also be computed in the estimate of with Incremental GMR. A fast and
incremental version of GMR named Incremental local online (ILO)-GMR.
5. Model: basis function
network
In this section we turn to algorithms that yield a weighted sum of
basis functions.
5.1 Least squares for
basis function networks
For all algorithms in this section, least squares regression can be
used to learn the weight associated with each basis function. The
standard and regularized solutions to the least square problem are We now describe three forms that may take: the design matrix,
the feature matrix, or the Gram matrix.
5.5.1 Design matrix
In linear least squares, the matrix is of size , i.e. each row (行) contains
one input example, with each column (列) representing a different
dimension of the input space. Informally, may be considered as
containing the 'raw' input data.
5.5.1 Feature matrix
Projecting the input space into a higher dimensional feature space
with basis functions, yields an
feature matrix (one
column for each basis function feature).
5.1.3 Gram matrix (for kernel
functions)
A kernel is a special type of basis function which relates two
elements of the input space to each other. In this case, is of size , because the kernel is evaluated on all combinations of input
examples. This matrix is known as the Gram matrix.
Note that the same kernel function k is used for all entries
of the Gram matrix.
5.2
Algorithm: regression with radial basis function networks
(revisited)
5.3 Algorithm: kernel
ridge regression (KRR)
In Kernel Ridge Regression (KRR), also called Kernel Regularized
Least Squares, the basis functions are generated from a kernel function
,
which takes two vectors from the input space as input. Using these
kernel functions, regression consists in finding the weights of the function The Gram matrix is given as (34), and the solution is Thus, while RBFN allows arbitrarily placed basis function
centers, KRR centers one basis function around each datapoint.
5.4 Algorithm:
Gaussian process regression (GPR)
In Gaussian Process Regression (GPR), the key idea is that the output
data , or any subset of the output data, can be thought of as
one sample from a multivariate (n-variate) Gaussian function. Often, it
is assumed that the mean of this distribution is 0, i.e. the Gaussian
Process is represented as The covariance matrix is determined by the
covariance function or kernel. A typical choice is the Gaussian function
as for smooth functions, we expect in general higher
covariance between points that are closer.
For a given covariance function and training points the
corresponding Gaussian Process is where is the
Gram matrix.
Predicting for a novel input
is done by assuming
that the novel output is also
sampled from a multivariate Gaussian with
Conditioning this joint distribution to predict given
yields another multivariate Gaussian, which predicts the probability of
observing a certain : Thus, the best estimate for is the mean, and the uncertainty in
is captured by the variance as
Rather than tuning the hyperparameters by hand, it is
customary to tune them automatically by minimizing their log
likelihood on a training dataset.
Non-zero mean function. Prior knowledge about a function may
be introduced in the mean of the Gaussian process.
The mean itself may depend on the input, so we write , which leads to a
Gaussian Process: which, when conditioned on
yields (the variance estimate stays the same) Sparse and online versions of GPR. Given the cubic
computation cost in the number of examples due to the inversion of , GPR is often not
applicable online.
The sparse approach consists in limiting by adequately choosing the points to
remember and forgetting the rest.
The mixture of experts approach rather consists in splitting the
global domain of the function into smaller regions, as done in the LWR
family of methods.
5.5 Algorithm: support
vector regression (SVR)
In Support Vector Regression (SVR), Any residual that is smaller
than a given threshold is not penalized. The region in which the residuals
are zero, is called the -insensitive tube. This leads to the
-insensitive loss function:
This loss function cannot be minimized analytically, and the
problem is rather stated as a constrained optimization problem which uses the slack variables and , which represent the distance
above or below the ϵ-insensitive tube.
This constrained optimization problem is solved by defining a
Lagrange function that incorporates both the term to be minimized and
the constraints. The final result for linear SVR, in which the
underlying model is , is: where and
are the Lagrangian
multipliers in the Lagrange function for the first two constraints in
(47). With this solution, the linear function becomes: The slopes a are thus represented as a linear combination of
the training examples .
When applying this approach to SVR by using a basis function to project in
the feature space (and using rather than a to denote
weights instead of slopes), the solution becomes The meta-parameters of SVR are the tolerance parameter and the trade-off parameter
.
5.6 Algorithm: iRFRLS
Random Features Regularized Least Squares (RFRLS). With respect to
RBFNS, the key idea behind iRFRLS is that any function can be
approximated arbitrarily well as a weighted sum of cosine
functions. Thus, instead of using Gaussian kernels as in
RBFNFS, is as a set
of cosine functions:
5.7 Algorithm: I-SSGPR
Actually, I-SSGPR also inherits from the GPR framework the capability
to optimize most of meta-parameters using the marginal likelihood, so
the only meta-parameter that remains to be tuned is .
5.8 Algorithm: CART (regression
trees)
5.9 Algorithm: extreme
learning machine (ELM)
The Extreme Learning Machine (ELM) is a batch regression algorithm
for training the weights of a Single-Hidden Layer Feedforward Network
(SLFN). An SLFN is an Artificial Neural Network (ANN) that has one input
layer, one hidden layer, and one output layer, presented as
Activation function
is often chosen as sigmoid or hyperbolic .
In the ELM approach, the basis function parameters are first set
randomly, and are then used to generate the feature matrix . Least squares is then
used to determine the parameters .
5.10 Algorithm: backpropagation
6 Summary
Gray: meta-parameters of the algorithms (partial model specifications
in columns 3–5 and further algorithmic parameters in column 6) which the
user must set. White: model parameters that the algorithm determines
itself. Abbreviations: adap. = adapted, indiv. = from individual
training examples, rand. = randomly sampled, CO = constrained
optimization.