Review of Regression Algorithms

Stulp, F., & Sigaud, O. (2015). Many regression algorithms, one unified model: A review. Neural Networks, 69, 60-79.

1. Overview

Model: mixture of linear models
LWR	Locally weighted regression
RFWR	Receptive field weighted regression
LWPR	Locally weighted projection regression
XCSF	XCS for functions
GMR	Gaussian mixture regression
M5	Model trees

Model: weighted sum of basis functions
RBFNS	Radial basis function networks
KRR	Kernel ridge regression
GPR	Gaussian process regression
SVR	Support vector regression
iRFRLS	Incr. random features regularized least squares
I-SSGPR	Incr. sparse spectrum gaussian process regr.
CART	Regression trees
ELM	Extreme learning machine
BPROP	Backpropagation

2. Linear Least Squares (LLS)

In the case of linear regression, the family of model functions is linear, which means that is represented as a line (1D), plane (2D) or hyperplane ( 3D). Formally, if the data is assumed centered so that the hyperplane goes through the origin, the underlying model is

A popular method for determining a is the linear least squares (LLS) algorithm. It uses the vertical distance between observed values and the predictions , which are known as the residuals: In the LLS case, the sum of the squared residuals is minimized: where Therefore, our target is to minimize which is a continuously differentiable unconstrained optimization problem. The solution of this problem is to determine when the derivative

2.1. Least Squares and Least Deviations

The -norm of a D-dimensional vector is defined as

2.2 Regularized Least Squares

Potential singularities in may make it difficult to invert this matrix, which can result in very large values in the vector a. A solution to this issue is to explicitly penalize large weights, which results in Regularized Linear Least Squares (RGLS), where (5) becomes where is a parameter that determines the trade-off between small values in a and small residuals. Using the -norm for is called Tikhonov regularization. RGLS with Tikhonov regularization is known as Ridge Regression (RR). Using the same derivation as for (6), the analytical solution to this least squares minimization problem is

2.3 Adding offsets to the linear model

If the data is not centered, an offset should be added so that the linear model becomes In this case, the design matrix becomes

2.4 Multivariate regression

In multivariate regression, the output space is multi-dimensional (), where the output is thus a vector, a becomes a matrix , and the linear model is The design matrix of inputs is the same matrix , but the targets are now stored in a matrix , and the least squares solution becomes This solution is decoupled between the different output variables, and corresponds to separate univariate least squares solutions:

3. From linear least squares to non-linear regression

3.1 Locally weighted regression (LWR)

For each training example , we define a corresponding weight . With these weights, the sum of residuals is now computed as Defining to be an diagonal weight matrix with , and rewriting (16) in matrix form the weighted linear least squares solution becomes The weights for each sample are typically defined as a function of the input space through a function , parameterized with A commonly used weighting function is the multivariate Gaussian: Locally Weighted Regression (LWR) is an extension of weighted linear least squares, in which independent weighted regressions are performed on the same data (in the design matrix ), but with different weight matrices . Performing weighted least squares E times for the different weighting functions leads to local linear models . This local responsibility is the reason for the name Locally Weighted Regression. The resulting model is a weighted sum of these linear models, where the weights are determined by the functions : Because each of the E local least squares regressions is performed independently of the others, the summation in (22) requires the weighting functions to be normalized, i.e. . For instance, the normalized Gaussian weighting function is: In summary, the LWR algorithm performs several independent weighted linear least squares regressions on the same data, but each with a different weighting function, usually localized around a different part of the input space. The result is a weighted sum of linear models, where the models are linear, and where the weights are determined by the (normalized) weighting functions.

3.2 Regression with radial basis function networks (RBFNS)

Instead of differently weighting each input sample – as in weighted linear least squares and LWR – an alternative extension to LLS is to project each input sample into a higher-dimensional feature space using a set of basis functions: When the basis functions are radial, it is called a Radial Basis Function Network (RBFN). To apply LLS to optimize w, we define the projected version of the design matrix – the feature matrix – which is (one column for each basis function feature) as and we get the least squares solution In summary, non-linear regression can be achieved by using a set of basis functions to project the input space into a feature space, and to perform LLS in the feature space. The result is a weighted sum of linear models, where the models are the basis functions, and where the weights are determined by a single linear regression.

4. Model: mixture of linear models

4.1 Algorithm: locally weighted regression (revisited)

Although proposed almost two decades ago, LWR is still a competitive, widely-used algorithm, due to its simplicity. Fitting results may, however, become brittle if the input space dimensionality is high and not enough data is available.

4.2 Algorithm: receptive field weighted regression (RFWR)

The RFWR algorithm is the incremental variant of LWR, and it automates the choice of several model parameters. The main differences to LWR are:

RFWR is incremental, rather than batch, and uses Recursive Least Squares (RLS) instead of a batch ls algorithm;

New linear models and weighting functions – called receptive fields – are added automatically by the algorithm when needed, i.e. when some input point is not covered by any receptive field.

The centers and covariance matrices of the receptive fields are also adapted.

This flexibility comes at the cost of having to tune several meta-parameters, such as thresholds for removing or adding receptive fields as well as first and/or second order learning rates.

4.3 Algorithm: locally weighted projection regression（LWPR）

The LWPR algorithm is an extension of RFWR, in which a low-dimensional affine projection is realized before fitting the linear models. LWPR has many of the same meta-parameters as RFWR, and thus the same difficulties in tuning these meta-parameters apply.

4.4 Algorithm: XCSF

4.5 Algorithm: Gaussian mixture regression (GMR)

The underlying assumption in Gaussian Mixture Regression (GMR) is that the data in the joint inputoutput space can be well represented by a set of Gaussians, which is known as a Gaussian Mixture Model (GMM).

Training: unsupervised learning. A notable feature of GMR is that the training phase consists of unsupervised learning. It is performed by fitting a Gaussian Mixture Model (GMM) to the data with the Expectation–Maximization (EM) algorithm. -means clustering is commonly used to provide a first initialization of the centers of the Gaussians.

Consider the input and target , they are they are concatenated into one vector . The GMM represents a model of the density of the vectors as a weighted sum of Gaussian functions: The EM algorithm adjusts the priors and the parameters of the Gaussian functions (the means and covariance that define this model. The only meta-parameter of the training phase is , the number of Gaussians.

Regression: conditioning on the input.

In regression, we are interested in predicting . To do so, can be separated in input and output components as follows: For instance, if the dimensionality of the input space is 2, and that of the output space is 1, then $ {e,X} {e,Y} _{e}$ would then be of size 3 × 3, being the covariance matrix of the full input × output space.

Then, the expected output given an input is then computed as

As a Bayesian method, an advantage of GMR over other LWR methods is that the variance can also be computed in the estimate of with Incremental GMR. A fast and incremental version of GMR named Incremental local online (ILO)-GMR.

5. Model: basis function network

In this section we turn to algorithms that yield a weighted sum of basis functions.

5.1 Least squares for basis function networks

For all algorithms in this section, least squares regression can be used to learn the weight associated with each basis function. The standard and regularized solutions to the least square problem are We now describe three forms that may take: the design matrix, the feature matrix, or the Gram matrix.

5.5.1 Design matrix

In linear least squares, the matrix is of size , i.e. each row (行) contains one input example, with each column (列) representing a different dimension of the input space. Informally, may be considered as containing the 'raw' input data.

5.5.1 Feature matrix

Projecting the input space into a higher dimensional feature space with basis functions, yields an feature matrix (one column for each basis function feature).

5.1.3 Gram matrix (for kernel functions)

A kernel is a special type of basis function which relates two elements of the input space to each other. In this case, is of size , because the kernel is evaluated on all combinations of input examples. This matrix is known as the Gram matrix. Note that the same kernel function k is used for all entries of the Gram matrix.

5.2 Algorithm: regression with radial basis function networks (revisited)

5.3 Algorithm: kernel ridge regression (KRR)

In Kernel Ridge Regression (KRR), also called Kernel Regularized Least Squares, the basis functions are generated from a kernel function , which takes two vectors from the input space as input. Using these kernel functions, regression consists in finding the weights of the function The Gram matrix is given as (34), and the solution is Thus, while RBFN allows arbitrarily placed basis function centers, KRR centers one basis function around each data point.

5.4 Algorithm: Gaussian process regression (GPR)

In Gaussian Process Regression (GPR), the key idea is that the output data , or any subset of the output data, can be thought of as one sample from a multivariate (n-variate) Gaussian function. Often, it is assumed that the mean of this distribution is 0, i.e. the Gaussian Process is represented as The covariance matrix is determined by the covariance function or kernel. A typical choice is the Gaussian function as for smooth functions, we expect in general higher covariance between points that are closer.

For a given covariance function and training points the corresponding Gaussian Process is where is the Gram matrix.

Predicting for a novel input is done by assuming that the novel output is also sampled from a multivariate Gaussian with

Conditioning this joint distribution to predict given yields another multivariate Gaussian, which predicts the probability of observing a certain : Thus, the best estimate for is the mean, and the uncertainty in is captured by the variance as Rather than tuning the hyperparameters by hand, it is customary to tune them automatically by minimizing their log likelihood on a training dataset.

Non-zero mean function. Prior knowledge about a function may be introduced in the mean of the Gaussian process.

The mean itself may depend on the input, so we write , which leads to a Gaussian Process: which, when conditioned on yields (the variance estimate stays the same) Sparse and online versions of GPR. Given the cubic computation cost in the number of examples due to the inversion of , GPR is often not applicable online.

The sparse approach consists in limiting by adequately choosing the points to remember and forgetting the rest.

The mixture of experts approach rather consists in splitting the global domain of the function into smaller regions, as done in the LWR family of methods.

5.5 Algorithm: support vector regression (SVR)

In Support Vector Regression (SVR), Any residual that is smaller than a given threshold is not penalized. The region in which the residuals are zero, is called the -insensitive tube. This leads to the -insensitive loss function: This loss function cannot be minimized analytically, and the problem is rather stated as a constrained optimization problem which uses the slack variables and , which represent the distance above or below the ϵ-insensitive tube.

This constrained optimization problem is solved by defining a Lagrange function that incorporates both the term to be minimized and the constraints. The final result for linear SVR, in which the underlying model is , is: where and are the Lagrangian multipliers in the Lagrange function for the first two constraints in (47). With this solution, the linear function becomes: The slopes a are thus represented as a linear combination of the training examples .

When applying this approach to SVR by using a basis function to project in the feature space (and using rather than a to denote weights instead of slopes), the solution becomes The meta-parameters of SVR are the tolerance parameter and the trade-off parameter .

5.6 Algorithm: iRFRLS

Random Features Regularized Least Squares (RFRLS). With respect to RBFNS, the key idea behind iRFRLS is that any function can be approximated arbitrarily well as a weighted sum of cosine functions. Thus, instead of using Gaussian kernels as in RBFNFS, is as a set of cosine functions:

5.7 Algorithm: I-SSGPR

Actually, I-SSGPR also inherits from the GPR framework the capability to optimize most of meta-parameters using the marginal likelihood, so the only meta-parameter that remains to be tuned is .

5.8 Algorithm: CART (regression trees)

5.9 Algorithm: extreme learning machine (ELM)

The Extreme Learning Machine (ELM) is a batch regression algorithm for training the weights of a Single-Hidden Layer Feedforward Network (SLFN). An SLFN is an Artificial Neural Network (ANN) that has one input layer, one hidden layer, and one output layer, presented as

Activation function is often chosen as sigmoid or hyperbolic .

In the ELM approach, the basis function parameters are first set randomly, and are then used to generate the feature matrix . Least squares is then used to determine the parameters .

5.10 Algorithm: backpropagation

6 Summary

Gray: meta-parameters of the algorithms (partial model specifications in columns 3–5 and further algorithmic parameters in column 6) which the user must set. White: model parameters that the algorithm determines itself. Abbreviations: adap. = adapted, indiv. = from individual training examples, rand. = randomly sampled, CO = constrained optimization.

Model: Mixture of linear models

Algorithm	Linear model estim.	Number of BFs ()	Position of BFs ()	Size of BFs ()	Algorithmic parameters
LWR	LS	fixed	fixed	fixed
RFWR	RLS	adap.	indiv.	adap.	many
LWPR	NIPALS	adap.	indiv.	adap.	many
XCSF	RLS	adap.	adap.	adap.	many
GMR	EM	fixed	adap.	adap.
M5	LS	adap.	adap.	adap.

Model: Weighted sum of basis functions

Algorithm	Linear model estim.	Number of BFs ()	Position of BFs ()	Size of BFs ()
RBFN	LS	fixed	fixed	fixed
KRR	(RG)LS	indiv.	indiv.	fixed
GPR	LS	indiv.	indiv.	fixed
SVR	CO	adap.	indiv.	fixed
iRFRLS	RLS	fixed	rand.	rand.
I-SSGPR	RLS	fixed	rand.	rand.
CART	LS	adap.	adap.	adap.
ELM	LS	fixed	rand.	rand.
BPROP	N/A	fixed	adap.	adap.