First of all, why use Gaussian Process to do regression? Or even,
what is regression? Regression is a common machine learning task that
can be described as Given some observed data points (training dataset),
finding a function that represents the dataset as close as possible,
then using the function to make predictions at new data points.
Regression can be conducted with polynomials, and it's common there is
more than one possible function that fits the observed data. Besides
getting predictions by the function, we also want to know how certain
these predictions are. Moreover, quantifying uncertainty is super
valuable to achieve an efficient learning process. The areas with the
least certainty should be explored more.
In a word, GP can be used to make predictions at new data points and
can tell us how certain these predictions are.
II. Basics
A. Gaussian (Normal)
Distribution
Let's talk about Gaussian.
A random variable is said to
be normally distributed with mean and variance if its probability density
function (PDF) is
Here, represents random
variables and is the real
argument. The Gaussian or Normal distribution of is usually represented by .
A Gaussian PDF is plotted
below. We generate n number random sample points from a
Gaussian distribution on
x axis.
import numpy as np import matplotlib.pyplot as plt import pandas as pd import seaborn as sns
from scipy.stats import norm
# Plot 1-D gaussian n = 1# n number of independent 1-D gaussian m= 1000# m points in 1-D gaussian f_random = np.random.normal(size=(n, m)) # more information about 'size': https://www.sharpsightlabs.com/blog/numpy-random-normal/ print(f_random.shape)
for i inrange(n): #sns.distplot(f_random[i], hist=True, rug=True, vertical=True, color="orange") sns.distplot(f_random[i], hist=True, rug=True, fit=norm, kde=False, color="r", vertical=False) # sns.distplot(f_random[i], hist=True, rug=True)
#plt.title('1000 random samples by a 1-D Gaussian')
plt.title('1 random samples from a 1-D Gaussian distribution') plt.xlabel(r'$x$', fontsize = 16) plt.ylabel(r'$P_X(x)$', fontsize = 16) plt.show()
We generated data points that follow the normal distribution. On the
other hand, we can model data points, assume these points are Gaussian,
model as a function, and do regression using it. As shown above, a
kernel density and histogram of the generated points were estimated. The
kernel density estimation looks a normal distribution due to there are
plenty (m=1000) observation points to get this Gaussian
looking PDF. In regression, even we don't have that many observation
data, we can model the data as a function that follows a normal
distribution if we assume a Gaussian prior.
We have a random generated dataset in . We sampled the generated dataset and got a Gaussian bell curve.
Now, if we project all points on the x-axis to another
space. In this space, We treat all points as a
vector , and plot on the new axis at .
1 2 3 4 5 6 7 8 9 10 11
n = 1# n number of independent 1-D gaussian m= 1000# m points in 1-D gaussian f_random = np.random.normal(size=(n, m))
Xshow = np.linspace(0, 1, n).reshape(-1,1) # n number test points in the range of (0, 1)
Let's do something interesting. Let's connect points of and by lines. For now, we only generate
10 random points for and , and then join them up as 10 lines.
Keep in mind, these random generated 10 points are Gaussian.
1 2 3 4 5 6 7 8 9 10
n = 2 m = 10 f_random = np.random.normal(size=(n, m)) # 2*10
Xshow = np.linspace(0, 1, n).reshape(-1,1) # n number test points in the range of (0, 1)
Going back to think about regression. These lines look like
functions for each pair of points. On the other hand,
the plot also looks like we are sampling the region with 10 linear functions even
there are only two points on each line. In the sampling perspective, the
domain is our region of
interest, i.e. the specific region we do our regression. This sampling
looks even more clear if we generate more independent Gaussian and
connecting points in order by lines.
1 2 3 4 5 6 7 8 9 10 11 12
n = 20 m = 10 f_random = np.random.normal(size=(n, m))
Xshow = np.linspace(0, 1, n).reshape(-1,1) # n number test points in the range of (0, 1)
Even these lines look like functions, but they are too noisy. If
is our input space, these
functions are meaningless for the regression task. We can do no
prediction by using these functions. The functions should be smoother,
meaning input points that are close to each other should have similar
values of the function.
Thus, functions by connecting independent Gaussian are not proper for
regression, we need Gaussians that correlated to each other. How to
describe joint Gaussian? Multivariable Gaussian.
B. Multivariate Normal
Distribution (MVN)
In some situations, a system (set of data) has to be described by
more than more feature variables , and these variables are correlated. If we want to
model the data all in one go as Gaussian, we need multivariate Gaussian.
Here are examples of the
Gaussian.
The gaussian can be
visualized as a 3D bell curve with the heights representing probability
density.
Formally, multivariate Gaussian is expressed as
The mean vector
is a 2d vector ,
which are independent mean of each variable and .
The covariance matrix of
Gaussian is . The
diagonal terms are independent variances of each variable, and . The offdiagonal terms represents
correlations between the two variables. A correlation component
represents how much one variable is related to another variable.
A Gaussian can be expressed
as
When we have a Gaussian, the
covariance matrix is and its element is . The is a symmetric matrix and stores
the pairwise covariances of all the jointly modeled random
variables.
Play around with the covariance matrix to see the correlations
between the two Gaussians.
Besides the joint probalility, we are more interested to the
conditional probability. If we cut a slice on the 3D bell curve or draw
a line on the elipse contour, we got the conditional probability
distribution . The
conditional distribution is also Gaussian.
C. Kernals
We want to smooth the sampling functions by defining the covariance
functions. Considering the fact that when two vectors are similar, their
dot product output value is high. It is very clear to see this in the
dot product equation , where is the angle between two vectors.
If an algorithm is defined solely in terms of inner products in input
space then it can be lifted into feature space by replacing occurrences
of those inner products by ; we call a kernel function.
A popular covariance function (aka kernel function) is squared
exponential kernal, also called the radial basis function (RBF) kernel
or Gaussian kernel, defined as
Let's re-plot 20 independent Gaussian and connecting points in order
by lines. Instead of generating 20 independent Gaussian before, we do
the plot of a Gaussian with a
identity convariance matrix.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
n = 20# n number of independent 1-D gaussian 20个维度 m = 10# m points in 1-D gaussian 每个维度10个点
Xshow = np.linspace(0, 1, n).reshape(-1,1) # n number test points in the range of (0, 1)
for i inrange(m): plt.plot(Xshow, f_prior, '-o', linewidth=1) plt.plot(Xshow, f_prior2, '-o', linewidth=1) plt.title('10 samples of the 20-D gaussian kernelized prior') plt.show()
We get much smoother lines and looks even more like functions. When
the dimension of Gaussian gets larger, there is no need to connect
points. When the dimension become infinity, there is a point represents
any possible input. Let's plot m=200 samples of
n=200 Gaussian to
get a feeling of functions with infinity parameters.
plt.clf() #plt.plot(Xshow, f_prior, '-o') Xshow = np.linspace(0, 1, n).reshape(-1,1) # n number test points in the range of (0, 1)
plt.figure(figsize=(16,8)) #for i in range(m): plt.plot(Xshow, f_prior, 'o', linewidth=1, markersize=2, markeredgewidth=1)
plt.title('200 samples of the 200-D gaussian kernelized prior') #plt.axis([0, 1, -3, 3]) plt.show()
As we can see above, when we increase the dimension of Gaussian to
infinity, we can sample all the possible points in our region of
interest.
Here we talk a little bit about Parametric and Nonparametric
model. You can skip this section without compromising your
Gaussian Process understandings.
Parametric models assume that the data distribution can be modeled in
terms of a set of finite number parameters. For regression, we have some
data points, and we would like to make predictions of the value of with a specific . If we assume a linear regression
model, ,
we need to find the parameters and to define the line. In many
cases, the linear model assumption isn’t hold, a polynomial model with
more parameters, such as is needed. We use the training
dataset of observations, to train
the model, i.e. mapping to through parameters .
After the training process, we assume all the information of the data
are captured by the feature parameters , thus the prediction is
independent of the training data .
It can be expressed as , in which is
the prediction made at a unobserved point .
Thus, conducting regression using the parametric model, the
complexity or flexibility of model is limited by the parameter numbers.
It’s natural to think to use a model that the number of parameters grows
with the size of the dataset, and it’s a Bayesian non-parametric model.
Bayesian non-parametric model do not imply that there are no parameters,
but rather infinitely parameters.
To generate correlated normally distributed random samples, one can
first generate uncorrelated samples, and then multiply them by a matrix
L such that ,
where K is the desired covariance matrix. L can be
created, for example, by using the Cholesky decomposition of
K.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
n = 20 m = 10
Xshow = np.linspace(0, 1, n).reshape(-1,1) # n number test points in the range of (0, 1)
K_ = kernel(Xshow, Xshow)
L = np.linalg.cholesky(K_ + 1e-6*np.eye(n))
f_prior = np.dot(L, np.random.normal(size=(n,m)))
plt.clf() plt.plot(Xshow, f_prior, '-o') plt.title('10 samples of the 20-D gaussian kernelized prior') plt.show()
III. Math
First, again, going back to our task regression. There is a function
we are trying to model
given a set of data points (trainig data/existing
observed data) from the unknow function . The traditional non-linear
regression machine learning methods typically give one function that it
considers to fit these observations the best. But, as shown at the
begining, there can be more than one funcitons fit the observations
equally well.
Second, let's review what we got from MVN. We got the feeling that
when the dimension of Gaussian is infinite, we can sample all the region
of interest with random functions. These infinite random functions are
MVN because it's our assumption (prior). More formally, the prior
distribution of these infinite random functions are MVN. The prior
distribution representing the kind out outputs that we expect to see over
some inputs without even
observing any data.
When we have observation points, instead of infinite random
functions, we only keep functions that are fit these points. Now we got
our posterior, the current belief based on the existing observations.
When we have more observation points, we use our previous posterior as
our prior, use these new observations to update our posterior.
This is Gaussian process.
A Gaussian process is a probability distribution over
possible functions that fit a set of points.
Because we have the probability distribution over all possible
functions, we can caculate the means as the function,
and caculate the variance to show how confidient when we make
predictions using the function.
Keep in mind,
The functions(posterior) updates with new observations.
The mean calcualted by the posterior distribution of the possible
functions is the function used for regression.
The function is modeled by a multivarable Gaussian as
where , and . is the mean function and it is common
to use as GPs are
flexible enough to model the mean arbitrarily well. is a positive definite kernel
function or covariance function. Thus, a Gaussian process
is a distribution over functions whose shape (smoothness, ...) is
defined by . If points
and are considered to be similar
by the kernel the function values at these points, and , can be expected to be
similar too.
So, we have observations, and we have estimated functions with these observations. Now
say we have some new points where we want to predict
.
The joint distribution of and can be modeled as:
where , and . And
This is modeling a joint distribution , but we want the conditional distribution over
only, which is . The derivation process from the joint
distribution to the conditional uses the Marginal and conditional
distributions of MVN theorem.
We got
It is realistic modelling situations that we do not have access to
function values themselves, but only noisy versions thereof . Assuming additive
independent identically distributed Gaussian noise with variance , the prior on the noisy
observations becomes . The joint distribution of the observed
target values and the function values at the test locations under the
prior as
Deriving the conditional distribution corresponding to eqn. 2.19 we
get the predictive equations for Gaussian process regression as
where,
IV. Simple Implementation
Example
We do the regression example between -5 and 5. The observation data
points (traing dataset) are generated from a uniform distribution
between -5 and 5. This means any point value within the given interval
[-5, 5] is equally likely to be drawn by uniform. The functions will be
evaluated at n evenly spaced points between -5 and 5. We do
this to show a continuous function for regression in our region of
interest [-5, 5]. This is a simple example to do GP regression. It
assumes a zero mean GP Prior.
Dr. Jie Wang, [An Intuitive Tutorial to Gaussian Processes
Regression]
The inputs are (inputs), (targets), (covariance function), (noise level), (test input). The outputs are (mean), (variance), (log marginal
likelihood).
Note: Cholesky Decomposition
The Cholesky decomposition of a symmetric, positive definite matrix
decomposes into a product of a lower triangular
matrix and its transpose
where is called the Cholesky
factor. The Cholesky decomposition is useful for solving linear systems
with symmetric, positive definite coefficient matrix .
To solve for
, first solve the
triangular system by forward substitution and then the triangular
system by back
substitution. Using the backslash operator, we write the solution as
where the notation is the vector which solves . Both the
forward and backward substitution steps require operations, when is of size . The computation of the
Cholesky factor is considered
numerically extremely stable and takes time , so it is the method of choice when
it can be applied.
Note also that the determinant of a positive definite symmetric
matrix can be calculated efficiently by
where is the Cholesky factor
from .
If , we
have
Therefore
Therefore
1 2 3
import numpy as np import matplotlib.pyplot as plt import pandas as pd
1 2 3 4 5 6 7 8
# This is the true unknown function we are trying to approximate f = lambda x: np.sin(0.9*x).flatten() # flatten是numpy.ndarray.flatten的一个函数,即返回一个一维数组 #f = lambda x: (0.25*(x**2)).flatten() x = np.arange(-5, 5, 0.1)
# Sample some input points and noisy versions of the function evaluated at # these points. N = 20# number of existing observation points (training points). n = 200# number of test points. s = 0.00005# noise variance.
X = np.random.uniform(-5, 5, size=(N,1)) # N training points y = f(X) + s*np.random.randn(N)
K = kernel(X, X) L = np.linalg.cholesky(K + s*np.eye(N)) # line 1 L*L^T = K+\sigma_n^2 *I cholesky分解
# points we're going to make predictions at. Xtest = np.linspace(-5, 5, n).reshape(-1,1)
# compute the mean at our test points. Lk = np.linalg.solve(L, kernel(X, Xtest)) # k_star = kernel(X, Xtest), calculating v := L\k_star Lk = v mu = np.dot(Lk.T, np.linalg.solve(L, y)) # np.linalg.solve(L, y) --->Lx = y
# compute the variance at our test points. K_ = kernel(Xtest, Xtest) # k(x_star, x_star) s2 = np.diag(K_) - np.sum(Lk**2, axis=0) # 方差 s = np.sqrt(s2) # 标准差
# draw samples from the posterior at our test points. L = np.linalg.cholesky(K_ + 1e-6*np.eye(n) - np.dot(Lk.T, Lk)) f_post = mu.reshape(-1,1) + np.dot(L, np.random.normal(size=(n,40))) # size=(n, m), m shown how many posterior plt.figure(3) plt.clf() plt.figure(figsize=(10,4)) plt.plot(X, y, 'k+', markersize=20, markeredgewidth=3) plt.plot(Xtest, mu, 'r--', linewidth=3) plt.plot(Xtest, f_post, linewidth=0.8) plt.title('40 samples from the GP posterior, mean prediction function and observation points') plt.show()
We plotted m=40 samples from the Gaussian Process posterior together
with the mean function for prediction and the observation data points
(training dataset). It's clear all posterior functions collapse at all
observation points.
Reference:
[1] Seeger, Matthias. "Gaussian processes for machine
learning." International journal of neural systems 14.02
(2004): 69-106.
Log-likelihood First, let's define the shorthand . We now have
For now, we will ignore hyperparameters effecting . So we only consider hyperparameters
affecting . Let's consider the
derivative with respect to some general hyperparameter .
For any invertible matrix and
any parameter , the derivative of
with respect to equals For any invertible matrix , the derivative of is given by A nice consequence of the above theorem is that
With the above equations, we can find that
We start by defining . Then we also take the trace of the rightmost term.
This allows us to cycle the order of multiplication so we wind up with
The squared exponential covariance function is now given by
We will start with the derivative with respect to , where we assume that
the measurement noise has the same strength for each measurement.
We now have Next, we take the derivative with respect to . Here we have The next derivative is the hardest one. We will take the
derivative with respect to , which is the squared
length scale in the direction of input . Note that we have one such
derivative for each of the
input dimensions. We will find these derivatives element-wise. So, Using the chain rule and (33) we can find that the above
equals It is important to consider what the derivative looks like. It is a matrix filled with zeros,
except for a single one, which is on row and in column . Keeping this in mind, we can simplify
the above to Note that the term is element from the vector .