Gaussian Process
Zhao Zhao

I.   Motivation

First of all, why use Gaussian Process to do regression? Or even, what is regression? Regression is a common machine learning task that can be described as Given some observed data points (training dataset), finding a function that represents the dataset as close as possible, then using the function to make predictions at new data points. Regression can be conducted with polynomials, and it's common there is more than one possible function that fits the observed data. Besides getting predictions by the function, we also want to know how certain these predictions are. Moreover, quantifying uncertainty is super valuable to achieve an efficient learning process. The areas with the least certainty should be explored more.

In a word, GP can be used to make predictions at new data points and can tell us how certain these predictions are.

II. Basics

A.  Gaussian (Normal) Distribution

Let's talk about Gaussian.

A random variable is said to be normally distributed with mean and variance if its probability density function (PDF) is

Here, represents random variables and is the real argument. The Gaussian or Normal distribution of is usually represented by .

A Gaussian PDF is plotted below. We generate n number random sample points from a Gaussian distribution on x axis.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns


from scipy.stats import norm

# Plot 1-D gaussian
n = 1 # n number of independent 1-D gaussian
m= 1000 # m points in 1-D gaussian
f_random = np.random.normal(size=(n, m))
# more information about 'size': https://www.sharpsightlabs.com/blog/numpy-random-normal/
print(f_random.shape)

for i in range(n):
#sns.distplot(f_random[i], hist=True, rug=True, vertical=True, color="orange")
sns.distplot(f_random[i], hist=True, rug=True, fit=norm, kde=False, color="r", vertical=False)
# sns.distplot(f_random[i], hist=True, rug=True)

#plt.title('1000 random samples by a 1-D Gaussian')

plt.title('1 random samples from a 1-D Gaussian distribution')
plt.xlabel(r'$x$', fontsize = 16)
plt.ylabel(r'$P_X(x)$', fontsize = 16)
plt.show()

We generated data points that follow the normal distribution. On the other hand, we can model data points, assume these points are Gaussian, model as a function, and do regression using it. As shown above, a kernel density and histogram of the generated points were estimated. The kernel density estimation looks a normal distribution due to there are plenty (m=1000) observation points to get this Gaussian looking PDF. In regression, even we don't have that many observation data, we can model the data as a function that follows a normal distribution if we assume a Gaussian prior.

We have a random generated dataset in . We sampled the generated dataset and got a Gaussian bell curve.

Now, if we project all points on the x-axis to another space. In this space, We treat all points as a vector , and plot on the new axis at .

1
2
3
4
5
6
7
8
9
10
11
n = 1         # n number of independent 1-D gaussian 
m= 1000 # m points in 1-D gaussian
f_random = np.random.normal(size=(n, m))

Xshow = np.linspace(0, 1, n).reshape(-1,1) # n number test points in the range of (0, 1)

plt.clf()
plt.plot(Xshow, f_random, 'o', linewidth=1, markersize=1, markeredgewidth=2)
plt.xlabel('$X$')
plt.ylabel('$f(X)$')
plt.show()

It's clear that the vector is Gaussian. It looks like we did nothing but vertically plot the vector points .

Next, we can plot multiple independent Gaussian in the coordinates. For example, put vector at and another vector at .

1
2
3
4
5
6
7
8
9
10
11
12
n = 2          
m = 1000
f_random = np.random.normal(size=(n, m))

Xshow = np.linspace(0, 1, n).reshape(-1,1) # n number test points in the range of (0, 1)

plt.clf()
plt.plot(Xshow, f_random, 'o', linewidth=1, markersize=1, markeredgewidth=2)

plt.xlabel('$Y$', fontsize = 16)
plt.ylabel('$x$', fontsize = 16)
plt.show()

Keep in mind that both vectors and are Gaussian.

Let's do something interesting. Let's connect points of and by lines. For now, we only generate 10 random points for and , and then join them up as 10 lines. Keep in mind, these random generated 10 points are Gaussian.

1
2
3
4
5
6
7
8
9
10
n = 2          
m = 10
f_random = np.random.normal(size=(n, m)) # 2*10

Xshow = np.linspace(0, 1, n).reshape(-1,1) # n number test points in the range of (0, 1)

plt.clf()
plt.plot(Xshow, f_random, '-o', linewidth=2, markersize=4, markeredgewidth=2)
plt.xlabel(r'$Y$', fontsize = 16)
plt.ylabel(r'$x$', fontsize = 16)

Going back to think about regression. These lines look like functions for each pair of points. On the other hand, the plot also looks like we are sampling the region with 10 linear functions even there are only two points on each line. In the sampling perspective, the domain is our region of interest, i.e. the specific region we do our regression. This sampling looks even more clear if we generate more independent Gaussian and connecting points in order by lines.

1
2
3
4
5
6
7
8
9
10
11
12
n = 20          
m = 10
f_random = np.random.normal(size=(n, m))

Xshow = np.linspace(0, 1, n).reshape(-1,1) # n number test points in the range of (0, 1)

plt.clf()
plt.plot(Xshow, f_random, '-o', linewidth=1, markersize=3, markeredgewidth=2)

plt.xlabel(r'$Y$', fontsize = 16)
plt.ylabel(r'$x$', fontsize = 16)
plt.show()

Even these lines look like functions, but they are too noisy. If is our input space, these functions are meaningless for the regression task. We can do no prediction by using these functions. The functions should be smoother, meaning input points that are close to each other should have similar values of the function.

Thus, functions by connecting independent Gaussian are not proper for regression, we need Gaussians that correlated to each other. How to describe joint Gaussian? Multivariable Gaussian.

B. Multivariate Normal Distribution (MVN)

In some situations, a system (set of data) has to be described by more than more feature variables , and these variables are correlated. If we want to model the data all in one go as Gaussian, we need multivariate Gaussian. Here are examples of the Gaussian.

The gaussian can be visualized as a 3D bell curve with the heights representing probability density.

Formally, multivariate Gaussian is expressed as

The mean vector is a 2d vector , which are independent mean of each variable and .

The covariance matrix of Gaussian is . The diagonal terms are independent variances of each variable, and . The offdiagonal terms represents correlations between the two variables. A correlation component represents how much one variable is related to another variable.

A Gaussian can be expressed as

When we have a Gaussian, the covariance matrix is and its element is . The is a symmetric matrix and stores the pairwise covariances of all the jointly modeled random variables.

Play around with the covariance matrix to see the correlations between the two Gaussians.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import pandas as pd
import seaborn as sns

mean, cov = [0., 0.], [(1., -0.6), (-0.6, 1.)]
data = np.random.multivariate_normal(mean, cov, 1000)
df = pd.DataFrame(data, columns=["x1", "x2"])
g = sns.jointplot("x1", "x2", data=df, kind="kde")

#(sns.jointplot("x1", "x2", data=df).plot_joint(sns.kdeplot))

g.plot_joint(plt.scatter, c="g", s=30, linewidth=1, marker="+")

#g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("$x1$", "$x2$");
# g.ax_joint.legend_.remove()
plt.show()

Besides the joint probalility, we are more interested to the conditional probability. If we cut a slice on the 3D bell curve or draw a line on the elipse contour, we got the conditional probability distribution . The conditional distribution is also Gaussian.

C. Kernals

We want to smooth the sampling functions by defining the covariance functions. Considering the fact that when two vectors are similar, their dot product output value is high. It is very clear to see this in the dot product equation , where is the angle between two vectors. If an algorithm is defined solely in terms of inner products in input space then it can be lifted into feature space by replacing occurrences of those inner products by ; we call a kernel function.

A popular covariance function (aka kernel function) is squared exponential kernal, also called the radial basis function (RBF) kernel or Gaussian kernel, defined as

Let's re-plot 20 independent Gaussian and connecting points in order by lines. Instead of generating 20 independent Gaussian before, we do the plot of a Gaussian with a identity convariance matrix.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
n = 20      # n number of independent 1-D gaussian 20个维度
m = 10 # m points in 1-D gaussian 每个维度10个点


mean = np.zeros(n)
cov = np.eye(n)

f_prior = np.random.multivariate_normal(mean, cov, m).T

plt.clf()

#plt.plot(Xshow, f_prior, '-o')
Xshow = np.linspace(0, 1, n).reshape(-1,1) # n number test points in the range of (0, 1)

for i in range(m):
plt.plot(Xshow, f_prior, '-o', linewidth=1)

plt.title('10 samples of the 20-D gaussian prior')
plt.show()

We got exactly the same plot as expected. Now let's kernelizing our funcitons by use the RBF as our convariace.

1
2
3
4
def kernel(a, b):
sqdist = np.sum(a**2,axis=1).reshape(-1,1) + np.sum(b**2,1) - 2*np.dot(a, b.T) # (a-b)^2= a^T *a +b^T*b -2 a*b^T
# np.sum( ,axis=1) means adding all elements columnly; .reshap(-1, 1) add one dimension to make (n,) become (n,1)
return np.exp(-.5 * sqdist)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
n = 20  
m = 10

Xshow = np.linspace(0, 1, n).reshape(-1,1) # n number test points in the range of (0, 1) size--(20,1)

K_ = kernel(Xshow, Xshow) # k(x_star, x_star) size--(20,20)

mean = np.zeros(n)
cov = np.eye(n)

f_prior = np.random.multivariate_normal(mean, K_, m).T # size--(20,10)
f_prior2 = np.random.multivariate_normal(mean, cov, m).T
plt.clf()

Xshow = np.linspace(0, 1, n).reshape(-1,1) # n number test points in the range of (0, 1)

for i in range(m):
plt.plot(Xshow, f_prior, '-o', linewidth=1)
plt.plot(Xshow, f_prior2, '-o', linewidth=1)
plt.title('10 samples of the 20-D gaussian kernelized prior')
plt.show()

We get much smoother lines and looks even more like functions. When the dimension of Gaussian gets larger, there is no need to connect points. When the dimension become infinity, there is a point represents any possible input. Let's plot m=200 samples of n=200 Gaussian to get a feeling of functions with infinity parameters.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
n = 200         
m = 200

Xshow = np.linspace(0, 1, n).reshape(-1,1)

K_ = kernel(Xshow, Xshow) # k(x_star, x_star)

mean = np.zeros(n)
cov = np.eye(n)

f_prior = np.random.multivariate_normal(mean, K_, m).T

plt.clf()
#plt.plot(Xshow, f_prior, '-o')
Xshow = np.linspace(0, 1, n).reshape(-1,1) # n number test points in the range of (0, 1)

plt.figure(figsize=(16,8))
#for i in range(m):
plt.plot(Xshow, f_prior, 'o', linewidth=1, markersize=2, markeredgewidth=1)

plt.title('200 samples of the 200-D gaussian kernelized prior')
#plt.axis([0, 1, -3, 3])
plt.show()

As we can see above, when we increase the dimension of Gaussian to infinity, we can sample all the possible points in our region of interest.

Here we talk a little bit about Parametric and Nonparametric model. You can skip this section without compromising your Gaussian Process understandings.

Parametric models assume that the data distribution can be modeled in terms of a set of finite number parameters. For regression, we have some data points, and we would like to make predictions of the value of with a specific . If we assume a linear regression model, , we need to find the parameters and to define the line. In many cases, the linear model assumption isn’t hold, a polynomial model with more parameters, such as is needed. We use the training dataset of observations, to train the model, i.e. mapping to through parameters . After the training process, we assume all the information of the data are captured by the feature parameters , thus the prediction is independent of the training data . It can be expressed as  , in which is the prediction made at a unobserved point .

Thus, conducting regression using the parametric model, the complexity or flexibility of model is limited by the parameter numbers. It’s natural to think to use a model that the number of parameters grows with the size of the dataset, and it’s a Bayesian non-parametric model. Bayesian non-parametric model do not imply that there are no parameters, but rather infinitely parameters.

To generate correlated normally distributed random samples, one can first generate uncorrelated samples, and then multiply them by a matrix L such that , where K is the desired covariance matrix. L can be created, for example, by using the Cholesky decomposition of K.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
n = 20      
m = 10

Xshow = np.linspace(0, 1, n).reshape(-1,1) # n number test points in the range of (0, 1)

K_ = kernel(Xshow, Xshow)

L = np.linalg.cholesky(K_ + 1e-6*np.eye(n))

f_prior = np.dot(L, np.random.normal(size=(n,m)))

plt.clf()
plt.plot(Xshow, f_prior, '-o')
plt.title('10 samples of the 20-D gaussian kernelized prior')
plt.show()

III. Math

First, again, going back to our task regression. There is a function we are trying to model given a set of data points (trainig data/existing observed data) from the unknow function . The traditional non-linear regression machine learning methods typically give one function that it considers to fit these observations the best. But, as shown at the begining, there can be more than one funcitons fit the observations equally well.

Second, let's review what we got from MVN. We got the feeling that when the dimension of Gaussian is infinite, we can sample all the region of interest with random functions. These infinite random functions are MVN because it's our assumption (prior). More formally, the prior distribution of these infinite random functions are MVN. The prior distribution representing the kind out outputs that we expect to see over some inputs without even observing any data.

When we have observation points, instead of infinite random functions, we only keep functions that are fit these points. Now we got our posterior, the current belief based on the existing observations. When we have more observation points, we use our previous posterior as our prior, use these new observations to update our posterior.  

This is Gaussian process.

A Gaussian process is a probability distribution over possible functions that fit a set of points.

Because we have the probability distribution over all possible functions, we can caculate the means as the function, and caculate the variance to show how confidient when we make predictions using the function.

Keep in mind,

  • The functions(posterior) updates with new observations.

  • The mean calcualted by the posterior distribution of the possible functions is the function used for regression.

The function is modeled by a multivarable Gaussian as

where , and . is the mean function and it is common to use as GPs are flexible enough to model the mean arbitrarily well. is a positive definite kernel function or covariance function. Thus, a Gaussian process is a distribution over functions whose shape (smoothness, ...) is defined by . If points and are considered to be similar by the kernel the function values at these points, and , can be expected to be similar too.

So, we have observations, and we have estimated functions with these observations. Now say we have some new points where we want to predict .

The joint distribution of and can be modeled as:

where , and . And

This is modeling a joint distribution , but we want the conditional distribution over only, which is . The derivation process from the joint distribution to the conditional uses the Marginal and conditional distributions of MVN theorem.

We got

It is realistic modelling situations that we do not have access to function values themselves, but only noisy versions thereof . Assuming additive independent identically distributed Gaussian noise with variance , the prior on the noisy observations becomes . The joint distribution of the observed target values and the function values at the test locations under the prior as

Deriving the conditional distribution corresponding to eqn. 2.19 we get the predictive equations for Gaussian process regression as

where,

IV. Simple Implementation Example

We do the regression example between -5 and 5. The observation data points (traing dataset) are generated from a uniform distribution between -5 and 5. This means any point value within the given interval [-5, 5] is equally likely to be drawn by uniform. The functions will be evaluated at n evenly spaced points between -5 and 5. We do this to show a continuous function for regression in our region of interest [-5, 5]. This is a simple example to do GP regression. It assumes a zero mean GP Prior.

The algorithm executed follows

  1. The textbook GPML, P19.

input: input, , (covariance function), (noise level), (test input)

return: (mean), (variance), (log marginal likelihood).

  1. Dr. Jie Wang, [An Intuitive Tutorial to Gaussian Processes Regression]

The inputs are (inputs), (targets), (covariance function), (noise level), (test input).  The outputs are (mean), (variance), (log marginal likelihood).

Note: Cholesky Decomposition

The Cholesky decomposition of a symmetric, positive definite matrix decomposes into a product of a lower triangular matrix and its transpose

where is called the Cholesky factor. The Cholesky decomposition is useful for solving linear systems with symmetric, positive definite coefficient matrix .

To solve for , first solve the triangular system by forward substitution and then the triangular system by back substitution. Using the backslash operator, we write the solution as

where the notation is the vector which solves . Both the forward and backward substitution steps require operations, when is of size . The computation of the Cholesky factor is considered numerically extremely stable and takes time , so it is the method of choice when it can be applied.

Note also that the determinant of a positive definite symmetric matrix can be calculated efficiently by

where is the Cholesky factor from .

  1. If , we have

Therefore

Therefore

1
2
3
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
1
2
3
4
5
6
7
8
# This is the true unknown function we are trying to approximate
f = lambda x: np.sin(0.9*x).flatten() # flatten是numpy.ndarray.flatten的一个函数,即返回一个一维数组
#f = lambda x: (0.25*(x**2)).flatten()
x = np.arange(-5, 5, 0.1)

plt.plot(x, f(x))
plt.axis([-5, 5, -3, 3])
plt.show()

We use a general Squared Exponential Kernel, also called Radial Basis Function Kernel or Gaussian Kernel:

where and are hyperparameters. More information about the hyperparameters can be found after the codes.

1
2
3
4
5
6
7
# Define the kernel
def kernel(a, b):
kernelParameter_l = 0.1
kernelParameter_sigma = 1.0
sqdist = np.sum(a**2,axis=1).reshape(-1,1) + np.sum(b**2,1) - 2*np.dot(a, b.T)
# np.sum( ,axis=1) means adding all elements columnly; .reshap(-1, 1) add one dimension to make (n,) become (n,1)
return kernelParameter_sigma*np.exp(-.5 * (1/kernelParameter_l) * sqdist)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Sample some input points and noisy versions of the function evaluated at
# these points.
N = 20 # number of existing observation points (training points).
n = 200 # number of test points.
s = 0.00005 # noise variance.

X = np.random.uniform(-5, 5, size=(N,1)) # N training points
y = f(X) + s*np.random.randn(N)

K = kernel(X, X)
L = np.linalg.cholesky(K + s*np.eye(N)) # line 1 L*L^T = K+\sigma_n^2 *I cholesky分解

# points we're going to make predictions at.
Xtest = np.linspace(-5, 5, n).reshape(-1,1)

# compute the mean at our test points.
Lk = np.linalg.solve(L, kernel(X, Xtest)) # k_star = kernel(X, Xtest), calculating v := L\k_star Lk = v
mu = np.dot(Lk.T, np.linalg.solve(L, y)) # np.linalg.solve(L, y) --->Lx = y

# compute the variance at our test points.
K_ = kernel(Xtest, Xtest) # k(x_star, x_star)
s2 = np.diag(K_) - np.sum(Lk**2, axis=0) # 方差
s = np.sqrt(s2) # 标准差

# PLOTS:
plt.figure(1)
plt.clf()
plt.plot(X, y, 'k+', ms=18)
plt.plot(Xtest, f(Xtest), 'b-')
plt.gca().fill_between(Xtest.flat, mu-2*s, mu+2*s, color="#dddddd")
plt.plot(Xtest, mu, 'r--', lw=2)
#plt.savefig('predictive.png', bbox_inches='tight', dpi=300)
plt.title('Mean predictions plus 2 st.deviations')
plt.show()
#plt.axis([-5, 5, -3, 3])
1
2
3
4
5
6
7
8
9
10
11
# draw samples from the posterior at our test points.
L = np.linalg.cholesky(K_ + 1e-6*np.eye(n) - np.dot(Lk.T, Lk))
f_post = mu.reshape(-1,1) + np.dot(L, np.random.normal(size=(n,40))) # size=(n, m), m shown how many posterior
plt.figure(3)
plt.clf()
plt.figure(figsize=(10,4))
plt.plot(X, y, 'k+', markersize=20, markeredgewidth=3)
plt.plot(Xtest, mu, 'r--', linewidth=3)
plt.plot(Xtest, f_post, linewidth=0.8)
plt.title('40 samples from the GP posterior, mean prediction function and observation points')
plt.show()

We plotted m=40 samples from the Gaussian Process posterior together with the mean function for prediction and the observation data points (training dataset). It's clear all posterior functions collapse at all observation points.

Reference:

[1] Seeger, Matthias. "Gaussian processes for machine learning." International journal of neural systems 14.02 (2004): 69-106.

[2] Wang, Jie. "An intuitive tutorial to Gaussian processes regression." arXiv preprint arXiv:2009.10862 (2020).


OPTIMIZING THE LOG-LIKELIHOOD

Log-likelihood First, let's define the shorthand . We now have For now, we will ignore hyperparameters effecting . So we only consider hyperparameters affecting . Let's consider the derivative with respect to some general hyperparameter .

For any invertible matrix and any parameter , the derivative of with respect to equals For any invertible matrix , the derivative of is given by A nice consequence of the above theorem is that

With the above equations, we can find that

We start by defining . Then we also take the trace of the rightmost term. This allows us to cycle the order of multiplication so we wind up with The squared exponential covariance function is now given by We will start with the derivative with respect to , where we assume that the measurement noise has the same strength for each measurement. We now have Next, we take the derivative with respect to . Here we have The next derivative is the hardest one. We will take the derivative with respect to , which is the squared length scale in the direction of input . Note that we have one such derivative for each of the input dimensions. We will find these derivatives element-wise. So, Using the chain rule and (33) we can find that the above equals It is important to consider what the derivative looks like. It is a matrix filled with zeros, except for a single one, which is on row and in column . Keeping this in mind, we can simplify the above to Note that the term is element from the vector .