Sum Rule, Product Rule, Bayes' Theorem, and Gaussian Distribution
Zhao Zhao

Sum Rule, Product Rule, and Bayes' Theorem

is the joint distribution of the two random variables . The distributions and are the corresponding marginal distributions, and is the conditional distribution of given .

The sum rule (marginalization property): The product rule relates the joint distribution to the conditional distribution via:

In machine learning and Bayesian statistics, we are often interested in making inferences of unobserved (latent) random variables given that we have observed other random variables. Let us assume we have some prior knowledge about an unobserved random variable and some relationship between and a second random variable , which we can observe. If we observe , we can use Bayes'theorem to draw some conclusions about given the observed values of . Bayes' theorem (also Bayes' theorem Bayes' rule or Bayes' law) According to product rule, we have so that The quantity is the marginal likelihood/evidence. Therefore, the marginal likelihood is independent of , and it ensures that the posterior is normalized.

Means and Covariances

(Expected Value). The expected value of a function of a univariate continuous random variable is given by Correspondingly, the expected value of a function of a discrete random variable is given by (Mean). The mean of a random variable with states is an average and is defined as where for , where the subscript indicates the corresponding dimension of . The integral and sum are over the states of the target space of the random variable .

(Covariance (Univariate)). The covariance between two univariate random variables is given by the expected product of their deviations from their respective means, i.e., By using the linearity of expectations, the expression can be rewritten as the expected value of the product minus the product of the expected values, i.e., The covariance of a variable with itself is called the variance and variance is denoted by . The square root of the variance is called the standard deviation and is often denoted by .

(Covariance (Multivariate)). If we consider two multivariate random variables and with states and respectively, the covariance between and is defined as (Variance). The variance of a random variable with variance states and a mean vector is defined as The normalized version of covariance is called the correlation.

(Correlation). The correlation between two random variables is given by

Empirical Means and Covariances

(Empirical Mean and Covariance). The empirical mean vector is the arithmetic average of the observations for each variable, and it is defined as

The empirical covariance matrix is a D×D matrix

Other Expressions for the Variance

We now focus on ==a single random== variable . The standard definition of variance is the expectation of the squared deviation of a random variable from its expected value , i.e., The above formula can be converted to the so-called raw-score formula for variance:

Sums and Transformations of Random Variables

Consider two random variables with states . Then: Consider a random variable with mean and covariance matrix and a (deterministic) affine transformation of . Then is itself a random variable whose mean vector and covariance matrix are given by Furthermore,

Statistical Independence

(Independence). Two random variables X, Y are statistically independent if and only if

If are (statistically) independent, then Another concept that is important in machine learning is conditional independence.

(Conditional Independence). Two random variables and are conditionally independent given if and only if for all , where is the set of states of random variable . We write to denote that is conditionally independent of given .

By using the product rule of probability , we have Then, This alternative presentation provides the interpretation “given that we know z, knowledge about y does not change our knowledge of x”.

Gaussian Distribution

For a univariate random variable, the Gaussian distribution has a density that is given by The multivariate Gaussian distribution is fully characterized by a mean vector vector and a covariance matrix and defined as We write or .

Marginals and Conditionals of Gaussians are Gaussians

To consider the effect of applying the sum rule of probability and the effect of conditioning, we explicitly write the Gaussian distribution in terms of the concatenated states , The conditional distribution is also Gaussian and given by the y-value is an observation and no longer random.

The marginal distribution of a joint Gaussian distribution is itself Gaussian and computed by applying the sum rule and given by

Product of Gaussian Densities
The product of two Gaussians is a Gaussian distribution scaled by a , given by with $$

$$ The scaling constant itself can be written in the form of a Gaussian density either in or in with an “inflated” covariance matrix , i.e.,

Sums and Linear Transformations

If , are independent Gaussian random variables (i.e., the joint distribution is given as with and then is also Gaussian distributed and given by

Consider a mixture of two univariate Gaussian densities and are univariate Gaussian densities. Then the mean of the mixture density is given by the weighted sum of the means of each random variable: The variance of the mixture density is given by Consider a Gaussian distributed random variable . For a given matrix of appropriate shape, let be a random variable such that is a transformed version of . Then Let us now consider the reverse transformation: when we know that a random variable has a mean that is a linear transformation of another random variable. Let be a Gaussian random variable with mean , i.e., Then,

Hence, is a linear transformation of , and we obtain