Bayesian Linear Regression, Maximum Likelihood and Maximum-A-Priori
Bayesian methods allows us to perform modelling of an input to an output by providing a measure of uncertainty or “how sure we are”, based on the seen data. Unlike most frequentist methods commonly used, where the outpt of the method is a set of best fit parameters, the output of a Bayesian regression is a probability distribution of each model parameter, called the posterior distribution.
For the sake of comparison, take the example of a simple linear regression
Apart from the uncertainty quantification, another benefit of Bayesian is the possibility of online learning, i.e. a continuous update of the trained model (from previously-seen data) by looking at only the new data. This is a handy feature for e.g. datasets that are purged periodically.
Refresher: Normal distribution
A probability distribution is a mathematical function that provides the probabilities of occurrence of different possible outcomes in an experiment. In the following post, we methods will be solely based on the Normal distribution defined for an input
for a distribution with mean and standard deviation on an univariate notation, or with mean vector and covariance matrix on its multivariate (matrix) notation
Bayes Theorem
From the field of probability, the product rule tells us that
In Machine Learning, we are mostly interested in how the parameters of the model
,where evidence is
The idea behing Bayes Theorem in ML is to invert the relationship between the parameters
An important theoream called the Bernstein-von Mises Theorem states that:
- for a sufficiently large sample size, the posterior distribution becomes independent of the prior (as long as the prior is neither 0 or 1)
- ie for a sufficiently large dataset, the prior just doesnt matter much anymore as the current data has enough information;
- if we let the datapoints go to infitiny, the posterior distribution will go to normal distribution, where the mean will be the maximum likelihood estimator;
- this is a restatement of the central limit theorem, where the posterior distribution becomes the likelihood function;
- ie the effect of the prior decreases as the data increases;
- we typically want to maximize the posterior, so we can drop the evidence term (as it’s independent of
) and compute only the maximum of the term .
There are two main optimization problems that we discuss in Bayesian methods: Maximum Likelihood Estimator (MLE) and Maximum-a-Posteriori (MAP).
Maximum Likelihood Estimator (MLE)
(In case the following reductions cause some confusion, check the appendix at the bottom of the page for the rules used, or The Matrix Cookbook for a more detailed overview of matrix operations.)
When we try to find how likely is for an output
Let us interpret what the probability density
When the distribution of the prior and posterior are computationally tractable, the optimization of the parameters that define the distribution can be performed using the coordinate ascent method, an iterative method that was covered in other post. In practice we perform and iterative partial derivatives of the prior/posterior for the model parameters (e.g. mean
To simplify the computation, we perform the log-trick and place the term to optimize into a log function. We can do this because
I.e. we turn a log of products into a sum of logs. In the linear regression model, the likelihood is Gaussian, due to the Gaussian noise term
where
The second trick here is that the previous equation is quadratic in
Note that this solution is independent of the noise variance
Adding regularization
Regularizers can be added normally as in the non-Bayesian regression and may have as well an analytical solution. As an example, if we want ot add an
where
Estimating noise variance
So far we assumed the noise
The partial derivative of the loss with respect to
i.e.
Maximum-a-Posteriori (MAP)
Maximum likelihood without regularizer is prone to overfitting (details in section 9.2.2 of the Mathematics for Machine Learning book). In the occurrence of overfitting, we run into very large parameter values. To mitigate the effect of huge values, we can place the prior on the parameter space, and seek now the parameters to estimate the posterior distribution.
If we have prior knowledge about the distribution of the parameters, we can multiply an additional term to the likelihood. This additional term is a prior probability distribution on parameters. For a given prior, after observing some data
This is called the Maximum-a-Posteriori estimation (MAP), and is obtained by applying the Bayes Theorem.
The problem in hand is to find the parameters of the distribution that best represent the data. Adapting the equation
The computation steps are similar to log-trick applied to the MLE use case. The log-posterior is then:
In practice, this is the sum of the log-likelihood
Computing the minimum value:
Now we see that the only difference between the weights estimated using MAp(
Closed-form solution for Bayesian Linear Regression
Instead of computing a point estimate via MLE or MAP, a special case of Bayesian optimization is the linear regression with normal priors and posterior. In this case, the posterior has an analytical solution. This approach is utilized very commonly, mainly since the result is not an estimation, and computing the analytical solution for the posteriors is extremelly fast to compute even for very large datasets and dimensionality.
In brief, bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients (as well as other parameters describing the distribution of the regressand) and ultimately allowing the out-of-sample prediction of the regressand (often labelled {\displaystyle y}y) conditional on observed values of the regressors (source: wikipedia page for Bayesian Linear Regression). I.e. Bayesian linear regression pushes the idea of the parameter prior a step further and does not even attempt to compute a point estimate of the parameters, but instead the full posterior distribution over the parameters is taken into account when making predictions. This means we do not fit any parameters, but we compute a mean over all plausible parameters settings (according to the posterior).
We assume all our weights are drawn from a gaussian distribution and can be independent (if covariance matrix is diagonal) or not (otherwise). In practice, we start with the prior
We have then the following prior in multivariate notation:
And the likelihood:
Going back to Equation
We now apply the log-trick, and factorize the expression so that we can isolate quadratic, linear and independent terms on
As a side note: looking at the last term, we can see that this function is quadratic in
where we used the linear and quadratic terms in
Going back to the main problem, we now need to bring this unnormalized Gaussian into the form proportional to
Thus, by matching the squared (
and
inline with equations 3.50 and 3.51 in Chris Bishop’s PRML book.
Online learning
Online learning allows us to do iterative learning by continuously updating our posterior based on new observable data. The main principle is that — following the
illustration of four steps of online learning for the linear model
We start with the prior knowledge that both weights (
Predictive Variance
Remember that we wanted to model a noisy output defined by
We can then compute the expectation of
i.e. because noise has a mean zero, it’s a result equivalent to a linear regression where the weights are set to the close-form solution of their mean value. The variance of
Similarly to the visualization displayed before, introducing new datapoints improves the accuracy of our model:
illustration of four steps modelling a synthetic sinusoidal data. Pictures with blue groudntruth, red model mean approximation, and light-orange area of predictive variance. (inspired on fig 3.8, Pattern Classification and Machine Learning, Chris Bishop)
Comparing two Bayesian models
If we want to compare two probabilistic models
If we assume the same prior for both models, then the equation simplifies to
Final Remarks
For more advanced topics on Bayesian Linear Regression refer to chapter 3 in Pattern Recognition and Machine Learning book from Chris Bishop and chapter 9.3 in Mathematics for Machine Learning, both available on the resources page. To download the source code of the closed-form solutions and reproduce the examples above, download the Bayesian Linear Regression notebook.
Finally, it has been shown that a kernelized Bayesian Linear Regression with the Kernel
Misc: linear models of the exponential family
For a set up related to an output which is non-Gaussian but is a member of the exponential family, check Generalized Linear Models.
Appendix: refresher on Linear Algebra
Here are the list of algebraic rules used in this document. For further details, check The Matrix Cookbook for a more detailed list of matrix operations.
- Multiplication:
; ; ; for a scalar ; ; ; .
- Transpose:
; ; for a scalar ; ;
- Division:
- if
, then , for a scalar ; - if
, then , for a scalar ; is the system of linear equations for row , repeated for every row;- therefore,
, if matrix has an inverse;
- therefore,
- if
- Inverse:
;- If
is invertible, its inverse is unique; - If
is invertible, then has an unique solution; - If
is invertible, ; for a scalar ;
- First Order derivatives:
; ; ; ;
- Variance and Covariance:
; , for scalar ; ; , if X and Y are independent (ie, zero covariance); ;- If
is a linear transformation of univariate form or multivariate form then:- Variance in univariate form:
; - Variance in multivariate form:
, where is the covariance matrix of ; - Expected Value in multivariate form:
, where is the mean of- formulas 6.50 and 6.51 in Mathematics for Machine Learning
- Variance in univariate form: