Exponential Family of Distributions
On the previous post, we saw that computing the Maximum Likelihood estimator and the Maximum-a-Posterior on a normally-distributed set of parameters becomes much easier once we apply the log-trick. The rationale is that since
However, this is not a property of the Gaussian distribution only. In fact, most common distributions including the exponential, log-normal, gamma, chi-squared, beta, Dirichlet, Bernoulli, categorical, Poisson, geometric, inverse Gaussian, von Mises and von Mises-Fisher distributions can be represented in a similar syntax, making it simple to compute as well. To the set of such distributions we call it the Exponential Family of Distributions, and we will discuss them next.
Detour: relationship between common probability distributions
Probability distributions describe the probabilities of each outcome, with the common property that the probability of all events adds up to 1. They can also be classified in two subsets: the ones described by a probability mass function if specified for discrete values, or probability density functions if described within some continuous interval. There are dozens (hundreds?) of different distributions, even though only 15 of them are often mentioned and used, and have some kind of relationship among themselves:
15 most common probability distributions and their relationships. (source: post Common probability distributions from Sean Owen)
A bried summary of their relationship follows. For more details, check the original post from Sean Owen:
- Bernoulli and Uniform: the uniform distribution yields equal probability to each discrete outcome e.g. a coin toss or a dice roll; the Bernoulli yields an unequal probability to two discrete outcomes as
and , e.g. an unfair coin toss; - Binomial and Hypergeometric: the binomial can be seen as the probability of the sum of outcomes of what follows a bernoulli distribution, e.g. rolling a dice 30 times, what’s the probability that we get the outcome six? This count follows the binomial distribution, with parameter
trials, and as success (a la Bernoulli); - Poisson and Binomial: like the binomial distribution, the poisson distribution is a distribution of a count — the number of times some event happened over a discrete time, given a rate for the event to ocur. It’s parametrized as
(the and parameters of the binomial); - Geometric and Negative Binomial: while in the binomial we count the number of times the probability succeeds in yielding a given event after a number of trial, in the geometric distribution we count how many negative trials until we succeed in out event happening; The negative binomial distribution is a simple generalization of the geometric, measuring the number of failures until
successes have occurred, not just 1; - Exponential and Weibull: the exponential distribution is the geometric on a continuous interval, parametrized by
, like Poisson. While it will describes “time until event or failure” at a constant rate, the Weibull distribution models increases or decreases of rate of failures over time (i.e. models time-to-failure); - Normal, Log-Normal, Student’s t, and Chi-squared: if we take a set of values following the same (any) distribution and sum them, that sum of values follows approximatly the normal distribution — this is true regardless of the underlying distribution, and this phenomenon is called the Central Limit Theorem. The log-normal distribution relates to distributions whose logarithm is normally distributed. The exponentiation of a normally distribution is log-normally distributed. Student’s t-distributions are normal distribution with a fatter tail, although is approaches normal distribution as the parameter increases. The chi-square distribution if the distribution of sum-of-squares of normally-distributed values;
- Gamma and Beta: the gamma distribution is a generalization of the exponential and the chi-squared distributions. Like the exponential distribution, it is used to model waiting times e.g. the time until next
events occur. It appears in machine learning as the conjugate prior to some distributions. The beta distribution is the conjugate prior to most of the other distributions mentioned here;
Exponential Family of distributions
The exponential family of distribution is the set of distributions parametrized by
or in a more extensive notation:
where
The therm
- The intuitive notion of sufficiency is that
is sufficient for , if there is no information in regarding beyond that in . That is, having observed , we can throw away for the purposes of inference with respect to ; - Moreover, this means that the likelihood ratio is the same for any two datasets
and , i.e. if , then ;
The term
The term
Another important point is that the mean and variance of
For the complete dataset
and as expected the second derivative is equal to the variance of
One requirement of the exponential family distributions is that the parameters must factorize (i.e. must be separable into products, each of which involves only one type of variable), as either the power or base of an enxponentiation operation. I.e. the factors must be one of the following:
where
Another important point is that a product of two exponential-family distributions is as well part of the exponential family, but unnormalized:
Finally, the exponential family distribution have conjugate priors (i.e. prior and posterior distributions have distributions from the exponential family), and the posterior predictive distribution has always a closed-form solution (provided that the normalizing factor can also be stated in closed-form), both important properties for Bayesian statistics.
Example: Univariate Gaussian distribution
The univariate Gaussian distribution is defined for an input
for a distribution with mean
where:
, , , ,
We will now use the first and second derivative of
which is the mean of
which is the standard deviation of our normal distribution, by definition.
Example: Bernoulli distribution
Similarly, to compute the exponential family parameters in the Bernoulli distribution we follow as:
where:
, , , .
We now compute the mean of
which is the mean of a Bernoulli variable. Taking a second derivative yields:
which is the variance of a Bernoulli variable.
Parameters for common distributions
The following table provides a summary of most common distributions in the exponential family and their exponential-family parameters. For a more exhaustive list, check the Wikipedia entry for Exponential Family.
Distribution | Probability Density/Mass Function | Natural parameter(s) |
Inverse parameter mapping | Base measure |
Sufficient statistic |
Log-partition |
Log-partition |
---|---|---|---|---|---|---|---|
Bernoulli distribution | |||||||
binomial distribution with known number of trials n |
|||||||
Poisson distribution | |||||||
negative binomial distribution with known number of failures r |
|||||||
exponential distribution | |||||||
Pareto distribution with known minimum value |
|||||||
Laplace distribution with known mean μ |
|||||||
normal distribution known variance |
|||||||
normal distribution | |||||||
lognormal distribution | |||||||
gamma distribution shape |
|||||||
gamma distribution shape |
|||||||
beta distribution | |||||||
multivariate normal distribution | |||||||
multinomial distribution with known number of trials n |
|
where |
Maximum Likelihood
On the previous post, we have computed the Maximum Likelihood Estimator (MLE) for a Gaussian distribution. In this post, we have seen that Gaussian — alongside plenty other distributions — belongs to the Exponential Family of Distributions. We will now show that the MLE estimator can be generalized across all distributions in the Exponential Family.
As in the Gaussian use case, to compute the MLE we start by applying the log-trick to the general expression of the exponential family, and obtain the following log-likelihood:
we then compute the derivative with respect to
Not surprisingly, the results relates to the data only via the sufficient statistics