This project originally started as a side component of my groups final project in our Probability and Statistics course. Our project focused on modelling access to improved sanitation facilities at a macroeconomic level using several other country-level statistics as explanatory variables. Our response variable measured the proportion of the population in a given country that had access to improved sanitation facilities. It also had two clear peaks in its density plot. These two observations made it the perfect candidate for combining two ideas we had learned about: beta distributions and the EM algorithm, to explore fitting beta mixture densities. This developed beyond the case-study of santitation facilities into generalized code and comparison tests against the standard of gaussian mixture densities.
Note that the reason there appears to be density beyond the [0,100] range is due to the way kernel densities are measured. The blue lines in the plot indicate the min and max in the dataset.
This post will contain the following sections:
The beta distribution is a probability distribution that is especially relevant for modelling proportions. The reason for this is that the beta distribution is bounded on the interval [0,1]. In addition, the beta distribution can be skewed or symmetrical on this interval, which lends it flexibility to model potentially more situations. With our data we were interested in modeling the proportion of the population within a country that has access to improve santitation facilities, which makes it a great fit for the beta distribution.
The plots above are beta density plots for different values of the two shape parameters, alpha and beta. As you can see, the beta distribution can take on many shapes. Starting from the upper left panel, and moving clockwise, you see density plots resembling gaussian, exponential, uniform, and gamma distributions, all within the parameterization of the beta distribution. The one major restriction is that it is only defined on the [0,1] interval.
The probability density function (pdf) for the beta distribution is shown below, where \(\Gamma(y)\) is the gamma function, similar to factorials:
\[ p(x: \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta - 1} \]
A mixture density is a probability density function that is parameterized by two or more density functions and mixture parameters. Below you can see a general definition of a mixture density, where \(p(x|\theta)\) could theoretically take on any parameterized density function:
\[ P(x|\theta) = \sum^{K}_{k=1}\pi_kp(x|\theta_k) \qquad \qquad \text{with} \quad 0 \leq \pi_k \leq 1 \quad \text{and} \quad \sum^{K}_{k=1} \pi_k = 1 \]
For convenience of phrase, we will typically refer to the \(\pi_k\) as mixture parameters, and the \(\theta_k\) as component parameters. Above we just worked with the density at one point, but when it comes to maximum likelihood estimation, we need to work with the total likelihood of a whole data set. This can be formulated with nested sums and products as follows (\(X\) is considered the full dataset, and \(x_n\) is considered a single data point):
\[ l(X|\theta) = \prod^{N}_{n=1}\sum^{K}_{k=1}\pi_kp(x_n | \theta_k) \]
For convenience, we can take the log of this, and use the beta distribution parameterization to obtain the log-likelihood
\[ L(X|\boldsymbol{\pi}, \boldsymbol{\alpha}, \boldsymbol{\beta}) = \sum^{N}_{n=1}log \left( \sum^{K}_{k=1}\pi_kp(x|\alpha_k, \beta_k) \right) \]
where \(\boldsymbol{\pi}\), \(\boldsymbol{\alpha}\), and \(\boldsymbol{\beta}\) are vectors of the parameters defining the \(k\) mixture densities.
The traditional next step in finding maximum likelihood estimates for each of our parameters would be to take each of the partial derivatives, set them equal to zero, and solve for the parameter of interest. However, we do not get closed form solutions to these partial derivatives due to the summation of individual distributions. The way to solve this problem is to use an iterative method known as the EM algorithm.
The EM algorithm was developed to assist maximum likelihood estimation with missing or incomplete data, but it can be used in other situations where iterative procedures are necessary. Applications can include missing value imputation, censored data, mixture models, and many others. In the context of mixture models and mixture densities, the need for the EM algorithm comes about because there is no closed-form solution to the maximum likelihood of all the parameters (mixture and component parameters). The EM algorithm for mixture models works by iterating between maximizing the likelihood of the parameters given responsibilty measures (which we will define) and calculating the expected responsibilities given the parameters. You can think of the responsibility as the probability that a given data point belongs to a particular distribution in the mixture model. As we will discuss in the next section, mixtures of beta distributions create additional problems when it comes to implementing the EM algorithm.
When working with gaussian mixture densities, the maximum likelihood estimates for the mixture and component parameters have closed forms given responsibility measures, which means that we just have to evaluate an equation to finish the maximization step in a given iteration of the EM Algorithm (The formulas can be seen on the Wikipedia page for mixture models, and the “Density Estimation with Gaussian Mixture Models” in Mathematics for Machine Learning by Marc Peter Deisenroth et al. gives a great overview of the math behind these formulas). With beta mixtures however, there is not a closed form for the maximum likelihood estimates for \(\alpha\) and \(\beta\) because of the gamma functions in the likelihood equation. This means that we have to resort to numerical maximization methods. However, we can still use the formulas for the mixture parameters and the responsibility from gaussian mixture densities and apply them to beta mixture densities:
\[\begin{align*} r_{nk} = \frac{\pi_kp(x_n | \alpha_k, \beta_k)}{\sum^{K}_{j=1}\pi_jp(x_n | \alpha_j, \beta_j)} \qquad & \text{Responsibility of the } k \text{th mixture component for the } n \text{th data point} \\ \pi^{\text{new}}_k = \frac{\sum^{N}_{n = 1}r_{nk}}{N} \qquad & \text{New mixture parameters based only on the responsibility} \end{align*}\]This means that we can calculate the responsibilities and the mixture parameters sequentially, and then get numeric MLEs for the component parameters, holding the mixture parameters constant. Numeric optimization methods require starting points, and unlike with the mixture parameter where we have a range of values and we can chose the midpoint as the initial value, the component parameters \(\alpha\) and \(\beta\) can take any positive real number, so there isn’t a reasonable midpoint to initialize them. Instead, we will use the method of moments to calculate appropriate starting values for the numerical optimization process. Once we have numerical maximum likelihood estimates for the component parameters in the first iteration, we can use these as the starting point for the numerical maximization in the second iteration.
To show this process, we will go through a simple example with code and calculations, explicitly showing the calculations through the first two iterations, and then showing how to code an iterative algorithm with a reasonable stopping point.
We will use a small dataset of a mixture of two beta distributions, the first with 40 data points from a Beta(1, 5) distribution and the second with 60 data points from a Beta(10, 2) distribution. Below you can see the pdf for this mixture, which we will parameterize as BM(mixture_param, alpha_1, beta_1, alpha_2, beta_2) = BM(0.4, 1, 5, 10, 2):
set.seed(1)
data = c(rbeta(40, 1, 5), rbeta(60, 10, 2))
To start the process of fitting a beta mixture density to our 100 data points, we start by initializing the mixture parameter to 0.5, and compute method of moments estimates for the four component parameters. We first sort the data points, and then partition the data into the smallest 50% and the largest 50%. We then treat these two new datasets as their own datasets and calculate method of moments estimates for different beta distributions fit to each dataset. The algebra behind the method of moments estimates is not explored here, but you can see it carried out on the Wikipedia page for the Beta Distribution. We can see this whole process in the code below:
sorted_data = sort(data)
smallest_50 = sorted_data[1:50]
largest_50 = sorted_data[51:100]
MoM_Beta_Estimate <- function(data) {
mean = mean(data)
var = var(data)
intermediate = mean*(1-mean)/var - 1
alpha = mean*intermediate
beta = (1-mean)*intermediate
return(list(alpha = alpha, beta = beta))
}
MoM_Smallest = MoM_Beta_Estimate(smallest_50)
MoM_Largest = MoM_Beta_Estimate(largest_50)
If we compare this to the known underlying distribution, it might not seem very close, but it is a much better starting point than creating a universal initial set of values for the component parameters.
We can then use these method of moments estimates to start the numerical process of obtaining MLEs for the component parameters, holding the mixture parameter constant. To use the MaxLik package, we have to define the pdf for a beta mixture, and the corresponding likelihood function in a specific fashion.
beta_mixture_dens <- function(x, mixtures, alphas, betas, log = T) {
len = length(mixtures)
dens = 0
for (i in 1:len) {
dens = dens + mixtures[i]*dbeta(x, alphas[i], betas[i])
}
if (log) {return(log(dens))}
else {return(dens)}
}
mixture_params = c(0.5, 0.5)
likelihood <- function(theta, x) {
if (any(theta <= 0)) {
NA
} else {
len = length(theta)
alphas = theta[seq(1, len, by = 2)]
betas = theta[seq(2, len, by = 2)]
beta_mixture_dens(x, mixture_params, alphas, betas, log = T)
}
}
A couple things should be noted here:
We can now combine our starting component parameters determined by method of moments and the likelihood functions we defined above within the ‘maxLik’ function to get numerically maximized estimates for the component parameters, keeping the mixture parameter constant (in this case, 0.5). We can already see it getting close to the actual distribution component parameters of (1, 5, 10, 2), and certainly an improvement on the method of moments estimates.
component_parameters = c(MoM_Smallest$alpha, MoM_Smallest$beta, MoM_Largest$alpha, MoM_Largest$beta)
new_component_params = maxLik::maxLik(likelihood, start = component_parameters, x = data)
new_component_params
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 13 iterations
## Return code 2: successive function values within tolerance limit
## Log-Likelihood: 32.47405 (4 free parameter(s))
## Estimate(s): 1.293175 6.743566 14.19475 2.874781
component_parameters = new_component_params$estimate
This completes the first maximization step.
We now begin the expectation step by calculating the responsibilities for each density. We refer back to the mathematical definition of responsibility in a mixture density:
\[ r_{nk} = \frac{\pi_kp(x_n | \alpha_k, \beta_k)}{\sum^{K}_{j=1}\pi_jp(x_n | \alpha_j, \beta_j)} \qquad \qquad \pi^{\text{new}}_k = \frac{\sum^{N}_{n = 1}r_{nk}}{N} \]
we will start by getting the numerator of the responsibility measures by using the built in ‘dbeta’ function in R, which can calculate the \(p(x_n | \alpha_k, \beta_k)\) part for us (first_dist and second_dist). We can then row-wise (or column-wise depending on your perspective) add them up to get the denominator. We show the first five data points, and the responsibility the two distributions have (and show that they add up to 1).
first_dist = mixture_params[1]*dbeta(data, component_parameters[1], component_parameters[2])
second_dist = mixture_params[2]*dbeta(data, component_parameters[3], component_parameters[4])
responsibility_denom = first_dist + second_dist
first_responsibility = first_dist / responsibility_denom
second_responsibility = second_dist / responsibility_denom
## first second total
## 1 0.99908 0.00092 1
## 2 1.00000 0.00000 1
## 3 1.00000 0.00000 1
## 4 1.00000 0.00000 1
## 5 0.98005 0.01995 1
Now that we have the responsibilities \(r_{nk}\), we can easily calculate new mixture parameters using the formula above:
mixture_param1 = sum(first_responsibility)/length(data)
mixture_param2 = sum(second_responsibility)/length(data)
mixture_params = c(mixture_param1, mixture_param2)
mixture_params
## [1] 0.404255 0.595745
We can see that after only the first expectation step, the mixture parameters produced from the EM algorithm are almost equivalent to the actual population mixture parameters that were used to generate the data.
We can now go back to the maximization side, and get numerical maximum likelihood estimates for the component parameters using the component parameters from the previous maximization step as starting points. Notice that with the way we define the likelihood function, it hard-codes in that the mixture parameters should be pulled from the variable ‘mixture_params’. Since this parameter isn’t defined in the function, it would look for it in the global environment (in this case the environment produced by the markdown), and find the re-defined mixture parameters above.
new_component_params = maxLik::maxLik(likelihood, start = component_parameters, x = data)
new_component_params
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 7 iterations
## Return code 2: successive function values within tolerance limit
## Log-Likelihood: 34.36495 (4 free parameter(s))
## Estimate(s): 1.323012 7.04211 13.97811 2.847226
component_parameters = new_component_params$estimate
Seems like it didn’t improvement much, and is still a decent bit off from the population parameters right? Keep in mind that this is finding the optimal parameters given the data. In fact, at this point, we have found parameters that have a higher likelihood given the data compared to the actual population parameters used to define the distribution (34.4 compared to 31.7):
sum(beta_mixture_dens(data, c(0.4, 0.6), c(1, 10), c(5, 2), log = T))
## [1] 31.65598
If you look back through the output to the maxLik function, you will see a couple things printed out:
To decide when to stop this iterative process of calculating responsibilities and mixture parameters and then MLEs for the component parameters, we will keep track of the likelihood after each maximization step. For simplicity, the stopping rule we will use is when the likelihood of the current parameters (\(\theta_t\)) is less than or equal to a small threshold (\(\epsilon\)) plus the likelihood of the previous parameters (\(\theta_{t-1}\)). In math notation the stopping iteration is (where \(\boldsymbol{\pi_t, \alpha_t, \beta_t}\) are vectors):
\[ L(\boldsymbol{\pi_t, \alpha_t, \beta_t} | x) \leq L(\boldsymbol{\pi_{t-1}, \alpha_{t-1}, \beta_{t-1}} | x) + \epsilon \]
Now technically the sequence of likelihoods can diverge (if \(\alpha\) and \(\beta\) are both 0, we get infinite density at both 0, and 1, which means potentially infinite likelihood), but for all intents and purposes the likelihood function will be convex, meaning there will be at least one maximum. The possibility of local optima is certainly possible, but generally rare.
We now have an algorithm that can fit mixtures of beta distributions to data. The last thing we need to do is evaluate its performance, and compare it to the standard in mixture models, which are guassian mixture models.
To evaluate the performance of this iterative approach, we will actually use a package that I developed to fit beta mixture densities. You can download this package easily off of github using the devtools package.
devtools::install_github("https://github.com/palmerimatthew/BetaMixture")
To evaluate this algorithm and the code, we will look at several things:
We will start with a couple of datasets randomly generated from a mixture of beta distributions. These tests are to help determine the ability of the algorithm to fit a robust selection of data distributions. We will compare the log likelihood of the fitted parameters to the log likelihood of the population parameters defined below:
First_data <- c(rbeta(700, 5, 2), rbeta(300, 1, 10))
First_time <- system.time(First_fit <- BetaMixture::BM_Fit(First_data, 2, 0.01, 1))
First_pop <- sum(BetaMixture::BM_Density(First_data, c(0.7, 0.3), c(5,1), c(2,10), Log=T))
Second_data <- c(rbeta(200, 2, 20), rbeta(500, 20, 2), rbeta(300, 10, 10))
Second_time <- system.time(Second_fit <- BetaMixture::BM_Fit(Second_data, 3, 0.01, 1))
Second_pop <- sum(BetaMixture::BM_Density(Second_data, c(0.2, 0.5, 0.3), c(2, 20, 10), c(20, 2, 10), Log=T))
Third_data <- c(rbeta(500, 2, 5), rbeta(500, 5, 5))
Third_time <- system.time(Third_fit <- BetaMixture::BM_Fit(Third_data, 2, 0.01, 1))
Third_pop <- sum(BetaMixture::BM_Density(Third_data, c(0.5, 0.5), c(2, 5), c(5, 5), Log=T))
Fourth_data <- c(rbeta(300, 2, 20), rbeta(250, 15, 30), rbeta(350, 30, 10), rbeta(100, 15, 1))
Fourth_time <- system.time(Fourth_fit <- BetaMixture::BM_Fit(Fourth_data, 4, 0.01, 1))
Fourth_pop <- sum(BetaMixture::BM_Density(Fourth_data, c(0.3, 0.25, 0.35, 0.1), c(2, 15, 30, 15), c(20, 30, 10, 1)))
## Density Fitted_LogLik Population_LogLik Time
## 1 First 186.3575 184.3355 6.267
## 2 Second 330.1788 326.9853 29.653
## 3 Third 281.7124 280.9228 1.048
## 4 Fourth 210.3479 1390.7187 49.363
We see that the log likelihood of the fitted values from the algorithm are all better than the actual population parameters except for in the last density. This density contain 4 beta distributions, and looking at the fitted values, it never converges to accurate mixture parameters (0.2953, 0.243, 0.2493, 0.2124). The first two mixture parameters are close to the first two actual mixture parameters; however the last two are not close. Improvements could be made to the performance in this area, either by incorporating random starting points for the mixture parameters, or taking the fitted mixture parameters, change some of them, and restarting the process. But these are concerns for another day. We can see that it does an excellent job with small mixtures, and even mixtures that are seemingly overlapping. Below you can see the four population pdfs, as well as the pdf for the fitted mixture density
We will come back to this when we compare this algorithm to gaussian mixture density algorithms, but a concerning aspect is the time that it takes to fit. We can see this above in the ‘Time’ column. Most of the this time comes from the numerical maximization of the likelihood that needs to happen with mixtures of beta distributions. We see the last computation takes almost a minute. Just for reference, I am using a 2019 Macbook Pro with a 2.4 GHz 8-Core Intel Core i9 processor.
The simplest method of probabilistic classification is gaussian mixture models, which are mixtures of multivariate gaussian distributions. Here, we will just focus on mixtures of univariate gaussian distributions. We will use the same 4 datasets above, but this time fit gaussian mixture densities to them, comparing the log likelihood of these fitted parameters, and the time it takes to fit the model. We will use the ‘mixtools’ package to help us fit gaussian mixtures.
first_time <- system.time(first_fit <- mixtools::normalmixEM(First_data, k = 2))
second_time <- system.time(second_fit <- mixtools::normalmixEM(Second_data, k = 3))
third_time <- system.time(third_fit <- mixtools::normalmixEM(Third_data, k = 2))
fourth_time <- system.time(fourth_fit <- mixtools::normalmixEM(Fourth_data, k = 4))
## Density FItted_LogLik Population_LogLik Time
## 1 First 51.4635 184.3355 0.009
## 2 Second 258.2869 326.9853 0.015
## 3 Third 274.3867 280.9228 0.075
## 4 Fourth 189.4871 1390.7187 0.056
If we compare the fitted gaussian mixture densities to the fitted beta mixture densities above, we can see that with the log likelihood the beta mixtures fit better than the gaussian mixtures. This is to be expected, as we know that each of these toy datasets were generated from a mixture of beta distributions. However, it is a nice sanity check to see that the algorithm performs well, and can outperform the standard procedure for probabilistic classification in certain situations. We can see below the actual pdfs used to generate the data in each of these situations as well as the fitted gaussian mixture densities.
To gauge the performance of this algorithm more holistically in comparison to gaussian mixture densities, we will look at several other types of mixtures:
One of the things that we have to do is come up with a way to deal with the issue that beta distributions have a [0,1] support, and each of the distributions above have support beyond that range. To deal with this, we will use a rather simplistic method of just using the max and min values in the dataset to rescale the dataset to a [0,1] scale. When we do this, we will also need to adjust the likelihoods so that we can accurately compare models on different scales (for example, gaussian model on the real numbers and Beta model on [0,1]). To see why this is an issue lets think about if we fit a uniform distribution to 100 data points on the [0,1] scale, and then fit another uniform distribution to the same data points multiplied by 10 (so now on a [0,10] scale). The likelihood of the first model would be 100 (since each point has a density of 1), and the likelihood of the second model would be 10 (since each point has a density of 0.1). However, these are the same models fit to the same data, so to compare we would need to rescale the densities.
We will revisit this rather simplistic method of re-scaling in the conclusion. This section will also be rather brief per instance, and the code will largely be hidden, and we refer you to the above section for any code references.
We’ll look at two different set-ups of gaussian mixture densities each with a mixture of 2 and a mixture of 3. The first set-up will be where the individual distributions are spread out from each other, and the the second set-up will be where they overlap somewhat.
We will generate 1000 data points from each of these distributions, and then compare a fitted beta mixture density to a fitted gaussian mixture density.
## Density Fit_Process Fitted_LogLik Actual_LogLik
## 1 First Beta 358.4709 359.9950
## 2 First Gaussian 362.3596 359.9950
## 3 Second Beta 284.9162 306.2075
## 4 Second Gaussian 311.1597 306.2075
## 5 Third Beta 237.1455 243.0241
## 6 Third Gaussian 243.6842 243.0241
## 7 Fourth Beta 211.7081 230.0979
## 8 Fourth Gaussian 233.6122 230.0979
We see that the gaussian mixture fits better than the beta mixture in each situation, but this is to be expected just like above when the beta mixture fit better on mixtures of betas. However, one encouraging sign is that when you compare the log likelihoods of the fitted beta mixture to the fitted gaussian mixture, they are closer than when the true underlying distribution was a mixture of beta distributions.
For mixtures of gamma and gaussian distriutions, we will look at four distributions to evaluate the performance. Similar to how we did it with just gaussian mixtures, we will have two that have three densities and two that have two densities, and then we will have two where the densities are close, and two where the densities are all seperated.
## Density Fit_Process Fitted_LogLik Actual_LogLik
## 1 First Beta 231.4421 232.7979
## 2 First Gaussian 187.9769 232.7979
## 3 Second Beta 353.2723 354.0699
## 4 Second Gaussian 326.3643 354.0699
## 5 Third Beta 579.7631 579.1763
## 6 Third Gaussian 560.2246 579.1763
## 7 Fourth Beta 220.9024 221.3713
## 8 Fourth Gaussian 199.0086 221.3713
We can see that in each of these situations, the beta mixture fits considerably better than the gaussian mixtures. This is in large part due to the flexible of the beta distributions.
Much the same as the previous examples, we will look at 4 mixture densities, 2 having 2 distributions and 2 having 3 distributions, and then 2 with distriutions with seperation and 2 with distributions close to each other.
## Density Fit_Process Fitted_LogLik Actual_LogLik
## 1 First Beta 474.4318 470.7752
## 2 First Gaussian 381.2753 470.7752
## 3 Second Beta 239.5194 260.8734
## 4 Second Gaussian 227.3774 260.8734
## 5 Third Beta 280.3269 282.1749
## 6 Third Gaussian 245.4265 282.1749
## 7 Fourth Beta 217.3617 210.8145
## 8 Fourth Gaussian 195.9969 210.8145
Similar to the above section with gamma/gaussian mixtures, we see that the beta mixture outperforms the gaussian mixture in each example.
In this section we will look at mixture of distributions not including gaussians, namely Gamma, Exponential, Reversed Gamma, and Reversed Exponential combinations. ‘Reversed’ distributions are simply parameterized by negative x instead of just x (so instead of having support [0,Inf), then will have support (-Inf, 0]). These reversed distributions will also have a shift parameter, so a reversed gamma with a shift parameter of 10 for example will have a (-Inf, 10] support. the last number in the parameterization below for the reversed distributions are these shift parameters; the others are the typical parameters for the normal version of the distributions.
## Density Fit_Process Fitted_LogLik Actual_LogLik
## 1 First Beta 357.01321 353.0160
## 2 First Gaussian 196.72498 353.0160
## 3 Second Beta 239.26531 237.8492
## 4 Second Gaussian 86.25996 237.8492
## 5 Third Beta 742.22826 743.3190
## 6 Third Gaussian 499.41301 743.3190
## 7 Fourth Beta 618.31377 620.2365
## 8 Fourth Gaussian 235.39804 620.2365
These situations are where the beta mixture truly outshines the gaussian mixture. We can see in all of these examples that the comparison between the log-likelihood of the beta mixture and the gaussian mixture is no longer on an additive scale, but now on a multiplicative scale.
## K Beta_LogLik Normal_LogLik
## 1 1 42.79317 -51.06238
## 2 2 51.43635 11.88376
## 3 3 52.10359 27.31194
## 4 4 52.56052 32.80612
We can see that the beta mixture considerably outperforms the gaussian mixture density fit on this dataset.
When I originally started this project, I didn’t realize that this idea of using beta distributions instead of gaussian distributions was already widely used. Dirichlet distributions (multivariate extension of betas) have been used in the place of multivariate gaussians with both Dirichlet mixture models and Dirichlet Processes. Both of these changes have generally performed better than their gaussian predecessors. The main challenge with Dirichlet models instead of Gaussian models is the potential need for numeric methods, but as computers become more and more powerful, this concern will largely dissipate. One other issue with dirichlet models is that their input space is bounded, and the gaussian input space is not bounded. I’ll admit that I don’t know how dirichlet mixture models deal with this in practice.
Some additional items that I might look at related to this in the future: