Beta Mixture Density

Introduction

This project originally started as a side component of my groups final project in our Probability and Statistics course. Our project focused on modelling access to improved sanitation facilities at a macroeconomic level using several other country-level statistics as explanatory variables. Our response variable measured the proportion of the population in a given country that had access to improved sanitation facilities. It also had two clear peaks in its density plot. These two observations made it the perfect candidate for combining two ideas we had learned about: beta distributions and the EM algorithm, to explore fitting beta mixture densities. This developed beyond the case-study of santitation facilities into generalized code and comparison tests against the standard of gaussian mixture densities.

Note that the reason there appears to be density beyond the [0,100] range is due to the way kernel densities are measured. The blue lines in the plot indicate the min and max in the dataset.

This post will contain the following sections:

Beta distribution overview, especially in relation to our sanitation facility case study
EM algorithm overview, and how it can be used to fit mixture densities
The challenge of fitting beta mixture densities using the EM algorithm, and the need for numerical maximization
The math and code behind fitting beta mixture densities
Comparison of beta mixture density fits to gaussian mixture density fits
Future steps and conclusion

The Beta Distribution

The beta distribution is a probability distribution that is especially relevant for modelling proportions. The reason for this is that the beta distribution is bounded on the interval [0,1]. In addition, the beta distribution can be skewed or symmetrical on this interval, which lends it flexibility to model potentially more situations. With our data we were interested in modeling the proportion of the population within a country that has access to improve santitation facilities, which makes it a great fit for the beta distribution.

The plots above are beta density plots for different values of the two shape parameters, alpha and beta. As you can see, the beta distribution can take on many shapes. Starting from the upper left panel, and moving clockwise, you see density plots resembling gaussian, exponential, uniform, and gamma distributions, all within the parameterization of the beta distribution. The one major restriction is that it is only defined on the [0,1] interval.

The probability density function (pdf) for the beta distribution is shown below, where \(\Gamma(y)\) is the gamma function, similar to factorials:

\[ p(x: \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta - 1} \]

The EM Algorithm and Mixture Densities

A mixture density is a probability density function that is parameterized by two or more density functions and mixture parameters. Below you can see a general definition of a mixture density, where \(p(x|\theta)\) could theoretically take on any parameterized density function:

\[ P(x|\theta) = \sum^{K}_{k=1}\pi_kp(x|\theta_k) \qquad \qquad \text{with} \quad 0 \leq \pi_k \leq 1 \quad \text{and} \quad \sum^{K}_{k=1} \pi_k = 1 \]

For convenience of phrase, we will typically refer to the \(\pi_k\) as mixture parameters, and the \(\theta_k\) as component parameters. Above we just worked with the density at one point, but when it comes to maximum likelihood estimation, we need to work with the total likelihood of a whole data set. This can be formulated with nested sums and products as follows (\(X\) is considered the full dataset, and \(x_n\) is considered a single data point):

\[ l(X|\theta) = \prod^{N}_{n=1}\sum^{K}_{k=1}\pi_kp(x_n | \theta_k) \]

For convenience, we can take the log of this, and use the beta distribution parameterization to obtain the log-likelihood

\[ L(X|\boldsymbol{\pi}, \boldsymbol{\alpha}, \boldsymbol{\beta}) = \sum^{N}_{n=1}log \left( \sum^{K}_{k=1}\pi_kp(x|\alpha_k, \beta_k) \right) \]

where \(\boldsymbol{\pi}\), \(\boldsymbol{\alpha}\), and \(\boldsymbol{\beta}\) are vectors of the parameters defining the \(k\) mixture densities.

The traditional next step in finding maximum likelihood estimates for each of our parameters would be to take each of the partial derivatives, set them equal to zero, and solve for the parameter of interest. However, we do not get closed form solutions to these partial derivatives due to the summation of individual distributions. The way to solve this problem is to use an iterative method known as the EM algorithm.

The EM algorithm was developed to assist maximum likelihood estimation with missing or incomplete data, but it can be used in other situations where iterative procedures are necessary. Applications can include missing value imputation, censored data, mixture models, and many others. In the context of mixture models and mixture densities, the need for the EM algorithm comes about because there is no closed-form solution to the maximum likelihood of all the parameters (mixture and component parameters). The EM algorithm for mixture models works by iterating between maximizing the likelihood of the parameters given responsibilty measures (which we will define) and calculating the expected responsibilities given the parameters. You can think of the responsibility as the probability that a given data point belongs to a particular distribution in the mixture model. As we will discuss in the next section, mixtures of beta distributions create additional problems when it comes to implementing the EM algorithm.

The EM Algorithm in Practice with Beta Mixtures

Theoretical Justifications

When working with gaussian mixture densities, the maximum likelihood estimates for the mixture and component parameters have closed forms given responsibility measures, which means that we just have to evaluate an equation to finish the maximization step in a given iteration of the EM Algorithm (The formulas can be seen on the Wikipedia page for mixture models, and the “Density Estimation with Gaussian Mixture Models” in Mathematics for Machine Learning by Marc Peter Deisenroth et al. gives a great overview of the math behind these formulas). With beta mixtures however, there is not a closed form for the maximum likelihood estimates for \(\alpha\) and \(\beta\) because of the gamma functions in the likelihood equation. This means that we have to resort to numerical maximization methods. However, we can still use the formulas for the mixture parameters and the responsibility from gaussian mixture densities and apply them to beta mixture densities:

\[\begin{align*} r_{nk} = \frac{\pi_kp(x_n | \alpha_k, \beta_k)}{\sum^{K}_{j=1}\pi_jp(x_n | \alpha_j, \beta_j)} \qquad & \text{Responsibility of the } k \text{th mixture component for the } n \text{th data point} \\ \pi^{\text{new}}_k = \frac{\sum^{N}_{n = 1}r_{nk}}{N} \qquad & \text{New mixture parameters based only on the responsibility} \end{align*}\]

This means that we can calculate the responsibilities and the mixture parameters sequentially, and then get numeric MLEs for the component parameters, holding the mixture parameters constant. Numeric optimization methods require starting points, and unlike with the mixture parameter where we have a range of values and we can chose the midpoint as the initial value, the component parameters \(\alpha\) and \(\beta\) can take any positive real number, so there isn’t a reasonable midpoint to initialize them. Instead, we will use the method of moments to calculate appropriate starting values for the numerical optimization process. Once we have numerical maximum likelihood estimates for the component parameters in the first iteration, we can use these as the starting point for the numerical maximization in the second iteration.

Demonstration

data set-up

To show this process, we will go through a simple example with code and calculations, explicitly showing the calculations through the first two iterations, and then showing how to code an iterative algorithm with a reasonable stopping point.

We will use a small dataset of a mixture of two beta distributions, the first with 40 data points from a Beta(1, 5) distribution and the second with 60 data points from a Beta(10, 2) distribution. Below you can see the pdf for this mixture, which we will parameterize as BM(mixture_param, alpha_1, beta_1, alpha_2, beta_2) = BM(0.4, 1, 5, 10, 2):

set.seed(1)
data = c(rbeta(40, 1, 5), rbeta(60, 10, 2))

Method of Moments estimate of component parameters

To start the process of fitting a beta mixture density to our 100 data points, we start by initializing the mixture parameter to 0.5, and compute method of moments estimates for the four component parameters. We first sort the data points, and then partition the data into the smallest 50% and the largest 50%. We then treat these two new datasets as their own datasets and calculate method of moments estimates for different beta distributions fit to each dataset. The algebra behind the method of moments estimates is not explored here, but you can see it carried out on the Wikipedia page for the Beta Distribution. We can see this whole process in the code below:

sorted_data = sort(data)
smallest_50 = sorted_data[1:50]
largest_50 = sorted_data[51:100]
MoM_Beta_Estimate <- function(data) {
  mean = mean(data)
  var = var(data)
  intermediate = mean*(1-mean)/var - 1
  alpha = mean*intermediate
  beta = (1-mean)*intermediate
  return(list(alpha = alpha, beta = beta))
}
MoM_Smallest = MoM_Beta_Estimate(smallest_50)
MoM_Largest = MoM_Beta_Estimate(largest_50)

Method of Moments estimates for smallest 50%: \(\alpha\) = 0.6277637 and \(\beta\) = 1.7854317
Method of Moments estimates for largest 50%: \(\alpha\) = 25.8484581 and \(\beta\) = 4.196086

If we compare this to the known underlying distribution, it might not seem very close, but it is a much better starting point than creating a universal initial set of values for the component parameters.

Numerical Maximum Likelihood Estimates for Component Parameters (first iteration)

We can then use these method of moments estimates to start the numerical process of obtaining MLEs for the component parameters, holding the mixture parameter constant. To use the MaxLik package, we have to define the pdf for a beta mixture, and the corresponding likelihood function in a specific fashion.

beta_mixture_dens <- function(x, mixtures, alphas, betas, log = T) {
  len = length(mixtures)
  dens = 0
  for (i in 1:len) {
    dens = dens + mixtures[i]*dbeta(x, alphas[i], betas[i])
  }
  if (log) {return(log(dens))}
  else {return(dens)}
}
mixture_params = c(0.5, 0.5)
likelihood <- function(theta, x) {
    if (any(theta <= 0)) {
      NA
    } else {
      len = length(theta)
      alphas = theta[seq(1, len, by = 2)]
      betas = theta[seq(2, len, by = 2)]
      beta_mixture_dens(x, mixture_params, alphas, betas, log = T)
    }
  }

A couple things should be noted here:

We are not explicitly checking in the ‘beta_mixture_dens’ function if any of the given vectors have invalid values (mixture_params don’t add up to 1, any of the alphas/betas are less than or equal to 0). In a deployed package, this checks would have to be performed, so that an appropriate high-level error can be displayed to the user.
In the likelihood function we do check that the alphas and betas (enclosed in the theta vector) are all greater than zero, but this is because the maxLik function is able to fix itself when it generates an invalid parameter (only knows this when the likelihood function returns NA, if this check wasn’t there, and it tried to use a negative value for an alpha or beta, you would get NAs in your likelihood).
theta is assumed to be alternating between alpha and beta (alpha_1, beta_1, alpha_2, beta_2, etc.). The ordering doesn’t necessarily matter, as long as you are able to be consistent throughout your code in how you store and reference the values (you can see I was able to easily grab out the alphas (odds) and the betas (evens) from the theta vector).

We can now combine our starting component parameters determined by method of moments and the likelihood functions we defined above within the ‘maxLik’ function to get numerically maximized estimates for the component parameters, keeping the mixture parameter constant (in this case, 0.5). We can already see it getting close to the actual distribution component parameters of (1, 5, 10, 2), and certainly an improvement on the method of moments estimates.

component_parameters = c(MoM_Smallest$alpha, MoM_Smallest$beta, MoM_Largest$alpha, MoM_Largest$beta)
new_component_params = maxLik::maxLik(likelihood, start = component_parameters, x = data)
new_component_params

## Maximum Likelihood estimation
## Newton-Raphson maximisation, 13 iterations
## Return code 2: successive function values within tolerance limit
## Log-Likelihood: 32.47405 (4 free parameter(s))
## Estimate(s): 1.293175 6.743566 14.19475 2.874781

component_parameters = new_component_params$estimate

This completes the first maximization step.

Responsibility and Mixture Parameter Estimation (second iteration)

We now begin the expectation step by calculating the responsibilities for each density. We refer back to the mathematical definition of responsibility in a mixture density:

\[ r_{nk} = \frac{\pi_kp(x_n | \alpha_k, \beta_k)}{\sum^{K}_{j=1}\pi_jp(x_n | \alpha_j, \beta_j)} \qquad \qquad \pi^{\text{new}}_k = \frac{\sum^{N}_{n = 1}r_{nk}}{N} \]

we will start by getting the numerator of the responsibility measures by using the built in ‘dbeta’ function in R, which can calculate the \(p(x_n | \alpha_k, \beta_k)\) part for us (first_dist and second_dist). We can then row-wise (or column-wise depending on your perspective) add them up to get the denominator. We show the first five data points, and the responsibility the two distributions have (and show that they add up to 1).

first_dist = mixture_params[1]*dbeta(data, component_parameters[1], component_parameters[2])
second_dist = mixture_params[2]*dbeta(data, component_parameters[3], component_parameters[4])
responsibility_denom = first_dist + second_dist
first_responsibility = first_dist / responsibility_denom
second_responsibility = second_dist / responsibility_denom

##     first  second total
## 1 0.99908 0.00092     1
## 2 1.00000 0.00000     1
## 3 1.00000 0.00000     1
## 4 1.00000 0.00000     1
## 5 0.98005 0.01995     1

Now that we have the responsibilities \(r_{nk}\), we can easily calculate new mixture parameters using the formula above:

mixture_param1 = sum(first_responsibility)/length(data)
mixture_param2 = sum(second_responsibility)/length(data)
mixture_params = c(mixture_param1, mixture_param2)
mixture_params

## [1] 0.404255 0.595745

We can see that after only the first expectation step, the mixture parameters produced from the EM algorithm are almost equivalent to the actual population mixture parameters that were used to generate the data.

Numerical Maximum Likelihood Estimates for Component Parameters (second iteration)

We can now go back to the maximization side, and get numerical maximum likelihood estimates for the component parameters using the component parameters from the previous maximization step as starting points. Notice that with the way we define the likelihood function, it hard-codes in that the mixture parameters should be pulled from the variable ‘mixture_params’. Since this parameter isn’t defined in the function, it would look for it in the global environment (in this case the environment produced by the markdown), and find the re-defined mixture parameters above.

new_component_params = maxLik::maxLik(likelihood, start = component_parameters, x = data)
new_component_params

## Maximum Likelihood estimation
## Newton-Raphson maximisation, 7 iterations
## Return code 2: successive function values within tolerance limit
## Log-Likelihood: 34.36495 (4 free parameter(s))
## Estimate(s): 1.323012 7.04211 13.97811 2.847226

component_parameters = new_component_params$estimate

Seems like it didn’t improvement much, and is still a decent bit off from the population parameters right? Keep in mind that this is finding the optimal parameters given the data. In fact, at this point, we have found parameters that have a higher likelihood given the data compared to the actual population parameters used to define the distribution (34.4 compared to 31.7):

sum(beta_mixture_dens(data, c(0.4, 0.6), c(1, 10), c(5, 2), log = T))

## [1] 31.65598

Defining an end point to this iteration

If you look back through the output to the maxLik function, you will see a couple things printed out:

That it is using Newton-Raphson maximization, a process also used in applications like finding coefficients in GLMs
How many iterations the process went through to reach a maximum of the likelihood (13 and 7)
The Log Likelihood of the parameters given the data
The estimates of the four component parameters produced by the numerical process

To decide when to stop this iterative process of calculating responsibilities and mixture parameters and then MLEs for the component parameters, we will keep track of the likelihood after each maximization step. For simplicity, the stopping rule we will use is when the likelihood of the current parameters (\(\theta_t\)) is less than or equal to a small threshold (\(\epsilon\)) plus the likelihood of the previous parameters (\(\theta_{t-1}\)). In math notation the stopping iteration is (where \(\boldsymbol{\pi_t, \alpha_t, \beta_t}\) are vectors):

\[ L(\boldsymbol{\pi_t, \alpha_t, \beta_t} | x) \leq L(\boldsymbol{\pi_{t-1}, \alpha_{t-1}, \beta_{t-1}} | x) + \epsilon \]

Now technically the sequence of likelihoods can diverge (if \(\alpha\) and \(\beta\) are both 0, we get infinite density at both 0, and 1, which means potentially infinite likelihood), but for all intents and purposes the likelihood function will be convex, meaning there will be at least one maximum. The possibility of local optima is certainly possible, but generally rare.

We now have an algorithm that can fit mixtures of beta distributions to data. The last thing we need to do is evaluate its performance, and compare it to the standard in mixture models, which are guassian mixture models.

Performance and Comparison to Gaussian Mixture Densities

To evaluate the performance of this iterative approach, we will actually use a package that I developed to fit beta mixture densities. You can download this package easily off of github using the devtools package.

devtools::install_github("https://github.com/palmerimatthew/BetaMixture")

To evaluate this algorithm and the code, we will look at several things:

Ability to fit distributions that truly come from a mixture of beta distributions. Compare the fitted parameters to the known population parameters.
Compare the fitted beta mixture densities to fitted gaussian mixture densities on datasets coming from a mixture of beta distributions.
Compare the fitted beta mixture densities to fitted gaussian mixture densities on miscellaneous generated mixture densities.
Comparison of performance on Access to Improved Sanitation Facilities dataset

Fit true underlying Beta Mixtures

We will start with a couple of datasets randomly generated from a mixture of beta distributions. These tests are to help determine the ability of the algorithm to fit a robust selection of data distributions. We will compare the log likelihood of the fitted parameters to the log likelihood of the population parameters defined below:

First: Mix(0.7, 0.3), Alpha(5, 1), Beta(2, 10)
Second: Mix(0.2, 0.5, 0.3), Alpha(2, 20, 10), Beta(20, 2, 10)
Third: Mix(0.5, 0.5), Alpha(2, 5), Beta(5, 5)
Fourth: Mix(0.3, 0.1, 0.35, 0.25), Alpha(3, 20, 30, 15), Beta(20, 10, 7, 1)

First_data <- c(rbeta(700, 5, 2), rbeta(300, 1, 10))
First_time <- system.time(First_fit <- BetaMixture::BM_Fit(First_data, 2, 0.01, 1))
First_pop <- sum(BetaMixture::BM_Density(First_data, c(0.7, 0.3), c(5,1), c(2,10), Log=T))
Second_data <- c(rbeta(200, 2, 20), rbeta(500, 20, 2), rbeta(300, 10, 10))
Second_time <- system.time(Second_fit <- BetaMixture::BM_Fit(Second_data, 3, 0.01, 1))
Second_pop <- sum(BetaMixture::BM_Density(Second_data, c(0.2, 0.5, 0.3), c(2, 20, 10), c(20, 2, 10), Log=T))
Third_data <- c(rbeta(500, 2, 5), rbeta(500, 5, 5))
Third_time <- system.time(Third_fit <- BetaMixture::BM_Fit(Third_data, 2, 0.01, 1))
Third_pop <- sum(BetaMixture::BM_Density(Third_data, c(0.5, 0.5), c(2, 5), c(5, 5), Log=T))
Fourth_data <- c(rbeta(300, 2, 20), rbeta(250, 15, 30), rbeta(350, 30, 10), rbeta(100, 15, 1))
Fourth_time <- system.time(Fourth_fit <- BetaMixture::BM_Fit(Fourth_data, 4, 0.01, 1))
Fourth_pop <- sum(BetaMixture::BM_Density(Fourth_data, c(0.3, 0.25, 0.35, 0.1), c(2, 15, 30, 15), c(20, 30, 10, 1)))

##   Density Fitted_LogLik Population_LogLik   Time
## 1   First      186.3575          184.3355  6.267
## 2  Second      330.1788          326.9853 29.653
## 3   Third      281.7124          280.9228  1.048
## 4  Fourth      210.3479         1390.7187 49.363

We see that the log likelihood of the fitted values from the algorithm are all better than the actual population parameters except for in the last density. This density contain 4 beta distributions, and looking at the fitted values, it never converges to accurate mixture parameters (0.2953, 0.243, 0.2493, 0.2124). The first two mixture parameters are close to the first two actual mixture parameters; however the last two are not close. Improvements could be made to the performance in this area, either by incorporating random starting points for the mixture parameters, or taking the fitted mixture parameters, change some of them, and restarting the process. But these are concerns for another day. We can see that it does an excellent job with small mixtures, and even mixtures that are seemingly overlapping. Below you can see the four population pdfs, as well as the pdf for the fitted mixture density

We will come back to this when we compare this algorithm to gaussian mixture density algorithms, but a concerning aspect is the time that it takes to fit. We can see this above in the ‘Time’ column. Most of the this time comes from the numerical maximization of the likelihood that needs to happen with mixtures of beta distributions. We see the last computation takes almost a minute. Just for reference, I am using a 2019 Macbook Pro with a 2.4 GHz 8-Core Intel Core i9 processor.

Comparison to Gaussian Mixture Density

The simplest method of probabilistic classification is gaussian mixture models, which are mixtures of multivariate gaussian distributions. Here, we will just focus on mixtures of univariate gaussian distributions. We will use the same 4 datasets above, but this time fit gaussian mixture densities to them, comparing the log likelihood of these fitted parameters, and the time it takes to fit the model. We will use the ‘mixtools’ package to help us fit gaussian mixtures.

first_time <- system.time(first_fit <- mixtools::normalmixEM(First_data, k = 2))

second_time <- system.time(second_fit <- mixtools::normalmixEM(Second_data, k = 3))

third_time <- system.time(third_fit <- mixtools::normalmixEM(Third_data, k = 2))

fourth_time <- system.time(fourth_fit <- mixtools::normalmixEM(Fourth_data, k = 4))

##   Density FItted_LogLik Population_LogLik  Time
## 1   First       51.4635          184.3355 0.009
## 2  Second      258.2869          326.9853 0.015
## 3   Third      274.3867          280.9228 0.075
## 4  Fourth      189.4871         1390.7187 0.056

If we compare the fitted gaussian mixture densities to the fitted beta mixture densities above, we can see that with the log likelihood the beta mixtures fit better than the gaussian mixtures. This is to be expected, as we know that each of these toy datasets were generated from a mixture of beta distributions. However, it is a nice sanity check to see that the algorithm performs well, and can outperform the standard procedure for probabilistic classification in certain situations. We can see below the actual pdfs used to generate the data in each of these situations as well as the fitted gaussian mixture densities.

Performance on datasets generated from miscellaneous distributions

To gauge the performance of this algorithm more holistically in comparison to gaussian mixture densities, we will look at several other types of mixtures:

Gaussian mixtures
Gamma and Gaussian
Exponential and Gaussian
Gamma, Exponential, Reversed Gamma, Reversed Exponential combinations

One of the things that we have to do is come up with a way to deal with the issue that beta distributions have a [0,1] support, and each of the distributions above have support beyond that range. To deal with this, we will use a rather simplistic method of just using the max and min values in the dataset to rescale the dataset to a [0,1] scale. When we do this, we will also need to adjust the likelihoods so that we can accurately compare models on different scales (for example, gaussian model on the real numbers and Beta model on [0,1]). To see why this is an issue lets think about if we fit a uniform distribution to 100 data points on the [0,1] scale, and then fit another uniform distribution to the same data points multiplied by 10 (so now on a [0,10] scale). The likelihood of the first model would be 100 (since each point has a density of 1), and the likelihood of the second model would be 10 (since each point has a density of 0.1). However, these are the same models fit to the same data, so to compare we would need to rescale the densities.

We will revisit this rather simplistic method of re-scaling in the conclusion. This section will also be rather brief per instance, and the code will largely be hidden, and we refer you to the above section for any code references.

Population Gaussian mixture

We’ll look at two different set-ups of gaussian mixture densities each with a mixture of 2 and a mixture of 3. The first set-up will be where the individual distributions are spread out from each other, and the the second set-up will be where they overlap somewhat.

Mixture of 2, far apart: 0.3*Normal(-5, 1), 0.7*Normal(2, 2)
Mixture of 3, far apart: 0.4*Normal(0, 2), 0.4*Normal(-15, 3), 0.2*Normal(9, 2)
Mixture of 2, close: 0.4*Normal(0, 2), 0.6*Normal(6, 2)
Mixture of 3, close: 0.3*Normal(-1, 2), 0.3*Normal(3, 1), 0.4*Normal(6, 1)

We will generate 1000 data points from each of these distributions, and then compare a fitted beta mixture density to a fitted gaussian mixture density.

##   Density Fit_Process Fitted_LogLik Actual_LogLik
## 1   First        Beta      358.4709      359.9950
## 2   First    Gaussian      362.3596      359.9950
## 3  Second        Beta      284.9162      306.2075
## 4  Second    Gaussian      311.1597      306.2075
## 5   Third        Beta      237.1455      243.0241
## 6   Third    Gaussian      243.6842      243.0241
## 7  Fourth        Beta      211.7081      230.0979
## 8  Fourth    Gaussian      233.6122      230.0979

We see that the gaussian mixture fits better than the beta mixture in each situation, but this is to be expected just like above when the beta mixture fit better on mixtures of betas. However, one encouraging sign is that when you compare the log likelihoods of the fitted beta mixture to the fitted gaussian mixture, they are closer than when the true underlying distribution was a mixture of beta distributions.

Population of Gamma and Gaussian

For mixtures of gamma and gaussian distriutions, we will look at four distributions to evaluate the performance. Similar to how we did it with just gaussian mixtures, we will have two that have three densities and two that have two densities, and then we will have two where the densities are close, and two where the densities are all seperated.

Mixture of 2, far apart: 0.4*Gamma(2, 0.5), 0.6*Normal(15, 2)
Mixture of 3, far apart: 0.3*Gamma(3, 1), 0.4*Normal(10, 1), 0.3*Normal(19, 2)
Mixture of 2, close: 0.5*Gamma(2, 0.25), 0.5*Normal(18, 3)
Mixture of 3, close: 0.3*Gamma(1.5, 0.5), 0.3*Normal(7, 1), 0.4*Normal(12,2)

##   Density Fit_Process Fitted_LogLik Actual_LogLik
## 1   First        Beta      231.4421      232.7979
## 2   First    Gaussian      187.9769      232.7979
## 3  Second        Beta      353.2723      354.0699
## 4  Second    Gaussian      326.3643      354.0699
## 5   Third        Beta      579.7631      579.1763
## 6   Third    Gaussian      560.2246      579.1763
## 7  Fourth        Beta      220.9024      221.3713
## 8  Fourth    Gaussian      199.0086      221.3713

We can see that in each of these situations, the beta mixture fits considerably better than the gaussian mixtures. This is in large part due to the flexible of the beta distributions.

Exponential and Gaussian

Much the same as the previous examples, we will look at 4 mixture densities, 2 having 2 distributions and 2 having 3 distributions, and then 2 with distriutions with seperation and 2 with distributions close to each other.

Mixture of 2, far apart: 0.4*Exponential(1), 0.6*Normal(9, 2)
Mixture of 3, far apart: 0.2*Exponential(0.5), 0.5*Normal(12, 2), 0.3*Normal(20, 1)
Mixture of 2, close: 0.3*Exponential(0.4), 0.7*Normal(8, 2)
Mixture of 3, close: 0.2*Exponential(0.3), 0.4*Normal(7, 2), 0.3*Normal(12, 1)

##   Density Fit_Process Fitted_LogLik Actual_LogLik
## 1   First        Beta      474.4318      470.7752
## 2   First    Gaussian      381.2753      470.7752
## 3  Second        Beta      239.5194      260.8734
## 4  Second    Gaussian      227.3774      260.8734
## 5   Third        Beta      280.3269      282.1749
## 6   Third    Gaussian      245.4265      282.1749
## 7  Fourth        Beta      217.3617      210.8145
## 8  Fourth    Gaussian      195.9969      210.8145

Similar to the above section with gamma/gaussian mixtures, we see that the beta mixture outperforms the gaussian mixture in each example.

Miscellaneous Distributions

In this section we will look at mixture of distributions not including gaussians, namely Gamma, Exponential, Reversed Gamma, and Reversed Exponential combinations. ‘Reversed’ distributions are simply parameterized by negative x instead of just x (so instead of having support [0,Inf), then will have support (-Inf, 0]). These reversed distributions will also have a shift parameter, so a reversed gamma with a shift parameter of 10 for example will have a (-Inf, 10] support. the last number in the parameterization below for the reversed distributions are these shift parameters; the others are the typical parameters for the normal version of the distributions.

Gamma and Reversed Gamma: 0.4*Gamma(2, 1), 0.6*Reversed_Gamma(2, 0.5, 20)
Gamma and Reversed Exponential: 0.7*Gamma(3, 0.5), 0.3*Reversed_Exponential(0.5, 22.5)
Exponential and Reversed Gamma: 0.3*Exponential(1), 0.7*Reversed_Gamma(2, 0.5, 25)
Exponential and Reversed Exponential: 0.4*Exponential(1), 0.6*Reversed_Exponential(0.5, 15)

##   Density Fit_Process Fitted_LogLik Actual_LogLik
## 1   First        Beta     357.01321      353.0160
## 2   First    Gaussian     196.72498      353.0160
## 3  Second        Beta     239.26531      237.8492
## 4  Second    Gaussian      86.25996      237.8492
## 5   Third        Beta     742.22826      743.3190
## 6   Third    Gaussian     499.41301      743.3190
## 7  Fourth        Beta     618.31377      620.2365
## 8  Fourth    Gaussian     235.39804      620.2365

These situations are where the beta mixture truly outshines the gaussian mixture. We can see in all of these examples that the comparison between the log-likelihood of the beta mixture and the gaussian mixture is no longer on an additive scale, but now on a multiplicative scale.

Performance on Sanitation dataset in comparison to Gaussian mixture densities

##   K Beta_LogLik Normal_LogLik
## 1 1    42.79317     -51.06238
## 2 2    51.43635      11.88376
## 3 3    52.10359      27.31194
## 4 4    52.56052      32.80612

We can see that the beta mixture considerably outperforms the gaussian mixture density fit on this dataset.

Conclusions

When I originally started this project, I didn’t realize that this idea of using beta distributions instead of gaussian distributions was already widely used. Dirichlet distributions (multivariate extension of betas) have been used in the place of multivariate gaussians with both Dirichlet mixture models and Dirichlet Processes. Both of these changes have generally performed better than their gaussian predecessors. The main challenge with Dirichlet models instead of Gaussian models is the potential need for numeric methods, but as computers become more and more powerful, this concern will largely dissipate. One other issue with dirichlet models is that their input space is bounded, and the gaussian input space is not bounded. I’ll admit that I don’t know how dirichlet mixture models deal with this in practice.

Some additional items that I might look at related to this in the future:

We arbitrarily chose end points as the max and min of the dataset in order to reduce the dataset to the [0,1] range. Instead, we chould have chosen these end points as part of the maximization process, and potentially gotten a better fit.
Test the beta mixture algorithm’s ability to fit disproportionate mixtures. All the tests were done on mixtures that had similar mixture parameters (this might also involve incorporating random restarts in the algorithm).
Test performance on many other real datasets (old faithful and iris datasets for example).