The Conway–Maxwell–Poisson (CMP) distribution

The Conway–Maxwell–Poisson (CMP) distribution is a discrete probability distribution with two parameters \(\lambda >0\) and \(\nu \geq 0\), and probability mass function

\[ f_{s}(x)= \dfrac{\lambda^x}{(x!)^{\nu}} \dfrac{1}{Z(\lambda,\nu)},\, \] where \(Z(\lambda,\nu)=\sum _{{j=0}}^{\infty }{\frac {\lambda ^{j}}{(j!)^{\nu }}}\) is the normalising constant, and \(x \in \{0,1,2,3,\dots\}\). The parameter space is defined by \(\lambda ,\nu >0\), and \(0<\lambda<1\) , \(\nu =0\). One of the appealing properties of this distribution is that it can account for overdispersion and underdispersion for different values of the parameter \(\nu\).

See Shmueli et al. (2005) and Kimberly F. Sellers (2023) for more details about this distribution.

Likelihood function and Sufficient Statistics

The likelihood function of \((\lambda, \nu)\), for a sample \({\bf x} = (x_1,\dots,x_n)^{\top}\) is given by:

\[ \begin{align} L(\lambda, \nu \mid {\bf x}) &= \prod_{i=1}^n \dfrac{\lambda^{x_i}}{(x_i!)^{\nu}} \dfrac{1}{Z(\lambda,\nu)}\\ &= Z(\lambda,\nu)^{-n} \lambda^{\sum_{i}^n x_i}\left( \prod_{i=1}^n x_i! \right)^{-\nu}. \end{align} \]

This expression implies that \(\left(\prod_{i=1}^n x_i!,\sum_{i}^n x_i\right)\) are sufficient statistics for \((\lambda, \nu)\) (see Shmueli et al. (2005) and Kadane et al. (2006)). Consequently, \(\left(\dfrac{1}{n}\sum_{i=1}^n \log(x_i!), \dfrac{1}{n}\sum_{i}^n x_i\right)\) are also sufficient for \((\lambda, \nu)\).

The log-likelihood function is:

\[ \begin{align} \ell(\lambda, \nu \mid {\bf x}) &= - n \log Z(\lambda,\nu) + n\bar{\mathbf{x}}\log( \lambda)- {\nu}\sum_{i=1}^n \log \left(x_i! \right). \end{align} \]

One of the challenges in estimating the parameters \((\lambda, \nu)\) is that the normalising constant \(Z(\lambda,\nu)\) is not available in closed-form, which complicates the use of maximum likelihood estimation (Shmueli et al. (2005)). However, approximating this normalising constant is feasible by using a large number of terms in the corresponding series

\[ Z(\lambda,\nu) \approx \sum _{{j=0}}^{M}{\frac {\lambda ^{j}}{(j!)^{\nu }}}, \quad \text{ with } \, M \, \text{ large}. \]

We also introduce a new approximation which, although trivial, allows for borrowing other numerical methods to provide a numerically stable/tractable approximation of this normalising constant. Note that,

\[ \begin{split} Z(\lambda,\nu) &\approx \sum _{{j=0}}^{M}{\frac {\lambda ^{j}}{(j!)^{\nu }}}, \\ & = \sum _{{j=0}}^{M} \exp \left[ j \log(\lambda) - \nu \log(j!) \right ] \end{split} \]

Then, the term \(\log Z(\lambda,\nu)\), which appears in the log-likelihood function, can be seen as the logSumExp function applied to the set of values \(j \log(\lambda) - \nu \log(j!)\), \(j=0,\dots,M\). As shown later, this allows for a fast and stable approximation of the log-likelihood function.

Numerical methods based on Fisher scoring have been proposed (K. F. Sellers and Shmueli (2010), Chatla and Shmueli (2018), K. Sellers, Lotze, and Raim (2019)) to approximate the maximum likelihood estimators of the parameters \((\lambda, \nu)\). Recently, Bayesian methods (Benson and Friel (2021)) and Generalised Bayesian methods (Matsubara et al. (2023)) have also been employed to estimate the parameters \((\lambda, \nu)\). In the next section, we discuss a method introduced by Rubio and Johansen (2013) to approximate the maximum likelihood estimator for models with intractable likelihoods.

Approximate maximum likelihood estimator (AMLE)

The AMLE, introduced by Rubio and Johansen (2013), consists of using Approximate Bayesian Computation (ABC) methods to approximate the likelihood function in cases where the likelihood is intractable and/or too computationally expensive to evaluate, and maximising this approximated likelihood function. The basic ABC algorithm is shown below:

Simulate \({\bf\theta}^{\prime}\) from the prior distribution \(\pi(\cdot)\).
Generate \({\bf y}\) from the model \(f(\cdot\vert{\bf\theta}^{\prime})\).
Accept \({\bf\theta}^{\prime}\) with probability \(\propto K_{\varepsilon}({\bf x} \vert {\bf y})\) otherwise return to step 1.

where \(K_{\varepsilon}({\bf x}\vert {\bf y})\) is a normalised Markov kernel. The ABC algorithm produces a sample from the following approximation to the posterior distribution:

\[ \widehat{\pi}_{\varepsilon}({\bf\theta}\vert{\bf x}) = \dfrac{\widehat{f}_{\varepsilon}({\bf x}\vert{\bf\theta})\pi({\bf\theta})}{\int_{\Theta}\widehat{f}_{\varepsilon}({\bf x}\vert{\bf t})\pi({\bf t})d{\bf t}}, \]

where

\[ \widehat{f}_{\varepsilon}({\bf x}\vert{\bf\theta}) = \int_{{\mathbb R}^{q\times n}}K_{\varepsilon}({\bf x}\vert {\bf y}) f({\bf y}\vert{\bf\theta}) d{\bf{y}} , \]

is an approximation of the likelihood function. Therefore, if we use a uniform prior on a compact set that contains the Maximum Likelihood Estimator (MLE), and the Maximum a Posteriori (MAP, the maximum of the posterior distribution) coincide. Thus, it is possible to approximate the MLE by using ABC methods to generate an approximate sample from the posterior and then approximating the MAP using this sample. This can be done by using the ABC algorithm in the first step of the AMLE algorithm, as shown below:

The AMLE Algorithm

Obtain a sample \({\bf \theta}^*_{\varepsilon}=({\bf \theta}^*_{\varepsilon,1},...,{\bf \theta}^*_{\varepsilon,k})\) from \(\widehat{\pi}_{\varepsilon}({\bf \theta}\vert{\bf x})\).
Using the sample \({\bf \theta}^*_{\varepsilon}\) construct a nonparametric estimator \(\widehat{\pi}_{k,\varepsilon}({\bf \theta}\vert{\bf x})\) of the density \(\widehat{\pi}_{\varepsilon}({\bf \theta}\vert{\bf x})\).
Calculate the maximum of \(\widehat{\pi}_{k,\varepsilon}({\bf \theta}\vert{\bf x})\), \(\tilde{{\bf \theta}}_{\varepsilon}\). This is an approximation of the MLE \(\widehat{{\bf \theta}}\).

Asymptotic properties of the AMLE are studied in Rubio and Johansen (2013).

Disclaimer: this is simply a tutorial on the use of AMLE for the CMP distribution with pedagogical purposes, not a formal comparison of the different estimation methods.

R code

The following R code illustrates how to estimate the parameters \((\lambda,\nu)\) using the AMLE method for a simulated data set. A comparison with the results obtained with the command glm.cmp from the R package COMPoissonReg is presented.

R code

Required packages

rm(list=ls())

# Required packages
library(KernSmooth)

## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009

library(mvtnorm)
library(misc3d)
library(ks)
library(COMPoissonReg)

## Loading required package: Rcpp

## Loading required package: numDeriv

MLE functions

#--------------------------------------------------------------------
# LogSumExp Function: Recursive formula
# https://rpubs.com/FJRubio/LSE
#--------------------------------------------------------------------
LSE_R <- function(vec){ 
  n.vec <- length(vec)
  vec <- sort(vec, decreasing = TRUE)
  Lk <- vec[1]
  for (k in 1:(n.vec-1)) {
    Lk <- max(vec[k+1], Lk) + log1p(exp(-abs(vec[k+1] - Lk))) 
  }
  return(Lk)
}

# minus log-likelihood function
# Reparameterised using a log link

NN <- 100 # number of terms in the series for the normalising constant

log_lik = function(par){
  lambda <- exp(par[1])
  nu <- exp(par[2])
 
  logZs <- (0:NN)*log(lambda) - nu*lfactorial(0:NN)
  
  # log-normalising constant using logSumExp formula
   LZN <- LSE_R(logZs)
    
   # loglik
   out <- -ns*LZN + sumdat*log(lambda) -nu*slfact 
   
   return(-out)
}

AMLE functions

######################################################################################################
######################################################################################################
# AMLE
# https://doi.org/10.1214/13-EJS819
######################################################################################################
######################################################################################################



######################################################################################################
# Input variables
######################################################################################################
#--------------------------------------------------------------------------------------------------------
# dat    :   Vector of observations.
# N      :   ABC sample size to be used for the kernel density estimation
# tol    :   Tolerance for the ABC algorithm
# a1     :   Left-end point of the support of the uniform prior distribution on $\lambda$
# b1     :   Right-end point of the support of the uniform prior distribution on $\lambda$
# a2     :   Left-end point of the support of the uniform prior distribution on $\nu$
# b2     :   Right-end point of the support of the uniform prior distribution on $\nu$
#-------------------------------------------------------------------------------------------------------- 

# Metric

dist = function(x,y) return( sqrt( (mean(x) - mean(y) )^2  + ( mean(lfactorial(x)) - mean(lfactorial(y))  )^2  )  )

# AMLE function

AMLE = function(dat,N,tol,a1,b1,a2,b2){
  pp = qq = rep(0,N)
  j=1
#  pb = txtProgressBar(min = 1, max = N, initial = 1) 
  while(j<N+1){
    p = runif(1,a1,b1)
    q = runif(1,a2,b2)
    datABC <- rcmp(n = ns, lambda = p, nu = q)
    if(dist(datABC,dat)<tol) {
      pp[j] = p  
      qq[j] = q 
      j = j+1 
    }
  #  setTxtProgressBar(pb,j)
  }
#  close(pb)
  pp1 = cbind(pp,qq)
  H.scv=Hscv(pp1)
  den = kde(pp1, H=H.scv)
  ind=which(den$estimate==max(den$estimate), arr.ind=TRUE)
  MLE <- c(den$eval.points[[1]][ind[1]],den$eval.points[[2]][ind[2]])
  ABCSamp <- pp1
  out = list( MLE = MLE, ABCSamp = ABCSamp)
  return(out)
}

Example 1

################################################################################
# Data simulation
################################################################################

set.seed(1234)
# Sample size
ns <- 1000
# True parameter values
lambda0 <- 5
nu0 <- 2
# Data simulation
dat <- rcmp(n = ns, lambda = lambda0, nu = nu0)
table(dat)

## dat
##   0   1   2   3   4   5   6 
##  61 290 363 187  78  17   4

# Required values for log-likelihood
sumdat <- sum(dat)
slfact <- sum(lfactorial(dat))

summary(dat)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   1.998   3.000   6.000

var(dat)

## [1] 1.251247

plot(table(dat), xlab = "x", ylab = "Frequency", cex.lab = 1.5, cex.axis = 1.5, 
     main = "Simulated data")

################################################################################
# AMLE of the simulated data (based on N = 1000 ABC simulations)
################################################################################

runAMLE = AMLE(dat = dat, N = 1000, tol = 0.02, a1 = 3, b1 = 7, a2 = 1, b2 = 4)

runAMLE$MLE

## [1] 4.350823 1.820564

# Histogram of the ABC posterior samples
hist(runAMLE$ABCSamp[,1], xlab = expression(lambda), ylab = "Density", probability = TRUE,
     cex.axis = 1.5, cex.lab = 1.5, main = "ABC Posterior", breaks = 20)
points(density(runAMLE$ABCSamp[,1]), type = "l", lwd = 2, col = "blue")
abline(v = lambda0, col = "red", lwd = 2)
box()

hist(runAMLE$ABCSamp[,2], xlab = expression(nu), ylab = "Density", probability = TRUE,
     cex.axis = 1.5, cex.lab = 1.5, main = "ABC Posterior", breaks = 20)
points(density(runAMLE$ABCSamp[,2]), type = "l", lwd = 2, col = "blue")
abline(v = nu0, col = "red", lwd = 2)
box()

# MLE direct implementation

OPT <- nlminb(c(0,0),log_lik, control = list(iter.max = 10000))

MLE <- exp(OPT$par)

MLE

## [1] 4.416575 1.836861

# Comparison with glm.cmp (see estimates of lambda and nu)
mle.cmp <- glm.cmp(formula.lambda = dat~1, formula.nu = dat~1, data = data.frame(dat))
mle.cmp

## CMP coefficients
##               Estimate     SE z-value   p-value
## X:(Intercept)   1.4854 0.0898 16.5466 1.693e-61
## S:(Intercept)   0.6081 0.0498 12.2074 2.839e-34
## --
## Transformed intercept-only parameters
##        Estimate     SE
## lambda   4.4166 0.3965
## nu       1.8369 0.0915
## --
## Chi-squared test for equidispersion
## X^2 = 107.6647, df = 1, p-value = 3.1832e-25
## --
## Elapsed: 0.12 sec   Sample size: 1000   formula interface
## LogLik: -1503.5224   AIC: 3011.0448   BIC: 3020.8603   
## Optimization Method: L-BFGS-B   Converged status: 0
## Message: CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH

Example 2

################################################################################
# Data simulation
################################################################################

set.seed(1234)
# Sample size
ns <- 1000
# True parameter values
lambda0 <- 5
nu0 <- 0.5
# Data simulation
dat <- rcmp(n = ns, lambda = lambda0, nu = nu0)
table(dat)

## dat
##  5  6  7  8  9 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 
##  1  1  2  1  5  7  5 14 20 13 17 38 28 43 42 54 60 47 51 62 56 58 56 34 44 41 
## 32 33 34 35 36 37 38 39 40 41 42 43 44 45 48 49 50 51 
## 34 22 21 14 27 17 16  7  7 11  8  6  4  2  1  1  1  1

# Required values for log-likelihood
sumdat <- sum(dat)
slfact <- sum(lfactorial(dat))

summary(dat)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   21.00   25.00   25.72   30.00   51.00

var(dat)

## [1] 53.06841

plot(table(dat), xlab = "x", ylab = "Frequency", cex.lab = 1.5, cex.axis = 1.5, 
     main = "Simulated data")

################################################################################
# AMLE of the simulated data (based on N = 1000 ABC simulations)
################################################################################

runAMLE = AMLE(dat = dat, N = 1000, tol = 0.02, a1 = 3, b1 = 7, a2 = 0.41, b2 = 0.55)

runAMLE$MLE

## [1] 4.6653988 0.4771598

# Histogram of the ABC posterior samples
hist(runAMLE$ABCSamp[,1], xlab = expression(lambda), ylab = "Density", probability = TRUE,
     cex.axis = 1.5, cex.lab = 1.5, main = "ABC Posterior", breaks = 20)
points(density(runAMLE$ABCSamp[,1]), type = "l", lwd = 2, col = "blue")
abline(v = lambda0, col = "red", lwd = 2)
box()

hist(runAMLE$ABCSamp[,2], xlab = expression(nu), ylab = "Density", probability = TRUE,
     cex.axis = 1.5, cex.lab = 1.5, main = "ABC Posterior", breaks = 20)
points(density(runAMLE$ABCSamp[,2]), type = "l", lwd = 2, col = "blue")
abline(v = nu0, col = "red", lwd = 2)
box()

# MLE direct implementation

OPT <- nlminb(c(0,0),log_lik, control = list(iter.max = 10000))

MLE <- exp(OPT$par)

MLE

## [1] 4.6136866 0.4741082

# Comparison with glm.cmp (see estimates of lambda and nu)
mle.cmp <- glm.cmp(formula.lambda = dat~1, formula.nu = dat~1, data = data.frame(dat))
mle.cmp

## CMP coefficients
##               Estimate     SE  z-value    p-value
## X:(Intercept)   1.5291 0.0697  21.9318 1.293e-106
## S:(Intercept)  -0.7463 0.0449 -16.6131  5.599e-62
## --
## Transformed intercept-only parameters
##        Estimate     SE
## lambda   4.6141 0.3217
## nu       0.4741 0.0213
## --
## Chi-squared test for equidispersion
## X^2 = 345.8788, df = 1, p-value = 3.3465e-77
## --
## Elapsed: 1.48 sec   Sample size: 1000   formula interface
## LogLik: -3397.1539   AIC: 6798.3079   BIC: 6808.1234   
## Optimization Method: L-BFGS-B   Converged status: 0
## Message: CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH

Benson, A., and N. Friel. 2021. “Bayesian Inference, Model Selection and Likelihood Estimation Using Fast Rejection Sampling: The Conway-Maxwell-Poisson Distribution.” Bayesian Analysis 16 (3): 905–31.

Chatla, Suneel Babu, and Galit Shmueli. 2018. “Efficient Estimation of COM–Poisson Regression and a Generalized Additive Model.” Computational Statistics & Data Analysis 121: 71–88.

Kadane, J. B., G. Shmueli, T. P. Minka, S. Borle, and P. Boatwright. 2006. “Conjugate Analysis of the Conway-Maxwell-Poisson Distribution.” Bayesian Analysis 1 (2): 363–74.

Matsubara, T., J. Knoblauch, F. X. Briol, and C. J. Oates. 2023. “Generalized Bayesian Inference for Discrete Intractable Likelihood.” Journal of the American Statistical Association, 1–11.

Rubio, F. J., and A. M. Johansen. 2013. “A Simple Approach to Maximum Intractable Likelihood Estimation.” Electronic Journal of Statistics 7: 1632–54.

Sellers, K. F., and G. Shmueli. 2010. “A Flexible Regression Model for Count Data.” The Annals of Applied Statistics, 943–61.

Sellers, Kimberly F. 2023. The Conway–Maxwell–Poisson Distribution. Vol. 8. Cambridge University Press.

Sellers, K., T. Lotze, and A. Raim. 2019. “Package ‘COMPoissonReg’.” Package “COMPoissonReg.

Shmueli, G., T. P. Minka, J. B. Kadane, S. Borle, and P. Boatwright. 2005. “A Useful Distribution for Fitting Discrete Data: Revival of the Conway–Maxwell–Poisson Distribution.” Journal of the Royal Statistical Society Series C: Applied Statistics 54 (1): 127–42.

(Approximate) Maximum Likelihood Estimation for the Conway-Maxwell-Poisson Distribution

F. Javier Rubio

02 January, 2024

The Conway–Maxwell–Poisson (CMP) distribution

Likelihood function and Sufficient Statistics

Approximate maximum likelihood estimator (AMLE)

R code

R code

Required packages

MLE functions

AMLE functions

Example 1

Example 2