The Conway–Maxwell–Poisson (CMP) distribution is a discrete probability distribution with two parameters \(\lambda >0\) and \(\nu \geq 0\), and probability mass function
\[ f_{s}(x)= \dfrac{\lambda^x}{(x!)^{\nu}} \dfrac{1}{Z(\lambda,\nu)},\, \] where \(Z(\lambda,\nu)=\sum _{{j=0}}^{\infty }{\frac {\lambda ^{j}}{(j!)^{\nu }}}\) is the normalising constant, and \(x \in \{0,1,2,3,\dots\}\). The parameter space is defined by \(\lambda ,\nu >0\), and \(0<\lambda<1\) , \(\nu =0\). One of the appealing properties of this distribution is that it can account for overdispersion and underdispersion for different values of the parameter \(\nu\).
See Shmueli et al. (2005) and Kimberly F. Sellers (2023) for more details about this distribution.
The likelihood function of \((\lambda, \nu)\), for a sample \({\bf x} = (x_1,\dots,x_n)^{\top}\) is given by:
\[ \begin{align} L(\lambda, \nu \mid {\bf x}) &= \prod_{i=1}^n \dfrac{\lambda^{x_i}}{(x_i!)^{\nu}} \dfrac{1}{Z(\lambda,\nu)}\\ &= Z(\lambda,\nu)^{-n} \lambda^{\sum_{i}^n x_i}\left( \prod_{i=1}^n x_i! \right)^{-\nu}. \end{align} \]
This expression implies that \(\left(\prod_{i=1}^n x_i!,\sum_{i}^n x_i\right)\) are sufficient statistics for \((\lambda, \nu)\) (see Shmueli et al. (2005) and Kadane et al. (2006)). Consequently, \(\left(\dfrac{1}{n}\sum_{i=1}^n \log(x_i!), \dfrac{1}{n}\sum_{i}^n x_i\right)\) are also sufficient for \((\lambda, \nu)\).
The log-likelihood function is:
\[ \begin{align} \ell(\lambda, \nu \mid {\bf x}) &= - n \log Z(\lambda,\nu) + n\bar{\mathbf{x}}\log( \lambda)- {\nu}\sum_{i=1}^n \log \left(x_i! \right). \end{align} \]
One of the challenges in estimating the parameters \((\lambda, \nu)\) is that the normalising constant \(Z(\lambda,\nu)\) is not available in closed-form, which complicates the use of maximum likelihood estimation (Shmueli et al. (2005)). However, approximating this normalising constant is feasible by using a large number of terms in the corresponding series
\[ Z(\lambda,\nu) \approx \sum _{{j=0}}^{M}{\frac {\lambda ^{j}}{(j!)^{\nu }}}, \quad \text{ with } \, M \, \text{ large}. \]
We also introduce a new approximation which, although trivial, allows for borrowing other numerical methods to provide a numerically stable/tractable approximation of this normalising constant. Note that,
\[ \begin{split} Z(\lambda,\nu) &\approx \sum _{{j=0}}^{M}{\frac {\lambda ^{j}}{(j!)^{\nu }}}, \\ & = \sum _{{j=0}}^{M} \exp \left[ j \log(\lambda) - \nu \log(j!) \right ] \end{split} \]
Then, the term \(\log
Z(\lambda,\nu)\), which appears in the log-likelihood function,
can be seen as the logSumExp function
applied to the set of values \(j \log(\lambda)
- \nu \log(j!)\), \(j=0,\dots,M\). As shown later, this allows
for a fast and stable approximation of the log-likelihood function.
Numerical methods based on Fisher scoring have been proposed (K. F. Sellers and Shmueli (2010), Chatla and Shmueli (2018), K. Sellers, Lotze, and Raim (2019)) to approximate the maximum likelihood estimators of the parameters \((\lambda, \nu)\). Recently, Bayesian methods (Benson and Friel (2021)) and Generalised Bayesian methods (Matsubara et al. (2023)) have also been employed to estimate the parameters \((\lambda, \nu)\). In the next section, we discuss a method introduced by Rubio and Johansen (2013) to approximate the maximum likelihood estimator for models with intractable likelihoods.
The AMLE, introduced by Rubio and Johansen (2013), consists of using Approximate Bayesian Computation (ABC) methods to approximate the likelihood function in cases where the likelihood is intractable and/or too computationally expensive to evaluate, and maximising this approximated likelihood function. The basic ABC algorithm is shown below:
Simulate \({\bf\theta}^{\prime}\) from the prior distribution \(\pi(\cdot)\).
Generate \({\bf y}\) from the model \(f(\cdot\vert{\bf\theta}^{\prime})\).
Accept \({\bf\theta}^{\prime}\) with probability \(\propto K_{\varepsilon}({\bf x} \vert {\bf y})\) otherwise return to step 1.
where \(K_{\varepsilon}({\bf x}\vert {\bf y})\) is a normalised Markov kernel. The ABC algorithm produces a sample from the following approximation to the posterior distribution:
\[ \widehat{\pi}_{\varepsilon}({\bf\theta}\vert{\bf x}) = \dfrac{\widehat{f}_{\varepsilon}({\bf x}\vert{\bf\theta})\pi({\bf\theta})}{\int_{\Theta}\widehat{f}_{\varepsilon}({\bf x}\vert{\bf t})\pi({\bf t})d{\bf t}}, \]
where
\[ \widehat{f}_{\varepsilon}({\bf x}\vert{\bf\theta}) = \int_{{\mathbb R}^{q\times n}}K_{\varepsilon}({\bf x}\vert {\bf y}) f({\bf y}\vert{\bf\theta}) d{\bf{y}} , \]
is an approximation of the likelihood function. Therefore, if we use a uniform prior on a compact set that contains the Maximum Likelihood Estimator (MLE), and the Maximum a Posteriori (MAP, the maximum of the posterior distribution) coincide. Thus, it is possible to approximate the MLE by using ABC methods to generate an approximate sample from the posterior and then approximating the MAP using this sample. This can be done by using the ABC algorithm in the first step of the AMLE algorithm, as shown below:
The AMLE Algorithm
Obtain a sample \({\bf \theta}^*_{\varepsilon}=({\bf \theta}^*_{\varepsilon,1},...,{\bf \theta}^*_{\varepsilon,k})\) from \(\widehat{\pi}_{\varepsilon}({\bf \theta}\vert{\bf x})\).
Using the sample \({\bf \theta}^*_{\varepsilon}\) construct a nonparametric estimator \(\widehat{\pi}_{k,\varepsilon}({\bf \theta}\vert{\bf x})\) of the density \(\widehat{\pi}_{\varepsilon}({\bf \theta}\vert{\bf x})\).
Calculate the maximum of \(\widehat{\pi}_{k,\varepsilon}({\bf \theta}\vert{\bf x})\), \(\tilde{{\bf \theta}}_{\varepsilon}\). This is an approximation of the MLE \(\widehat{{\bf \theta}}\).
Asymptotic properties of the AMLE are studied in Rubio and Johansen (2013).
See also: Approximate Maximum Likelihood Estimation (AMLE)
Disclaimer: this is simply a tutorial on the use of AMLE for the CMP distribution with pedagogical purposes, not a formal comparison of the different estimation methods.
The following R code illustrates how to estimate the parameters \((\lambda,\nu)\) using the AMLE method for a
simulated data set. A comparison with the results obtained with the
command glm.cmp from the R package
COMPoissonReg is presented.
rm(list=ls())
# Required packages
library(KernSmooth)
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
library(mvtnorm)
library(misc3d)
library(ks)
library(COMPoissonReg)
## Loading required package: Rcpp
## Loading required package: numDeriv
#--------------------------------------------------------------------
# LogSumExp Function: Recursive formula
# https://rpubs.com/FJRubio/LSE
#--------------------------------------------------------------------
LSE_R <- function(vec){
n.vec <- length(vec)
vec <- sort(vec, decreasing = TRUE)
Lk <- vec[1]
for (k in 1:(n.vec-1)) {
Lk <- max(vec[k+1], Lk) + log1p(exp(-abs(vec[k+1] - Lk)))
}
return(Lk)
}
# minus log-likelihood function
# Reparameterised using a log link
NN <- 100 # number of terms in the series for the normalising constant
log_lik = function(par){
lambda <- exp(par[1])
nu <- exp(par[2])
logZs <- (0:NN)*log(lambda) - nu*lfactorial(0:NN)
# log-normalising constant using logSumExp formula
LZN <- LSE_R(logZs)
# loglik
out <- -ns*LZN + sumdat*log(lambda) -nu*slfact
return(-out)
}
######################################################################################################
######################################################################################################
# AMLE
# https://doi.org/10.1214/13-EJS819
######################################################################################################
######################################################################################################
######################################################################################################
# Input variables
######################################################################################################
#--------------------------------------------------------------------------------------------------------
# dat : Vector of observations.
# N : ABC sample size to be used for the kernel density estimation
# tol : Tolerance for the ABC algorithm
# a1 : Left-end point of the support of the uniform prior distribution on $\lambda$
# b1 : Right-end point of the support of the uniform prior distribution on $\lambda$
# a2 : Left-end point of the support of the uniform prior distribution on $\nu$
# b2 : Right-end point of the support of the uniform prior distribution on $\nu$
#--------------------------------------------------------------------------------------------------------
# Metric
dist = function(x,y) return( sqrt( (mean(x) - mean(y) )^2 + ( mean(lfactorial(x)) - mean(lfactorial(y)) )^2 ) )
# AMLE function
AMLE = function(dat,N,tol,a1,b1,a2,b2){
pp = qq = rep(0,N)
j=1
# pb = txtProgressBar(min = 1, max = N, initial = 1)
while(j<N+1){
p = runif(1,a1,b1)
q = runif(1,a2,b2)
datABC <- rcmp(n = ns, lambda = p, nu = q)
if(dist(datABC,dat)<tol) {
pp[j] = p
qq[j] = q
j = j+1
}
# setTxtProgressBar(pb,j)
}
# close(pb)
pp1 = cbind(pp,qq)
H.scv=Hscv(pp1)
den = kde(pp1, H=H.scv)
ind=which(den$estimate==max(den$estimate), arr.ind=TRUE)
MLE <- c(den$eval.points[[1]][ind[1]],den$eval.points[[2]][ind[2]])
ABCSamp <- pp1
out = list( MLE = MLE, ABCSamp = ABCSamp)
return(out)
}
################################################################################
# Data simulation
################################################################################
set.seed(1234)
# Sample size
ns <- 1000
# True parameter values
lambda0 <- 5
nu0 <- 2
# Data simulation
dat <- rcmp(n = ns, lambda = lambda0, nu = nu0)
table(dat)
## dat
## 0 1 2 3 4 5 6
## 61 290 363 187 78 17 4
# Required values for log-likelihood
sumdat <- sum(dat)
slfact <- sum(lfactorial(dat))
summary(dat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 1.998 3.000 6.000
var(dat)
## [1] 1.251247
plot(table(dat), xlab = "x", ylab = "Frequency", cex.lab = 1.5, cex.axis = 1.5,
main = "Simulated data")
################################################################################
# AMLE of the simulated data (based on N = 1000 ABC simulations)
################################################################################
runAMLE = AMLE(dat = dat, N = 1000, tol = 0.02, a1 = 3, b1 = 7, a2 = 1, b2 = 4)
runAMLE$MLE
## [1] 4.350823 1.820564
# Histogram of the ABC posterior samples
hist(runAMLE$ABCSamp[,1], xlab = expression(lambda), ylab = "Density", probability = TRUE,
cex.axis = 1.5, cex.lab = 1.5, main = "ABC Posterior", breaks = 20)
points(density(runAMLE$ABCSamp[,1]), type = "l", lwd = 2, col = "blue")
abline(v = lambda0, col = "red", lwd = 2)
box()
hist(runAMLE$ABCSamp[,2], xlab = expression(nu), ylab = "Density", probability = TRUE,
cex.axis = 1.5, cex.lab = 1.5, main = "ABC Posterior", breaks = 20)
points(density(runAMLE$ABCSamp[,2]), type = "l", lwd = 2, col = "blue")
abline(v = nu0, col = "red", lwd = 2)
box()
# MLE direct implementation
OPT <- nlminb(c(0,0),log_lik, control = list(iter.max = 10000))
MLE <- exp(OPT$par)
MLE
## [1] 4.416575 1.836861
# Comparison with glm.cmp (see estimates of lambda and nu)
mle.cmp <- glm.cmp(formula.lambda = dat~1, formula.nu = dat~1, data = data.frame(dat))
mle.cmp
## CMP coefficients
## Estimate SE z-value p-value
## X:(Intercept) 1.4854 0.0898 16.5466 1.693e-61
## S:(Intercept) 0.6081 0.0498 12.2074 2.839e-34
## --
## Transformed intercept-only parameters
## Estimate SE
## lambda 4.4166 0.3965
## nu 1.8369 0.0915
## --
## Chi-squared test for equidispersion
## X^2 = 107.6647, df = 1, p-value = 3.1832e-25
## --
## Elapsed: 0.12 sec Sample size: 1000 formula interface
## LogLik: -1503.5224 AIC: 3011.0448 BIC: 3020.8603
## Optimization Method: L-BFGS-B Converged status: 0
## Message: CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH
################################################################################
# Data simulation
################################################################################
set.seed(1234)
# Sample size
ns <- 1000
# True parameter values
lambda0 <- 5
nu0 <- 0.5
# Data simulation
dat <- rcmp(n = ns, lambda = lambda0, nu = nu0)
table(dat)
## dat
## 5 6 7 8 9 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
## 1 1 2 1 5 7 5 14 20 13 17 38 28 43 42 54 60 47 51 62 56 58 56 34 44 41
## 32 33 34 35 36 37 38 39 40 41 42 43 44 45 48 49 50 51
## 34 22 21 14 27 17 16 7 7 11 8 6 4 2 1 1 1 1
# Required values for log-likelihood
sumdat <- sum(dat)
slfact <- sum(lfactorial(dat))
summary(dat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 21.00 25.00 25.72 30.00 51.00
var(dat)
## [1] 53.06841
plot(table(dat), xlab = "x", ylab = "Frequency", cex.lab = 1.5, cex.axis = 1.5,
main = "Simulated data")
################################################################################
# AMLE of the simulated data (based on N = 1000 ABC simulations)
################################################################################
runAMLE = AMLE(dat = dat, N = 1000, tol = 0.02, a1 = 3, b1 = 7, a2 = 0.41, b2 = 0.55)
runAMLE$MLE
## [1] 4.6653988 0.4771598
# Histogram of the ABC posterior samples
hist(runAMLE$ABCSamp[,1], xlab = expression(lambda), ylab = "Density", probability = TRUE,
cex.axis = 1.5, cex.lab = 1.5, main = "ABC Posterior", breaks = 20)
points(density(runAMLE$ABCSamp[,1]), type = "l", lwd = 2, col = "blue")
abline(v = lambda0, col = "red", lwd = 2)
box()
hist(runAMLE$ABCSamp[,2], xlab = expression(nu), ylab = "Density", probability = TRUE,
cex.axis = 1.5, cex.lab = 1.5, main = "ABC Posterior", breaks = 20)
points(density(runAMLE$ABCSamp[,2]), type = "l", lwd = 2, col = "blue")
abline(v = nu0, col = "red", lwd = 2)
box()
# MLE direct implementation
OPT <- nlminb(c(0,0),log_lik, control = list(iter.max = 10000))
MLE <- exp(OPT$par)
MLE
## [1] 4.6136866 0.4741082
# Comparison with glm.cmp (see estimates of lambda and nu)
mle.cmp <- glm.cmp(formula.lambda = dat~1, formula.nu = dat~1, data = data.frame(dat))
mle.cmp
## CMP coefficients
## Estimate SE z-value p-value
## X:(Intercept) 1.5291 0.0697 21.9318 1.293e-106
## S:(Intercept) -0.7463 0.0449 -16.6131 5.599e-62
## --
## Transformed intercept-only parameters
## Estimate SE
## lambda 4.6141 0.3217
## nu 0.4741 0.0213
## --
## Chi-squared test for equidispersion
## X^2 = 345.8788, df = 1, p-value = 3.3465e-77
## --
## Elapsed: 1.48 sec Sample size: 1000 formula interface
## LogLik: -3397.1539 AIC: 6798.3079 BIC: 6808.1234
## Optimization Method: L-BFGS-B Converged status: 0
## Message: CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH