The article specifies some methodologies and simulations on model fitting, including plot exploring, MSE evaluating, and Chi-square goodness of fit testing. It provides some detailed comparisons between discrete distributions which can help selecting the best fitted models.

We may observe the number of Spore, which starts at zero, and corresponding frequencies·

Spore Frequency
0 131
1 55
2 21
3 14
4 6
5 6
6 2
7 0
8 0
9 0
10 0
11 2
12 1

The principle behind fitting distributions to data is to find the type of distribution (Normal, Binomial, Poisson, etc) and the value of the parameters (mean, variance, etc) that give the highest probability of producing the observed data. Given data indicates that it could be fitted to discrete distribution such as Binomial, Poisson, Geometric(Shifted) and Negative Binomial(Shifted) distribution because the observed Spore starts at 0.

##          MEAN      VAR     DISP     THETA         R
## [1,] 1.004202 3.075932 3.063062 0.3264707 0.4867531
#estimating use r(mle)
fitdistr(origin,'Negative Binomial')
##      size         mu    
##   0.5823039   1.0042043 
##  (0.1087772) (0.1072183)

Pmme=P[which.max(likelihood)];Pmme
## [1] 0.3264073
Pmle=P[which.max(likelihood2)];Pmle
## [1] 0.3669874

To find the parameters of the “best-fitting” distribution corresponding to the data, we need to find the specific estimators of the given data distribution:

Spore Frequency round.RF..5. E_poi E_bin E_geom E_nbin1 E_nbin2
0 131 0.55042 87.18820 83.69363 118.75052 138.01908 132.77148
1 55 0.23109 87.55454 91.08095 59.49974 45.24852 48.93660
2 21 0.08824 43.96121 45.74784 29.81224 22.65529 24.50604
3 14 0.05882 14.71531 14.04216 14.93737 12.64846 13.35176
4 6 0.02521 3.69428 2.93877 7.48434 7.42600 7.56868
5 6 0.02521 0.74196 0.44282 3.75002 4.48822 4.39050
6 2 0.00840 0.12418 0.04943 1.87894 2.76436 2.58556
7 0 0.00000 0.01781 0.00414 0.94144 1.72536 1.53891
8 0 0.00000 0.00224 0.00026 0.47171 1.08753 0.92322
9 0 0.00000 0.00025 0.00001 0.23635 0.69071 0.55724
10 0 0.00000 0.00003 0.00000 0.11842 0.44134 0.33798
11 2 0.00840 0.00000 0.00000 0.05933 0.28338 0.20581
12 1 0.00420 0.00000 0.00000 0.02973 0.18270 0.12573

Overdispersion distribution would be applied here, which suggesting that Negative Binomial or Geometric distribution might be a good candidate for the data fitting. We will verify this later by side-by-side barplot and Chi-square goodness of fit test.

In order to have better comparisons, we then use sample mean and sample variance to fit Binomial, Poisson, Geometric and Negative Binomial distributions, the patterns are shown clearly from the side-by-side barplots.

The significant difference between first three bars in Poisson and Binomial plot indicates that these two may not be a good fit. On the contrary, Negative Binomial and Geometric overcome the height difference especially for the first bar, above findings supports our former conclusions.

Then we can perform Chi-square goodness of fit Test to verify this. First we can have a look at some assumptions of the test. The Chi-square (X^2) statistic measures how well the expected frequency of the fitted distribution compares with the observed frequency of a histogram of the observed data. The Chi-square test makes the following assumptions:

  1. The observed data consists of a random sample of n independent data points.
  2. The measurement scale can be nominal (i.e. non-numeric) or numerical.
  3. The n data points can be arranged into histogram form with N contiguous classes that cover the entire possible range of the variable.

In order to do the test, we need to combine the last six groups to make the count in each cell at least 5. Consequently, we have the updated side-by-side barplots:

## The following objects are masked from data1 (pos = 3):
## 
##     Frequency, Spore
## [1] 1.0000000 2.8607595 2.8607595 0.3495575 0.5374150
##      size         mu    
##   0.5913633   1.0000003 
##  (0.1114659) (0.1063331)

## [1] 0.3495399
## [1] 0.3716923
Spore Frequency RF E_poi E_bin E_geom E_nbin1 E_nbin2
0 131 0.5504202 87.55531 80.90017 119.00000 135.28731 132.53819
1 55 0.2310924 87.55531 94.38353 59.50000 47.29070 49.25225
2 21 0.0882353 43.77765 47.19177 29.75000 23.64535 24.62612
3 14 0.0588235 14.59255 13.10882 14.87500 13.00843 13.36699
4 6 0.0252101 3.64814 2.18480 7.43750 7.48272 7.54160
5 6 0.0252101 0.72963 0.21848 3.71875 4.41679 4.35177
6-12 5 0.0210084 0.00024 0.00000 0.23242 0.60901 0.53936

## Warning in chisq.test(rbind(Frequency, E_poi)): Chi-squared approximation
## may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  rbind(Frequency, E_poi)
## X-squared = 33.792, df = 6, p-value = 7.378e-06
## Warning in chisq.test(rbind(Frequency, E_bin)): Chi-squared approximation
## may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  rbind(Frequency, E_bin)
## X-squared = 44.471, df = 6, p-value = 5.961e-08
## Warning in chisq.test(rbind(Frequency, E_geom)): Chi-squared approximation
## may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  rbind(Frequency, E_geom)
## X-squared = 7.296, df = 6, p-value = 0.2943
## Warning in chisq.test(rbind(Frequency, E_nbin1)): Chi-squared approximation
## may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  rbind(Frequency, E_nbin1)
## X-squared = 4.6018, df = 6, p-value = 0.5958
## Warning in chisq.test(rbind(Frequency, E_nbin2)): Chi-squared approximation
## may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  rbind(Frequency, E_nbin2)
## X-squared = 4.5882, df = 6, p-value = 0.5976

##                       Poisson Binomial Geometric Negative Binomial(mme)
## sum of residuals       0.1412   0.0124    3.4863                 6.2597
## Chi-square Statistic  33.7923  44.4712    7.2960                 4.6018
## p-value                0.0000   0.0000    0.2943                 0.5958
## MSE                  503.5382 688.6914   38.7969                15.6830
##                      Negative Binomial(mle)
## sum of residuals                     5.7837
## Chi-square Statistic                 4.5882
## p-value                              0.5976
## MSE                                 10.5632

To some extent, the p-value(0.596) for Negative Binomial distribution leads to failing to reject the null hypothesis that the data are consistent with Negative Binomial distribution.

Based on the explorations on the properties of these four distribution, side-by-side barplots, MSE and Chi-square goodness of fit test, we may come to the conclusion that Negative Binomial distribution would be the best candidate for this data set.

#ALL THE CODES
library(ggplot2)
library(knitr)
library(MASS)

data1=data.frame(c(0:12),c(131,55,21,14,6,6,2,0,0,0,0,2,1))
names(data1) <- c('Spore', 'Frequency')
kable(data1)

attach(data1)
N <- sum(Frequency)
RF <- Frequency/N
DF <- data.frame(data1, round(RF,5))
MEAN <- sum(RF*Spore)
VAR <- (sum(Spore^2*Frequency) - N*MEAN^2)/(N-1)#VAR can also be obtained by (MEAN+MEAN^2/R)
DISP <- VAR/MEAN # over dispersion
THETA <- 1/DISP
R <- MEAN*THETA/(1-THETA) # MEAN = R(1-THETA)/THETA
cbind(MEAN,VAR,DISP,THETA,R)

#esimating use r(mme)
origin = c(rep(0,131),rep(1,55),rep(2,21),rep(3,14),rep(4,6),rep(5,6),rep(6,2),11,11,12)
P=seq(0.01,0.99,length=5000)
likelihood = NULL
for (i in 1:length(P)){
likelihood[i] = N*R*log(P[i]) - N*log(factorial(R-1)) + 
  sum(origin*log(1-P[i]) + log(factorial(origin+R-1)) -log(factorial(origin)))}
#estimating use r(mle)
fitdistr(origin,'Negative Binomial')
r = fitdistr(origin,'Negative Binomial')$estimate[1]
likelihood2 = NULL
for (i in 1:length(P)){
likelihood2[i] = N*r*log(P[i]) - N*log(factorial(r-1)) + 
  sum(origin*log(1-P[i]) + log(factorial(origin+r-1)) -log(factorial(origin)))}
par(mfrow=c(1,2))
plot(P, likelihood, pch=20)
plot(P, likelihood2, pch=20)
P1=P[which.max(likelihood)];P1
P2=P[which.max(likelihood2)];P2

#fit in different distribution using mean, variance and R.
x = Spore
E_poi = round(N * dpois(x, lambda=MEAN),5)
E_bin = round(N * dbinom(x, length(x), prob = MEAN/length(x)),5)
E_geom = round(N * dgeom(x, 1/(1+MEAN)),5) # MEAN = R(1-THETA)/THETA, for geometric, R = 1
E_nbin1 = round(N * dnbinom(x, size = R, mu = MEAN),5)
E_nbin2 = round(N * dnbinom(x, size = r, mu = MEAN),5)
DF = data.frame(DF, E_poi, E_bin, E_geom, E_nbin1,E_nbin2)
kable(DF)

par(mfrow=c(2,2))
barplot(matrix(c(Frequency,E_poi),nr=2, byrow = TRUE), beside=T, 
        col=c("aquamarine3","coral"), 
        names.arg=x)
legend("topright", c("Observed","Expected Poisson"), pch=15, 
       col=c("aquamarine3","coral"), 
       bty="n")

barplot(matrix(c(Frequency,E_bin),nr=2, byrow = TRUE), beside=T, 
        col=c("aquamarine3","coral"), 
        names.arg=x)
legend("topright", c("Observed","Expected Binomial"), pch=15, 
       col=c("aquamarine3","coral"), 
       bty="n")

barplot(matrix(c(Frequency,E_geom),nr=2, byrow = TRUE), beside=T, 
        col=c("aquamarine3","coral"), 
        names.arg=x)
legend("topright", c("Observed","Expected Geometric"), pch=15, 
       col=c("aquamarine3","coral"), 
       bty="n")

par(mfrow=c(1,2))
barplot(matrix(c(Frequency,E_nbin1),nr=2, byrow = TRUE), beside=T, 
        col=c("aquamarine3","coral"), 
        names.arg=x)
legend("topright", c("Observed","Expected Negative Binomial using mme"), pch=15, 
       col=c("aquamarine3","coral"), 
       bty="n")

barplot(matrix(c(Frequency,E_nbin2),nr=2, byrow = TRUE), beside=T, 
        col=c("aquamarine3","coral"), 
        names.arg=x)
legend("topright", c("Observed","Expected Negative Binomial using mle"), pch=15, 
       col=c("aquamarine3","coral"), 
       bty="n")

data1 = rbind(data1[1:6,], c('6-12',5))
attach(data1)
Frequency = as.numeric(Frequency)
Spore1 = c(0:5,9)
N = sum(Frequency)
RF = Frequency/N
DF = data.frame(data1, RF)
MEAN = sum(RF*Spore1) 
VAR = (sum(Spore1^2*Frequency) - N*MEAN^2)/(N-1)
DISP= VAR/MEAN # over dispersion
THETA = 1/DISP
R = MEAN*THETA/(1-THETA) # MEAN = R(1-THETA)/THETA
c(MEAN,VAR,DISP,THETA,R)

origin = c(rep(0,131),rep(1,55),rep(2,21),rep(3,14),rep(4,6),rep(5,6),rep(9,5))
P=seq(0.01,0.99,length=5000)
likelihood = NULL
for (i in 1:length(P)){
likelihood[i] = N*R*log(P[i]) - N*log(factorial(R-1)) + 
  sum(origin*log(1-P[i]) + log(factorial(origin+R-1)) -log(factorial(origin)))}
#estimating use r(mle)
fitdistr(origin,'Negative Binomial')
r = fitdistr(origin,'Negative Binomial')$estimate[1]
likelihood2 = NULL
for (i in 1:length(P)){
likelihood2[i] = N*r*log(P[i]) - N*log(factorial(r-1)) + 
  sum(origin*log(1-P[i]) + log(factorial(origin+r-1)) -log(factorial(origin)))}
par(mfrow=c(1,2))
plot(P, likelihood, pch=20)
plot(P, likelihood2, pch=20)
P1=P[which.max(likelihood)];P1
P2=P[which.max(likelihood2)];P2

#fit in different distribution using mean, variance and R.
x = Spore1
E_poi = round(N * dpois(x, lambda=MEAN),5)
E_bin = round(N * dbinom(x, length(x), prob = MEAN/length(x)),5)
E_geom = round(N * dgeom(x, 1/(1+MEAN)),5) # MEAN = R(1-THETA)/THETA, for geometric, R = 1
E_nbin1 = round(N * dnbinom(x, size = R, mu = MEAN),5)
E_nbin2 = round(N * dnbinom(x, size = r, mu = MEAN),5)
DF = data.frame(DF, E_poi, E_bin, E_geom, E_nbin1, E_nbin2)
kable(DF)

par(mfrow=c(2,2))
barplot(matrix(c(Frequency,E_poi),nr=2, byrow = TRUE), beside=T, 
        col = c("aquamarine3","coral"), 
        names.arg = c(0:5, '6-12'))
legend("topright", c("Observed","Expected Poisson"), pch=15, 
       col=c("aquamarine3","coral"), 
       bty="n")

barplot(matrix(c(Frequency,E_bin),nr=2, byrow = TRUE), beside=T, 
        col=c("aquamarine3","coral"), 
        names.arg=c(0:5, '6-12'))
legend("topright", c("Observed","Expected Binomial"), pch=15, 
       col=c("aquamarine3","coral"), 
       bty="n")


barplot(matrix(c(Frequency,E_geom),nr=2, byrow = TRUE), beside=T, 
        col=c("aquamarine3","coral"), 
        names.arg=c(0:5, '6-12'))
legend("topright", c("Observed","Expected Geometric"), pch=15, 
       col=c("aquamarine3","coral"), 
       bty="n")

par(mfrow=c(1,2))

barplot(matrix(c(Frequency,E_nbin1),nr=2, byrow = TRUE), beside=T, 
        col=c("aquamarine3","coral"), 
        names.arg=c(0:5, '6-12'))
legend("topright", c("Observed","Expected Negative Binomial using mme"), pch=15, 
       col=c("aquamarine3","coral"), 
       bty="n")

barplot(matrix(c(Frequency,E_nbin2),nr=2, byrow = TRUE), beside=T, 
        col=c("aquamarine3","coral"), 
        names.arg=c(0:5, '6-12'))
legend("topright", c("Observed","Expected Negative Binomial using mle"), pch=15, 
       col=c("aquamarine3","coral"), 
       bty="n")

CT1 = chisq.test(rbind(Frequency,E_poi)); CT1
CT2 = chisq.test(rbind(Frequency,E_bin));CT2
CT3 = chisq.test(rbind(Frequency,E_geom));CT3
CT4 = chisq.test(rbind(Frequency,E_nbin1));CT4
CT5 = chisq.test(rbind(Frequency,E_nbin2));CT5

C1 = CT1$statistic
C2 = CT2$statistic
C3 = CT3$statistic
C4 = CT4$statistic
C5 = CT5$statistic

P1 = CT1$p.value
P2 = CT2$p.value
P3 = CT3$p.value
P4 = CT4$p.value
P5 = CT5$p.value

RES1 = Frequency - E_poi
RES2 = Frequency - E_bin
RES3 = Frequency - E_geom
RES4 = Frequency - E_nbin1
RES5 = Frequency - E_nbin2

par(mfrow=c(1,5))
plot(x,RES1)
plot(x,RES2)
plot(x,RES3)
plot(x,RES4)
plot(x,RES5)
SR1 = sum(RES1)
SR2 = sum(RES2)
SR3 = sum(RES3)
SR4 = sum(RES4)
SR5 = sum(RES5)
MSE1 = mean(RES1^2)
MSE2 = mean(RES2^2)
MSE3 = mean(RES3^2)
MSE4 = mean(RES4^2)
MSE5 = mean(RES5^2)
SUMMARY = matrix(c(SR1, SR2, SR3, SR4, SR5, C1, C2, C3, C4, C5,P1, P2, P3, P4,P5, MSE1, MSE2, MSE3, MSE4,MSE5), nrow = 4, byrow = TRUE)
colnames(SUMMARY) = c("Poisson","Binomial","Geometric", "Negative Binomial(mme)","Negative Binomial(mle)")
rownames(SUMMARY) = c("sum of residuals", "Chi-square Statistic", "p-value", "MSE")
round(SUMMARY,4)