Fitting Distribution to Given Data

The article specifies some methodologies and simulations on model fitting, including plot exploring, MSE evaluating, and Chi-square goodness of fit testing. It provides some detailed comparisons between discrete distributions which can help selecting the best fitted models.

We may observe the number of Spore, which starts at zero, and corresponding frequencies·

Spore	Frequency
0	131
1	55
2	21
3	14
4	6
5	6
6	2
7	0
8	0
9	0
10	0
11	2
12	1

The principle behind fitting distributions to data is to find the type of distribution (Normal, Binomial, Poisson, etc) and the value of the parameters (mean, variance, etc) that give the highest probability of producing the observed data. Given data indicates that it could be fitted to discrete distribution such as Binomial, Poisson, Geometric(Shifted) and Negative Binomial(Shifted) distribution because the observed Spore starts at 0.

##          MEAN      VAR     DISP     THETA         R
## [1,] 1.004202 3.075932 3.063062 0.3264707 0.4867531

#estimating use r(mle)
fitdistr(origin,'Negative Binomial')

##      size         mu    
##   0.5823039   1.0042043 
##  (0.1087772) (0.1072183)

Pmme=P[which.max(likelihood)];Pmme

## [1] 0.3264073

Pmle=P[which.max(likelihood2)];Pmle

## [1] 0.3669874

To find the parameters of the “best-fitting” distribution corresponding to the data, we need to find the specific estimators of the given data distribution:

Spore	Frequency	round.RF..5.	E_poi	E_bin	E_geom	E_nbin1	E_nbin2
0	131	0.55042	87.18820	83.69363	118.75052	138.01908	132.77148
1	55	0.23109	87.55454	91.08095	59.49974	45.24852	48.93660
2	21	0.08824	43.96121	45.74784	29.81224	22.65529	24.50604
3	14	0.05882	14.71531	14.04216	14.93737	12.64846	13.35176
4	6	0.02521	3.69428	2.93877	7.48434	7.42600	7.56868
5	6	0.02521	0.74196	0.44282	3.75002	4.48822	4.39050
6	2	0.00840	0.12418	0.04943	1.87894	2.76436	2.58556
7	0	0.00000	0.01781	0.00414	0.94144	1.72536	1.53891
8	0	0.00000	0.00224	0.00026	0.47171	1.08753	0.92322
9	0	0.00000	0.00025	0.00001	0.23635	0.69071	0.55724
10	0	0.00000	0.00003	0.00000	0.11842	0.44134	0.33798
11	2	0.00840	0.00000	0.00000	0.05933	0.28338	0.20581
12	1	0.00420	0.00000	0.00000	0.02973	0.18270	0.12573

Overdispersion distribution would be applied here, which suggesting that Negative Binomial or Geometric distribution might be a good candidate for the data fitting. We will verify this later by side-by-side barplot and Chi-square goodness of fit test.

In order to have better comparisons, we then use sample mean and sample variance to fit Binomial, Poisson, Geometric and Negative Binomial distributions, the patterns are shown clearly from the side-by-side barplots.

The significant difference between first three bars in Poisson and Binomial plot indicates that these two may not be a good fit. On the contrary, Negative Binomial and Geometric overcome the height difference especially for the first bar, above findings supports our former conclusions.

Then we can perform Chi-square goodness of fit Test to verify this. First we can have a look at some assumptions of the test. The Chi-square (X^2) statistic measures how well the expected frequency of the fitted distribution compares with the observed frequency of a histogram of the observed data. The Chi-square test makes the following assumptions:

The observed data consists of a random sample of n independent data points.
The measurement scale can be nominal (i.e. non-numeric) or numerical.
The n data points can be arranged into histogram form with N contiguous classes that cover the entire possible range of the variable.

In order to do the test, we need to combine the last six groups to make the count in each cell at least 5. Consequently, we have the updated side-by-side barplots:

## The following objects are masked from data1 (pos = 3):
## 
##     Frequency, Spore

## [1] 1.0000000 2.8607595 2.8607595 0.3495575 0.5374150

##      size         mu    
##   0.5913633   1.0000003 
##  (0.1114659) (0.1063331)

## [1] 0.3495399

## [1] 0.3716923

Spore	Frequency	RF	E_poi	E_bin	E_geom	E_nbin1	E_nbin2
0	131	0.5504202	87.55531	80.90017	119.00000	135.28731	132.53819
1	55	0.2310924	87.55531	94.38353	59.50000	47.29070	49.25225
2	21	0.0882353	43.77765	47.19177	29.75000	23.64535	24.62612
3	14	0.0588235	14.59255	13.10882	14.87500	13.00843	13.36699
4	6	0.0252101	3.64814	2.18480	7.43750	7.48272	7.54160
5	6	0.0252101	0.72963	0.21848	3.71875	4.41679	4.35177
6-12	5	0.0210084	0.00024	0.00000	0.23242	0.60901	0.53936

## Warning in chisq.test(rbind(Frequency, E_poi)): Chi-squared approximation
## may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  rbind(Frequency, E_poi)
## X-squared = 33.792, df = 6, p-value = 7.378e-06

## Warning in chisq.test(rbind(Frequency, E_bin)): Chi-squared approximation
## may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  rbind(Frequency, E_bin)
## X-squared = 44.471, df = 6, p-value = 5.961e-08

## Warning in chisq.test(rbind(Frequency, E_geom)): Chi-squared approximation
## may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  rbind(Frequency, E_geom)
## X-squared = 7.296, df = 6, p-value = 0.2943

## Warning in chisq.test(rbind(Frequency, E_nbin1)): Chi-squared approximation
## may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  rbind(Frequency, E_nbin1)
## X-squared = 4.6018, df = 6, p-value = 0.5958

## Warning in chisq.test(rbind(Frequency, E_nbin2)): Chi-squared approximation
## may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  rbind(Frequency, E_nbin2)
## X-squared = 4.5882, df = 6, p-value = 0.5976

##                       Poisson Binomial Geometric Negative Binomial(mme)
## sum of residuals       0.1412   0.0124    3.4863                 6.2597
## Chi-square Statistic  33.7923  44.4712    7.2960                 4.6018
## p-value                0.0000   0.0000    0.2943                 0.5958
## MSE                  503.5382 688.6914   38.7969                15.6830
##                      Negative Binomial(mle)
## sum of residuals                     5.7837
## Chi-square Statistic                 4.5882
## p-value                              0.5976
## MSE                                 10.5632

To some extent, the p-value(0.596) for Negative Binomial distribution leads to failing to reject the null hypothesis that the data are consistent with Negative Binomial distribution.

Based on the explorations on the properties of these four distribution, side-by-side barplots, MSE and Chi-square goodness of fit test, we may come to the conclusion that Negative Binomial distribution would be the best candidate for this data set.

#ALL THE CODES
library(ggplot2)
library(knitr)
library(MASS)

data1=data.frame(c(0:12),c(131,55,21,14,6,6,2,0,0,0,0,2,1))
names(data1) <- c('Spore', 'Frequency')
kable(data1)

attach(data1)
N <- sum(Frequency)
RF <- Frequency/N
DF <- data.frame(data1, round(RF,5))
MEAN <- sum(RF*Spore)
VAR <- (sum(Spore^2*Frequency) - N*MEAN^2)/(N-1)#VAR can also be obtained by (MEAN+MEAN^2/R)
DISP <- VAR/MEAN # over dispersion
THETA <- 1/DISP
R <- MEAN*THETA/(1-THETA) # MEAN = R(1-THETA)/THETA
cbind(MEAN,VAR,DISP,THETA,R)

#esimating use r(mme)
origin = c(rep(0,131),rep(1,55),rep(2,21),rep(3,14),rep(4,6),rep(5,6),rep(6,2),11,11,12)
P=seq(0.01,0.99,length=5000)
likelihood = NULL
for (i in 1:length(P)){
likelihood[i] = N*R*log(P[i]) - N*log(factorial(R-1)) + 
  sum(origin*log(1-P[i]) + log(factorial(origin+R-1)) -log(factorial(origin)))}
#estimating use r(mle)
fitdistr(origin,'Negative Binomial')
r = fitdistr(origin,'Negative Binomial')$estimate[1]
likelihood2 = NULL
for (i in 1:length(P)){
likelihood2[i] = N*r*log(P[i]) - N*log(factorial(r-1)) + 
  sum(origin*log(1-P[i]) + log(factorial(origin+r-1)) -log(factorial(origin)))}
par(mfrow=c(1,2))
plot(P, likelihood, pch=20)
plot(P, likelihood2, pch=20)
P1=P[which.max(likelihood)];P1
P2=P[which.max(likelihood2)];P2

#fit in different distribution using mean, variance and R.
x = Spore
E_poi = round(N * dpois(x, lambda=MEAN),5)
E_bin = round(N * dbinom(x, length(x), prob = MEAN/length(x)),5)
E_geom = round(N * dgeom(x, 1/(1+MEAN)),5) # MEAN = R(1-THETA)/THETA, for geometric, R = 1
E_nbin1 = round(N * dnbinom(x, size = R, mu = MEAN),5)
E_nbin2 = round(N * dnbinom(x, size = r, mu = MEAN),5)
DF = data.frame(DF, E_poi, E_bin, E_geom, E_nbin1,E_nbin2)
kable(DF)

par(mfrow=c(2,2))
barplot(matrix(c(Frequency,E_poi),nr=2, byrow = TRUE), beside=T, 
        col=c("aquamarine3","coral"), 
        names.arg=x)
legend("topright", c("Observed","Expected Poisson"), pch=15, 
       col=c("aquamarine3","coral"), 
       bty="n")

barplot(matrix(c(Frequency,E_bin),nr=2, byrow = TRUE), beside=T, 
        col=c("aquamarine3","coral"), 
        names.arg=x)
legend("topright", c("Observed","Expected Binomial"), pch=15, 
       col=c("aquamarine3","coral"), 
       bty="n")

barplot(matrix(c(Frequency,E_geom),nr=2, byrow = TRUE), beside=T, 
        col=c("aquamarine3","coral"), 
        names.arg=x)
legend("topright", c("Observed","Expected Geometric"), pch=15, 
       col=c("aquamarine3","coral"), 
       bty="n")

par(mfrow=c(1,2))
barplot(matrix(c(Frequency,E_nbin1),nr=2, byrow = TRUE), beside=T, 
        col=c("aquamarine3","coral"), 
        names.arg=x)
legend("topright", c("Observed","Expected Negative Binomial using mme"), pch=15, 
       col=c("aquamarine3","coral"), 
       bty="n")

barplot(matrix(c(Frequency,E_nbin2),nr=2, byrow = TRUE), beside=T, 
        col=c("aquamarine3","coral"), 
        names.arg=x)
legend("topright", c("Observed","Expected Negative Binomial using mle"), pch=15, 
       col=c("aquamarine3","coral"), 
       bty="n")

data1 = rbind(data1[1:6,], c('6-12',5))
attach(data1)
Frequency = as.numeric(Frequency)
Spore1 = c(0:5,9)
N = sum(Frequency)
RF = Frequency/N
DF = data.frame(data1, RF)
MEAN = sum(RF*Spore1) 
VAR = (sum(Spore1^2*Frequency) - N*MEAN^2)/(N-1)
DISP= VAR/MEAN # over dispersion
THETA = 1/DISP
R = MEAN*THETA/(1-THETA) # MEAN = R(1-THETA)/THETA
c(MEAN,VAR,DISP,THETA,R)

origin = c(rep(0,131),rep(1,55),rep(2,21),rep(3,14),rep(4,6),rep(5,6),rep(9,5))
P=seq(0.01,0.99,length=5000)
likelihood = NULL
for (i in 1:length(P)){
likelihood[i] = N*R*log(P[i]) - N*log(factorial(R-1)) + 
  sum(origin*log(1-P[i]) + log(factorial(origin+R-1)) -log(factorial(origin)))}
#estimating use r(mle)
fitdistr(origin,'Negative Binomial')
r = fitdistr(origin,'Negative Binomial')$estimate[1]
likelihood2 = NULL
for (i in 1:length(P)){
likelihood2[i] = N*r*log(P[i]) - N*log(factorial(r-1)) + 
  sum(origin*log(1-P[i]) + log(factorial(origin+r-1)) -log(factorial(origin)))}
par(mfrow=c(1,2))
plot(P, likelihood, pch=20)
plot(P, likelihood2, pch=20)
P1=P[which.max(likelihood)];P1
P2=P[which.max(likelihood2)];P2

#fit in different distribution using mean, variance and R.
x = Spore1
E_poi = round(N * dpois(x, lambda=MEAN),5)
E_bin = round(N * dbinom(x, length(x), prob = MEAN/length(x)),5)
E_geom = round(N * dgeom(x, 1/(1+MEAN)),5) # MEAN = R(1-THETA)/THETA, for geometric, R = 1
E_nbin1 = round(N * dnbinom(x, size = R, mu = MEAN),5)
E_nbin2 = round(N * dnbinom(x, size = r, mu = MEAN),5)
DF = data.frame(DF, E_poi, E_bin, E_geom, E_nbin1, E_nbin2)
kable(DF)

par(mfrow=c(2,2))
barplot(matrix(c(Frequency,E_poi),nr=2, byrow = TRUE), beside=T, 
        col = c("aquamarine3","coral"), 
        names.arg = c(0:5, '6-12'))
legend("topright", c("Observed","Expected Poisson"), pch=15, 
       col=c("aquamarine3","coral"), 
       bty="n")

barplot(matrix(c(Frequency,E_bin),nr=2, byrow = TRUE), beside=T, 
        col=c("aquamarine3","coral"), 
        names.arg=c(0:5, '6-12'))
legend("topright", c("Observed","Expected Binomial"), pch=15, 
       col=c("aquamarine3","coral"), 
       bty="n")


barplot(matrix(c(Frequency,E_geom),nr=2, byrow = TRUE), beside=T, 
        col=c("aquamarine3","coral"), 
        names.arg=c(0:5, '6-12'))
legend("topright", c("Observed","Expected Geometric"), pch=15, 
       col=c("aquamarine3","coral"), 
       bty="n")

par(mfrow=c(1,2))

barplot(matrix(c(Frequency,E_nbin1),nr=2, byrow = TRUE), beside=T, 
        col=c("aquamarine3","coral"), 
        names.arg=c(0:5, '6-12'))
legend("topright", c("Observed","Expected Negative Binomial using mme"), pch=15, 
       col=c("aquamarine3","coral"), 
       bty="n")

barplot(matrix(c(Frequency,E_nbin2),nr=2, byrow = TRUE), beside=T, 
        col=c("aquamarine3","coral"), 
        names.arg=c(0:5, '6-12'))
legend("topright", c("Observed","Expected Negative Binomial using mle"), pch=15, 
       col=c("aquamarine3","coral"), 
       bty="n")

CT1 = chisq.test(rbind(Frequency,E_poi)); CT1
CT2 = chisq.test(rbind(Frequency,E_bin));CT2
CT3 = chisq.test(rbind(Frequency,E_geom));CT3
CT4 = chisq.test(rbind(Frequency,E_nbin1));CT4
CT5 = chisq.test(rbind(Frequency,E_nbin2));CT5

C1 = CT1$statistic
C2 = CT2$statistic
C3 = CT3$statistic
C4 = CT4$statistic
C5 = CT5$statistic

P1 = CT1$p.value
P2 = CT2$p.value
P3 = CT3$p.value
P4 = CT4$p.value
P5 = CT5$p.value

RES1 = Frequency - E_poi
RES2 = Frequency - E_bin
RES3 = Frequency - E_geom
RES4 = Frequency - E_nbin1
RES5 = Frequency - E_nbin2

par(mfrow=c(1,5))
plot(x,RES1)
plot(x,RES2)
plot(x,RES3)
plot(x,RES4)
plot(x,RES5)
SR1 = sum(RES1)
SR2 = sum(RES2)
SR3 = sum(RES3)
SR4 = sum(RES4)
SR5 = sum(RES5)
MSE1 = mean(RES1^2)
MSE2 = mean(RES2^2)
MSE3 = mean(RES3^2)
MSE4 = mean(RES4^2)
MSE5 = mean(RES5^2)
SUMMARY = matrix(c(SR1, SR2, SR3, SR4, SR5, C1, C2, C3, C4, C5,P1, P2, P3, P4,P5, MSE1, MSE2, MSE3, MSE4,MSE5), nrow = 4, byrow = TRUE)
colnames(SUMMARY) = c("Poisson","Binomial","Geometric", "Negative Binomial(mme)","Negative Binomial(mle)")
rownames(SUMMARY) = c("sum of residuals", "Chi-square Statistic", "p-value", "MSE")
round(SUMMARY,4)

Fitting Distribution to Given Data

Simiao Ye

09-12-2015