The article specifies some methodologies and simulations on model fitting, including plot exploring, MSE evaluating, and Chi-square goodness of fit testing. It provides some detailed comparisons between discrete distributions which can help selecting the best fitted models.
We may observe the number of Spore, which starts at zero, and corresponding frequencies·
Spore | Frequency |
---|---|
0 | 131 |
1 | 55 |
2 | 21 |
3 | 14 |
4 | 6 |
5 | 6 |
6 | 2 |
7 | 0 |
8 | 0 |
9 | 0 |
10 | 0 |
11 | 2 |
12 | 1 |
The principle behind fitting distributions to data is to find the type of distribution (Normal, Binomial, Poisson, etc) and the value of the parameters (mean, variance, etc) that give the highest probability of producing the observed data. Given data indicates that it could be fitted to discrete distribution such as Binomial, Poisson, Geometric(Shifted) and Negative Binomial(Shifted) distribution because the observed Spore starts at 0.
## MEAN VAR DISP THETA R
## [1,] 1.004202 3.075932 3.063062 0.3264707 0.4867531
#estimating use r(mle)
fitdistr(origin,'Negative Binomial')
## size mu
## 0.5823039 1.0042043
## (0.1087772) (0.1072183)
Pmme=P[which.max(likelihood)];Pmme
## [1] 0.3264073
Pmle=P[which.max(likelihood2)];Pmle
## [1] 0.3669874
To find the parameters of the “best-fitting” distribution corresponding to the data, we need to find the specific estimators of the given data distribution:
Spore | Frequency | round.RF..5. | E_poi | E_bin | E_geom | E_nbin1 | E_nbin2 |
---|---|---|---|---|---|---|---|
0 | 131 | 0.55042 | 87.18820 | 83.69363 | 118.75052 | 138.01908 | 132.77148 |
1 | 55 | 0.23109 | 87.55454 | 91.08095 | 59.49974 | 45.24852 | 48.93660 |
2 | 21 | 0.08824 | 43.96121 | 45.74784 | 29.81224 | 22.65529 | 24.50604 |
3 | 14 | 0.05882 | 14.71531 | 14.04216 | 14.93737 | 12.64846 | 13.35176 |
4 | 6 | 0.02521 | 3.69428 | 2.93877 | 7.48434 | 7.42600 | 7.56868 |
5 | 6 | 0.02521 | 0.74196 | 0.44282 | 3.75002 | 4.48822 | 4.39050 |
6 | 2 | 0.00840 | 0.12418 | 0.04943 | 1.87894 | 2.76436 | 2.58556 |
7 | 0 | 0.00000 | 0.01781 | 0.00414 | 0.94144 | 1.72536 | 1.53891 |
8 | 0 | 0.00000 | 0.00224 | 0.00026 | 0.47171 | 1.08753 | 0.92322 |
9 | 0 | 0.00000 | 0.00025 | 0.00001 | 0.23635 | 0.69071 | 0.55724 |
10 | 0 | 0.00000 | 0.00003 | 0.00000 | 0.11842 | 0.44134 | 0.33798 |
11 | 2 | 0.00840 | 0.00000 | 0.00000 | 0.05933 | 0.28338 | 0.20581 |
12 | 1 | 0.00420 | 0.00000 | 0.00000 | 0.02973 | 0.18270 | 0.12573 |
Overdispersion distribution would be applied here, which suggesting that Negative Binomial or Geometric distribution might be a good candidate for the data fitting. We will verify this later by side-by-side barplot and Chi-square goodness of fit test.
In order to have better comparisons, we then use sample mean and sample variance to fit Binomial, Poisson, Geometric and Negative Binomial distributions, the patterns are shown clearly from the side-by-side barplots.
The significant difference between first three bars in Poisson and Binomial plot indicates that these two may not be a good fit. On the contrary, Negative Binomial and Geometric overcome the height difference especially for the first bar, above findings supports our former conclusions.
Then we can perform Chi-square goodness of fit Test to verify this. First we can have a look at some assumptions of the test. The Chi-square (X^2) statistic measures how well the expected frequency of the fitted distribution compares with the observed frequency of a histogram of the observed data. The Chi-square test makes the following assumptions:
In order to do the test, we need to combine the last six groups to make the count in each cell at least 5. Consequently, we have the updated side-by-side barplots:
## The following objects are masked from data1 (pos = 3):
##
## Frequency, Spore
## [1] 1.0000000 2.8607595 2.8607595 0.3495575 0.5374150
## size mu
## 0.5913633 1.0000003
## (0.1114659) (0.1063331)
## [1] 0.3495399
## [1] 0.3716923
Spore | Frequency | RF | E_poi | E_bin | E_geom | E_nbin1 | E_nbin2 |
---|---|---|---|---|---|---|---|
0 | 131 | 0.5504202 | 87.55531 | 80.90017 | 119.00000 | 135.28731 | 132.53819 |
1 | 55 | 0.2310924 | 87.55531 | 94.38353 | 59.50000 | 47.29070 | 49.25225 |
2 | 21 | 0.0882353 | 43.77765 | 47.19177 | 29.75000 | 23.64535 | 24.62612 |
3 | 14 | 0.0588235 | 14.59255 | 13.10882 | 14.87500 | 13.00843 | 13.36699 |
4 | 6 | 0.0252101 | 3.64814 | 2.18480 | 7.43750 | 7.48272 | 7.54160 |
5 | 6 | 0.0252101 | 0.72963 | 0.21848 | 3.71875 | 4.41679 | 4.35177 |
6-12 | 5 | 0.0210084 | 0.00024 | 0.00000 | 0.23242 | 0.60901 | 0.53936 |
## Warning in chisq.test(rbind(Frequency, E_poi)): Chi-squared approximation
## may be incorrect
##
## Pearson's Chi-squared test
##
## data: rbind(Frequency, E_poi)
## X-squared = 33.792, df = 6, p-value = 7.378e-06
## Warning in chisq.test(rbind(Frequency, E_bin)): Chi-squared approximation
## may be incorrect
##
## Pearson's Chi-squared test
##
## data: rbind(Frequency, E_bin)
## X-squared = 44.471, df = 6, p-value = 5.961e-08
## Warning in chisq.test(rbind(Frequency, E_geom)): Chi-squared approximation
## may be incorrect
##
## Pearson's Chi-squared test
##
## data: rbind(Frequency, E_geom)
## X-squared = 7.296, df = 6, p-value = 0.2943
## Warning in chisq.test(rbind(Frequency, E_nbin1)): Chi-squared approximation
## may be incorrect
##
## Pearson's Chi-squared test
##
## data: rbind(Frequency, E_nbin1)
## X-squared = 4.6018, df = 6, p-value = 0.5958
## Warning in chisq.test(rbind(Frequency, E_nbin2)): Chi-squared approximation
## may be incorrect
##
## Pearson's Chi-squared test
##
## data: rbind(Frequency, E_nbin2)
## X-squared = 4.5882, df = 6, p-value = 0.5976
## Poisson Binomial Geometric Negative Binomial(mme)
## sum of residuals 0.1412 0.0124 3.4863 6.2597
## Chi-square Statistic 33.7923 44.4712 7.2960 4.6018
## p-value 0.0000 0.0000 0.2943 0.5958
## MSE 503.5382 688.6914 38.7969 15.6830
## Negative Binomial(mle)
## sum of residuals 5.7837
## Chi-square Statistic 4.5882
## p-value 0.5976
## MSE 10.5632
To some extent, the p-value(0.596) for Negative Binomial distribution leads to failing to reject the null hypothesis that the data are consistent with Negative Binomial distribution.
Based on the explorations on the properties of these four distribution, side-by-side barplots, MSE and Chi-square goodness of fit test, we may come to the conclusion that Negative Binomial distribution would be the best candidate for this data set.
#ALL THE CODES
library(ggplot2)
library(knitr)
library(MASS)
data1=data.frame(c(0:12),c(131,55,21,14,6,6,2,0,0,0,0,2,1))
names(data1) <- c('Spore', 'Frequency')
kable(data1)
attach(data1)
N <- sum(Frequency)
RF <- Frequency/N
DF <- data.frame(data1, round(RF,5))
MEAN <- sum(RF*Spore)
VAR <- (sum(Spore^2*Frequency) - N*MEAN^2)/(N-1)#VAR can also be obtained by (MEAN+MEAN^2/R)
DISP <- VAR/MEAN # over dispersion
THETA <- 1/DISP
R <- MEAN*THETA/(1-THETA) # MEAN = R(1-THETA)/THETA
cbind(MEAN,VAR,DISP,THETA,R)
#esimating use r(mme)
origin = c(rep(0,131),rep(1,55),rep(2,21),rep(3,14),rep(4,6),rep(5,6),rep(6,2),11,11,12)
P=seq(0.01,0.99,length=5000)
likelihood = NULL
for (i in 1:length(P)){
likelihood[i] = N*R*log(P[i]) - N*log(factorial(R-1)) +
sum(origin*log(1-P[i]) + log(factorial(origin+R-1)) -log(factorial(origin)))}
#estimating use r(mle)
fitdistr(origin,'Negative Binomial')
r = fitdistr(origin,'Negative Binomial')$estimate[1]
likelihood2 = NULL
for (i in 1:length(P)){
likelihood2[i] = N*r*log(P[i]) - N*log(factorial(r-1)) +
sum(origin*log(1-P[i]) + log(factorial(origin+r-1)) -log(factorial(origin)))}
par(mfrow=c(1,2))
plot(P, likelihood, pch=20)
plot(P, likelihood2, pch=20)
P1=P[which.max(likelihood)];P1
P2=P[which.max(likelihood2)];P2
#fit in different distribution using mean, variance and R.
x = Spore
E_poi = round(N * dpois(x, lambda=MEAN),5)
E_bin = round(N * dbinom(x, length(x), prob = MEAN/length(x)),5)
E_geom = round(N * dgeom(x, 1/(1+MEAN)),5) # MEAN = R(1-THETA)/THETA, for geometric, R = 1
E_nbin1 = round(N * dnbinom(x, size = R, mu = MEAN),5)
E_nbin2 = round(N * dnbinom(x, size = r, mu = MEAN),5)
DF = data.frame(DF, E_poi, E_bin, E_geom, E_nbin1,E_nbin2)
kable(DF)
par(mfrow=c(2,2))
barplot(matrix(c(Frequency,E_poi),nr=2, byrow = TRUE), beside=T,
col=c("aquamarine3","coral"),
names.arg=x)
legend("topright", c("Observed","Expected Poisson"), pch=15,
col=c("aquamarine3","coral"),
bty="n")
barplot(matrix(c(Frequency,E_bin),nr=2, byrow = TRUE), beside=T,
col=c("aquamarine3","coral"),
names.arg=x)
legend("topright", c("Observed","Expected Binomial"), pch=15,
col=c("aquamarine3","coral"),
bty="n")
barplot(matrix(c(Frequency,E_geom),nr=2, byrow = TRUE), beside=T,
col=c("aquamarine3","coral"),
names.arg=x)
legend("topright", c("Observed","Expected Geometric"), pch=15,
col=c("aquamarine3","coral"),
bty="n")
par(mfrow=c(1,2))
barplot(matrix(c(Frequency,E_nbin1),nr=2, byrow = TRUE), beside=T,
col=c("aquamarine3","coral"),
names.arg=x)
legend("topright", c("Observed","Expected Negative Binomial using mme"), pch=15,
col=c("aquamarine3","coral"),
bty="n")
barplot(matrix(c(Frequency,E_nbin2),nr=2, byrow = TRUE), beside=T,
col=c("aquamarine3","coral"),
names.arg=x)
legend("topright", c("Observed","Expected Negative Binomial using mle"), pch=15,
col=c("aquamarine3","coral"),
bty="n")
data1 = rbind(data1[1:6,], c('6-12',5))
attach(data1)
Frequency = as.numeric(Frequency)
Spore1 = c(0:5,9)
N = sum(Frequency)
RF = Frequency/N
DF = data.frame(data1, RF)
MEAN = sum(RF*Spore1)
VAR = (sum(Spore1^2*Frequency) - N*MEAN^2)/(N-1)
DISP= VAR/MEAN # over dispersion
THETA = 1/DISP
R = MEAN*THETA/(1-THETA) # MEAN = R(1-THETA)/THETA
c(MEAN,VAR,DISP,THETA,R)
origin = c(rep(0,131),rep(1,55),rep(2,21),rep(3,14),rep(4,6),rep(5,6),rep(9,5))
P=seq(0.01,0.99,length=5000)
likelihood = NULL
for (i in 1:length(P)){
likelihood[i] = N*R*log(P[i]) - N*log(factorial(R-1)) +
sum(origin*log(1-P[i]) + log(factorial(origin+R-1)) -log(factorial(origin)))}
#estimating use r(mle)
fitdistr(origin,'Negative Binomial')
r = fitdistr(origin,'Negative Binomial')$estimate[1]
likelihood2 = NULL
for (i in 1:length(P)){
likelihood2[i] = N*r*log(P[i]) - N*log(factorial(r-1)) +
sum(origin*log(1-P[i]) + log(factorial(origin+r-1)) -log(factorial(origin)))}
par(mfrow=c(1,2))
plot(P, likelihood, pch=20)
plot(P, likelihood2, pch=20)
P1=P[which.max(likelihood)];P1
P2=P[which.max(likelihood2)];P2
#fit in different distribution using mean, variance and R.
x = Spore1
E_poi = round(N * dpois(x, lambda=MEAN),5)
E_bin = round(N * dbinom(x, length(x), prob = MEAN/length(x)),5)
E_geom = round(N * dgeom(x, 1/(1+MEAN)),5) # MEAN = R(1-THETA)/THETA, for geometric, R = 1
E_nbin1 = round(N * dnbinom(x, size = R, mu = MEAN),5)
E_nbin2 = round(N * dnbinom(x, size = r, mu = MEAN),5)
DF = data.frame(DF, E_poi, E_bin, E_geom, E_nbin1, E_nbin2)
kable(DF)
par(mfrow=c(2,2))
barplot(matrix(c(Frequency,E_poi),nr=2, byrow = TRUE), beside=T,
col = c("aquamarine3","coral"),
names.arg = c(0:5, '6-12'))
legend("topright", c("Observed","Expected Poisson"), pch=15,
col=c("aquamarine3","coral"),
bty="n")
barplot(matrix(c(Frequency,E_bin),nr=2, byrow = TRUE), beside=T,
col=c("aquamarine3","coral"),
names.arg=c(0:5, '6-12'))
legend("topright", c("Observed","Expected Binomial"), pch=15,
col=c("aquamarine3","coral"),
bty="n")
barplot(matrix(c(Frequency,E_geom),nr=2, byrow = TRUE), beside=T,
col=c("aquamarine3","coral"),
names.arg=c(0:5, '6-12'))
legend("topright", c("Observed","Expected Geometric"), pch=15,
col=c("aquamarine3","coral"),
bty="n")
par(mfrow=c(1,2))
barplot(matrix(c(Frequency,E_nbin1),nr=2, byrow = TRUE), beside=T,
col=c("aquamarine3","coral"),
names.arg=c(0:5, '6-12'))
legend("topright", c("Observed","Expected Negative Binomial using mme"), pch=15,
col=c("aquamarine3","coral"),
bty="n")
barplot(matrix(c(Frequency,E_nbin2),nr=2, byrow = TRUE), beside=T,
col=c("aquamarine3","coral"),
names.arg=c(0:5, '6-12'))
legend("topright", c("Observed","Expected Negative Binomial using mle"), pch=15,
col=c("aquamarine3","coral"),
bty="n")
CT1 = chisq.test(rbind(Frequency,E_poi)); CT1
CT2 = chisq.test(rbind(Frequency,E_bin));CT2
CT3 = chisq.test(rbind(Frequency,E_geom));CT3
CT4 = chisq.test(rbind(Frequency,E_nbin1));CT4
CT5 = chisq.test(rbind(Frequency,E_nbin2));CT5
C1 = CT1$statistic
C2 = CT2$statistic
C3 = CT3$statistic
C4 = CT4$statistic
C5 = CT5$statistic
P1 = CT1$p.value
P2 = CT2$p.value
P3 = CT3$p.value
P4 = CT4$p.value
P5 = CT5$p.value
RES1 = Frequency - E_poi
RES2 = Frequency - E_bin
RES3 = Frequency - E_geom
RES4 = Frequency - E_nbin1
RES5 = Frequency - E_nbin2
par(mfrow=c(1,5))
plot(x,RES1)
plot(x,RES2)
plot(x,RES3)
plot(x,RES4)
plot(x,RES5)
SR1 = sum(RES1)
SR2 = sum(RES2)
SR3 = sum(RES3)
SR4 = sum(RES4)
SR5 = sum(RES5)
MSE1 = mean(RES1^2)
MSE2 = mean(RES2^2)
MSE3 = mean(RES3^2)
MSE4 = mean(RES4^2)
MSE5 = mean(RES5^2)
SUMMARY = matrix(c(SR1, SR2, SR3, SR4, SR5, C1, C2, C3, C4, C5,P1, P2, P3, P4,P5, MSE1, MSE2, MSE3, MSE4,MSE5), nrow = 4, byrow = TRUE)
colnames(SUMMARY) = c("Poisson","Binomial","Geometric", "Negative Binomial(mme)","Negative Binomial(mle)")
rownames(SUMMARY) = c("sum of residuals", "Chi-square Statistic", "p-value", "MSE")
round(SUMMARY,4)