Abstract
Nigrum Inventory in a natural forest can help the efficient production of crowberry food products. The occurrence of Empetrum nigrum at one forest location of Södermanland län has correlations with some factors such as forest amount and the forest aggregation and the numbers of species richness of forest plants. In this study,the first task is to analyze the effects of the variables forest, forest.clumpy, and Empenigr on sp.richness with GLM model specification. The second task is to analyze the effects of the variables forest, forest.clumpy, and sp.richness on Empenigr with Logit model specification. The computational results conclude that a) for the case of GLM model Specification, the first best model specification is GLM_12 while the second best is GLM_10_A. Both two models show the positive statistically significant effects of forest, forest.clumpy and Empenigr on sp.richness. However, their interactive term (forest\(*\)forest.clumpy) shows insignificant effects on sp.richness; b) for the case of Logit model Specification, the first best model specification is LogitMOD_11 while the second best is LogitMOD_12_B. Both two models show the positive statistically significant effects of forest, forest.clumpy and sp.richness on Empenigr. Their interactive term (forest\(*\)forest.clumpy), however, shows the negative statistically significant effect on Empenigr.According to Wikipedia, Empetrum nigrum (also called crowberry or blackberry, see Figure 1) is the species of heather family Ericaceae, largely reproducing and spreading through underground stems near circumboreal regions in the northern hemisphere. Evolutionary biologists believe that long-distance migratory birds can help to disperse seeds from one location to the other. Empetrum nigrum is associated with mature coniferous forests and survival in muskegs and mixed-wood forests under wet soil conditions. It is good full of vitamins and fiber, and also good for health Production for food products with Empetrum nigrum requires its sufficient large inventory. The forest amount (“forest”) and forest aggregation (“forest.clumpy”) play a significant role to affect Empetrum nigrum inventory. In this min-project, the main study is to find out the effects of forest amount (“forest”) and forest aggregation (“forest.clumpy”) and their interactive terms on the numbers of species richness of forest plants (“sp.richness”), and as well on the occurrence of Empetrum nigrum (“Empenigr”) at Södermanlands län (See Figure 2).
In this experimental design, the flora inventory at Södermanland län suggests to divide into a square grid of 2.5 x 2.5 \(km^{2}\). The sampling size of the square gird suggests randomly selected 200 units. Each square grid entails for all kinds of plant species that counted up as “sp.richness” response varaible. Whenever the occurrence of Empetrum nigrum in the designated square grid, the binary variables(“Empenigr”) is indicated as “1”; otherwise as “0”. Both variables “forest” and “forest.clumpy” are explanatory effects on “sp.richness” or “Empenigr”. The variable “forest” indicates the Proportion of forested area; while “forest.clumpy” is the Aggregation index of the forest structure. The High value of “forest.clumpy” indicates that the forest is aggregated in one or a few large patches. All variable definitions and types of class are Tabluated in Table 1. The basic descriptive statistics of the given dataset (forest.csv) shows in Table 2.
From the original dataset (forest.csv), “Empenigr” is to serve as a binary reponse variable or to serve as a two-factor level of “Empenigr” explanatory variable. Therefore, it is suggested to create a new 2-factor-level “fac_Empenigr” as categorical class for explanatory variable. The mapping is “1” to “H” and “0” to “L”.
fac_forestcsv = cbind(forestcsv, as.factor(forestcsv$Empenigr))
colnames(fac_forestcsv)[5] = c("fac_Empenigr")
fac_forestcsv[, 5] = mapvalues(fac_forestcsv[, 5], from = c("1", "0"), to = c("H",
"L"))
head(fac_forestcsv, 10) forest forest.clumpy sp.richness Empenigr fac_Empenigr
1 71.94996 0.8609 109 1 H
2 20.88000 0.7961 48 0 L
3 83.24995 0.8216 117 1 H
4 36.99002 0.7993 50 0 L
5 74.50996 0.9017 78 0 L
6 30.17003 0.8970 71 0 L
7 61.28998 0.8518 79 0 L
8 66.42997 0.8879 96 0 L
9 83.54995 0.8271 102 1 H
10 43.77001 0.8814 61 0 L
A scatter matrix plot in Figure 3 is a better option to show their (inter-)relationship between variables in forest datset . A book of Crawley (2013) and papers of Rouder et al. (2016) and Boudreau, Ropars, and Harper (2010) are recommdended for this mini-project.
scatterplotMatrix(~forest + forest.clumpy + sp.richness | fac_Empenigr, regLine = FALSE,
smooth = FALSE, diagonal = list(method = "density"), by.groups = TRUE, data = fac_forestcsv)In the study, first procedure is to run Shapiro–Wilk Normality Test of all variables except for binary variable (Empenigr). The Shapiro–Wilk test is Normality Test of Frequentist Statistics Method. Details how to compute the test statistic can refer Shapiro and Wilk (1972). The Normality Test indicates that variables of “forest”, “forest.clumpy” and “sp.richness” are statistically significant to follow normal distribution.
library(RcmdrMisc)
normalityTest(~forest, test = "shapiro.test", data = forestcsv)
normalityTest(~forest.clumpy, test = "shapiro.test", data = forestcsv)
normalityTest(~sp.richness, test = "shapiro.test", data = forestcsv)### 'fac_Empenigr'
groupL = fac_forestcsv[which(fac_forestcsv[5] == "L"), ]
groupH = fac_forestcsv[which(fac_forestcsv[5] == "H"), ] forest forest.clumpy sp.richness Empenigr fac_Empenigr
Min. :19.78 Min. :0.7609 Min. : 51.00 Min. :1 L: 0
1st Qu.:49.10 1st Qu.:0.8230 1st Qu.: 86.00 1st Qu.:1 H:70
Median :65.98 Median :0.8629 Median : 97.00 Median :1
Mean :61.31 Mean :0.8599 Mean : 94.74 Mean :1
3rd Qu.:74.64 3rd Qu.:0.8895 3rd Qu.:104.00 3rd Qu.:1
Max. :90.76 Max. :0.9466 Max. :119.00 Max. :1
forest forest.clumpy sp.richness Empenigr fac_Empenigr
Min. : 8.28 Min. :0.7638 Min. : 23.00 Min. :0 L:130
1st Qu.:35.32 1st Qu.:0.8494 1st Qu.: 68.00 1st Qu.:0 H: 0
Median :48.72 Median :0.8706 Median : 79.00 Median :0
Mean :49.58 Mean :0.8657 Mean : 80.02 Mean :0
3rd Qu.:64.72 3rd Qu.:0.8851 3rd Qu.: 92.00 3rd Qu.:0
Max. :91.60 Max. :0.9487 Max. :127.00 Max. :0
The following correlation matrix indicates that
Empenigr forest forest.clumpy sp.richness
Empenigr 1.00000000 0.2875824 -0.07264119 0.37083433
forest 0.28758243 1.0000000 -0.18937695 0.56497549
forest.clumpy -0.07264119 -0.1893769 1.00000000 0.07721077
sp.richness 0.37083433 0.5649755 0.07721077 1.00000000
To test the group mean effects of 2-factor-level “fac_Empenigr” on other variables
The Hypothesis tests:
AnovaModel.10 <- aov(sp.richness ~ fac_Empenigr, data = fac_forestcsv)
ContractTukey_10 = glht(AnovaModel.10, linfct = mcp(fac_Empenigr = "Tukey"))
summary(ContractTukey_10)
AnovaModel.11 <- aov(forest.clumpy ~ fac_Empenigr, data = fac_forestcsv)
ContractTukey_11 = glht(AnovaModel.11, linfct = mcp(fac_Empenigr = "Tukey"))
summary(ContractTukey_11)
AnovaModel.12 <- aov(forest ~ fac_Empenigr, data = fac_forestcsv)
ContractTukey_12 = glht(AnovaModel.12, linfct = mcp(fac_Empenigr = "Tukey"))
summary(ContractTukey_12)Table 6 is the summary tabulated the output of Simultaneous Tests for General Linear Hypotheses of Multiple Comparisons of Means Tukey Contrasts Test.
Comparing Two Nested Models with ANOVA F Test:
The results found that GLM_10_A is statistically better than all other models except GLM_12, implying that the interactive terms of \(\mathrm{forest*forest.clumpy}\) is not statistical significant , and the two-factor-level of Empenigr also contributes a statistical significant effects on “sp.richness”. The following is to list out the First Best Model:GLM_10_A and the Second Best Model:GLM_12. Resdiual diagnostic check for these two models are also followed normal distributed.
GLM_10 <- glm(sp.richness ~ forest + forest.clumpy, family = gaussian(identity),
data = forestcsv)
GLM_10_A <- glm(sp.richness ~ 0 + forest + forest.clumpy, family = gaussian(identity),
data = forestcsv)
GLM_11 <- glm(sp.richness ~ forest + forest.clumpy + forest * forest.clumpy, family = gaussian(identity),
data = forestcsv)
GLM_11_A <- glm(sp.richness ~ 0 + forest + forest.clumpy + forest * forest.clumpy,
family = gaussian(identity), data = forestcsv)
GLM_12 <- glm(sp.richness ~ forest + forest.clumpy + factor(Empenigr), family = gaussian(identity),
data = forestcsv)
GLM_12_B <- glm(sp.richness ~ forest + forest.clumpy + forest * forest.clumpy + factor(Empenigr),
family = gaussian(identity), data = forestcsv)Analysis of Deviance Table
Model 1: sp.richness ~ 0 + forest + forest.clumpy
Model 2: sp.richness ~ forest + forest.clumpy
Model 3: sp.richness ~ forest + forest.clumpy + forest * forest.clumpy
Model 4: sp.richness ~ 0 + forest + forest.clumpy + forest * forest.clumpy
Model 5: sp.richness ~ forest + forest.clumpy + factor(Empenigr)
Model 6: sp.richness ~ forest + forest.clumpy + forest * forest.clumpy +
factor(Empenigr)
Resid. Df Resid. Dev Df Deviance F Pr(>F)
1 198 46628
2 197 46332 1 296.4 1.3504 0.2466
3 196 46150 1 182.1 0.8298 0.3635
4 197 46512 -1 -362.5 1.6517 0.2003
5 196 42820 1 3692.3 16.8223 6.025e-05 ***
6 195 42800 1 19.8 0.0903 0.7641
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
GLM_12 <- glm(sp.richness ~ forest + forest.clumpy + factor(Empenigr), family = gaussian(identity),
data = forestcsv)
summary(GLM_12)
Call:
glm(formula = sp.richness ~ forest + forest.clumpy + factor(Empenigr),
family = gaussian(identity), data = forestcsv)
Deviance Residuals:
Min 1Q Median 3Q Max
-43.667 -9.702 -0.443 11.050 33.004
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -30.98934 25.17236 -1.231 0.219766
forest 0.52173 0.05702 9.151 < 2e-16 ***
forest.clumpy 98.34708 28.28539 3.477 0.000625 ***
factor(Empenigr)1 9.17466 2.28831 4.009 8.65e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 218.4683)
Null deviance: 71764 on 199 degrees of freedom
Residual deviance: 42820 on 196 degrees of freedom
AIC: 1650.9
Number of Fisher Scoring iterations: 2
GLM_10_A <- glm(sp.richness ~ 0 + forest + forest.clumpy, family = gaussian(identity),
data = forestcsv)
summary(GLM_10_A)
Call:
glm(formula = sp.richness ~ 0 + forest + forest.clumpy, family = gaussian(identity),
data = forestcsv)
Deviance Residuals:
Min 1Q Median 3Q Max
-37.685 -10.673 -1.518 10.436 35.640
Coefficients:
Estimate Std. Error t value Pr(>|t|)
forest 0.56648 0.05421 10.45 <2e-16 ***
forest.clumpy 63.46043 3.58035 17.73 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 235.4952)
Null deviance: 1522550 on 200 degrees of freedom
Residual deviance: 46628 on 198 degrees of freedom
AIC: 1663.9
Number of Fisher Scoring iterations: 2
LogitMOD_10 <- glm(Empenigr ~ forest + forest.clumpy, data = forestcsv, family = binomial(logit))
LogitMOD_10_A <- glm(Empenigr ~ 0 + forest + forest.clumpy, data = forestcsv, family = binomial(logit))
LogitMOD_11 <- glm(Empenigr ~ forest + forest.clumpy + forest * forest.clumpy, data = forestcsv,
family = binomial(logit))
LogitMOD_11_A <- glm(Empenigr ~ 0 + forest + forest.clumpy + forest * forest.clumpy,
data = forestcsv, family = binomial(logit))
LogitMOD_12 <- glm(Empenigr ~ forest + forest.clumpy + sp.richness, data = forestcsv,
family = binomial(logit))
LogitMOD_12_B <- glm(Empenigr ~ forest + forest.clumpy + forest * forest.clumpy +
sp.richness, data = forestcsv, family = binomial(logit))The likelihood ratio test is to compare the model specification fit one to the other by estimating the Log likelihood.
\[LR = -2 log\left(\frac{Liki(Model_s)}{Liki(Model_b)}\right) = -2(logliki(Model_s)-logliki(Model_b))\]
In general, if the test statistic is large enough, \(H_{0}\) is statistically rejected, implying that the larger model has a significant model improvement over the smaller one. With sequential narrowing-down procedures of likelihood ratio test for 6 Logit models in Table 4, it found that the first best model is LogitMOD_11 while the second best model is LogitMOD_12_B.
library(mlogit)
lrtest(LogitMOD_10_A, LogitMOD_10, LogitMOD_11_A, LogitMOD_11, LogitMOD_12, LogitMOD_12_B)Likelihood ratio test
Model 1: Empenigr ~ 0 + forest + forest.clumpy
Model 2: Empenigr ~ forest + forest.clumpy
Model 3: Empenigr ~ 0 + forest + forest.clumpy + forest * forest.clumpy
Model 4: Empenigr ~ forest + forest.clumpy + forest * forest.clumpy
Model 5: Empenigr ~ forest + forest.clumpy + sp.richness
Model 6: Empenigr ~ forest + forest.clumpy + forest * forest.clumpy +
sp.richness
#Df LogLik Df Chisq Pr(>Chisq)
1 2 -121.01
2 3 -120.84 1 0.3538 0.55198
3 3 -120.98 0 0.2969 < 2e-16 ***
4 4 -117.95 1 6.0678 0.01377 *
5 4 -112.89 0 10.1293 < 2e-16 ***
6 5 -110.04 1 5.7010 0.01696 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Likelihood ratio test
Model 1: Empenigr ~ forest + forest.clumpy + forest * forest.clumpy
Model 2: Empenigr ~ 0 + forest + forest.clumpy + forest * forest.clumpy
Model 3: Empenigr ~ forest + forest.clumpy + sp.richness
Model 4: Empenigr ~ forest + forest.clumpy + forest * forest.clumpy +
sp.richness
#Df LogLik Df Chisq Pr(>Chisq)
1 4 -117.95
2 3 -120.98 -1 6.0678 0.01377 *
3 4 -112.89 1 16.1972 5.708e-05 ***
4 5 -110.04 1 5.7010 0.01696 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Likelihood ratio test
Model 1: Empenigr ~ 0 + forest + forest.clumpy + forest * forest.clumpy
Model 2: Empenigr ~ forest + forest.clumpy + sp.richness
Model 3: Empenigr ~ forest + forest.clumpy + forest * forest.clumpy +
sp.richness
#Df LogLik Df Chisq Pr(>Chisq)
1 3 -120.98
2 4 -112.89 1 16.197 5.708e-05 ***
3 5 -110.04 1 5.701 0.01696 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
LogitMOD_11 <- glm(Empenigr ~ forest + forest.clumpy + forest * forest.clumpy, data = forestcsv,
family = binomial(logit))
summary(LogitMOD_11)
Call:
glm(formula = Empenigr ~ forest + forest.clumpy + forest * forest.clumpy,
family = binomial(logit), data = forestcsv)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0126 -0.9227 -0.7090 1.2532 2.2699
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -30.7165 13.4065 -2.291 0.0220 *
forest 0.5084 0.2101 2.420 0.0155 *
forest.clumpy 32.8914 15.4238 2.133 0.0330 *
forest:forest.clumpy -0.5558 0.2441 -2.276 0.0228 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 258.98 on 199 degrees of freedom
Residual deviance: 235.90 on 196 degrees of freedom
AIC: 243.9
Number of Fisher Scoring iterations: 4
LogitMOD_12_B <- glm(Empenigr ~ forest + forest.clumpy + forest * forest.clumpy +
sp.richness, data = forestcsv, family = binomial(logit))
summary(LogitMOD_12_B)
Call:
glm(formula = Empenigr ~ forest + forest.clumpy + forest * forest.clumpy +
sp.richness, family = binomial(logit), data = forestcsv)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8298 -0.8245 -0.5645 1.0488 2.2260
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -31.66155 14.30537 -2.213 0.026880 *
forest 0.50739 0.22104 2.295 0.021708 *
forest.clumpy 30.93029 16.41250 1.885 0.059490 .
sp.richness 0.04381 0.01163 3.769 0.000164 ***
forest:forest.clumpy -0.58010 0.25719 -2.255 0.024103 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 258.98 on 199 degrees of freedom
Residual deviance: 220.07 on 195 degrees of freedom
AIC: 230.07
Number of Fisher Scoring iterations: 5
Nigrum Inventory in a natural forest can help the efficient production of crowberry food products. The occurrence of Empetrum nigrum at one forest location of Södermanland län has correlations with some factors such as forest amount and the forest aggregation and the numbers of species richness of forest plants. In this study, the first task is to analyze the effects of the variables forest, forest.clumpy, and Empenigr on sp.richness with GLM model specification. The second task is to analyze the effects of the variables forest, forest.clumpy, and sp.richness on Empenigr with Logit model specification. This experimental study is designated to select randomly the 200 sample dataset from Södermanland län. The preliminary testing for the dataset(forest.csv) used here includes scatter-matrix plot by group, normality test, correlation matrix estimation, and the Tukey contrast test.
The scatter-matrix plot by group ( = Empenigr) visualizes that the distributions of forest, forest.clumpy, and sp.richness are very different between the two groups. Contrarily, the normality test without by group ( = Empenigr) indicates that variables of forest, forest.clumpy, and sp.richness are still statistically significant to follow normal distributions. The estimation of correlation matrix shows that sp.richness shows a positive correlation of forest, forest.clumpy and Empenigr respectively. The variables of forest and forest.clumpy show a negative correlation, implying that a higher proportion of forested area, lower Aggregation index of the forest structure (less dense of flora forest structure). The variables of Empenigr and forest.clumpy show a negative correlation, implying that higher occurred counting rate of Empetrum nigrum, lower Aggregation index of the forest structure is. More important, the results of the Tukey Contrasts Test indicate that the fac_Empenigr (H(1) or L(0)) has a statistically significant group effect on the varialbes of forest and sp.richness but not on forest.clumpy. This finding implies that the group effects of Empetrum nigrum are significant in the proportion of forested area and the flora species abundance. However, the group effects of Empetrum nigrum show no obvious significant effect on the aggregation index of the forest structure.
With the preliminary testing , it found that the variables of forest, forest.clumpy, and sp.richness are following normal distributions. To study the effects of the variables of forest, forest.clumpy and Empenigr on sp.richness, I decide to use six GLM model specifications to test their linear correlations with Gaussian identity link fucntion for no probability transformation that tabulated in Table 3. The results of estimated parameters significance show in Table 7. It should be highlighted that there is no overdispersion problem as all six GLM model specifications if Gaussian identity link fucntion is applied. By using ANOVA F Test for the six Nested GLM Models, it found that the first best model is the first best model specification is GLM_12 while the second best is GLM_10_A. Contrary to the six GLM models, I use another six Logit models (see Table 4)to study the effects of forest, forest.clumpy and sp.richness on Empenigr. Since Empenigr is a binary response variable, the binomial logit link function is more appropiate. Besides, Likelihood ratio test is more appropiate to test the nest logit models bacause ANOVA F Test is invalid for nest logit model comparsion. Based on LR test, the first best model specification is LogitMOD_11 while the second best is LogitMOD_12_B.
According to the computational results in Table 9, it can conclude that
for the case of GLM model Specification, the first best model specification is “GLM_12” while the second best is “GLM_10_A”. Both two models show the positive statistically significant effects of “forest”, “forest.clumpy” and “Empenigr” on “sp.richness”. However, their interactive term ("forest*forest.clumpy“) shows no significant effects on”sp.richness".
for the case of Logit model Specification, the first best model specification is LogitMOD_11 while the second best is LogitMOD_12_B. Both two models show the positive statistically significant effects of “forest”, “forest.clumpy” and “sp.richness” on “Empenigr”. Their interactive term ("forest*forest.clumpy“), however, shows the negative statistically significant effect on”Empenigr".
Boudreau, Stéphane, Pascale Ropars, and Karen Amanda Harper. 2010. “Population Dynamics of Empetrum Hermaphroditum (Ericaceae) on a Subarctic Sand Dune: Evidence of Rapid Colonization Through Efficient Sexual Reproduction.” American Journal of Botany 97 (5): 770–81.
Crawley, Michael J. 2013. “The R Book Second Edition.” John Wiley & Sons.
Rouder, Jeffrey N, Christopher R Engelhardt, Simon McCabe, and Richard D Morey. 2016. “Model Comparison in Anova.” Psychonomic Bulletin & Review 23 (6): 1779–86. https://link.springer.com/content/pdf/10.3758/s13423-016-1026-5.pdf.
Shapiro, Samuel S, and MB Wilk. 1972. “An Analysis of Variance Test for the Exponential Distribution (Complete Samples).” Technometrics 14 (2): 355–70.