1 Introduction

According to Wikipedia, Empetrum nigrum (also called crowberry or blackberry, see Figure 1) is the species of heather family Ericaceae, largely reproducing and spreading through underground stems near circumboreal regions in the northern hemisphere. Evolutionary biologists believe that long-distance migratory birds can help to disperse seeds from one location to the other. Empetrum nigrum is associated with mature coniferous forests and survival in muskegs and mixed-wood forests under wet soil conditions. It is good full of vitamins and fiber, and also good for health Production for food products with Empetrum nigrum requires its sufficient large inventory. The forest amount (“forest”) and forest aggregation (“forest.clumpy”) play a significant role to affect Empetrum nigrum inventory. In this min-project, the main study is to find out the effects of forest amount (“forest”) and forest aggregation (“forest.clumpy”) and their interactive terms on the numbers of species richness of forest plants (“sp.richness”), and as well on the occurrence of Empetrum nigrum (“Empenigr”) at Södermanlands län (See Figure 2).

2 Research Questions

  • to study the effects of the variables “forest”, “forest.clumpy” and “Empenigr” on “sp.richness” using GLM
  • to study the effects of the variables “forest”, “forest.clumpy” and “sp.richness” on “Empenigr” using Logit Model

3 Model Specification:

3.1 GLM Model Specification:

3.2 Logit Regression Specification

4 Methodology

In this experimental design, the flora inventory at Södermanland län suggests to divide into a square grid of 2.5 x 2.5 \(km^{2}\). The sampling size of the square gird suggests randomly selected 200 units. Each square grid entails for all kinds of plant species that counted up as “sp.richness” response varaible. Whenever the occurrence of Empetrum nigrum in the designated square grid, the binary variables(“Empenigr”) is indicated as “1”; otherwise as “0”. Both variables “forest” and “forest.clumpy” are explanatory effects on “sp.richness” or “Empenigr”. The variable “forest” indicates the Proportion of forested area; while “forest.clumpy” is the Aggregation index of the forest structure. The High value of “forest.clumpy” indicates that the forest is aggregated in one or a few large patches. All variable definitions and types of class are Tabluated in Table 1. The basic descriptive statistics of the given dataset (forest.csv) shows in Table 2.

4.1 Preliminary Testing for Dataset

From the original dataset (forest.csv), “Empenigr” is to serve as a binary reponse variable or to serve as a two-factor level of “Empenigr” explanatory variable. Therefore, it is suggested to create a new 2-factor-level “fac_Empenigr” as categorical class for explanatory variable. The mapping is “1” to “H” and “0” to “L”.

4.1.2 Create 2-factor “fac_Empenigr”

  • “1” mapping to “H”
  • “0” mapping to “L”
     forest forest.clumpy sp.richness Empenigr fac_Empenigr
1  71.94996        0.8609         109        1            H
2  20.88000        0.7961          48        0            L
3  83.24995        0.8216         117        1            H
4  36.99002        0.7993          50        0            L
5  74.50996        0.9017          78        0            L
6  30.17003        0.8970          71        0            L
7  61.28998        0.8518          79        0            L
8  66.42997        0.8879          96        0            L
9  83.54995        0.8271         102        1            H
10 43.77001        0.8814          61        0            L

4.1.3 Scatter Matrix Plot

A scatter matrix plot in Figure 3 is a better option to show their (inter-)relationship between variables in forest datset . A book of Crawley (2013) and papers of Rouder et al. (2016) and Boudreau, Ropars, and Harper (2010) are recommdended for this mini-project.

4.1.4 Shapiro–Wilk Normality Test

In the study, first procedure is to run Shapiro–Wilk Normality Test of all variables except for binary variable (Empenigr). The Shapiro–Wilk test is Normality Test of Frequentist Statistics Method. Details how to compute the test statistic can refer Shapiro and Wilk (1972). The Normality Test indicates that variables of “forest”, “forest.clumpy” and “sp.richness” are statistically significant to follow normal distribution.

4.1.5 Descriptive Statistics by Group

     forest      forest.clumpy     sp.richness        Empenigr fac_Empenigr
 Min.   :19.78   Min.   :0.7609   Min.   : 51.00   Min.   :1   L: 0        
 1st Qu.:49.10   1st Qu.:0.8230   1st Qu.: 86.00   1st Qu.:1   H:70        
 Median :65.98   Median :0.8629   Median : 97.00   Median :1               
 Mean   :61.31   Mean   :0.8599   Mean   : 94.74   Mean   :1               
 3rd Qu.:74.64   3rd Qu.:0.8895   3rd Qu.:104.00   3rd Qu.:1               
 Max.   :90.76   Max.   :0.9466   Max.   :119.00   Max.   :1               
     forest      forest.clumpy     sp.richness        Empenigr fac_Empenigr
 Min.   : 8.28   Min.   :0.7638   Min.   : 23.00   Min.   :0   L:130       
 1st Qu.:35.32   1st Qu.:0.8494   1st Qu.: 68.00   1st Qu.:0   H:  0       
 Median :48.72   Median :0.8706   Median : 79.00   Median :0               
 Mean   :49.58   Mean   :0.8657   Mean   : 80.02   Mean   :0               
 3rd Qu.:64.72   3rd Qu.:0.8851   3rd Qu.: 92.00   3rd Qu.:0               
 Max.   :91.60   Max.   :0.9487   Max.   :127.00   Max.   :0               

4.1.6 Correlation Matrix

The following correlation matrix indicates that

  • “sp.richness” shows a positive correlation of “forest”, “forest.clumpy” and “Empenigr” respectively.
  • “forest” and “forest.clumpy” show a negative correlation, implying that higher poroportion of forested area, lower Aggregation index of the forest structure (less dense of flora forest structure).
  • “Empenigr” and “forest.clumpy” show a negative correlation, implying that higher occured counting rate of Empetrum nigrum, lower Aggregation index of the forest structure
                 Empenigr     forest forest.clumpy sp.richness
Empenigr       1.00000000  0.2875824   -0.07264119  0.37083433
forest         0.28758243  1.0000000   -0.18937695  0.56497549
forest.clumpy -0.07264119 -0.1893769    1.00000000  0.07721077
sp.richness    0.37083433  0.5649755    0.07721077  1.00000000

4.1.7 Tukey Contrast Test

  • To test the group mean effects of 2-factor-level “fac_Empenigr” on other variables

  • The Hypothesis tests:

    • \(H_{o}: \mu (\mathrm{Empenigr_{1}}) = \mu (\mathrm{Empenigr_{0}})\)
    • \(H_{a}:\) Their Means are not equal

Table 6 is the summary tabulated the output of Simultaneous Tests for General Linear Hypotheses of Multiple Comparisons of Means Tukey Contrasts Test.

  • The results indicate that the fac_Empenigr (H(1) or L(0)) has a statisitical significant group effects on “forest” and “sp.richness” but not on “forest.clumpy”
  • This finding implies that the group effects of Empetrum nigrum are significant on the poroportion of forested area and the the flora species abundance
  • However the group effects of Empetrum nigrum show no obvious significant on the aggregation index of the forest structure.

5 Computational Results

5.1 GLM Model Specification:

5.1.1 Nested Model Comparison with ANOVA F Test

Comparing Two Nested Models with ANOVA F Test:

  • \(H_{0}:\) : Restricted model is true
  • \(H_{a}:\) : Unrestricted model is true
  • F-statistics = \(F = \frac{\mathrm{(RRSEE-URSSE)* q}}{\mathrm{URSSE/DOF}}\), where
    • q = no. of restriction
    • DOF = no of degree of freedom in unrestricted model
    • RRSS = Restricted Residual Residual Sum of Squares
    • URSS = Unrestricted Residual Residual Sum of Squares

The results found that GLM_10_A is statistically better than all other models except GLM_12, implying that the interactive terms of \(\mathrm{forest*forest.clumpy}\) is not statistical significant , and the two-factor-level of Empenigr also contributes a statistical significant effects on “sp.richness”. The following is to list out the First Best Model:GLM_10_A and the Second Best Model:GLM_12. Resdiual diagnostic check for these two models are also followed normal distributed.

Analysis of Deviance Table

Model 1: sp.richness ~ 0 + forest + forest.clumpy
Model 2: sp.richness ~ forest + forest.clumpy
Model 3: sp.richness ~ forest + forest.clumpy + forest * forest.clumpy
Model 4: sp.richness ~ 0 + forest + forest.clumpy + forest * forest.clumpy
Model 5: sp.richness ~ forest + forest.clumpy + factor(Empenigr)
Model 6: sp.richness ~ forest + forest.clumpy + forest * forest.clumpy + 
    factor(Empenigr)
  Resid. Df Resid. Dev Df Deviance       F    Pr(>F)    
1       198      46628                                  
2       197      46332  1    296.4  1.3504    0.2466    
3       196      46150  1    182.1  0.8298    0.3635    
4       197      46512 -1   -362.5  1.6517    0.2003    
5       196      42820  1   3692.3 16.8223 6.025e-05 ***
6       195      42800  1     19.8  0.0903    0.7641    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5.1.2 First Best: GLM_12


Call:
glm(formula = sp.richness ~ forest + forest.clumpy + factor(Empenigr), 
    family = gaussian(identity), data = forestcsv)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-43.667   -9.702   -0.443   11.050   33.004  

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -30.98934   25.17236  -1.231 0.219766    
forest              0.52173    0.05702   9.151  < 2e-16 ***
forest.clumpy      98.34708   28.28539   3.477 0.000625 ***
factor(Empenigr)1   9.17466    2.28831   4.009 8.65e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 218.4683)

    Null deviance: 71764  on 199  degrees of freedom
Residual deviance: 42820  on 196  degrees of freedom
AIC: 1650.9

Number of Fisher Scoring iterations: 2

5.1.3 Diagnostic: GLM_12

5.1.4 Second Best: GLM_10_A


Call:
glm(formula = sp.richness ~ 0 + forest + forest.clumpy, family = gaussian(identity), 
    data = forestcsv)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-37.685  -10.673   -1.518   10.436   35.640  

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
forest         0.56648    0.05421   10.45   <2e-16 ***
forest.clumpy 63.46043    3.58035   17.73   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 235.4952)

    Null deviance: 1522550  on 200  degrees of freedom
Residual deviance:   46628  on 198  degrees of freedom
AIC: 1663.9

Number of Fisher Scoring iterations: 2

5.1.5 Diagnostic: GLM_10_A

5.1.6 Effect Plots: GLM_10_A

5.2 Logit Regression Specification

5.2.1 Nested Model Comparison with LR Test

The likelihood ratio test is to compare the model specification fit one to the other by estimating the Log likelihood.

\[LR = -2 log\left(\frac{Liki(Model_s)}{Liki(Model_b)}\right) = -2(logliki(Model_s)-logliki(Model_b))\]

  • \(H_{0}:\): the smaller model is the “best” model
  • \(H_{a}:\): the bigger model is the “best” model

In general, if the test statistic is large enough, \(H_{0}\) is statistically rejected, implying that the larger model has a significant model improvement over the smaller one. With sequential narrowing-down procedures of likelihood ratio test for 6 Logit models in Table 4, it found that the first best model is LogitMOD_11 while the second best model is LogitMOD_12_B.

Likelihood ratio test

Model 1: Empenigr ~ 0 + forest + forest.clumpy
Model 2: Empenigr ~ forest + forest.clumpy
Model 3: Empenigr ~ 0 + forest + forest.clumpy + forest * forest.clumpy
Model 4: Empenigr ~ forest + forest.clumpy + forest * forest.clumpy
Model 5: Empenigr ~ forest + forest.clumpy + sp.richness
Model 6: Empenigr ~ forest + forest.clumpy + forest * forest.clumpy + 
    sp.richness
  #Df  LogLik Df   Chisq Pr(>Chisq)    
1   2 -121.01                          
2   3 -120.84  1  0.3538    0.55198    
3   3 -120.98  0  0.2969    < 2e-16 ***
4   4 -117.95  1  6.0678    0.01377 *  
5   4 -112.89  0 10.1293    < 2e-16 ***
6   5 -110.04  1  5.7010    0.01696 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Likelihood ratio test

Model 1: Empenigr ~ forest + forest.clumpy + forest * forest.clumpy
Model 2: Empenigr ~ 0 + forest + forest.clumpy + forest * forest.clumpy
Model 3: Empenigr ~ forest + forest.clumpy + sp.richness
Model 4: Empenigr ~ forest + forest.clumpy + forest * forest.clumpy + 
    sp.richness
  #Df  LogLik Df   Chisq Pr(>Chisq)    
1   4 -117.95                          
2   3 -120.98 -1  6.0678    0.01377 *  
3   4 -112.89  1 16.1972  5.708e-05 ***
4   5 -110.04  1  5.7010    0.01696 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Likelihood ratio test

Model 1: Empenigr ~ 0 + forest + forest.clumpy + forest * forest.clumpy
Model 2: Empenigr ~ forest + forest.clumpy + sp.richness
Model 3: Empenigr ~ forest + forest.clumpy + forest * forest.clumpy + 
    sp.richness
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1   3 -120.98                         
2   4 -112.89  1 16.197  5.708e-05 ***
3   5 -110.04  1  5.701    0.01696 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5.2.2 First Best: LogitMOD_11


Call:
glm(formula = Empenigr ~ forest + forest.clumpy + forest * forest.clumpy, 
    family = binomial(logit), data = forestcsv)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0126  -0.9227  -0.7090   1.2532   2.2699  

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)  
(Intercept)          -30.7165    13.4065  -2.291   0.0220 *
forest                 0.5084     0.2101   2.420   0.0155 *
forest.clumpy         32.8914    15.4238   2.133   0.0330 *
forest:forest.clumpy  -0.5558     0.2441  -2.276   0.0228 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 258.98  on 199  degrees of freedom
Residual deviance: 235.90  on 196  degrees of freedom
AIC: 243.9

Number of Fisher Scoring iterations: 4

5.2.3 Diagnostic: LogitMOD_11

5.2.4 Effect Plots: LogitMOD_11

5.2.5 Second Best: LogitMOD_12_B


Call:
glm(formula = Empenigr ~ forest + forest.clumpy + forest * forest.clumpy + 
    sp.richness, family = binomial(logit), data = forestcsv)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8298  -0.8245  -0.5645   1.0488   2.2260  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -31.66155   14.30537  -2.213 0.026880 *  
forest                 0.50739    0.22104   2.295 0.021708 *  
forest.clumpy         30.93029   16.41250   1.885 0.059490 .  
sp.richness            0.04381    0.01163   3.769 0.000164 ***
forest:forest.clumpy  -0.58010    0.25719  -2.255 0.024103 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 258.98  on 199  degrees of freedom
Residual deviance: 220.07  on 195  degrees of freedom
AIC: 230.07

Number of Fisher Scoring iterations: 5

5.2.6 Diagnostic : LogitMOD_12_B

5.2.7 Effect Plots: LogitMOD_12_B

6 Discussion and Summary

Nigrum Inventory in a natural forest can help the efficient production of crowberry food products. The occurrence of Empetrum nigrum at one forest location of Södermanland län has correlations with some factors such as forest amount and the forest aggregation and the numbers of species richness of forest plants. In this study, the first task is to analyze the effects of the variables forest, forest.clumpy, and Empenigr on sp.richness with GLM model specification. The second task is to analyze the effects of the variables forest, forest.clumpy, and sp.richness on Empenigr with Logit model specification. This experimental study is designated to select randomly the 200 sample dataset from Södermanland län. The preliminary testing for the dataset(forest.csv) used here includes scatter-matrix plot by group, normality test, correlation matrix estimation, and the Tukey contrast test.

The scatter-matrix plot by group ( = Empenigr) visualizes that the distributions of forest, forest.clumpy, and sp.richness are very different between the two groups. Contrarily, the normality test without by group ( = Empenigr) indicates that variables of forest, forest.clumpy, and sp.richness are still statistically significant to follow normal distributions. The estimation of correlation matrix shows that sp.richness shows a positive correlation of forest, forest.clumpy and Empenigr respectively. The variables of forest and forest.clumpy show a negative correlation, implying that a higher proportion of forested area, lower Aggregation index of the forest structure (less dense of flora forest structure). The variables of Empenigr and forest.clumpy show a negative correlation, implying that higher occurred counting rate of Empetrum nigrum, lower Aggregation index of the forest structure is. More important, the results of the Tukey Contrasts Test indicate that the fac_Empenigr (H(1) or L(0)) has a statistically significant group effect on the varialbes of forest and sp.richness but not on forest.clumpy. This finding implies that the group effects of Empetrum nigrum are significant in the proportion of forested area and the flora species abundance. However, the group effects of Empetrum nigrum show no obvious significant effect on the aggregation index of the forest structure.

With the preliminary testing , it found that the variables of forest, forest.clumpy, and sp.richness are following normal distributions. To study the effects of the variables of forest, forest.clumpy and Empenigr on sp.richness, I decide to use six GLM model specifications to test their linear correlations with Gaussian identity link fucntion for no probability transformation that tabulated in Table 3. The results of estimated parameters significance show in Table 7. It should be highlighted that there is no overdispersion problem as all six GLM model specifications if Gaussian identity link fucntion is applied. By using ANOVA F Test for the six Nested GLM Models, it found that the first best model is the first best model specification is GLM_12 while the second best is GLM_10_A. Contrary to the six GLM models, I use another six Logit models (see Table 4)to study the effects of forest, forest.clumpy and sp.richness on Empenigr. Since Empenigr is a binary response variable, the binomial logit link function is more appropiate. Besides, Likelihood ratio test is more appropiate to test the nest logit models bacause ANOVA F Test is invalid for nest logit model comparsion. Based on LR test, the first best model specification is LogitMOD_11 while the second best is LogitMOD_12_B.

7 Conclusion

According to the computational results in Table 9, it can conclude that

  • for the case of GLM model Specification, the first best model specification is “GLM_12” while the second best is “GLM_10_A”. Both two models show the positive statistically significant effects of “forest”, “forest.clumpy” and “Empenigr” on “sp.richness”. However, their interactive term ("forest*forest.clumpy“) shows no significant effects on”sp.richness".

  • for the case of Logit model Specification, the first best model specification is LogitMOD_11 while the second best is LogitMOD_12_B. Both two models show the positive statistically significant effects of “forest”, “forest.clumpy” and “sp.richness” on “Empenigr”. Their interactive term ("forest*forest.clumpy“), however, shows the negative statistically significant effect on”Empenigr".

Reference

Boudreau, Stéphane, Pascale Ropars, and Karen Amanda Harper. 2010. “Population Dynamics of Empetrum Hermaphroditum (Ericaceae) on a Subarctic Sand Dune: Evidence of Rapid Colonization Through Efficient Sexual Reproduction.” American Journal of Botany 97 (5): 770–81.

Crawley, Michael J. 2013. “The R Book Second Edition.” John Wiley & Sons.

Rouder, Jeffrey N, Christopher R Engelhardt, Simon McCabe, and Richard D Morey. 2016. “Model Comparison in Anova.” Psychonomic Bulletin & Review 23 (6): 1779–86. https://link.springer.com/content/pdf/10.3758/s13423-016-1026-5.pdf.

Shapiro, Samuel S, and MB Wilk. 1972. “An Analysis of Variance Test for the Exponential Distribution (Complete Samples).” Technometrics 14 (2): 355–70.