1 The Approach of GLM

  • The approach of GLM is to model a linear relationship between response variable and explanatory variables, by relaxing a) the assumption of constant variarance and b) the assumption of Gasussian error distribution.
  • With maximized the likelihood estimation of parameter values from the pre-assigned probability density function, GLM can estimate the linear relationship between the explanatory variable and the response variable.
  • In case explanatory variable is categorical or binary values, a logit transformation is necessary.
  • The logit of the probability \(p\) is the logarithm of the odds: \(logit(p) = log(\frac{p}{1-p}) = log(p) - log(1-p)\)
  • From statistical test’s view point, the null hypothesis \(H_{o}\) is rejected if and only if the log odds ratio is equal to 0. This result implies that the exploratory varible has statistically significant effect on the response variable. -Book Reference: Crawley (2013)

3 Objectives:

  • create and interpret a Poisson regression model analysing differences in species numbers for Islands in the archipelago of Galapagos.
  • use dataset gala in the faraway package
'data.frame':   30 obs. of  7 variables:
 $ Species  : num  58 31 3 25 2 18 24 10 8 2 ...
 $ Endemics : num  23 21 3 9 1 11 0 7 4 2 ...
 $ Area     : num  25.09 1.24 0.21 0.1 0.05 ...
 $ Elevation: num  346 109 114 46 77 119 93 168 71 112 ...
 $ Nearest  : num  0.6 0.6 2.8 1.9 1.9 8 6 34.1 0.4 2.6 ...
 $ Scruz    : num  0.6 26.3 58.7 47.4 1.9 ...
 $ Adjacent : num  1.84 572.33 0.78 0.18 903.82 ...
[1] "Species"   "Endemics"  "Area"      "Elevation" "Nearest"   "Scruz"    
[7] "Adjacent" 
             Species Endemics    Area Elevation Nearest Scruz Adjacent
Baltra            58       23   25.09       346     0.6   0.6     1.84
Bartolome         31       21    1.24       109     0.6  26.3   572.33
Caldwell           3        3    0.21       114     2.8  58.7     0.78
Champion          25        9    0.10        46     1.9  47.4     0.18
Coamano            2        1    0.05        77     1.9   1.9   903.82
Daphne.Major      18       11    0.34       119     8.0   8.0     1.84
Daphne.Minor      24        0    0.08        93     6.0  12.0     0.34
Darwin            10        7    2.33       168    34.1 290.2     2.85
Eden               8        4    0.03        71     0.4   0.4    17.95
Enderby            2        2    0.18       112     2.6  50.2     0.10
Espanola          97       26   58.27       198     1.1  88.3     0.57
Fernandina        93       35  634.49      1494     4.3  95.3  4669.32
Gardner1          58       17    0.57        49     1.1  93.1    58.27
Gardner2           5        4    0.78       227     4.6  62.2     0.21
Genovesa          40       19   17.35        76    47.4  92.2   129.49
Isabela          347       89 4669.32      1707     0.7  28.1   634.49
Marchena          51       23  129.49       343    29.1  85.9    59.56
Onslow             2        2    0.01        25     3.3  45.9     0.10
Pinta            104       37   59.56       777    29.1 119.6   129.49
Pinzon           108       33   17.95       458    10.7  10.7     0.03
Las.Plazas        12        9    0.23        94     0.5   0.6    25.09
Rabida            70       30    4.89       367     4.4  24.4   572.33
SanCristobal     280       65  551.62       716    45.2  66.6     0.57
SanSalvador      237       81  572.33       906     0.2  19.8     4.89
SantaCruz        444       95  903.82       864     0.6   0.0     0.52
SantaFe           62       28   24.08       259    16.5  16.5     0.52
SantaMaria       285       73  170.92       640     2.6  49.2     0.10
Seymour           44       16    1.84       147     0.6   9.6    25.09
Tortuga           16        8    1.24       186     6.8  50.9    17.95
Wolf              21       12    2.85       253    34.1 254.7     2.33

5 Ordinary LM Modeling:

  • \(Species = f(Area, Elevation, Nearest, Scruz, Adjacent)\)
  • Presence of Heteroscedasticity in Gala Dataset

Call:
lm(formula = Species ~ Adjacent + Area + Elevation + Endemics + 
    Nearest + Scruz, data = gala)

Residuals:
    Min      1Q  Median      3Q     Max 
-68.219 -10.225   1.830   9.557  71.090 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -15.337942   9.423550  -1.628    0.117    
Adjacent      0.001811   0.011879   0.152    0.880    
Area          0.013258   0.011403   1.163    0.257    
Elevation    -0.047537   0.047596  -0.999    0.328    
Endemics      4.393654   0.481203   9.131 4.13e-09 ***
Nearest      -0.101460   0.500871  -0.203    0.841    
Scruz         0.008256   0.105884   0.078    0.939    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 28.96 on 23 degrees of freedom
Multiple R-squared:  0.9494,    Adjusted R-squared:  0.9362 
F-statistic: 71.88 on 6 and 23 DF,  p-value: 9.674e-14

5.2 Variance Inflation Factor (VIF)

  • to test the problem of multicollinearity
 Adjacent      Area Elevation  Endemics   Nearest     Scruz 
 3.645582  3.356530 13.920084  5.979425  1.767133  1.793814 
            (Intercept) Adjacent   Area Elevation Endemics Nearest  Scruz
(Intercept)       1.000   -0.011  0.202    -0.072   -0.260  -0.035 -0.383
Adjacent         -0.011    1.000  0.539    -0.840    0.706   0.184  0.051
Area              0.202    0.539  1.000    -0.703    0.357   0.181  0.058
Elevation        -0.072   -0.840 -0.703     1.000   -0.845  -0.104 -0.166
Endemics         -0.260    0.706  0.357    -0.845    1.000  -0.024  0.257
Nearest          -0.035    0.184  0.181    -0.104   -0.024   1.000 -0.611
Scruz            -0.383    0.051  0.058    -0.166    0.257  -0.611  1.000

5.3 studentized Breusch-Pagan test

  • test the non-constant error variance

    studentized Breusch-Pagan test

data:  Species ~ Adjacent + Area + Elevation + Endemics + Nearest +     Scruz
BP = 10.67, df = 1, p-value = 0.001089

5.4 Conclusion: OLModel.1

  • The Normal Q-Q plot of the residuals in OLModel.1 indicates that errors are higly non-normally distributed.
  • The VIF test indicates that there is no problem of multicollinearity amony 6 explanatory variables in OLModel.1 .
  • The studentized Breusch-Pagan test indicates that the non-constant error variance is statistically significant in OLModel.1 .

6 Poisson Regression Modeling:


Call:
glm(formula = Species ~ Adjacent + Area + Elevation + Nearest + 
    Scruz, family = poisson(link = log), data = gala)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-8.2752  -4.4966  -0.9443   1.9168  10.1849  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.155e+00  5.175e-02  60.963  < 2e-16 ***
Adjacent    -6.630e-04  2.933e-05 -22.608  < 2e-16 ***
Area        -5.799e-04  2.627e-05 -22.074  < 2e-16 ***
Elevation    3.541e-03  8.741e-05  40.507  < 2e-16 ***
Nearest      8.826e-03  1.821e-03   4.846 1.26e-06 ***
Scruz       -5.709e-03  6.256e-04  -9.126  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 3510.73  on 29  degrees of freedom
Residual deviance:  716.85  on 24  degrees of freedom
AIC: 889.68

Number of Fisher Scoring iterations: 5

6.1 Basic Diagnostic Plot: glmodel.1

  • The Diagnostic test for glmodel.1 found that
    • Residual deviance = 716.85 on 24 Degrees of freedom
    • The ratio of Residual deviance/Degrees of freedom = 716.85/24 = 29.86875 which is greater than 1.2, indicating that the model is overdispersed . That is, the variance is increasing with the mean.
    • Transformation of Poisson to quasiPoisson is a method to solve the Overdispersionproblem.

6.2 Has overdispersion problem in glmodel.1

\[\frac{\mathrm{Residual\_Deviance}}{\mathrm{DOF}} = \frac{716.85}{24} = 29.87 >> 1 \]

7 QuasiPoisson Regression Modeling:


Call:
glm(formula = Species ~ Adjacent + Area + Elevation + Nearest + 
    Scruz, family = quasipoisson(link = log), data = gala)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-8.2752  -4.4966  -0.9443   1.9168  10.1849  

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.1548079  0.2915901  10.819 1.03e-10 ***
Adjacent    -0.0006630  0.0001653  -4.012 0.000511 ***
Area        -0.0005799  0.0001480  -3.918 0.000649 ***
Elevation    0.0035406  0.0004925   7.189 1.98e-07 ***
Nearest      0.0088256  0.0102622   0.860 0.398292    
Scruz       -0.0057094  0.0035251  -1.620 0.118380    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 31.74921)

    Null deviance: 3510.73  on 29  degrees of freedom
Residual deviance:  716.85  on 24  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

7.1 No overdispersion problem in glmodel.2

\[\frac{\mathrm{Residual\_Deviance}}{\mathrm{DOF}} = \frac{716.85}{24} = 29.87 < 31.74921 \]


Call:
glm(formula = Species ~ Adjacent + Area + Elevation, family = quasipoisson(link = log), 
    data = gala)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-11.852   -4.248   -1.013    2.466   10.607  

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.9613109  0.2617589  11.313 1.53e-11 ***
Adjacent    -0.0007508  0.0001524  -4.928 4.07e-05 ***
Area        -0.0005704  0.0001381  -4.129 0.000334 ***
Elevation    0.0035891  0.0004721   7.602 4.54e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 30.08156)

    Null deviance: 3510.73  on 29  degrees of freedom
Residual deviance:  818.74  on 26  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

8 Calculate an ANOVA Table

Analysis of Deviance Table (Type II tests)

Response: Species
Error estimate based on Pearson residuals 

           Sum Sq Df F value    Pr(>F)    
Adjacent   624.61  1 19.6732 0.0001745 ***
Area       487.51  1 15.3550 0.0006472 ***
Elevation 1672.72  1 52.6856 1.688e-07 ***
Nearest     22.57  1  0.7107 0.4075257    
Scruz       96.77  1  3.0481 0.0936229 .  
Residuals  761.98 24                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Reference

Crawley, Michael J. 2013. “The R Book Second Edition.” John Wiley & Sons.