1 Polynomial Regression

   Year Hymenoptera SunAPR
1  1970          NA  128.7
2  1971          NA  115.7
3  1972          NA  130.7
4  1973          NA  150.0
5  1974          NA  149.5
6  1975          NA  125.7
7  1976          NA  152.6
8  1977          NA  157.4
9  1978          NA  110.4
10 1979          NA  121.5
11 1980       100.0  166.6
12 1981        96.1  129.0
13 1982        94.5  173.1
14 1983        97.3  136.0
15 1984        99.6  217.1
16 1985       100.7  128.1
17 1986        98.3  127.8
18 1987        96.2  154.2
19 1988        94.5  129.6
20 1989        93.4  132.1
21 1990        95.2  206.5
22 1991        95.7  148.8
23 1992        98.8  124.4
24 1993       100.7  113.7
25 1994       101.8  166.1
26 1995       102.8  173.1
27 1996       105.3  131.6
28 1997       104.5  165.3
29 1998        98.1  120.0
30 1999        95.7  154.9
31 2000        95.1  136.7
32 2001        94.4  134.0
33 2002        90.8  193.5
34 2003        88.9  192.0
35 2004        88.0  135.8
36 2005        90.7  147.4
37 2006        92.8  159.9
38 2007        93.5  215.6
39 2008        89.7  153.0
40 2009        82.9  172.0
41 2010        82.3  201.3
42 2011        84.9  218.3
43 2012        84.2  129.4
44 2013        83.2  168.4
45 2014        80.1  146.4
46 2015        78.0  220.7
47 2016        77.6  168.3
'data.frame':   47 obs. of  3 variables:
 $ Year       : int  1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 ...
 $ Hymenoptera: num  NA NA NA NA NA NA NA NA NA NA ...
 $ SunAPR     : num  129 116 131 150 150 ...

1.1 Run a linear regression: Hymenoptera~Year


Call:
lm(formula = Hymenoptera ~ Year, data = mydataindex)

Residuals:
   Min     1Q Median     3Q    Max 
-7.123 -3.793 -1.203  3.017 11.097 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1152.11166  145.81814   7.901 2.73e-09 ***
Year          -0.53001    0.07298  -7.262 1.76e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.74 on 35 degrees of freedom
  (10 observations deleted due to missingness)
Multiple R-squared:  0.6011,    Adjusted R-squared:  0.5897 
F-statistic: 52.74 on 1 and 35 DF,  p-value: 1.758e-08

1.2 Run a Polynomial regression: Hymenoptera~ Year + I(Year^2)


Call:
lm(formula = Hymenoptera ~ Year + I(Year^2), data = mydataindex)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5962 -2.5437 -0.1513  1.7272  7.4874 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.298e+05  2.134e+04  -6.083 6.72e-07 ***
Year         1.306e+02  2.136e+01   6.112 6.16e-07 ***
I(Year^2)   -3.281e-02  5.347e-03  -6.137 5.72e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.312 on 34 degrees of freedom
  (10 observations deleted due to missingness)
Multiple R-squared:  0.8107,    Adjusted R-squared:  0.7996 
F-statistic: 72.83 on 2 and 34 DF,  p-value: 5.124e-13

2 Generalized Additive Models(GAM)

2.1 GAM: gam(Hymenoptera~s(Year)

  • Generalized Additive Models(GAM) is a linear predictor additive models, assuming that the error structures follow Poisson and binomial distributed. Details can refere the Chapter 18 in Crawley (2013)

Family: gaussian 
Link function: identity 

Formula:
Hymenoptera ~ s(Year)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  93.1432     0.3408   273.3   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
          edf Ref.df     F p-value    
s(Year) 8.351  8.883 48.07  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.922   Deviance explained =   94%
GCV = 5.7515  Scale est. = 4.2979    n = 37

2.2 Diagnostics: gam(Hymenoptera~s(Year)


Method: GCV   Optimizer: magic
Smoothing parameter selection converged after 7 iterations.
The RMS GCV score gradient at convergence was 1.756672e-06 .
The Hessian was positive definite.
Model rank =  10 / 10 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

          k'  edf k-index p-value  
s(Year) 9.00 8.35    0.67   0.015 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.3 GAM: gam(Hymenoptera~s(Year,k=14)

  • The k value controls the “wiggliness” of the response curve, i.e., how tightly the response curve follows the individual data points. A low k value makes the response smoother, while a high k value results in a response curve that follows the individual observation points closely.

Family: gaussian 
Link function: identity 

Formula:
Hymenoptera ~ s(Year, k = 14)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  93.1432     0.2988   311.7   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
          edf Ref.df     F p-value    
s(Year) 11.11  12.37 45.66  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =   0.94   Deviance explained = 95.8%
GCV = 4.9125  Scale est. = 3.3044    n = 37

2.4 Diagnostics: gam(Hymenoptera~s(Year,k=14))


Method: GCV   Optimizer: magic
Smoothing parameter selection converged after 9 iterations.
The RMS GCV score gradient at convergence was 3.356308e-05 .
The Hessian was positive definite.
Model rank =  14 / 14 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

          k'  edf k-index p-value  
s(Year) 13.0 11.1    0.78    0.05 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.5 Polynomial model Vs GAM model.

Analysis of Variance Table

Model 1: Hymenoptera ~ Year + I(Year^2)
Model 2: Hymenoptera ~ s(Year)
  Res.Df    RSS     Df Sum of Sq      F   Pr(>F)    
1 34.000 373.05                                     
2 27.649 118.83 6.3508    254.22 9.3136 1.02e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance Table

Model 1: Hymenoptera ~ Year + I(Year^2)
Model 2: Hymenoptera ~ s(Year, k = 14)
  Res.Df    RSS     Df Sum of Sq      F    Pr(>F)    
1 34.000 373.05                                      
2 24.888  82.24 9.1116    290.81 9.6588 3.362e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Results:

  • shows that the GAM model (gamyear_K14 or gamyear ) has much lower residual sum of squares (RSS) than the polynomial model (RegModel.2).
  • the difference is highly significant, indicating that the response curve of the GAM model follows more tightly the observation points.

2.6 GAM: gam(Hymenoptera~s(SunAPR))


Family: gaussian 
Link function: identity 

Formula:
Hymenoptera ~ s(SunAPR)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   93.143      1.157   80.47   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
          edf Ref.df     F p-value  
s(SunAPR)   1      1 4.769  0.0356 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.0948   Deviance explained =   12%
GCV = 52.399  Scale est. = 49.566    n = 37

2.7 Diagnostics: gam(Hymenoptera~s(SunAPR))


Method: GCV   Optimizer: magic
Smoothing parameter selection converged after 14 iterations.
The RMS GCV score gradient at convergence was 5.263863e-06 .
The Hessian was positive definite.
Model rank =  10 / 10 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

          k' edf k-index p-value  
s(SunAPR)  9   1    0.79    0.07 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.8 GAM:gam(Hymenoptera~s(Year,k=14)+s(SunAPR)


Family: gaussian 
Link function: identity 

Formula:
Hymenoptera ~ s(Year, k = 14) + s(SunAPR)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   93.143      0.302   308.5   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
            edf Ref.df      F p-value    
s(Year)   11.13  12.37 38.764  <2e-16 ***
s(SunAPR)  1.00   1.00  0.246   0.625    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.938   Deviance explained = 95.9%
GCV = 5.2283  Scale est. = 3.3735    n = 37

2.9 Diagnostics:gam(Hymenoptera~s(Year,k=14)+s(SunAPR)


Method: GCV   Optimizer: magic
Smoothing parameter selection converged after 17 iterations.
The RMS GCV score gradient at convergence was 6.360075e-07 .
The Hessian was positive definite.
Model rank =  23 / 23 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

            k'  edf k-index p-value  
s(Year)   13.0 11.1    0.78   0.045 *
s(SunAPR)  9.0  1.0    1.07   0.655  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.10 GAM: gamyearsun_02 Vs gamyear

Analysis of Deviance Table

Model 1: Hymenoptera ~ s(Year, k = 14) + s(SunAPR)
Model 2: Hymenoptera ~ s(Year)
  Resid. Df Resid. Dev      Df Deviance      F  Pr(>F)  
1    22.627      80.54                                  
2    27.117     118.83 -4.4901  -38.294 2.5281 0.06331 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.11 GAM: gamyearsun_02 Vs gamsun_01

Analysis of Deviance Table

Model 1: Hymenoptera ~ s(Year, k = 14) + s(SunAPR)
Model 2: Hymenoptera ~ s(SunAPR)
  Resid. Df Resid. Dev      Df Deviance      F    Pr(>F)    
1    22.627      80.54                                      
2    35.000    1734.82 -12.373  -1654.3 39.633 2.238e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3 GAM: Poisson-distributed Data

   resp wave  hill
1    19  0.0  7.04
2    10  0.1  9.04
3    13  0.2  5.48
4     6  0.3  5.24
5    20  0.4  8.60
6    12  0.5  7.32
7     9  0.6  8.76
8     9  0.7 10.00
9     9  0.8  9.00
10   10  0.9  7.80
11   14  1.0  6.16
12    9  1.1  6.08
13   11  1.2  5.52
14   15  1.3  7.96
15   11  1.4  8.48
16   14  1.5  8.40
17   11  1.6  9.44
18   12  1.7  7.84
19   15  1.8  7.88
20    7  1.9  5.72
21   13  2.0  6.96
22    8  2.1  9.48
23   11  2.2  7.40
24    8  2.3  9.32
25    4  2.4  9.96
26   15  2.5  7.92
27    7  2.6  8.72
28    3  2.7  5.32
29    5  2.8  9.52
30    5  2.9  5.40
'data.frame':   126 obs. of  3 variables:
 $ resp: int  19 10 13 6 20 12 9 9 9 10 ...
 $ wave: num  0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ...
 $ hill: num  7.04 9.04 5.48 5.24 8.6 7.32 8.76 10 9 7.8 ...

5 Poisson Regression Modeling: gamdata.csv

5.1 glmodel.A1: resp ~ wave + hill


Call:
glm(formula = resp ~ wave + hill, family = poisson(link = log), 
    data = gamdatacsv)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.1532  -0.7368  -0.0819   0.6189   3.6487  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.334230   0.161659  14.439   <2e-16 ***
wave        -0.017758   0.007982  -2.225   0.0261 *  
hill         0.002789   0.019940   0.140   0.8888    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 193.89  on 125  degrees of freedom
Residual deviance: 188.88  on 123  degrees of freedom
AIC: 699.01

Number of Fisher Scoring iterations: 4

5.2 Basic Diagnostic Plot: resp ~ wave + hill

  • The Diagnostic test for glmodel.A1 found that
    • Residual deviance = 188.88 on 123 Degrees of freedom
    • The ratio of Residual deviance/Degrees of freedom = 188.88 / 123 = 1.53561 which is greater than 1.2, indicating that the model is overdispersed . That is, the variance is increasing with the mean.
    • Transformation of Poisson to quasiPoisson is a method to solve the Overdispersionproblem.

5.3 Has overdispersion problem in glm(resp ~ wave + hill)

\[\frac{\mathrm{Residual\_Deviance}}{\mathrm{DOF}} = \frac{188.88}{123} = 1.53561 >> 1 \]

6 QuasiPoisson Regression Modeling:


Call:
glm(formula = resp ~ wave + hill, family = quasipoisson(link = log), 
    data = gamdatacsv)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.1532  -0.7368  -0.0819   0.6189   3.6487  

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.334230   0.196143  11.901   <2e-16 ***
wave        -0.017758   0.009685  -1.834   0.0691 .  
hill         0.002789   0.024193   0.115   0.9084    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 1.472122)

    Null deviance: 193.89  on 125  degrees of freedom
Residual deviance: 188.88  on 123  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 4

6.1 Overdispersion problem in glm(resp ~ wave + hill)

  • not so serious if using quasipoisson

\[\frac{\mathrm{Residual\_Deviance}}{\mathrm{DOF}} = \frac{188.88}{123} = 1.53561 > 1.472122 \]

6.2 Calculate an ANOVA Table

Analysis of Deviance Table (Type II tests)

Response: resp
Error estimate based on Pearson residuals 

           Sum Sq  Df F value  Pr(>F)  
wave        4.956   1  3.3665 0.06895 .
hill        0.020   1  0.0133 0.90840  
Residuals 181.070 123                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Deviance Table (Type II tests)

Response: resp
Error estimate based on Pearson residuals 

           Sum Sq  Df F value  Pr(>F)  
wave        4.956   1  3.3665 0.06895 .
hill        0.020   1  0.0133 0.90840  
Residuals 181.070 123                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

7 GAM Poisson Regression Modeling: gamdata.csv


Family: gaussian 
Link function: identity 

Formula:
resp ~ s(wave) + s(hill)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.4524     0.2416   39.12   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
          edf Ref.df      F  p-value    
s(wave) 6.056  7.208  5.278 2.42e-05 ***
s(hill) 3.377  4.184 20.625 5.21e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.476   Deviance explained = 51.6%
GCV = 8.0183  Scale est. = 7.3543    n = 126

Family: poisson 
Link function: log 

Formula:
resp ~ s(wave) + s(hill)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.20920    0.03001   73.63   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
          edf Ref.df Chi.sq  p-value    
s(wave) 5.805  6.949  29.57 0.000125 ***
s(hill) 3.593  4.452  66.89 4.52e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.465   Deviance explained = 51.4%
UBRE = -0.087376  Scale est. = 1         n = 126

7.1 Anova Test

Analysis of Deviance Table

Model 1: resp ~ s(wave) + s(hill)
Model 2: resp ~ wave + hill
  Resid. Df Resid. Dev    Df Deviance  Pr(>Chi)    
1     113.6     94.195                             
2     126.0    188.880 -12.4  -94.685 9.356e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Deviance Table

Model 1: resp ~ s(wave) + s(hill)
Model 2: resp ~ s(wave) + s(hill)
  Resid. Df Resid. Dev         Df Deviance  Pr(>Chi)    
1    113.60      94.19                                  
2    113.61     849.92 -0.0080872  -755.72 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Reference

Crawley, Michael J. 2013. “The R Book Second Edition.” John Wiley & Sons.