1 Dataset


Task 1: Load the “Northeast Climate” R Dataset ne_stations.RData, with the load function. Use the verbose=TRUE optional argument to display the names of the objects that are loaded.

load("ne_stations.RData", verbose = TRUE)
## Loading objects:
##   ne
##   ne.m
##   ne.df
##   state.ne
##   state.ne.m
##   ne.crs


Q1: What are the data types of these objects? For those that are spatial, what are their coördinate reference systems (CRS)?

Note: you will have to load the sp package to see the CRS.

A:

install.packages("sp")
library("sp")
class(ne)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
proj4string(ne)
## [1] "+proj=longlat +datum=NAD27 +no_defs +ellps=clrk66 +nadgrids=@conus,@alaska,@ntv2_0.gsb,@ntv1_can.dat"
class(ne.m)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
proj4string(ne.m)
## [1] "+proj=aea +lat_0=42.5 +lat_1=39 +lat_2=44 +lon_0=-76 +ellps=WGS84 +units=m"
class(ne.df)
## [1] "data.frame"
class(state.ne)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
proj4string(state.ne)
## [1] "+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0"
class(state.ne.m)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
proj4string(state.ne.m)
## [1] "+proj=aea +lat_0=42.5 +lat_1=39 +lat_2=44 +lon_0=-76 +ellps=WGS84 +units=m"
class(ne.crs)
## [1] "CRS"
## attr(,"package")
## [1] "sp"

We will work mostly with object ne.df, a dataframe version of the climate stations.


Q2: For how many climate stations are there records? How many states of the USA are represented?

A:

length(ne.df$STATION_ID)
## [1] 305
levels(ne.df$STATE)
## [1] "NJ" "NY" "PA" "VT"
  • There are 305 climate stations in the records.
  • There are 4 states (NJ, NY, PA, VT) are represented.


Task 2: Display the names of the variables in this dataset.

names(ne.df)
##  [1] "STATION_ID" "STATE"      "STATION_NA" "LATITUDE_D" "LONGITUDE_"
##  [6] "ELEVATION_" "OID_"       "COOP_ID"    "STATE_1"    "STN_NAME"  
## [11] "LAT_DD"     "LONG_DD"    "ELEV_FT"    "JAN_GDD50"  "FEB_GDD50" 
## [16] "MAR_GDD50"  "APR_GDD50"  "MAY_GDD50"  "JUN_GDD50"  "JUL_GDD50" 
## [21] "AUG_GDD50"  "SEP_GDD50"  "OCT_GDD50"  "NOV_GDD50"  "DEC_GDD50" 
## [26] "ANN_GDD50"  "E"          "N"


Q3: What are the attributes in the ne.df data frame?

A:

  • Attributes in the ne.df data frame are “STATION_ID” “STATE” “STATION_NA” “LATITUDE_D” “LONGITUDE_” “ELEVATION_” “OID_” “COOP_ID” “STATE_1” “STN_NAME” “LAT_DD” “LONG_DD” “ELEV_FT” “JAN_GDD50” “FEB_GDD50” “MAR_GDD50” “APR_GDD50” “MAY_GDD50” “JUN_GDD50” “JUL_GDD50” “AUG_GDD50” “SEP_GDD50” “OCT_GDD50” “NOV_GDD50” “DEC_GDD50” “ANN_GDD50” “E” “N”.

The response variable here is ANN_GDD50 (annual growing-degree days), the coördinates are in fields E and N, the elevation of the station is in field ELEVATION_.


2 First-order trend surface


Task 3: Compute a first-order trend surface model using the two coordinates (East and North) as the predictors, and the growing degree days as the dependent (response) variable, using Ordinary Least Squares linear regression.

model_t3 <- lm(ANN_GDD50 ~ E + N, data = ne.df)

Display its summary and the linear model diagnostic plots “residuals vs. fitted” and “Normal Q-Q”.

summary(model_t3)
## 
## Call:
## lm(formula = ANN_GDD50 ~ E + N, data = ne.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1207.36  -232.79    27.11   259.59   847.42 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.298e+03  2.381e+01  96.478  < 2e-16 ***
## E            7.483e-04  1.211e-04   6.181 2.07e-09 ***
## N           -2.818e-03  1.369e-04 -20.577  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 371.5 on 302 degrees of freedom
## Multiple R-squared:  0.5837, Adjusted R-squared:  0.581 
## F-statistic: 211.7 on 2 and 302 DF,  p-value: < 2.2e-16
par(mfrow = c(1,2))
plot(model_t3, which = 1:2)


Q4: What proportion of the variation in GDD50 is explained by the first-order trend?

A:

  • The proportion of variability explained is 0.58.


Q5: Which of the two coördinates most affects the GDD50?

A:

  • For each km E it increase by 7.483e-04 m (0.0007483 m); for each km N it decrease by -2.818e-03 m (-0.002818 m). They both are not that important to the predictants. Compared to E, N is kind of more affected to the GDD50.


Q6: Is there a relation between the fitted values and residuals? If so, what?

A:

  • There is a relation between residuals and fitted values: residuals at both extremes are positive (under-predictions); in the mid-range most residuals are negative (over-predictions). Mean residual is not 0 through the range of fitted values.


Q7: Are the residuals normally-distributed? If not, how so? Are there any especially large residuals?

A:

  • Extreme residuals are not from a normal distribution. There are some extreme value influence the residuals distribution.


Optional Task: Identify the most poorly-modelled station.

Optional Q: Where is this? Why might it be so poorly modelled?

Optional A:

Task 4: Display a post-plot of the residuals, with the positive residuals (under-predictions) shown in green and the negative (over-predictions) in red.

res_t3 <- residuals(model_t3)
plot(ne.df$N ~ ne.df$E, cex=3*abs(res_t3)/max(abs(res_t3)), col=ifelse(res_t3 > 0, "green", "red"), xlab="E", ylab="N",
main="Residuals from 1st-order trend", sub="Positive: green; negative: red", asp=1)
grid()


Q8: What is the spatial pattern of the trend-surface residuals? Does it suggest a higher-order trend?

A:

  • Spatial pattern: large residuals tend to be near each other, and vice-versa. It suggests a higher-order trend.


3 Second-order trend surface


Task 5: Repeat all the steps of the previous section, and answer the same questions, for a full 2nd-order trend surface.

model_t5 <- lm(ANN_GDD50 ~ E + N + I(E^2) + I(N^2) +I(E*N), 
               data = ne.df)


Q9: What proportion of the variation in GDD50 is explained by the second-order trend?

A:

summary(model_t5)
## 
## Call:
## lm(formula = ANN_GDD50 ~ E + N + I(E^2) + I(N^2) + I(E * N), 
##     data = ne.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1076.18  -214.61    37.77   229.11   792.40 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.131e+03  3.529e+01  60.402  < 2e-16 ***
## E           -8.336e-05  1.444e-04  -0.577    0.564    
## N           -1.523e-03  1.788e-04  -8.520 7.95e-16 ***
## I(E^2)       2.821e-09  6.927e-10   4.073 5.95e-05 ***
## I(N^2)       7.526e-09  8.380e-10   8.981  < 2e-16 ***
## I(E * N)    -8.036e-09  9.691e-10  -8.292 3.84e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 323.9 on 299 degrees of freedom
## Multiple R-squared:  0.6868, Adjusted R-squared:  0.6816 
## F-statistic: 131.1 on 5 and 299 DF,  p-value: < 2.2e-16
  • The proportion of variability explained is 0.68.


Q10: Which of the two coördinates (taking into account their squares also) most affects the GDD50?

A:

  • For each km E it increase by -8.336e-05 +5.311309e-05 m (-0.00003024691 m); for each km N it decrease by -1.523e-03 + 8.675252e-05 m (-0.001436247 m). They both are not that important to the predictants. Compared to E, N is kind of more affected to the GDD50.


Q11: Is there a relation between the fitted values and residuals? If so, what?

A:

par(mfrow = c(1,2))
plot(model_t5, which = 1:2)

  • Relation of fits vs. residuals seen in 1st order trend has been removed. But systematic over-prediction of highest values.


Q12: Are the residuals normally-distributed? If not, how so? Are there any especially large residuals?

A:

  • No, they are not normally distributed. There are over-prediction of highest values. There are especially large residuals.


Optional Task: Identify the most poorly-modelled station.

Optional Q: Where is this? Why might it be so poorly modelled?

Optional A:

Task 6: Display a post-plot of the residuals, with the positive residuals (under-predictions) shown in green and the negative (over-predictions) in red.

res_t5 <- residuals(model_t5)
plot(ne.df$N ~ ne.df$E, cex=3*abs(res_t5)/max(abs(res_t5)), col=ifelse(res_t5 > 0, "green", "red"), xlab="E", ylab="N", main="Residuals from 2nd-order trend", sub="Positive: green; negative: red", asp=1)
grid()


Q13: What is the spatial pattern of the trend-surface residuals? Does it suggest a higher-order trend?

A:

  • These residuals form local clusters of positive, negative, and near-zero; there does not appear to be any overall spatial pattern. So, a higher-order trend surface is not indicated.


4 Comparing trend orders


Task 7: Compare the two orders of trend surfaces with an analysis of variance.

anova(model_t3, model_t5)
## Analysis of Variance Table
## 
## Model 1: ANN_GDD50 ~ E + N
## Model 2: ANN_GDD50 ~ E + N + I(E^2) + I(N^2) + I(E * N)
##   Res.Df      RSS Df Sum of Sq     F    Pr(>F)    
## 1    302 41685611                                 
## 2    299 31361422  3  10324189 32.81 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Q14: Is the 2nd-order trend clearly superior to the 1st-order trend?

A:

  • P-values are ***, suggesting that most independent variables are significant at 0.000 level, so the second-order surface is statistically superior to the first-order surface.


5 Mapping


Task 8: Load the raster package into the workspace.

install.packages("raster")


Then, load a gridded DEM covering the study area dem_nj_ny_pa_vt_4km.RData, with the load function. Use the verbose=TRUE optional argument to display the names of the objects that are loaded.

load("dem_nj_ny_pa_vt_4km.RData", verbose = TRUE)
## Loading objects:
##   dem.ne.m.sp


Q15: What object is loaded? What is its class? What is its CRS? Which of the point datasets does this match?

A:

  • Loading objects: dem.ne.m.sp
class(dem.ne.m.sp)
## [1] "SpatialPixelsDataFrame"
## attr(,"package")
## [1] "sp"
proj4string(dem.ne.m.sp)
## [1] "+proj=aea +lat_0=42.5 +lat_1=39 +lat_2=44 +lon_0=-76 +ellps=WGS84 +units=m"


Q16: What is the bounding box and resolution of the grid? (You can see these with the summary function.)

A:

summary(dem.ne.m.sp)
## Object of class SpatialPixelsDataFrame
## Coordinates:
##         min      max
## E -385999.3 355750.7
## N -395302.4 289937.6
## Is projected: TRUE 
## proj4string :
## [+proj=aea +lat_0=42.5 +lat_1=39 +lat_2=44 +lon_0=-76 +ellps=WGS84
## +units=m]
## Number of points: 22649
## Grid attributes:
##   cellcentre.offset cellsize cells.dim
## E         -384274.3     3450       216
## N         -393450.4     3704       185
## Data attributes:
##    ELEVATION_      
##  Min.   :  -6.697  
##  1st Qu.: 602.504  
##  Median :1173.889  
##  Mean   :1125.085  
##  3rd Qu.:1587.356  
##  Max.   :3992.470
  • bounding box
range(dem.ne.m.sp$E)
## [1] -384274.3  354025.7
range(dem.ne.m.sp$N)
## [1] -393450.4  288085.6


  • The resolution of grid is 3450m * 3704m.


Task 9: Display the grid with the spplot function.

(grid_t9 <- spplot(dem.ne.m.sp, xlab="East", ylab="North"))


Task 10: Predict over this study area with the 2nd-order trend surface. Add this as a field in the study area data frame.

pred_t10 <- predict.lm(model_t5, newdata = as.data.frame(dem.ne.m.sp))
dem.ne.m.sp$PRED_2ndTS <- predict(model_t5, newdata=as.data.frame(dem.ne.m.sp))


Task 11: Summarize the predictions.

summary(pred_t10)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1822    2087    2332    2413    2630    4227


Q17: How do the predictions compare with the summary of the weather stations observations?

summary(ne.df$ANN_GDD50)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     795    2100    2463    2518    2930    4021
summary(pred_t10)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1822    2087    2332    2413    2630    4227


A:

Task 12: Display a map of the predictions using the spplot function, and specifying the field to display with the zcol argument to spplot.

(grid_t12 <- spplot(dem.ne.m.sp, zcol="PRED_2ndTS",
                    xlab="East", ylab="North"))

Q18: Is this trend surface a realistic representation of the growing-degree days? Explain.

A:

Yes it is. The figure shows realistic geological information, such as the curve line indicating ravers or valleies.


6 Generalized Additive Model (GAM)


Task 13: Display a scatterplot of the GDD50 against the two coördinates.

require(ggplot2)
## Loading required package: ggplot2
plote_t13 <- ggplot(ne.df, aes(x=E, y=ANN_GDD50)) + geom_point() +
   geom_smooth(method = "loess")
plotn_t13 <- ggplot(ne.df, aes(x=N, y=ANN_GDD50)) + geom_point() +
   geom_smooth(method = "loess")
require(gridExtra)
## Loading required package: gridExtra
grid.arrange(plote_t13, plotn_t13, ncol = 2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'


Q19: Do these relations look linear or quadratic throughout their ranges?

A:

The relation with North seems almost linear. However, the relation with East is much more scattered.


Task 14: Compute a Generalized Additive Model (GAM) of GDD50 modelled by the two coördinates.

require(mgcv)
## Loading required package: mgcv
## Loading required package: nlme
## This is mgcv 1.8-33. For overview type 'help("mgcv-package")'.
model_t14 <- gam(ANN_GDD50 ~ s(E, N), data = ne.df)


Q20: What proportion of the variation in GDD50 is explained by the GAM? How does this compare to the 2nd-order trend surface?

summary(model_t14)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## ANN_GDD50 ~ s(E, N)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2517.52      15.73     160   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##          edf Ref.df     F p-value    
## s(E,N) 24.46   27.8 36.98  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.771   Deviance explained = 78.9%
## GCV =  82348  Scale est. = 75474     n = 305

A:

  • The proportion of variability explained is 0.77.
  • Compared with 2nd-order trend surface, the proportion becomes higher.


Task 15: Predict over this study area with the fitted GAM. Add this as a field in the study area data frame.

pred_t15 <- predict.gam(model_t14, newdata = as.data.frame(dem.ne.m.sp))
dem.ne.m.sp$PRED_GAM <- pred_t15

Note: you will have to convert the SpatialPixelsDataFrame to a regular data.frame as the argument to newdata, because predict.gam does not recognize sp objects.


Task 16: Summarize the predictions.

summary(pred_t15)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1634    2047    2281    2396    2667    3633


Q21: How do the predictions compare with the summary of the weather stations observations, and to the summary of the 2nd-order trend surface predictions?

A:

summary(ne.df$ANN_GDD50)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     795    2100    2463    2518    2930    4021
summary(pred_t10)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1822    2087    2332    2413    2630    4227
summary(pred_t15)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1634    2047    2281    2396    2667    3633


Task 17: Display a post-plot of the residuals.

res_t17 <- residuals.gam(model_t14)
plot(ne.df$N ~ ne.df$E, cex=3*abs(res_t17)/max(abs(res_t17)), col=ifelse(res_t17 > 0, "green", "red"), xlab="E", ylab="N",
main="Residuals from GAM", sub="Positive: green; negative: red", asp=1)
grid()


Q22: What is the spatial pattern of the GAM trend-surface residuals? How does this differ from the pattern of the 2nd-order trend residuals?

  • No residual spatial correlation. Compared with the 2nd-order trend residuals, GAM trend-surface residuals doesn’t appear any spatial pattern.


A:

Task 18: Display a map of the predictions using the spplot function.

library(sp)
(map_t18 <- spplot(dem.ne.m.sp, zcol="PRED_GAM",
                   xlab="East", ylab="North"))
## Warning in wkt(obj): CRS object has no comment

## Warning in wkt(obj): CRS object has no comment


Q23: Is this trend surface a realistic representation of the growing-degree days? Explain. How does it improve on the 2nd-order trend?

A:

  • Yes, it is. It improved by including elevation data in the prediction. Independent marginal effect of predictors: 2D trend, 1D elevation. Removes spatial dependence of OLS residuals at the range of the empirical smoother.


7 Thin-plate splines


Task: Load the fields package, to use for thin-plate spline calculation.

Task: Add a field to the stations dataframe, containing a matrix of the two coördinates, as required by the fields package.

Task: compute the thin-plate spline for the growing degree days, as a function of coördinates.

Task: Predict with the thin-plate spline over the study area. Note you will have to make a matrix from the grid coördinates, interpolate on that, and then convert the prediction to a field in the grid.

Task: Display the map.

Q: Is this trend surface a realistic representation of the growing-degree days? Explain. Does it improve on the GAM trend?

A:

8 Challenge – including elevation

Of course, elevation also affects temperature and thus the growing-degree days.

Repeat the exercise, but including elevation in all the models. Comment on how and why the trend surfaces (now 3D) are different/improved from the 2D trends.

In these models, consider elevation as an independent dimension, with no interaction with the geographic coördinates (although, that may not be a valid assumption… you could test this… ).

Also, see if a transformation of elevation might make its relation with the target variable more linear.

9 Saving the workspace

Task: Save the workspace for a later exercise using the save.image function: