Task 1: Load the “Northeast Climate” R Dataset ne_stations.RData, with the load function. Use the verbose=TRUE optional argument to display the names of the objects that are loaded.
load("ne_stations.RData", verbose = TRUE)
## Loading objects:
## ne
## ne.m
## ne.df
## state.ne
## state.ne.m
## ne.crs
Q1: What are the data types of these objects? For those that are spatial, what are their coördinate reference systems (CRS)?
Note: you will have to load the sp package to see the CRS.
A:
install.packages("sp")
library("sp")
class(ne)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
proj4string(ne)
## [1] "+proj=longlat +datum=NAD27 +no_defs +ellps=clrk66 +nadgrids=@conus,@alaska,@ntv2_0.gsb,@ntv1_can.dat"
class(ne.m)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
proj4string(ne.m)
## [1] "+proj=aea +lat_0=42.5 +lat_1=39 +lat_2=44 +lon_0=-76 +ellps=WGS84 +units=m"
class(ne.df)
## [1] "data.frame"
class(state.ne)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
proj4string(state.ne)
## [1] "+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0"
class(state.ne.m)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
proj4string(state.ne.m)
## [1] "+proj=aea +lat_0=42.5 +lat_1=39 +lat_2=44 +lon_0=-76 +ellps=WGS84 +units=m"
class(ne.crs)
## [1] "CRS"
## attr(,"package")
## [1] "sp"
We will work mostly with object ne.df, a dataframe version of the climate stations.
Q2: For how many climate stations are there records? How many states of the USA are represented?
A:
length(ne.df$STATION_ID)
## [1] 305
levels(ne.df$STATE)
## [1] "NJ" "NY" "PA" "VT"
Task 2: Display the names of the variables in this dataset.
names(ne.df)
## [1] "STATION_ID" "STATE" "STATION_NA" "LATITUDE_D" "LONGITUDE_"
## [6] "ELEVATION_" "OID_" "COOP_ID" "STATE_1" "STN_NAME"
## [11] "LAT_DD" "LONG_DD" "ELEV_FT" "JAN_GDD50" "FEB_GDD50"
## [16] "MAR_GDD50" "APR_GDD50" "MAY_GDD50" "JUN_GDD50" "JUL_GDD50"
## [21] "AUG_GDD50" "SEP_GDD50" "OCT_GDD50" "NOV_GDD50" "DEC_GDD50"
## [26] "ANN_GDD50" "E" "N"
Q3: What are the attributes in the ne.df data frame?
A:
ne.df data frame are “STATION_ID” “STATE” “STATION_NA” “LATITUDE_D” “LONGITUDE_” “ELEVATION_” “OID_” “COOP_ID” “STATE_1” “STN_NAME” “LAT_DD” “LONG_DD” “ELEV_FT” “JAN_GDD50” “FEB_GDD50” “MAR_GDD50” “APR_GDD50” “MAY_GDD50” “JUN_GDD50” “JUL_GDD50” “AUG_GDD50” “SEP_GDD50” “OCT_GDD50” “NOV_GDD50” “DEC_GDD50” “ANN_GDD50” “E” “N”.The response variable here is ANN_GDD50 (annual growing-degree days), the coördinates are in fields E and N, the elevation of the station is in field ELEVATION_.
Task 3: Compute a first-order trend surface model using the two coordinates (East and North) as the predictors, and the growing degree days as the dependent (response) variable, using Ordinary Least Squares linear regression.
model_t3 <- lm(ANN_GDD50 ~ E + N, data = ne.df)
Display its summary and the linear model diagnostic plots “residuals vs. fitted” and “Normal Q-Q”.
summary(model_t3)
##
## Call:
## lm(formula = ANN_GDD50 ~ E + N, data = ne.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1207.36 -232.79 27.11 259.59 847.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.298e+03 2.381e+01 96.478 < 2e-16 ***
## E 7.483e-04 1.211e-04 6.181 2.07e-09 ***
## N -2.818e-03 1.369e-04 -20.577 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 371.5 on 302 degrees of freedom
## Multiple R-squared: 0.5837, Adjusted R-squared: 0.581
## F-statistic: 211.7 on 2 and 302 DF, p-value: < 2.2e-16
par(mfrow = c(1,2))
plot(model_t3, which = 1:2)
Q4: What proportion of the variation in GDD50 is explained by the first-order trend?
A:
Q5: Which of the two coördinates most affects the GDD50?
A:
Q6: Is there a relation between the fitted values and residuals? If so, what?
A:
Q7: Are the residuals normally-distributed? If not, how so? Are there any especially large residuals?
A:
Optional Task: Identify the most poorly-modelled station.
Optional Q: Where is this? Why might it be so poorly modelled?
Optional A:
Task 4: Display a post-plot of the residuals, with the positive residuals (under-predictions) shown in green and the negative (over-predictions) in red.
res_t3 <- residuals(model_t3)
plot(ne.df$N ~ ne.df$E, cex=3*abs(res_t3)/max(abs(res_t3)), col=ifelse(res_t3 > 0, "green", "red"), xlab="E", ylab="N",
main="Residuals from 1st-order trend", sub="Positive: green; negative: red", asp=1)
grid()
Q8: What is the spatial pattern of the trend-surface residuals? Does it suggest a higher-order trend?
A:
Task 5: Repeat all the steps of the previous section, and answer the same questions, for a full 2nd-order trend surface.
model_t5 <- lm(ANN_GDD50 ~ E + N + I(E^2) + I(N^2) +I(E*N),
data = ne.df)
Q9: What proportion of the variation in GDD50 is explained by the second-order trend?
A:
summary(model_t5)
##
## Call:
## lm(formula = ANN_GDD50 ~ E + N + I(E^2) + I(N^2) + I(E * N),
## data = ne.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1076.18 -214.61 37.77 229.11 792.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.131e+03 3.529e+01 60.402 < 2e-16 ***
## E -8.336e-05 1.444e-04 -0.577 0.564
## N -1.523e-03 1.788e-04 -8.520 7.95e-16 ***
## I(E^2) 2.821e-09 6.927e-10 4.073 5.95e-05 ***
## I(N^2) 7.526e-09 8.380e-10 8.981 < 2e-16 ***
## I(E * N) -8.036e-09 9.691e-10 -8.292 3.84e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 323.9 on 299 degrees of freedom
## Multiple R-squared: 0.6868, Adjusted R-squared: 0.6816
## F-statistic: 131.1 on 5 and 299 DF, p-value: < 2.2e-16
Q10: Which of the two coördinates (taking into account their squares also) most affects the GDD50?
A:
Q11: Is there a relation between the fitted values and residuals? If so, what?
A:
par(mfrow = c(1,2))
plot(model_t5, which = 1:2)
Q12: Are the residuals normally-distributed? If not, how so? Are there any especially large residuals?
A:
Optional Task: Identify the most poorly-modelled station.
Optional Q: Where is this? Why might it be so poorly modelled?
Optional A:
Task 6: Display a post-plot of the residuals, with the positive residuals (under-predictions) shown in green and the negative (over-predictions) in red.
res_t5 <- residuals(model_t5)
plot(ne.df$N ~ ne.df$E, cex=3*abs(res_t5)/max(abs(res_t5)), col=ifelse(res_t5 > 0, "green", "red"), xlab="E", ylab="N", main="Residuals from 2nd-order trend", sub="Positive: green; negative: red", asp=1)
grid()
Q13: What is the spatial pattern of the trend-surface residuals? Does it suggest a higher-order trend?
A:
Task 7: Compare the two orders of trend surfaces with an analysis of variance.
anova(model_t3, model_t5)
## Analysis of Variance Table
##
## Model 1: ANN_GDD50 ~ E + N
## Model 2: ANN_GDD50 ~ E + N + I(E^2) + I(N^2) + I(E * N)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 302 41685611
## 2 299 31361422 3 10324189 32.81 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Q14: Is the 2nd-order trend clearly superior to the 1st-order trend?
A:
***, suggesting that most independent variables are significant at 0.000 level, so the second-order surface is statistically superior to the first-order surface.Task 8: Load the raster package into the workspace.
install.packages("raster")
Then, load a gridded DEM covering the study area dem_nj_ny_pa_vt_4km.RData, with the load function. Use the verbose=TRUE optional argument to display the names of the objects that are loaded.
load("dem_nj_ny_pa_vt_4km.RData", verbose = TRUE)
## Loading objects:
## dem.ne.m.sp
Q15: What object is loaded? What is its class? What is its CRS? Which of the point datasets does this match?
A:
dem.ne.m.spclass(dem.ne.m.sp)
## [1] "SpatialPixelsDataFrame"
## attr(,"package")
## [1] "sp"
proj4string(dem.ne.m.sp)
## [1] "+proj=aea +lat_0=42.5 +lat_1=39 +lat_2=44 +lon_0=-76 +ellps=WGS84 +units=m"
Q16: What is the bounding box and resolution of the grid? (You can see these with the summary function.)
A:
summary(dem.ne.m.sp)
## Object of class SpatialPixelsDataFrame
## Coordinates:
## min max
## E -385999.3 355750.7
## N -395302.4 289937.6
## Is projected: TRUE
## proj4string :
## [+proj=aea +lat_0=42.5 +lat_1=39 +lat_2=44 +lon_0=-76 +ellps=WGS84
## +units=m]
## Number of points: 22649
## Grid attributes:
## cellcentre.offset cellsize cells.dim
## E -384274.3 3450 216
## N -393450.4 3704 185
## Data attributes:
## ELEVATION_
## Min. : -6.697
## 1st Qu.: 602.504
## Median :1173.889
## Mean :1125.085
## 3rd Qu.:1587.356
## Max. :3992.470
range(dem.ne.m.sp$E)
## [1] -384274.3 354025.7
range(dem.ne.m.sp$N)
## [1] -393450.4 288085.6
Task 9: Display the grid with the spplot function.
(grid_t9 <- spplot(dem.ne.m.sp, xlab="East", ylab="North"))
Task 10: Predict over this study area with the 2nd-order trend surface. Add this as a field in the study area data frame.
pred_t10 <- predict.lm(model_t5, newdata = as.data.frame(dem.ne.m.sp))
dem.ne.m.sp$PRED_2ndTS <- predict(model_t5, newdata=as.data.frame(dem.ne.m.sp))
Task 11: Summarize the predictions.
summary(pred_t10)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1822 2087 2332 2413 2630 4227
Q17: How do the predictions compare with the summary of the weather stations observations?
summary(ne.df$ANN_GDD50)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 795 2100 2463 2518 2930 4021
summary(pred_t10)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1822 2087 2332 2413 2630 4227
A:
Task 12: Display a map of the predictions using the spplot function, and specifying the field to display with the zcol argument to spplot.
(grid_t12 <- spplot(dem.ne.m.sp, zcol="PRED_2ndTS",
xlab="East", ylab="North"))
Q18: Is this trend surface a realistic representation of the growing-degree days? Explain.
A:
Yes it is. The figure shows realistic geological information, such as the curve line indicating ravers or valleies.
Task 13: Display a scatterplot of the GDD50 against the two coördinates.
require(ggplot2)
## Loading required package: ggplot2
plote_t13 <- ggplot(ne.df, aes(x=E, y=ANN_GDD50)) + geom_point() +
geom_smooth(method = "loess")
plotn_t13 <- ggplot(ne.df, aes(x=N, y=ANN_GDD50)) + geom_point() +
geom_smooth(method = "loess")
require(gridExtra)
## Loading required package: gridExtra
grid.arrange(plote_t13, plotn_t13, ncol = 2)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Q19: Do these relations look linear or quadratic throughout their ranges?
A:
The relation with North seems almost linear. However, the relation with East is much more scattered.
Task 14: Compute a Generalized Additive Model (GAM) of GDD50 modelled by the two coördinates.
require(mgcv)
## Loading required package: mgcv
## Loading required package: nlme
## This is mgcv 1.8-33. For overview type 'help("mgcv-package")'.
model_t14 <- gam(ANN_GDD50 ~ s(E, N), data = ne.df)
Q20: What proportion of the variation in GDD50 is explained by the GAM? How does this compare to the 2nd-order trend surface?
summary(model_t14)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## ANN_GDD50 ~ s(E, N)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2517.52 15.73 160 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(E,N) 24.46 27.8 36.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.771 Deviance explained = 78.9%
## GCV = 82348 Scale est. = 75474 n = 305
A:
Task 15: Predict over this study area with the fitted GAM. Add this as a field in the study area data frame.
pred_t15 <- predict.gam(model_t14, newdata = as.data.frame(dem.ne.m.sp))
dem.ne.m.sp$PRED_GAM <- pred_t15
Note: you will have to convert the SpatialPixelsDataFrame to a regular data.frame as the argument to newdata, because predict.gam does not recognize sp objects.
Task 16: Summarize the predictions.
summary(pred_t15)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1634 2047 2281 2396 2667 3633
Q21: How do the predictions compare with the summary of the weather stations observations, and to the summary of the 2nd-order trend surface predictions?
A:
summary(ne.df$ANN_GDD50)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 795 2100 2463 2518 2930 4021
summary(pred_t10)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1822 2087 2332 2413 2630 4227
summary(pred_t15)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1634 2047 2281 2396 2667 3633
Task 17: Display a post-plot of the residuals.
res_t17 <- residuals.gam(model_t14)
plot(ne.df$N ~ ne.df$E, cex=3*abs(res_t17)/max(abs(res_t17)), col=ifelse(res_t17 > 0, "green", "red"), xlab="E", ylab="N",
main="Residuals from GAM", sub="Positive: green; negative: red", asp=1)
grid()
Q22: What is the spatial pattern of the GAM trend-surface residuals? How does this differ from the pattern of the 2nd-order trend residuals?
A:
Task 18: Display a map of the predictions using the spplot function.
library(sp)
(map_t18 <- spplot(dem.ne.m.sp, zcol="PRED_GAM",
xlab="East", ylab="North"))
## Warning in wkt(obj): CRS object has no comment
## Warning in wkt(obj): CRS object has no comment
Q23: Is this trend surface a realistic representation of the growing-degree days? Explain. How does it improve on the 2nd-order trend?
A:
Task: Load the fields package, to use for thin-plate spline calculation.
Task: Add a field to the stations dataframe, containing a matrix of the two coördinates, as required by the fields package.
Task: compute the thin-plate spline for the growing degree days, as a function of coördinates.
Task: Predict with the thin-plate spline over the study area. Note you will have to make a matrix from the grid coördinates, interpolate on that, and then convert the prediction to a field in the grid.
Task: Display the map.
Q: Is this trend surface a realistic representation of the growing-degree days? Explain. Does it improve on the GAM trend?
A:
Of course, elevation also affects temperature and thus the growing-degree days.
Repeat the exercise, but including elevation in all the models. Comment on how and why the trend surfaces (now 3D) are different/improved from the 2D trends.
In these models, consider elevation as an independent dimension, with no interaction with the geographic coördinates (although, that may not be a valid assumption… you could test this… ).
Also, see if a transformation of elevation might make its relation with the target variable more linear.
Task: Save the workspace for a later exercise using the save.image function: