We'll first import the SST data that I showed in class, and just do some simple manipulations in R with it.
# library(ncdf4) nc <- nc_open('../Data/sst.mnmean.nc') mean.sst <-
# ncvar_get(nc) # That worked for me, but the CRAN page suggests that it
# may not work on a PC. # You might try the ncdf package (not ncdf4) #
# Either way, I will save it to an R data object that you can download
# directly save(mean.sst, file='../Data/mean.sst.Rdata')
load("../Data/mean.sst.Rdata")
# I'll have to manually enter the coordiante information (with a dummy
# index right now for time)
dimnames(mean.sst) <- list(East = seq(from = 1, by = 2, length = 180), North = seq(from = 88,
by = -2, to = -88), Date = seq(1, 1916))
There are a lot of packages for storing Dates in R. For many physical scientists, zoo is a helpful package. It has special classes for year-month and year-quarter data. Chron and timeDate are sometimes useful (timeDate has timezone support). lubridate is a unique package that has separate classes for time periods, intervals and durations (for example, an interval is anchored to specific points in time, a duration or period is a difference that you might want to add to a time).
We will now practice come space-time plotting of these data
library(zoo)
dates <- as.yearmon(1850 + seq(0, 1915)/12)
# melt the sst array into an data.frame
library(reshape)
sst.df <- melt(mean.sst)
sst.df$Date <- as.Date(dates[sst.df$Date])
# Create a North Atlantic subset
na <- subset(sst.df, (East > (360 - 100) | East < 5) & North > 0 & North < 60)
# And wrap the Western Hemisphere coordinates to be negative Eastern
# Hemisphere Coordinates
na$East[na$East > 200] <- na$East[na$East > 200] - 360
# Plot one month
library(ggplot2)
ggplot(data = subset(na, Date == "2000-06-01")) + geom_raster(aes(x = East,
y = North, fill = value)) + scale_fill_gradient2("SST", midpoint = 20.5,
high = scales::muted("red"), low = scales::muted("blue"), mid = "grey95") +
coord_equal()
# Create a year and month variable, and plot all data for one year
library(lubridate)
na$Year <- year(na$Date) # year() is in the lubridate package
na$Month <- month(na$Date)
ggplot(data = subset(na, Year == 2000)) + geom_raster(aes(x = East, y = North,
fill = value)) + scale_fill_gradient2("SST", midpoint = 20.5, high = scales::muted("red"),
low = scales::muted("blue"), mid = "grey95") + coord_equal() + facet_wrap(~Month)
# Or.. plot every August from 2000 forward
ggplot(data = subset(na, Year >= 2000 & Month == 7)) + geom_raster(aes(x = East,
y = North, fill = value)) + scale_fill_gradient2("SST", midpoint = 20.5,
high = scales::muted("red"), low = scales::muted("blue"), mid = "grey95") +
coord_equal() + facet_wrap(~Year)
# Not very exciting!
# Calculate monthly anomolies Create a unique factor for each
# lat/lon/month pair. Use the duplicated trick First, create a lon, lat,
# month grouping factor
na <- na[order(na$East, na$North, na$Month), ]
na$group <- cumsum(!duplicated(na[, c("East", "North", "Month")]))
# The following both take too long na <- ddply(na, .(East, North, Month),
# function(x) cbind(x, data.frame(SSA=x$value-mean(x$value)))) na <-
# ddply(na, .(group), function(x) cbind(x,
# data.frame(SSA=x$value-mean(x$value))))
group_means <- tapply(na$value, na$group, mean, na.rm = TRUE) # Calculate means for each group
na$group_mean <- group_means[na$group] # Populate the dataframe with monthly mean
na$SSA <- na$value - na$group_mean # subtract monthly mean
# Now, repeat the plotting of Augusts since 2000
ggplot(data = subset(na, Year >= 2000 & Month == 7)) + geom_raster(aes(x = East,
y = North, fill = SSA)) + scale_fill_gradient2("SSA", midpoint = 0, high = scales::muted("red"),
low = scales::muted("blue"), mid = "grey95") + coord_equal() + facet_wrap(~Year) +
ggtitle("August Anomalies") + scale_x_continuous("", breaks = NULL, expand = c(0,
0)) + scale_y_continuous("", breaks = NULL, expand = c(0, 0))
Data Processing. We will first create time series of the mongthly SST and SSA data
# Create a univiariate index of North Atlantic SST
month.sst <- melt(tapply(na$value, list(na$Year, na$Month), mean, na.rm = TRUE))
names(month.sst) <- c("Year", "Month", "SST")
# Reorder
month.sst <- month.sst[order(month.sst$Year, month.sst$Month), ]
month.sst$Date <- as.Date(paste(month.sst$Year, month.sst$Month, "1", sep = "-"),
"%Y-%m-%d")
qplot(x = Date, y = SST, data = month.sst, geom = "path")
# Create a univariate index of North Atlantic SSA
month.ssa <- melt(tapply(na$SSA, list(na$Year, na$Month), mean, na.rm = TRUE))
names(month.ssa) <- c("Year", "Month", "SSA")
# Reorder
month.ssa <- month.ssa[order(month.ssa$Year, month.ssa$Month), ]
month.ssa$Date <- as.Date(paste(month.ssa$Year, month.ssa$Month, "1", sep = "-"),
"%Y-%m-%d")
qplot(x = Date, y = SSA, data = month.ssa, geom = "path")
acf(month.ssa$SSA, lag.max = 100 * 12, na.action = na.pass) # Create an autocorrelation function out to 100 years
That was easy. It draws a nice little plot. The vertical bars give you a rough idea about when a deviation from zero is significant. They are a bit misleading. They are properly used for testing the null hypothesis that all of the acf values are zero. But, of course, that's not the case here.
Also the labeling of the axis is ugly. It is in months. Years would be better. You could plot the data by hand, but then you wouldn't get the confidence intervals.
acf.obj <- acf(month.ssa$SSA, lag.max = 100 * 12, na.action = na.pass, plot = FALSE)
qplot(x = acf.obj$lag/12, y = acf.obj$acf, xlab = "lag (years)", ylab = "ACF")
But really, we should detrend the series to remove any nonstationairity:
month.ssa$SSA2 <- resid(lm(SSA ~ Date, data = month.ssa, na.action = na.exclude))
acf.obj <- acf(month.ssa$SSA2, lag.max = 100 * 12, na.action = na.pass, plot = FALSE)
qplot(x = acf.obj$lag/12, y = acf.obj$acf, xlab = "lag (years)", ylab = "ACF")
The autoregressive model (AR(1)) assumes that the data model that looks like this: \( Y_t = \rho T_{t-1}+e_t \)
Importantly, this means that each data value depends directly, in a causal manner, on its previous time value. But it does not depend directly on the value two time periods ago.
More generally, an AR(p) model looks like: \( Y_t = \sum_{\ell=1}^p \rho_\ell Y_{t-\ell} + e_t \)
Autoregressive models are pretty nice when it comes to forecasting. But we won't do this today. It can be difficult to tell from the ACF plot if an autoregressive model is appropriate. A much better tool for this task is the “partial autocorrelation function”. In essence, an PACF does the sequential comparison of an AR(2) vs an AR(1), and an AR(3) vs an AR(2), and an AR(4) vs and AR(3), etc. Essentially, it lets us see if there is a simple AR(p) structure that is hidden in the data.
pacf.obj <- pacf(month.ssa$SSA2, lag.max = 100 * 12, na.action = na.pass, plot = TRUE)
pacf.obj <- pacf(month.ssa$SSA2, lag.max = 2 * 12, na.action = na.pass, plot = TRUE)
Note two things: 1), The PACF doesn't pick up the long cycle at all. It completely misses it. The PACF is much better as describing high frequency variation (within a few time lags), and usually completely misses the low frequency stuff.
Second, it appears that it's really close to an AR(1) process. But you might also justify an AR(4). Tough call.
We'll use here some data from Jim Elsner at Florida State, on hurricane frequencies.
load("../Elsner/annual.RData")
# There's no good reason for doing the following, really, there isn't
dat <- annual
dat$Yr <- dat$Year
###################
plot(dat$Yr, dat$B.1, xlab = "Year", ylab = "Hurricane Count", lab = c(10, 7,
20), type = "l", lwd = 2)
par(new = TRUE)
plot(dat$Yr, dat$sst, type = "l", col = "red", xaxt = "n", yaxt = "n", xlab = "",
ylab = "", lwd = 2)
axis(4)
mtext(expression(paste("SST [", degree, "C]")), side = 4, line = 2.5, las = 0)
legend("topleft", col = c("black", "red"), lty = 1, legend = c("Hurricanes",
"SST"))
It appears that there may be a correlation between Hurricanes and SST.
We will look at the ACFs of count and sea surface temporature fist, and then fit a regression.
acf(subset(dat$sst, dat$Yr > 1932))
acf(log(subset(dat$B.1, dat$Yr > 1932)))
model1 = lm(log(B.1) ~ sst + soi, data = dat, subset = Yr >= 1932)
summary(model1)
##
## Call:
## lm(formula = log(B.1) ~ sst + soi, data = dat, subset = Yr >=
## 1932)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9347 -0.1973 0.0411 0.2182 1.0234
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.6726 0.0466 35.91 < 2e-16 ***
## sst 0.5857 0.1826 3.21 0.00197 **
## soi 0.0560 0.0146 3.82 0.00027 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.369 on 76 degrees of freedom
## Multiple R-squared: 0.251, Adjusted R-squared: 0.231
## F-statistic: 12.7 on 2 and 76 DF, p-value: 1.7e-05
acf(residuals(model1), lag.max = 10)
There doesn't appear to be strong evidence pointing to temporal autocorrelation. Some might say that this is a type II error, because there is ALWAYS temporal autocorrelation. So fit a model with an AR(1) error.
This type of model is called a “Generalized Least Squares” model, which is not at all to be confused with a “Generalized Linear Model”. The GLM generalized the distribution and link functions. The GLS generalized the error structure to be not independent or homoskedastic. GLS is a lot like Random Effects. In fact, they are so alike, that they are the same. Except random effects usually have discrete groupings, and time series does not.
The GLS model with AR1 errors looks like this: \( Y_t = X_t\beta + e_t \) \( e_t = \rho e_{t-1} + u_t \)
library(nlme)
model2 <- gls(log(B.1) ~ sst + soi, data = dat, subset = Yr > 1932, correlation = corAR1(form = ~Yr))
summary(model2)
## Generalized least squares fit by REML
## Model: log(B.1) ~ sst + soi
## Data: dat
## Subset: Yr > 1932
## AIC BIC logLik
## 84.39 95.98 -37.2
##
## Correlation Structure: AR(1)
## Formula: ~Yr
## Parameter estimate(s):
## Phi
## 0.1647
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) 1.6736 0.05474 30.573 0.0000
## sst 0.5842 0.20221 2.889 0.0051
## soi 0.0581 0.01425 4.078 0.0001
##
## Correlation:
## (Intr) sst
## sst -0.411
## soi 0.103 -0.040
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -2.5289 -0.5532 0.1077 0.6021 2.7617
##
## Residual standard error: 0.3725
## Degrees of freedom: 78 total; 75 residual
The other time series model is a general ARMA(p,q). You would use corARMA for this
Comparing the two models, we see that the coefficients are hardly changed, and the t-statistics came down slightly.
What would I report?
Since the difference between the models is so small, if I were publishing in a climate field journal, I would report the model without autocorrelation. But I would also say that we checked the model with AR(1) errors, and found no significant difference.
You can add the correlation to a GLM:
library(MASS)
# glmmPQL needs a random effects structure, so create a dummy with one
# group
dat$ID <- 1
model.3 <- glmmPQL(B.1 ~ sst + soi, data = subset(dat, Yr > 1932), family = quasipoisson,
correlation = corAR1(form = ~Yr), random = ~1 | ID)
## iteration 1
## iteration 2
Note, since we added a random intercept for the one group, this has been estimated to be essentially zero.
I won't do it here, it is really stretching the data thin, but you can also add correlations to gamm so that you have glm, additive models, random effects and time series correlation, all in one model. I wouldn't suggest this unless you have a heck of a lot of data.