In this lab, we will conduct a principal components analysis of surface temperatures over the equatorial Pacific.

The data are here: https://www.dropbox.com/sh/za8bxeor5fdruye/wzA88i84ey

The file is called Pac_Temp.Rdata.

These data represent a 40 x 10 spatial grid over 50 years. The data are saved as a 400x50 matrix of temperature values, and a vector for latitude and longitude

load("../Data/Pac_Temp.Rdata")
library(ggplot2)

# Note: each row of Temp is a pixel, and each column is a time period Create
# a grid of lat/lon values
grid <- expand.grid(x = lon, y = lat)

Plotting

In order to plot, we need to do a little planning. ggplot will needs a dataframe, with one column as the x coordinate, one column the y coordinate, one for the year, and one for the actual temperature value. R stores data by column. The grid object contains the coordinates for one year, we need to repeat that 50 times (once for each year).

ggplt.df <- data.frame(x = rep(grid$x, 50), y = rep(grid$y, 50), year = rep(c(1950:1999), 
    each = 400), temp = as.vector(Temp))
head(ggplt.df)

##        x     y year  temp
## 1 -200.0 -12.5 1950 29.23
## 2 -197.2 -12.5 1950 29.39
## 3 -194.4 -12.5 1950 29.39
## 4 -191.5 -12.5 1950 29.46
## 5 -188.7 -12.5 1950 29.58
## 6 -185.9 -12.5 1950 29.58

ggplot(data = subset(ggplt.df, year == 1990)) + geom_raster(aes(x = x, y = y, 
    fill = temp)) + coord_equal()

plot of chunk unnamed-chunk-2



ggplot(data = subset(ggplt.df, year > 1990)) + geom_raster(aes(x = x, y = y, 
    fill = temp)) + coord_equal() + facet_wrap(~year) + scale_fill_gradient2(midpoint = 26.5) + 
    scale_x_continuous(expand = c(0, 0)) + scale_y_continuous(expand = c(0, 
    0))

plot of chunk unnamed-chunk-2

Remove the long term average from each pixel. i.e., calculate an SSA

SSA <- sweep(Temp, 1, rowMeans(Temp), "-")
ggplt.df$SSA <- as.vector(SSA)
ggplot(data = ggplt.df) + geom_raster(aes(x = x, y = y, fill = SSA)) + coord_equal() + 
    facet_wrap(~year) + scale_fill_gradient2(midpoint = 0) + scale_x_continuous(expand = c(0, 
    0)) + scale_y_continuous(expand = c(0, 0))

plot of chunk unnamed-chunk-3

Note, the major El Nino events of 1982 and 1997, and the unusual chain from 1990-1994.

Empirical Orthogonal Functions

In the climate field, they do Principal Components Analysis, but they want to sound cooler, so they call it Empirical Orthogonal Functions… or EOF.

But the jist is the same. In this case we have 50 maps. Each map (year) is a separate variable. Is there an efficient and simple way to characterize these 50 maps? Is is possible to create just 1, or 2 maps, and to say that these few maps characterize the entire series of 50 maps? If so, it sure would be a lot easier to understand the dominant spatial and temporal dynamics.

This is what we will do with Principal Components.


EOF <- prcomp(SSA, scale. = FALSE, center = FALSE)
# EOF <- prcomp(SSA)
names(EOF)

## [1] "sdev"     "rotation" "center"   "scale"    "x"

plot(EOF$sdev^2, ylab = "Variance", xlab = "Component")

plot of chunk unnamed-chunk-4

screeplot(EOF)

plot of chunk unnamed-chunk-4


# Create a matrix of standardized components for easier plotting
EOF$std_x <- sweep(EOF$x, 2, EOF$sdev, "/")

# And map....
ggplt.df$PC <- as.vector(EOF$std_x)
ggplt.df$PCname <- paste("Component", rep(1:50, each = 400), sep = "")
ggplot(data = ggplt.df) + geom_raster(aes(x = x, y = y, fill = PC)) + coord_equal() + 
    facet_wrap(~PCname) + scale_fill_gradient2(midpoint = 0) + scale_x_continuous(expand = c(0, 
    0)) + scale_y_continuous(expand = c(0, 0))

plot of chunk unnamed-chunk-4

# Whoa.  The Components aren't in order.

ggplt.df$PCname <- factor(paste("Component", rep(1:50, each = 400), sep = ""), 
    levels = paste("Component", 1:50, sep = ""))
ggplot(data = ggplt.df) + geom_raster(aes(x = x, y = y, fill = PC)) + coord_equal() + 
    facet_wrap(~PCname) + scale_fill_gradient2(midpoint = 0) + scale_x_continuous(expand = c(0, 
    0)) + scale_y_continuous(expand = c(0, 0))

plot of chunk unnamed-chunk-4

Note how the first component is the smoothest, and each is more and more complex. The fist component detects a dominant ridge running East-West. This adquately characterizes the bulge of warm water flowing across the Pacific Ocean each year.

Data Reconstruction

Let's reconstruct the data using the first component and it's time series.

In the rotation variable, of the PCA, you can see how important the principal component is each year. Essentially, this is like a time series of the principal component.


time.df <- data.frame(value = EOF$rotation[, 1], year = seq(1950, 1999))
ggplot(data = time.df) + geom_path(aes(x = year, y = value)) + ggtitle("Time Series of Principal Component 1")

plot of chunk unnamed-chunk-5


recon <- EOF$x[, 1, drop = FALSE] %*% t(EOF$rotation[, 1, drop = FALSE])
ggplt.df$recon1 <- as.vector(recon)

ggplot(data = ggplt.df) + geom_raster(aes(x = x, y = y, fill = recon1)) + coord_equal() + 
    facet_wrap(~year) + scale_fill_gradient2(midpoint = 0) + scale_x_continuous(expand = c(0, 
    0)) + scale_y_continuous(expand = c(0, 0))

plot of chunk unnamed-chunk-5

Compare this with the original data. You see that the first principal component does a decent job of describing the overall pattern from year to year.

Evaluating the model

EOF is a model. In particular, it assumes that each component is independent. What happens over time with one component is independent of what happens over time in the other components. Remember, these are axes. And knowing that we are above average on one axis shouldn't tell us anything about the other axes.

Let's look at a scatterplot of the first two components

# But let's plot the time series of the first two components.
plot(EOF$rotation[, 1], EOF$rotation[, 2])

plot of chunk unnamed-chunk-6

# Sure, the slope is zero, but there is clearly a relation, that is thrown
# off by two observations.  The two observations happen to be 1982 and 1997.

which(dimnames(EOF$rotation)[[1]] %in% c("1982", "1997"))

## integer(0)

crossprod(EOF$rotation[-c(33, 48), 1:2])  # Recalculate the correlation w/out those two years

##        PC1    PC2
## PC1 0.7139 0.3147
## PC2 0.3147 0.6538

cov2cor(crossprod(EOF$rotation[-c(33, 48), 1:2]))  # Standardize for interpretation

##        PC1    PC2
## PC1 1.0000 0.4606
## PC2 0.4606 1.0000

So, there is a pretty strong correlation between the two components apart from 1982,1997. This is bad. Very bad. But, it shows the influence of outliers on PCA.