Geography 415 Lab 4

Objectives:

Learn how to perform simple linear regression in R
Perform basic validity checks for linear regression

Part 1: Tutorial - Tree rings and paleoclimate

In many places with clear growing seasons, it is possible to clear determine the age of a tree by counting its annual growth rings. The relative of each ring is a rough proxy for the growing conditions experience by the tree that year. Furthermore, by grouping together neighboring trees, it is possible to constrict a proxy for the average growth conditions experienced by all the trees, and in some instances, by using the ring from well-preserved downed trees, it is possible to ring width series that date back hundreds, or even thousands of years.

The data used in this lab were provided by Prof. Henri Grissino-Mayer in the Geography Department. The file MLC.COL has yearly tree ring widths from a study site in El Malpais National Monument in New Mexico. Open the file with a text editor and look at it quickly. The data have three columns: year, standardized chronology, residual chronology. In general, if a tree had a particularly good or bad year, then this effect is carried over to the next year as well. This can somoetimes be a problem for analysis. The residual chronology addresses this - it is similar to the standardized chronology, except that it should have some of the year-to-year correlation filtered out.

These tree ring series are much longer than human records of temperature and precipitation. The hope is that if we can find a relation between ring width and climatic conditions, then we can use the ring width series to reocnstruct, or 'hindcast' the climatic conditions.

setwd("~/Dropbox/classes/Geog415_s13/geog415_s13_lab/lab4/")
# trw: tree ring width
trw <- read.table("MLC.COL", na.strings = ".")
# Check the data
head(trw)

##     V1    V2    V3
## 1 -136 0.441    NA
## 2 -135 0.814    NA
## 3 -134 0.677    NA
## 4 -133 0.197    NA
## 5 -132 0.347 0.517
## 6 -131 0.651 0.824

names(trw) <- c("year", "stand_chron", "resid_chron")
summary(trw)

##       year       stand_chron     resid_chron    
##  Min.   :-136   Min.   :0.006   Min.   :-0.271  
##  1st Qu.: 396   1st Qu.:0.674   1st Qu.: 0.736  
##  Median : 928   Median :0.972   Median : 0.992  
##  Mean   : 928   Mean   :1.000   Mean   : 1.000  
##  3rd Qu.:1460   3rd Qu.:1.278   3rd Qu.: 1.260  
##  Max.   :1992   Max.   :3.357   Max.   : 3.035  
##                                 NA's   :4

plot(trw$year, trw$resid_chron, type = "l", xlab = "Year", ylab = "residual chronology")

plot of chunk load_data

Tree growth should be a function of regional climate patterns + site-specific conditions. This function is called the “response function.'' If you create a proxy from many trees, hopefully the site-specific conditions will cancel each other out, leaving a response function of regional-proxy = f(climate patterns). It is standard dendrochronological practice to first explore this response function: which climate variables are trees responding to? Precipitation? Temperature? Annual average temperature, or are some months more important than others? We will skip this step here.

Once it is established that there is a relationship between tree growth and particulary climatic variables, you might imagine reversing this relation; i.e. thinking about climate as a function of ring width. Scientifically, this is silly (it's like putting the cart before the horse), but statisically, there is nothing wrong with it. We call this function the 'transfer function,' since it transfers ring width measurements into climate measurements.

Dr. Grissino-Mayer has also provided precipitation data from the study area in the file '2904.pcp.' This is a file of monthly precipitation measurements back to 1895. I'll load in the file, see what it looks like, fix some column names, and then turn it from 'wide' format to 'long format'. Check the data after each step… you'll seen what I mean by 'wide' or 'long'.

raw_precip <- read.table("2904.pcp")
head(raw_precip)

##     V1   V2   V3   V4   V5   V6   V7   V8   V9  V10  V11  V12  V13
## 1 1895 1.48 0.61 0.24 0.02 0.73 0.12 3.94 1.96 0.94 1.07 0.94 0.37
## 2 1896 0.21 0.27 0.17 0.26 0.00 0.56 3.31 1.95 1.43 4.01 0.19 0.55
## 3 1897 1.47 0.59 0.63 0.10 0.95 0.56 2.47 1.81 2.35 0.87 0.02 0.21
## 4 1898 1.17 0.04 0.50 0.61 0.24 1.27 3.68 2.29 0.73 0.00 0.34 1.08
## 5 1899 0.36 0.32 0.52 0.13 0.00 0.72 3.98 1.38 1.55 0.27 0.57 0.24
## 6 1900 0.35 0.27 0.46 0.54 0.40 0.44 1.51 1.56 2.22 0.92 0.44 0.22

names(raw_precip) <- c("year", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", 
    "Aug", "Sep", "Oct", "Nov", "Dec")
# Coerce that from a 'wide' format to a 'long' format
library(reshape)

## Loading required package: plyr

## Attaching package: 'reshape'

## The following object(s) are masked from 'package:plyr':
## 
## rename, round_any

precip <- melt(raw_precip, id.vars = "year", variable_name = "month")
head(precip)

##   year month value
## 1 1895   Jan  1.48
## 2 1896   Jan  0.21
## 3 1897   Jan  1.47
## 4 1898   Jan  1.17
## 5 1899   Jan  0.36
## 6 1900   Jan  0.35

tail(precip)

##      year month value
## 1291 1997   Dec  1.62
## 1292 1998   Dec  0.72
## 1293 1999   Dec  0.27
## 1294 2000   Dec  0.40
## 1295 2001   Dec  0.47
## 1296 2002   Dec  1.25

# The order is screwy.  We should sort by year, then month
precip <- precip[order(precip$year, precip$month), ]

Homework question: describe in words all of the R commands in that last line.

We now have precipitation for each month, i.e. 12 observations per year. We only have one tree ring record per year. We will have to create climate variables per year, not per month. We might calculate rain during the growing season, but here, we'll calculate annual rain (measured from July to June). We'll use the rollmean function in the zoo package to calculate 12 month rolling means (moving average), and then extract just the June observations (i.e. July-June averages).

head(precip, 13)

##      year month value
## 1    1895   Jan  1.48
## 109  1895   Feb  0.61
## 217  1895   Mar  0.24
## 325  1895   Apr  0.02
## 433  1895   May  0.73
## 541  1895   Jun  0.12
## 649  1895   Jul  3.94
## 757  1895   Aug  1.96
## 865  1895   Sep  0.94
## 973  1895   Oct  1.07
## 1081 1895   Nov  0.94
## 1189 1895   Dec  0.37
## 2    1896   Jan  0.21

library(zoo)

## Attaching package: 'zoo'

## The following object(s) are masked from 'package:base':
## 
## as.Date, as.Date.numeric

precip$ma <- rollmean(precip$value, 12, fill = NA, align = "right") * 12
# I multiplied by 12 to turn monthly average into annual total.
precip_yr <- subset(precip, month == "Jun" & !is.na(ma))
names(precip_yr)

## [1] "year"  "month" "value" "ma"

names(trw)

## [1] "year"        "stand_chron" "resid_chron"

# And finally, create the regression dataset.
reg_data <- merge(x = trw, y = precip_yr, by.x = "year", by.y = "year", all.x = TRUE)

Simple Linear Regression

As you might expect in a statistics software language, performing the actual regression is really simple: one line of code.

(Warning about the science and statistics here: In practice, one would evaluate the regression for different subsets, and verify that the results are similar no matter which subset we use to fit. In this lab, we skip that and just fit the regression to the entire dataset.)

mod <- lm(ma ~ resid_chron, data = reg_data)  # mod is short for model in my head
summary(mod)  # print the summary

## 
## Call:
## lm(formula = ma ~ resid_chron, data = reg_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.818 -1.515 -0.364  1.266  6.176 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    8.675      0.530    16.4  < 2e-16 ***
## resid_chron    4.054      0.455     8.9  3.7e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 2.14 on 95 degrees of freedom
##   (2032 observations deleted due to missingness)
## Multiple R-squared: 0.455,   Adjusted R-squared: 0.449 
## F-statistic: 79.2 on 1 and 95 DF,  p-value: 3.68e-14

We see that each increase in the ring width corresponds to a 4.0537 in increase in annual precipitation. If we wanted to create 95% confidence intervals, we would have 4.0537 plus/minus 2 times 0.4554. The high t value (indicated by a small p value) indicates that the data are not consistent with a hypothesis in which the coefficient is zero, or that the data provide significant evidence against this hypothesis. To be honest, this hypothesis isn't quite interesting… presumably we would only be doing this if we already knew that there was a significant relation between the two.

We also see the $ R^2 $ of 0.4548, and the adjusted R-squared of 0.449. But is this R² significantly different from zero? Is the data consistent with a model in which all of the coeffecients (apart from the intercept) are zero? The F-statistic reports this. F is large (as measured by the small p), so we can reject the hypothesis of an intercept-only model.

Aside: In the simple linear regression, with only one variable, the t-statistic and the F-statistic are testing the exact same hypothesis: that the one coefficient is zero. You should see that t squared is equal to F.

Regression Diagnostics

Regression is easy enough. We can always run a regression and get results. But should we? Should we run the regression? Should we trust the results? For this, we need to qualitatively evaluate the assumptions. This is mixture of knowing your data, knowing your science, looking at a few different types of plots, and a little creativity. What's worse, is that it's a lot of work, and it's never going to get published. It is just assumed that the regression is OK, and if a reviewer questions you on the regression, you should have the supporting evidence that you will need already.

Here is a quick way to do the regression diagnostic plots.

plot(mod)  # You can then step through each plot, Hit <Return> to see the next plot

plot of chunk diagnostics

# Or you can see one plot at a time, e.g.
plot(mod, which = 2)

plot of chunk diagnostics

First, a little about residuals. First of all, regression residuals are not population model errors. The residuals are estimates of the population errors. Not all regression residuals are created equal. Some data rows have residuals that are noisier than other rows. Even if the data have constant variance, the regression residuals do not. The 'standardized residuals' correct for this. There is also something called 'studentized residuals'. There is usually very little difference in practice, but the studentized residuals may be more robust to outliers.

Homework: For the first three plots, describe wich assumptions they help to visually evaluate.

There appears to be a slight nonlinearity in the dataset. As you move from low to high fitted values, the residuals tend to be low, then high, then low again. It's hard to tell how much of a problem this is, and how to fix it. Possible techincal solutions might be:

log or square root transforming the predictand and/or predictor variables. Transforming the dependent variable is not a problem for the regression, but it does complicate how we interpret the prediction..
Adding ring width squared (this is a multivariate regression, will cover that in class soon)
Nonlinear regression (I won't cover this in lab)

If you want to create your own plot of residuals, you can calculate the residuals, rstandard, and rstudent functions to add the residuals to your data.frame.

residuals(mod)

##     2033     2034     2035     2036     2037     2038     2039     2040 
## -1.14282  1.03310 -2.85503  0.01947  0.63187 -3.38809 -1.68765 -1.51740 
##     2041     2042     2043     2044     2045     2046     2047     2048 
## -3.81808  3.21420 -0.35547 -1.68870 -0.79878 -0.80800 -1.80648 -1.74520 
##     2049     2050     2051     2052     2053     2054     2055     2056 
##  0.27579  0.81395 -3.67145  1.26623 -0.89324  0.10168  2.13146 -0.60898 
##     2057     2058     2059     2060     2061     2062     2063     2064 
##  0.67944 -0.15045  2.32008 -0.36377  2.62944 -0.83782  0.06182 -0.08135 
##     2065     2066     2067     2068     2069     2070     2071     2072 
## -1.24325 -0.11960 -1.57274  3.70813  2.90595 -0.39879  3.29084 -0.96391 
##     2073     2074     2075     2076     2077     2078     2079     2080 
##  1.57400 -0.35504  2.19648  0.40393  1.06570  6.17580  3.09232  1.61958 
##     2081     2082     2083     2084     2085     2086     2087     2088 
## -3.16050  3.29843 -1.99073 -0.85946 -1.98715 -0.58169 -1.27695 -1.11978 
##     2089     2090     2091     2092     2093     2094     2095     2096 
## -2.04420 -2.00365 -3.05961 -1.56770 -3.05597 -3.41902  3.27301  1.60134 
##     2097     2098     2099     2100     2101     2102     2103     2104 
## -0.96559 -1.64066  0.01967 -0.33111 -1.18893 -1.69847  1.68814  1.29620 
##     2105     2106     2107     2108     2109     2110     2111     2112 
## -1.14008 -2.41594 -1.12320 -1.47731  2.42311  5.64392 -3.52215 -0.58658 
##     2113     2114     2115     2116     2117     2118     2119     2120 
##  2.35749 -1.02228  1.05599  0.82585 -1.51516 -1.50415 -0.96283  1.10764 
##     2121     2122     2123     2124     2125     2126     2127     2128 
##  0.54419  3.53444 -1.09698  3.03915  0.11386  4.94774 -1.58460  1.19471 
##     2129 
##  3.59836

rstandard(mod)

##      2033      2034      2035      2036      2037      2038      2039 
## -0.537314  0.486891 -1.343804  0.009240  0.300736 -1.590563 -0.792704 
##      2040      2041      2042      2043      2044      2045      2046 
## -0.713024 -1.808357  1.516540 -0.166881 -0.801164 -0.376727 -0.380657 
##      2047      2048      2049      2050      2051      2052      2053 
## -0.847948 -0.819342  0.129855  0.382580 -1.735095  0.605338 -0.419249 
##      2054      2055      2056      2057      2058      2059      2060 
##  0.047779  1.033752 -0.288629  0.323566 -0.070976  1.094727 -0.172516 
##      2061      2062      2063      2064      2065      2066      2067 
##  1.234043 -0.397323  0.029288 -0.038182 -0.583487 -0.056278 -0.747084 
##      2068      2069      2070      2071      2072      2073      2074 
##  1.740771  1.363994 -0.187939  1.559429 -0.454815  0.738906 -0.167016 
##      2075      2076      2077      2078      2079      2080      2081 
##  1.044424  0.189950  0.500310  2.909863  1.453610  0.762396 -1.484141 
##      2082      2083      2084      2085      2086      2087      2088 
##  1.559423 -0.939798 -0.403396 -0.937342 -0.273144 -0.602774 -0.534618 
##      2089      2090      2091      2092      2093      2094      2095 
## -0.963170 -0.942839 -1.440635 -0.737688 -1.434213 -1.606718  1.537078 
##      2096      2097      2098      2099      2100      2101      2102 
##  0.761869 -0.454053 -0.771410  0.009236 -0.155394 -0.557989 -0.803078 
##      2103      2104      2105      2106      2107      2108      2109 
##  0.792667  0.617052 -0.539497 -1.133984 -0.527497 -0.702558  1.139827 
##      2110      2111      2112      2113      2114      2115      2116 
##  2.661111 -1.671989 -0.277363  1.107182 -0.480705  0.495993  0.390420 
##      2117      2118      2119      2120      2121      2122      2123 
## -0.711091 -0.706696 -0.454110  0.520694  0.256559  1.664954 -0.526702 
##      2124      2125      2126      2127      2128      2129 
##  1.443753  0.053520  2.376878 -0.743719  0.562736  1.712538

Assuming that everything is OK, we can then go on to predict the pre-historic climate:

# The prediction is easy, just one line of code)
pred <- predict(mod, newdata = reg_data, interval = "prediction", se.fit = TRUE)
# head(pred, 12)
pred <- cbind(reg_data, pred)  # Bind the original data and prediction data together
require(ggplot2)

## Loading required package: ggplot2

ggplot(aes(x = year), data = pred) + geom_line(aes(y = fit.fit)) + geom_smooth(aes(y = fit.fit, 
    ymin = fit.lwr, ymax = fit.upr), stat = "identity")

## Warning: Removed 4 rows containing missing values (geom_path).

## Warning: Removed 4 rows containing missing values (geom_path).

plot of chunk predict

Easy enough, right?

But that's a bit messy. Maybe we want to plot 10 year moving averages in order to more clearly see the long term trends:

pred$fit_s <- rollmean(pred$fit.fit, 10, align = "right", fill = NA)
head(pred, 20)

##    year stand_chron resid_chron month value ma fit.fit fit.lwr fit.upr
## 1  -136       0.441          NA  <NA>    NA NA      NA      NA      NA
## 2  -135       0.814          NA  <NA>    NA NA      NA      NA      NA
## 3  -134       0.677          NA  <NA>    NA NA      NA      NA      NA
## 4  -133       0.197          NA  <NA>    NA NA      NA      NA      NA
## 5  -132       0.347       0.517  <NA>    NA NA   10.77   6.469   15.07
## 6  -131       0.651       0.824  <NA>    NA NA   12.02   7.736   16.29
## 7  -130       0.512       0.754  <NA>    NA NA   11.73   7.449   16.01
## 8  -129       0.830       1.117  <NA>    NA NA   13.20   8.929   17.48
## 9  -128       0.713       0.910  <NA>    NA NA   12.36   8.088   16.64
## 10 -127       0.492       0.648  <NA>    NA NA   11.30   7.012   15.59
## 11 -126       1.620       1.739  <NA>    NA NA   15.72  11.407   20.04
## 12 -125       1.384       1.482  <NA>    NA NA   14.68  10.392   18.97
## 13 -124       1.616       1.643  <NA>    NA NA   15.34  11.029   19.64
## 14 -123       2.938       2.777  <NA>    NA NA   19.93  15.385   24.48
## 15 -122       2.886       2.487  <NA>    NA NA   18.76  14.292   23.22
## 16 -121       2.003       1.539  <NA>    NA NA   14.91  10.618   19.21
## 17 -120       1.476       1.008  <NA>    NA NA   12.76   8.487   17.04
## 18 -119       1.242       0.593  <NA>    NA NA   11.08   6.784   15.37
## 19 -118       2.154       1.716  <NA>    NA NA   15.63  11.316   19.95
## 20 -117       1.290       0.994  <NA>    NA NA   12.70   8.430   16.98
##    se.fit df residual.scale fit_s
## 1      NA 95          2.142    NA
## 2      NA 95          2.142    NA
## 3      NA 95          2.142    NA
## 4      NA 95          2.142    NA
## 5  0.3293 95          2.142    NA
## 6  0.2426 95          2.142    NA
## 7  0.2583 95          2.142    NA
## 8  0.2190 95          2.142    NA
## 9  0.2280 95          2.142    NA
## 10 0.2873 95          2.142    NA
## 11 0.3780 95          2.142    NA
## 12 0.2902 95          2.142    NA
## 13 0.3432 95          2.142    NA
## 14 0.8116 95          2.142    NA
## 15 0.6852 95          2.142    NA
## 16 0.3080 95          2.142    NA
## 17 0.2188 95          2.142    NA
## 18 0.3042 95          2.142    NA
## 19 0.3695 95          2.142    NA
## 20 0.2195 95          2.142    NA

# Hmmmm.... rollmean doesn't like something... maybe the first four NAs?
pred2 <- subset(pred, !is.na(fit.fit))
pred2$fit_s <- rollmean(pred2$fit.fit, 10, fill = NA, align = "right")
head(pred2, 20)  # Yay

##    year stand_chron resid_chron month value ma fit.fit fit.lwr fit.upr
## 5  -132       0.347       0.517  <NA>    NA NA  10.771   6.469   15.07
## 6  -131       0.651       0.824  <NA>    NA NA  12.015   7.736   16.29
## 7  -130       0.512       0.754  <NA>    NA NA  11.731   7.449   16.01
## 8  -129       0.830       1.117  <NA>    NA NA  13.203   8.929   17.48
## 9  -128       0.713       0.910  <NA>    NA NA  12.364   8.088   16.64
## 10 -127       0.492       0.648  <NA>    NA NA  11.302   7.012   15.59
## 11 -126       1.620       1.739  <NA>    NA NA  15.724  11.407   20.04
## 12 -125       1.384       1.482  <NA>    NA NA  14.683  10.392   18.97
## 13 -124       1.616       1.643  <NA>    NA NA  15.335  11.029   19.64
## 14 -123       2.938       2.777  <NA>    NA NA  19.932  15.385   24.48
## 15 -122       2.886       2.487  <NA>    NA NA  18.757  14.292   23.22
## 16 -121       2.003       1.539  <NA>    NA NA  14.914  10.618   19.21
## 17 -120       1.476       1.008  <NA>    NA NA  12.761   8.487   17.04
## 18 -119       1.242       0.593  <NA>    NA NA  11.079   6.784   15.37
## 19 -118       2.154       1.716  <NA>    NA NA  15.631  11.316   19.95
## 20 -117       1.290       0.994  <NA>    NA NA  12.704   8.430   16.98
## 21 -116       3.357       3.035  <NA>    NA NA  20.978  16.346   25.61
## 22 -115       1.912       1.475  <NA>    NA NA  14.654  10.364   18.94
## 23 -114       0.490      -0.077  <NA>    NA NA   8.363   3.967   12.76
## 24 -113       0.273      -0.129  <NA>    NA NA   8.152   3.745   12.56
##    se.fit df residual.scale fit_s
## 5  0.3293 95          2.142    NA
## 6  0.2426 95          2.142    NA
## 7  0.2583 95          2.142    NA
## 8  0.2190 95          2.142    NA
## 9  0.2280 95          2.142    NA
## 10 0.2873 95          2.142    NA
## 11 0.3780 95          2.142    NA
## 12 0.2902 95          2.142    NA
## 13 0.3432 95          2.142    NA
## 14 0.8116 95          2.142 13.71
## 15 0.6852 95          2.142 14.50
## 16 0.3080 95          2.142 14.79
## 17 0.2188 95          2.142 14.90
## 18 0.3042 95          2.142 14.69
## 19 0.3695 95          2.142 15.01
## 20 0.2195 95          2.142 15.15
## 21 0.9253 95          2.142 15.68
## 22 0.2881 95          2.142 15.67
## 23 0.5617 95          2.142 14.98
## 24 0.5836 95          2.142 13.80

ggplot(aes(x = year, y = fit_s), data = pred2) + geom_line()

## Warning: Removed 9 rows containing missing values (geom_path).

plot of chunk smooth_predict

And, just for fun, let's show times that are wetter than now or dryer than now:

# subset the data to remove missing values
mean_precip <- with(subset(pred2, !is.na(ma)), mean(ma))
# Create a column that is the lesser of the fit and the mean
pred2$low <- pmin(pred2$fit_s, mean_precip)
# Create a column that is the greater of the fit and the mean
pred2$hi <- pmax(pred2$fit_s, mean_precip)
# Create a column for the mean
pred2$avg <- mean_precip
# Create a plot, with two ribbons: one for data below average and one for
# data above average
ggplot(aes(x = year, y = fit_s), data = pred2) + geom_ribbon(aes(ymin = low, 
    ymax = avg), color = "brown", fill = "brown") + geom_ribbon(aes(ymin = avg, 
    ymax = hi), color = "green", fill = "green") + geom_vline(aes(xintercept = 1895))

plot of chunk unnamed-chunk-2

Homework

Describe in words what this command did: precip <- precip[order(precip$year, precip$month), ]
For the first three diagnostic plots, describe wich assumptions they help to visually evaluate.
These data are time series data. In general, data are correlated in time, and this would be a problem for simple linear regression. Plot the regression residuals vs time, and visually assess whether there is evidence of temporal (serial) correlation.
On the Blackboard site is the dataset GalapagosData.txt. The species data represents the number of species recorded from each of the Galapagos islands. A fundamental 'law' of island biogeography is that species diversity tends to follow a power law relationship with island area, i.e. $ \mbox{species} = \alpha\times{\mbox{area}^\beta} $. This is not linear, but it suggests that the following regression might make sense: $ \log(\mbox{species}) = a + \beta \times log(\mbox{area}) $. $ a $ is not quite $ \alpha $, rather $ a=\log(\alpha) $.
Fit this regression, and present a brief write-up that a) describes the results in words, and b) summarizes your conclusions from diagnostic model checking.