Objectives:
In many places with clear growing seasons, it is possible to clear determine the age of a tree by counting its annual growth rings. The relative of each ring is a rough proxy for the growing conditions experience by the tree that year. Furthermore, by grouping together neighboring trees, it is possible to constrict a proxy for the average growth conditions experienced by all the trees, and in some instances, by using the ring from well-preserved downed trees, it is possible to ring width series that date back hundreds, or even thousands of years.
The data used in this lab were provided by Prof. Henri Grissino-Mayer in the Geography Department. The file MLC.COL has yearly tree ring widths from a study site in El Malpais National Monument in New Mexico. Open the file with a text editor and look at it quickly. The data have three columns: year, standardized chronology, residual chronology. In general, if a tree had a particularly good or bad year, then this effect is carried over to the next year as well. This can somoetimes be a problem for analysis. The residual chronology addresses this - it is similar to the standardized chronology, except that it should have some of the year-to-year correlation filtered out.
These tree ring series are much longer than human records of temperature and precipitation. The hope is that if we can find a relation between ring width and climatic conditions, then we can use the ring width series to reocnstruct, or 'hindcast' the climatic conditions.
setwd("~/Dropbox/classes/Geog415_s13/geog415_s13_lab/lab4/")
# trw: tree ring width
trw <- read.table("MLC.COL", na.strings = ".")
# Check the data
head(trw)
## V1 V2 V3
## 1 -136 0.441 NA
## 2 -135 0.814 NA
## 3 -134 0.677 NA
## 4 -133 0.197 NA
## 5 -132 0.347 0.517
## 6 -131 0.651 0.824
names(trw) <- c("year", "stand_chron", "resid_chron")
summary(trw)
## year stand_chron resid_chron
## Min. :-136 Min. :0.006 Min. :-0.271
## 1st Qu.: 396 1st Qu.:0.674 1st Qu.: 0.736
## Median : 928 Median :0.972 Median : 0.992
## Mean : 928 Mean :1.000 Mean : 1.000
## 3rd Qu.:1460 3rd Qu.:1.278 3rd Qu.: 1.260
## Max. :1992 Max. :3.357 Max. : 3.035
## NA's :4
plot(trw$year, trw$resid_chron, type = "l", xlab = "Year", ylab = "residual chronology")
Tree growth should be a function of regional climate patterns + site-specific conditions. This function is called the “response function.'' If you create a proxy from many trees, hopefully the site-specific conditions will cancel each other out, leaving a response function of regional-proxy = f(climate patterns). It is standard dendrochronological practice to first explore this response function: which climate variables are trees responding to? Precipitation? Temperature? Annual average temperature, or are some months more important than others? We will skip this step here.
Once it is established that there is a relationship between tree growth and particulary climatic variables, you might imagine reversing this relation; i.e. thinking about climate as a function of ring width. Scientifically, this is silly (it's like putting the cart before the horse), but statisically, there is nothing wrong with it. We call this function the 'transfer function,' since it transfers ring width measurements into climate measurements.
Dr. Grissino-Mayer has also provided precipitation data from the study area in the file '2904.pcp.' This is a file of monthly precipitation measurements back to 1895. I'll load in the file, see what it looks like, fix some column names, and then turn it from 'wide' format to 'long format'. Check the data after each step… you'll seen what I mean by 'wide' or 'long'.
raw_precip <- read.table("2904.pcp")
head(raw_precip)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
## 1 1895 1.48 0.61 0.24 0.02 0.73 0.12 3.94 1.96 0.94 1.07 0.94 0.37
## 2 1896 0.21 0.27 0.17 0.26 0.00 0.56 3.31 1.95 1.43 4.01 0.19 0.55
## 3 1897 1.47 0.59 0.63 0.10 0.95 0.56 2.47 1.81 2.35 0.87 0.02 0.21
## 4 1898 1.17 0.04 0.50 0.61 0.24 1.27 3.68 2.29 0.73 0.00 0.34 1.08
## 5 1899 0.36 0.32 0.52 0.13 0.00 0.72 3.98 1.38 1.55 0.27 0.57 0.24
## 6 1900 0.35 0.27 0.46 0.54 0.40 0.44 1.51 1.56 2.22 0.92 0.44 0.22
names(raw_precip) <- c("year", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul",
"Aug", "Sep", "Oct", "Nov", "Dec")
# Coerce that from a 'wide' format to a 'long' format
library(reshape)
## Loading required package: plyr
## Attaching package: 'reshape'
## The following object(s) are masked from 'package:plyr':
##
## rename, round_any
precip <- melt(raw_precip, id.vars = "year", variable_name = "month")
head(precip)
## year month value
## 1 1895 Jan 1.48
## 2 1896 Jan 0.21
## 3 1897 Jan 1.47
## 4 1898 Jan 1.17
## 5 1899 Jan 0.36
## 6 1900 Jan 0.35
tail(precip)
## year month value
## 1291 1997 Dec 1.62
## 1292 1998 Dec 0.72
## 1293 1999 Dec 0.27
## 1294 2000 Dec 0.40
## 1295 2001 Dec 0.47
## 1296 2002 Dec 1.25
# The order is screwy. We should sort by year, then month
precip <- precip[order(precip$year, precip$month), ]
Homework question: describe in words all of the R commands in that last line.
We now have precipitation for each month, i.e. 12 observations per year. We only have one tree ring record per year. We will have to create climate variables per year, not per month. We might calculate rain during the growing season, but here, we'll calculate annual rain (measured from July to June). We'll use the rollmean function in the zoo package to calculate 12 month rolling means (moving average), and then extract just the June observations (i.e. July-June averages).
head(precip, 13)
## year month value
## 1 1895 Jan 1.48
## 109 1895 Feb 0.61
## 217 1895 Mar 0.24
## 325 1895 Apr 0.02
## 433 1895 May 0.73
## 541 1895 Jun 0.12
## 649 1895 Jul 3.94
## 757 1895 Aug 1.96
## 865 1895 Sep 0.94
## 973 1895 Oct 1.07
## 1081 1895 Nov 0.94
## 1189 1895 Dec 0.37
## 2 1896 Jan 0.21
library(zoo)
## Attaching package: 'zoo'
## The following object(s) are masked from 'package:base':
##
## as.Date, as.Date.numeric
precip$ma <- rollmean(precip$value, 12, fill = NA, align = "right") * 12
# I multiplied by 12 to turn monthly average into annual total.
precip_yr <- subset(precip, month == "Jun" & !is.na(ma))
names(precip_yr)
## [1] "year" "month" "value" "ma"
names(trw)
## [1] "year" "stand_chron" "resid_chron"
# And finally, create the regression dataset.
reg_data <- merge(x = trw, y = precip_yr, by.x = "year", by.y = "year", all.x = TRUE)
As you might expect in a statistics software language, performing the actual regression is really simple: one line of code.
(Warning about the science and statistics here: In practice, one would evaluate the regression for different subsets, and verify that the results are similar no matter which subset we use to fit. In this lab, we skip that and just fit the regression to the entire dataset.)
mod <- lm(ma ~ resid_chron, data = reg_data) # mod is short for model in my head
summary(mod) # print the summary
##
## Call:
## lm(formula = ma ~ resid_chron, data = reg_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.818 -1.515 -0.364 1.266 6.176
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.675 0.530 16.4 < 2e-16 ***
## resid_chron 4.054 0.455 8.9 3.7e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.14 on 95 degrees of freedom
## (2032 observations deleted due to missingness)
## Multiple R-squared: 0.455, Adjusted R-squared: 0.449
## F-statistic: 79.2 on 1 and 95 DF, p-value: 3.68e-14
We see that each increase in the ring width corresponds to a 4.0537 in increase in annual precipitation. If we wanted to create 95% confidence intervals, we would have 4.0537 plus/minus 2 times 0.4554. The high t value (indicated by a small p value) indicates that the data are not consistent with a hypothesis in which the coefficient is zero, or that the data provide significant evidence against this hypothesis. To be honest, this hypothesis isn't quite interesting… presumably we would only be doing this if we already knew that there was a significant relation between the two.
We also see the \( R^2 \) of 0.4548, and the adjusted R-squared of 0.449. But is this R2 significantly different from zero? Is the data consistent with a model in which all of the coeffecients (apart from the intercept) are zero? The F-statistic reports this. F is large (as measured by the small p), so we can reject the hypothesis of an intercept-only model.
Aside: In the simple linear regression, with only one variable, the t-statistic and the F-statistic are testing the exact same hypothesis: that the one coefficient is zero. You should see that t squared is equal to F.
Regression is easy enough. We can always run a regression and get results. But should we? Should we run the regression? Should we trust the results? For this, we need to qualitatively evaluate the assumptions. This is mixture of knowing your data, knowing your science, looking at a few different types of plots, and a little creativity. What's worse, is that it's a lot of work, and it's never going to get published. It is just assumed that the regression is OK, and if a reviewer questions you on the regression, you should have the supporting evidence that you will need already.
Here is a quick way to do the regression diagnostic plots.
plot(mod) # You can then step through each plot, Hit <Return> to see the next plot
# Or you can see one plot at a time, e.g.
plot(mod, which = 2)
First, a little about residuals. First of all, regression residuals are not population model errors. The residuals are estimates of the population errors. Not all regression residuals are created equal. Some data rows have residuals that are noisier than other rows. Even if the data have constant variance, the regression residuals do not. The 'standardized residuals' correct for this. There is also something called 'studentized residuals'. There is usually very little difference in practice, but the studentized residuals may be more robust to outliers.
Homework: For the first three plots, describe wich assumptions they help to visually evaluate.
There appears to be a slight nonlinearity in the dataset. As you move from low to high fitted values, the residuals tend to be low, then high, then low again. It's hard to tell how much of a problem this is, and how to fix it. Possible techincal solutions might be:
If you want to create your own plot of residuals, you can calculate the residuals, rstandard, and rstudent functions to add the residuals to your data.frame.
residuals(mod)
## 2033 2034 2035 2036 2037 2038 2039 2040
## -1.14282 1.03310 -2.85503 0.01947 0.63187 -3.38809 -1.68765 -1.51740
## 2041 2042 2043 2044 2045 2046 2047 2048
## -3.81808 3.21420 -0.35547 -1.68870 -0.79878 -0.80800 -1.80648 -1.74520
## 2049 2050 2051 2052 2053 2054 2055 2056
## 0.27579 0.81395 -3.67145 1.26623 -0.89324 0.10168 2.13146 -0.60898
## 2057 2058 2059 2060 2061 2062 2063 2064
## 0.67944 -0.15045 2.32008 -0.36377 2.62944 -0.83782 0.06182 -0.08135
## 2065 2066 2067 2068 2069 2070 2071 2072
## -1.24325 -0.11960 -1.57274 3.70813 2.90595 -0.39879 3.29084 -0.96391
## 2073 2074 2075 2076 2077 2078 2079 2080
## 1.57400 -0.35504 2.19648 0.40393 1.06570 6.17580 3.09232 1.61958
## 2081 2082 2083 2084 2085 2086 2087 2088
## -3.16050 3.29843 -1.99073 -0.85946 -1.98715 -0.58169 -1.27695 -1.11978
## 2089 2090 2091 2092 2093 2094 2095 2096
## -2.04420 -2.00365 -3.05961 -1.56770 -3.05597 -3.41902 3.27301 1.60134
## 2097 2098 2099 2100 2101 2102 2103 2104
## -0.96559 -1.64066 0.01967 -0.33111 -1.18893 -1.69847 1.68814 1.29620
## 2105 2106 2107 2108 2109 2110 2111 2112
## -1.14008 -2.41594 -1.12320 -1.47731 2.42311 5.64392 -3.52215 -0.58658
## 2113 2114 2115 2116 2117 2118 2119 2120
## 2.35749 -1.02228 1.05599 0.82585 -1.51516 -1.50415 -0.96283 1.10764
## 2121 2122 2123 2124 2125 2126 2127 2128
## 0.54419 3.53444 -1.09698 3.03915 0.11386 4.94774 -1.58460 1.19471
## 2129
## 3.59836
rstandard(mod)
## 2033 2034 2035 2036 2037 2038 2039
## -0.537314 0.486891 -1.343804 0.009240 0.300736 -1.590563 -0.792704
## 2040 2041 2042 2043 2044 2045 2046
## -0.713024 -1.808357 1.516540 -0.166881 -0.801164 -0.376727 -0.380657
## 2047 2048 2049 2050 2051 2052 2053
## -0.847948 -0.819342 0.129855 0.382580 -1.735095 0.605338 -0.419249
## 2054 2055 2056 2057 2058 2059 2060
## 0.047779 1.033752 -0.288629 0.323566 -0.070976 1.094727 -0.172516
## 2061 2062 2063 2064 2065 2066 2067
## 1.234043 -0.397323 0.029288 -0.038182 -0.583487 -0.056278 -0.747084
## 2068 2069 2070 2071 2072 2073 2074
## 1.740771 1.363994 -0.187939 1.559429 -0.454815 0.738906 -0.167016
## 2075 2076 2077 2078 2079 2080 2081
## 1.044424 0.189950 0.500310 2.909863 1.453610 0.762396 -1.484141
## 2082 2083 2084 2085 2086 2087 2088
## 1.559423 -0.939798 -0.403396 -0.937342 -0.273144 -0.602774 -0.534618
## 2089 2090 2091 2092 2093 2094 2095
## -0.963170 -0.942839 -1.440635 -0.737688 -1.434213 -1.606718 1.537078
## 2096 2097 2098 2099 2100 2101 2102
## 0.761869 -0.454053 -0.771410 0.009236 -0.155394 -0.557989 -0.803078
## 2103 2104 2105 2106 2107 2108 2109
## 0.792667 0.617052 -0.539497 -1.133984 -0.527497 -0.702558 1.139827
## 2110 2111 2112 2113 2114 2115 2116
## 2.661111 -1.671989 -0.277363 1.107182 -0.480705 0.495993 0.390420
## 2117 2118 2119 2120 2121 2122 2123
## -0.711091 -0.706696 -0.454110 0.520694 0.256559 1.664954 -0.526702
## 2124 2125 2126 2127 2128 2129
## 1.443753 0.053520 2.376878 -0.743719 0.562736 1.712538
Assuming that everything is OK, we can then go on to predict the pre-historic climate:
# The prediction is easy, just one line of code)
pred <- predict(mod, newdata = reg_data, interval = "prediction", se.fit = TRUE)
# head(pred, 12)
pred <- cbind(reg_data, pred) # Bind the original data and prediction data together
require(ggplot2)
## Loading required package: ggplot2
ggplot(aes(x = year), data = pred) + geom_line(aes(y = fit.fit)) + geom_smooth(aes(y = fit.fit,
ymin = fit.lwr, ymax = fit.upr), stat = "identity")
## Warning: Removed 4 rows containing missing values (geom_path).
## Warning: Removed 4 rows containing missing values (geom_path).
Easy enough, right?
But that's a bit messy. Maybe we want to plot 10 year moving averages in order to more clearly see the long term trends:
pred$fit_s <- rollmean(pred$fit.fit, 10, align = "right", fill = NA)
head(pred, 20)
## year stand_chron resid_chron month value ma fit.fit fit.lwr fit.upr
## 1 -136 0.441 NA <NA> NA NA NA NA NA
## 2 -135 0.814 NA <NA> NA NA NA NA NA
## 3 -134 0.677 NA <NA> NA NA NA NA NA
## 4 -133 0.197 NA <NA> NA NA NA NA NA
## 5 -132 0.347 0.517 <NA> NA NA 10.77 6.469 15.07
## 6 -131 0.651 0.824 <NA> NA NA 12.02 7.736 16.29
## 7 -130 0.512 0.754 <NA> NA NA 11.73 7.449 16.01
## 8 -129 0.830 1.117 <NA> NA NA 13.20 8.929 17.48
## 9 -128 0.713 0.910 <NA> NA NA 12.36 8.088 16.64
## 10 -127 0.492 0.648 <NA> NA NA 11.30 7.012 15.59
## 11 -126 1.620 1.739 <NA> NA NA 15.72 11.407 20.04
## 12 -125 1.384 1.482 <NA> NA NA 14.68 10.392 18.97
## 13 -124 1.616 1.643 <NA> NA NA 15.34 11.029 19.64
## 14 -123 2.938 2.777 <NA> NA NA 19.93 15.385 24.48
## 15 -122 2.886 2.487 <NA> NA NA 18.76 14.292 23.22
## 16 -121 2.003 1.539 <NA> NA NA 14.91 10.618 19.21
## 17 -120 1.476 1.008 <NA> NA NA 12.76 8.487 17.04
## 18 -119 1.242 0.593 <NA> NA NA 11.08 6.784 15.37
## 19 -118 2.154 1.716 <NA> NA NA 15.63 11.316 19.95
## 20 -117 1.290 0.994 <NA> NA NA 12.70 8.430 16.98
## se.fit df residual.scale fit_s
## 1 NA 95 2.142 NA
## 2 NA 95 2.142 NA
## 3 NA 95 2.142 NA
## 4 NA 95 2.142 NA
## 5 0.3293 95 2.142 NA
## 6 0.2426 95 2.142 NA
## 7 0.2583 95 2.142 NA
## 8 0.2190 95 2.142 NA
## 9 0.2280 95 2.142 NA
## 10 0.2873 95 2.142 NA
## 11 0.3780 95 2.142 NA
## 12 0.2902 95 2.142 NA
## 13 0.3432 95 2.142 NA
## 14 0.8116 95 2.142 NA
## 15 0.6852 95 2.142 NA
## 16 0.3080 95 2.142 NA
## 17 0.2188 95 2.142 NA
## 18 0.3042 95 2.142 NA
## 19 0.3695 95 2.142 NA
## 20 0.2195 95 2.142 NA
# Hmmmm.... rollmean doesn't like something... maybe the first four NAs?
pred2 <- subset(pred, !is.na(fit.fit))
pred2$fit_s <- rollmean(pred2$fit.fit, 10, fill = NA, align = "right")
head(pred2, 20) # Yay
## year stand_chron resid_chron month value ma fit.fit fit.lwr fit.upr
## 5 -132 0.347 0.517 <NA> NA NA 10.771 6.469 15.07
## 6 -131 0.651 0.824 <NA> NA NA 12.015 7.736 16.29
## 7 -130 0.512 0.754 <NA> NA NA 11.731 7.449 16.01
## 8 -129 0.830 1.117 <NA> NA NA 13.203 8.929 17.48
## 9 -128 0.713 0.910 <NA> NA NA 12.364 8.088 16.64
## 10 -127 0.492 0.648 <NA> NA NA 11.302 7.012 15.59
## 11 -126 1.620 1.739 <NA> NA NA 15.724 11.407 20.04
## 12 -125 1.384 1.482 <NA> NA NA 14.683 10.392 18.97
## 13 -124 1.616 1.643 <NA> NA NA 15.335 11.029 19.64
## 14 -123 2.938 2.777 <NA> NA NA 19.932 15.385 24.48
## 15 -122 2.886 2.487 <NA> NA NA 18.757 14.292 23.22
## 16 -121 2.003 1.539 <NA> NA NA 14.914 10.618 19.21
## 17 -120 1.476 1.008 <NA> NA NA 12.761 8.487 17.04
## 18 -119 1.242 0.593 <NA> NA NA 11.079 6.784 15.37
## 19 -118 2.154 1.716 <NA> NA NA 15.631 11.316 19.95
## 20 -117 1.290 0.994 <NA> NA NA 12.704 8.430 16.98
## 21 -116 3.357 3.035 <NA> NA NA 20.978 16.346 25.61
## 22 -115 1.912 1.475 <NA> NA NA 14.654 10.364 18.94
## 23 -114 0.490 -0.077 <NA> NA NA 8.363 3.967 12.76
## 24 -113 0.273 -0.129 <NA> NA NA 8.152 3.745 12.56
## se.fit df residual.scale fit_s
## 5 0.3293 95 2.142 NA
## 6 0.2426 95 2.142 NA
## 7 0.2583 95 2.142 NA
## 8 0.2190 95 2.142 NA
## 9 0.2280 95 2.142 NA
## 10 0.2873 95 2.142 NA
## 11 0.3780 95 2.142 NA
## 12 0.2902 95 2.142 NA
## 13 0.3432 95 2.142 NA
## 14 0.8116 95 2.142 13.71
## 15 0.6852 95 2.142 14.50
## 16 0.3080 95 2.142 14.79
## 17 0.2188 95 2.142 14.90
## 18 0.3042 95 2.142 14.69
## 19 0.3695 95 2.142 15.01
## 20 0.2195 95 2.142 15.15
## 21 0.9253 95 2.142 15.68
## 22 0.2881 95 2.142 15.67
## 23 0.5617 95 2.142 14.98
## 24 0.5836 95 2.142 13.80
ggplot(aes(x = year, y = fit_s), data = pred2) + geom_line()
## Warning: Removed 9 rows containing missing values (geom_path).
And, just for fun, let's show times that are wetter than now or dryer than now:
# subset the data to remove missing values
mean_precip <- with(subset(pred2, !is.na(ma)), mean(ma))
# Create a column that is the lesser of the fit and the mean
pred2$low <- pmin(pred2$fit_s, mean_precip)
# Create a column that is the greater of the fit and the mean
pred2$hi <- pmax(pred2$fit_s, mean_precip)
# Create a column for the mean
pred2$avg <- mean_precip
# Create a plot, with two ribbons: one for data below average and one for
# data above average
ggplot(aes(x = year, y = fit_s), data = pred2) + geom_ribbon(aes(ymin = low,
ymax = avg), color = "brown", fill = "brown") + geom_ribbon(aes(ymin = avg,
ymax = hi), color = "green", fill = "green") + geom_vline(aes(xintercept = 1895))