Jeffrey D. Walker 2013-01-14
This tutorial explains how to perform parametric and non-parametric trend tests on multiple time series, and plot a summary of the resulting slope estimates.
First, we load four libraries needed to perform the analysis.
library(lubridate)
library(scales)
library(plyr)
library(ggplot2)
library(wq)
theme_set(theme_bw())
Next, we create a dummy set of data for multiple sites and parameters. For each site/parameter pair, we 1) generate random slopes and error variances, and 2) generate a synthetic monthly time series.
start.date <- as.Date("2000-01-01") # initial date of all timeseries
n.years <- 10 # number of years
initial.value <- 100 # initial value of each time series
dates <- seq.Date(start.date, by = "month", length.out = 12 * n.years) # array of dates
sites <- factor(c("A", "B", "C")) # site factors
params <- factor(c("X", "Y", "Z")) # parameter factors
set.seed(123) # set seed for reproducibility
# generate random slopes and error variances for each site/parameter pair
ts.slopes <- expand.grid(SITE = sites, PARAM = params)
ts.slopes$SLOPE <- runif(n = nrow(ts.slopes), min = -0.1, max = 0.1)
ts.slopes$VAR <- runif(n = nrow(ts.slopes), min = 0, max = 3)
ts.slopes
## SITE PARAM SLOPE VAR
## 1 A X -0.042484 1.3698
## 2 B X 0.057661 2.8705
## 3 C X -0.018205 1.3600
## 4 A Y 0.076603 2.0327
## 5 B Y 0.088093 1.7179
## 6 C Y -0.090889 0.3088
## 7 A Z 0.005621 2.6995
## 8 B Z 0.078484 0.7383
## 9 C Z 0.010287 0.1262
# generate time series for each site/parameter
df <- expand.grid(SITE = sites, PARAM = params, DATE = dates, VALUE = NA)
for (s in sites) {
for (p in params) {
SLOPE <- subset(ts.slopes, SITE == s & PARAM == p)$SLOPE
VAR <- subset(ts.slopes, SITE == s & PARAM == p)$VAR
df[which(df$SITE == s & df$PARAM == p), "VALUE"] <- initial.value +
seq(0, length(dates) - 1) * SLOPE/12 + rnorm(length(dates), 0, sqrt(VAR))
}
}
Generate plot of each time series, with linear regression overlay.
ggplot(df, aes(DATE, VALUE)) + geom_line() + geom_smooth(method = "lm") + facet_grid(SITE ~
PARAM, scales = "free_y")
It is generally a good idea to compare different statistical methods to test for trends. We will use two types of tests: 1) Seasonall Kendall Test (non-parametric), and 2) Linear Regression (parametric). For each test, we define a function that takes a single time series (as a subset of the original dataframe, df), and returns a 1-row data frame containing a summary of the slope, p-value, etc.
To explain how each function works, we first extract one time series from the data frame corresponding to Site 'A' and Parameter 'X', which is saved as variable x.
x <- subset(df, SITE == "A" & PARAM == "X")
head(x)
## SITE PARAM DATE VALUE
## 1 A X 2000-01-01 99.48
## 10 A X 2000-02-01 101.43
## 19 A X 2000-03-01 100.41
## 28 A X 2000-04-01 100.46
## 37 A X 2000-05-01 100.12
## 46 A X 2000-06-01 99.33
For the Seasonal Kendall test, we will use the seaKen() function from the wq package. Since this function requires the input timeseries to be of type ts, we must convert x from a dataframe to a ts object. Note that the arguments start and end both take a vector of length 2, with the first element being the initial/final year, and the second being the initial/final month. Since this is a month time series, the frequency is 12.
t <- ts(x$VALUE, start = c(year(start.date), 1), end = c(year(start.date) +
n.years - 1, 12), frequency = 12)
t
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 2000 99.48 101.43 100.41 100.46 100.12 99.33 102.07 100.56 97.67 100.79
## 2001 99.70 98.75 99.10 99.22 97.97 100.92 100.12 98.60 101.40 100.42
## 2002 100.94 100.87 100.71 100.55 99.83 99.54 99.45 99.08 99.64 98.40
## 2003 98.56 99.40 99.32 100.77 99.76 100.15 99.82 99.80 101.45 99.58
## 2004 100.51 99.97 100.08 100.26 99.23 99.42 98.62 98.55 100.16 100.32
## 2005 102.19 99.21 97.08 100.95 98.94 98.96 100.97 99.43 98.33 99.97
## 2006 100.20 99.31 100.49 99.48 100.12 101.01 100.23 99.34 101.06 100.88
## 2007 98.97 101.29 98.99 102.25 101.48 99.41 98.48 98.85 99.97 99.38
## 2008 99.61 98.74 97.70 99.20 100.72 98.97 100.35 97.74 99.57 100.24
## 2009 98.87 98.62 98.41 99.74 98.49 99.03 99.30 101.75 98.83 99.86
## Nov Dec
## 2000 99.41 98.71
## 2001 99.58 100.97
## 2002 102.42 101.29
## 2003 101.61 98.02
## 2004 99.86 100.87
## 2005 99.59 99.76
## 2006 100.35 99.99
## 2007 99.26 98.55
## 2008 99.98 99.74
## 2009 99.67 98.45
We can then apply the function seaKen() to the object t.
sk <- seaKen(t)
sk
## $sen.slope
## [1] -0.09148
##
## $sen.slope.pct
## [1] -0.09168
##
## $p.value
## [1] 0.01226
##
## $miss
## 1 2 3 4 5 6 7 8 9 10 11 12
## 0 0 0 0 0 0 0 0 0 0 0 0
str(sk)
## List of 4
## $ sen.slope : num -0.0915
## $ sen.slope.pct: num -0.0917
## $ p.value : num 0.0123
## $ miss : Named num [1:12] 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "names")= chr [1:12] "1" "2" "3" "4" ...
The slope, percent slope and associated p-value can thus be extracted from the output of the seaKen() function, and the results saved as a 1-row data frame.
slope <- sk$sen.slope
slope.pct <- sk$sen.slope.pct/100
direction <- ordered(ifelse(slope > 0, "Increasing", "Decreasing"), levels = c("Increasing",
"Decreasing"))
pval <- sk$p.value
signif <- cut(pval, breaks = c(0, 0.05, 0.1, 1), labels = c("p<0.05", "0.05<p<0.10",
"p>0.10"))
data.frame(METHOD = "SeasonalKendall", SLOPE = slope, SLOPE.PCT = slope.pct,
PVAL = pval, SIGNIF = signif, DIRECTION = direction)
## METHOD SLOPE SLOPE.PCT PVAL SIGNIF DIRECTION
## 1 SeasonalKendall -0.09148 -0.0009168 0.01226 p<0.05 Decreasing
This process is then saved to a function called trend.sk().
trend.sk <- function(x) {
t <- ts(x$VALUE, start = c(year(start.date), 1), end = c(year(start.date) +
n.years - 1, 12), frequency = 12)
sk <- seaKen(t)
slope <- sk$sen.slope
slope.pct <- sk$sen.slope.pct/100
direction <- ordered(ifelse(slope > 0, "Increasing", "Decreasing"), levels = c("Increasing",
"Decreasing"))
pval <- sk$p.value
signif <- cut(pval, breaks = c(0, 0.05, 0.1, 1), labels = c("p<0.05", "0.05<p<0.10",
"p>0.10"))
data.frame(METHOD = "SeasonalKendall", SLOPE = slope, SLOPE.PCT = slope.pct,
PVAL = pval, SIGNIF = signif, DIRECTION = direction)
}
trend.sk(x)
## METHOD SLOPE SLOPE.PCT PVAL SIGNIF DIRECTION
## 1 SeasonalKendall -0.09148 -0.0009168 0.01226 p<0.05 Decreasing
A similar function can be written that summarizes the trend test using linear reag
trend.lm <- function(x) {
x$Year <- year(x$DATE) + (month(x$DATE) - 1)/12
l <- summary(lm(VALUE ~ Year, data = x))
slope <- l$coefficients[2, "Estimate"]
slope.pct <- slope/mean(x$VALUE)
direction <- ordered(ifelse(slope > 0, "Increasing", "Decreasing"), levels = c("Increasing",
"Decreasing"))
pval <- l$coefficients[2, "Pr(>|t|)"]
signif <- cut(pval, breaks = c(0, 0.05, 0.1, 1), labels = c("p<0.05", "0.05<p<0.10",
"p>0.10"))
data.frame(METHOD = "LinearRegression", SLOPE = slope, SLOPE.PCT = slope.pct,
PVAL = pval, SIGNIF = signif, DIRECTION = direction)
}
trend.lm(x)
## METHOD SLOPE SLOPE.PCT PVAL SIGNIF DIRECTION
## 1 LinearRegression -0.06524 -0.0006538 0.04922 p<0.05 Decreasing
With the two trend tests defined by trend.sk() and trend.lm(), which both return a 1-row data frame with the same column names, we can apply both tests to all the time series in the original data frame using the ddply() function from the plyr package.
trend.sk.batch <- ddply(df, c("SITE", "PARAM"), trend.sk)
trend.lm.batch <- ddply(df, c("SITE", "PARAM"), trend.lm)
trends <- rbind(trend.sk.batch, trend.lm.batch)
trends
## SITE PARAM METHOD SLOPE SLOPE.PCT PVAL SIGNIF
## 1 A X SeasonalKendall -0.09148 -0.0009168 1.226e-02 p<0.05
## 2 A Y SeasonalKendall 0.02117 0.0002110 4.540e-01 p>0.10
## 3 A Z SeasonalKendall -0.01813 -0.0001809 7.371e-01 p>0.10
## 4 B X SeasonalKendall 0.17183 0.0017121 4.887e-03 p<0.05
## 5 B Y SeasonalKendall 0.06103 0.0006080 5.281e-02 0.05<p<0.10
## 6 B Z SeasonalKendall 0.11096 0.0011065 1.494e-03 p<0.05
## 7 C X SeasonalKendall -0.11504 -0.0011500 2.122e-03 p<0.05
## 8 C Y SeasonalKendall -0.11936 -0.0011983 4.782e-07 p<0.05
## 9 C Z SeasonalKendall 0.01169 0.0001168 3.394e-01 p>0.10
## 10 A X LinearRegression -0.06524 -0.0006538 4.922e-02 p<0.05
## 11 A Y LinearRegression 0.06420 0.0006398 1.595e-01 p>0.10
## 12 A Z LinearRegression -0.02638 -0.0002632 6.158e-01 p>0.10
## 13 B X LinearRegression 0.16765 0.0016704 2.142e-03 p<0.05
## 14 B Y LinearRegression 0.06343 0.0006320 1.015e-01 p>0.10
## 15 B Z LinearRegression 0.08732 0.0008708 1.754e-03 p<0.05
## 16 C X LinearRegression -0.08264 -0.0008262 3.424e-02 p<0.05
## 17 C Y LinearRegression -0.11392 -0.0011437 7.480e-09 p<0.05
## 18 C Z LinearRegression 0.01761 0.0001759 1.286e-01 p>0.10
## DIRECTION
## 1 Decreasing
## 2 Increasing
## 3 Decreasing
## 4 Increasing
## 5 Increasing
## 6 Increasing
## 7 Decreasing
## 8 Decreasing
## 9 Increasing
## 10 Decreasing
## 11 Increasing
## 12 Decreasing
## 13 Increasing
## 14 Increasing
## 15 Increasing
## 16 Decreasing
## 17 Decreasing
## 18 Increasing
We can then compare the results of each test using a bar chart.
ggplot(trends, aes(SITE, SLOPE, fill = METHOD)) + geom_bar(stat = "identity",
position = "dodge") + facet_wrap(~PARAM) + labs(x = "Site", y = "Trend Slope (units/yr)")
Finally, we can also summarize the results using a “lolli-pop” chart, which shows the magnitude, direction and significance of each time series.
p.slope <- ggplot() + aes(SLOPE, SITE) +
geom_vline(xint=0) +
geom_segment(aes(x=0, xend=SLOPE, yend=SITE)) +
geom_point(size=4, color='white') +
geom_point(size=4, aes(alpha=ordered(SIGNIF, levels=c("p>0.10","0.05<p<0.10","p<0.05")), color=DIRECTION)) +
geom_point(size=4, shape=1) +
scale_alpha_manual('Significance', values=c(0.01, 0.4, 1), drop=FALSE) +
scale_color_manual('Slope Sign', values=c('orangered','steelblue'), labels=c('Increasing','Decreasing'), drop=FALSE) +
labs(x='Trend Slope (units/yr)', y='Site')
p.slope %+% subset(trends, METHOD=="SeasonalKendall") + facet_wrap(~PARAM) + ggtitle("Seasonal Kendall Trend Tests")
p.slope %+% subset(trends, METHOD=="LinearRegression") + facet_wrap(~PARAM) + ggtitle("Linear Regression Trend Tests")