Advanced R and Data Analysis

Get Started - load the data and packages again.

library(dplyr)
library(ggplot2)
load('rainfall.RData')

Focusing on time series

⩰ Components and superposition of effects



Source: http://aplus.com/a/instachaz-painfully-accurate-graphs-about-life

📆 🕙 Time Series are often several superimposed effects


  • Cyclic
    • Seasonal variation in rainfall
    • Salary on payday

    • Long Term Trend
      • Overall increase/decrease in rainfall
      • Whether I’m getting richer or poorer
    • Random (Noise)
      • Very wet one-off days
      • Unplanned big spend on car repair

⪸ As seen in the rainfall data?


rain %>% group_by(Year,Month) %>% summarise(rf=sum(Rainfall)) -> monthly_total
monthly_total %>% ungroup() %>% arrange(Year,Month) %>% filter(Year <= 1860) %>% mutate(ind=row_number())-> temp
ggplot(temp,aes(x=ind,y=rf)) + geom_line(col='wheat4') + geom_point(col='seagreen') +
  scale_x_continuous(breaks=1+(0:11)*12,labels=1850:1861) +
  labs(x="Year",y="Rainfall",title = "Total Irish Rainfall")

⏳ The R time-series object


ts - A time series object that is useful for some functions…
monthly_total$rf %>% ts(freq=12,start=1850) -> rain_ts # Time series object
rain_ts %>% window(c(1850,1),c(1853,12)) # window from jan 1850 - dec 1853
##         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct
## 1850 2836.3 2158.9  964.1 3457.2 1492.1 1362.4 2584.8 1906.4 1763.1 1567.8
## 1851 4875.5 1379.9 1997.9 1368.1 1124.2 2537.1 2258.3 2416.2 1234.5 2834.2
## 1852 4036.1 2311.2 1116.0 1110.0 1754.6 4484.9 1716.0 2752.0 1433.1 2217.7
## 1853 3618.6 1073.9 2088.0 1881.9  829.1 1936.7 2512.3 2072.0 1504.1 4627.6
##         Nov    Dec
## 1850 2828.5 2600.1
## 1851 1134.3 1719.2
## 1852 5581.2 5205.9
## 1853 2651.1 1209.3

Note c(1850,1) is January 1850 etc. when using functions like window

🌔 A month plot function


Plot to show average levels by month and individual yearly variations from this.
rain %>% filter(Station=="Dublin Airport") -> dub_rain
dub_rain %>% 
  mutate(yf = 0.05+0.9*(Year - min(Year))/(max(Year) - min(Year)) + 
           as.integer(Month)) %>% 
  group_by(Month) %>% 
  mutate(avg=mean(Rainfall)) -> dub_mp
m_plot <- ggplot(dub_mp,aes(x=yf,group=Month)) + labs(x='Month') +
  scale_x_continuous(breaks=1:12+0.5,
                     labels=c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")) +
  scale_y_continuous(limits=c(0,250))

m_plot

A look at the month plot:


m_plot + geom_line(aes(y=Rainfall),col='navy') +
  geom_line(aes(y=avg),col='indianred',lwd=2) 

Decomposition


View data as seasonal + trend + random components:

rain_ts  %>% 
  stl(s.window='periodic') -> rain_decomp
rain_decomp
##           seasonal    trend   remainder
## Jan 1850  541.6231 2290.835     3.84188
## Feb 1850 -193.1494 2273.775    78.27395
## Mar 1850 -286.0009 2256.716 -1006.61494
## Apr 1850 -555.8361 2244.188  1768.84808
## May 1850 -491.4910 2231.660  -248.06925
## Jun 1850 -456.0382 2225.708  -407.27014
## Jul 1850 -219.3928 2219.756   584.43633

R. B. Cleveland, W. S. Cleveland, J.E. McRae, and I. Terpenning (1990) STL: A Seasonal-Trend Decomposition Procedure Based on Loess. Journal of Official Statistics, v6, pp3–73.

Graphical representation


library(forecast)
rain_ts  %>% window(c(1850,1),c(1859,12)) %>% 
  stl(s.window='periodic') %>% autoplot()

… for all of the data:


rain_ts  %>% stl(s.window='periodic') %>% autoplot()

… for all of the data with more smoothing:


rain_ts  %>% stl(s.window='periodic',t.window=361) %>% autoplot()

Further issues for time series

A closer look at the error term


  • You can extract the error term from an stl object
    • column 3 is the error term of the time.series component
rain_ts  %>% stl(s.window='periodic',t.window=361) -> rain_decomp2
rain_decomp2$time.series[,3] %>% window(c(1850,1),c(1853,12)) 
##              Jan         Feb         Mar         Apr         May
## 1850   184.79467   241.67250  -860.76751  1901.54871  -128.51254
## 1851  2215.27112  -546.05105   164.30894  -196.27484  -505.13609
## 1852  1367.14757   376.52540  -726.31461  -463.09839   116.54036
## 1853   940.92402  -869.49815   236.94120   300.03678  -817.74510
##              Jun         Jul         Aug         Sep         Oct
## 1850  -294.35012   690.72311  -361.86057  -327.84485 -1036.83437
## 1851   871.62633   355.49956   139.21588  -865.16840   220.84208
## 1852  2810.70278  -195.52399   466.29233  -675.29195  -404.38147
## 1853   253.69668   591.94927  -222.55505  -613.15997  1996.62987
##              Nov         Dec
## 1850   285.70343  -125.77091
## 1851 -1417.22012 -1015.39446
## 1852  3020.95633  2462.58199
## 1853    81.94704 -1542.94794

Temporal Autocorrelation


  • This is the degree to which current values of a variable depend on previous values
    • Autocorrelation at lag 1 is correlation between \(x_t\) and \(x_{t-1}\)
      • often written as \(\rho(x_t,x_{t-1})\) or for shorthand \(\rho_1\).
      • … and so on. For lag \(k\) it is \(\rho(x_t,x_{t-k})\), or \(\rho_k\).

  • A graph of \(\rho_k\) vs. \(k\) is called a correlogram
    • Looking at correlograms helps further understand the process driving the time series
      • In particular, if a particular month has an unusually high or low rainfall level how long is this anomaly likely to persist?

The correlogram


  • This is provided by acf (AutoCorrelation Function)
    • Blue dotted lines show significant values at 95% : Significant 6 month lag negative correlation
    • Initial spike fairly irrelevent - just says \(\rho_0 = 1\) - obviously so as this is just \(\rho(x_i,x_i)\)!
rain_decomp2$time.series[,3] %>% ggAcf()

Forecasting


rain_ts %>% window(c(1980,1),c(2014,12)) %>% stl(s.window='periodic') %>% forecast(h=120) -> fcast
autoplot(fcast) 

Conclusion

💡 New ideas

  • New R ideas
    • ts objects for time and related functions
  • New time series ideas
    • Decomposition: trend/seasonal/random
    • Autocorrelation: How long do anomalies last?
    • Forecasting: Take good note of error bands, question assumptions
  • Next lecture - Interactive methods