Advanced R and Data Analysis

Overview

More on time series
Making interactive tools
Some new spatial techniques
Reproducibility

R studio

A useful tool
Download and install (it’s free) - https://www.rstudio.com/products/rstudio/download/

Looking at the rainfall data

💧 Main working data set is `rainfall.RData`…

📥 Get it from Moodle in section headed Chris Brunsdon’s Lectures.

load('rainfall.RData')
head(stations)

## # A tibble: 6 × 9
##       Station Elevation Easting Northing   Lat  Long    County
##         <chr>     <int>   <dbl>    <dbl> <dbl> <dbl>     <chr>
## 1      Athboy        87  270400   261700 53.60 -6.93     Meath
## 2 Foulksmills        71  284100   118400 52.30 -6.77   Wexford
## 3   Mullingar       112  241780   247765 53.47 -7.37 Westmeath
## 4     Portlaw         8  246600   115200 52.28 -7.31 Waterford
## 5    Rathdrum       131  319700   186000 52.91 -6.22   Wicklow
## 6 Strokestown        49  194500   279100 53.75 -8.10 Roscommon
## # ... with 2 more variables: Abbreviation <chr>, Source <chr>

This packages all of the rainfall information

head(rain)

## # A tibble: 6 × 4
##    Year  Month Rainfall Station
##   <dbl> <fctr>    <dbl>   <chr>
## 1  1850    Jan    169.0  Ardara
## 2  1851    Jan    236.4  Ardara
## 3  1852    Jan    249.7  Ardara
## 4  1853    Jan    209.1  Ardara
## 5  1854    Jan    188.5  Ardara
## 6  1855    Jan     32.3  Ardara

Its been converted to an R binary data file for convenience.
There are no missing values.
Data runs from 1850 to 2014 by month.
👍🏻Thanks to Conor Murphy and Simon Noone for supplying.

Introduction to the `dplyr` package

🎁 A new package `dplyr`

library(dplyr)
rain %>% group_by(Station) %>% 
  summarise(mrain=mean(Rainfall))  -> rain_summary
head(rain_summary)

## # A tibble: 6 × 2
##      Station     mrain
##        <chr>     <dbl>
## 1     Ardara 140.36753
## 2     Armagh  68.32096
## 3     Athboy  74.74356
## 4    Belfast  87.10995
## 5       Birr  70.83498
## 6 Cappoquinn 121.22455

The %>% command acts as a ‘pipeline’.
x %>% f(y) is the same as f(x,y).
x %>% f1(y) %>% f2(z) is the same as f2(f1(x,y),z) but easier to read.

📆 Group by month

rain %>% group_by(Month) %>% 
  summarise(mrain=mean(Rainfall)) -> rain_months
head(rain_months)

## # A tibble: 6 × 2
##    Month     mrain
##   <fctr>     <dbl>
## 1    Jan 112.64355
## 2    Feb  83.24975
## 3    Mar  79.53280
## 4    Apr  68.74165
## 5    May  71.31769
## 6    Jun  72.74104

summarise applies a summary function to each group and creates a new data frame with one entry for each group. Any R summary function can be used - eg median, sd, max or a user defined function.

📊 Basic graphic investigation

barplot(rain_months$mrain,names=rain_months$Month,las=3,col='dodgerblue')

📊 Shorter version using `with`

with(rain_months,barplot(mrain,names=Month,las=3,col='dodgerblue'))

📈 Also for yearly data

rain %>% group_by(Year) %>% 
  summarise(total_rain=sum(Rainfall)) -> rain_years
with(rain_years,plot(Year,total_rain,type='l',col='dodgerblue'))

✂️ Filter for individual stations using `filter`

rain %>% group_by(Year) %>% 
  filter(Station=='Strokestown') %>%
  summarise(total_rain=sum(Rainfall)) -> rain_years_str
with(rain_years_str,plot(Year,total_rain,type='l',col='dodgerblue'))

🔢 Multiple groupings - eg average by month/station combination

rain %>% group_by(Month,Station) %>% 
  summarise(mean_rain=mean(Rainfall)) -> rain_season_station
head(rain_season_station)

## Source: local data frame [6 x 3]
## Groups: Month [1]
## 
##    Month    Station mean_rain
##   <fctr>      <chr>     <dbl>
## 1    Jan     Ardara 174.82606
## 2    Jan     Armagh  74.57242
## 3    Jan     Athboy  84.94759
## 4    Jan    Belfast 101.20718
## 5    Jan       Birr  79.92074
## 6    Jan Cappoquinn 153.97159

Note that with is implicit in the dplyr commands such as filter and summarise

Re-arrange using `reshape2` package

library(reshape2)
rain_season_station %>% acast(Station~Month) %>% head

## Using mean_rain as value column: use value.var to override.

##                  Jan       Feb       Mar      Apr      May       Jun
## Ardara     174.82606 126.82303 123.02000 98.79333 96.90727 105.24061
## Armagh      74.57242  55.97182  56.48879 53.67030 59.23182  62.72939
## Athboy      84.94759  62.62133  62.44944 58.97874 62.16260  68.11460
## Belfast    101.20718  74.50206  73.10221 65.90492 69.23426  74.48525
## Birr        79.92074  57.88501  58.42056 54.07187 60.25831  61.96445
## Cappoquinn 153.97159 117.77099 110.02890 94.00365 95.81437  94.86357
##                  Jul       Aug       Sep       Oct       Nov       Dec
## Ardara     123.70485 145.24788 152.80727 174.44788 176.45030 186.14182
## Armagh      72.50636  81.92182  69.02576  80.94242  73.73121  79.05939
## Athboy      76.28662  88.84710  76.35077  88.14627  80.95767  87.05997
## Belfast     87.70003 102.44499  87.11622 106.35394 100.31760 102.95073
## Birr        75.10084  86.75669  72.21722  83.80810  77.87266  81.74334
## Cappoquinn 104.09372 125.42245 116.04466 146.70658 141.10012 154.87402

Now results organised into 2D array

… which makes a useful heatmap

library(reshape2)
rain_season_station %>% acast(Station~Month) %>% heatmap(Colv=NA)

Why `%>%` is useful here

This code takes the rain data frame and runs through the steps to aggregated, reshape and compute the heat map:

rain %>% group_by(Month,Station) %>% summarise(mean_rain=mean(Rainfall)) %>%
  acast(Station~Month) %>%
  heatmap(Colv=NA)

So does this code, in ‘traditional’ R format:

  heatmap(acast(summarise(group_by(rain,Month,Station),
                          mean_rain=mean(Rainfall)),Station~Month),Colv=NA)

Which do you think is clearer?

Multiple summary stats are also possible

rain %>% group_by(Month,Station) %>% 
  summarise(mean_rain=mean(Rainfall),sd_rain=sd(Rainfall)) -> rain_mnsd
head(rain_mnsd)

## Source: local data frame [6 x 4]
## Groups: Month [1]
## 
##    Month    Station mean_rain  sd_rain
##   <fctr>      <chr>     <dbl>    <dbl>
## 1    Jan     Ardara 174.82606 64.59834
## 2    Jan     Armagh  74.57242 28.91264
## 3    Jan     Athboy  84.94759 32.84887
## 4    Jan    Belfast 101.20718 40.60078
## 5    Jan       Birr  79.92074 33.74263
## 6    Jan Cappoquinn 153.97159 66.33995

`mutate` - Create new variables in a data frame

rain %>% group_by(Month) %>% 
  summarise(mean_rain=mean(Rainfall),sd_rain=sd(Rainfall)) %>%
  mutate(cv_rain=100 * sd_rain/mean_rain)  -> rain_mnsdcv
head(rain_mnsdcv)

## # A tibble: 6 × 4
##    Month mean_rain  sd_rain  cv_rain
##   <fctr>     <dbl>    <dbl>    <dbl>
## 1    Jan 112.64355 57.56599 51.10456
## 2    Feb  83.24975 51.45153 61.80382
## 3    Mar  79.53280 44.33714 55.74699
## 4    Apr  68.74165 36.39073 52.93841
## 5    May  71.31769 37.20539 52.16853
## 6    Jun  72.74104 40.92818 56.26560

📖 CV: Coefficient of variation - standard deviation as a percentage of mean.
Shows relative variability of rainfall

Relative variability visualised…

barplot(rain_mnsdcv$cv_rain,names=rain_mnsdcv$Month,las=3,col='indianred')

… and absolute variability

barplot(rain_mnsdcv$sd_rain,names=rain_mnsdcv$Month,las=3,col='indianred')

Focusing on time series

⩰ Components and superposition of effects

Source: http://aplus.com/a/instachaz-painfully-accurate-graphs-about-life

📆 🕙 Time Series are often several superimposed effects

Cyclic
- Seasonal variation in rainfall
- Salary on payday
Long Term Trend
- Overall increase/decrease in rainfall
- Whether I’m getting richer or poorer
Random (Noise)
- Very wet one-off days
- Unplanned big spend on car repair

⪸ As seen in the rainfall data?

rain %>% group_by(Year,Month) %>% summarise(rf=sum(Rainfall)) -> monthly_total
plot(monthly_total$rf[1:120],type='b',ann=F, axes=F, cex=0.6)
axis(1,at=seq(0,120,by=12),labels=1850:1860); axis(2); title('Total Irish Rainfall 1850-1860')

⏳ The R time-series object

ts - A time series object that is useful for some functions…

monthly_total$rf %>% ts(freq=12,start=1850) -> rain_ts # Time serioes object
rain_ts %>% window(c(1850,1),c(1853,12)) # window from jan 1850 - dec 1853

##         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct
## 1850 2836.3 2158.9  964.1 3457.2 1492.1 1362.4 2584.8 1906.4 1763.1 1567.8
## 1851 4875.5 1379.9 1997.9 1368.1 1124.2 2537.1 2258.3 2416.2 1234.5 2834.2
## 1852 4036.1 2311.2 1116.0 1110.0 1754.6 4484.9 1716.0 2752.0 1433.1 2217.7
## 1853 3618.6 1073.9 2088.0 1881.9  829.1 1936.7 2512.3 2072.0 1504.1 4627.6
##         Nov    Dec
## 1850 2828.5 2600.1
## 1851 1134.3 1719.2
## 1852 5581.2 5205.9
## 1853 2651.1 1209.3

Note c(1850,1) is January 1850 etc. when using functions like window

🌔 The `monthplot` function

Plot to show average levels for each month and individual yearly variations from this.

rain_ts %>% window(c(1850,1),c(1869,12)) %>%
  monthplot(col='dodgerblue',col.base='indianred',lwd.base=3)

… for the whole data set:

Remove the window filter to see whole data (1850-2014)

rain_ts %>% 
  monthplot(col='dodgerblue',col.base='indianred',lwd.base=3)

Decomposition

View data as trend + seasonal + random components:

rain_ts  %>% 
  stl(s.window='periodic') -> rain_decomp
rain_decomp

##           seasonal    trend   remainder
## Jan 1850  541.6231 2290.835     3.84188
## Feb 1850 -193.1494 2273.775    78.27395
## Mar 1850 -286.0009 2256.716 -1006.61494
## Apr 1850 -555.8361 2244.188  1768.84808
## May 1850 -491.4910 2231.660  -248.06925
## Jun 1850 -456.0382 2225.708  -407.27014
## Jul 1850 -219.3928 2219.756   584.43633

R. B. Cleveland, W. S. Cleveland, J.E. McRae, and I. Terpenning (1990) STL: A Seasonal-Trend Decomposition Procedure Based on Loess. Journal of Official Statistics, v6, pp3–73.

Graphical representation

rain_ts  %>% window(c(1850,1),c(1859,12)) %>% 
  stl(s.window='periodic') %>% plot

… for all of the data:

rain_ts  %>% stl(s.window='periodic') %>% plot

… for all of the data with more smoothing:

rain_ts  %>% stl(s.window='periodic',t.window=361) %>% plot

Further issues for time series

A closer look at the error term

You can extract the error term from an stl object
column 3 is the error term of the time.series component

rain_ts  %>% stl(s.window='periodic',t.window=361) -> rain_decomp2
rain_decomp2$time.series[,3] %>% window(c(1850,1),c(1853,12))

##              Jan         Feb         Mar         Apr         May
## 1850   184.79467   241.67250  -860.76751  1901.54871  -128.51254
## 1851  2215.27112  -546.05105   164.30894  -196.27484  -505.13609
## 1852  1367.14757   376.52540  -726.31461  -463.09839   116.54036
## 1853   940.92402  -869.49815   236.94120   300.03678  -817.74510
##              Jun         Jul         Aug         Sep         Oct
## 1850  -294.35012   690.72311  -361.86057  -327.84485 -1036.83437
## 1851   871.62633   355.49956   139.21588  -865.16840   220.84208
## 1852  2810.70278  -195.52399   466.29233  -675.29195  -404.38147
## 1853   253.69668   591.94927  -222.55505  -613.15997  1996.62987
##              Nov         Dec
## 1850   285.70343  -125.77091
## 1851 -1417.22012 -1015.39446
## 1852  3020.95633  2462.58199
## 1853    81.94704 -1542.94794

Temporal Autocorrelation

This is the degree to which current values of a variable depend on previous values
Autocorrelation at lag 1 is correlation between \(x_t\) and \(x_{t-1}\)
- often written as \(\rho(x_t,x_{t-1})\) or for shorthand \(\rho_1\).
- … and so on. For lag \(k\) it is \(\rho(x_t,x_{t-k})\), or \(\rho_k\).

A graph of \(\rho_k\) vs. \(k\) is called a *correlogram
Looking at correlograms helps further understand the process driving the time series
- In particular, if a particular month has an unusually high or low rainfall level how long is this anomaly likely to persist?

The correlogram

This is provided by acf (AutoCorrelation Function)
Blue dotted lines show significant values at 95% : Significant 6 month lag negative correlation
Initial spike fairly irrelevent - just says \(\rho_0 = 1\) - obviously so as this is just \(\rho(x_i,x_i)\)!

rain_decomp2$time.series[,3] %>% acf(main='Rainfall Correlogram')

The frequency domain

Another way to think about time series is in the frequency domain
Identify ‘spikes’ at different frequencies - ie 1 cycle per year, 2 cycles per year etc. - use the spectrum function

rain_ts %>% spectrum(spans=3,log='dB') # spans controls smoothing on spectral estimate

Conclusion

💡 New ideas

New R ideas
- dplyr and pipeline-based commands, and related functions
- ts objects for time and related functions
New time series ideas
- Graphical tools: monthplot etc.
- Decomposition: trend/seasonal/random
- Autocorrelation: How long do anomalies last?
- Spectral Analysis: useful to check for unexpected cycles
Next lecture - Interactive methods

Further Topics in R and Data Analysis

Chris Brunsdon