Advanced R and Data Analysis



Overview


  • More on time series
  • Making interactive tools
  • Some new spatial techniques
  • Reproducibility

R studio




Looking at the rainfall data



💧 Main working data set is rainfall.RData


📥 Get it from Moodle in section headed Chris Brunsdon’s Lectures.
load('rainfall.RData')
head(stations)
## # A tibble: 6 × 9
##       Station Elevation Easting Northing   Lat  Long    County
##         <chr>     <int>   <dbl>    <dbl> <dbl> <dbl>     <chr>
## 1      Athboy        87  270400   261700 53.60 -6.93     Meath
## 2 Foulksmills        71  284100   118400 52.30 -6.77   Wexford
## 3   Mullingar       112  241780   247765 53.47 -7.37 Westmeath
## 4     Portlaw         8  246600   115200 52.28 -7.31 Waterford
## 5    Rathdrum       131  319700   186000 52.91 -6.22   Wicklow
## 6 Strokestown        49  194500   279100 53.75 -8.10 Roscommon
## # ... with 2 more variables: Abbreviation <chr>, Source <chr>


This packages all of the rainfall information


head(rain)
## # A tibble: 6 × 4
##    Year  Month Rainfall Station
##   <dbl> <fctr>    <dbl>   <chr>
## 1  1850    Jan    169.0  Ardara
## 2  1851    Jan    236.4  Ardara
## 3  1852    Jan    249.7  Ardara
## 4  1853    Jan    209.1  Ardara
## 5  1854    Jan    188.5  Ardara
## 6  1855    Jan     32.3  Ardara

  • Its been converted to an R binary data file for convenience.
  • There are no missing values.
  • Data runs from 1850 to 2014 by month.
  • 👍🏻Thanks to Conor Murphy and Simon Noone for supplying.


Introduction to the dplyr package



🎁 A new package dplyr


library(dplyr)
rain %>% group_by(Station) %>% 
  summarise(mrain=mean(Rainfall))  -> rain_summary
head(rain_summary)
## # A tibble: 6 × 2
##      Station     mrain
##        <chr>     <dbl>
## 1     Ardara 140.36753
## 2     Armagh  68.32096
## 3     Athboy  74.74356
## 4    Belfast  87.10995
## 5       Birr  70.83498
## 6 Cappoquinn 121.22455

  • The %>% command acts as a ‘pipeline’.
  • x %>% f(y) is the same as f(x,y).
  • x %>% f1(y) %>% f2(z) is the same as f2(f1(x,y),z) but easier to read.


📆 Group by month


rain %>% group_by(Month) %>% 
  summarise(mrain=mean(Rainfall)) -> rain_months
head(rain_months)
## # A tibble: 6 × 2
##    Month     mrain
##   <fctr>     <dbl>
## 1    Jan 112.64355
## 2    Feb  83.24975
## 3    Mar  79.53280
## 4    Apr  68.74165
## 5    May  71.31769
## 6    Jun  72.74104

summarise applies a summary function to each group and creates a new data frame with one entry for each group. Any R summary function can be used - eg median, sd, max or a user defined function.



📊 Basic graphic investigation


barplot(rain_months$mrain,names=rain_months$Month,las=3,col='dodgerblue')



📊 Shorter version using with


with(rain_months,barplot(mrain,names=Month,las=3,col='dodgerblue'))



📈 Also for yearly data


rain %>% group_by(Year) %>% 
  summarise(total_rain=sum(Rainfall)) -> rain_years
with(rain_years,plot(Year,total_rain,type='l',col='dodgerblue'))



✂️ Filter for individual stations using filter


rain %>% group_by(Year) %>% 
  filter(Station=='Strokestown') %>%
  summarise(total_rain=sum(Rainfall)) -> rain_years_str
with(rain_years_str,plot(Year,total_rain,type='l',col='dodgerblue'))



🔢 Multiple groupings - eg average by month/station combination


rain %>% group_by(Month,Station) %>% 
  summarise(mean_rain=mean(Rainfall)) -> rain_season_station
head(rain_season_station)
## Source: local data frame [6 x 3]
## Groups: Month [1]
## 
##    Month    Station mean_rain
##   <fctr>      <chr>     <dbl>
## 1    Jan     Ardara 174.82606
## 2    Jan     Armagh  74.57242
## 3    Jan     Athboy  84.94759
## 4    Jan    Belfast 101.20718
## 5    Jan       Birr  79.92074
## 6    Jan Cappoquinn 153.97159

Note that with is implicit in the dplyr commands such as filter and summarise



Re-arrange using reshape2 package


library(reshape2)
rain_season_station %>% acast(Station~Month) %>% head
## Using mean_rain as value column: use value.var to override.
##                  Jan       Feb       Mar      Apr      May       Jun
## Ardara     174.82606 126.82303 123.02000 98.79333 96.90727 105.24061
## Armagh      74.57242  55.97182  56.48879 53.67030 59.23182  62.72939
## Athboy      84.94759  62.62133  62.44944 58.97874 62.16260  68.11460
## Belfast    101.20718  74.50206  73.10221 65.90492 69.23426  74.48525
## Birr        79.92074  57.88501  58.42056 54.07187 60.25831  61.96445
## Cappoquinn 153.97159 117.77099 110.02890 94.00365 95.81437  94.86357
##                  Jul       Aug       Sep       Oct       Nov       Dec
## Ardara     123.70485 145.24788 152.80727 174.44788 176.45030 186.14182
## Armagh      72.50636  81.92182  69.02576  80.94242  73.73121  79.05939
## Athboy      76.28662  88.84710  76.35077  88.14627  80.95767  87.05997
## Belfast     87.70003 102.44499  87.11622 106.35394 100.31760 102.95073
## Birr        75.10084  86.75669  72.21722  83.80810  77.87266  81.74334
## Cappoquinn 104.09372 125.42245 116.04466 146.70658 141.10012 154.87402

  • Now results organised into 2D array


… which makes a useful heatmap


library(reshape2)
rain_season_station %>% acast(Station~Month) %>% heatmap(Colv=NA)



Why %>% is useful here


This code takes the rain data frame and runs through the steps to aggregated, reshape and compute the heat map:

rain %>% group_by(Month,Station) %>% summarise(mean_rain=mean(Rainfall)) %>%
  acast(Station~Month) %>%
  heatmap(Colv=NA)

So does this code, in ‘traditional’ R format:

  heatmap(acast(summarise(group_by(rain,Month,Station),
                          mean_rain=mean(Rainfall)),Station~Month),Colv=NA)

Which do you think is clearer?



Multiple summary stats are also possible


rain %>% group_by(Month,Station) %>% 
  summarise(mean_rain=mean(Rainfall),sd_rain=sd(Rainfall)) -> rain_mnsd
head(rain_mnsd)
## Source: local data frame [6 x 4]
## Groups: Month [1]
## 
##    Month    Station mean_rain  sd_rain
##   <fctr>      <chr>     <dbl>    <dbl>
## 1    Jan     Ardara 174.82606 64.59834
## 2    Jan     Armagh  74.57242 28.91264
## 3    Jan     Athboy  84.94759 32.84887
## 4    Jan    Belfast 101.20718 40.60078
## 5    Jan       Birr  79.92074 33.74263
## 6    Jan Cappoquinn 153.97159 66.33995



mutate - Create new variables in a data frame


rain %>% group_by(Month) %>% 
  summarise(mean_rain=mean(Rainfall),sd_rain=sd(Rainfall)) %>%
  mutate(cv_rain=100 * sd_rain/mean_rain)  -> rain_mnsdcv
head(rain_mnsdcv)
## # A tibble: 6 × 4
##    Month mean_rain  sd_rain  cv_rain
##   <fctr>     <dbl>    <dbl>    <dbl>
## 1    Jan 112.64355 57.56599 51.10456
## 2    Feb  83.24975 51.45153 61.80382
## 3    Mar  79.53280 44.33714 55.74699
## 4    Apr  68.74165 36.39073 52.93841
## 5    May  71.31769 37.20539 52.16853
## 6    Jun  72.74104 40.92818 56.26560

  • 📖 CV: Coefficient of variation - standard deviation as a percentage of mean.
  • Shows relative variability of rainfall


Relative variability visualised…


barplot(rain_mnsdcv$cv_rain,names=rain_mnsdcv$Month,las=3,col='indianred')



… and absolute variability


barplot(rain_mnsdcv$sd_rain,names=rain_mnsdcv$Month,las=3,col='indianred')



Focusing on time series



⩰ Components and superposition of effects



Source: http://aplus.com/a/instachaz-painfully-accurate-graphs-about-life



📆 🕙 Time Series are often several superimposed effects


  • Cyclic
    • Seasonal variation in rainfall
    • Salary on payday
  • Long Term Trend
    • Overall increase/decrease in rainfall
    • Whether I’m getting richer or poorer
  • Random (Noise)
    • Very wet one-off days
    • Unplanned big spend on car repair


⪸ As seen in the rainfall data?


rain %>% group_by(Year,Month) %>% summarise(rf=sum(Rainfall)) -> monthly_total
plot(monthly_total$rf[1:120],type='b',ann=F, axes=F, cex=0.6)
axis(1,at=seq(0,120,by=12),labels=1850:1860); axis(2); title('Total Irish Rainfall 1850-1860')



⏳ The R time-series object


ts - A time series object that is useful for some functions…
monthly_total$rf %>% ts(freq=12,start=1850) -> rain_ts # Time serioes object
rain_ts %>% window(c(1850,1),c(1853,12)) # window from jan 1850 - dec 1853
##         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct
## 1850 2836.3 2158.9  964.1 3457.2 1492.1 1362.4 2584.8 1906.4 1763.1 1567.8
## 1851 4875.5 1379.9 1997.9 1368.1 1124.2 2537.1 2258.3 2416.2 1234.5 2834.2
## 1852 4036.1 2311.2 1116.0 1110.0 1754.6 4484.9 1716.0 2752.0 1433.1 2217.7
## 1853 3618.6 1073.9 2088.0 1881.9  829.1 1936.7 2512.3 2072.0 1504.1 4627.6
##         Nov    Dec
## 1850 2828.5 2600.1
## 1851 1134.3 1719.2
## 1852 5581.2 5205.9
## 1853 2651.1 1209.3

Note c(1850,1) is January 1850 etc. when using functions like window



🌔 The monthplot function


Plot to show average levels for each month and individual yearly variations from this.
rain_ts %>% window(c(1850,1),c(1869,12)) %>%
  monthplot(col='dodgerblue',col.base='indianred',lwd.base=3)



… for the whole data set:


Remove the window filter to see whole data (1850-2014)
rain_ts %>% 
  monthplot(col='dodgerblue',col.base='indianred',lwd.base=3)



Decomposition


View data as trend + seasonal + random components:

rain_ts  %>% 
  stl(s.window='periodic') -> rain_decomp
rain_decomp
##           seasonal    trend   remainder
## Jan 1850  541.6231 2290.835     3.84188
## Feb 1850 -193.1494 2273.775    78.27395
## Mar 1850 -286.0009 2256.716 -1006.61494
## Apr 1850 -555.8361 2244.188  1768.84808
## May 1850 -491.4910 2231.660  -248.06925
## Jun 1850 -456.0382 2225.708  -407.27014
## Jul 1850 -219.3928 2219.756   584.43633

R. B. Cleveland, W. S. Cleveland, J.E. McRae, and I. Terpenning (1990) STL: A Seasonal-Trend Decomposition Procedure Based on Loess. Journal of Official Statistics, v6, pp3–73.



Graphical representation


rain_ts  %>% window(c(1850,1),c(1859,12)) %>% 
  stl(s.window='periodic') %>% plot


… for all of the data:


rain_ts  %>% stl(s.window='periodic') %>% plot


… for all of the data with more smoothing:


rain_ts  %>% stl(s.window='periodic',t.window=361) %>% plot


Further issues for time series


A closer look at the error term


  • You can extract the error term from an stl object
  • column 3 is the error term of the time.series component
rain_ts  %>% stl(s.window='periodic',t.window=361) -> rain_decomp2
rain_decomp2$time.series[,3] %>% window(c(1850,1),c(1853,12)) 
##              Jan         Feb         Mar         Apr         May
## 1850   184.79467   241.67250  -860.76751  1901.54871  -128.51254
## 1851  2215.27112  -546.05105   164.30894  -196.27484  -505.13609
## 1852  1367.14757   376.52540  -726.31461  -463.09839   116.54036
## 1853   940.92402  -869.49815   236.94120   300.03678  -817.74510
##              Jun         Jul         Aug         Sep         Oct
## 1850  -294.35012   690.72311  -361.86057  -327.84485 -1036.83437
## 1851   871.62633   355.49956   139.21588  -865.16840   220.84208
## 1852  2810.70278  -195.52399   466.29233  -675.29195  -404.38147
## 1853   253.69668   591.94927  -222.55505  -613.15997  1996.62987
##              Nov         Dec
## 1850   285.70343  -125.77091
## 1851 -1417.22012 -1015.39446
## 1852  3020.95633  2462.58199
## 1853    81.94704 -1542.94794

Temporal Autocorrelation


  • This is the degree to which current values of a variable depend on previous values
  • Autocorrelation at lag 1 is correlation between \(x_t\) and \(x_{t-1}\)
    • often written as \(\rho(x_t,x_{t-1})\) or for shorthand \(\rho_1\).
    • … and so on. For lag \(k\) it is \(\rho(x_t,x_{t-k})\), or \(\rho_k\).

  • A graph of \(\rho_k\) vs. \(k\) is called a *correlogram
  • Looking at correlograms helps further understand the process driving the time series
    • In particular, if a particular month has an unusually high or low rainfall level how long is this anomaly likely to persist?

The correlogram


  • This is provided by acf (AutoCorrelation Function)
  • Blue dotted lines show significant values at 95% : Significant 6 month lag negative correlation
  • Initial spike fairly irrelevent - just says \(\rho_0 = 1\) - obviously so as this is just \(\rho(x_i,x_i)\)!
rain_decomp2$time.series[,3] %>% acf(main='Rainfall Correlogram')


The frequency domain


  • Another way to think about time series is in the frequency domain
  • Identify ‘spikes’ at different frequencies - ie 1 cycle per year, 2 cycles per year etc. - use the spectrum function
rain_ts %>% spectrum(spans=3,log='dB') # spans controls smoothing on spectral estimate 


Conclusion


💡 New ideas

  • New R ideas
    • dplyr and pipeline-based commands, and related functions
    • ts objects for time and related functions
  • New time series ideas
    • Graphical tools: monthplot etc.
    • Decomposition: trend/seasonal/random
    • Autocorrelation: How long do anomalies last?
    • Spectral Analysis: useful to check for unexpected cycles
  • Next lecture - Interactive methods