Advanced R and Data Analysis

Overview for Rest of Module

Looking at the rainfall data

💧 Main working data set is rainfall.RData


📥 Get it from Teams in the usual folder.
load('rainfall.RData')
head(stations)
## # A tibble: 6 x 9
##   Station   Elevation Easting Northing   Lat  Long County  Abbreviation Source  
##   <chr>         <int>   <dbl>    <dbl> <dbl> <dbl> <chr>   <chr>        <chr>   
## 1 Athboy           87  270400   261700  53.6 -6.93 Meath   AB           Met Eir…
## 2 Foulksmi…        71  284100   118400  52.3 -6.77 Wexford F            Met Eir…
## 3 Mullingar       112  241780   247765  53.5 -7.37 Westme… M            Met Eir…
## 4 Portlaw           8  246600   115200  52.3 -7.31 Waterf… P            Met Eir…
## 5 Rathdrum        131  319700   186000  52.9 -6.22 Wicklow RD           Met Eir…
## 6 Strokest…        49  194500   279100  53.8 -8.1  Roscom… S            Met Eir…

This packages all of the rainfall information


head(rain)
## # A tibble: 6 x 4
##    Year Month Rainfall Station
##   <dbl> <fct>    <dbl> <chr>  
## 1  1850 Jan      169   Ardara 
## 2  1851 Jan      236.  Ardara 
## 3  1852 Jan      250.  Ardara 
## 4  1853 Jan      209.  Ardara 
## 5  1854 Jan      188.  Ardara 
## 6  1855 Jan       32.3 Ardara

  • Its been converted to an R binary data file for convenience.
    • There are no missing values.
    • Data runs from 1850 to 2014 by month.
    • 👍🏻Thanks to Prof. Conor Murphy and Simon Noone for supplying.

Introduction to the dplyr package

🎁 Recap for the package dplyr


library(dplyr)
rain %>% group_by(Station) %>% 
  summarise(mrain=mean(Rainfall))  -> rain_summary
head(rain_summary)
## # A tibble: 6 x 2
##   Station    mrain
##   <chr>      <dbl>
## 1 Ardara     140. 
## 2 Armagh      68.3
## 3 Athboy      74.7
## 4 Belfast     87.1
## 5 Birr        70.8
## 6 Cappoquinn 121.

  • The %>% command acts as a ‘pipeline’.
    • x %>% f(y) is the same as f(x,y).
    • x %>% f1(y) %>% f2(z) is the same as f2(f1(x,y),z) but easier to read.

📆 Group by month


rain %>% group_by(Month) %>% 
  summarise(mrain=mean(Rainfall)) -> rain_months
head(rain_months)
## # A tibble: 6 x 2
##   Month mrain
##   <fct> <dbl>
## 1 Jan   113. 
## 2 Feb    83.2
## 3 Mar    79.5
## 4 Apr    68.7
## 5 May    71.3
## 6 Jun    72.7

summarise applies a summary function to each group and creates a new data frame with one entry for each group. Any R summary function can be used - eg median, sd, max or a user defined function.

📊 Basic graphic investigation


barplot(rain_months$mrain,names=rain_months$Month,las=3,col='dodgerblue')

📊 Shorter version using with


with(rain_months,barplot(mrain,names=Month,las=3,col='dodgerblue'))

📈 Also for yearly data


rain %>% group_by(Year) %>% 
  summarise(total_rain=sum(Rainfall)) -> rain_years
## `summarise()` ungrouping output (override with `.groups` argument)
with(rain_years,plot(Year,total_rain,type='l',col='dodgerblue'))

✂️ Filter for individual stations using filter


rain %>% group_by(Year) %>% 
  filter(Station=='Strokestown') %>%
  summarise(total_rain=sum(Rainfall)) -> rain_years_str
## `summarise()` ungrouping output (override with `.groups` argument)
with(rain_years_str,plot(Year,total_rain,type='l',col='dodgerblue'))

🔢 Multiple groupings - eg average by month/station combination


rain %>% group_by(Month,Station) %>% 
  summarise(mean_rain=mean(Rainfall)) -> rain_season_station
## `summarise()` regrouping output by 'Month' (override with `.groups` argument)
head(rain_season_station)
## # A tibble: 6 x 3
## # Groups:   Month [1]
##   Month Station    mean_rain
##   <fct> <chr>          <dbl>
## 1 Jan   Ardara         175. 
## 2 Jan   Armagh          74.6
## 3 Jan   Athboy          84.9
## 4 Jan   Belfast        101. 
## 5 Jan   Birr            79.9
## 6 Jan   Cappoquinn     154.

Note that with is implicit in the dplyr commands such as filter and summarise

Re-arrange using pivot_wider function


rain_season_station %>% 
  pivot_wider(values_from='mean_rain',names_from='Month')
## # A tibble: 25 x 13
##    Station   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov
##    <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Ardara  175.  127.  123.   98.8  96.9 105.  124.  145.  153.  174.  176. 
##  2 Armagh   74.6  56.0  56.5  53.7  59.2  62.7  72.5  81.9  69.0  80.9  73.7
##  3 Athboy   84.9  62.6  62.4  59.0  62.2  68.1  76.3  88.8  76.4  88.1  81.0
##  4 Belfast 101.   74.5  73.1  65.9  69.2  74.5  87.7 102.   87.1 106.  100. 
##  5 Birr     79.9  57.9  58.4  54.1  60.3  62.0  75.1  86.8  72.2  83.8  77.9
##  6 Cappoq… 154.  118.  110.   94.0  95.8  94.9 104.  125.  116.  147.  141. 
##  7 Cork A… 138.  103.   92.3  76.7  77.2  71.5  76.9  92.8  91.7 120.  121. 
##  8 Derry   100.   74.6  70.2  62.9  64.0  72.9  83.4  94.1  85.3 103.  100. 
##  9 Drumsna  97.2  74.4  71.9  63.6  68.9  73.0  82.1  97.0  83.1  97.4  95.3
## 10 Dublin…  63.7  49.0  51.3  49.2  56.1  56.3  64.1  74.9  61.8  73.8  68.6
## # … with 15 more rows, and 1 more variable: Dec <dbl>

  • Now results organised into 2D array

… which makes a useful heatmap


rain_season_station %>% 
  pivot_wider(values_from='mean_rain',names_from='Month') -> temp  
temp %>% select(-Station) %>% as.matrix() -> rainmat
temp$Station -> rownames(rainmat) 
heatmap(rainmat,Colv=NA)

Multiple summary stats are also possible


rain %>% group_by(Month,Station) %>% 
  summarise(mean_rain=mean(Rainfall),sd_rain=sd(Rainfall)) -> rain_mnsd
head(rain_mnsd)
## # A tibble: 6 x 4
## # Groups:   Month [1]
##   Month Station    mean_rain sd_rain
##   <fct> <chr>          <dbl>   <dbl>
## 1 Jan   Ardara         175.     64.6
## 2 Jan   Armagh          74.6    28.9
## 3 Jan   Athboy          84.9    32.8
## 4 Jan   Belfast        101.     40.6
## 5 Jan   Birr            79.9    33.7
## 6 Jan   Cappoquinn     154.     66.3

mutate - Create new variables in a data frame


rain %>% group_by(Month) %>% 
  summarise(mean_rain=mean(Rainfall),sd_rain=sd(Rainfall)) %>%
  mutate(cv_rain=100 * sd_rain/mean_rain)  -> rain_mnsdcv
head(rain_mnsdcv)
## # A tibble: 6 x 4
##   Month mean_rain sd_rain cv_rain
##   <fct>     <dbl>   <dbl>   <dbl>
## 1 Jan       113.     57.6    51.1
## 2 Feb        83.2    51.5    61.8
## 3 Mar        79.5    44.3    55.7
## 4 Apr        68.7    36.4    52.9
## 5 May        71.3    37.2    52.2
## 6 Jun        72.7    40.9    56.3

  • 📖 CV: Coefficient of variation - standard deviation as a percentage of mean.
    • Shows relative variability of rainfall

Relative variability visualised…


barplot(rain_mnsdcv$cv_rain,names=rain_mnsdcv$Month,las=3,col='indianred')

… and absolute variability


barplot(rain_mnsdcv$sd_rain,names=rain_mnsdcv$Month,las=3,col='indianred')

Focusing on time series

⩰ Components and superposition of effects

📆 🕙 Time Series are often several superimposed effects


  • Cyclic
    • Seasonal variation in rainfall

    • Salary on payday

    • Long Term Trend

      • Overall increase/decrease in rainfall
      • Whether I’m getting richer or poorer
    • Random (Noise)

      • Very wet one-off days
      • Unplanned big spend on car repair

⪸ As seen in the rainfall data?


rain %>% group_by(Year,Month) %>% summarise(rf=sum(Rainfall)) -> monthly_total
plot(monthly_total$rf[1:120],type='b',ann=F, axes=F, cex=0.6)
axis(1,at=seq(0,120,by=12),labels=1850:1860); axis(2); title('Total Irish Rainfall 1850-1860')

⏳ The R time-series object


ts - A time series object that is useful for some functions…
monthly_total$rf %>% ts(freq=12,start=1850) -> rain_ts # Time serioes object
rain_ts %>% window(c(1850,1),c(1853,12)) # window from jan 1850 - dec 1853
##         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct
## 1850 2836.3 2158.9  964.1 3457.2 1492.1 1362.4 2584.8 1906.4 1763.1 1567.8
## 1851 4875.5 1379.9 1997.9 1368.1 1124.2 2537.1 2258.3 2416.2 1234.5 2834.2
## 1852 4036.1 2311.2 1116.0 1110.0 1754.6 4484.9 1716.0 2752.0 1433.1 2217.7
## 1853 3618.6 1073.9 2088.0 1881.9  829.1 1936.7 2512.3 2072.0 1504.1 4627.6
##         Nov    Dec
## 1850 2828.5 2600.1
## 1851 1134.3 1719.2
## 1852 5581.2 5205.9
## 1853 2651.1 1209.3

Note c(1850,1) is January 1850 etc. when using functions like window

🌔 The monthplot function


Plot to show average levels for each month and individual yearly variations from this.
rain_ts %>% window(c(1850,1),c(1869,12)) %>%
  monthplot(col='dodgerblue',col.base='indianred',lwd.base=3)

… for the whole data set:


Remove the window filter to see whole data (1850-2014)
rain_ts %>% 
  monthplot(col='dodgerblue',col.base='indianred',lwd.base=3)

Decomposition


View data as trend + seasonal + random components:

rain_ts  %>% 
  stl(s.window='periodic') -> rain_decomp
rain_decomp
##           seasonal    trend   remainder
## Jan 1850  541.6231 2290.835     3.84188
## Feb 1850 -193.1494 2273.775    78.27395
## Mar 1850 -286.0009 2256.716 -1006.61494
## Apr 1850 -555.8361 2244.188  1768.84808
## May 1850 -491.4910 2231.660  -248.06925
## Jun 1850 -456.0382 2225.708  -407.27014
## Jul 1850 -219.3928 2219.756   584.43633

R. B. Cleveland, W. S. Cleveland, J.E. McRae, and I. Terpenning (1990) STL: A Seasonal-Trend Decomposition Procedure Based on Loess. Journal of Official Statistics, v6, pp3–73.

Graphical representation


rain_ts  %>% window(c(1850,1),c(1859,12)) %>% 
  stl(s.window='periodic') %>% plot

… for all of the data:


rain_ts  %>% stl(s.window='periodic') %>% plot

… for all of the data with more smoothing:


rain_ts  %>% stl(s.window='periodic',t.window=361) %>% plot

Further issues for time series

A closer look at the error term


  • You can extract the error term from an stl object
    • column 3 is the error term of the time.series component
rain_ts  %>% stl(s.window='periodic',t.window=361) -> rain_decomp2
rain_decomp2$time.series[,3] %>% window(c(1850,1),c(1853,12)) 
##              Jan         Feb         Mar         Apr         May         Jun
## 1850   184.79467   241.67250  -860.76751  1901.54871  -128.51254  -294.35012
## 1851  2215.27112  -546.05105   164.30894  -196.27484  -505.13609   871.62633
## 1852  1367.14757   376.52540  -726.31461  -463.09839   116.54036  2810.70278
## 1853   940.92402  -869.49815   236.94120   300.03678  -817.74510   253.69668
##              Jul         Aug         Sep         Oct         Nov         Dec
## 1850   690.72311  -361.86057  -327.84485 -1036.83437   285.70343  -125.77091
## 1851   355.49956   139.21588  -865.16840   220.84208 -1417.22012 -1015.39446
## 1852  -195.52399   466.29233  -675.29195  -404.38147  3020.95633  2462.58199
## 1853   591.94927  -222.55505  -613.15997  1996.62987    81.94704 -1542.94794

Temporal Autocorrelation


  • This is the degree to which current values of a variable depend on previous values
    • Autocorrelation at lag 1 is correlation between \(x_t\) and \(x_{t-1}\)
      • often written as \(\rho(x_t,x_{t-1})\) or for shorthand \(\rho_1\).
      • … and so on. For lag \(k\) it is \(\rho(x_t,x_{t-k})\), or \(\rho_k\).

  • A graph of \(\rho_k\) vs. \(k\) is called a *correlogram
    • Looking at correlograms helps further understand the process driving the time series
      • In particular, if a particular month has an unusually high or low rainfall level how long is this anomaly likely to persist?

The correlogram


  • This is provided by acf (AutoCorrelation Function)
    • Blue dotted lines show significant values at 95% : Significant 6 month lag negative correlation
    • Initial spike fairly irrelevent - just says \(\rho_0 = 1\) - obviously so as this is just \(\rho(x_i,x_i)\)!
rain_decomp2$time.series[,3] %>% acf(main='Rainfall Correlogram')

The frequency domain


  • Another way to think about time series is in the frequency domain
    • Identify ‘spikes’ at different frequencies - ie 1 cycle per year, 2 cycles per year etc. - use the spectrum function
rain_ts %>% spectrum(spans=3,log='dB') # spans controls smoothing on spectral estimate 

Conclusion

💡 New ideas

  • New R ideas
    • dplyr and pipeline-based commands, and related functions
    • ts objects for time and related functions
    • New time series ideas
      • Graphical tools: monthplot etc.
      • Decomposition: trend/seasonal/random
      • Autocorrelation: How long do anomalies last?
      • Spectral Analysis: useful to check for unexpected cycles
    • Next lecture - Interactive methods