Further Topics in R and Data Analysis

Advanced R and Data Analysis

Overview for Rest of Module

More on time series
Making interactive tools
Some new spatial techniques
Reproducibility

R studio

A useful tool
Download and install (it’s free) - https://www.rstudio.com/products/rstudio/download/

Looking at the rainfall data

💧 Main working data set is `rainfall.RData`…

📥 Get it from Teams in the usual folder.

load('rainfall.RData')
head(stations)

## # A tibble: 6 x 9
##   Station   Elevation Easting Northing   Lat  Long County  Abbreviation Source  
##   <chr>         <int>   <dbl>    <dbl> <dbl> <dbl> <chr>   <chr>        <chr>   
## 1 Athboy           87  270400   261700  53.6 -6.93 Meath   AB           Met Eir…
## 2 Foulksmi…        71  284100   118400  52.3 -6.77 Wexford F            Met Eir…
## 3 Mullingar       112  241780   247765  53.5 -7.37 Westme… M            Met Eir…
## 4 Portlaw           8  246600   115200  52.3 -7.31 Waterf… P            Met Eir…
## 5 Rathdrum        131  319700   186000  52.9 -6.22 Wicklow RD           Met Eir…
## 6 Strokest…        49  194500   279100  53.8 -8.1  Roscom… S            Met Eir…

This packages all of the rainfall information

head(rain)

## # A tibble: 6 x 4
##    Year Month Rainfall Station
##   <dbl> <fct>    <dbl> <chr>  
## 1  1850 Jan      169   Ardara 
## 2  1851 Jan      236.  Ardara 
## 3  1852 Jan      250.  Ardara 
## 4  1853 Jan      209.  Ardara 
## 5  1854 Jan      188.  Ardara 
## 6  1855 Jan       32.3 Ardara

Its been converted to an R binary data file for convenience.
- There are no missing values.
- Data runs from 1850 to 2014 by month.
- 👍🏻Thanks to Prof. Conor Murphy and Simon Noone for supplying.

Introduction to the `dplyr` package

🎁 Recap for the package `dplyr`

library(dplyr)
rain %>% group_by(Station) %>% 
  summarise(mrain=mean(Rainfall))  -> rain_summary
head(rain_summary)

## # A tibble: 6 x 2
##   Station    mrain
##   <chr>      <dbl>
## 1 Ardara     140. 
## 2 Armagh      68.3
## 3 Athboy      74.7
## 4 Belfast     87.1
## 5 Birr        70.8
## 6 Cappoquinn 121.

The %>% command acts as a ‘pipeline’.
- x %>% f(y) is the same as f(x,y).
- x %>% f1(y) %>% f2(z) is the same as f2(f1(x,y),z) but easier to read.

📆 Group by month

rain %>% group_by(Month) %>% 
  summarise(mrain=mean(Rainfall)) -> rain_months
head(rain_months)

## # A tibble: 6 x 2
##   Month mrain
##   <fct> <dbl>
## 1 Jan   113. 
## 2 Feb    83.2
## 3 Mar    79.5
## 4 Apr    68.7
## 5 May    71.3
## 6 Jun    72.7

summarise applies a summary function to each group and creates a new data frame with one entry for each group. Any R summary function can be used - eg median, sd, max or a user defined function.

📊 Basic graphic investigation

barplot(rain_months$mrain,names=rain_months$Month,las=3,col='dodgerblue')

📊 Shorter version using `with`

with(rain_months,barplot(mrain,names=Month,las=3,col='dodgerblue'))

📈 Also for yearly data

rain %>% group_by(Year) %>% 
  summarise(total_rain=sum(Rainfall)) -> rain_years

## `summarise()` ungrouping output (override with `.groups` argument)

with(rain_years,plot(Year,total_rain,type='l',col='dodgerblue'))

✂️ Filter for individual stations using `filter`

rain %>% group_by(Year) %>% 
  filter(Station=='Strokestown') %>%
  summarise(total_rain=sum(Rainfall)) -> rain_years_str

## `summarise()` ungrouping output (override with `.groups` argument)

with(rain_years_str,plot(Year,total_rain,type='l',col='dodgerblue'))

🔢 Multiple groupings - eg average by month/station combination

rain %>% group_by(Month,Station) %>% 
  summarise(mean_rain=mean(Rainfall)) -> rain_season_station

## `summarise()` regrouping output by 'Month' (override with `.groups` argument)

head(rain_season_station)

## # A tibble: 6 x 3
## # Groups:   Month [1]
##   Month Station    mean_rain
##   <fct> <chr>          <dbl>
## 1 Jan   Ardara         175. 
## 2 Jan   Armagh          74.6
## 3 Jan   Athboy          84.9
## 4 Jan   Belfast        101. 
## 5 Jan   Birr            79.9
## 6 Jan   Cappoquinn     154.

Note that with is implicit in the dplyr commands such as filter and summarise

Re-arrange using `pivot_wider` function

rain_season_station %>% 
  pivot_wider(values_from='mean_rain',names_from='Month')

## # A tibble: 25 x 13
##    Station   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov
##    <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Ardara  175.  127.  123.   98.8  96.9 105.  124.  145.  153.  174.  176. 
##  2 Armagh   74.6  56.0  56.5  53.7  59.2  62.7  72.5  81.9  69.0  80.9  73.7
##  3 Athboy   84.9  62.6  62.4  59.0  62.2  68.1  76.3  88.8  76.4  88.1  81.0
##  4 Belfast 101.   74.5  73.1  65.9  69.2  74.5  87.7 102.   87.1 106.  100. 
##  5 Birr     79.9  57.9  58.4  54.1  60.3  62.0  75.1  86.8  72.2  83.8  77.9
##  6 Cappoq… 154.  118.  110.   94.0  95.8  94.9 104.  125.  116.  147.  141. 
##  7 Cork A… 138.  103.   92.3  76.7  77.2  71.5  76.9  92.8  91.7 120.  121. 
##  8 Derry   100.   74.6  70.2  62.9  64.0  72.9  83.4  94.1  85.3 103.  100. 
##  9 Drumsna  97.2  74.4  71.9  63.6  68.9  73.0  82.1  97.0  83.1  97.4  95.3
## 10 Dublin…  63.7  49.0  51.3  49.2  56.1  56.3  64.1  74.9  61.8  73.8  68.6
## # … with 15 more rows, and 1 more variable: Dec <dbl>

Now results organised into 2D array

… which makes a useful heatmap

rain_season_station %>% 
  pivot_wider(values_from='mean_rain',names_from='Month') -> temp  
temp %>% select(-Station) %>% as.matrix() -> rainmat
temp$Station -> rownames(rainmat) 
heatmap(rainmat,Colv=NA)

Multiple summary stats are also possible

rain %>% group_by(Month,Station) %>% 
  summarise(mean_rain=mean(Rainfall),sd_rain=sd(Rainfall)) -> rain_mnsd
head(rain_mnsd)

## # A tibble: 6 x 4
## # Groups:   Month [1]
##   Month Station    mean_rain sd_rain
##   <fct> <chr>          <dbl>   <dbl>
## 1 Jan   Ardara         175.     64.6
## 2 Jan   Armagh          74.6    28.9
## 3 Jan   Athboy          84.9    32.8
## 4 Jan   Belfast        101.     40.6
## 5 Jan   Birr            79.9    33.7
## 6 Jan   Cappoquinn     154.     66.3

`mutate` - Create new variables in a data frame

rain %>% group_by(Month) %>% 
  summarise(mean_rain=mean(Rainfall),sd_rain=sd(Rainfall)) %>%
  mutate(cv_rain=100 * sd_rain/mean_rain)  -> rain_mnsdcv
head(rain_mnsdcv)

## # A tibble: 6 x 4
##   Month mean_rain sd_rain cv_rain
##   <fct>     <dbl>   <dbl>   <dbl>
## 1 Jan       113.     57.6    51.1
## 2 Feb        83.2    51.5    61.8
## 3 Mar        79.5    44.3    55.7
## 4 Apr        68.7    36.4    52.9
## 5 May        71.3    37.2    52.2
## 6 Jun        72.7    40.9    56.3

📖 CV: Coefficient of variation - standard deviation as a percentage of mean.
- Shows relative variability of rainfall

Relative variability visualised…

barplot(rain_mnsdcv$cv_rain,names=rain_mnsdcv$Month,las=3,col='indianred')

… and absolute variability

barplot(rain_mnsdcv$sd_rain,names=rain_mnsdcv$Month,las=3,col='indianred')

Focusing on time series

⩰ Components and superposition of effects

Source: http://aplus.com/a/instachaz-painfully-accurate-graphs-about-life

📆 🕙 Time Series are often several superimposed effects

Cyclic
- Seasonal variation in rainfall
- Salary on payday
- Long Term Trend
  - Overall increase/decrease in rainfall
  - Whether I’m getting richer or poorer
- Random (Noise)
  - Very wet one-off days
  - Unplanned big spend on car repair

⪸ As seen in the rainfall data?

rain %>% group_by(Year,Month) %>% summarise(rf=sum(Rainfall)) -> monthly_total
plot(monthly_total$rf[1:120],type='b',ann=F, axes=F, cex=0.6)
axis(1,at=seq(0,120,by=12),labels=1850:1860); axis(2); title('Total Irish Rainfall 1850-1860')

⏳ The R time-series object

ts - A time series object that is useful for some functions…

monthly_total$rf %>% ts(freq=12,start=1850) -> rain_ts # Time serioes object
rain_ts %>% window(c(1850,1),c(1853,12)) # window from jan 1850 - dec 1853

##         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct
## 1850 2836.3 2158.9  964.1 3457.2 1492.1 1362.4 2584.8 1906.4 1763.1 1567.8
## 1851 4875.5 1379.9 1997.9 1368.1 1124.2 2537.1 2258.3 2416.2 1234.5 2834.2
## 1852 4036.1 2311.2 1116.0 1110.0 1754.6 4484.9 1716.0 2752.0 1433.1 2217.7
## 1853 3618.6 1073.9 2088.0 1881.9  829.1 1936.7 2512.3 2072.0 1504.1 4627.6
##         Nov    Dec
## 1850 2828.5 2600.1
## 1851 1134.3 1719.2
## 1852 5581.2 5205.9
## 1853 2651.1 1209.3

Note c(1850,1) is January 1850 etc. when using functions like window

🌔 The `monthplot` function

Plot to show average levels for each month and individual yearly variations from this.

rain_ts %>% window(c(1850,1),c(1869,12)) %>%
  monthplot(col='dodgerblue',col.base='indianred',lwd.base=3)

… for the whole data set:

Remove the window filter to see whole data (1850-2014)

rain_ts %>% 
  monthplot(col='dodgerblue',col.base='indianred',lwd.base=3)

Decomposition

View data as trend + seasonal + random components:

rain_ts  %>% 
  stl(s.window='periodic') -> rain_decomp
rain_decomp

##           seasonal    trend   remainder
## Jan 1850  541.6231 2290.835     3.84188
## Feb 1850 -193.1494 2273.775    78.27395
## Mar 1850 -286.0009 2256.716 -1006.61494
## Apr 1850 -555.8361 2244.188  1768.84808
## May 1850 -491.4910 2231.660  -248.06925
## Jun 1850 -456.0382 2225.708  -407.27014
## Jul 1850 -219.3928 2219.756   584.43633

R. B. Cleveland, W. S. Cleveland, J.E. McRae, and I. Terpenning (1990) STL: A Seasonal-Trend Decomposition Procedure Based on Loess. Journal of Official Statistics, v6, pp3–73.

Graphical representation

rain_ts  %>% window(c(1850,1),c(1859,12)) %>% 
  stl(s.window='periodic') %>% plot

… for all of the data:

rain_ts  %>% stl(s.window='periodic') %>% plot

… for all of the data with more smoothing:

rain_ts  %>% stl(s.window='periodic',t.window=361) %>% plot

Further issues for time series

A closer look at the error term

You can extract the error term from an stl object
- column 3 is the error term of the time.series component

rain_ts  %>% stl(s.window='periodic',t.window=361) -> rain_decomp2
rain_decomp2$time.series[,3] %>% window(c(1850,1),c(1853,12))

##              Jan         Feb         Mar         Apr         May         Jun
## 1850   184.79467   241.67250  -860.76751  1901.54871  -128.51254  -294.35012
## 1851  2215.27112  -546.05105   164.30894  -196.27484  -505.13609   871.62633
## 1852  1367.14757   376.52540  -726.31461  -463.09839   116.54036  2810.70278
## 1853   940.92402  -869.49815   236.94120   300.03678  -817.74510   253.69668
##              Jul         Aug         Sep         Oct         Nov         Dec
## 1850   690.72311  -361.86057  -327.84485 -1036.83437   285.70343  -125.77091
## 1851   355.49956   139.21588  -865.16840   220.84208 -1417.22012 -1015.39446
## 1852  -195.52399   466.29233  -675.29195  -404.38147  3020.95633  2462.58199
## 1853   591.94927  -222.55505  -613.15997  1996.62987    81.94704 -1542.94794

Temporal Autocorrelation

This is the degree to which current values of a variable depend on previous values
- Autocorrelation at lag 1 is correlation between \(x_t\) and \(x_{t-1}\)
  - often written as \(\rho(x_t,x_{t-1})\) or for shorthand \(\rho_1\).
  - … and so on. For lag \(k\) it is \(\rho(x_t,x_{t-k})\), or \(\rho_k\).

A graph of \(\rho_k\) vs. \(k\) is called a *correlogram
- Looking at correlograms helps further understand the process driving the time series
  - In particular, if a particular month has an unusually high or low rainfall level how long is this anomaly likely to persist?

The correlogram

This is provided by acf (AutoCorrelation Function)
- Blue dotted lines show significant values at 95% : Significant 6 month lag negative correlation
- Initial spike fairly irrelevent - just says \(\rho_0 = 1\) - obviously so as this is just \(\rho(x_i,x_i)\)!

rain_decomp2$time.series[,3] %>% acf(main='Rainfall Correlogram')

The frequency domain

Another way to think about time series is in the frequency domain
- Identify ‘spikes’ at different frequencies - ie 1 cycle per year, 2 cycles per year etc. - use the spectrum function

rain_ts %>% spectrum(spans=3,log='dB') # spans controls smoothing on spectral estimate

Conclusion

💡 New ideas

New R ideas
- dplyr and pipeline-based commands, and related functions
- ts objects for time and related functions
- New time series ideas
  - Graphical tools: monthplot etc.
  - Decomposition: trend/seasonal/random
  - Autocorrelation: How long do anomalies last?
  - Spectral Analysis: useful to check for unexpected cycles
- Next lecture - Interactive methods