Introduction to the dplyr
package
🎁 Recap for the package dplyr
library(dplyr)
rain %>% group_by(Station) %>%
summarise(mrain=mean(Rainfall)) -> rain_summary
head(rain_summary)
## # A tibble: 6 x 2
## Station mrain
## <chr> <dbl>
## 1 Ardara 140.
## 2 Armagh 68.3
## 3 Athboy 74.7
## 4 Belfast 87.1
## 5 Birr 70.8
## 6 Cappoquinn 121.
- The
%>%
command acts as a ‘pipeline’.
x %>% f(y)
is the same as f(x,y)
.
x %>% f1(y) %>% f2(z)
is the same as f2(f1(x,y),z)
but easier to read.
📆 Group by month
rain %>% group_by(Month) %>%
summarise(mrain=mean(Rainfall)) -> rain_months
head(rain_months)
## # A tibble: 6 x 2
## Month mrain
## <fct> <dbl>
## 1 Jan 113.
## 2 Feb 83.2
## 3 Mar 79.5
## 4 Apr 68.7
## 5 May 71.3
## 6 Jun 72.7
summarise
applies a summary function to each group and creates a new data frame with one entry for each group. Any R summary function can be used - eg median
, sd
, max
or a user defined function.
📊 Basic graphic investigation
barplot(rain_months$mrain,names=rain_months$Month,las=3,col='dodgerblue')

📊 Shorter version using with
with(rain_months,barplot(mrain,names=Month,las=3,col='dodgerblue'))

📈 Also for yearly data
rain %>% group_by(Year) %>%
summarise(total_rain=sum(Rainfall)) -> rain_years
## `summarise()` ungrouping output (override with `.groups` argument)
with(rain_years,plot(Year,total_rain,type='l',col='dodgerblue'))

✂️ Filter for individual stations using filter
rain %>% group_by(Year) %>%
filter(Station=='Strokestown') %>%
summarise(total_rain=sum(Rainfall)) -> rain_years_str
## `summarise()` ungrouping output (override with `.groups` argument)
with(rain_years_str,plot(Year,total_rain,type='l',col='dodgerblue'))

🔢 Multiple groupings - eg average by month/station combination
rain %>% group_by(Month,Station) %>%
summarise(mean_rain=mean(Rainfall)) -> rain_season_station
## `summarise()` regrouping output by 'Month' (override with `.groups` argument)
head(rain_season_station)
## # A tibble: 6 x 3
## # Groups: Month [1]
## Month Station mean_rain
## <fct> <chr> <dbl>
## 1 Jan Ardara 175.
## 2 Jan Armagh 74.6
## 3 Jan Athboy 84.9
## 4 Jan Belfast 101.
## 5 Jan Birr 79.9
## 6 Jan Cappoquinn 154.
Note that with
is implicit in the dplyr
commands such as filter
and summarise
Re-arrange using pivot_wider
function
rain_season_station %>%
pivot_wider(values_from='mean_rain',names_from='Month')
## # A tibble: 25 x 13
## Station Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Ardara 175. 127. 123. 98.8 96.9 105. 124. 145. 153. 174. 176.
## 2 Armagh 74.6 56.0 56.5 53.7 59.2 62.7 72.5 81.9 69.0 80.9 73.7
## 3 Athboy 84.9 62.6 62.4 59.0 62.2 68.1 76.3 88.8 76.4 88.1 81.0
## 4 Belfast 101. 74.5 73.1 65.9 69.2 74.5 87.7 102. 87.1 106. 100.
## 5 Birr 79.9 57.9 58.4 54.1 60.3 62.0 75.1 86.8 72.2 83.8 77.9
## 6 Cappoq… 154. 118. 110. 94.0 95.8 94.9 104. 125. 116. 147. 141.
## 7 Cork A… 138. 103. 92.3 76.7 77.2 71.5 76.9 92.8 91.7 120. 121.
## 8 Derry 100. 74.6 70.2 62.9 64.0 72.9 83.4 94.1 85.3 103. 100.
## 9 Drumsna 97.2 74.4 71.9 63.6 68.9 73.0 82.1 97.0 83.1 97.4 95.3
## 10 Dublin… 63.7 49.0 51.3 49.2 56.1 56.3 64.1 74.9 61.8 73.8 68.6
## # … with 15 more rows, and 1 more variable: Dec <dbl>
- Now results organised into 2D array
… which makes a useful heatmap
rain_season_station %>%
pivot_wider(values_from='mean_rain',names_from='Month') -> temp
temp %>% select(-Station) %>% as.matrix() -> rainmat
temp$Station -> rownames(rainmat)
heatmap(rainmat,Colv=NA)
Multiple summary stats are also possible
rain %>% group_by(Month,Station) %>%
summarise(mean_rain=mean(Rainfall),sd_rain=sd(Rainfall)) -> rain_mnsd
head(rain_mnsd)
## # A tibble: 6 x 4
## # Groups: Month [1]
## Month Station mean_rain sd_rain
## <fct> <chr> <dbl> <dbl>
## 1 Jan Ardara 175. 64.6
## 2 Jan Armagh 74.6 28.9
## 3 Jan Athboy 84.9 32.8
## 4 Jan Belfast 101. 40.6
## 5 Jan Birr 79.9 33.7
## 6 Jan Cappoquinn 154. 66.3
mutate
- Create new variables in a data frame
rain %>% group_by(Month) %>%
summarise(mean_rain=mean(Rainfall),sd_rain=sd(Rainfall)) %>%
mutate(cv_rain=100 * sd_rain/mean_rain) -> rain_mnsdcv
head(rain_mnsdcv)
## # A tibble: 6 x 4
## Month mean_rain sd_rain cv_rain
## <fct> <dbl> <dbl> <dbl>
## 1 Jan 113. 57.6 51.1
## 2 Feb 83.2 51.5 61.8
## 3 Mar 79.5 44.3 55.7
## 4 Apr 68.7 36.4 52.9
## 5 May 71.3 37.2 52.2
## 6 Jun 72.7 40.9 56.3
- 📖 CV: Coefficient of variation - standard deviation as a percentage of mean.
- Shows relative variability of rainfall
Relative variability visualised…
barplot(rain_mnsdcv$cv_rain,names=rain_mnsdcv$Month,las=3,col='indianred')

… and absolute variability
barplot(rain_mnsdcv$sd_rain,names=rain_mnsdcv$Month,las=3,col='indianred')

Focusing on time series
📆 🕙 Time Series are often several superimposed effects
⪸ As seen in the rainfall data?
rain %>% group_by(Year,Month) %>% summarise(rf=sum(Rainfall)) -> monthly_total
plot(monthly_total$rf[1:120],type='b',ann=F, axes=F, cex=0.6)
axis(1,at=seq(0,120,by=12),labels=1850:1860); axis(2); title('Total Irish Rainfall 1850-1860')

⏳ The R time-series object
ts
- A time series object that is useful for some functions…
monthly_total$rf %>% ts(freq=12,start=1850) -> rain_ts # Time serioes object
rain_ts %>% window(c(1850,1),c(1853,12)) # window from jan 1850 - dec 1853
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 1850 2836.3 2158.9 964.1 3457.2 1492.1 1362.4 2584.8 1906.4 1763.1 1567.8
## 1851 4875.5 1379.9 1997.9 1368.1 1124.2 2537.1 2258.3 2416.2 1234.5 2834.2
## 1852 4036.1 2311.2 1116.0 1110.0 1754.6 4484.9 1716.0 2752.0 1433.1 2217.7
## 1853 3618.6 1073.9 2088.0 1881.9 829.1 1936.7 2512.3 2072.0 1504.1 4627.6
## Nov Dec
## 1850 2828.5 2600.1
## 1851 1134.3 1719.2
## 1852 5581.2 5205.9
## 1853 2651.1 1209.3
Note c(1850,1)
is January 1850 etc. when using functions like window
🌔 The monthplot
function
Plot to show average levels for each month and individual yearly variations from this.
rain_ts %>% window(c(1850,1),c(1869,12)) %>%
monthplot(col='dodgerblue',col.base='indianred',lwd.base=3)

… for the whole data set:
Remove the window filter to see whole data (1850-2014)
rain_ts %>%
monthplot(col='dodgerblue',col.base='indianred',lwd.base=3)

Decomposition
View data as trend + seasonal + random components:
rain_ts %>%
stl(s.window='periodic') -> rain_decomp
rain_decomp
## seasonal trend remainder
## Jan 1850 541.6231 2290.835 3.84188
## Feb 1850 -193.1494 2273.775 78.27395
## Mar 1850 -286.0009 2256.716 -1006.61494
## Apr 1850 -555.8361 2244.188 1768.84808
## May 1850 -491.4910 2231.660 -248.06925
## Jun 1850 -456.0382 2225.708 -407.27014
## Jul 1850 -219.3928 2219.756 584.43633
R. B. Cleveland, W. S. Cleveland, J.E. McRae, and I. Terpenning (1990) STL: A Seasonal-Trend Decomposition Procedure Based on Loess. Journal of Official Statistics, v6, pp3–73.
Graphical representation
rain_ts %>% window(c(1850,1),c(1859,12)) %>%
stl(s.window='periodic') %>% plot

… for all of the data:
rain_ts %>% stl(s.window='periodic') %>% plot

… for all of the data with more smoothing:
rain_ts %>% stl(s.window='periodic',t.window=361) %>% plot

Further issues for time series
A closer look at the error term
- You can extract the error term from an
stl
object
- column 3 is the error term of the
time.series
component
rain_ts %>% stl(s.window='periodic',t.window=361) -> rain_decomp2
rain_decomp2$time.series[,3] %>% window(c(1850,1),c(1853,12))
## Jan Feb Mar Apr May Jun
## 1850 184.79467 241.67250 -860.76751 1901.54871 -128.51254 -294.35012
## 1851 2215.27112 -546.05105 164.30894 -196.27484 -505.13609 871.62633
## 1852 1367.14757 376.52540 -726.31461 -463.09839 116.54036 2810.70278
## 1853 940.92402 -869.49815 236.94120 300.03678 -817.74510 253.69668
## Jul Aug Sep Oct Nov Dec
## 1850 690.72311 -361.86057 -327.84485 -1036.83437 285.70343 -125.77091
## 1851 355.49956 139.21588 -865.16840 220.84208 -1417.22012 -1015.39446
## 1852 -195.52399 466.29233 -675.29195 -404.38147 3020.95633 2462.58199
## 1853 591.94927 -222.55505 -613.15997 1996.62987 81.94704 -1542.94794
Temporal Autocorrelation
- This is the degree to which current values of a variable depend on previous values
- Autocorrelation at lag 1 is correlation between \(x_t\) and \(x_{t-1}\)
- often written as \(\rho(x_t,x_{t-1})\) or for shorthand \(\rho_1\).
- … and so on. For lag \(k\) it is \(\rho(x_t,x_{t-k})\), or \(\rho_k\).
- A graph of \(\rho_k\) vs. \(k\) is called a *correlogram
- Looking at correlograms helps further understand the process driving the time series
- In particular, if a particular month has an unusually high or low rainfall level how long is this anomaly likely to persist?
The correlogram
- This is provided by
acf
(AutoCorrelation Function)
- Blue dotted lines show significant values at 95% : Significant 6 month lag negative correlation
- Initial spike fairly irrelevent - just says \(\rho_0 = 1\) - obviously so as this is just \(\rho(x_i,x_i)\)!
rain_decomp2$time.series[,3] %>% acf(main='Rainfall Correlogram')

The frequency domain
- Another way to think about time series is in the frequency domain
- Identify ‘spikes’ at different frequencies - ie 1 cycle per year, 2 cycles per year etc. - use the
spectrum
function
rain_ts %>% spectrum(spans=3,log='dB') # spans controls smoothing on spectral estimate
