Introduction to the dplyr
package
🎁 A new package dplyr
library(dplyr)
rain %>% group_by(Station) %>%
summarise(mrain=mean(Rainfall)) -> rain_summary
head(rain_summary)
## # A tibble: 6 × 2
## Station mrain
## <chr> <dbl>
## 1 Ardara 140.36753
## 2 Armagh 68.32096
## 3 Athboy 74.74356
## 4 Belfast 87.10995
## 5 Birr 70.83498
## 6 Cappoquinn 121.22455
- The
%>%
command acts as a ‘pipeline’.
x %>% f(y)
is the same as f(x,y)
.
x %>% f1(y) %>% f2(z)
is the same as f2(f1(x,y),z)
but easier to read.
📆 Group by month
rain %>% group_by(Month) %>%
summarise(mrain=mean(Rainfall)) -> rain_months
head(rain_months)
## # A tibble: 6 × 2
## Month mrain
## <fctr> <dbl>
## 1 Jan 112.64355
## 2 Feb 83.24975
## 3 Mar 79.53280
## 4 Apr 68.74165
## 5 May 71.31769
## 6 Jun 72.74104
summarise
applies a summary function to each group and creates a new data frame with one entry for each group. Any R summary function can be used - eg median
, sd
, max
or a user defined function.
📊 Basic graphic investigation
barplot(rain_months$mrain,names=rain_months$Month,las=3,col='dodgerblue')

📊 Shorter version using with
with(rain_months,barplot(mrain,names=Month,las=3,col='dodgerblue'))

📈 Also for yearly data
rain %>% group_by(Year) %>%
summarise(total_rain=sum(Rainfall)) -> rain_years
with(rain_years,plot(Year,total_rain,type='l',col='dodgerblue'))

✂️ Filter for individual stations using filter
rain %>% group_by(Year) %>%
filter(Station=='Strokestown') %>%
summarise(total_rain=sum(Rainfall)) -> rain_years_str
with(rain_years_str,plot(Year,total_rain,type='l',col='dodgerblue'))

🔢 Multiple groupings - eg average by month/station combination
rain %>% group_by(Month,Station) %>%
summarise(mean_rain=mean(Rainfall)) -> rain_season_station
head(rain_season_station)
## Source: local data frame [6 x 3]
## Groups: Month [1]
##
## Month Station mean_rain
## <fctr> <chr> <dbl>
## 1 Jan Ardara 174.82606
## 2 Jan Armagh 74.57242
## 3 Jan Athboy 84.94759
## 4 Jan Belfast 101.20718
## 5 Jan Birr 79.92074
## 6 Jan Cappoquinn 153.97159
Note that with
is implicit in the dplyr
commands such as filter
and summarise
Re-arrange using reshape2
package
library(reshape2)
rain_season_station %>% acast(Station~Month) %>% head
## Using mean_rain as value column: use value.var to override.
## Jan Feb Mar Apr May Jun
## Ardara 174.82606 126.82303 123.02000 98.79333 96.90727 105.24061
## Armagh 74.57242 55.97182 56.48879 53.67030 59.23182 62.72939
## Athboy 84.94759 62.62133 62.44944 58.97874 62.16260 68.11460
## Belfast 101.20718 74.50206 73.10221 65.90492 69.23426 74.48525
## Birr 79.92074 57.88501 58.42056 54.07187 60.25831 61.96445
## Cappoquinn 153.97159 117.77099 110.02890 94.00365 95.81437 94.86357
## Jul Aug Sep Oct Nov Dec
## Ardara 123.70485 145.24788 152.80727 174.44788 176.45030 186.14182
## Armagh 72.50636 81.92182 69.02576 80.94242 73.73121 79.05939
## Athboy 76.28662 88.84710 76.35077 88.14627 80.95767 87.05997
## Belfast 87.70003 102.44499 87.11622 106.35394 100.31760 102.95073
## Birr 75.10084 86.75669 72.21722 83.80810 77.87266 81.74334
## Cappoquinn 104.09372 125.42245 116.04466 146.70658 141.10012 154.87402
- Now results organised into 2D array
… which makes a useful heatmap
library(reshape2)
rain_season_station %>% acast(Station~Month) %>% heatmap(Colv=NA)
Why %>%
is useful here
This code takes the rain
data frame and runs through the steps to aggregated, reshape and compute the heat map:
rain %>% group_by(Month,Station) %>% summarise(mean_rain=mean(Rainfall)) %>%
acast(Station~Month) %>%
heatmap(Colv=NA)
So does this code, in ‘traditional’ R format:
heatmap(acast(summarise(group_by(rain,Month,Station),
mean_rain=mean(Rainfall)),Station~Month),Colv=NA)
Which do you think is clearer?
Multiple summary stats are also possible
rain %>% group_by(Month,Station) %>%
summarise(mean_rain=mean(Rainfall),sd_rain=sd(Rainfall)) -> rain_mnsd
head(rain_mnsd)
## Source: local data frame [6 x 4]
## Groups: Month [1]
##
## Month Station mean_rain sd_rain
## <fctr> <chr> <dbl> <dbl>
## 1 Jan Ardara 174.82606 64.59834
## 2 Jan Armagh 74.57242 28.91264
## 3 Jan Athboy 84.94759 32.84887
## 4 Jan Belfast 101.20718 40.60078
## 5 Jan Birr 79.92074 33.74263
## 6 Jan Cappoquinn 153.97159 66.33995
mutate
- Create new variables in a data frame
rain %>% group_by(Month) %>%
summarise(mean_rain=mean(Rainfall),sd_rain=sd(Rainfall)) %>%
mutate(cv_rain=100 * sd_rain/mean_rain) -> rain_mnsdcv
head(rain_mnsdcv)
## # A tibble: 6 × 4
## Month mean_rain sd_rain cv_rain
## <fctr> <dbl> <dbl> <dbl>
## 1 Jan 112.64355 57.56599 51.10456
## 2 Feb 83.24975 51.45153 61.80382
## 3 Mar 79.53280 44.33714 55.74699
## 4 Apr 68.74165 36.39073 52.93841
## 5 May 71.31769 37.20539 52.16853
## 6 Jun 72.74104 40.92818 56.26560
- 📖 CV: Coefficient of variation - standard deviation as a percentage of mean.
- Shows relative variability of rainfall
Relative variability visualised…
barplot(rain_mnsdcv$cv_rain,names=rain_mnsdcv$Month,las=3,col='indianred')

… and absolute variability
barplot(rain_mnsdcv$sd_rain,names=rain_mnsdcv$Month,las=3,col='indianred')

Focusing on time series
📆 🕙 Time Series are often several superimposed effects
- Cyclic
- Seasonal variation in rainfall
- Salary on payday
- Long Term Trend
- Overall increase/decrease in rainfall
- Whether I’m getting richer or poorer
- Random (Noise)
- Very wet one-off days
- Unplanned big spend on car repair
⪸ As seen in the rainfall data?
rain %>% group_by(Year,Month) %>% summarise(rf=sum(Rainfall)) -> monthly_total
plot(monthly_total$rf[1:120],type='b',ann=F, axes=F, cex=0.6)
axis(1,at=seq(0,120,by=12),labels=1850:1860); axis(2); title('Total Irish Rainfall 1850-1860')

⏳ The R time-series object
ts
- A time series object that is useful for some functions…
monthly_total$rf %>% ts(freq=12,start=1850) -> rain_ts # Time serioes object
rain_ts %>% window(c(1850,1),c(1853,12)) # window from jan 1850 - dec 1853
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 1850 2836.3 2158.9 964.1 3457.2 1492.1 1362.4 2584.8 1906.4 1763.1 1567.8
## 1851 4875.5 1379.9 1997.9 1368.1 1124.2 2537.1 2258.3 2416.2 1234.5 2834.2
## 1852 4036.1 2311.2 1116.0 1110.0 1754.6 4484.9 1716.0 2752.0 1433.1 2217.7
## 1853 3618.6 1073.9 2088.0 1881.9 829.1 1936.7 2512.3 2072.0 1504.1 4627.6
## Nov Dec
## 1850 2828.5 2600.1
## 1851 1134.3 1719.2
## 1852 5581.2 5205.9
## 1853 2651.1 1209.3
Note c(1850,1)
is January 1850 etc. when using functions like window
🌔 The monthplot
function
Plot to show average levels for each month and individual yearly variations from this.
rain_ts %>% window(c(1850,1),c(1869,12)) %>%
monthplot(col='dodgerblue',col.base='indianred',lwd.base=3)

… for the whole data set:
Remove the window filter to see whole data (1850-2014)
rain_ts %>%
monthplot(col='dodgerblue',col.base='indianred',lwd.base=3)

Decomposition
View data as trend + seasonal + random components:
rain_ts %>%
stl(s.window='periodic') -> rain_decomp
rain_decomp
## seasonal trend remainder
## Jan 1850 541.6231 2290.835 3.84188
## Feb 1850 -193.1494 2273.775 78.27395
## Mar 1850 -286.0009 2256.716 -1006.61494
## Apr 1850 -555.8361 2244.188 1768.84808
## May 1850 -491.4910 2231.660 -248.06925
## Jun 1850 -456.0382 2225.708 -407.27014
## Jul 1850 -219.3928 2219.756 584.43633
R. B. Cleveland, W. S. Cleveland, J.E. McRae, and I. Terpenning (1990) STL: A Seasonal-Trend Decomposition Procedure Based on Loess. Journal of Official Statistics, v6, pp3–73.
Graphical representation
rain_ts %>% window(c(1850,1),c(1859,12)) %>%
stl(s.window='periodic') %>% plot

… for all of the data:
rain_ts %>% stl(s.window='periodic') %>% plot

… for all of the data with more smoothing:
rain_ts %>% stl(s.window='periodic',t.window=361) %>% plot

Further issues for time series
A closer look at the error term
- You can extract the error term from an
stl
object
- column 3 is the error term of the
time.series
component
rain_ts %>% stl(s.window='periodic',t.window=361) -> rain_decomp2
rain_decomp2$time.series[,3] %>% window(c(1850,1),c(1853,12))
## Jan Feb Mar Apr May
## 1850 184.79467 241.67250 -860.76751 1901.54871 -128.51254
## 1851 2215.27112 -546.05105 164.30894 -196.27484 -505.13609
## 1852 1367.14757 376.52540 -726.31461 -463.09839 116.54036
## 1853 940.92402 -869.49815 236.94120 300.03678 -817.74510
## Jun Jul Aug Sep Oct
## 1850 -294.35012 690.72311 -361.86057 -327.84485 -1036.83437
## 1851 871.62633 355.49956 139.21588 -865.16840 220.84208
## 1852 2810.70278 -195.52399 466.29233 -675.29195 -404.38147
## 1853 253.69668 591.94927 -222.55505 -613.15997 1996.62987
## Nov Dec
## 1850 285.70343 -125.77091
## 1851 -1417.22012 -1015.39446
## 1852 3020.95633 2462.58199
## 1853 81.94704 -1542.94794
Temporal Autocorrelation
- This is the degree to which current values of a variable depend on previous values
- Autocorrelation at lag 1 is correlation between \(x_t\) and \(x_{t-1}\)
- often written as \(\rho(x_t,x_{t-1})\) or for shorthand \(\rho_1\).
- … and so on. For lag \(k\) it is \(\rho(x_t,x_{t-k})\), or \(\rho_k\).
- A graph of \(\rho_k\) vs. \(k\) is called a *correlogram
- Looking at correlograms helps further understand the process driving the time series
- In particular, if a particular month has an unusually high or low rainfall level how long is this anomaly likely to persist?
The correlogram
- This is provided by
acf
(AutoCorrelation Function)
- Blue dotted lines show significant values at 95% : Significant 6 month lag negative correlation
- Initial spike fairly irrelevent - just says \(\rho_0 = 1\) - obviously so as this is just \(\rho(x_i,x_i)\)!
rain_decomp2$time.series[,3] %>% acf(main='Rainfall Correlogram')

The frequency domain
- Another way to think about time series is in the frequency domain
- Identify ‘spikes’ at different frequencies - ie 1 cycle per year, 2 cycles per year etc. - use the
spectrum
function
rain_ts %>% spectrum(spans=3,log='dB') # spans controls smoothing on spectral estimate
