Introduction to the dplyr
package
🎁 A new package dplyr
library(dplyr)
rain %>% group_by(Station) %>%
summarise(mrain=mean(Rainfall)) -> rain_summary
head(rain_summary)
## # A tibble: 6 x 2
## Station mrain
## <chr> <dbl>
## 1 Ardara 140.
## 2 Armagh 68.3
## 3 Athboy 74.7
## 4 Belfast 87.1
## 5 Birr 70.8
## 6 Cappoquinn 121.
- The
%>%
command acts as a ‘pipeline’.
x %>% f(y)
is the same as f(x,y)
.
x %>% f1(y) %>% f2(z)
is the same as f2(f1(x,y),z)
but easier to read.
📆 Group by month
rain %>% group_by(Month) %>%
summarise(mrain=mean(Rainfall)) -> rain_months
head(rain_months)
## # A tibble: 6 x 2
## Month mrain
## <fct> <dbl>
## 1 Jan 113.
## 2 Feb 83.2
## 3 Mar 79.5
## 4 Apr 68.7
## 5 May 71.3
## 6 Jun 72.7
summarise
applies a summary function to each group and creates a new data frame with one entry for each group. Any R summary function can be used - eg median
, sd
, max
or a user defined function.
📊 Basic graphic investigation
library(ggplot2)
ggplot(rain_months,aes(x=Month,y=mrain)) + geom_col()

📊 What does that mean?
library(ggplot2)
Load ggplot2
- this is an R graphics library
- Enter line:
ggplot(rain_months,aes(x=Month,y=mrain)) + geom_col()
ggplot
is the main command - it creates a graph
aes
Aesthetics link variables to characteristics of the graph
- Here
x
is month and y
is mean rainfall
geom_col
specifies the kind of graph - here it is a column plot
- you add a geometry to the graph object to specify the actual graph
📊 Modifying the style of plots
ggplot(rain_months,aes(x=Month,y=mrain)) + geom_col(fill='dodgerblue')

📈 Also for yearly data
Here we use geom_line
to get a line graph, rather than column.
rain %>% group_by(Year) %>%
summarise(total_rain=sum(Rainfall)) -> rain_years
ggplot(rain_years,aes(x=Year,y=total_rain)) + geom_line(col='indianred')

✂️ Filter for individual stations using filter
rain %>% group_by(Year) %>%
filter(Station=='Strokestown') %>%
summarise(total_rain=sum(Rainfall)) -> rain_years_str
ggplot(rain_years_str,aes(x=Year,y=total_rain)) + geom_line(col='indianred')

Multiple summary stats are also possible
rain %>% group_by(Month,Station) %>%
summarise(mean_rain=mean(Rainfall),sd_rain=sd(Rainfall)) -> rain_mnsd
head(rain_mnsd)
## # A tibble: 6 x 4
## # Groups: Month [1]
## Month Station mean_rain sd_rain
## <fct> <chr> <dbl> <dbl>
## 1 Jan Ardara 175. 64.6
## 2 Jan Armagh 74.6 28.9
## 3 Jan Athboy 84.9 32.8
## 4 Jan Belfast 101. 40.6
## 5 Jan Birr 79.9 33.7
## 6 Jan Cappoquinn 154. 66.3
mutate
- Create new variables in a data frame
rain %>% group_by(Month) %>%
summarise(mean_rain=mean(Rainfall),sd_rain=sd(Rainfall)) %>%
mutate(cv_rain=100 * sd_rain/mean_rain) -> rain_mnsdcv
head(rain_mnsdcv)
## # A tibble: 6 x 4
## Month mean_rain sd_rain cv_rain
## <fct> <dbl> <dbl> <dbl>
## 1 Jan 113. 57.6 51.1
## 2 Feb 83.2 51.5 61.8
## 3 Mar 79.5 44.3 55.7
## 4 Apr 68.7 36.4 52.9
## 5 May 71.3 37.2 52.2
## 6 Jun 72.7 40.9 56.3
- 📖 CV: Coefficient of variation - standard deviation as a percentage of mean.
- Shows relative variability of rainfall
Relative variability visualised…
ggplot(rain_mnsdcv,aes(x=Month,y=cv_rain)) + geom_col(fill='darkred')

… and absolute variability
ggplot(rain_mnsdcv,aes(x=Month,y=sd_rain)) + geom_col(fill='seagreen')

🔢 Multiple groupings - eg average by month/station combination
rain %>% group_by(Month,Station) %>%
summarise(mean_rain=mean(Rainfall)) -> rain_season_station
head(rain_season_station)
## # A tibble: 6 x 3
## # Groups: Month [1]
## Month Station mean_rain
## <fct> <chr> <dbl>
## 1 Jan Ardara 175.
## 2 Jan Armagh 74.6
## 3 Jan Athboy 84.9
## 4 Jan Belfast 101.
## 5 Jan Birr 79.9
## 6 Jan Cappoquinn 154.
Note that with
is implicit in the dplyr
commands such as filter
and summarise
Re-arrange using pivot_wider
in dplyr
package
library(tidyverse)
rain_season_station %>% pivot_wider(names_from=Month,values_from=mean_rain)
## # A tibble: 25 x 13
## Station Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Ardara 175. 127. 123. 98.8 96.9 105. 124. 145. 153. 174.
## 2 Armagh 74.6 56.0 56.5 53.7 59.2 62.7 72.5 81.9 69.0 80.9
## 3 Athboy 84.9 62.6 62.4 59.0 62.2 68.1 76.3 88.8 76.4 88.1
## 4 Belfast 101. 74.5 73.1 65.9 69.2 74.5 87.7 102. 87.1 106.
## 5 Birr 79.9 57.9 58.4 54.1 60.3 62.0 75.1 86.8 72.2 83.8
## 6 Cappoq… 154. 118. 110. 94.0 95.8 94.9 104. 125. 116. 147.
## 7 Cork A… 138. 103. 92.3 76.7 77.2 71.5 76.9 92.8 91.7 120.
## 8 Derry 100. 74.6 70.2 62.9 64.0 72.9 83.4 94.1 85.3 103.
## 9 Drumsna 97.2 74.4 71.9 63.6 68.9 73.0 82.1 97.0 83.1 97.4
## 10 Dublin… 63.7 49.0 51.3 49.2 56.1 56.3 64.1 74.9 61.8 73.8
## # … with 15 more rows, and 2 more variables: Nov <dbl>, Dec <dbl>
- Now results organised into 2D array
Can also visualise 2D arrays - geom_tile
ggplot(rain_season_station,aes(x=Month,y=Station,fill=mean_rain)) + geom_tile()

Create a heatmap
rain %>% group_by(Station) %>%
summarise(mean_rain=sum(Rainfall)) -> rain_station
rain_season_station %>% mutate(Station = reorder(Station,rain_station$mean_rain)) -> rain_season_station
- Similar to above, but order the station columns so that similar ones are closer together
- Base this on the overall rainfall per station
- Create an ordering of stations by amount of rainfall
- Rainiest places at the top, driest places at the bottom
Create a heatmap
hm <- ggplot(rain_season_station,aes(x=Month,y=Station,fill=mean_rain)) + geom_tile(col='white')
hm

Storing plots to modify is useful (1)
hm + scale_fill_viridis_c(direction=-1) + labs(fill='Mean Rainfall')

Storing plots to modify is useful (2)
hm + scale_fill_distiller(palette = 'Reds',direction=1) + labs(fill='Mean Rainfall')

Alternative aesthetic and geometry
ggplot(rain_season_station,aes(x=Month,y=Station,size=mean_rain)) + geom_point(col='salmon')

Why %>%
is useful here
This code takes the rain
data frame and runs through the steps to aggregate nationally for 1950.
rain %>% filter(Year==1950) %>% group_by(Station) %>%
summarise(total_rain=sum(Rainfall)) -> rain_station_1950
So does this code, in ‘traditional’ R format:
rain_station_1950 <- summarise(
group_by(
filter(Station,Year==1950),Station),
total_rain=sum(Rainfall))
Which do you think is clearer?