Introduction to the `dplyr` package

🎁 A new package `dplyr`

library(dplyr)
rain %>% group_by(Station) %>% 
  summarise(mrain=mean(Rainfall))  -> rain_summary
head(rain_summary)

## # A tibble: 6 x 2
##   Station    mrain
##   <chr>      <dbl>
## 1 Ardara     140. 
## 2 Armagh      68.3
## 3 Athboy      74.7
## 4 Belfast     87.1
## 5 Birr        70.8
## 6 Cappoquinn 121.

The %>% command acts as a ‘pipeline’.
- x %>% f(y) is the same as f(x,y).
- x %>% f1(y) %>% f2(z) is the same as f2(f1(x,y),z) but easier to read.

📆 Group by month

rain %>% group_by(Month) %>% 
  summarise(mrain=mean(Rainfall)) -> rain_months
head(rain_months)

## # A tibble: 6 x 2
##   Month mrain
##   <fct> <dbl>
## 1 Jan   113. 
## 2 Feb    83.2
## 3 Mar    79.5
## 4 Apr    68.7
## 5 May    71.3
## 6 Jun    72.7

summarise applies a summary function to each group and creates a new data frame with one entry for each group. Any R summary function can be used - eg median, sd, max or a user defined function.

📊 Basic graphic investigation

library(ggplot2)
ggplot(rain_months,aes(x=Month,y=mrain)) + geom_col()

📊 What does that mean?

library(ggplot2) Load ggplot2 - this is an R graphics library
Enter line: ggplot(rain_months,aes(x=Month,y=mrain)) + geom_col()
ggplot is the main command - it creates a graph
- aes Aesthetics link variables to characteristics of the graph
- Here x is month and y is mean rainfall
geom_col specifies the kind of graph - here it is a column plot
you add a geometry to the graph object to specify the actual graph

📊 Modifying the style of plots

ggplot(rain_months,aes(x=Month,y=mrain)) + geom_col(fill='dodgerblue')

📈 Also for yearly data

Here we use geom_line to get a line graph, rather than column.

rain %>% group_by(Year) %>% 
  summarise(total_rain=sum(Rainfall)) -> rain_years
ggplot(rain_years,aes(x=Year,y=total_rain)) +  geom_line(col='indianred')

✂️ Filter for individual stations using `filter`

rain %>% group_by(Year) %>% 
  filter(Station=='Strokestown') %>%
  summarise(total_rain=sum(Rainfall)) -> rain_years_str
ggplot(rain_years_str,aes(x=Year,y=total_rain)) +  geom_line(col='indianred')

Multiple summary stats are also possible

rain %>% group_by(Month,Station) %>% 
  summarise(mean_rain=mean(Rainfall),sd_rain=sd(Rainfall)) -> rain_mnsd
head(rain_mnsd)

## # A tibble: 6 x 4
## # Groups:   Month [1]
##   Month Station    mean_rain sd_rain
##   <fct> <chr>          <dbl>   <dbl>
## 1 Jan   Ardara         175.     64.6
## 2 Jan   Armagh          74.6    28.9
## 3 Jan   Athboy          84.9    32.8
## 4 Jan   Belfast        101.     40.6
## 5 Jan   Birr            79.9    33.7
## 6 Jan   Cappoquinn     154.     66.3

`mutate` - Create new variables in a data frame

rain %>% group_by(Month) %>% 
  summarise(mean_rain=mean(Rainfall),sd_rain=sd(Rainfall)) %>%
  mutate(cv_rain=100 * sd_rain/mean_rain)  -> rain_mnsdcv
head(rain_mnsdcv)

## # A tibble: 6 x 4
##   Month mean_rain sd_rain cv_rain
##   <fct>     <dbl>   <dbl>   <dbl>
## 1 Jan       113.     57.6    51.1
## 2 Feb        83.2    51.5    61.8
## 3 Mar        79.5    44.3    55.7
## 4 Apr        68.7    36.4    52.9
## 5 May        71.3    37.2    52.2
## 6 Jun        72.7    40.9    56.3

📖 CV: Coefficient of variation - standard deviation as a percentage of mean.
- Shows relative variability of rainfall

Relative variability visualised…

ggplot(rain_mnsdcv,aes(x=Month,y=cv_rain)) + geom_col(fill='darkred')

… and absolute variability

ggplot(rain_mnsdcv,aes(x=Month,y=sd_rain)) + geom_col(fill='seagreen')

🔢 Multiple groupings - eg average by month/station combination

rain %>% group_by(Month,Station) %>% 
  summarise(mean_rain=mean(Rainfall)) -> rain_season_station
head(rain_season_station)

## # A tibble: 6 x 3
## # Groups:   Month [1]
##   Month Station    mean_rain
##   <fct> <chr>          <dbl>
## 1 Jan   Ardara         175. 
## 2 Jan   Armagh          74.6
## 3 Jan   Athboy          84.9
## 4 Jan   Belfast        101. 
## 5 Jan   Birr            79.9
## 6 Jan   Cappoquinn     154.

Note that with is implicit in the dplyr commands such as filter and summarise

Re-arrange using `pivot_wider` in `dplyr` package

library(tidyverse)
rain_season_station %>% pivot_wider(names_from=Month,values_from=mean_rain)

## # A tibble: 25 x 13
##    Station   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct
##    <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Ardara  175.  127.  123.   98.8  96.9 105.  124.  145.  153.  174. 
##  2 Armagh   74.6  56.0  56.5  53.7  59.2  62.7  72.5  81.9  69.0  80.9
##  3 Athboy   84.9  62.6  62.4  59.0  62.2  68.1  76.3  88.8  76.4  88.1
##  4 Belfast 101.   74.5  73.1  65.9  69.2  74.5  87.7 102.   87.1 106. 
##  5 Birr     79.9  57.9  58.4  54.1  60.3  62.0  75.1  86.8  72.2  83.8
##  6 Cappoq… 154.  118.  110.   94.0  95.8  94.9 104.  125.  116.  147. 
##  7 Cork A… 138.  103.   92.3  76.7  77.2  71.5  76.9  92.8  91.7 120. 
##  8 Derry   100.   74.6  70.2  62.9  64.0  72.9  83.4  94.1  85.3 103. 
##  9 Drumsna  97.2  74.4  71.9  63.6  68.9  73.0  82.1  97.0  83.1  97.4
## 10 Dublin…  63.7  49.0  51.3  49.2  56.1  56.3  64.1  74.9  61.8  73.8
## # … with 15 more rows, and 2 more variables: Nov <dbl>, Dec <dbl>

Now results organised into 2D array

Can also visualise 2D arrays - `geom_tile`

ggplot(rain_season_station,aes(x=Month,y=Station,fill=mean_rain)) + geom_tile()

Create a heatmap

rain %>% group_by(Station) %>% 
  summarise(mean_rain=sum(Rainfall)) -> rain_station
rain_season_station %>% mutate(Station = reorder(Station,rain_station$mean_rain)) -> rain_season_station

Similar to above, but order the station columns so that similar ones are closer together
Base this on the overall rainfall per station
Create an ordering of stations by amount of rainfall
Rainiest places at the top, driest places at the bottom

Create a heatmap

hm <- ggplot(rain_season_station,aes(x=Month,y=Station,fill=mean_rain)) + geom_tile(col='white')
hm

Storing plots to modify is useful (1)

hm + scale_fill_viridis_c(direction=-1) + labs(fill='Mean Rainfall')

Storing plots to modify is useful (2)

hm + scale_fill_distiller(palette = 'Reds',direction=1) + labs(fill='Mean Rainfall')

Alternative aesthetic and geometry

ggplot(rain_season_station,aes(x=Month,y=Station,size=mean_rain)) + geom_point(col='salmon')

Why `%>%` is useful here

This code takes the rain data frame and runs through the steps to aggregate nationally for 1950.

rain %>% filter(Year==1950) %>% group_by(Station) %>% 
  summarise(total_rain=sum(Rainfall)) -> rain_station_1950

So does this code, in ‘traditional’ R format:

rain_station_1950 <- summarise(
  group_by(
    filter(Station,Year==1950),Station),
    total_rain=sum(Rainfall))

Which do you think is clearer?

Further Topics in R and Data Analysis

Chris Brunsdon

Advanced R and Data Analysis

Overview

Looking at the rainfall data

💧 Main working data set is `rainfall.RData`…

This packages all of the rainfall information

Introduction to the `dplyr` package

🎁 A new package `dplyr`

📆 Group by month

📊 Basic graphic investigation

📊 What does that mean?

📊 Modifying the style of plots

📈 Also for yearly data

✂️ Filter for individual stations using `filter`

Multiple summary stats are also possible

`mutate` - Create new variables in a data frame

Relative variability visualised…

… and absolute variability

🔢 Multiple groupings - eg average by month/station combination

Re-arrange using `pivot_wider` in `dplyr` package

Can also visualise 2D arrays - `geom_tile`

Create a heatmap

Create a heatmap

Storing plots to modify is useful (1)

Storing plots to modify is useful (2)

Alternative aesthetic and geometry

Why `%>%` is useful here

Conclusion

💡 New ideas

Further Topics in R and Data Analysis

Chris Brunsdon

Advanced R and Data Analysis

Overview

Looking at the rainfall data

💧 Main working data set is rainfall.RData…

This packages all of the rainfall information

Introduction to the dplyr package

🎁 A new package dplyr

📆 Group by month

📊 Basic graphic investigation

📊 What does that mean?

📊 Modifying the style of plots

📈 Also for yearly data

✂️ Filter for individual stations using filter

Multiple summary stats are also possible

mutate - Create new variables in a data frame

Relative variability visualised…

… and absolute variability

🔢 Multiple groupings - eg average by month/station combination

Re-arrange using pivot_wider in dplyr package

Can also visualise 2D arrays - geom_tile

Create a heatmap

Create a heatmap

Storing plots to modify is useful (1)

Storing plots to modify is useful (2)

Alternative aesthetic and geometry

Why %>% is useful here

Conclusion

💡 New ideas

💧 Main working data set is `rainfall.RData`…

Introduction to the `dplyr` package

🎁 A new package `dplyr`

✂️ Filter for individual stations using `filter`

`mutate` - Create new variables in a data frame

Re-arrange using `pivot_wider` in `dplyr` package

Can also visualise 2D arrays - `geom_tile`

Why `%>%` is useful here