Introduction

Often the data sets we work with contain large numbers of observations, representing the data at it’s most granular level. This is needed so that we can perform all the calculation and statistical analysis necessary to generate insights from our data. Sometimes it is also useful to see our data at a higher level, or in aggregate - for example the overall average.

In this vignette I will explain how to use the diplyr package to aggregate data and calculate mean and median for an aggregated set. I will also show how to use the tidyr package to manipulate the shape of the outcome so we can display the resultant mean and median data on the same chart using ggplot. This vignette uses the “mpg” data set from ggplot. I will start by loading the library for the packages that will be used

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)

1. Viewing the data

For a first look at the data I will use the head() function to view the first few rows of data and the str() function to see a description of the data.

head(mpg)
## # A tibble: 6 x 11
##   manufacturer model displ  year   cyl trans  drv     cty   hwy fl    class
##   <chr>        <chr> <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
## 1 audi         a4      1.8  1999     4 auto(~ f        18    29 p     comp~
## 2 audi         a4      1.8  1999     4 manua~ f        21    29 p     comp~
## 3 audi         a4      2    2008     4 manua~ f        20    31 p     comp~
## 4 audi         a4      2    2008     4 auto(~ f        21    30 p     comp~
## 5 audi         a4      2.8  1999     6 auto(~ f        16    26 p     comp~
## 6 audi         a4      2.8  1999     6 manua~ f        18    26 p     comp~
str(mpg)
## Classes 'tbl_df', 'tbl' and 'data.frame':    234 obs. of  11 variables:
##  $ manufacturer: chr  "audi" "audi" "audi" "audi" ...
##  $ model       : chr  "a4" "a4" "a4" "a4" ...
##  $ displ       : num  1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int  1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int  4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr  "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr  "f" "f" "f" "f" ...
##  $ cty         : int  18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int  29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr  "p" "p" "p" "p" ...
##  $ class       : chr  "compact" "compact" "compact" "compact" ...

So I can see from this that I have data on the mileage of a whole range of cars. Each observation or row in the data represents a particular car. The variables I am interested in looking at are the mileage data - these are the cty and hwy columns that represent the city and highway cycle mileage (miles per gallon or mpg).

I am interested in seeing how the mpg data compares across engine sizes and car classes. Using an example presented in the R for Data Science book, I can create a scatter plot showing how the highway mpg (hwy) for each car compares to the size of it’s engine (displ). I plot these two variables on the x and y axis and add the class of car as a colour so I can see that as well. The ggplot code below produces this chart.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

This chart presents the individual data points well in the scatterplot.

Aggregating the data

From the scatterplot it looks like there may be some differences by class of car. I want to see what the difference is between the classes as a whole by looking at their mean and median mileage data. I also want to compare the city as well as the highway mileage data. To do this I will use the dplyr package to aggregate the data by class.

mpg_aggbyclass <- mpg %>%  ## Create a new data frame called mpg_aggbyclass, using mpg data
  group_by(class) %>%  ## group this data by the "class" variable
  summarise_at(vars(hwy,cty), list(avg = ~mean(., na.rm=TRUE), med = ~median(., na.rm=TRUE)))
## summarise the variables hwy and cty at this class level. Calculate the mean and label it "avg"; calculate the median and label it "med".  Remove any null values with the na.rm=TRUE

Looking at this new data frame, I now have an average and median value for the city and highway mileage by class of vehicle.

mpg_aggbyclass
## # A tibble: 7 x 5
##   class      hwy_avg cty_avg hwy_med cty_med
##   <chr>        <dbl>   <dbl>   <dbl>   <dbl>
## 1 2seater       24.8    15.4    25        15
## 2 compact       28.3    20.1    27        20
## 3 midsize       27.3    18.8    27        18
## 4 minivan       22.4    15.8    23        16
## 5 pickup        16.9    13      17        13
## 6 subcompact    28.1    20.4    26        19
## 7 suv           18.1    13.5    17.5      13

Changing the shape of the data

Now that I have these data, I want to see how the different average mileage conditions (city and highway) compare across and within each class of car on a chart. The new data frame is in a wide format, and not well suited to the way I want to plot it. I need to transform this data into a more narrow format using the gather function of tidyr.

summary_by_class <- mpg_aggbyclass %>% gather(Condition, Mileage, hwy_avg:cty_med)
summary_by_class
## # A tibble: 28 x 3
##    class      Condition Mileage
##    <chr>      <chr>       <dbl>
##  1 2seater    hwy_avg      24.8
##  2 compact    hwy_avg      28.3
##  3 midsize    hwy_avg      27.3
##  4 minivan    hwy_avg      22.4
##  5 pickup     hwy_avg      16.9
##  6 subcompact hwy_avg      28.1
##  7 suv        hwy_avg      18.1
##  8 2seater    cty_avg      15.4
##  9 compact    cty_avg      20.1
## 10 midsize    cty_avg      18.8
## # ... with 18 more rows

Plot the new average and median data

Now that I have prepared the data I can use ggplot to create a gathered bar chart, and see the mean and median mileage for each car type and each driving condition next to each other, again using ggplot

ggplot(data = summary_by_class) + 
  geom_bar(mapping = aes(fill = Condition, x = class, y = Mileage), stat = "identity", position = "dodge")

It is now much easier to view and compare the average and median mileage by car class and by city and highway conditions.

References

R for Data Science https://r4ds.had.co.nz/ https://stackoverflow.com/questions/21982987/mean-per-group-in-a-data-frame