Often the data sets we work with contain large numbers of observations, representing the data at it’s most granular level. This is needed so that we can perform all the calculation and statistical analysis necessary to generate insights from our data. Sometimes it is also useful to see our data at a higher level, or in aggregate - for example the overall average.
In this vignette I will explain how to use the diplyr package to aggregate data and calculate mean and median for an aggregated set. I will also show how to use the tidyr package to manipulate the shape of the outcome so we can display the resultant mean and median data on the same chart using ggplot. This vignette uses the “mpg” data set from ggplot. I will start by loading the library for the packages that will be used
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
For a first look at the data I will use the head() function to view the first few rows of data and the str() function to see a description of the data.
head(mpg)
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(~ f 18 29 p comp~
## 2 audi a4 1.8 1999 4 manua~ f 21 29 p comp~
## 3 audi a4 2 2008 4 manua~ f 20 31 p comp~
## 4 audi a4 2 2008 4 auto(~ f 21 30 p comp~
## 5 audi a4 2.8 1999 6 auto(~ f 16 26 p comp~
## 6 audi a4 2.8 1999 6 manua~ f 18 26 p comp~
str(mpg)
## Classes 'tbl_df', 'tbl' and 'data.frame': 234 obs. of 11 variables:
## $ manufacturer: chr "audi" "audi" "audi" "audi" ...
## $ model : chr "a4" "a4" "a4" "a4" ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr "f" "f" "f" "f" ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr "p" "p" "p" "p" ...
## $ class : chr "compact" "compact" "compact" "compact" ...
So I can see from this that I have data on the mileage of a whole range of cars. Each observation or row in the data represents a particular car. The variables I am interested in looking at are the mileage data - these are the cty and hwy columns that represent the city and highway cycle mileage (miles per gallon or mpg).
I am interested in seeing how the mpg data compares across engine sizes and car classes. Using an example presented in the R for Data Science book, I can create a scatter plot showing how the highway mpg (hwy) for each car compares to the size of it’s engine (displ). I plot these two variables on the x and y axis and add the class of car as a colour so I can see that as well. The ggplot code below produces this chart.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
This chart presents the individual data points well in the scatterplot.
From the scatterplot it looks like there may be some differences by class of car. I want to see what the difference is between the classes as a whole by looking at their mean and median mileage data. I also want to compare the city as well as the highway mileage data. To do this I will use the dplyr package to aggregate the data by class.
mpg_aggbyclass <- mpg %>% ## Create a new data frame called mpg_aggbyclass, using mpg data
group_by(class) %>% ## group this data by the "class" variable
summarise_at(vars(hwy,cty), list(avg = ~mean(., na.rm=TRUE), med = ~median(., na.rm=TRUE)))
## summarise the variables hwy and cty at this class level. Calculate the mean and label it "avg"; calculate the median and label it "med". Remove any null values with the na.rm=TRUE
Looking at this new data frame, I now have an average and median value for the city and highway mileage by class of vehicle.
mpg_aggbyclass
## # A tibble: 7 x 5
## class hwy_avg cty_avg hwy_med cty_med
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2seater 24.8 15.4 25 15
## 2 compact 28.3 20.1 27 20
## 3 midsize 27.3 18.8 27 18
## 4 minivan 22.4 15.8 23 16
## 5 pickup 16.9 13 17 13
## 6 subcompact 28.1 20.4 26 19
## 7 suv 18.1 13.5 17.5 13
Now that I have these data, I want to see how the different average mileage conditions (city and highway) compare across and within each class of car on a chart. The new data frame is in a wide format, and not well suited to the way I want to plot it. I need to transform this data into a more narrow format using the gather function of tidyr.
summary_by_class <- mpg_aggbyclass %>% gather(Condition, Mileage, hwy_avg:cty_med)
summary_by_class
## # A tibble: 28 x 3
## class Condition Mileage
## <chr> <chr> <dbl>
## 1 2seater hwy_avg 24.8
## 2 compact hwy_avg 28.3
## 3 midsize hwy_avg 27.3
## 4 minivan hwy_avg 22.4
## 5 pickup hwy_avg 16.9
## 6 subcompact hwy_avg 28.1
## 7 suv hwy_avg 18.1
## 8 2seater cty_avg 15.4
## 9 compact cty_avg 20.1
## 10 midsize cty_avg 18.8
## # ... with 18 more rows
Now that I have prepared the data I can use ggplot to create a gathered bar chart, and see the mean and median mileage for each car type and each driving condition next to each other, again using ggplot
ggplot(data = summary_by_class) +
geom_bar(mapping = aes(fill = Condition, x = class, y = Mileage), stat = "identity", position = "dodge")
It is now much easier to view and compare the average and median mileage by car class and by city and highway conditions.
R for Data Science https://r4ds.had.co.nz/ https://stackoverflow.com/questions/21982987/mean-per-group-in-a-data-frame