Boxplots

This is a boxplot.

For a given bit of data, a boxplot shows us three things: the quartiles, the mean and the outliers (if any). The bottom tip of the line is the minimum value and the upper tip is the maximum value. The lower end of the box is the first quartile, the line across the middle is the second qartile (the mean), and the top of the box is the third quartile. Outliers are represented as points far above or far below the main boxplot.

So it’s a nice tool that shows us the spread of the data.

Boxplots in ggplot2

Let’s bring the tidyverse into R Studio.

library(tidyverse)

The dataset we’ll be using is called mpg.

mpg
## # A tibble: 234 x 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto~ f        18    29 p     comp~
##  2 audi         a4           1.8  1999     4 manu~ f        21    29 p     comp~
##  3 audi         a4           2    2008     4 manu~ f        20    31 p     comp~
##  4 audi         a4           2    2008     4 auto~ f        21    30 p     comp~
##  5 audi         a4           2.8  1999     6 auto~ f        16    26 p     comp~
##  6 audi         a4           2.8  1999     6 manu~ f        18    26 p     comp~
##  7 audi         a4           3.1  2008     6 auto~ f        18    27 p     comp~
##  8 audi         a4 quattro   1.8  1999     4 manu~ 4        18    26 p     comp~
##  9 audi         a4 quattro   1.8  1999     4 auto~ 4        16    25 p     comp~
## 10 audi         a4 quattro   2    2008     4 manu~ 4        20    28 p     comp~
## # ... with 224 more rows

It’s a reasonably small dataset containing information on different car models.

Let’s use a boxplot to visualise the mean and quartiles of hwy (highway miles per gallon).

ggplot(mpg, aes(y = hwy)) + 
  geom_boxplot()

Note that we specified that hwy should go on the y-axis. What happens if it goes on the x-axis?

From the boxplot above, we have a good idea of how most of the cars in our dataset perform on the highway. Although there are a couple of very efficient outliers pushing over 40 miles a gallon, most are within the 20’s.

But is this true for every type of car? Wouldn’t a midsize car do better than, say, an SUV?

We should find out how different classes of cars do. How many classes of cars are there?

mpg %>% 
  group_by(class) %>% 
  count()
## # A tibble: 7 x 2
## # Groups:   class [7]
##   class          n
##   <chr>      <int>
## 1 2seater        5
## 2 compact       47
## 3 midsize       41
## 4 minivan       11
## 5 pickup        33
## 6 subcompact    35
## 7 suv           62

So we need to tell ggplot2 to create 7 boxplots. The dataset needs to be split up according to the class.

We can put class on the x-axis, and ggplot2 should create 7 boxplots.

ggplot(mpg, aes(class, hwy)) + 
  geom_boxplot()

Pickups and SUVs don’t perform well, which is reasonable since they are big cars.

Something about how boxplots work emerges.

Boxplots only capture numerical data. Then you can have different boxplots to compare, but those different boxplots are created on the basis of some type of categorical data.

Think about points, which can have both x and y axes. But a boxplot can only have a y-axis. The x-axis can, at best, be used to show different subdivisions of whatever data the boxplot is capturing along the y-axis.

Colours

Let’s further highlight each boxplot by assigning them different colours. Like every geom that isn’t a line or point, geom_boxplot has two arguments for colour: col and fill. The col argument will change the border colour and the fill argument will change the fill colour.

ggplot(mpg, aes(class, hwy)) + 
  geom_boxplot(aes(fill = class))

Note that fill goes inside aes(). This is because we assigned fill to class, which is a part of the data. If we wanted one fill colour regardless of the data( such as fill = “red”), we wouldn’t put it in aes().

More colours

Let’s bring in some more palettes to play with.

library(viridis)

The viridis package has several fun palettes. It’s very popular due to it’s visual appeal as well as the fact that it’s colourblind-friendly.

There are two types of colour palettes in R: discrete and continuous. Discrete colours are a set of distinct individual colours - think of a painter’s palette. Continuous colours are a gradual gradient - one colour gradually shifts into another.

Think about our colour assignment - it’s done on the basis of a categorical variable. That means that we need a set of distinct colours. This is something we need to specify.

ggplot(mpg, aes(class, hwy)) + 
  geom_boxplot(aes(fill = class)) + 
  scale_fill_viridis_d()

There are four main functions in viridis:

scale_fill_viridis_c()
scale_fill_viridis_d()
scale_colour_viridis_c()
scale_colour_viridis_d()

The c and d at the end stand for continuous and discrete. Don’t confuse c for categorical!

Then there are separate functions for colour and fill. So if we use fill, we need to use scale_fill… and if we use colour, we need to use scale_colour…

Homework

What are some of the other categorical variables in mpg? Do you think the type of drive train (drv) has any impact on the mileage?

Try to make boxplots for iris, another dataset. Which variables (columns) can we make a boxplot for? Which variables (columns) can be used to segment the boxplots?

ggplot(mpg, aes(drv, hwy)) + 
  geom_boxplot(aes(fill = drv)) + 
  scale_fill_viridis_d(option = "C")

ggplot(iris, aes(Species, Petal.Length)) + 
  geom_boxplot(aes(fill = Species), alpha = 0.5) + 
  scale_fill_viridis_d(option = "E")