Violin Plots

tidyverse

Let’s bring the tidyverse into R Studio.

library(tidyverse)

A violin plot is a more advanced version of a boxplot. Let’s compare the two side-by-side.

ggplot(mpg, aes(drv, hwy)) + 
  geom_boxplot() + 
  geom_violin(alpha = 0.2)

The violin plot is a little more confusing, but shows us a little more about where the data is concentrated.

The Bikeshare Dataset

To explore violin plots, let’s use a new dataset.

The ISLR2 package contains an interesting dataset called Bikeshare. It contains the hourly and daily count of rental bikes between the years 2011 and 2012, along with weather and seasonal information.

install.packages("ISLR2")
library(ISLR2)

Bikeshare

This is long and difficult to read. Let’s make Bikeshare a tibble first.

install.packages("ISLR2")
library(ISLR2)

Bikeshare <- Bikeshare %>% as_tibble()
Bikeshare
## # A tibble: 8,645 x 15
##    season mnth    day hr    holiday weekday workingday weathersit    temp atemp
##     <dbl> <fct> <dbl> <fct>   <dbl>   <dbl>      <dbl> <fct>        <dbl> <dbl>
##  1      1 Jan       1 0           0       6          0 clear         0.24 0.288
##  2      1 Jan       1 1           0       6          0 clear         0.22 0.273
##  3      1 Jan       1 2           0       6          0 clear         0.22 0.273
##  4      1 Jan       1 3           0       6          0 clear         0.24 0.288
##  5      1 Jan       1 4           0       6          0 clear         0.24 0.288
##  6      1 Jan       1 5           0       6          0 cloudy/misty  0.24 0.258
##  7      1 Jan       1 6           0       6          0 clear         0.22 0.273
##  8      1 Jan       1 7           0       6          0 clear         0.2  0.258
##  9      1 Jan       1 8           0       6          0 clear         0.24 0.288
## 10      1 Jan       1 9           0       6          0 clear         0.32 0.348
## # ... with 8,635 more rows, and 5 more variables: hum <dbl>, windspeed <dbl>,
## #   casual <dbl>, registered <dbl>, bikers <dbl>

With over 8,000 rows and 15 columns, Bikeshare is not a small dataset. Before doing anything, let us set up a question. And before setting up a question, let’s spend a few moments understanding the data that we have. To understand the data better, we can look at https://cran.r-project.org/web//packages/ISLR2/ISLR2.pdf, which is the manual for the ISLR2 package. It contains important information on the data (search for “Bikeshare”).

The season column is the month, and obviously a categorical variable. How many seasons are there (and how much data do we have for each season)?

Bikeshare %>% count(season)
## # A tibble: 4 x 2
##   season     n
##    <dbl> <int>
## 1      1  2068
## 2      2  2203
## 3      3  2240
## 4      4  2134

We have the 4 seasons, numbered from 1 to 4 instead of named.

The registered column is the number of registered bikers, and obviously a numerical variable. What’s the average number of registered bikers across the whole year?

mean(Bikeshare$registered)
## [1] 115.1939

We can perhaps ask whether there is any difference in average registrations across the seasons?

ggplot(Bikeshare, aes(season, registered)) + 
  geom_violin()

Something’s gone wrong - we were supposed to have four violin plots. Our code was correct, so there must be an issue in the dataset itself.

Converting to Factors with dplyr

Taking another look at our dataset, season is a double. That’s what R calls numerical variables. So while we know that seasons are categorical, R didn’t understand that because the seasons are numbered from 1 to 4. It assumed that this must be numerical.

We need to tell it to consider it a categorical variable, or in R’s terms, a factor.

Bikeshare %>% 
  mutate(season = as.factor(season))

Don’t forget to assign this entire thing back to Bikeshare! Otherwise it’ll forget that we changed anything and revert to whatever it was before.

Bikeshare <- Bikeshare %>% 
  mutate(season = as.factor(season))

And now we can proceed with our plot.

ggplot(Bikeshare) + 
  geom_violin(aes(season, registered))

Season 1 (which is winter) has slightly lower ridership. But it’s not perfectly clear how much lower, because we lost the mean line. Like any plot, violin plots have their drawbacks.

But geom_violin has a special argument just for this.

ggplot(Bikeshare, aes(season, registered)) + 
  geom_violin(draw_quantiles = 0.5)

Now it’s much clearer. But finally we also need the exact numbers.

Bikeshare %>% 
  group_by(season) %>% 
  summarise(average_riders = mean(registered))
## # A tibble: 4 x 2
##   season average_riders
##   <fct>           <dbl>
## 1 1                62.2
## 2 2               122. 
## 3 3               145. 
## 4 4               128.

So while the yearly average is 115 registered riders, we can see that from winter to summer, the average ridership actually doubles. An inference that we could have perhaps made without the help of data.

Homework

Spend some time exploring the Bikeshare dataset. Are there any other categorical variables? Could registered ridership be dependent on them too?

library(viridis)
Bikeshare <- Bikeshare %>% 
  mutate(holiday = as.factor(holiday),
         weekday = as.factor(weekday))

ggplot(Bikeshare, aes(holiday, registered)) + 
  geom_violin(aes(col = holiday), fill = NA, draw_quantiles = 0.5)  +
  labs(title = "Registered bikers on working days vs. holidays",
       caption = "Data obtained from the ISLR2 package")

ggplot(Bikeshare, aes(weekday, registered)) + 
  geom_violin(aes(fill = weekday)) +
  scale_fill_viridis_d(option = "B") + 
  labs(title = "Registered bikers on each day of the week",
       caption = "Data obtained from the ISLR2 package")