Let’s bring the tidyverse into R Studio.
library(tidyverse)
A violin plot is a more advanced version of a boxplot. Let’s compare the two side-by-side.
ggplot(mpg, aes(drv, hwy)) +
geom_boxplot() +
geom_violin(alpha = 0.2)
The violin plot is a little more confusing, but shows us a little more about where the data is concentrated.
Taking another look at our dataset, season is a double. That’s what R calls numerical variables. So while we know that seasons are categorical, R didn’t understand that because the seasons are numbered from 1 to 4. It assumed that this must be numerical.
We need to tell it to consider it a categorical variable, or in R’s terms, a factor.
Bikeshare %>%
mutate(season = as.factor(season))
Don’t forget to assign this entire thing back to Bikeshare! Otherwise it’ll forget that we changed anything and revert to whatever it was before.
Bikeshare <- Bikeshare %>%
mutate(season = as.factor(season))
And now we can proceed with our plot.
ggplot(Bikeshare) +
geom_violin(aes(season, registered))
Season 1 (which is winter) has slightly lower ridership. But it’s not perfectly clear how much lower, because we lost the mean line. Like any plot, violin plots have their drawbacks.
But geom_violin has a special argument just for this.
ggplot(Bikeshare, aes(season, registered)) +
geom_violin(draw_quantiles = 0.5)
Now it’s much clearer. But finally we also need the exact numbers.
Bikeshare %>%
group_by(season) %>%
summarise(average_riders = mean(registered))
## # A tibble: 4 x 2
## season average_riders
## <fct> <dbl>
## 1 1 62.2
## 2 2 122.
## 3 3 145.
## 4 4 128.
So while the yearly average is 115 registered riders, we can see that from winter to summer, the average ridership actually doubles. An inference that we could have perhaps made without the help of data.
Spend some time exploring the Bikeshare dataset. Are there any other categorical variables? Could registered ridership be dependent on them too?
library(viridis)
Bikeshare <- Bikeshare %>%
mutate(holiday = as.factor(holiday),
weekday = as.factor(weekday))
ggplot(Bikeshare, aes(holiday, registered)) +
geom_violin(aes(col = holiday), fill = NA, draw_quantiles = 0.5) +
labs(title = "Registered bikers on working days vs. holidays",
caption = "Data obtained from the ISLR2 package")
ggplot(Bikeshare, aes(weekday, registered)) +
geom_violin(aes(fill = weekday)) +
scale_fill_viridis_d(option = "B") +
labs(title = "Registered bikers on each day of the week",
caption = "Data obtained from the ISLR2 package")