Mastering Boxplots in ggplot2

Analyzing Mammal Sleep Patterns through Distribution

Author

Abdullah Al Shamim

Published

February 10, 2026

Introduction

A Boxplot provides a visual summary of data through its quartiles. It is the gold standard for identifying outliers and comparing the central tendency of multiple groups simultaneously.

Mastering Boxplots: Visualizing Distribution and Outliers

Introduction Boxplots (or Box-and-Whisker plots) are essential for understanding the distribution of a numerical variable across different categories. They help us identify the median, quartiles, and potential outliers. In this guide, we use the msleep dataset to analyze the sleeping patterns of mammals based on their diet (vore).

1. Basic Boxplot (The Starting Point) We begin by mapping a categorical variable (vore) to the X-axis and a continuous variable (sleep_total) to the Y-axis.

  • Explanation: geom_boxplot() provides a five-number summary: the minimum, first quartile (Q1), median, third quartile (Q3), and maximum.

2. Systemic Data Cleaning (Handling Missing Values) Datasets often contain missing information.

  • Explanation: drop_na() removes rows where any column has an NA, which can drastically reduce your data. drop_na(vore) is the systemic choice; it only removes rows where the dietary category is unknown, keeping as much data as possible.

3. Color vs. Fill

  • color: Changes the color of the lines (outlines).
  • fill: Fills the interior of the boxes with color.
  • Pro-tip: Using fill by category makes the groups much easier to distinguish at a glance.

4. Refining Aesthetics and Transparency Solid colors can be overwhelming. We use alpha to introduce transparency and theme_test() for a clean, professional background.

5. Coordinate Flipping If your category names are long, flipping the axes makes the plot more readable.

  • Explanation: coord_flip() is particularly useful for reports where horizontal space is limited.

1. Setup & Initial Preview

We will use the tidyverse library and the msleep dataset, which contains the sleep times and weights of various mammals.

Code
library(tidyverse)

# Previewing the dataset
glimpse(msleep)
Rows: 83
Columns: 11
$ name         <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater shor…
$ genus        <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "Bra…
$ vore         <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "carn…
$ order        <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "Art…
$ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "dome…
$ sleep_total  <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0, 5…
$ sleep_rem    <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8, …
$ sleep_cycle  <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333, N…
$ awake        <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0, 1…
$ brainwt      <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000, 0…
$ bodywt       <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0.04…

2. Step-by-Step Construction

Step 1: Foundation

We map vore (dietary category) to the X-axis and sleep_total to the Y-axis.

Code
msleep %>% 
  ggplot(aes(x = vore, y = sleep_total)) +
  geom_boxplot()

Step 2: Systemic Cleaning

Notice the “NA” category above. We use drop_na(vore) to clean our visualization without losing data in other columns.

Code
msleep %>% 
  drop_na(vore) %>% 
  ggplot(aes(x = vore, y = sleep_total)) +
  geom_boxplot()


3. Enhancing Visual Information

Color vs. Fill

Mapping fill = vore allows us to use color to distinguish between herbivores, carnivores, omnivores, and insectivores.

Code
msleep %>% 
  drop_na(vore) %>% 
  ggplot(aes(x = vore, y = sleep_total, color = vore)) +
  geom_boxplot() +
  theme_minimal()

Code
msleep %>% 
  drop_na(vore) %>% 
  ggplot(aes(x = vore, y = sleep_total, fill = vore)) +
  geom_boxplot(alpha = 0.6) +
  theme_minimal()


4. Final Publication-Ready Plot

Combining all systemic elements: transparency, professional themes, descriptive labels, and coordinate flipping for better readability.

Code
msleep %>% 
  drop_na(vore) %>% 
  ggplot(aes(x = vore, y = sleep_total, fill = vore)) +
  geom_boxplot(alpha = 0.5, outlier.color = "red", outlier.shape = 4) +
  coord_flip() +
  theme_test(base_size = 14) +
  labs(title = "Mammalian Sleep Analysis",
       subtitle = "Total Sleep Duration Categorized by Diet (Vore)",
       x = "Dietary Category",
       y = "Total Sleep (Hours)",
       fill = "Diet Type",
       caption = "Dataset: msleep | Prepared by Abdullah Al Shamim") +
  theme(legend.position = "none") # Removed legend as X-axis already provides labels


Systemic Summary Toolkit

Function Role Why use it?
geom_boxplot() Geometry Displays the 5-number summary.
drop_na(vore) Cleaning Removes specific NAs to maintain data integrity.
alpha Transparency Softens colors for better presentation.
coord_flip() Layout Improves readability for categorical labels.