This vignette demonstrated how to use the dplyr and purrr packages within the TidyVerse. We are using the built-in msleep dataset, which contains information about the sleep patterns of mammals. This collaborative example is designed to show how we can transform raw data into a clean, summarized format.
Capability 1: Data Manipulation with dplyr
Here I am usingdplyr to isolate herbivores. We calculate a new column for awake time and sort the results to see which heribivores sleep the most. This demonstrates the power of piping functions like filter(), mutate(, and select() together.
library(tidyverse)# Cleaning and transforming the mammal sleep dataherbivore_sleep <- msleep %>%filter(vore =="herbi") %>%mutate(awake_time =24- sleep_total) %>%select(name, sleep_total, awake_time, brainwt) %>%arrange(desc(sleep_total))# Display the first few rowsknitr::kable(head(herbivore_sleep), caption ="Top Sleeping Herbivores")
Top Sleeping Herbivores
name
sleep_total
awake_time
brainwt
Arctic ground squirrel
16.6
7.4
0.0057
Golden-mantled ground squirrel
15.9
8.1
NA
Eastern american chipmunk
15.8
8.2
NA
Western american chipmunk
14.9
9.1
NA
Round-tailed muskrat
14.6
9.4
NA
Mountain beaver
14.4
9.6
NA
Capability 2: Functional Programming with purrr
The purrr package is used here to apply the mean function across every numeric column in the dataset at once. This is much more efficient than writing individual lines of code for every column. We use map_df to ensure the output remains in a clean data frame format.
library(tidyverse)# Use ggplot2::msleep to ensure R always sees it as the datasetggplot2::msleep %>% dplyr::select(where(is.numeric)) %>% purrr::map_df(~mean(.x, na.rm =TRUE)) %>% knitr::kable(caption ="Dataset Averages using purrr")
Dataset Averages using purrr
sleep_total
sleep_rem
sleep_cycle
awake
brainwt
bodywt
10.43373
1.87541
0.4395833
13.56747
0.2815814
166.1363
Conclusion
This example shows how dplyr pipes (%>%) allows us to clean and transform data efficiently before passing it directly into a visualization.
AI Transcript
Phase 1: Dataset Selection and Reasoning
User Inquiry: “So what dataset from kaggle do I need to use?” AI Response & Logic: The AI suggested that while Kaggle is an option, using a built-in TidyVerse dataset like msleep is superior for a “collaborative vignette” because it ensures the code will run for all classmates instantly without requiring them to download external files or handle broken file paths. Decision: The user chose to stay with the built-in msleep dataset for reliability and ease of extension by peers.
Phase 2: Troubleshooting the “Select” Error
User Inquiry: “Error in UseMethod(‘select’) : no applicable method for ‘select’ applied to an object of class ‘function’” Technical Logic: * The Problem: The AI identified that the R environment was “masking” the TidyVerse select fuPhase 1: Dataset Selection and Reasoning
Phase 3: Final Polishing for Collaboration
User Inquiry: “Have I completed my code completely?” AI Response & Logic: The AI confirmed the code was functional but suggested adding “Annotations” (descriptive text). This is a requirement for a “vignette” to ensure that the classmate who has to “extend” the code later understands the logic.
Extend
print(msleep)
# A tibble: 83 × 11
name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Cheet… Acin… carni Carn… lc 12.1 NA NA 11.9
2 Owl m… Aotus omni Prim… <NA> 17 1.8 NA 7
3 Mount… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6
4 Great… Blar… omni Sori… lc 14.9 2.3 0.133 9.1
5 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
6 Three… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6
7 North… Call… carni Carn… vu 8.7 1.4 0.383 15.3
8 Vespe… Calo… <NA> Rode… <NA> 7 NA NA 17
9 Dog Canis carni Carn… domesticated 10.1 2.9 0.333 13.9
10 Roe d… Capr… herbi Arti… lc 3 NA NA 21
# ℹ 73 more rows
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
for (i in2:5) { col_name <-colnames(msleep)[i] unique_vals <-unique(msleep[[i]])cat("Unique values in column", col_name, ":\n")print(unique_vals)cat("---------------------------\n")}
# A tibble: 8 × 2
# Groups: conservation [2]
name conservation
<chr> <chr>
1 Asian elephant en
2 Golden hamster en
3 Tiger en
4 Giant armadillo en
5 Jaguar nt
6 House mouse nt
7 Mountain beaver nt
8 Round-tailed muskrat nt
ggplot(line_plot_data, aes(x = obs_id, y = sleep_total, color = conservation)) +geom_line(linewidth =1.2) +geom_point(size =3) +labs(title ="Total Sleep Profile: Endangered vs. Near Threatened",subtitle ="Animals sorted from shortest to longest sleep duration",x ="Observation Rank",y ="Total Sleep (Hours)",color ="Status") +theme_minimal() +scale_color_manual(values =c("en"="#1f77b4", "nt"="#ff7f0e"))
The total sleep profile suggests that Endangered status does not dictate a specific sleep duration; rather, it reflects a group of animals with diverse biological needs that are currently under environmental pressure. Near Threatened species show a “tighter” profile, suggesting that their physiological sleep requirements may be more uniform across the sampled species. This comparison highlights that while sleep is a biological necessity, it is heavily influenced by the specific evolutionary traits of the animal rather than their conservation status alone.