TidyVerse Vignette: purrr and dplyr - EXTEND

Author

Ciara Bonnett, David Chen

Introduction

This vignette demonstrated how to use the dplyr and purrr packages within the TidyVerse. We are using the built-in msleep dataset, which contains information about the sleep patterns of mammals. This collaborative example is designed to show how we can transform raw data into a clean, summarized format.

Capability 1: Data Manipulation with dplyr

Here I am usingdplyr to isolate herbivores. We calculate a new column for awake time and sort the results to see which heribivores sleep the most. This demonstrates the power of piping functions like filter(), mutate(, and select() together.

library(tidyverse)

# Cleaning and transforming the mammal sleep data
herbivore_sleep <- msleep %>%
  filter(vore == "herbi") %>%
  mutate(awake_time = 24 - sleep_total) %>%
  select(name, sleep_total, awake_time, brainwt) %>%
  arrange(desc(sleep_total))
  
# Display the first few rows
knitr::kable(head(herbivore_sleep), caption = "Top Sleeping Herbivores")
Top Sleeping Herbivores
name sleep_total awake_time brainwt
Arctic ground squirrel 16.6 7.4 0.0057
Golden-mantled ground squirrel 15.9 8.1 NA
Eastern american chipmunk 15.8 8.2 NA
Western american chipmunk 14.9 9.1 NA
Round-tailed muskrat 14.6 9.4 NA
Mountain beaver 14.4 9.6 NA

Capability 2: Functional Programming with purrr

The purrr package is used here to apply the mean function across every numeric column in the dataset at once. This is much more efficient than writing individual lines of code for every column. We use map_df to ensure the output remains in a clean data frame format.

library(tidyverse)

# Use ggplot2::msleep to ensure R always sees it as the dataset
ggplot2::msleep %>%
  dplyr::select(where(is.numeric)) %>%
  purrr::map_df(~mean(.x, na.rm = TRUE)) %>%
  knitr::kable(caption = "Dataset Averages using purrr")
Dataset Averages using purrr
sleep_total sleep_rem sleep_cycle awake brainwt bodywt
10.43373 1.87541 0.4395833 13.56747 0.2815814 166.1363

Conclusion

This example shows how dplyr pipes (%>%) allows us to clean and transform data efficiently before passing it directly into a visualization.

AI Transcript

Phase 1: Dataset Selection and Reasoning

User Inquiry: “So what dataset from kaggle do I need to use?” AI Response & Logic: The AI suggested that while Kaggle is an option, using a built-in TidyVerse dataset like msleep is superior for a “collaborative vignette” because it ensures the code will run for all classmates instantly without requiring them to download external files or handle broken file paths. Decision: The user chose to stay with the built-in msleep dataset for reliability and ease of extension by peers.

Phase 2: Troubleshooting the “Select” Error

User Inquiry: “Error in UseMethod(‘select’) : no applicable method for ‘select’ applied to an object of class ‘function’” Technical Logic: * The Problem: The AI identified that the R environment was “masking” the TidyVerse select fuPhase 1: Dataset Selection and Reasoning

Phase 3: Final Polishing for Collaboration

User Inquiry: “Have I completed my code completely?” AI Response & Logic: The AI confirmed the code was functional but suggested adding “Annotations” (descriptive text). This is a requirement for a “vignette” to ensure that the classmate who has to “extend” the code later understands the logic.

Extend

print(msleep)
# A tibble: 83 × 11
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
 1 Cheet… Acin… carni Carn… lc                  12.1      NA        NA      11.9
 2 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA       7  
 3 Mount… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
 4 Great… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
 5 Cow    Bos   herbi Arti… domesticated         4         0.7       0.667  20  
 6 Three… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
 7 North… Call… carni Carn… vu                   8.7       1.4       0.383  15.3
 8 Vespe… Calo… <NA>  Rode… <NA>                 7        NA        NA      17  
 9 Dog    Canis carni Carn… domesticated        10.1       2.9       0.333  13.9
10 Roe d… Capr… herbi Arti… lc                   3        NA        NA      21  
# ℹ 73 more rows
# ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
colnames(msleep)
 [1] "name"         "genus"        "vore"         "order"        "conservation"
 [6] "sleep_total"  "sleep_rem"    "sleep_cycle"  "awake"        "brainwt"     
[11] "bodywt"      

Using a Basic “for” Loop

for (i in 2:5) {
  col_name <- colnames(msleep)[i]
  unique_vals <- unique(msleep[[i]])
  
  cat("Unique values in column", col_name, ":\n")
  print(unique_vals)
  cat("---------------------------\n")
}
Unique values in column genus :
 [1] "Acinonyx"      "Aotus"         "Aplodontia"    "Blarina"      
 [5] "Bos"           "Bradypus"      "Callorhinus"   "Calomys"      
 [9] "Canis"         "Capreolus"     "Capri"         "Cavis"        
[13] "Cercopithecus" "Chinchilla"    "Condylura"     "Cricetomys"   
[17] "Cryptotis"     "Dasypus"       "Dendrohyrax"   "Didelphis"    
[21] "Elephas"       "Eptesicus"     "Equus"         "Erinaceus"    
[25] "Erythrocebus"  "Eutamias"      "Felis"         "Galago"       
[29] "Giraffa"       "Globicephalus" "Haliochoerus"  "Heterohyrax"  
[33] "Homo"          "Lemur"         "Loxodonta"     "Lutreolina"   
[37] "Macaca"        "Meriones"      "Mesocricetus"  "Microtus"     
[41] "Mus"           "Myotis"        "Neofiber"      "Nyctibeus"    
[45] "Octodon"       "Onychomys"     "Oryctolagus"   "Ovis"         
[49] "Pan"           "Panthera"      "Papio"         "Paraechinus"  
[53] "Perodicticus"  "Peromyscus"    "Phalanger"     "Phoca"        
[57] "Phocoena"      "Potorous"      "Priodontes"    "Procavia"     
[61] "Rattus"        "Rhabdomys"     "Saimiri"       "Scalopus"     
[65] "Sigmodon"      "Spalax"        "Spermophilus"  "Suncus"       
[69] "Sus"           "Tachyglossus"  "Tamias"        "Tapirus"      
[73] "Tenrec"        "Tupaia"        "Tursiops"      "Genetta"      
[77] "Vulpes"       
---------------------------
Unique values in column vore :
[1] "carni"   "omni"    "herbi"   NA        "insecti"
---------------------------
Unique values in column order :
 [1] "Carnivora"       "Primates"        "Rodentia"        "Soricomorpha"   
 [5] "Artiodactyla"    "Pilosa"          "Cingulata"       "Hyracoidea"     
 [9] "Didelphimorphia" "Proboscidea"     "Chiroptera"      "Perissodactyla" 
[13] "Erinaceomorpha"  "Cetacea"         "Lagomorpha"      "Diprotodontia"  
[17] "Monotremata"     "Afrosoricida"    "Scandentia"     
---------------------------
Unique values in column conservation :
[1] "lc"           NA             "nt"           "domesticated" "vu"          
[6] "en"           "cd"          
---------------------------

Average Total Sleep Time by Conservation Status

# Filer NAs and calculate means
sleep_summary <- msleep %>%
  filter(!is.na(conservation)) %>%
  group_by(conservation) %>%
  summarise(avg_sleep = mean(sleep_total)) %>%
  arrange(desc(avg_sleep))

# Create plot
ggplot(sleep_summary, aes(x = reorder(conservation, -avg_sleep), y = avg_sleep, fill = conservation)) +
  geom_col() +
  geom_text(aes(label = round(avg_sleep, 1)), vjust = -0.5) + # Add labels on top
  labs(title = "Average Total Sleep Time by Conservation Status",
       x = "Conservation Status",
       y = "Average Sleep (Hours)") +
  theme_minimal() +
  guides(fill = "none")

Count of Observations per Conservation Status

#
msleep_counts <- msleep %>%
  count(conservation) %>%
  arrange(desc(n))

print(msleep_counts)
# A tibble: 7 × 2
  conservation     n
  <chr>        <int>
1 <NA>            29
2 lc              27
3 domesticated    10
4 vu               7
5 en               4
6 nt               4
7 cd               2
ggplot(msleep, aes(x = conservation, fill = conservation)) +
  geom_bar() +
  geom_text(stat='count', aes(label=after_stat(count)), vjust=-0.5) + # Adds the count labels
  theme_minimal() +
  labs(title = "Count of Observations per Conservation Status",
       x = "Conservation Status",
       y = "Number of Animals")

Filter for “en” and “nt”, then create a rank index within each group

line_plot_data <- msleep %>%
  filter(conservation %in% c("en", "nt")) %>%
  arrange(conservation, sleep_total) %>%
  group_by(conservation) %>%
  mutate(obs_id = row_number())
print(line_plot_data[,c(1,5)])
# A tibble: 8 × 2
# Groups:   conservation [2]
  name                 conservation
  <chr>                <chr>       
1 Asian elephant       en          
2 Golden hamster       en          
3 Tiger                en          
4 Giant armadillo      en          
5 Jaguar               nt          
6 House mouse          nt          
7 Mountain beaver      nt          
8 Round-tailed muskrat nt          
ggplot(line_plot_data, aes(x = obs_id, y = sleep_total, color = conservation)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +
  labs(title = "Total Sleep Profile: Endangered vs. Near Threatened",
       subtitle = "Animals sorted from shortest to longest sleep duration",
       x = "Observation Rank",
       y = "Total Sleep (Hours)",
       color = "Status") +
  theme_minimal() +
  scale_color_manual(values = c("en" = "#1f77b4", "nt" = "#ff7f0e"))

The total sleep profile suggests that Endangered status does not dictate a specific sleep duration; rather, it reflects a group of animals with diverse biological needs that are currently under environmental pressure. Near Threatened species show a “tighter” profile, suggesting that their physiological sleep requirements may be more uniform across the sampled species. This comparison highlights that while sleep is a biological necessity, it is heavily influenced by the specific evolutionary traits of the animal rather than their conservation status alone.