We’ll be examining the msleep dataset from ggplot2. We want to determine if mammal diet (here called “vore”) affects the total amount of time spent sleeping (Sleep_total), the total amount of time spent in REM sleep (Sleep_rem), and the length of a sleep cycle (Sleep_cycle). We’ll clean, analyze, and examine the dataset below.
We need the following libraries.
library(ggdist)
## Warning: package 'ggdist' was built under R version 4.2.3
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.2.3
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
library(scales)
## Warning: package 'scales' was built under R version 4.2.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard()    masks scales::discard()
## ✖ dplyr::filter()     masks rstatix::filter(), stats::filter()
## ✖ dplyr::lag()        masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Let’s check out the dataset.
view(msleep)

Data Cleaning

I did the same process again for conservation status. The International Union for Conservation of Nature (IUCN) determined nine categories for threatened species: Not Evaluated, Data Deficient, Least Concern, Near Threatened, Vulnerable, Endangered, Critically Endangered, Extinct in the Wild, and Extinct. The dataset included “domesticated” as a status. I noticed that missing values in this column were used for two reasons: multiple species under a given name or not included in the IUCN database. Truly missing values were taken from Animalia, and missing values due to multiple species were then entered accordingly.
sum(is.na(msleep$conservation))
## [1] 29
msleep[8,5] <- "lc"
msleep[16,5] <- "lc"
msleep[27,5] <- "lc"
msleep[34,5] <- "domesticated"
msleep[65,5] <- "lc"
msleep[68,5] <- "lc"
msleep[73,5] <- "lc"
msleep[75,5] <- "lc"
msleep[76,5] <- "lc"
msleep[82,5] <- "lc"
msleep[83,5] <- "lc"

msleep$conservation <- msleep$conservation %>% replace_na("ms")

sum(is.na(msleep$conservation))
## [1] 0
Making naming and capitalization consistent.
msleep$conservation[msleep$conservation == "cd"] <- "Critically Endangered"
msleep$conservation[msleep$conservation == "domesticated"] <- "Domesticated"
msleep$conservation[msleep$conservation == "en"] <- "Endangered"
msleep$conservation[msleep$conservation == "lc"] <- "Least Concern"
msleep$conservation[msleep$conservation == "ms"] <- "Multiple Species"
msleep$conservation[msleep$conservation == "nt"] <- "Near Threatened"
msleep$conservation[msleep$conservation == "vu"] <- "Vulnerable"

msleep$vore[msleep$vore == "carni"] <- "Carnivore"
msleep$vore[msleep$vore == "herbi"] <- "Herbivore"
msleep$vore[msleep$vore == "insecti"] <- "Insectivore"
msleep$vore[msleep$vore == "omni"] <- "Omnivore"

msleep$name <- str_to_title(msleep$name)
names(msleep) <- str_to_title(names(msleep))
Check data set to ensure cleaning occurred.
view(msleep)

Data Analysis

Here are some summary statistics grouped by vore. In the tibble, we see sample size, minimum value, maximum value, median, mean, and standard deviation. We will see some of this as boxplots in the visualization section.
sumstat <- msleep %>%
  group_by(Vore) %>%
  select(Sleep_total, Sleep_rem, Sleep_cycle) %>%
  get_summary_stats() 
## Adding missing grouping variables: `Vore`
view(sumstat)
The summary statistics are helpful. Let’s check for differences in the means of the vore groups over the three variables we discussed at the beginning.
summary(aov(Sleep_total ~ Vore, data = msleep))
##             Df Sum Sq Mean Sq F value Pr(>F)  
## Vore         3  126.6   42.20   2.226 0.0916 .
## Residuals   79 1497.5   18.96                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(Sleep_rem ~ Vore, data = msleep))
##             Df Sum Sq Mean Sq F value Pr(>F)  
## Vore         3  17.72   5.907   4.036 0.0113 *
## Residuals   57  83.41   1.463                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 22 observations deleted due to missingness
summary(aov(Sleep_cycle ~ Vore, data = msleep))
##             Df Sum Sq Mean Sq F value Pr(>F)
## Vore         3  0.582  0.1941   1.596  0.213
## Residuals   28  3.406  0.1216               
## 51 observations deleted due to missingness
Based on the significance of the p-value (alpha = 0.05), here is no difference for total amount of sleep, but there is a difference for total REM sleep and length of sleep cycle. Use Tukey’s honestly significant difference (HSD) test to see where the differences exist.
TukeyHSD(aov(Sleep_rem ~ Vore, data = msleep))
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Sleep_rem ~ Vore, data = msleep)
## 
## $Vore
##                             diff        lwr       upr     p adj
## Herbivore-Carnivore   -0.9011111 -2.0862443 0.2840221 0.1955735
## Insectivore-Carnivore  0.8433333 -0.8098939 2.4965605 0.5355059
## Omnivore-Carnivore    -0.3344444 -1.5971176 0.9282287 0.8962753
## Insectivore-Herbivore  1.7444444  0.2995117 3.1893772 0.0118840
## Omnivore-Herbivore     0.5666667 -0.4075068 1.5408401 0.4211798
## Omnivore-Insectivore  -1.1777778 -2.6869608 0.3314053 0.1769193
TukeyHSD(aov(Sleep_cycle ~ Vore, data = msleep))
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Sleep_cycle ~ Vore, data = msleep)
## 
## $Vore
##                              diff        lwr       upr     p adj
## Herbivore-Carnivore    0.04472222 -0.4621412 0.5515857 0.9949676
## Insectivore-Carnivore -0.20666667 -0.8454424 0.4321090 0.8134355
## Omnivore-Carnivore     0.21909091 -0.2945041 0.7326859 0.6534640
## Insectivore-Herbivore -0.25138889 -0.8011595 0.2983817 0.6022034
## Omnivore-Herbivore     0.17436869 -0.2231148 0.5718522 0.6333127
## Omnivore-Insectivore   0.42575758 -0.1302253 0.9817405 0.1807779
Check the adjusted p-value. For REM sleep, we see differences for the Herbivore-Carnivore, Insectivore-Herbivore, Omnivore-Herbivore, and Omnivore-Insectivore pairings. For length of sleep cycle, this difference only exists for the Omnivore-Insectivore pairing.
This is most likely due to missing values. The REM sleep variable has 22 missing values, and the sleep cycle variable has 51 missing values.
sum(is.na(msleep$Sleep_rem))
## [1] 22
sum(is.na(msleep$Sleep_cycle))
## [1] 51

Data Visualization

Here are some preliminary plots. The first plot represents mammals by vore, and the second plot shows vore by conservation status. We clearly see herbivores comprise the majority of our dataset. Herbivores and carnivores are distributed over all conservation statuses, but it appears that insectivores and omnivores are focused among the Least Concern and Multiple Species statuses.
ggplot(msleep, aes(Vore)) +
  geom_bar() +
  labs(title = "Mammals by Vore", x = "Vore", y = "Number of Mammals") +
  ylim(0, 40) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5), axis.text = element_text(face = "bold")) 

ggplot(msleep) + 
  geom_bar(aes(Conservation, fill = Vore)) + 
  scale_fill_manual(values=c('#fde725', '#35b779', '#31688e', '#440154')) +
  labs(title = "Mammals by Conservation Status", x = "Conservation Status", y = "Number of Mammals") +
  ylim(0, 40) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5), axis.text = element_text(face = "bold")) +
  theme(axis.text.x = element_text(angle = 90))

Let’s view some boxplots for total sleep, total REM sleep, and sleep cycle.
ggplot(msleep, aes(x = factor(Vore), y = Sleep_total)) +
    geom_boxplot() +
    scale_y_continuous(breaks = pretty(c(0,20), n = 5), limits = c(0,20)) +
    labs(title = "Total Sleep by Vore", x = "Vore", y = "Total Sleep (hour)") +
    theme_classic() +
    theme(plot.title = element_text(hjust = 0.5), axis.text = element_text(face = "bold")) +
    stat_summary(fun = mean, geom = "point", col = "red") +  
    stat_summary(fun = mean, geom = "text", col = "red",     
               vjust = -2, aes(label = paste("Mean:", round(..y.., digits = 2))))
## Warning: The dot-dot notation (`..y..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(y)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(msleep, aes(x = factor(Vore), y = Sleep_rem)) +
    geom_boxplot() +
    scale_y_continuous(breaks = pretty(c(0,10), n = 4), limits = c(0,8)) +
    labs(title = "Total REM Sleep by Vore", x = "Vore", y = "Total REM Sleep (hour)", caption = "22 values are missing.") +
    theme_classic() +
    theme(plot.title = element_text(hjust = 0.5), axis.text = element_text(face = "bold")) +
    stat_summary(fun = mean, geom = "point", col = "red") +  
    stat_summary(fun = mean, geom = "text", col = "red",     
               vjust = 2.2, aes(label = paste("Mean:", round(..y.., digits = 2))))
## Warning: Removed 22 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 22 rows containing non-finite outside the scale range
## (`stat_summary()`).
## Removed 22 rows containing non-finite outside the scale range
## (`stat_summary()`).

Consider the total sleep graphic. Carnivores had the biggest spread of data. Carnivores had the closest mean to the median. Insectivores had the highest mean, and herbivores had the lowest mean. I believe this is due to their sample sizes. Omnivores had the most outliers. Carnivores and omnivores almost had the same median.
Consider the REM sleep graphic, which has 22 missing values. Carnivores had the biggest spread of data. Omnivores had the closest mean to the median. Insectivores had the highest mean, and herbivores had the lowest mean. Herbivores was the only group without outliers. Carnivores and omnivores almost had the same median. Notice the interquartile range (IQR) is much smaller for this graphic. This may be due to the amount of missing values for this variable.
For the variable Sleep_cycle (length of sleep cycle), there are 51 missing values. A boxplot doesn’t yield clear results. Use a dot plot grouped and colored by vore instead. (If the graphic doesn’t visualize after the first run, run it a second time.)
ggplot(msleep, aes(x = factor(Vore), y = Sleep_cycle, fill = Vore)) +
  stat_dots(position = "dodge", dotsize = 2) +
  scale_fill_manual(values=c('#fde725', '#35b779', '#31688e', '#440154')) +
  scale_y_continuous(breaks = pretty(c(0,1.6), n = 8), limits = c(0,1.6)) +
  labs(title = "Length of Sleep Cycle by Vore", x = "Vore", y = "Length of Sleep Cycle (hours)",
       caption = "51 values are missing.") + 
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5), axis.text = element_text(face = "bold")) +
  theme(axis.text.x = element_text(angle = 90)) 
## Warning: Removed 51 rows containing missing values or values outside the scale range
## (`stat_slabinterval()`).

From the summary statistics, we know that the medians were 0.38 for carnivores, 0.22 for herbivores, 0.18 for insectivores, and 0.50 for omnivores; the means were 0.37 for carnivores, 0.42 for herbivores, 0.17 for insectivores, and 0.60 for omnivores. With this graphic, we can also see modes. The modes were around 0.40 for carnivores, 0.20 for herbivores, 0.20 for insectivores, and 0.20 for omnivores. The carnivore and insectivore groups had means and medians very close to their modes. This is probably due to sample sizes. Herbivores had a median that was very close to its mode. Herbivores had the highest mean, and insectivores had the lowest mean.
Because of the ANOVA results, total sleep seems to be the only reliable variable. There may be a relationship between total sleep and brain weight. Here is a scatter plot that is colored by vore. The x-axis is scaled logarithmically to better show the spread of data. The color for Insectivore was changed to red to better show contrast for the scatter plot.
ggplot(msleep, aes(Brainwt, Sleep_total)) + 
  geom_point(aes(color = Vore), size = 2.5) +
  scale_color_manual(values=c('Carnivore' = '#fde725', 'Herbivore' = '#35b779', 'Insectivore' = 'red', 'Omnivore' = '#440154')) +
  labs(title = "Total Sleep versus Brain Weight", x = "Brain Weight (kg)", y = "Total Sleep (hours)",
       caption = "There are 27 brain weight values missing. The x-axis is \n scaled by logarithm base-10 due to size of values.") +
  scale_x_log10(labels = label_number()) +
  geom_smooth(se = FALSE) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5), axis.text = element_text(face = "bold")) 
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 27 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 27 rows containing missing values or values outside the scale range
## (`geom_point()`).

summary(lm(Sleep_total ~ Brainwt, msleep))
## 
## Call:
## lm(formula = Sleep_total ~ Brainwt, data = msleep)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4708 -2.2158 -0.3283  2.0377  9.2693 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.6311     0.5793   18.35  < 2e-16 ***
## Brainwt      -1.6325     0.5748   -2.84  0.00635 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.163 on 54 degrees of freedom
##   (27 observations deleted due to missingness)
## Multiple R-squared:   0.13,  Adjusted R-squared:  0.1138 
## F-statistic: 8.065 on 1 and 54 DF,  p-value: 0.006348
It appears there is a negative correlation between total sleep and brain weight, however the model only accounts for 11.38% of the total variance. Linear modeling is beyond the scope of this project, so no further steps will be taken.

Conclusion

Mammal vore doesn’t influence total amount of sleep, and it may not influence total amount of REM sleep (ANOVA shows some difference between those means). There is not enough data to determine if vore influences length of sleep cycle. There may be a relationship between total among of sleep and brain weight.