We need the following libraries.
library(ggdist)
## Warning: package 'ggdist' was built under R version 4.2.3
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.2.3
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
library(scales)
## Warning: package 'scales' was built under R version 4.2.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks rstatix::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Data Cleaning
I did the same process again for conservation status. The
International Union for Conservation of Nature (IUCN) determined nine
categories for threatened species: Not Evaluated, Data Deficient, Least
Concern, Near Threatened, Vulnerable, Endangered, Critically Endangered,
Extinct in the Wild, and Extinct. The dataset included “domesticated” as
a status. I noticed that missing values in this column were used for two
reasons: multiple species under a given name or not included in the IUCN
database. Truly missing values were taken from Animalia, and missing
values due to multiple species were then entered accordingly.
sum(is.na(msleep$conservation))
## [1] 29
msleep[8,5] <- "lc"
msleep[16,5] <- "lc"
msleep[27,5] <- "lc"
msleep[34,5] <- "domesticated"
msleep[65,5] <- "lc"
msleep[68,5] <- "lc"
msleep[73,5] <- "lc"
msleep[75,5] <- "lc"
msleep[76,5] <- "lc"
msleep[82,5] <- "lc"
msleep[83,5] <- "lc"
msleep$conservation <- msleep$conservation %>% replace_na("ms")
sum(is.na(msleep$conservation))
## [1] 0
Making naming and capitalization consistent.
msleep$conservation[msleep$conservation == "cd"] <- "Critically Endangered"
msleep$conservation[msleep$conservation == "domesticated"] <- "Domesticated"
msleep$conservation[msleep$conservation == "en"] <- "Endangered"
msleep$conservation[msleep$conservation == "lc"] <- "Least Concern"
msleep$conservation[msleep$conservation == "ms"] <- "Multiple Species"
msleep$conservation[msleep$conservation == "nt"] <- "Near Threatened"
msleep$conservation[msleep$conservation == "vu"] <- "Vulnerable"
msleep$vore[msleep$vore == "carni"] <- "Carnivore"
msleep$vore[msleep$vore == "herbi"] <- "Herbivore"
msleep$vore[msleep$vore == "insecti"] <- "Insectivore"
msleep$vore[msleep$vore == "omni"] <- "Omnivore"
msleep$name <- str_to_title(msleep$name)
names(msleep) <- str_to_title(names(msleep))
Check data set to ensure cleaning occurred.
view(msleep)
Data Analysis
The summary statistics are helpful. Let’s check for differences in
the means of the vore groups over the three variables we discussed at
the beginning.
summary(aov(Sleep_total ~ Vore, data = msleep))
## Df Sum Sq Mean Sq F value Pr(>F)
## Vore 3 126.6 42.20 2.226 0.0916 .
## Residuals 79 1497.5 18.96
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(Sleep_rem ~ Vore, data = msleep))
## Df Sum Sq Mean Sq F value Pr(>F)
## Vore 3 17.72 5.907 4.036 0.0113 *
## Residuals 57 83.41 1.463
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 22 observations deleted due to missingness
summary(aov(Sleep_cycle ~ Vore, data = msleep))
## Df Sum Sq Mean Sq F value Pr(>F)
## Vore 3 0.582 0.1941 1.596 0.213
## Residuals 28 3.406 0.1216
## 51 observations deleted due to missingness
Based on the significance of the p-value (alpha = 0.05), here is no
difference for total amount of sleep, but there is a difference for
total REM sleep and length of sleep cycle. Use Tukey’s honestly
significant difference (HSD) test to see where the differences
exist.
TukeyHSD(aov(Sleep_rem ~ Vore, data = msleep))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Sleep_rem ~ Vore, data = msleep)
##
## $Vore
## diff lwr upr p adj
## Herbivore-Carnivore -0.9011111 -2.0862443 0.2840221 0.1955735
## Insectivore-Carnivore 0.8433333 -0.8098939 2.4965605 0.5355059
## Omnivore-Carnivore -0.3344444 -1.5971176 0.9282287 0.8962753
## Insectivore-Herbivore 1.7444444 0.2995117 3.1893772 0.0118840
## Omnivore-Herbivore 0.5666667 -0.4075068 1.5408401 0.4211798
## Omnivore-Insectivore -1.1777778 -2.6869608 0.3314053 0.1769193
TukeyHSD(aov(Sleep_cycle ~ Vore, data = msleep))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Sleep_cycle ~ Vore, data = msleep)
##
## $Vore
## diff lwr upr p adj
## Herbivore-Carnivore 0.04472222 -0.4621412 0.5515857 0.9949676
## Insectivore-Carnivore -0.20666667 -0.8454424 0.4321090 0.8134355
## Omnivore-Carnivore 0.21909091 -0.2945041 0.7326859 0.6534640
## Insectivore-Herbivore -0.25138889 -0.8011595 0.2983817 0.6022034
## Omnivore-Herbivore 0.17436869 -0.2231148 0.5718522 0.6333127
## Omnivore-Insectivore 0.42575758 -0.1302253 0.9817405 0.1807779
Check the adjusted p-value. For REM sleep, we see differences for
the Herbivore-Carnivore, Insectivore-Herbivore, Omnivore-Herbivore, and
Omnivore-Insectivore pairings. For length of sleep cycle, this
difference only exists for the Omnivore-Insectivore pairing.
This is most likely due to missing values. The REM sleep variable
has 22 missing values, and the sleep cycle variable has 51 missing
values.
sum(is.na(msleep$Sleep_rem))
## [1] 22
sum(is.na(msleep$Sleep_cycle))
## [1] 51
Data Visualization
Here are some preliminary plots. The first plot represents mammals
by vore, and the second plot shows vore by conservation status. We
clearly see herbivores comprise the majority of our dataset. Herbivores
and carnivores are distributed over all conservation statuses, but it
appears that insectivores and omnivores are focused among the Least
Concern and Multiple Species statuses.
ggplot(msleep, aes(Vore)) +
geom_bar() +
labs(title = "Mammals by Vore", x = "Vore", y = "Number of Mammals") +
ylim(0, 40) +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5), axis.text = element_text(face = "bold"))

ggplot(msleep) +
geom_bar(aes(Conservation, fill = Vore)) +
scale_fill_manual(values=c('#fde725', '#35b779', '#31688e', '#440154')) +
labs(title = "Mammals by Conservation Status", x = "Conservation Status", y = "Number of Mammals") +
ylim(0, 40) +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5), axis.text = element_text(face = "bold")) +
theme(axis.text.x = element_text(angle = 90))

Let’s view some boxplots for total sleep, total REM sleep, and sleep
cycle.
ggplot(msleep, aes(x = factor(Vore), y = Sleep_total)) +
geom_boxplot() +
scale_y_continuous(breaks = pretty(c(0,20), n = 5), limits = c(0,20)) +
labs(title = "Total Sleep by Vore", x = "Vore", y = "Total Sleep (hour)") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5), axis.text = element_text(face = "bold")) +
stat_summary(fun = mean, geom = "point", col = "red") +
stat_summary(fun = mean, geom = "text", col = "red",
vjust = -2, aes(label = paste("Mean:", round(..y.., digits = 2))))
## Warning: The dot-dot notation (`..y..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(y)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(msleep, aes(x = factor(Vore), y = Sleep_rem)) +
geom_boxplot() +
scale_y_continuous(breaks = pretty(c(0,10), n = 4), limits = c(0,8)) +
labs(title = "Total REM Sleep by Vore", x = "Vore", y = "Total REM Sleep (hour)", caption = "22 values are missing.") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5), axis.text = element_text(face = "bold")) +
stat_summary(fun = mean, geom = "point", col = "red") +
stat_summary(fun = mean, geom = "text", col = "red",
vjust = 2.2, aes(label = paste("Mean:", round(..y.., digits = 2))))
## Warning: Removed 22 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 22 rows containing non-finite outside the scale range
## (`stat_summary()`).
## Removed 22 rows containing non-finite outside the scale range
## (`stat_summary()`).

Consider the total sleep graphic. Carnivores had the biggest spread
of data. Carnivores had the closest mean to the median. Insectivores had
the highest mean, and herbivores had the lowest mean. I believe this is
due to their sample sizes. Omnivores had the most outliers. Carnivores
and omnivores almost had the same median.
Consider the REM sleep graphic, which has 22 missing values.
Carnivores had the biggest spread of data. Omnivores had the closest
mean to the median. Insectivores had the highest mean, and herbivores
had the lowest mean. Herbivores was the only group without outliers.
Carnivores and omnivores almost had the same median. Notice the
interquartile range (IQR) is much smaller for this graphic. This may be
due to the amount of missing values for this variable.
For the variable Sleep_cycle (length of sleep cycle), there are 51
missing values. A boxplot doesn’t yield clear results. Use a dot plot
grouped and colored by vore instead. (If the graphic doesn’t visualize
after the first run, run it a second time.)
ggplot(msleep, aes(x = factor(Vore), y = Sleep_cycle, fill = Vore)) +
stat_dots(position = "dodge", dotsize = 2) +
scale_fill_manual(values=c('#fde725', '#35b779', '#31688e', '#440154')) +
scale_y_continuous(breaks = pretty(c(0,1.6), n = 8), limits = c(0,1.6)) +
labs(title = "Length of Sleep Cycle by Vore", x = "Vore", y = "Length of Sleep Cycle (hours)",
caption = "51 values are missing.") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5), axis.text = element_text(face = "bold")) +
theme(axis.text.x = element_text(angle = 90))
## Warning: Removed 51 rows containing missing values or values outside the scale range
## (`stat_slabinterval()`).

From the summary statistics, we know that the medians were 0.38 for
carnivores, 0.22 for herbivores, 0.18 for insectivores, and 0.50 for
omnivores; the means were 0.37 for carnivores, 0.42 for herbivores, 0.17
for insectivores, and 0.60 for omnivores. With this graphic, we can also
see modes. The modes were around 0.40 for carnivores, 0.20 for
herbivores, 0.20 for insectivores, and 0.20 for omnivores. The carnivore
and insectivore groups had means and medians very close to their modes.
This is probably due to sample sizes. Herbivores had a median that was
very close to its mode. Herbivores had the highest mean, and
insectivores had the lowest mean.
Because of the ANOVA results, total sleep seems to be the only
reliable variable. There may be a relationship between total sleep and
brain weight. Here is a scatter plot that is colored by vore. The x-axis
is scaled logarithmically to better show the spread of data. The color
for Insectivore was changed to red to better show contrast for the
scatter plot.
ggplot(msleep, aes(Brainwt, Sleep_total)) +
geom_point(aes(color = Vore), size = 2.5) +
scale_color_manual(values=c('Carnivore' = '#fde725', 'Herbivore' = '#35b779', 'Insectivore' = 'red', 'Omnivore' = '#440154')) +
labs(title = "Total Sleep versus Brain Weight", x = "Brain Weight (kg)", y = "Total Sleep (hours)",
caption = "There are 27 brain weight values missing. The x-axis is \n scaled by logarithm base-10 due to size of values.") +
scale_x_log10(labels = label_number()) +
geom_smooth(se = FALSE) +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5), axis.text = element_text(face = "bold"))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 27 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 27 rows containing missing values or values outside the scale range
## (`geom_point()`).

summary(lm(Sleep_total ~ Brainwt, msleep))
##
## Call:
## lm(formula = Sleep_total ~ Brainwt, data = msleep)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4708 -2.2158 -0.3283 2.0377 9.2693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.6311 0.5793 18.35 < 2e-16 ***
## Brainwt -1.6325 0.5748 -2.84 0.00635 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.163 on 54 degrees of freedom
## (27 observations deleted due to missingness)
## Multiple R-squared: 0.13, Adjusted R-squared: 0.1138
## F-statistic: 8.065 on 1 and 54 DF, p-value: 0.006348
It appears there is a negative correlation between total sleep and
brain weight, however the model only accounts for 11.38% of the total
variance. Linear modeling is beyond the scope of this project, so no
further steps will be taken.