I used the following code to list all data sets built into R.
#data(package = .packages(all.available = TRUE))
I chose ‘msleep’, which describes the sleep patterns of mammals along with several other important descriptors.
print(head(msleep))
## # A tibble: 6 x 11
## name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Chee~ Acin~ carni Carn~ lc 12.1 NA NA 11.9
## 2 Owl ~ Aotus omni Prim~ <NA> 17 1.8 NA 7
## 3 Moun~ Aplo~ herbi Rode~ nt 14.4 2.4 NA 9.6
## 4 Grea~ Blar~ omni Sori~ lc 14.9 2.3 0.133 9.1
## 5 Cow Bos herbi Arti~ domesticated 4 0.7 0.667 20
## 6 Thre~ Brad~ herbi Pilo~ <NA> 14.4 2.2 0.767 9.6
## # ... with 2 more variables: brainwt <dbl>, bodywt <dbl>
colnames(msleep)
## [1] "name" "genus" "vore" "order" "conservation"
## [6] "sleep_total" "sleep_rem" "sleep_cycle" "awake" "brainwt"
## [11] "bodywt"
print(nrow(msleep))
## [1] 83
The animal is identified by its common name, order, and genus. The data also provide dietary characteristics, conservation status, as well as brain and body weight. There are also several values describing sleep duration, interval, and how much sleep is designated as REM. To start off this discussion, I will test the assumptions that total sleep is correlated with brain weight.
plot(msleep$brainwt, msleep$sleep_total)
There are a few very large outliers in terms of body weight that will surely affect any association between body weight and sleep. To offset this, I’m going to look at brain weight as a proportion of total body weight.
msleep_df <- msleep
msleep_df$brainprop <- 100*(round(msleep$brainwt/msleep$bodywt,3))
glimpse(msleep_df)
## Rows: 83
## Columns: 12
## $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater s...
## $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "...
## $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "c...
## $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "...
## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "d...
## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0...
## $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0....
## $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333...
## $ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0...
## $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000...
## $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0...
## $ brainprop <dbl> NA, 3.2, NA, 1.5, 0.1, NA, NA, NA, 0.5, 0.7, 0.3, 0.8,...
100*round(sum(is.na(msleep_df$brainprop))/nrow(msleep_df),3)
## [1] 32.5
This new proportion does leave me with several missing values as not every mammal has accurate brain and body weights. About 30% of the values are missing a proportion. It may be helpful in the future to approximate weight based off weights of similar taxonomies.
brain_sleep_line <- lm(data = msleep_df, formula = sleep_total ~ brainprop)
plot(msleep_df$brainprop, msleep_df$sleep_total, xlab ="Proportion of brain weight to body weight, percent",
ylab ="Total time sleeping per day, hours")
abline(brain_sleep_line)
Plotting the total sleep relative to brain proportion shows a slightly positive correlation. However, there remain some outliers with large brains relative to their body.
plot(fitted(brain_sleep_line), resid(brain_sleep_line))
abline(h = 0)
hist(resid(brain_sleep_line))
The residual plot appears to be the most scattered at lower brain proportions, while there’s less scatting for the outliers I mentioned previously. The histogram also illustrates that there residual values are skewed to the right and do not conform to a normal distribution.
qqnorm(resid(brain_sleep_line))
qqline(resid(brain_sleep_line))
There appear to be outliers at both extremes of the data, giving us a clear graphical representation of deviation from a normal distribution. Finally, I’ll find information about the numerical summary and use the ‘gvlma’ package to automatically test my data for linear model assumptions.
print(summary(brain_sleep_line))
##
## Call:
## lm(formula = sleep_total ~ brainprop, data = msleep_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6615 -3.4030 -0.5124 2.1246 9.3404
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.6093 0.8565 10.05 5.7e-14 ***
## brainprop 1.5030 0.6184 2.43 0.0184 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.237 on 54 degrees of freedom
## (27 observations deleted due to missingness)
## Multiple R-squared: 0.0986, Adjusted R-squared: 0.0819
## F-statistic: 5.907 on 1 and 54 DF, p-value: 0.01844
gvlma::gvlma(brain_sleep_line)
##
## Call:
## lm(formula = sleep_total ~ brainprop, data = msleep_df)
##
## Coefficients:
## (Intercept) brainprop
## 8.609 1.503
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma::gvlma(x = brain_sleep_line)
##
## Value p-value Decision
## Global Stat 5.7769 0.21644 Assumptions acceptable.
## Skewness 3.4407 0.06361 Assumptions acceptable.
## Kurtosis 0.3703 0.54282 Assumptions acceptable.
## Link Function 1.0498 0.30556 Assumptions acceptable.
## Heteroscedasticity 0.9161 0.33850 Assumptions acceptable.
Based on the R-squared value, the brain mass as proportion of total body weight explains less than 10% of the variation in total sleep duration. Although the assumptions package does find the skewness acceptable, I noticed that the p value is right on the border, at p = 0.063. I think my presented linear model requires more attention to the persistent outliers and missing values to draw better conclusions.