Discussion 11

Assignment: Using R, create a simple linear regression model and test its assumptions. You may use any data that interest you.

I used the following code to list all data sets built into R.

#data(package = .packages(all.available = TRUE))

I chose ‘msleep’, which describes the sleep patterns of mammals along with several other important descriptors.

print(head(msleep))

## # A tibble: 6 x 11
##   name  genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr> <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Chee~ Acin~ carni Carn~ lc                  12.1      NA        NA      11.9
## 2 Owl ~ Aotus omni  Prim~ <NA>                17         1.8      NA       7  
## 3 Moun~ Aplo~ herbi Rode~ nt                  14.4       2.4      NA       9.6
## 4 Grea~ Blar~ omni  Sori~ lc                  14.9       2.3       0.133   9.1
## 5 Cow   Bos   herbi Arti~ domesticated         4         0.7       0.667  20  
## 6 Thre~ Brad~ herbi Pilo~ <NA>                14.4       2.2       0.767   9.6
## # ... with 2 more variables: brainwt <dbl>, bodywt <dbl>

colnames(msleep)

##  [1] "name"         "genus"        "vore"         "order"        "conservation"
##  [6] "sleep_total"  "sleep_rem"    "sleep_cycle"  "awake"        "brainwt"     
## [11] "bodywt"

print(nrow(msleep))

## [1] 83

The animal is identified by its common name, order, and genus. The data also provide dietary characteristics, conservation status, as well as brain and body weight. There are also several values describing sleep duration, interval, and how much sleep is designated as REM. To start off this discussion, I will test the assumptions that total sleep is correlated with brain weight.

plot(msleep$brainwt, msleep$sleep_total)

There are a few very large outliers in terms of body weight that will surely affect any association between body weight and sleep. To offset this, I’m going to look at brain weight as a proportion of total body weight.

msleep_df <- msleep
msleep_df$brainprop <- 100*(round(msleep$brainwt/msleep$bodywt,3))
glimpse(msleep_df)

## Rows: 83
## Columns: 12
## $ name         <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater s...
## $ genus        <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "...
## $ vore         <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "c...
## $ order        <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "...
## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "d...
## $ sleep_total  <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0...
## $ sleep_rem    <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0....
## $ sleep_cycle  <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333...
## $ awake        <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0...
## $ brainwt      <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000...
## $ bodywt       <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0...
## $ brainprop    <dbl> NA, 3.2, NA, 1.5, 0.1, NA, NA, NA, 0.5, 0.7, 0.3, 0.8,...

100*round(sum(is.na(msleep_df$brainprop))/nrow(msleep_df),3)

## [1] 32.5

This new proportion does leave me with several missing values as not every mammal has accurate brain and body weights. About 30% of the values are missing a proportion. It may be helpful in the future to approximate weight based off weights of similar taxonomies.

brain_sleep_line <- lm(data = msleep_df, formula = sleep_total ~ brainprop)
plot(msleep_df$brainprop, msleep_df$sleep_total, xlab ="Proportion of brain weight to body weight, percent",
ylab ="Total time sleeping per day, hours")
abline(brain_sleep_line)

Plotting the total sleep relative to brain proportion shows a slightly positive correlation. However, there remain some outliers with large brains relative to their body.

Testing Assumptions

Reisiduals

plot(fitted(brain_sleep_line), resid(brain_sleep_line))
abline(h = 0)

hist(resid(brain_sleep_line))

The residual plot appears to be the most scattered at lower brain proportions, while there’s less scatting for the outliers I mentioned previously. The histogram also illustrates that there residual values are skewed to the right and do not conform to a normal distribution.

qqnorm(resid(brain_sleep_line))
qqline(resid(brain_sleep_line))

There appear to be outliers at both extremes of the data, giving us a clear graphical representation of deviation from a normal distribution. Finally, I’ll find information about the numerical summary and use the ‘gvlma’ package to automatically test my data for linear model assumptions.

print(summary(brain_sleep_line))

## 
## Call:
## lm(formula = sleep_total ~ brainprop, data = msleep_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6615 -3.4030 -0.5124  2.1246  9.3404 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.6093     0.8565   10.05  5.7e-14 ***
## brainprop     1.5030     0.6184    2.43   0.0184 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.237 on 54 degrees of freedom
##   (27 observations deleted due to missingness)
## Multiple R-squared:  0.0986, Adjusted R-squared:  0.0819 
## F-statistic: 5.907 on 1 and 54 DF,  p-value: 0.01844

gvlma::gvlma(brain_sleep_line)

## 
## Call:
## lm(formula = sleep_total ~ brainprop, data = msleep_df)
## 
## Coefficients:
## (Intercept)    brainprop  
##       8.609        1.503  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma::gvlma(x = brain_sleep_line) 
## 
##                     Value p-value                Decision
## Global Stat        5.7769 0.21644 Assumptions acceptable.
## Skewness           3.4407 0.06361 Assumptions acceptable.
## Kurtosis           0.3703 0.54282 Assumptions acceptable.
## Link Function      1.0498 0.30556 Assumptions acceptable.
## Heteroscedasticity 0.9161 0.33850 Assumptions acceptable.

Based on the R-squared value, the brain mass as proportion of total body weight explains less than 10% of the variation in total sleep duration. Although the assumptions package does find the skewness acceptable, I noticed that the p value is right on the border, at p = 0.063. I think my presented linear model requires more attention to the persistent outliers and missing values to draw better conclusions.

Discussion 11

Thomas Hill

Assignment: Using R, create a simple linear regression model and test its assumptions. You may use any data that interest you.

Testing Assumptions

Reisiduals