Assignment: Using R, build aregression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

I used the following code to list all data sets built into R.

#data(package = .packages(all.available = TRUE))

I chose ‘msleep’, which describes the sleep patterns of mammals along with several other important descriptors. I used this dataset for Discussion 11 when the assignment was to check the assumptions of the linear model. For this discussion, I’ll be adapting the columns into a multivariate model to predict the total sleep in hours.

print(head(msleep))
## # A tibble: 6 x 11
##   name  genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr> <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Chee~ Acin~ carni Carn~ lc                  12.1      NA        NA      11.9
## 2 Owl ~ Aotus omni  Prim~ <NA>                17         1.8      NA       7  
## 3 Moun~ Aplo~ herbi Rode~ nt                  14.4       2.4      NA       9.6
## 4 Grea~ Blar~ omni  Sori~ lc                  14.9       2.3       0.133   9.1
## 5 Cow   Bos   herbi Arti~ domesticated         4         0.7       0.667  20  
## 6 Thre~ Brad~ herbi Pilo~ <NA>                14.4       2.2       0.767   9.6
## # ... with 2 more variables: brainwt <dbl>, bodywt <dbl>
colnames(msleep)
##  [1] "name"         "genus"        "vore"         "order"        "conservation"
##  [6] "sleep_total"  "sleep_rem"    "sleep_cycle"  "awake"        "brainwt"     
## [11] "bodywt"
print(nrow(msleep))
## [1] 83

Background

In my previous discusison, I divided brain weight by body weight to find the proportion of brain weight relative to body weight. I’ll do this again for the multiple regression. I also did a little research into the colleciton of this data, which is based on regressions in the following paper by Savage et al., https://www.pnas.org/content/104/3/1051. For this reason, I’m also going to include the root of body weight as an accepted measure of metabolism. I also expressed rem sleep as a proportion of total sleep.

msleep_df <- msleep
msleep_df$brainprop <- 100*(round(msleep$brainwt/msleep$bodywt,3))
msleep_df$remprop <- 100*(round(msleep$sleep_rem/msleep$sleep_total,3))
glimpse(msleep_df)
## Rows: 83
## Columns: 13
## $ name         <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater s...
## $ genus        <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "...
## $ vore         <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "c...
## $ order        <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "...
## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "d...
## $ sleep_total  <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0...
## $ sleep_rem    <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0....
## $ sleep_cycle  <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333...
## $ awake        <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0...
## $ brainwt      <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000...
## $ bodywt       <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0...
## $ brainprop    <dbl> NA, 3.2, NA, 1.5, 0.1, NA, NA, NA, 0.5, 0.7, 0.3, 0.8,...
## $ remprop      <dbl> NA, 10.6, 16.7, 15.4, 17.5, 15.3, 16.1, NA, 28.7, NA, ...
100*round(sum(is.na(msleep_df$brainprop))/nrow(msleep_df),3)
## [1] 32.5

The most informative variables to explore are all except name and genus. I believe both of these are too specific to be a meaningful input. This means I’ll be considering an initial 9 variables to predict sleep total. Also of note are the three categorical variables that still have class of character. I will convert them to factors to detect their missing values.

msleep_df$vore <- factor(msleep_df$vore)
msleep_df$conservation <- factor(msleep_df$conservation)
msleep_df$order <- factor(msleep_df$order)
sleep_multi_lm <- lm(data = msleep_df, formula = sleep_total ~ order + conservation + vore + sleep_rem + sleep_cycle + awake + brainwt + bodywt + brainprop + sqrt(bodywt) + remprop)
summary(sleep_multi_lm)
## 
## Call:
## lm(formula = sleep_total ~ order + conservation + vore + sleep_rem + 
##     sleep_cycle + awake + brainwt + bodywt + brainprop + sqrt(bodywt) + 
##     remprop, data = msleep_df)
## 
## Residuals:
## ALL 20 residuals are 0: no residual degrees of freedom!
## 
## Coefficients: (5 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)
## (Intercept)           2.400e+01         NA      NA       NA
## orderCarnivora       -1.038e-16         NA      NA       NA
## orderChiroptera       4.598e-15         NA      NA       NA
## orderCingulata       -9.134e-16         NA      NA       NA
## orderDidelphimorphia  2.094e-17         NA      NA       NA
## orderErinaceomorpha   3.508e-15         NA      NA       NA
## orderLagomorpha       2.365e-15         NA      NA       NA
## orderPerissodactyla  -2.122e-15         NA      NA       NA
## orderRodentia         4.849e-15         NA      NA       NA
## orderSoricomorpha     7.170e-15         NA      NA       NA
## conservationen       -1.563e-15         NA      NA       NA
## conservationlc       -1.367e-15         NA      NA       NA
## conservationnt        8.609e-16         NA      NA       NA
## conservationvu        8.130e-16         NA      NA       NA
## voreherbi             3.925e-16         NA      NA       NA
## voreinsecti          -6.255e-16         NA      NA       NA
## voreomni                     NA         NA      NA       NA
## sleep_rem             9.821e-16         NA      NA       NA
## sleep_cycle           1.520e-14         NA      NA       NA
## awake                -1.000e+00         NA      NA       NA
## brainwt               0.000e+00         NA      NA       NA
## bodywt                       NA         NA      NA       NA
## brainprop                    NA         NA      NA       NA
## sqrt(bodywt)                 NA         NA      NA       NA
## remprop                      NA         NA      NA       NA
## 
## Residual standard error: NaN on 0 degrees of freedom
##   (63 observations deleted due to missingness)
## Multiple R-squared:      1,  Adjusted R-squared:    NaN 
## F-statistic:   NaN on 19 and 0 DF,  p-value: NA

This initial summary has too few degrees of freedom to provide meaningful analysis. Sixty-three observations were excluded because of becaues of missing values. I will attempt to eliminate a variable if it is contributing too few values, and if that’s not sufficient I will provide an inferred values based on comparable species.

summary(msleep_df)
##      name              genus                vore             order   
##  Length:83          Length:83          carni  :19   Rodentia    :22  
##  Class :character   Class :character   herbi  :32   Carnivora   :12  
##  Mode  :character   Mode  :character   insecti: 5   Primates    :12  
##                                        omni   :20   Artiodactyla: 6  
##                                        NA's   : 7   Soricomorpha: 5  
##                                                     Cetacea     : 3  
##                                                     (Other)     :23  
##        conservation  sleep_total      sleep_rem      sleep_cycle    
##  cd          : 2    Min.   : 1.90   Min.   :0.100   Min.   :0.1167  
##  domesticated:10    1st Qu.: 7.85   1st Qu.:0.900   1st Qu.:0.1833  
##  en          : 4    Median :10.10   Median :1.500   Median :0.3333  
##  lc          :27    Mean   :10.43   Mean   :1.875   Mean   :0.4396  
##  nt          : 4    3rd Qu.:13.75   3rd Qu.:2.400   3rd Qu.:0.5792  
##  vu          : 7    Max.   :19.90   Max.   :6.600   Max.   :1.5000  
##  NA's        :29                    NA's   :22      NA's   :51      
##      awake          brainwt            bodywt           brainprop    
##  Min.   : 4.10   Min.   :0.00014   Min.   :   0.005   Min.   :0.100  
##  1st Qu.:10.25   1st Qu.:0.00290   1st Qu.:   0.174   1st Qu.:0.400  
##  Median :13.90   Median :0.01240   Median :   1.670   Median :0.700  
##  Mean   :13.57   Mean   :0.28158   Mean   : 166.136   Mean   :1.039  
##  3rd Qu.:16.15   3rd Qu.:0.12550   3rd Qu.:  41.750   3rd Qu.:1.500  
##  Max.   :22.10   Max.   :5.71200   Max.   :6654.000   Max.   :4.000  
##                  NA's   :27                           NA's   :27     
##     remprop    
##  Min.   : 3.7  
##  1st Qu.:11.3  
##  Median :15.6  
##  Mean   :17.4  
##  3rd Qu.:22.7  
##  Max.   :34.7  
##  NA's   :22

According to the dataframe summary, there are 51 missing entries for sleep cycle. I’ll omit this column and see what the regression looks like.

sleep_multi_lm <- lm(data = msleep_df, formula = sleep_total ~ order + vore + conservation + awake + brainwt + bodywt + brainprop + sqrt(bodywt) + remprop)
summary(sleep_multi_lm)
## Warning in summary.lm(sleep_multi_lm): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = sleep_total ~ order + vore + conservation + awake + 
##     brainwt + bodywt + brainprop + sqrt(bodywt) + remprop, data = msleep_df)
## 
## Residuals:
##          4          5          9         11         12         14         15 
##  1.161e-15 -6.826e-17 -1.500e-16  1.752e-16  2.386e-15 -2.198e-15  5.738e-16 
##         17         18         19         20         22         23         24 
## -1.735e-15 -2.712e-31  1.319e-15 -6.409e-31 -7.396e-32  1.217e-16 -1.217e-16 
##         25         26         28         32         33         40         42 
##  1.726e-31 -1.726e-31  1.377e-16  1.230e-17 -1.319e-15 -7.642e-31  2.219e-31 
##         48         49         62         64         67         71         74 
##  2.219e-31 -1.070e-16  2.712e-31 -1.285e-15 -3.205e-31  1.098e-15  7.396e-32 
##         77 
##  2.465e-32 
## 
## Coefficients:
##                        Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)           2.400e+01  1.105e-14  2.172e+15   <2e-16 ***
## orderCarnivora        2.702e-15  6.323e-15  4.270e-01   0.6911    
## orderChiroptera      -1.831e-14  9.818e-15 -1.865e+00   0.1356    
## orderCingulata       -3.504e-15  7.071e-15 -4.960e-01   0.6462    
## orderDidelphimorphia -1.490e-14  6.179e-15 -2.412e+00   0.0734 .  
## orderErinaceomorpha  -1.617e-15  6.971e-15 -2.320e-01   0.8280    
## orderHyracoidea      -4.859e-15  4.688e-15 -1.037e+00   0.3584    
## orderLagomorpha      -6.016e-15  4.376e-15 -1.375e+00   0.2412    
## orderPerissodactyla  -1.274e-14  5.090e-15 -2.503e+00   0.0665 .  
## orderPrimates        -1.028e-14  7.548e-15 -1.362e+00   0.2450    
## orderRodentia        -7.223e-15  4.778e-15 -1.512e+00   0.2051    
## orderSoricomorpha    -2.904e-15  8.137e-15 -3.570e-01   0.7392    
## voreherbi             8.126e-15  7.171e-15  1.133e+00   0.3204    
## voreinsecti           1.144e-14  7.200e-15  1.589e+00   0.1873    
## voreomni              8.578e-15  7.612e-15  1.127e+00   0.3228    
## conservationen       -5.688e-16  3.503e-15 -1.620e-01   0.8789    
## conservationlc       -1.720e-15  2.052e-15 -8.380e-01   0.4491    
## conservationnt       -4.268e-15  2.967e-15 -1.439e+00   0.2236    
## conservationvu        2.257e-14  8.744e-15  2.581e+00   0.0612 .  
## awake                -1.000e+00  4.679e-16 -2.137e+15   <2e-16 ***
## brainwt               6.408e-14  2.578e-14  2.486e+00   0.0678 .  
## bodywt                2.908e-17  3.431e-17  8.480e-01   0.4444    
## brainprop            -8.122e-16  9.670e-16 -8.400e-01   0.4482    
## sqrt(bodywt)         -2.012e-15  1.341e-15 -1.500e+00   0.2080    
## remprop              -2.641e-16  2.120e-16 -1.246e+00   0.2809    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.328e-15 on 4 degrees of freedom
##   (54 observations deleted due to missingness)
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 4.975e+30 on 24 and 4 DF,  p-value: < 2.2e-16

While missing values have improved, the summary is giving me an additional error message that one variable, ‘awake’, may be too highly correlated to total sleep. Below is a simple linear regression of these two variables.

sleep_vs_awake <- lm(data = msleep_df, formula = sleep_total ~ awake)
summary(sleep_vs_awake)
## 
## Call:
## lm(formula = sleep_total ~ awake, data = msleep_df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.004529 -0.002016 -0.001256  0.000283  0.046893 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 23.9959187  0.0026771    8963   <2e-16 ***
## awake       -0.9996104  0.0001876   -5329   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.007563 on 81 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.84e+07 on 1 and 81 DF,  p-value: < 2.2e-16

Of note is the intercept equal to almost 24 hours. I likely think this in an artifact that total sleep and awake make up two parts of a 24-hour period. I will omit ‘awake’ as a variable for this reason.

sleep_multi_lm <- lm(data = msleep_df, formula = sleep_total ~ order + vore + conservation + brainwt + bodywt + brainprop + sqrt(bodywt) + remprop)
summary(sleep_multi_lm)
## 
## Call:
## lm(formula = sleep_total ~ order + vore + conservation + brainwt + 
##     bodywt + brainprop + sqrt(bodywt) + remprop, data = msleep_df)
## 
## Residuals:
##          4          5          9         11         12         14         15 
##  3.519e+00  1.969e-01 -5.733e-01 -4.338e-01 -1.571e+00  1.252e+00 -1.459e+00 
##         17         18         19         20         22         23         24 
## -2.060e+00  1.887e-15 -4.754e-01 -1.110e-16 -2.220e-16 -3.568e-01  3.568e-01 
##         25         26         28         32         33         40         42 
## -3.331e-16  4.441e-16  4.586e-01  1.147e-01  4.754e-01  1.776e-15  3.331e-16 
##         48         49         62         64         67         71         74 
##  6.661e-16  2.368e-01 -1.887e-15 -2.655e-01  1.332e-15  5.845e-01  2.220e-16 
##         77 
## -1.110e-16 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)
## (Intercept)            7.69000    7.64083   1.006    0.360
## orderCarnivora         3.93208    5.78213   0.680    0.527
## orderChiroptera        7.30460    8.79715   0.830    0.444
## orderCingulata         8.51145    5.58449   1.524    0.188
## orderDidelphimorphia   2.00259    5.83751   0.343    0.746
## orderErinaceomorpha   -6.82913    5.92202  -1.153    0.301
## orderHyracoidea       -3.24507    4.23889  -0.766    0.479
## orderLagomorpha        0.36607    4.17960   0.088    0.934
## orderPerissodactyla    3.43276    4.61658   0.744    0.491
## orderPrimates         -0.89859    7.20309  -0.125    0.906
## orderRodentia          2.61859    4.41373   0.593    0.579
## orderSoricomorpha     -4.27167    7.53924  -0.567    0.595
## voreherbi              0.78596    6.84473   0.115    0.913
## voreinsecti            2.25732    6.80705   0.332    0.754
## voreomni               5.91705    6.77759   0.873    0.423
## conservationen         1.80489    3.24975   0.555    0.603
## conservationlc         1.16388    1.89139   0.615    0.565
## conservationnt         0.94926    2.80365   0.339    0.749
## conservationvu        -2.66780    8.27202  -0.323    0.760
## brainwt              -13.64962   23.87114  -0.572    0.592
## bodywt                 0.02903    0.03012   0.964    0.379
## brainprop             -0.20850    0.91953  -0.227    0.830
## sqrt(bodywt)          -0.72539    1.24033  -0.585    0.584
## remprop                0.08428    0.19910   0.423    0.690
## 
## Residual standard error: 2.225 on 5 degrees of freedom
##   (54 observations deleted due to missingness)
## Multiple R-squared:  0.9617, Adjusted R-squared:  0.7857 
## F-statistic: 5.465 on 23 and 5 DF,  p-value: 0.03382

Categorical Data

Next, I noticed that several of the categorical values may benefit from treating them aas ordinal. In particular, conservation status varies based on the size and vulnerability of the species. Additionally, I’ll look closer at any ranking that could be included in the taxonomic order. I’ll also continue to work on the large amount of missing values.

levels(msleep_df$conservation)
## [1] "cd"           "domesticated" "en"           "lc"           "nt"          
## [6] "vu"
msleep_df$conservation <- replace_na(msleep_df$conservation, 'lc') #replace na values with 'least concern', as the original data source na meant that the species have not conservation status
msleep_df$conservation <- msleep_df$conservation %>% recode_factor('cd' = 'nt') #the two 'cd' values are giraffe and pilot whale, which are designated as nearly threatened

msleep_df$conservation <- factor(msleep_df$conservation, order = TRUE, levels = c("domesticated", 'lc', 'nt', 'vu', 'en')) #reordered conservation status by severity

summary(msleep_df$conservation)
## domesticated           lc           nt           vu           en 
##           10           56            6            7            4

Next, I’m going to recode diet as expressed in the ‘vore’ column. I will order this to offer a general spectrum between herbivore and carnivore. I will recode insectivores to be considered carnivores in this instance. This assumption also explores the idea that nutrient density also says something about an organism’s metabolism and sleep needs.

levels(msleep_df$vore)
## [1] "carni"   "herbi"   "insecti" "omni"
msleep_df$vore <- msleep_df$vore %>% recode_factor('insecti' = 'carni') #recode insectivores as carnivores
msleep_df$vore <- factor(msleep_df$vore, order = TRUE, levels = c("herbi", 'omni', 'carni')) #reordered conservation status by severity

summary(msleep_df$vore)
## herbi  omni carni  NA's 
##    32    20    24     7

Last, I’m going to explore the idea of making the ‘order’ column ordinal by making an assumption about each order containing morphologically similar organisms. To test this, I’m going to group by order, find the body weight, and generate a dot plot to see if these orders occupy distinct intervals of body weight.

levels(msleep_df$order)
##  [1] "Afrosoricida"    "Artiodactyla"    "Carnivora"       "Cetacea"        
##  [5] "Chiroptera"      "Cingulata"       "Didelphimorphia" "Diprotodontia"  
##  [9] "Erinaceomorpha"  "Hyracoidea"      "Lagomorpha"      "Monotremata"    
## [13] "Perissodactyla"  "Pilosa"          "Primates"        "Proboscidea"    
## [17] "Rodentia"        "Scandentia"      "Soricomorpha"
msleep_order_stats <- msleep_df %>% 
  select(order, bodywt) %>%
  group_by(order) %>% #group by taxonomic designation
  mutate(mean_wt = mean(bodywt)) %>% #find average weight by order
  arrange(mean_wt)  %>% #arrange by average weight
  mutate(min_wt = min(bodywt)) %>% #find minimum weight by order
  mutate(max_wt = max(bodywt))  %>% #find max weight by order
  distinct(order, mean_wt, min_wt, max_wt) %>% #remove duplicate entries
  mutate(order = factor(order, levels = .$order))

msleep_order_stats
## # A tibble: 19 x 4
## # Groups:   order [19]
##    order             mean_wt   min_wt   max_wt
##    <fct>               <dbl>    <dbl>    <dbl>
##  1 Chiroptera         0.0165    0.01     0.023
##  2 Soricomorpha       0.0414    0.005    0.075
##  3 Scandentia         0.104     0.104    0.104
##  4 Rodentia           0.288     0.021    1.35 
##  5 Erinaceomorpha     0.66      0.55     0.77 
##  6 Afrosoricida       0.9       0.9      0.9  
##  7 Didelphimorphia    1.03      0.37     1.7  
##  8 Diprotodontia      1.36      1.1      1.62 
##  9 Lagomorpha         2.5       2.5      2.5  
## 10 Hyracoidea         3.06      2.62     3.6  
## 11 Pilosa             3.85      3.85     3.85 
## 12 Monotremata        4.5       4.5      4.5  
## 13 Primates          13.9       0.2     62    
## 14 Cingulata         31.8       3.5     60    
## 15 Carnivora         57.7       2      163.   
## 16 Artiodactyla     282.       14.8    900.   
## 17 Perissodactyla   305.      187      521    
## 18 Cetacea          342.       53.2    800    
## 19 Proboscidea     4600.     2547     6654
msleep_order_vector <- as.character(msleep_order_stats$order) #save orders to vector for recoding purposes
ggplot(msleep_order_stats, aes(log(mean_wt), order)) + 
  geom_point() +
  geom_point(aes(log(max_wt)), color = 'red') +
  geom_point(aes(log(min_wt)), color = 'blue')

After all that, it appears that there’s a decent amount of overlap between the different orders. There are some outliers that appear distinct, but the largest orders, like rodentia, carnivora and primates, have overlap with each other. Ultimately, I don’t think that the order variable is a good one to model as I don’t have enough values to assume underlying normality. Using order as a variable didn’t pan out, so instead I’ll use order to interpolate some missing values for body weight and move onto backward elimination finally.

library(imputeTS) #interpolate 
## Warning: package 'imputeTS' was built under R version 3.6.3
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
msleep_df$estimated_bodywt <- na.interpolation(msleep_df$bodywt, option = "linear")
## Warning: na.interpolation will be replaced by na_interpolation.
##            Functionality stays the same.
##            The new function name better fits modern R code style guidelines.
##            Please adjust your code accordingly.
summary(msleep_df)
##      name              genus              vore             order   
##  Length:83          Length:83          herbi:32   Rodentia    :22  
##  Class :character   Class :character   omni :20   Carnivora   :12  
##  Mode  :character   Mode  :character   carni:24   Primates    :12  
##                                        NA's : 7   Artiodactyla: 6  
##                                                   Soricomorpha: 5  
##                                                   Cetacea     : 3  
##                                                   (Other)     :23  
##        conservation  sleep_total      sleep_rem      sleep_cycle    
##  domesticated:10    Min.   : 1.90   Min.   :0.100   Min.   :0.1167  
##  lc          :56    1st Qu.: 7.85   1st Qu.:0.900   1st Qu.:0.1833  
##  nt          : 6    Median :10.10   Median :1.500   Median :0.3333  
##  vu          : 7    Mean   :10.43   Mean   :1.875   Mean   :0.4396  
##  en          : 4    3rd Qu.:13.75   3rd Qu.:2.400   3rd Qu.:0.5792  
##                     Max.   :19.90   Max.   :6.600   Max.   :1.5000  
##                                     NA's   :22      NA's   :51      
##      awake          brainwt            bodywt           brainprop    
##  Min.   : 4.10   Min.   :0.00014   Min.   :   0.005   Min.   :0.100  
##  1st Qu.:10.25   1st Qu.:0.00290   1st Qu.:   0.174   1st Qu.:0.400  
##  Median :13.90   Median :0.01240   Median :   1.670   Median :0.700  
##  Mean   :13.57   Mean   :0.28158   Mean   : 166.136   Mean   :1.039  
##  3rd Qu.:16.15   3rd Qu.:0.12550   3rd Qu.:  41.750   3rd Qu.:1.500  
##  Max.   :22.10   Max.   :5.71200   Max.   :6654.000   Max.   :4.000  
##                  NA's   :27                           NA's   :27     
##     remprop     estimated_bodywt  
##  Min.   : 3.7   Min.   :   0.005  
##  1st Qu.:11.3   1st Qu.:   0.174  
##  Median :15.6   Median :   1.670  
##  Mean   :17.4   Mean   : 166.136  
##  3rd Qu.:22.7   3rd Qu.:  41.750  
##  Max.   :34.7   Max.   :6654.000  
##  NA's   :22

Backward Elimination and Final Model

sleep_multi_lm <- lm(data = msleep_df, formula =  sleep_total ~ vore + conservation +  estimated_bodywt + sqrt(estimated_bodywt) + remprop)
summary(sleep_multi_lm)
## 
## Call:
## lm(formula = sleep_total ~ vore + conservation + estimated_bodywt + 
##     sqrt(estimated_bodywt) + remprop, data = msleep_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.899 -2.466 -0.140  2.165  6.434 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            12.49366    1.70200   7.341 2.82e-09 ***
## vore.L                  1.43765    0.89386   1.608  0.11460    
## vore.Q                  0.25338    0.90500   0.280  0.78075    
## conservation.L          2.96008    1.85081   1.599  0.11659    
## conservation.Q          2.51115    2.00372   1.253  0.21645    
## conservation.C          2.93365    1.60895   1.823  0.07476 .  
## conservation^4          2.08926    1.90782   1.095  0.27917    
## estimated_bodywt        0.01432    0.01002   1.429  0.15986    
## sqrt(estimated_bodywt) -0.76159    0.25874  -2.943  0.00507 ** 
## remprop                 0.06723    0.07150   0.940  0.35204    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.431 on 46 degrees of freedom
##   (27 observations deleted due to missingness)
## Multiple R-squared:  0.5455, Adjusted R-squared:  0.4565 
## F-statistic: 6.134 on 9 and 46 DF,  p-value: 1.252e-05
plot(sleep_multi_lm)

### Conclusions

The above model uses four variables, one of which is transformed, to predict total sleep duration. These variables explain about 45% of the variation of sleep duration. The largest outliers differ from the trendline by over 5 hours, or about 20% of the day. This is a lot of variation and suggests that sleep time isn’t purely related to indicators of metabolism.

The most significant contributor is the root of total body weight, which according to the sign of the coefficient has a negative correlation with sleep time. This may be explained that sleep is primarily a neurological function and increased body mass does not increase the demand for regeneration. Additionally, Savage et al mention that some of the largest mammals, marine animals like dolphins and whales, have very different sleep schedules compared to terrestrial counterparts. Instead of their entire brain sleeping at one time, one hemisphere will rest completely while the other remains altert. This phenomenon is called unihemispheric slow-wave sleep (USWS), and was compensated for in the original publication by dividing the sleep times by half. I think this behavior is unique and more difficult to model than by halving sleep time.

If this model were to be developed further, a dataset with more species and fewer missing values would support a stronger relationship. While there was overlap in size between orders like carnivora and primates, species of megafauna versus smaller mammals like bats and rodents are very distinct and could be treated as their own categories.