Data 605 - Discussion 13

Discussion 13

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

summary(msleep)

##      name              genus               vore          
##  Length:83          Length:83          Length:83         
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##     order           conservation        sleep_total      sleep_rem    
##  Length:83          Length:83          Min.   : 1.90   Min.   :0.100  
##  Class :character   Class :character   1st Qu.: 7.85   1st Qu.:0.900  
##  Mode  :character   Mode  :character   Median :10.10   Median :1.500  
##                                        Mean   :10.43   Mean   :1.875  
##                                        3rd Qu.:13.75   3rd Qu.:2.400  
##                                        Max.   :19.90   Max.   :6.600  
##                                                        NA's   :22     
##   sleep_cycle         awake          brainwt            bodywt        
##  Min.   :0.1167   Min.   : 4.10   Min.   :0.00014   Min.   :   0.005  
##  1st Qu.:0.1833   1st Qu.:10.25   1st Qu.:0.00290   1st Qu.:   0.174  
##  Median :0.3333   Median :13.90   Median :0.01240   Median :   1.670  
##  Mean   :0.4396   Mean   :13.57   Mean   :0.28158   Mean   : 166.136  
##  3rd Qu.:0.5792   3rd Qu.:16.15   3rd Qu.:0.12550   3rd Qu.:  41.750  
##  Max.   :1.5000   Max.   :22.10   Max.   :5.71200   Max.   :6654.000  
##  NA's   :51                       NA's   :27

head(msleep)

## # A tibble: 6 x 11
##   name  genus vore  order conservation sleep_total sleep_rem sleep_cycle
##   <chr> <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl>
## 1 Chee~ Acin~ carni Carn~ lc                  12.1      NA        NA    
## 2 Owl ~ Aotus omni  Prim~ <NA>                17         1.8      NA    
## 3 Moun~ Aplo~ herbi Rode~ nt                  14.4       2.4      NA    
## 4 Grea~ Blar~ omni  Sori~ lc                  14.9       2.3       0.133
## 5 Cow   Bos   herbi Arti~ domesticated         4         0.7       0.667
## 6 Thre~ Brad~ herbi Pilo~ <NA>                14.4       2.2       0.767
## # ... with 3 more variables: awake <dbl>, brainwt <dbl>, bodywt <dbl>

Can I predict the Brain Weight of a mammal based on the Body weight? Does the awake have an effect? Does the eating habits have an effect?

Vore will become my dichotomous value as I will pair down the dataset to only the two highest vore types.

It may make sense the amount of time the body is awake could have an effect. I will create a quadratic variable then analyze the difference.

msleep.analysis <- msleep%>%
  select(name, genus, vore, sleep_total, awake, brainwt,bodywt)%>%
  filter(!is.na(brainwt))%>%
  filter(!is.na(vore))%>%
  filter(vore %in% c("herbi","omni"))%>%
  mutate(category = ifelse(vore == "herbi",0,1))


corr.df <- msleep.analysis%>%
 select(sleep_total, awake, brainwt,bodywt,category)

corr.data <-cor(corr.df)
corrplot(corr.data, type = "lower", order="hclust", tl.col = "black", tl.srt = 45, method = "number")

ggpairs(
  corr.df,
  lower = list(continuous = ggally_points, combo = ggally_dot_no_facet)
  )

corr.df <- corr.df %>%
  mutate(awake2 = awake^2)
  

msleep_lm <- lm(brainwt ~ bodywt + awake + sleep_total + awake2 + category, data = corr.df)
summary (msleep_lm)

## 
## Call:
## lm(formula = brainwt ~ bodywt + awake + sleep_total + awake2 + 
##     category, data = corr.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72732 -0.14347 -0.05026  0.01966  2.00405 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.062e-01  7.662e-01  -0.139    0.891    
## bodywt       9.294e-04  7.114e-05  13.065 2.26e-14 ***
## awake       -4.823e-03  1.167e-01  -0.041    0.967    
## sleep_total         NA         NA      NA       NA    
## awake2       1.077e-03  4.176e-03   0.258    0.798    
## category     1.115e-01  1.679e-01   0.664    0.512    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4413 on 32 degrees of freedom
## Multiple R-squared:  0.8771, Adjusted R-squared:  0.8617 
## F-statistic:  57.1 on 4 and 32 DF,  p-value: 4.07e-14

Awake does not drop the pvalue significantly enough for us to say it has an effect on the analysis

msleep_lm <- lm(brainwt ~ bodywt + category, data = corr.df)
summary (msleep_lm)

## 
## Call:
## lm(formula = brainwt ~ bodywt + category, data = corr.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77970 -0.12703 -0.09761 -0.01868  2.05547 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.015e-01  1.044e-01   0.973    0.338    
## bodywt      9.604e-04  6.504e-05  14.765 2.37e-16 ***
## category    3.018e-02  1.489e-01   0.203    0.841    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4393 on 34 degrees of freedom
## Multiple R-squared:  0.8706, Adjusted R-squared:  0.863 
## F-statistic: 114.4 on 2 and 34 DF,  p-value: 7.979e-16

I am back to the same problem I had with the previous model. I wonder, can i make the model work IF I remove the highest values as they are so far outside the others, it makes the model unworkable.

corr.df.2 <- corr.df %>%
  filter(brainwt < 1)

msleep_lm_2 <- lm(brainwt ~ bodywt + category, data = corr.df.2)
summary (msleep_lm_2)

## 
## Call:
## lm(formula = brainwt ~ bodywt + category, data = corr.df.2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.19411 -0.05624 -0.02607  0.04184  0.32791 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.0288100  0.0253368   1.137    0.264    
## bodywt      0.0009805  0.0001264   7.754 9.51e-09 ***
## category    0.0321025  0.0344244   0.933    0.358    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09589 on 31 degrees of freedom
## Multiple R-squared:  0.6667, Adjusted R-squared:  0.6452 
## F-statistic: 31.01 on 2 and 31 DF,  p-value: 4.013e-08

The values do come out to be much cleaner. How does the rest of the model look if I add in the previous quadratic values?

corr.df.2 <- corr.df %>%
  mutate(awake2 = msleep.analysis$awake^2)%>%
  mutate(awake = msleep.analysis$awake)%>%
  filter(brainwt < 1)
  
msleep_lm_2 <- lm(brainwt ~ bodywt + category + awake + awake2, data = corr.df.2)
summary (msleep_lm_2)

## 
## Call:
## lm(formula = brainwt ~ bodywt + category + awake + awake2, data = corr.df.2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.15846 -0.06163 -0.01495  0.02705  0.32641 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0072615  0.1666929   0.044    0.966    
## bodywt       0.0008070  0.0001518   5.315 1.06e-05 ***
## category     0.0567448  0.0357596   1.587    0.123    
## awake       -0.0094590  0.0259036  -0.365    0.718    
## awake2       0.0006980  0.0009572   0.729    0.472    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0923 on 29 degrees of freedom
## Multiple R-squared:  0.7111, Adjusted R-squared:  0.6712 
## F-statistic: 17.84 on 4 and 29 DF,  p-value: 1.716e-07

msleep_lm_2 <- lm(brainwt ~ bodywt + category + awake2, data = corr.df.2)
summary (msleep_lm_2)

## 
## Call:
## lm(formula = brainwt ~ bodywt + category + awake2, data = corr.df.2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.16291 -0.06107 -0.00746  0.02560  0.32217 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.0512863  0.0449404  -1.141   0.2628    
## bodywt       0.0008260  0.0001405   5.877 1.97e-06 ***
## category     0.0536134  0.0342110   1.567   0.1276    
## awake2       0.0003540  0.0001678   2.109   0.0434 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09096 on 30 degrees of freedom
## Multiple R-squared:  0.7098, Adjusted R-squared:  0.6807 
## F-statistic: 24.45 on 3 and 30 DF,  p-value: 3.337e-08

Removing the outliers to define the scope of the model does allow for the awake value to become significant so long as the quadratic is included.

corr.df.2%>%
  ggplot(aes(bodywt, brainwt))+
  geom_point()+
  geom_smooth(method = lm, se = F)+
  labs(title = "Original Data", x="Body Weight", y="Brain Weight")+
  theme_minimal()

## `geom_smooth()` using formula 'y ~ x'

msleep_lm_2 %>%
  ggplot(aes(fitted(msleep_lm_2),resid(msleep_lm_2)))+
  geom_point()+
  geom_smooth(method = lm, se = F)+
  labs(title = "Residual Data", x="Fitted", y="Residual")+
  theme_minimal()

## `geom_smooth()` using formula 'y ~ x'

Again. the outliers on the Residual data play havoc

msleep_lm_2 %>%
  ggplot(aes(sample=resid(msleep_lm_2)))+
  stat_qq()+
  stat_qq_line()+
  labs(title = "Q-Q Plot")+
  theme_minimal()

Again, using linear regression on this dataset, even removing the outliers to state the model is accurate IF the Brain Weight is less than 1lb.

Data 605 - Discussion 13

John Kellogg

4/22/2020

Discussion 13