Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
summary(msleep)
## name genus vore
## Length:83 Length:83 Length:83
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## order conservation sleep_total sleep_rem
## Length:83 Length:83 Min. : 1.90 Min. :0.100
## Class :character Class :character 1st Qu.: 7.85 1st Qu.:0.900
## Mode :character Mode :character Median :10.10 Median :1.500
## Mean :10.43 Mean :1.875
## 3rd Qu.:13.75 3rd Qu.:2.400
## Max. :19.90 Max. :6.600
## NA's :22
## sleep_cycle awake brainwt bodywt
## Min. :0.1167 Min. : 4.10 Min. :0.00014 Min. : 0.005
## 1st Qu.:0.1833 1st Qu.:10.25 1st Qu.:0.00290 1st Qu.: 0.174
## Median :0.3333 Median :13.90 Median :0.01240 Median : 1.670
## Mean :0.4396 Mean :13.57 Mean :0.28158 Mean : 166.136
## 3rd Qu.:0.5792 3rd Qu.:16.15 3rd Qu.:0.12550 3rd Qu.: 41.750
## Max. :1.5000 Max. :22.10 Max. :5.71200 Max. :6654.000
## NA's :51 NA's :27
head(msleep)
## # A tibble: 6 x 11
## name genus vore order conservation sleep_total sleep_rem sleep_cycle
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Chee~ Acin~ carni Carn~ lc 12.1 NA NA
## 2 Owl ~ Aotus omni Prim~ <NA> 17 1.8 NA
## 3 Moun~ Aplo~ herbi Rode~ nt 14.4 2.4 NA
## 4 Grea~ Blar~ omni Sori~ lc 14.9 2.3 0.133
## 5 Cow Bos herbi Arti~ domesticated 4 0.7 0.667
## 6 Thre~ Brad~ herbi Pilo~ <NA> 14.4 2.2 0.767
## # ... with 3 more variables: awake <dbl>, brainwt <dbl>, bodywt <dbl>
Can I predict the Brain Weight of a mammal based on the Body weight? Does the awake have an effect? Does the eating habits have an effect?
Vore will become my dichotomous value as I will pair down the dataset to only the two highest vore types.
It may make sense the amount of time the body is awake could have an effect. I will create a quadratic variable then analyze the difference.
msleep.analysis <- msleep%>%
select(name, genus, vore, sleep_total, awake, brainwt,bodywt)%>%
filter(!is.na(brainwt))%>%
filter(!is.na(vore))%>%
filter(vore %in% c("herbi","omni"))%>%
mutate(category = ifelse(vore == "herbi",0,1))
corr.df <- msleep.analysis%>%
select(sleep_total, awake, brainwt,bodywt,category)
corr.data <-cor(corr.df)
corrplot(corr.data, type = "lower", order="hclust", tl.col = "black", tl.srt = 45, method = "number")
ggpairs(
corr.df,
lower = list(continuous = ggally_points, combo = ggally_dot_no_facet)
)
corr.df <- corr.df %>%
mutate(awake2 = awake^2)
msleep_lm <- lm(brainwt ~ bodywt + awake + sleep_total + awake2 + category, data = corr.df)
summary (msleep_lm)
##
## Call:
## lm(formula = brainwt ~ bodywt + awake + sleep_total + awake2 +
## category, data = corr.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.72732 -0.14347 -0.05026 0.01966 2.00405
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.062e-01 7.662e-01 -0.139 0.891
## bodywt 9.294e-04 7.114e-05 13.065 2.26e-14 ***
## awake -4.823e-03 1.167e-01 -0.041 0.967
## sleep_total NA NA NA NA
## awake2 1.077e-03 4.176e-03 0.258 0.798
## category 1.115e-01 1.679e-01 0.664 0.512
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4413 on 32 degrees of freedom
## Multiple R-squared: 0.8771, Adjusted R-squared: 0.8617
## F-statistic: 57.1 on 4 and 32 DF, p-value: 4.07e-14
Awake does not drop the pvalue significantly enough for us to say it has an effect on the analysis
msleep_lm <- lm(brainwt ~ bodywt + category, data = corr.df)
summary (msleep_lm)
##
## Call:
## lm(formula = brainwt ~ bodywt + category, data = corr.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77970 -0.12703 -0.09761 -0.01868 2.05547
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.015e-01 1.044e-01 0.973 0.338
## bodywt 9.604e-04 6.504e-05 14.765 2.37e-16 ***
## category 3.018e-02 1.489e-01 0.203 0.841
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4393 on 34 degrees of freedom
## Multiple R-squared: 0.8706, Adjusted R-squared: 0.863
## F-statistic: 114.4 on 2 and 34 DF, p-value: 7.979e-16
I am back to the same problem I had with the previous model. I wonder, can i make the model work IF I remove the highest values as they are so far outside the others, it makes the model unworkable.
corr.df.2 <- corr.df %>%
filter(brainwt < 1)
msleep_lm_2 <- lm(brainwt ~ bodywt + category, data = corr.df.2)
summary (msleep_lm_2)
##
## Call:
## lm(formula = brainwt ~ bodywt + category, data = corr.df.2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.19411 -0.05624 -0.02607 0.04184 0.32791
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0288100 0.0253368 1.137 0.264
## bodywt 0.0009805 0.0001264 7.754 9.51e-09 ***
## category 0.0321025 0.0344244 0.933 0.358
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09589 on 31 degrees of freedom
## Multiple R-squared: 0.6667, Adjusted R-squared: 0.6452
## F-statistic: 31.01 on 2 and 31 DF, p-value: 4.013e-08
The values do come out to be much cleaner. How does the rest of the model look if I add in the previous quadratic values?
corr.df.2 <- corr.df %>%
mutate(awake2 = msleep.analysis$awake^2)%>%
mutate(awake = msleep.analysis$awake)%>%
filter(brainwt < 1)
msleep_lm_2 <- lm(brainwt ~ bodywt + category + awake + awake2, data = corr.df.2)
summary (msleep_lm_2)
##
## Call:
## lm(formula = brainwt ~ bodywt + category + awake + awake2, data = corr.df.2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.15846 -0.06163 -0.01495 0.02705 0.32641
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0072615 0.1666929 0.044 0.966
## bodywt 0.0008070 0.0001518 5.315 1.06e-05 ***
## category 0.0567448 0.0357596 1.587 0.123
## awake -0.0094590 0.0259036 -0.365 0.718
## awake2 0.0006980 0.0009572 0.729 0.472
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0923 on 29 degrees of freedom
## Multiple R-squared: 0.7111, Adjusted R-squared: 0.6712
## F-statistic: 17.84 on 4 and 29 DF, p-value: 1.716e-07
msleep_lm_2 <- lm(brainwt ~ bodywt + category + awake2, data = corr.df.2)
summary (msleep_lm_2)
##
## Call:
## lm(formula = brainwt ~ bodywt + category + awake2, data = corr.df.2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.16291 -0.06107 -0.00746 0.02560 0.32217
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0512863 0.0449404 -1.141 0.2628
## bodywt 0.0008260 0.0001405 5.877 1.97e-06 ***
## category 0.0536134 0.0342110 1.567 0.1276
## awake2 0.0003540 0.0001678 2.109 0.0434 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09096 on 30 degrees of freedom
## Multiple R-squared: 0.7098, Adjusted R-squared: 0.6807
## F-statistic: 24.45 on 3 and 30 DF, p-value: 3.337e-08
Removing the outliers to define the scope of the model does allow for the awake value to become significant so long as the quadratic is included.
corr.df.2%>%
ggplot(aes(bodywt, brainwt))+
geom_point()+
geom_smooth(method = lm, se = F)+
labs(title = "Original Data", x="Body Weight", y="Brain Weight")+
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
msleep_lm_2 %>%
ggplot(aes(fitted(msleep_lm_2),resid(msleep_lm_2)))+
geom_point()+
geom_smooth(method = lm, se = F)+
labs(title = "Residual Data", x="Fitted", y="Residual")+
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
Again. the outliers on the Residual data play havoc
msleep_lm_2 %>%
ggplot(aes(sample=resid(msleep_lm_2)))+
stat_qq()+
stat_qq_line()+
labs(title = "Q-Q Plot")+
theme_minimal()
Again, using linear regression on this dataset, even removing the outliers to state the model is accurate IF the Brain Weight is less than 1lb.