The goal of this document is to utilize R as a tool for creating and interpreting a multiple linear regression model. The structure of these models is fundamentally different from that of a simple linear regression model because they include more than one predictor. Thus, the multiple linear regression model follows this format \[y = \beta_0 + \beta_1x_1 + . . . + \beta_kx_k + \epsilon\] where \(\beta_1 , \beta_2 . . . \beta_k\) represent the linear relationships between \({x_1} , {x_2} . . . {x_k}\) and response \(y\), respectively, and \(\epsilon\) is the error term. We’ll dive into the details and mechanics of the model later, but for now we’ll call up a data set to help us explore this type of model.
For this R guide, we used data from the “alr3” package in R. We loaded in the “sleep1” data set which has 11 variables for 62 animal species. The values in the data set are recorded as species averages. For example, body weight is the average body weight for mammals of that species. The variables we will look at for this R guide are: body weight, brain weight and maximum life span. Maximum life span is how many years that species could live, so if the maximum life span of a human is 100 years we could say we expect that a human could live up to 100 years. Body weight is measured in kilograms, brain weight is measured in grams and life span is measured in years. We use this set of code to pull the “sleep1” data from the “alr3” library.
#install.packages("alr3")
library(alr3)
## Warning: package 'alr3' was built under R version 3.3.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.3.3
data(sleep1)
attach(sleep1)
head(sleep1)
Notice ho we attached our data with the attach() function just like in the simple regression guide so that we can call our variables without referencing the whole data set again. It is worth noting that there are 4 species that are missing data for brain weight, body weight, and/or life span. That is okay because we can use our model to predict for these missing data points.
For this R Guide, we can ignore all the variables except the body weight, brain weight and life span variables.
The term multicollinearity refers to dependence between predictors in a multiple linear regression model. Severe multicollinearity will cause the least squares point estimates \(\hat{\beta_1}\) and \(\hat{\beta_2}\) to differ significantly from the true regression parameters, and thus be a poor estimate of the true linear relationship between the predictors and response.
Variance inflation factors (VIF’s) are one way to test for severe multicollinearity. The VIF for a predictor is calculated using the coefficient of determination from a linear regression model that relates that predictor to the set of all others. In our case, we only use two predictor values so our VIF test will only require that we calculate the \(R^2\) value of one model.
multicolmod<-lm(BodyWt ~ BrainWt)
summary(multicolmod)
##
## Call:
## lm(formula = BodyWt ~ BrainWt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1552.25 -8.00 47.36 55.10 1553.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -56.85555 42.97805 -1.323 0.191
## BrainWt 0.90291 0.04453 20.278 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 323.5 on 60 degrees of freedom
## Multiple R-squared: 0.8727, Adjusted R-squared: 0.8705
## F-statistic: 411.2 on 1 and 60 DF, p-value: < 2.2e-16
As shown above, the \(R^2\) of the model that relates a species’ body weight to its brain weight has a value of 0.8727. To calculate the VIF, we use this formula: \[VIF = (1-R^2)^{-1}\] .
1/(1-0.8727)
## [1] 7.85546
In practice, we consider multicollinearity to be an issue if the largest VIF is greater than 10. Since our only VIF is 7.85546, we do not consider multicollinearity to be an issue between body weight and brain weight. This comes as a bit of a shock considering the brain weight is literally part of the body weight. Intuitively, the average body weight of a species should be related to its average brain weight because mammals do not frequently have small average body weights relative to their average brain weight. The two are expected be almost completely correlated and vary proportionately.
Using a groupwise means does not apply to the model we are using in this R Guide. However, I wanted to address groupwise means because we talked a lot about it in class. When we look at a model using groupwise means, we look at just the categorical predictor variable which has a numeric value (such as 1 or 0) associated with it. For example, a data set about sports might include individuals who either play football (1) or run cross country. The general form of that model would look like this:\[\widehat{weight} = \sf{\beta _ {0}} + \sf{\beta _ {1}} (height) + \sf{\beta _ {1}} (sport = 0 or 1)\] We did this in class, so I just wanted to address a categorical variable since my example only uses quantitative predictor variables.
We can use R to create a multiple linear regression equation that takes both brain weight and body weight into consideration when predicting maximum life span. Rather than just predicting the maximum life span of species based on brain weight or body weight alone, the multiple linear regression model uses both of these quantitative predictors to generate a prediction for the maximum life span. These predictors are both quantitative because they are continuous variables that can be quantified as numerical numbers.
myMultipleLinReg <- lm(Life ~ BodyWt + BrainWt)
summary(myMultipleLinReg)
##
## Call:
## lm(formula = Life ~ BodyWt + BrainWt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.823 -9.460 -2.710 7.507 41.776
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.333955 1.843452 8.318 2.65e-11 ***
## BodyWt -0.026633 0.005263 -5.061 4.99e-06 ***
## BrainWt 0.033744 0.005094 6.624 1.56e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.18 on 55 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.4946, Adjusted R-squared: 0.4763
## F-statistic: 26.92 on 2 and 55 DF, p-value: 7.064e-09
Referring back to the introduction, remember that the general equation for a multiple linear regression model with two individual predictors is: \[\widehat{response} = \hat{\beta_{0}} + \hat{\beta_{1}} (predictor1) + \hat{B_{2}} (predictor2)\] .
In this equation, \(\hat{\beta_{0}}\) is the intercept, \(\hat{\beta_{1}}\) is the slope coefficient for the first predictor variable and \(\hat{\beta_{2}}\) is the slope coefficient for the second predicted variable.
We can see from the summary that in our case \(\hat{\beta_{0}}\) = 15.333955 years, \(\hat{\beta_{1}}\) = -0.026633 and \(\hat{\beta_{2}}\) = 0.033744. This means that our multiple linear regression equation will look like this: \[\widehat{Maximum Life Span} = 15.333955 - 0.026633 * (Body Weight) + 0.033744 *(Brain Weight)\]
The intercept in this model, \(\hat{B_{0}}\), cannot be interpreted because there are no animal species that weigh 0 kilograms and have a brain weight of 0 grams.
The slope coefficient on body weight, \(\hat{\beta_{1}}\), means that if the average body weight of a species decreased by 1 kg, we would expect the maximum life span of that species to decrease by .026633 years on average. We can set up our hypothesis test to be: \[\sf{H_{0}} : \hat{\beta_{1}} = 0\] \[\sf{H _ {1}} : \hat{\beta_{1}} \neq 0\]
We can look at the summary of this model and see that the p-value on this coefficient is 4.99*(10^-6).Since the p -value is less than .05, we reject the null hypothesis and conclude that \(\hat{{\beta _ {1}}}\) is significant to our model.
The slope coefficient on the brain weight, \(\hat{\beta_{2}}\), means that if the average brain weight of a species increases by one g then the maximum life span would be expected to increase by .033744 years on average. We can set up our hypothesis test to be: \[\sf{H _ {0}} : \hat{\beta_{2}} = 0 \] \[\sf{H _ {1}} : \hat{\beta_{2}} \neq 0\]
We can look at the summary of this model and see that the p-value on this coefficient is 1.56*(10^-9). Since the p -value is less than .05, we reject the null hypothesis and conclude that \(\hat{{\beta _ {2}}\) is significant to our model.
The final multiple linear regression model is \[\widehat{Maximum Life Span} = 15.333955 - 0.026633 (Body Weight) + 0.033744 (Brain Weight)\]
We see from the summary that the Multiple R-Squared value is .4946. This means that about 49.46% of variation in maximum life span can be attributed to the brain weight and body weight variables. We can look at different types of polynomial regression models to try and increase this number. Ideally, we want this number to be as high as possible because that means that our model is accurately predicting maximum life span based off the brain weight and body weight of the species.
We can also see from the summary that the p-value on this model is 7.065*(10^-9). This p-value is essentially zero which means our model is a significant model.
We can also use R to create a multiple linear regression model that has an interaction term. The general equation command for this model is: \[lm(y ~ x1 * x2)\] We want to know if the relationship between \(y\) and \(x_1\) depends on the value/category of \(x_2\) .
In our case, we want to know it the relationship between maximum life span and brain weight depends on the value of body weight. Our final model will look like this:\[\widehat{Maximum Life Span} = \hat{\sf{\beta _ {0}}} + \hat{\sf{\beta _ {1}}} (Body Weight) + \hat{\sf{\beta _ {2}}} (Brain Weight) + \hat{\sf{\beta _ {3}}} (Body Weight * Brain Weight)\]
The interaction term is the last term, (Body Weight * Brain Weight). We want to test this model to see if the slope coefficient, \(\hat{\sf{\beta _ {3}}}\) ,is significant. If it is significant, that would mean that the relationship between maximum life span and brain weight does depend on body weight. If it is not significant, that would mean that the relationship between maximum life span and brain weight does not depend on body weight.
myInteractionMod <- lm(Life ~ BodyWt * BrainWt)
myInteractionMod
##
## Call:
## lm(formula = Life ~ BodyWt * BrainWt)
##
## Coefficients:
## (Intercept) BodyWt BrainWt BodyWt:BrainWt
## 1.453e+01 -6.374e-04 2.985e-02 -4.013e-06
We see from this command that the interaction model is : \[\widehat{Maximum Life Span} = 14.53 - 0.0006374 (Body Weight) + 0.02985 (Brain Weight) - 0.000004013(Body Weight * Brain Weight)\] .
Now, I am going to demonstrate the T-Distribution Test. In order to do this, I am going to use the first multiple linear regression model I created and the multiple linear regression model with an interaction term.
I will use the T-Test to determine whether the interaction term is significant to this model.
The multiple linear regression model is: \[\widehat{Maximum Life Span}$ = 15.333955 - 0.026633 (Body Weight) + 0.033744(Brain Weight)\]
The multiple linear regression with an interaction term is: \[\widehat{Maximum Life Span} = 14.53 - 0.0006374 (Body Weight) + 0.02985 (Brain Weight) - 0.000004013 (Body Weight * Brain Weight)\]
In order to do the T-Test, we can use the interactive model. We run the “summary” function on the interaction model and look at the p-value next to the interaction term coefficient. We set up the hypothesis test for the slope coefficient on the interaction term, \(\hat{\sf{\beta _ {3}}}\), as follows: \[\sf{H _ {0}} : \hat{\sf{\beta _ {3}}} = 0\] \[\sf{H _ {1}} : \hat{\sf{\beta _ {3}}} \neq 0\]
Recall that \(\hat{\sf{\beta _ {3}}}\) is the slope coefficient in front of the interaction term. So this is the coefficient of interest when running the T-Distribution Test.
summary(myInteractionMod)
##
## Call:
## lm(formula = Life ~ BodyWt * BrainWt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.281 -9.020 -2.012 7.303 46.433
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.453e+01 1.916e+00 7.580 4.71e-10 ***
## BodyWt -6.374e-04 1.927e-02 -0.033 0.974
## BrainWt 2.985e-02 5.762e-03 5.181 3.37e-06 ***
## BodyWt:BrainWt -4.013e-06 2.864e-06 -1.401 0.167
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.06 on 54 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.5124, Adjusted R-squared: 0.4853
## F-statistic: 18.91 on 3 and 54 DF, p-value: 1.639e-08
We can see from these results that the p-value on the interaction term is .167. This is not less than .05, so we fail to reject the null hypothesis. Thus, we can conclude that the interaction term is not significant in our model.
Using the summary above, the Multiple R-Squared value for our interaction model is 0.5124. This means that 51.24% of the variation in maximum life span can be explained by our predictors. This is slightly higher than the 49.46% we observed in our model without the interaction term. Thus, the model with the interaction term is slightly better, but not significantly better as seen from the hypothesis test above. We can also see from the summary that the p-value on this model is 1.639*(10^-8). This p-value is essentially zero which means our model is a significant model.
We can work off of our multiple linear regression model and add polynomial terms to it.
First, I will start by making a quadratic model. I will square the predictors and look at their significance by using the summary function.
brainS <- (BrainWt)^2
bodyS <- (BodyWt) ^2
myQuadMod <- lm(Life ~ BrainWt + brainS + BodyWt + bodyS)
summary(myQuadMod)
##
## Call:
## lm(formula = Life ~ BrainWt + brainS + BodyWt + bodyS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.006 -6.160 -0.564 3.283 37.264
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.069e+01 1.409e+00 7.589 5.09e-10 ***
## BrainWt 8.756e-02 7.999e-03 10.946 3.24e-15 ***
## brainS -1.247e-05 1.605e-06 -7.771 2.60e-10 ***
## BodyWt -4.530e-02 1.329e-02 -3.408 0.001258 **
## bodyS 5.334e-06 1.392e-06 3.832 0.000338 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.972 on 53 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.7742, Adjusted R-squared: 0.7572
## F-statistic: 45.43 on 4 and 53 DF, p-value: < 2.2e-16
We can see from the output that: \(\hat{\sf{\beta _ {0}}} = 10.690\) is the intercept, \(\hat{\sf{\beta _ {1}}} = 0.08756\) is the slope coefficient on brain weight, \(\hat{\sf{\beta _ {2}}} = -0.00001247\) is the slope coefficient on brain weight squared, \(\hat{\sf{\beta _ {3}}}= -0.04530\) is the slope coefficient on body weight, and \(\hat{\sf{\beta _ {4}}} = 0.000005334\) is the slope coefficient on body weight squared.
Thus, the final equation for the quadratic linear regression equation is: \[\widehat{Maximum Life Span} = 10.690 + 0.08756 (Brain Weight) - 0.00001247 (Brain Weight)^2 - 0.04530 (Body Weight) + 0.000005334(Body Weight)^2\]
We want to use a hypothesis test to see if the brain weight coefficient is significant. We can set up the hypothesis test for brain weight squared coefficient, \(\sf{\hat{\beta_{2}}}\) : \[\sf{H _ {0}} : \sf{\hat{\beta_{2}}} = 0 \] \[\sf{H _ {1}} : \sf{\hat{\beta_{2}}} \neq 0\]
We can see from the summary that the p-value on the brain weight squared term is 2.60*(10^-10). We know that this is less than .05, and we can therefore reject the null hypothesis. This means that the brain weight squared term is significant to our model.
We can set up the hypothesis test for body weight squared , \(\sf{\hat{\beta_{4}}}\) : \[\sf{H _ {0}} : \sf{\hat{\beta_{4}}} = 0\] \[\sf{H _ {1}} : \sf{\hat{\beta_{4}}} \neq 0\]
We can see from the summary that the p-value on the body weight squared term is .000338. We know that this is less than .05, and we can therefore reject the null hypothesis. This means that the body weight squared term is significant to our model.
Since the squared terms are significant in the model, this means that we also must keep all lower powered terms in the model. Thus, we automatically know we will keep the brain weight and body weight predictors in our model.
We can see from the summary function that the Multiple R-Squared is .7742. This means that 77.42% of the variation in maximum life span can be explained by the variables in our model. In the original multiple linear regression model, the number was 49.46%. Since the amount of variation explained is so much higher in the polynomial mode, we can conclude that this polynomial model is significantly better than the original multiple linear regression model.
The final equation for the multiple linear regression quadratic polynomial model is: \[\widehat{(Maximum Life Span)} = 10.69 + 0.08756 (Brain Weight) - 0.00001247 (Brain Weight)^2 - 0.0453 (Body Weight) + 0.000005334 (Body Weight)^2\]
After creating the quadratic polynomial model, we can look at creating a cubic polynomial model. This involves cubing the predictor variables.
brainC <- (BrainWt)^3
bodyC <- (BodyWt)^3
myCubedMod <- lm(Life ~BrainWt + brainS + brainC + BodyWt + bodyS + bodyC)
summary(myCubedMod)
##
## Call:
## lm(formula = Life ~ BrainWt + brainS + brainC + BodyWt + bodyS +
## bodyC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.940 -5.461 -0.363 1.978 37.646
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.892e+00 1.563e+00 6.327 6.27e-08 ***
## BrainWt 1.030e-01 2.620e-02 3.930 0.000256 ***
## brainS -3.435e-05 2.375e-05 -1.446 0.154264
## brainC 6.077e-09 4.716e-09 1.289 0.203377
## BodyWt -3.007e-02 5.348e-02 -0.562 0.576433
## bodyS -4.552e-05 9.096e-05 -0.500 0.618910
## bodyC 5.582e-09 1.074e-08 0.520 0.605549
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9 on 51 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.7813, Adjusted R-squared: 0.7556
## F-statistic: 30.37 on 6 and 51 DF, p-value: 3.317e-15
We can see from the output that:
\(\sf{\hat{\beta_{0}}} = 9.892\) is the intercept, \(\sf{\hat{\beta_{1}}} = 0.1030\) is the slope coefficient on brain weight, \(\sf{\hat{\beta_{2}}} = -0.00003435\) is the slope coefficient on brain weight squared, \(\sf{\hat{\beta_{3}}} = 0.000000006077\) is the slope coefficient on the brain weight cubed, \(\sf{\hat{\beta_{4}}} = -0.03007\) is the slope coefficient on body weight, \(\sf{\hat{\beta_{5}}} = -0.00004552\) is the slope coefficient on body weight squared, and \(\sf{\hat{\beta_{6}}} = 0.000000005582\) is the slope coefficient on body weight cubed
Thus, the final equation for the cubic linear regression equation is:\[\widehat{Maximum Life Span} = 9.892 + 0.1030 (Brain Weight) - 0.00003435 (Brain Weight)^2 + 0.000000006077 (Brain Weight) ^3 - 0.03007 (Body Weight) - 0.00004552 (Body Weight)^2 + 0.000000005582 (Brain Weight)^3\]
Similar to how we analyzed the quadratic model, we can analyze the cubic model. We want to know if the brain weight cubed coefficient is significant to our model. We can set up the hypothesis test for brain weight cubed coefficient, \(\sf{\hat{\beta_{3}}}\) : \[\sf{H _ {0}} : \sf{\hat{\beta_{3}}} = 0\] \[\sf{H _ {1}} : \sf{\hat{\beta_{3}}} \neq 0\]
We can see from the summary that the p-value on brain weight cubed is .203377 which is not less than .05. Thus, we fail to reject the null hypothesis and conclude that the brain weight cubed value is not significant to our model.
We can set up the hypothesis test for body weight cubed coefficient, \(\sf{\hat{B_{6}}}\) : \[\sf{H _ {0}} : \sf{\hat{\beta_{6}}} = 0\] \[\sf{H _ {1}} : \sf{\hat{\beta_{6}}} \neq 0\]
We can see from the summary that the p-value on body weight cubed is .605549 which is not less than .05. Thus, we fail to reject the null hypothesis and conclude that the body weight cubed value is not significant to our model.
Should we decide only to build our models out of polynomial terms, it would be useful to be able
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
##
## forbes
brain2 <- (BrainWt)^2
body2 <- (BodyWt)^2
brain3 <- (BrainWt)^3
body3 <- (BodyWt)^3
brain4 <- (BrainWt)^4
body4 <- (BodyWt)^4
brain5 <- (BrainWt)^5
body5 <- (BodyWt)^5
BigMod <- lm( Life ~ BrainWt + brain2 + brain3 + brain4 + BodyWt + body2 + body3 + body4)
stepAIC(BigMod)
## Start: AIC=262.31
## Life ~ BrainWt + brain2 + brain3 + brain4 + BodyWt + body2 +
## body3 + body4
##
## Df Sum of Sq RSS AIC
## - body3 1 0.50 3915.7 260.31
## - body4 1 0.69 3915.8 260.32
## - body2 1 1.85 3917.0 260.33
## - BodyWt 1 24.21 3939.4 260.66
## - brain4 1 80.65 3995.8 261.49
## <none> 3915.2 262.31
## - brain3 1 211.55 4126.7 263.36
## - brain2 1 280.57 4195.7 264.32
## - BrainWt 1 976.28 4891.4 273.22
##
## Step: AIC=260.31
## Life ~ BrainWt + brain2 + brain3 + brain4 + BodyWt + body2 +
## body4
##
## Df Sum of Sq RSS AIC
## - body2 1 17.48 3933.1 258.57
## - BodyWt 1 81.02 3996.7 259.50
## - body4 1 92.29 4007.9 259.66
## <none> 3915.7 260.31
## - brain4 1 213.70 4129.4 261.39
## - brain3 1 230.48 4146.1 261.63
## - brain2 1 280.19 4195.8 262.32
## - BrainWt 1 1051.54 4967.2 272.11
##
## Step: AIC=258.57
## Life ~ BrainWt + brain2 + brain3 + brain4 + BodyWt + body4
##
## Df Sum of Sq RSS AIC
## <none> 3933.1 258.57
## - BodyWt 1 158.79 4091.9 258.87
## - brain4 1 218.29 4151.4 259.70
## - body4 1 224.52 4157.7 259.79
## - brain3 1 233.42 4166.6 259.92
## - brain2 1 308.53 4241.7 260.95
## - BrainWt 1 1749.94 5683.1 277.92
##
## Call:
## lm(formula = Life ~ BrainWt + brain2 + brain3 + brain4 + BodyWt +
## body4)
##
## Coefficients:
## (Intercept) BrainWt brain2 brain3 brain4
## 8.914e+00 1.487e-01 -2.159e-04 1.501e-07 -2.381e-11
## BodyWt body4
## -3.135e-02 1.943e-12
I really need help interpreting all of these. What is log(anything) in english?
To explore the linear relationships between a species’ life span and its body weight and brain weight further, we can do logarithmic transformations on the response and/or predictors of our regression model. By now we are comfortable with the functions to create a linear model so let’s call this one “logTransformMod” and transform \(\widehat{Maximum Life Span}\) to \(\widehat{log(Maximum Life Span)}\) . To do this we simply substitute the base 10 log of our response for our regular response in the lm() function input.
logTransformMod <- lm(log(Life) ~ BodyWt + BrainWt)
summary(logTransformMod)
##
## Call:
## lm(formula = log(Life) ~ BodyWt + BrainWt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7318 -0.6176 0.1493 0.6888 1.4619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4246210 0.1170774 20.710 < 2e-16 ***
## BodyWt -0.0007953 0.0003342 -2.379 0.02083 *
## BrainWt 0.0011160 0.0003235 3.450 0.00109 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8368 on 55 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.241, Adjusted R-squared: 0.2134
## F-statistic: 8.733 on 2 and 55 DF, p-value: 0.0005084
There are a number of statistics worth taking note of in this summary. First of all, we can see that both the average brain weight and average body weight for a given species have significant linear relationships with the maximum life span of the species indivudally based on the p-values listed.
Interpreting the point estimates of the regression coeffictients is confusing. We can take the log of the predictors too out of curiosity to test for a linear relationship between the transformed predictors and transformed response.
logTransformMod2 <- lm(log(Life) ~ log(BodyWt) + log(BrainWt))
summary(logTransformMod2)
##
## Call:
## lm(formula = log(Life) ~ log(BodyWt) + log(BrainWt))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2188 -0.2944 -0.1341 0.2181 1.9097
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.13935 0.24296 4.690 1.85e-05 ***
## log(BodyWt) -0.18798 0.08325 -2.258 0.0279 *
## log(BrainWt) 0.53138 0.10643 4.993 6.36e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5609 on 55 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.659, Adjusted R-squared: 0.6466
## F-statistic: 53.14 on 2 and 55 DF, p-value: 1.417e-13
The AIC function in R can help us determine which models are more useful. To do this, we create a list of “candidate” models that include different combinations of predictor variables. For the first model, the simplest model, there are no predictors. For the second model, we will use just the average body weight as the predictor and for the third, average brain weight alone. In the fourth model, we will use a multiple linear regression model without a term interacting average brain and body weights. The fifth model is the multiple regression model with the interaction term. For the sixth model, we will use the quadratic regression model. The seventh model is our cubic regression model. In the eighth model, we will use the log transform on the average body weight predicto and in the ninth model, we will use the log transform on the brain weight and the body weight predictors. And finally, on the eleventh model I used the log transform on the response variable as well as both of the predictors.
brainS <- (BrainWt)^2
bodyS <- (BodyWt)^2
brainC <- (BrainWt)^3
bodyC <- (BodyWt)^3
one <- lm(Life ~ 1)
two <- lm(Life ~ BodyWt)
three <- lm(Life ~ BrainWt)
four <- lm(Life ~ BodyWt + BrainWt)
five <- lm(Life ~ BrainWt * BodyWt)
six <- lm(Life ~ BrainWt + brainS + BodyWt + bodyS)
seven <- lm(Life ~ BrainWt + brainS + brainC + BodyWt + bodyS + bodyC)
eight <- lm((Life) ~ (BrainWt) + log(BodyWt))
nine <- lm((Life) ~ log(BrainWt) + (BodyWt))
ten <- lm(Life ~ log(BrainWt) + log(BodyWt) )
eleven <- lm(log(Life) ~ log(BrainWt) + log(BodyWt))
After creating all of these models, I can run the AIC function on them. The AIC function takes a set of models as its input. It then runs analysis on the models and returns AIC values. The simplest model with the lowest AIC value is the best model to predict a species’ life span using its average brain weight and body weight.
AIC(one, two, three, four, five, six, seven, eight, nine, ten, eleven)
We can see from the output that the model with the lowest AIC value is the eleventh model. Since this is the model where we took the log of the response variable as well as the log of the predictor variables, it is relatively simple compared to the polynomial models. This model has an AIC value that is significantly smaller than all of the other AIC values in the other models. Thus, we can conclude that the model is:
myBestModel <- lm(log(Life) ~ log(BrainWt) + log(BodyWt))
summary(myBestModel)
##
## Call:
## lm(formula = log(Life) ~ log(BrainWt) + log(BodyWt))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2188 -0.2944 -0.1341 0.2181 1.9097
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.13935 0.24296 4.690 1.85e-05 ***
## log(BrainWt) 0.53138 0.10643 4.993 6.36e-06 ***
## log(BodyWt) -0.18798 0.08325 -2.258 0.0279 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5609 on 55 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.659, Adjusted R-squared: 0.6466
## F-statistic: 53.14 on 2 and 55 DF, p-value: 1.417e-13
\[\widehat{log(Maximum Life Span)} = 1.13935 + 0.53138 * log(BrainWt) - 0.18798 * log(BodyWt)\]
As seen in the section above, the best model for predicting the maximum life span for a species is: \[\widehat{log(Maximum Life Span)} = 1.13935 + 0.53138 * log(BrainWt) - 0.18798 * log(BodyWt)\].
We can use the summary() function in R to look at the statistics on this model.
summary(myBestModel)
##
## Call:
## lm(formula = log(Life) ~ log(BrainWt) + log(BodyWt))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2188 -0.2944 -0.1341 0.2181 1.9097
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.13935 0.24296 4.690 1.85e-05 ***
## log(BrainWt) 0.53138 0.10643 4.993 6.36e-06 ***
## log(BodyWt) -0.18798 0.08325 -2.258 0.0279 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5609 on 55 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.659, Adjusted R-squared: 0.6466
## F-statistic: 53.14 on 2 and 55 DF, p-value: 1.417e-13
We can see from this summary function that the p-value of the whole model is 1.417*(10^-13). Since this p-value is less than .05, we know that this model is very significant.
We can also see that the p-value associated with the slope coefficient on log(BrainWt) is 6.36*(10^-6). We can set up the hypothesis test to be: \[\sf{H _ {0}} : \hat{\beta_{1}} = 0\] \[\sf{H _ {1}} : \hat{\beta_{1}} \neq 0\]
Since the p-value is less than .05, we reject the null hypothesis. Thus, we conclude that \(\hat{\beta_{1}}\) is significant to our model.
We can do a similar thing with log(BodyWt). We see that the p-value associated with the slope coefficient on log(BodyWt) is ,0279. We can set up the hypothesis test to be: \[\sf{H _ {0}} : \hat{\beta_{2}} = 0 \] \[\sf{H _ {1}} : \hat{\beta_{2}} \neq 0 \]
Since the p-value is less than .05, we reject the null hypothesis. Thus, we conclude that \(\hat{\beta_{2}}\) is significant to our model.
I think we need help interpreting most of this R Guide
We can see from the summary that our intercept, \(\sf{\hat{\beta_{0}}}\), is equal to 1.13935 years. This means that if a species had an average brain weight and an average body weight of 0, we would expect that species to live up to 1.13935 years. Of course, this number does not make sense since an animal cannot have a body weight or a brain weight of zero.
We can also see from the summary function that \(\sf{\hat{\beta_{1}}}\), the slope coefficient on log(BrainWt), is equal to .53138. This means that if the log of the brain weight went up by one kilogram, we would expect the species to live up to .53138 years longer.
From the summary, we see that \(\sf{\hat{\beta_{2}}}\), the slope coefficient on log(BodyWt) is equal to -.18798. This means that if the log(BrainWt) of a species of animal went up by one gram, we would expect the maximum life span of that species to decrease by .18798 years.
Under the assumptions of the model, error terms \(\epsilon_i\) are statistically independent samples of a normal distribution with mean \({\mu}=0\) and variance \({\sigma}^2\). \({\sigma}^2\) represents the variance of the population of life span values at body weight \({x_1}\) and brain weight \({x_2}\) .
Let’s test this assumption by first creating a residual plot, just like we did in the simple linear regression R guide.
myResiduals = myBestModel$residuals
plot(myResiduals)
abline(0 , 0)
From this plot, we can see that there is no strong evidence of heteroscedasticity. The residuals seem to be independently and identically distributed.
Using the plot() function with our multiple linear regression model of choice as the input, we can gather much more information about model, especially in comparison to the other candidate models.
plot(myBestModel)
The first plot in our output shows us a general plot of our residuals against the fitted values for our model. Ideally, the red line indicating the trend would be a straight line across 0. As you can see, we come pretty close to achieving that, so things are looking up so far.
The second plot in the output is the normal q-q plot. If you remember from the simple linear regression guide, we studied these to determine whether our error terms appeared to follow a normal distribution. There is some deviation on the tails which tells us that the life spans for the species in rows 16, 7, and 33 vary more wildly than they should.
To investgate that, let’s take a side bar and call up which mammals those terms correspond to. We need to call up certain rows of data and we want all of the columns of it to appear with headers to indicate what values we’re seeing. To tell R that we only want that data, we tell it we want a list with the combine function denoted c(). Its inputs are the rows that we want to view. Since we want the rows of our “sleep1” data, we nest the combine function into sleep1[] so that R gathers the data in those three rows of the set. Then we leave an open ended comma in the input to indicate that we want all of the columns to be included.
sleep1[c(7,16,33),]
Notice that the three outliers are three unusual types of mammals. The Echidna is a procupine / anteater -like mammal that lays eggs. The other two are bats which are flying mammals. Since these are mammals that display incredibly atypical characteristics, we may choose to disregard them. Our data set is small already, though, and for the sake of continuity we will just continue our analysis under the conclusion that the constant variance assumption holds.
In the third plot, we’re looking for no trends? Not really sure why. . .
The last plot shows us the influence that each data point has on the least squares estimates of our regression coefficients. Points outside of the curves labelled 0.5 and 1 are said to have a significant impact. Almost all of our points lie well within both curves which indicates no individual species is skewing our model predictions severely.
We can use our model to predict log(maximum life span), whatever that means. Let’s say we have a species with an average body weight of 521.0 kilograms and an average brain weight of 655.0 grams. This corresponds to the average body weight and brain weight of horses. Let’s see what our model would predict for this species’ maximum life span on average. Our model is: \[\widehat{log(Maximum Life Span)} = 1.13935 - 0.18798 *log(Body Weight) + 0.53138*log(Brain Weight)\]
BrainWeight <- log(655)
BodyWeight <- log(521)
myPred1 <- 1.13935 + 0.18798 * (BrainWeight) -.0453*(BodyWeight)
myPred1
## [1] 2.074946
The actual maximum life span for the horse species is 46.0 years. . . but that’s not helpful and we’re not sure how to interpret all the log transformations.
Next, we use the species average for an arctic ground squirrel’s brain and body weights. The average body weight of the arctic squirrel is .920 kilograms and the average brain weight is 5.7 grams. Plugging it into our model returns:
BrainWeight <- log(5.7)
BodyWeight <- log(0.920)
myPredUnknown <- 1.13935 + 0.53138 *(BrainWeight) - 0.18798 *(BodyWeight)
myPredUnknown
## [1] 2.079873
This interpretation issue is becoming really annoying
Remember from the simple linear regression guide that if we want to make an interval that describes a point estimate of an individual response variable value at a specified explanatory value, we can simply adjust the input of the predict() function to read “(interval = ‘prediction’).”
con5050 <- data.frame(BodyWt = 50, BrainWt = 50)
predict(myBestModel , con5050 , interval = "prediction")
## fit lwr upr
## 1 2.482763 1.316001 3.649524
This means that we can predict with 95% confidence that the log(life span of a species of mammal weighing 50 kg on average and having a 50 g brain weight) will be between 1.136001 and 3.649524.
Similar to the simple linear regression, the confint() function will yield the limits of a 95% confidence interval for the regression parameters of the model that we put into it by default.
confint(myBestModel)
## 2.5 % 97.5 %
## (Intercept) 0.6524563 1.62625094
## log(BrainWt) 0.3180861 0.74467534
## log(BodyWt) -0.3548172 -0.02113382
Our main interval of concern is with the slope parameters \({\beta_1}\) and \(\beta_2\) corresponding to the average brain weights and body weights of a species, respectively. This output tells us that we can be 95% sure that the coefficient describing the relationship between average brain weight and maximum life span of a species will take on a value between 0.3180861 and 0.7445753. We will ignore the y-intercept interval since the value itself has no meaningful interpretation (at least to our knowledge).
Remember that we could change the confidence level using the inputs (model1, level = “desired confidence %”):
confint(myBestModel , level = 0.9)
## 5 % 95 %
## (Intercept) 0.7328778 1.54582944
## log(BrainWt) 0.3533162 0.70944517
## log(BodyWt) -0.3272597 -0.04869129
Notice how an interval of a higher degree of confidence is wider? This is because we can be more certain that the true \({\beta_1}\) and \({\beta_2}\) lie within a larger interval. Conversely, we cannot be as confident that \({\beta_1}\) and \({\beta_2}\) fall within a very small range of values without more information.
Our next objective is to get an idea for the mean life span of mammals with certain species averages for body weight and brain weight. Let’s say that we want to know something about the mean life span all species of mammals that have an average body weight of 14kg and average brain weight of 3g. Just like in the simple linear regression case, we use the predict() function to produce the limits of a (1-\({\alpha}\))% confidence interval after specifying the values of our predictors.
con143 <- data.frame(BodyWt = 14 , BrainWt = 3)
predict(myBestModel , con143 , interval = "confidence")
## fit lwr upr
## 1 1.227057 0.5611169 1.892997
Our code produced the bounds of the 95% confidence interval by default for the log(mean life span of mammals with average body weight 14kg and average brain weight 3g). The main takeaway is that we are 95% confident that the log(mean life span for species of mammals that have an average body weight of 14 kg and an average brain weight of 3) is between 0.5611169 and 1.892997.
We don’t need a calculator or computer to see that the prediction interval for the response is quite a bit wider than the confidence interval. As mentioned in the simple linear regression guide, this variation is due to the fact that we have refined our goal value from an average to a point prediction. Instead of looking for the individual value of the response variable at the specified predictor value, we’re estimating the mean of the population of responses.