Now we will begin working with multiple linear regression (MLR). MLR is just like simple linear regression, but with two or more predictors instead of just one. The basic formula is as follows: \[\hat{y_i}= \hat{\beta_0}+\hat{\beta_1} x_1+\hat{\beta_2} x_2+...+\hat{\beta_k} x_k\] Note that k= the number of predictors. With MLR, we are able to create models with increased accuracy. With MLR, the null hypothesis is that there is no linear relationship between the response, and any of the predictors. The alternative hypothesis is that at least one of the predictor variables has a linear relationship with the response. This R Guide will cover the basics of MLR utilizing models with and without interaction terms. We will look at complications caused by multicollinearity and heteroscedasticity and how to deal with them.
For this R Guide we will be using the USSeatBelts data within the AER package. First let’s call up the data and look at the first few data points to gain an understanding of the data.
library(AER)
## Warning: package 'AER' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
## Loading required package: lmtest
## Warning: package 'lmtest' was built under R version 3.4.4
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 3.4.3
## Loading required package: survival
data("USSeatBelts")
attach(USSeatBelts)
head(USSeatBelts)
## state year miles fatalities seatbelt speed65 speed70 drinkage alcohol
## 1 AK 1983 3358 0.04466945 NA no no yes no
## 2 AK 1984 3589 0.03733630 NA no no yes no
## 3 AK 1985 3840 0.03307291 NA no no yes no
## 4 AK 1986 4008 0.02519960 NA no no yes no
## 5 AK 1987 3900 0.01948718 NA no no yes no
## 6 AK 1988 3841 0.02525384 NA no no yes no
## income age enforce
## 1 17973 28.23497 no
## 2 18093 28.34354 no
## 3 18925 28.37282 no
## 4 18466 28.39665 no
## 5 18021 28.45325 no
## 6 18447 28.85142 no
This is balanced panel data for the years 1983-1997 from 50 US States and the District of Columbia for assessing traffic fatalities and seat belt usage. The federal government set goal percent of seat belt usages beginning in 1997 and first mandatory seat belt law was adopted by New York in 1984. With the output from the head command, we see that there are 12 variables in this data set with a good mix of qualitative and quantitative variables. For our first model we will be utilizing speed70, drinkage, alcohol, income, and age as our initial predictors. We can also easily see how many observations there are in this data set by using the nrow command.
nrow(USSeatBelts)
## [1] 765
This tells us that there are 765 observations within this data set. Typically, with more observations comes more accurate models so this is a positive initial sign. Now we will construct our first model.
When looking at a data set, there may be a variable that was recorded in an unfavorable unit, or it may be much smaller than the rest of the values recorded in the data set. From looking at the first several data points above, it is obvious that the fatalities variable has values significantly lower than the others within the data set. The reason the values are so small is because the units is number of fatalities per million of traffic miles. There is nothing wrong with that unit, but these small values will lead to small regression coefficients of the other variables. We are able to slightly alter the values by changing the units so that the values better resemble the rest in the data set. Let’s see if multiplying the fatalities variable by 1000 gives us more significant data points. We complete this in R by creating a new variable name followed by what we want the variable to be. In our case it is “fatalities * 1000.” We can then use the “head” command again to verify that our new variable is now included. It will be located the the right of the last variable in the original data set.
USSeatBelts$fatal1000 <- USSeatBelts$fatalities * 1000
head(USSeatBelts)
## state year miles fatalities seatbelt speed65 speed70 drinkage alcohol
## 1 AK 1983 3358 0.04466945 NA no no yes no
## 2 AK 1984 3589 0.03733630 NA no no yes no
## 3 AK 1985 3840 0.03307291 NA no no yes no
## 4 AK 1986 4008 0.02519960 NA no no yes no
## 5 AK 1987 3900 0.01948718 NA no no yes no
## 6 AK 1988 3841 0.02525384 NA no no yes no
## income age enforce fatal1000
## 1 17973 28.23497 no 44.66945
## 2 18093 28.34354 no 37.33630
## 3 18925 28.37282 no 33.07291
## 4 18466 28.39665 no 25.19960
## 5 18021 28.45325 no 19.48718
## 6 18447 28.85142 no 25.25384
Now we can see the values of our new variable, and the values are more significant than before. This does not change the recorded data, only the units of the fatalities variable. We just need to remember that the unit of fatal1000 is number of fatalities per 1,000 of traffic miles. This new variable will provide us with regression coefficients that are more manageable.
Now that we have identified speed70, drinkage, alcohol, income, and age as our possible predictors, we can build our model. We will use backward elimination, which is a model building method in which a single predictor is deleted from the model at a time, until all predictors are significant. A T-test determines whether or not a predictor is significant and thus stays in our model.
The backward elimination method begins with a model that contains all of our predictors. Use the lm(response ~ \(predictor_1\) + \(predictor_2\) + … + \(predictor_k\)) command to build the model with all predictors. I call the model allmod for ease of use. Use the summary() command with the model as the argument to display the t-test statistics and p-values for each predictor. We will use a significance level of \(\alpha=0.05\).
allmod<-lm(USSeatBelts$fatal1000 ~ speed70 + drinkage + alcohol + income + age)
summary(allmod)
##
## Call:
## lm(formula = USSeatBelts$fatal1000 ~ speed70 + drinkage + alcohol +
## income + age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.4566 -2.7543 -0.4003 2.2780 20.1322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.150e+01 3.328e+00 15.474 < 2e-16 ***
## speed70yes 2.178e+00 6.314e-01 3.449 0.000594 ***
## drinkageyes 2.511e-01 5.365e-01 0.468 0.639911
## alcoholyes -2.239e+00 5.015e-01 -4.465 9.24e-06 ***
## income -8.550e-04 3.885e-05 -22.008 < 2e-16 ***
## age -4.195e-01 1.008e-01 -4.161 3.53e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.283 on 759 degrees of freedom
## Multiple R-squared: 0.5214, Adjusted R-squared: 0.5182
## F-statistic: 165.4 on 5 and 759 DF, p-value: < 2.2e-16
Notice that R’s default name for each qualitative predictor is “predictornameyes”. The predictor with the highest p-value is drinkageyes at 0.6399, which is located under Pr(>|t|). The t-test statistic for drinkage, located under t value, is 0.4680. At a significance level of \(\alpha=0.05\), we do not have enough evidence that drinkage is a significant predictor for the number of fatalities per 1000 traffic miles. So we will drop drinkage as a predictor in this model and move to the next iteration in the backward elimination method. We will test the significance of our remaining predictors in a new model, which I call newmod.
newmod<-lm(USSeatBelts$fatal1000 ~ speed70 + alcohol + income + age)
summary(newmod)
##
## Call:
## lm(formula = USSeatBelts$fatal1000 ~ speed70 + alcohol + income +
## age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.3976 -2.7487 -0.3729 2.2369 20.1805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.152e+01 3.326e+00 15.489 < 2e-16 ***
## speed70yes 2.177e+00 6.311e-01 3.450 0.000591 ***
## alcoholyes -2.217e+00 4.990e-01 -4.443 1.02e-05 ***
## income -8.486e-04 3.635e-05 -23.348 < 2e-16 ***
## age -4.170e-01 1.006e-01 -4.144 3.79e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.281 on 760 degrees of freedom
## Multiple R-squared: 0.5213, Adjusted R-squared: 0.5187
## F-statistic: 206.9 on 4 and 760 DF, p-value: < 2.2e-16
The p-values for each predictor are less than \(\alpha=0.05\), so we have strong evidence that speed70, alcohol, income, and age are significant predictors for the number of fatalities per 1000 traffic miles. Because each predictor is significant, we are done building our model. The predictors for this model are speed70, alcohol, income, and age.
Now that we have our predictors, we can obtain the regression equation and examine the coefficients. Call the model and display the regression coefficients using the newmod model. The summary above also displays the regression coefficients under the “Estimate” column.
newmod
##
## Call:
## lm(formula = USSeatBelts$fatal1000 ~ speed70 + alcohol + income +
## age)
##
## Coefficients:
## (Intercept) speed70yes alcoholyes income age
## 51.5170000 2.1771753 -2.2170420 -0.0008487 -0.4170447
The regression equation, found under the “coefficients” in the output, is: \[\hat{fatal1000} =51.5170+2.1772*speed70yes-2.2170*alcoholyes -0.0009*income-0.4170*age\] According to this model, we predict states with higher income and mean age to experience lower fatalities per 1000 traffic miles. We also predict the fatalities per traffic 1000 miles to drop by 2.2170 when there is a minimum drinking age of 21 years in effect for the state and an increase of 2.1772 when there is a 70 mile per hour (or higher) speed limit. It rarely makes sense to interpret the intercept values in regression models and we do not interpret it here because age cannot be zero. Before we can put this model to use, we must verify the the model meets certain requirements and satisfies the multiple linear regression model assumptions.
Multicollinearity is a requirement that must be checked before using a regression model because it can limit the use and conclusions of a model. Multicollinearity simple means that predictors are related or dependent on each other. This is unfavorable because it will inflate standard error values, which in turn will decrease test stat values. This leads to larger p-values of the model, meaning that there is a decreased probability of being able to reject the null hypothesis in favor of the alternative hypothesis mentioned in the introduction. There are multiple ways of testing for multicollinearity. You can conduct a simple correlation test with the predictor variables, but this method does have one caveat. Both predictors must be quantitative variables to be able to perform the test, which is not true with our model. We will use the second method which is looking at the variance inflation factors (VIF) of our predictor variables. The VIF is a numerical representation on the effect that each predictor variable has on the model’s standard error. This value is calculated for each variable using the equation: \[VIF_j = \dfrac{1}{1 - R^2_j}\] Where \(R^2_j\) is the R2 value for each predictor. When looking at these values, we hope that they all lie below 10. Anything above 10 means that there is an obvious relationship between predictors. This leads to multicollinearity within our model. We can see all the VIF factors of our model simply by using the “vif” function within in R and placing our model name within the parentheses.
vif(newmod)
## speed70 alcohol income age
## 1.090460 1.068576 1.274845 1.217226
The VIF for each predictor is close to 1, so we are not concerned about multicollinearity in our model.
There are several assumptions that must be true in order to interpret our model and use it to create intervals and perform hypothesis testing. The assumptions are:
Mean Assumption: At any given combination of values of \(x_1\),\(x_2\),…,\(x_k\), the population of potential error term values has a mean equal to 0.
Constant Variance (Homoscedasticity) Assumption: The different populations of potential error term values corresponding to different combinations of values of \(x_1\),\(x_2\),…,\(x_k\) have equal variances.
Normality Assumption: At any given combination of values of \(x_1\),\(x_2\),…,\(x_k\), the population of potential error term values has a normal distribution.
Independence Assumption: Any one value of the error term is statistically independent on any other values of the error term.
We will verify the homoscedasticity, normality, and independence assumptions.
We can check the constant variance assumption by examining a plot of the residuals against the fitted values. The fitted values are simply the fatalities per 1000 traffic miles our model predicts.
par(mfrow= c(2,2))
plot(newmod) #is there a way to just display the first plot?
The residuals appear to become more spread out as the predicted fatalities per 1000 traffic miles increases. This model violates the homoscedasticity assumption and an increasing error variance exists. The red line in the first graph also indicates that this model systematically over predicts both low and high fatalities per 1000 traffic miles. In any real application of the regression model, the assumptions will not hold exactly. However, this model exhibits a major homoscedasticty violation and thus inhibits our ability to make statistical inferences. We will not further examine the assumptions of this model or use it to make predictions. Instead, we can attempt to remedy the homoscedasticty and systematic violations by performing transformations.
One way that we can remedy model issues like multicollinearity and/or heteroscedasticity is by using transformations. Our goal of using transformations within our model it to improve the heteroscedasticity of the first model. Three common transformations are square root, natural log, and the inverse. The square root can be used to help with heteroscedasticity. The natural log and inverse transformations can help equalize error and help with curvature that may be present in the residuals. Transformations can be used on both predictor variables.
When considering the use of transformations within our model, it’s a good idea to go through our variables to see which would be good candidates for a transformations. First we look at our response variable, fatal1000. Since we already altered this variable in our exploratory data analysis, we should avoid changing this variable further. Next we look at our predictor variables. One obvious candidate would be the income variable. Since its values range from 8,372 USD to 35,863 USD, scaling them down with either a square root or natural log transformation may prove to be helpful in equalizing error throughout the residuals. We create a regression model with a transformation the same as normal. The only difference being that we add the transformation term, so instead of “income” we will put “log(income).” Let’s name this model “transmod.” Then we will call up the model to see the regression coefficients of the predictors. Upon examination of the residual vs fitted plot, a log transformation of income as a predictor improves the heteroscedasticity and curvature the most. We can build the transformed model by replacing income with log(income) in our previous model. Again use the plot command with the model name as the argument to display useful diagnostic plots.
transmod<-lm(USSeatBelts$fatal1000 ~ speed70 + alcohol + log(income) + age)
par(mfrow= c(2,2))
plot(transmod)
We will now check the same assumptions that we did with the original model with the transformed model to verify any improvements in heteroscedasticity that we saw, as well as the other assumptions.
The residuals vs fitted plot indicates that the residuals still spread out as the fatalities per 1000 traffic miles increases. However, the heteroscedasticity is improved from the original model since the residuals appear more concentrated around zero. The overpredictions for both low and high fatality values also improves. The scale-location plot indicates the residuals have the largest variance for middle value fatalities per 1000 traffic miles. We are still concerned this model violates the homoscedasticty assumption, but will continue to use this model because of the improvement from the original model.
A histogram of the transformed model residuals allows us to visualize whether the residuals follow a normal distribution. Create a histogram of the residuals by using the hist() command with the residuals as the argument. Pull the residuals from the model using the transmod$residuals command. Note that a title can be added using main=“title”. The x-axis label can be changed using xlab=“label” argument and the y-axis label can be changed in a similar manner if necessary.
hist(transmod$residuals, xlab="Residuals", main= "Tranformed Model Residuals" )
The residuals do not follow a bell curve closely and are right skewed. We can also check the normality using the Q-Q plot provided by the “plot” command. There we have similar findings to the histogram as the residuals appear to veer off the normal line in the higher quantiles. We are somewhat concerned about violating the normality assumption, but will continue with our analysis.
The previous plots and the residuals vs leverage plots indicates that that the data points 1, 2, and 3 are possible outliers. These points differ from the rest of the data because Arkansas experienced high fatalities per 1000 traffic miles, ranging from roughly 33 to 44, from 1983 to 1985. Mean age was also low at around 28 years. However, none of the data lies outside Cook’s distance, which is indicated by a dotted red line, so the outliers do not significantly influence the model.
Transforming a model adds complexity in both the interpretation and explanation of the conclusions of a model. As above, we can obtain the regression coefficients for the transformed model using the coef() command.
coef(transmod)
## (Intercept) speed70yes alcoholyes log(income) age
## 190.5476951 2.4727488 -2.0779741 -16.0059813 -0.3624513
The regression equation for the transformed model is: \[\hat{fatal1000}= 190.5477+2.4727*speed70yes-2.0780*alcoholyes-16.0060*log(income)-0.3625*age\] Similar to the previous model, this model predicts that the presence of a minimum drinking age of 21 lowers the fatalities by 2.0780 and the presence of a speed limit of 70 or greater increases fatalities by 2.4727. Each year increase in mean age is expected to lower fatalities by 0.3625. It does not make sense to refer to the log transform of the median per capita income in US dollars as a monetary value or as a measurement of income. According to this model, a higher log(income) value is expected to decrease fatalities.
Confidence intervals for the regression coefficients give an estimate for the true regression coefficients. Use the confint() command with the model name and level of confidence as the arguments. Note that R defaults to a confidence level of 95% so the level=.95 argument below is not required.
confint(transmod, level=.95)
## 2.5 % 97.5 %
## (Intercept) 179.3901717 201.7052184
## speed70yes 1.2704060 3.6750916
## alcoholyes -3.0274093 -1.1285389
## log(income) -17.2560617 -14.7559008
## age -0.5541762 -0.1707264
The output lists the intercept and each predictor in the first column. The next two columns list the lower and upper bounds for each interval. We are 95% confident that the true regression coefficient values lie within their respective intervals. For example, we are 95% confident that the true regression coefficient for age lies between -0.5542 and -0.1707. The point estimate for age, which is listed about in both the coef() and summary() output, is -0.3625.
We use our model to make point predictions and explore how changing the predictors impacts fatalities. We would like to compare the model’s prediction for fatalities to the actual fatalities so we will use actual data. Let’s use the data collected for Indiana in 1988 and let R do some of the indexing for us.
I call the actual fatalities actualfatal and the predicted fatalities predfatal for ease of use. The 1988 data for Indiana is in row 231 and the speed70, alcohol, income, and age information are in columns 7, 9, 10, and 11 respectively. The actual fatalities per 1000 miles value (fatal1000 variable) is in column 13. Call data by using the USSeatBelts[ ] command with the row and column numbers as arguments. Again use the coef() command to obtain the regression coefficients and multiply by a vector containing the predictor values obtained from indexing. Note that for the log(income) predictor, we can call the income value using indexing and then take the log.
actualfatal<-USSeatBelts[231, 13]
predfatal<-coef(transmod)%*%c(1, USSeatBelts[231, 7], USSeatBelts[231,9], log(USSeatBelts[231, 10]), USSeatBelts[231,11])
actualfatal
## [1] 21.49675
predfatal
## [,1]
## [1,] 24.07887
The actual fatalities experienced per 1000 traffic miles, which is the first output, is 21.49675. Our transformed model prediction is 24.07887 fatalities per 1000 traffic miles. Our model over predicts the 1988 fatalities per 1000 traffic miles in Indiana by 2.58212.
Now we will use our model to create both a confidence interval (CI) and a prediction interval (PI) for the mean value of our response given particular values for our predictors. We will again use the 1988 Indiana data point to create both of our intervals. WE expect the confidence interval of the response to be a smaller window because when creating a CI, we take other data points into our consideration to create our interval. But with PI, we are basing our interval on this one sole data point, therefore we expect the window to be larger to achieve the same percentage level of confidence. First we will call up the values of the data point we will be using along with the observed value of the response. This will be useful when looking at the two intervals.
USSeatBelts[231 , "speed70"]
## [1] no
## Levels: no yes
USSeatBelts[231 , "alcohol"]
## [1] no
## Levels: no yes
USSeatBelts[231 , "income"]
## [1] 15231
USSeatBelts[231 , "age"]
## [1] 35.06287
USSeatBelts[231 , "fatal1000"]
## [1] 21.49675
Now that we have our values we will create a data frame based on these values. Then with the data frame and our model we will create both of our intervals. We use the “data.frame” command to create this. Don’t forget to enter the natural log of the income value.
newdata1 <- data.frame(speed70 = "no" , alcohol = "no" , income = log(15231) , age = 35.06287)
Now that we’ve created our data frame, we use the “predict” function to create the intervals. All that we need to reference is the data frame, our transformation model, and whether we want a PI or a CI.
predict(transmod , newdata1 , inteval = "predict")
## 1
## 141.5856
predict(transmod , newdata1 , inteval = "confidence")
## 1
## 141.5856
(Will need to fix)
A very common term to see in a multiple linear regression model is an interaction term. It is created by multiplying two independent variables (predictors) together. This term is included when the relationship between the mean value of the response and one of the independent variables is dependent on the value of one of the other predictors. In other words, if there is a noticeable relationship between two of the predictors the interaction term can be utilized to account for this relationship in the model. For our model, we will again look at the USSeatBelts data set.
To create a regression model with an interaction term, we still use the “lm” command, just like before. We enter our response variables first, and then our desired predictors. The only difference is that now we use a “*" to separate the predictor variables instead of “+.” Using the first symbol tells R that you want the model to include the interaction term of the predictors. We will name this model “intermod.” For this model let’s use the quantitative variable age, and the categorical variable “alcohol.” This variable is a yes or no output to the question “Is there a maximum of 0.08 blood alcohol content?” Intuitively, it makes sense that there might be a relationship between age and maximum blood-alcohol concentration, and their effect on the number of car-related fatalities. We will use our interaction model to see if this proves to be true.
intermod <- lm(fatal1000 ~ income + age * alcohol + speed70, data = USSeatBelts)
Now that we have created our model, we will look at the diagnostic plots to verify that we have not broken any of our assumptions that were stated above.
par(mfrow= c(2,2))
plot(intermod)
In the first plot, we see that our mean standard error stays near zero, but both ends have positive tails, meaning the model does over estimate in the lowest and highest quantiles. Next, the residuals follow the line of a normal distribution well, but the points do rise above the line in the higher quantiles. Lastly, looking at our standardized residuals, there does appear to be signs of a possibility of heteroscedasticity. We want that red trend line to be horizontal in the third graph, but it does move upwards on the right side of the graph. This trend line is not severe enough for us to conclude that we have broken this assumption. Therefore, we can move forward with this model, but we will need to keep a close eye on this.
The next step is checking our model for an obvious signs of multicollinearity. We can see all the VIF factors of our model simply by using the “vif” function within in R and placing our model name within the parentheses.
vif(intermod)
## income age alcohol speed70 age:alcohol
## 1.277233 1.579922 233.347295 1.090846 232.566522
Here we see that two out of our five predictors eclipse that benchmark maximum of 10. The VIF of alcohol, and the interaction term of age and alcohol have very high values, exceeding 200. This proves that there is obvious multicollinearity between our predictors. At this point it is not sensible to continue with this model. We can conclude that the inclusion of any interaction term does not improve our model due to severe multicollinearity.