This R guide will include information about Multiple Linear Regression and all of the statistics and tests that come with it. It also will talk about ways to check how accurate a model is and what we can look at to determine if we have the best model for our data. We will discuss how to run a multiple linear regression in R and what we can do with our model, interpretations, intervals, significance test, and some more useful information.
To demonstrate and explain the methods, we will be walking through everything using a data set that contains data about cars. This data set includes information on number of cylinders, engine displacement, horsepower, weight, quarter mile time, engine type, transmission type, rear axle ratio, number of gears, number of carburetors, and miles per gallon. The first step is to get our data into R. We can do that by using a line of code that calls the data set from R as it is stored in R.
data(mtcars)
Now our data set is in the R environment so we can use it and manipulate it to run our analysis. The next thing to do is to always get to know your data. A person needs to know what their data contains and how it looks to know the best techniques for using it. There are two commands that give us a quick glance. The first is the head() command which shows you the first 6 rows of data. The second one is the names() command which shows you all your variable names which will be very important because using these variables requires them to be spelled correctly and case sensitive every time.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
Now we have an idea what we have for data so we have to decide what we want to look at. Multiple linear regression is a tool for looking at the relationship between more than three variables (quantitative or categorical), and using two or more to predict the one dependent variable of interest. Looking at this data, we are interested in creating a model that is able to predict the miles per gallon (mpg) a car has based on other variables such as number of cylinders in the engine (cyl), displacement (disp, measure in cubic inches), gross horsepower (hp), rear axle ratio (drat), weight (wt, in 1000lbs), quarter mile time (qsec), v engine or straight engine (vs), transmission type (am, 0 = automatic and 1 = manual), number of forward gears (gear), number of carburetors (carb).
Now in order for R to know what data set we are referencing there is a little trick by using the attach() command. This allows us to skip having to type the data set name in every command.
attach(mtcars)
After getting the data prepared, the best way to see the relationship between variables would be to plot them and check out how our plots looks. To do this I will use the plot() command which lists the x-axis variable first and the y-axis variable second but separated by a comma. I will run a plot for every predictor variable with our response variable, MPG.
plot(mpg, cyl)
plot(mpg, disp)
plot(mpg, hp)
plot(mpg, drat)
plot(mpg, wt)
plot(mpg, qsec)
plot(mpg, vs)
plot(mpg, am)
plot(mpg, gear)
plot(mpg, carb)
These plots start to give us an idea on what might be related to MPG or what variables may be collinear as well. We will keep these in mind as we continue our analysis. Also we will use these plots to assess some other things that we will mention later in this guide.
The next step is to create our regression model to get some concrete numbers about the relationship between our predictors. We are looking for numbers that will tell us about our correlation and our error of our model. Our correlation is the big one because that tells us how well our predictor variable predicts our response variable.
The multiple linear model command is a very simple one. It is setup like:
modelname <-lm(response~predictor1+predictor2+predictor3+…)
Like mentioned earlier, we want to use the variables to predict MPG.
Then we just call the model’s name and it gives us it’s data on coefficients.
model <- lm(mpg~cyl+disp+hp+drat+wt+qsec+vs+am+gear+carb)
model
##
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + vs +
## am + gear + carb)
##
## Coefficients:
## (Intercept) cyl disp hp drat wt
## 12.30337 -0.11144 0.01334 -0.02148 0.78711 -3.71530
## qsec vs am gear carb
## 0.82104 0.31776 2.52023 0.65541 -0.19942
As we can see here our model was created under the name “model” so by just calling it’s name in a command it will show us our coefficients. We could now create a regression equation from this output: \(predictedmpg=12.303+observedcylinders∗−.111+observeddisplacement∗.013+observedhorsepower∗−.021+observedrearaxleratio∗.787+observedweightinthousandsofpounds∗−3.715+observedquartermiletime∗.821+observedenginetype∗.318+observedtransmissiontype∗2.520+observednumberofgears∗.655+observednumberofcarburetors∗−.199\)
Now to interpret our model and its equation. Interpretation is important so that other people understand what analysis is being done by the model. Multiple linear regression is a little trickier than simple linear regression in its interpretations but it still is understandable. The best way to explain it is to say what we expect to happen to the response variable when we increase one predictor variable by one unit, while holding all other variables constant. For our model this would sound like: For every thousand pound increase in the weight of the car, our MPG will decrease 3.715, with all other variables held constant.
The other thing to interpret would be the intercept but you can only interpret that if it makes sense that our predictor variables be zero. In our case multiple variables cannot be zero so we cannot interpret the intercept.
There are a couple things to look in the output, including correlation/correlation of determination(R, R-squared, and adjusted R-squared), t-tests, the f-test, coefficient confidence intervals, and mean squared error.
Arguably the most important measures of how good a model is are R and R2 . These tell us how well our model predicts the desired response variable. The closer R^2 is to one the better. A value of 1 would mean that our model is perfect. But different from the simple linear models, we now have an adjusted R-squared that takes into account how many predictors we are using. It penalizes a model for more predictors so if they are not contributing much to the model it will actually hurt this value, however, R2 is not used by many statisticians at all. To find these we can just look in the model summary using the summary() command.
summary(model)
##
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + vs +
## am + gear + carb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
As we can see our R^2 value is .869 which is very good and it means our model is able to predict mpg well. Also, this value means that predictors explain 86.9% of the variability in MPG. Now our adjusted R-squared is a little lower obviously as we are using 11 variables to predict. This could possibly improve if there is one or more predictors that aren’t very good and are hurting our model. One thing to note though is that comparing R^2 values is not a great way of deciding which model is better than the other.
Mean squared error is exactly how it sounds: we take the mean of all of our errors squared. This is a good measure for seeing how accurate a model is because we obviously want as little error as possible. To find this we can use the following formula:
sqrt(sum((model$residuals)^2)/21)
## [1] 2.650197
Or we can take the easy route and look in the summary:
summary(model)
##
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + vs +
## am + gear + carb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
As you can see these values are equal. There are other useful pieces of information in this summary such as the coefficients for each predictor and degrees of freedom for the model.
An important thing to do with our model is test the significance of a single predictor variable to see if it is related to our response variable. We use a hypothesis test to do this. Our null hypothesis is that the response variable coefficient is equal to 0 with our alternative hypothesis being that the its coefficient is not equal to 0. To find our p-value we run a T-test with n-(k+1) degrees of freedom. Our test statistic is bj/sb which is just the point estimate of the slope of the model divided by the standard error of that coefficient/slope value. The easiest way to find this is just calling the numbers/coefficients from the summary of the model and divide them by each other.
tstat <- coef(summary(model))[3,1]/coef(summary(model))[3,2]
tstat
## [1] 0.7467585
Now we can use our test statistic and the degrees of freedom of 21 that we found for our model to run our t-test. This is just done by doubling our pt() command because we need to include both tails of our distribution and not just one. Then the command goes: pt(test statistic, degrees of freedom, lower.tail=TRUE/FALSE(if false it includes the right side of the test stat on the distribution)).
2*pt(tstat, 21, lower.tail=FALSE)
## [1] 0.4634887
Here we can see our P-value is larger than our .05 level of significance so this tells us that we do not have enough evidence to reject the null hypothesis and that our predictor variable of engine displacement does not have the most significant relationship with our response variable of MPG.
We can do another one of these tests for our intercept.
tstat2 <- coef(summary(model))[1,1]/coef(summary(model))[1,2]
tstat2
## [1] 0.6573058
2*pt(tstat2, 21, lower.tail=FALSE)
## [1] 0.5181244
We have another large p-value so we can say our intercept is not significant to our model as well although we still cannot interpret this as we cannot reasonably set all of the predictor variables to 0 as no car would have values of 0 for each item.
An easier way to do this is just look in the model’s summary and look at the t-test/level of significance for each coefficient.
summary(model)
##
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + vs +
## am + gear + carb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
As you can see these values are the exact same so using the summary() command is a great time saver.
An additional way to see if there is a linear relationship between the predictor variables and response variable is by using the F-Test. This test determines if there is any relationship between our predictor and response variables. It is set up like our t-test where our null hypothesis is that there is no relationship between the response variable and its predictors and our alternative is that there is some linear relationship between them. A common α value to reject the null hypothesis is .05. We can find our F-test p-value in the bottom of our model’s summary.
summary(model)
##
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + vs +
## am + gear + carb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
Our p-value is very small so we can reject our null hypothesis and say that there is a relationship between at least one of the predictor variables and the response variable. One thing you might notice is that our p-value for our f and t tests are not identical as they were in simple linear regression. We will do a partial F-test as well later.
Another thing to look at is the confidence intervals for our coefficients. Our point estimates for each coefficient are not exact so we want to find a range where we are a certain percent confident that the actual value is in this range. A common interval is a 95% confidence interval so we will create one using the confint() command.
confint(model, level=.95)
## 2.5 % 97.5 %
## (Intercept) -26.62259745 51.22934576
## cyl -2.28468553 2.06180457
## disp -0.02380146 0.05047194
## hp -0.06675236 0.02378812
## drat -2.61383350 4.18805545
## wt -7.65495413 0.22434628
## qsec -0.69883421 2.34091571
## vs -4.05880242 4.69432805
## am -1.75681208 6.79726585
## gear -2.44999107 3.76081711
## carb -1.92290442 1.52406591
We can interpret this output to mean: for every thousand pounds the car’s weight increases, we are 95% confident that the corresponding MPG will change between -7.654 and 0.224.
An important thing when considering a multiple linear regression model is that our assumptions are met. One important assumption is that our errors (or residuals) have an approximate normal distribution. A residual is our observed value minus our predicted value for any value of x. We can check this easiest by getting our residuals for all our points and making a histogram for them using hist() command and calling for the residuals by using name<-modelname$residuals.
resid<- model$residuals
hist(resid)
This histogram does not have enough evidence to show that it is not
approximately normally distributed. It has fairly even tails and looks
fairly bell shaped.
Another good way to look at this assumption is to plot our errors along a straight line in a quantile plot. If our error is normally distributed then they will follow the straight line.
qqnorm(resid)
qqline(resid)
As we can see from this plot our errors follow the straight line well so we can say this assumption is met.
Another assumption is homoscedasticity which is constant variance throughout our data. We can look at this by plotting our residuals along a horizontal line across 0.
plot(model$residuals ~ disp)
abline(0,0)
We will say this assumption is met as our plots give no evidence of different levels of variance and the points appear to be randomly distributed opposed to displaying signs of a pattern of sorts.
And our last assumption to be met is that our cars we used for this data set are independent of one another. We will assume this assumption is met as there were many different makes and models along with different types of vehicles in the data set, so none of them should have an effect on any of the others.
There are a couple plots we would like to look at that will give us more of an idea of how good our model is or how well it is predicting. Some of these include looking at plots of our residuals.
plot(model)
Looking at the Residual vs Fitted Values plot, we can see that there is a slight pattern to the residuals of our model where we would rather see the residuals distributed as randomly as possible. However, we may be able to fix this by dropping some predictors or adding transformations to our model. The quantiles to our model follow a normal distribution as there is not much deviation from Normal QQ Plot. Based on the studentized residual plot, we can again see that some data points are clumped together whereas we would rather have them evenly spread out. Lastly, analyzing our Cook’s Distance plot we can see that some data points have a lot of leverage to them and may be affecting the accuracy of our model by skewing the results.
After completing our residual analysis, we concluded that some transformations may be beneficial to our model. Lets see if it is possible to improve our model by transforming some of our predictors, either with a square root transformation or a log transformation.
model1 = lm(mpg ~cyl+log(disp)+log(hp)+drat+wt+qsec+vs+am+gear+carb)
summary(model1)
##
## Call:
## lm(formula = mpg ~ cyl + log(disp) + log(hp) + drat + wt + qsec +
## vs + am + gear + carb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3225 -1.6278 -0.4725 1.1672 3.7616
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.749992 25.678622 2.054 0.0526 .
## cyl 0.934117 1.010657 0.924 0.3658
## log(disp) -4.923860 3.852996 -1.278 0.2152
## log(hp) -3.406400 3.003988 -1.134 0.2696
## drat 0.169684 1.549901 0.109 0.9139
## wt -0.975286 1.506152 -0.648 0.5243
## qsec 0.156231 0.686120 0.228 0.8221
## vs -0.005731 1.904313 -0.003 0.9976
## am 0.986507 2.084110 0.473 0.6408
## gear 1.706226 1.463070 1.166 0.2566
## carb -0.973755 0.653397 -1.490 0.1510
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.452 on 21 degrees of freedom
## Multiple R-squared: 0.8879, Adjusted R-squared: 0.8345
## F-statistic: 16.63 on 10 and 21 DF, p-value: 8.023e-08
The log transformation did improve upon our model just based off of the summary as our R2 value increased from .869 to .8879. We can also check if our plots improved at all for our new model:
plot(model1)
The residuals vs the fitted values plot slightly improved although the data points are not yet evenly distributed. However, the quantiles still follow a normal distribution and the studentized residuals were more evenly spread out. The Cook’s Distance plot also improved as the data no longer is as bunched up as it used to be and all of the data points lie within the threshold for acceptable weight because all of the data points lie within the dashed lines.
We made a model using all of the variables we were given but that might not be our best model as we may be able to create a better model using less predictors. Here we will go into some steps on how to reduce our model and explain how one can tell if one model is better than the other.
Now that we feel we have two comparable models we will run what is called Akaike Information Criteria (AIC) analysis for each of them in the backwards direction. This means that R is taking our total model with all of the variables in it and taking one out at a time so that our model’s AIC value improves every time, where the smaller the value the better the model. When the AIC cannot get any better, it stops and R tells us that this is the best model to use. It basically gets rid of the worst predictor at every step until it reaches a point where it would be removing a good predictor instead of a bad one. We should be left with our best predictors after using the stepAIC() command on our model which is a part of the R package MASS. An important thing to note is that an AIC value decrease of at least 10 is necessary to justify using a more complicated model over a simpler one.
Mallow’s CP is another way to test models for adding or removing predictors. Mallows CP=Σ(Yi−Y2)/MSE Ideally, this would roughly equal k+1 but it is not used by many statisticians so it is better to use AIC as AIC is a much more well known test.
library(MASS)
stepAIC(model)
## Start: AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - cyl 1 0.0799 147.57 68.915
## - vs 1 0.1601 147.66 68.932
## - carb 1 0.4067 147.90 68.986
## - gear 1 1.3531 148.85 69.190
## - drat 1 1.6270 149.12 69.249
## - disp 1 3.9167 151.41 69.736
## - hp 1 6.8399 154.33 70.348
## - qsec 1 8.8641 156.36 70.765
## <none> 147.49 70.898
## - am 1 10.5467 158.04 71.108
## - wt 1 27.0144 174.51 74.280
##
## Step: AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - vs 1 0.2685 147.84 66.973
## - carb 1 0.5201 148.09 67.028
## - gear 1 1.8211 149.40 67.308
## - drat 1 1.9826 149.56 67.342
## - disp 1 3.9009 151.47 67.750
## - hp 1 7.3632 154.94 68.473
## <none> 147.57 68.915
## - qsec 1 10.0933 157.67 69.032
## - am 1 11.8359 159.41 69.384
## - wt 1 27.0280 174.60 72.297
##
## Step: AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - carb 1 0.6855 148.53 65.121
## - gear 1 2.1437 149.99 65.434
## - drat 1 2.2139 150.06 65.449
## - disp 1 3.6467 151.49 65.753
## - hp 1 7.1060 154.95 66.475
## <none> 147.84 66.973
## - am 1 11.5694 159.41 67.384
## - qsec 1 15.6830 163.53 68.200
## - wt 1 27.3799 175.22 70.410
##
## Step: AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
##
## Df Sum of Sq RSS AIC
## - gear 1 1.565 150.09 63.457
## - drat 1 1.932 150.46 63.535
## <none> 148.53 65.121
## - disp 1 10.110 158.64 65.229
## - am 1 12.323 160.85 65.672
## - hp 1 14.826 163.35 66.166
## - qsec 1 26.408 174.94 68.358
## - wt 1 69.127 217.66 75.350
##
## Step: AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - drat 1 3.345 153.44 62.162
## - disp 1 8.545 158.64 63.229
## <none> 150.09 63.457
## - hp 1 13.285 163.38 64.171
## - am 1 20.036 170.13 65.466
## - qsec 1 25.574 175.67 66.491
## - wt 1 67.572 217.66 73.351
##
## Step: AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - disp 1 6.629 160.07 61.515
## <none> 153.44 62.162
## - hp 1 12.572 166.01 62.682
## - qsec 1 26.470 179.91 65.255
## - am 1 32.198 185.63 66.258
## - wt 1 69.043 222.48 72.051
##
## Step: AIC=61.52
## mpg ~ hp + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - hp 1 9.219 169.29 61.307
## <none> 160.07 61.515
## - qsec 1 20.225 180.29 63.323
## - am 1 25.993 186.06 64.331
## - wt 1 78.494 238.56 72.284
##
## Step: AIC=61.31
## mpg ~ wt + qsec + am
##
## Df Sum of Sq RSS AIC
## <none> 169.29 61.307
## - am 1 26.178 195.46 63.908
## - qsec 1 109.034 278.32 75.217
## - wt 1 183.347 352.63 82.790
##
## Call:
## lm(formula = mpg ~ wt + qsec + am)
##
## Coefficients:
## (Intercept) wt qsec am
## 9.618 -3.917 1.226 2.936
First running this on our original model with every variable, R gives us an output telling us to keep only 3 variables as predictors: quarter mile time, weight, and transmission type.
Now if we run this on our transformation model we get:
stepAIC(model1)
## Start: AIC=65.93
## mpg ~ cyl + log(disp) + log(hp) + drat + wt + qsec + vs + am +
## gear + carb
##
## Df Sum of Sq RSS AIC
## - vs 1 0.0001 126.28 63.928
## - drat 1 0.0721 126.35 63.947
## - qsec 1 0.3118 126.59 64.007
## - am 1 1.3473 127.63 64.268
## - wt 1 2.5214 128.80 64.561
## - cyl 1 5.1370 131.42 65.204
## - log(hp) 1 7.7323 134.01 65.830
## <none> 126.28 65.928
## - gear 1 8.1782 134.46 65.936
## - log(disp) 1 9.8204 136.10 66.325
## - carb 1 13.3555 139.63 67.146
##
## Step: AIC=63.93
## mpg ~ cyl + log(disp) + log(hp) + drat + wt + qsec + am + gear +
## carb
##
## Df Sum of Sq RSS AIC
## - drat 1 0.0720 126.35 61.947
## - qsec 1 0.3497 126.63 62.017
## - am 1 1.4160 127.70 62.285
## - wt 1 2.5408 128.82 62.566
## - cyl 1 5.4640 131.74 63.284
## - log(hp) 1 7.9966 134.28 63.893
## <none> 126.28 63.928
## - gear 1 8.2188 134.50 63.946
## - log(disp) 1 9.9454 136.22 64.354
## - carb 1 13.3899 139.67 65.153
##
## Step: AIC=61.95
## mpg ~ cyl + log(disp) + log(hp) + wt + qsec + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - qsec 1 0.3086 126.66 60.025
## - am 1 1.4720 127.82 60.317
## - wt 1 2.5345 128.89 60.582
## - cyl 1 5.3971 131.75 61.285
## <none> 126.35 61.947
## - log(hp) 1 8.5533 134.91 62.043
## - gear 1 8.6525 135.00 62.066
## - log(disp) 1 10.1591 136.51 62.421
## - carb 1 13.3729 139.72 63.166
##
## Step: AIC=60.02
## mpg ~ cyl + log(disp) + log(hp) + wt + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - am 1 1.1724 127.83 58.320
## - wt 1 2.3372 129.00 58.610
## - cyl 1 5.1185 131.78 59.292
## <none> 126.66 60.025
## - gear 1 8.7232 135.38 60.156
## - log(hp) 1 9.3330 135.99 60.300
## - log(disp) 1 12.4852 139.15 61.033
## - carb 1 15.9928 142.65 61.830
##
## Step: AIC=58.32
## mpg ~ cyl + log(disp) + log(hp) + wt + gear + carb
##
## Df Sum of Sq RSS AIC
## - wt 1 2.7307 130.56 56.996
## - cyl 1 6.6592 134.49 57.945
## <none> 127.83 58.320
## - log(hp) 1 8.7242 136.56 58.432
## - gear 1 15.8483 143.68 60.060
## - log(disp) 1 16.0475 143.88 60.104
## - carb 1 16.7435 144.58 60.258
##
## Step: AIC=57
## mpg ~ cyl + log(disp) + log(hp) + gear + carb
##
## Df Sum of Sq RSS AIC
## - log(hp) 1 7.249 137.81 56.725
## <none> 130.56 56.996
## - cyl 1 13.526 144.09 58.150
## - gear 1 32.779 163.34 62.164
## - carb 1 38.572 169.13 63.279
## - log(disp) 1 53.640 184.20 66.010
##
## Step: AIC=56.73
## mpg ~ cyl + log(disp) + gear + carb
##
## Df Sum of Sq RSS AIC
## - cyl 1 8.707 146.52 56.686
## <none> 137.81 56.725
## - gear 1 26.481 164.29 60.349
## - carb 1 60.918 198.73 66.439
## - log(disp) 1 97.216 235.03 71.807
##
## Step: AIC=56.69
## mpg ~ log(disp) + gear + carb
##
## Df Sum of Sq RSS AIC
## <none> 146.52 56.686
## - gear 1 20.189 166.71 58.817
## - carb 1 52.570 199.09 64.497
## - log(disp) 1 152.562 299.08 77.519
##
## Call:
## lm(formula = mpg ~ log(disp) + gear + carb)
##
## Coefficients:
## (Intercept) log(disp) gear carb
## 51.789 -6.592 1.787 -1.227
This output tells us to keep only 3 variables as well: log of displacement, number of gears, and number of carburetors. Then we want to run a model for each using the recommended variables.
model2<-lm(mpg~qsec+wt+am)
summary(model2)
##
## Call:
## lm(formula = mpg ~ qsec + wt + am)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## qsec 1.2259 0.2887 4.247 0.000216 ***
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## am 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
model6<-lm(mpg~log(disp)+gear+carb)
summary(model6)
##
## Call:
## lm(formula = mpg ~ log(disp) + gear + carb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0461 -1.3931 -0.5111 1.8053 4.2983
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.7887 8.5069 6.088 1.45e-06 ***
## log(disp) -6.5917 1.2208 -5.399 9.31e-06 ***
## gear 1.7869 0.9097 1.964 0.05950 .
## carb -1.2271 0.3872 -3.170 0.00368 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.288 on 28 degrees of freedom
## Multiple R-squared: 0.8699, Adjusted R-squared: 0.8559
## F-statistic: 62.4 on 3 and 28 DF, p-value: 1.62e-12
Here we can see that the model with the transformation in it does not improve much compared to the reduced model without any transformations. It also is harder to interpret transformed models so we will stick with model2 instead.
Instead of using the stepAIC() command, we could also have created a nested model and used the anova(first model, second model) command to test if the nested model was an improvement to the parent model. This command is testing the null hypothesis where the predictors that are being dropped are equal to 0 against the alternative hypothesis that these predictors are not equal to 0. If the nested model is an improvement to the parent model, we do not want to reject the null hypothesis. If we were to reject the null hypothesis, it would mean that the predictors that we dropped were significant to the model and should not have been dropped.
nestmodel = lm(mpg ~ wt + qsec + am)
anova(model,nestmodel)
## Analysis of Variance Table
##
## Model 1: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Model 2: mpg ~ wt + qsec + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 21 147.49
## 2 28 169.29 -7 -21.791 0.4432 0.8636
From the ANOVA table above, we can see that the p-value is quite large so we do not have enough evidence to reject the null hypothesis. This means that the nested model using the predictors for weight, quarter mile time, and the transmission type is an improvement to the model using all of the predictors since we were able to create a simpler model that still was accurate in predicting the response variable mpg.
Multicollinearity is an important issue that needs to be addressed during multiple linear regression. If our data displays multicollinearity then this means our predictor variables are correlated between each other which would cause overfitting and higher errors. The first thing to do to check for multicollinearity is to run a plot of your data. This will make a correlation matrix where you can see how all of the variables are correlated.
plot(mtcars)
From this plot it may be hard to tell what is correlated and what is not but an example of a pair of variables that could be highly correlated are weight and displacement. However, neither of these predictors are included in our model so there is no need to worry about it. But we should check the correlation between all of our variables we are using by using the cor() command. If our data does not display multicollinearity then all of these values will be less than .8.
cor(qsec, wt)
## [1] -0.1747159
cor(am, wt)
## [1] -0.6924953
cor(am, qsec)
## [1] -0.2298609
These all check out to be less than .8 which means our predictors are not highly correlated with each other. The negative values indicate negative correlation and the positive values indicate positive correlation for future reference.
The final way to check for multicollinearity is to check the variance inflation factor which is the measure of multicollinearity as it quantifies the severity of it. The formula is the ratio of the variance of the model to the variance of the individual variable by itself. For our model to check out all of the VIF’s for the variables need to be less than 10.
library(car)
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.2
vif(model2)
## qsec wt am
## 1.364339 2.482952 2.541437
Our VIF looks good so we can come to the conclusion that our model is not being impacted by multicollinearity.
In the case of the residual plot to our model having a clear pattern to itself, we can use polynomial regression to create a more accurate model. This means that we start by adding an x2 term, then x3 , and so on and so forth until we get the model that fits the data the best. Lets check the residuals of our model plotted against its fitted values:
plot(model2$residuals ~ model2$fitted.values, xlab = "Fitted Values", ylab = "Residuals")
abline(0,0)
This does not have a clear pattern to it but we can add a quadratic term to see if it improved upon plotting the residuals against the fitted values:
quadmod = lm(mpg ~ qsec + I(qsec^2)+ wt + am)
plot(quadmod$residuals ~ quadmod$fitted.values, xlab = "Fitted Values", ylab = "Residuals")
abline(0,0)
As shown by the new plot, there was no significance between it and the plot without the quadratic term. Just to be sure however, we can use the summary() command on the model with the quadratic term to check for significance:
summary(quadmod)
##
## Call:
## lm(formula = mpg ~ qsec + I(qsec^2) + wt + am)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9778 -1.5052 -0.2261 1.1002 4.2934
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36.2912 30.1135 -1.205 0.2386
## qsec 6.2889 3.2478 1.936 0.0634 .
## I(qsec^2) -0.1388 0.0887 -1.565 0.1293
## wt -3.8798 0.6939 -5.591 6.25e-06 ***
## am 3.1009 1.3798 2.247 0.0330 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.398 on 27 degrees of freedom
## Multiple R-squared: 0.8622, Adjusted R-squared: 0.8417
## F-statistic: 42.22 on 4 and 27 DF, p-value: 3.042e-11
AIC(quadmod)
## [1] 153.3415
The quadratic term I(qsec^2) has a p-value of .1293 showing that it is not significant, therefore not being necessary to keep for our final model.
The other thing to check is to see if adding an interaction term would be helpful for our model. An interaction term is when a predictor variable has a different relationship with the response variable if another predictor variable changes. We thought that maybe quarter mile time or weight had a different relationship with mpg if the transmission changed so we test those and found that transmission and weight would be two good terms to interact.
model3<-lm(mpg~qsec+wt*am)
summary(model3)
##
## Call:
## lm(formula = mpg ~ qsec + wt * am)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5076 -1.3801 -0.5588 1.0630 4.3684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.723 5.899 1.648 0.110893
## qsec 1.017 0.252 4.035 0.000403 ***
## wt -2.937 0.666 -4.409 0.000149 ***
## am 14.079 3.435 4.099 0.000341 ***
## wt:am -4.141 1.197 -3.460 0.001809 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.084 on 27 degrees of freedom
## Multiple R-squared: 0.8959, Adjusted R-squared: 0.8804
## F-statistic: 58.06 on 4 and 27 DF, p-value: 7.168e-13
AIC(model3)
## [1] 144.3736
This model is by far our best model in terms of R^2, significance of all variables, F-test, and has the lowest AIC. The interaction term in this model means that if the transmission is manual then the weight reduces mpg by 4.141 more for every thousand pounds than for an automatic transmission.
We have decided on this being our final model so we want to run all our assumptions and residual analysis again.
resid3<- model3$residuals
hist(resid3)
qqnorm(resid3)
qqline(resid3)
plot(model3$residuals ~ disp)
abline(0,0)
The assumptions look good and we can conclude that they are all met.
plot(model3)
All of these plots look as good if not better than the plots before. Keep in mind that they are much harder to get perfect with a smaller data set like ours.
We made a strong model but now there are ways we can use the model to make more conclusions about our data set.
We can also make confidence and prediction intervals for a certain point in our predictor variable’s range. For this experiment we will choose a weight of 2.92 thousand pounds, quarter mile time of 20.1 seconds and a manual transmission.
In order to do this we need to create a data frame with these three variables set to these values.
newdata <- data.frame(wt=2.92, qsec=20.1, am=1)
Now after that we can actually make our intervals. The first one we will do is a prediction interval. Use the command predict(model name, the new data frame, interval=“predict”).
predy <- predict(model3, newdata, interval="predict")
predy
## fit lwr upr
## 1 23.57616 18.62856 28.52376
The output is our prediction interval which is centered on 23.57616 meaning the prediction would be that. The lower and upper bounds mean that for one person with those three predictors set to their specific values, their MPG will be between 18.62856 and 28.52376 with 95% confidence.
The other interval is a confidence interval which is the same code but with confidence instead of predict.
confy <- predict(model3, newdata, interval="confidence")
confy
## fit lwr upr
## 1 23.57616 21.08774 26.06458
The output is our confidence interval which is centered on 23.57616 meaning the prediction mean would be that. The lower and upper bounds mean that for a population with those three predictors set to their specific values, their mean MPG will be between 21.08774 and 26.06458 with 95% confidence.
Thinking of the concepts of prediction vs. confidence intervals we would conclude that our prediction interval should be much wider than our confidence interval. A quick way to check this is true is to use this command.
confy %*% c(0, -1, 1) #conf interval width
## [,1]
## 1 4.976848
predy %*% c(0, -1, 1)
## [,1]
## 1 9.895191
We can see that our initial thoughts are obviously true. The other thing that both these intervals should have is the same center. The centers should be the point prediction for those three predictors set to their specific values. We can check that by simply seeing if they are equal.
confy[1] == predy[1]
## [1] TRUE
R gives us a nice output saying simply “TRUE” which means they are equal and have the same centers.
Now the point of this model is to use it as a prediction tool. To predict MPG we would just plug the values for our predictor variables into our model.
We will look at a random row of data by calling the data’s name then [row number, ] following it.
mtcars[20, ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
This gives us all the data from row 12. This row has the values for every variable we are interested in. We will use the predictors to predict MPG. So we plug them into our equation to get this line of code for a prediction:
pred<-coef(summary(model3))[1,1]+coef(summary(model3))[2,1]*19.9+coef(summary(model3))[3,1]*1.835+coef(summary(model3))[4,1]*1+coef(summary(model3))[5,1]*1.835*1
pred
## [1] 31.0523
This gives us our predicted value for MPG as 31.0523. Now the important measure is how far our predicted value is from the observed value or what is called a residual. You can find this by simply taking the observed value minus the predicted value.
33.9-31.0523
## [1] 2.8477
As we can see our residual is right about 2.85 MPG. This means our model under predicted by 2.85 MPG. Residuals are very important for calculating our errors that we previously did.
Overall, you should now have a great understanding of what multiple linear regression is as well as how to build a multiple linear regression model using R. These methods can be replicated using any sort of data within R or that can be imported into R.