In this guide we will walk you through some multiple linear regression basics in R. Multiple linear regression models include a response variable and more than one predictor variables, making them more complex than simple linear regression models.
Our regression equation will now look something like this:
\[\hat{y}= \hat{\beta_0}+\hat{\beta_1} x_1+\hat{\beta_2}x_2+\hat{\beta_3}x_3\]
We decided to use the data fuel2001 from the package, alr3. After you install this package, we can start working with the data. We will attach the data so we do not have to keep referencing it by name.
The fuel2001 data includes information on motor fuel consumption and related variables, for each U.S. state in the year 2001. Our response variable will be the amount of gasoline sold for road use (in thousands of gallons). We will use the other variables included as our predictors.
We will use the plot and summary commands on our data to get an idea of what we are working with. The plot command shows the relationship between all of the variables. The summary command gives us minimum, maximum, mean, and 1st and 3rd quartile values for each variable.
library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
data("fuel2001")
attach(fuel2001)
plot(fuel2001)
summary(fuel2001)
## Drivers FuelC Income Miles
## Min. : 328094 Min. : 148769 Min. :20993 Min. : 1534
## 1st Qu.: 1087128 1st Qu.: 737361 1st Qu.:25323 1st Qu.: 36586
## Median : 2718209 Median : 2048664 Median :27871 Median : 78914
## Mean : 3750504 Mean : 2542786 Mean :28404 Mean : 77419
## 3rd Qu.: 4424256 3rd Qu.: 3039932 3rd Qu.:31209 3rd Qu.:112828
## Max. :21623793 Max. :14691753 Max. :40640 Max. :300767
## MPC Pop Tax
## Min. : 6556 Min. : 381882 Min. : 7.50
## 1st Qu.: 9391 1st Qu.: 1162624 1st Qu.:18.00
## Median :10458 Median : 3115130 Median :20.00
## Mean :10448 Mean : 4257046 Mean :20.15
## 3rd Qu.:11311 3rd Qu.: 4845200 3rd Qu.:23.25
## Max. :17495 Max. :25599275 Max. :29.00
Before we create our multiple linear regression model, we need to check for multicollinearity. This means we will check to see if our predictor variables are too correlated to each other which could inflate our standard error. We will use variance inflation factor (VIF) to check for multicollinearity. If VIF is greater than 10, then we have to worry about the correlation between predictor variables.
First, we have to create models of our predictive variables, with one variable as the response and all others as predictors.
modDrivers<- lm(Drivers ~ Income + Miles + MPC + Pop + Tax)
modIncome<- lm(Income ~ Drivers + Miles + MPC + Pop + Tax)
modMiles<- lm(Miles ~ Income + Drivers + MPC + Pop + Tax)
modMPC<- lm(MPC ~ Income + Miles + Drivers + Pop + Tax)
modPop<- lm(Pop ~ Income + Miles + MPC + Drivers + Tax)
modTax<- lm(Tax ~ Income + Miles + MPC + Pop + Drivers)
Now we are ready to check VIFs. We will use the command vif with the arguments being each of our models created above.
vif(modDrivers,modIncome,modMiles,modMPC,modPop,modTax)
## Income Miles MPC Pop Tax
## 1.661833 2.322735 1.645997 2.611040 1.108626
We see that each VIF is less than 10, so we can move forward and make a model!
We will create our multiple linear regression model similarly to how we construct simple linear regression models.
We will name our model gasmod and use the lm (“lm” stands for linear model) command to create our model, listing first the response variable and then the explanatory variables. The response and explanatory variables will be separated by “~” and each explanatory variable will be separated by “+” because we are not using an interaction term in this model.
For this model we will include 2 predictor variables, the number of licensed drivers in the state and per capita personal income, to predict our response variable of fuel consumption.
Let’s also use the summary command, which will produce a summary of our model.
gasmod<-lm(FuelC~Drivers+Income)
summary(gasmod)
##
## Call:
## lm(formula = FuelC ~ Drivers + Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1732521 -144319 -54704 148908 1820242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.842e+05 4.214e+05 1.861 0.0689 .
## Drivers 6.735e-01 1.681e-02 40.053 <2e-16 ***
## Income -2.701e+01 1.508e+01 -1.792 0.0794 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 458700 on 48 degrees of freedom
## Multiple R-squared: 0.9722, Adjusted R-squared: 0.9711
## F-statistic: 840.5 on 2 and 48 DF, p-value: < 2.2e-16
We can find the intercept and slopes in the column labeled Estimate and use them to interpret our line of best fit. In this case we get the equation:
\[\widehat{Gas} =784200+.6735*(Drivers)-27.01*(Income) \]
Where Gas is thousands of gallons of gas consumed, Drivers is number of licensed drivers, and Income is per capita income.
This means that for every additional licensed driver in a state, state fuel consumption will increase by 673.5 gallons on average, all else held constant. Also, for every additional dollar of income per capita, state fuel consumption will fall by 27010 gallons on average, all else held constant.
Let’s plot our model to see what we are working with.
We will use the plot command with our model as an argument. This gives us 4 separate plots so we will use the command, par(mfrow=c(2,2)), to tell R to print the plots side by side in 2 rows and 2 columns. You can change these dimensions by changing the numbers in the command. This command is not necessary but helpful.
par(mfrow=c(2,2))
plot(gasmod)
First, we will look at the residuals vs. fitted plot. This can be used to check the assumption of homoscedasticity or equal variance. Ideally, we want to see equal spread and no trend in this plot. Things to look out for include curvature, heteroscedasticity, and outliers.
The curvature is the most important thing to pay attention to in this plot. It indicates our “mean function” could be missing something. We can improve this plot by using transforms or trying polynomial regression.
Looking at the residuals vs. fitted plot of our model, we see that though the curvature of the line is not terrible, there is definitely evidence of heteroscedasticity.
Because the residuals vs. fitted plot is so important, we will work with transforms to improve its the curvature and spread before looking at our other plots.
There are many different types of transforms but some common ones include square root, log, and inverse transformations.
First, let’s try a log transformation on our response variable and then plot our new model to see if our plots have improved.
gasmodt<-lm(log(FuelC)~ (Drivers)+ (Income), data= fuel2001)
par(mfrow=c(2,2))
plot(gasmodt)
Our residuals vs. fitted plot looks quite a bit worse than our original plots, with much more curvature. We better try another transformation.
Next, let’s try a square root transformation on our response variable
gasmodt2<-lm(sqrt(FuelC)~ (Drivers)+ (Income), data= fuel2001)
par(mfrow=c(2,2))
plot(gasmodt2)
Again, our residuals vs. fitted plot looks worse.
Let’s try a log transformation on all variables.
gasmodt3<-lm(log(FuelC)~ log(Drivers) + log(Income), data= fuel2001)
par(mfrow=c(2,2))
plot(gasmodt3)
The new residuals vs. fitted plot looks great! There is only a small amount of curvature in our line and there are no longer signs of heteroscedasticity.
We will move forward with this model and take a look at our other plots.
We will check model assumptions using the plot command with our model as an argument. If assumptions do not hold, then our model is untrustworthy.
par(mfrow=c(2,2))
plot(gasmodt3)
As we talked about earlier, things to look out for in our residuals vs. fitted plot include curvature, heteroscedasticity, and outliers. The curvature in our plot is minimal, there are no longer signs of heteroscedasticity, and no obvious outliers.
Because our transformed plot no longer shows signs of heteroscedasticity, we can say the model assumption of homoscedasticity (equal variance) holds.
The next plot is a QQnorm plot, which we are familiar with from simple linear regression. It helps us check the assumption that errors follow a normal distribution. This QQplot uses studentized residuals (which give less weight to extreme values) so it may look a little bit different than the ones we made to check this assumption in the past. The closer the data points follow the line, the closer they are to normal.
Looking at our Normal Q-Q plot above, we see that the points follow the line fairly closely which is what we want to see. Thus, we can say the normality assumption holds.
This plot is similar to the residual vs. fitted plot except it folds up the bottom half so all points are greater than zero, reducing the skewness of our data. Ideally, there will not be an obvious trend.
Though the line in our scale-location plot is not completely straight, there is no obvious trend.
This plot looks at Cook’s distance. Cook’s distance measures the influence each data point has on the regression coefficient. We want all of our data points to fall within the red lines that resemble a funnel. If they fall outside, this means the data point has too much influence on the regression coefficient.
Looking at our residuals vs. leverage plot, we see that all points fall within the red lines.
Now that we have transformed our model and checked assumptions, let’s use our model to make a prediction.
Let’s look at a model summary first to see what we are working with.
summary(gasmodt3)
##
## Call:
## lm(formula = log(FuelC) ~ log(Drivers) + log(Income), data = fuel2001)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.24925 -0.05895 -0.01716 0.08780 0.29932
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.60979 1.17173 3.081 0.003414 **
## log(Drivers) 1.03119 0.01759 58.632 < 2e-16 ***
## log(Income) -0.43580 0.11734 -3.714 0.000531 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1238 on 48 degrees of freedom
## Multiple R-squared: 0.9866, Adjusted R-squared: 0.986
## F-statistic: 1765 on 2 and 48 DF, p-value: < 2.2e-16
Like before, we can see our intercept estimates. Remember, they are now log transformed. We will keep this in mind as we make our predictions.
To make a prediction, we simply plug in values of our predictive variables into our regression equation. Our transformed regression equation will look like this:
\[\widehat{log(Gas)} = 3.60979 + 1.03119 *log(Drivers)-0.43580*log(Income) \]
Where Gas is thousands of gallons of gas, Drivers is number of licensed drivers, and Income is per capita income.
Let’s predict fuel consumption for a state with 1,000,000 licensed drivers and $30,000 as its income per capita.
I created variables DriversPredict and IncomePredict to store the values we are plugging into our equation for drivers and income. Because of our log transformation we well apply a log to these numbers.
The variable FuelPredict will then store our prediction for fuel consumption. Our equation will give us the log of our fuel consumption, so we will apply the function exp to undo the log to make interpretation easier.
DriversPredict<-log(1000000)
IncomePredict<-log(30000)
FuelPredict<- 3.60979+1.03119*DriversPredict-0.43580*IncomePredict
exp(FuelPredict)
## [1] 636388.7
This means that our model predicts a state with 1,000,000 licensed drivers and $30,000 per capita income would have consumed 636,388,700 gallons of fuel in the year 2001.
Let’s make prediction and confidence intervals. A prediction interval will give us the interval of fuel consumption values for when there are 1,000,000 licensed drivers and $30,000 per capita income, that we can believe with a given level of confidence. A confidence interval will give us an interval of values for the average amount of fuel consumption for states with 1,000,000 licensed drivers and $30,000 per capita income, that we can believe with a given level of confidence.
First, we will create a new data frame for the point we are interested in. To do this we will use the command data.frame with the arguments Drivers=1000000 and Income=30000. If we were interested in another point, we could simply replace the arguments after the command.
new.data<-data.frame(Drivers=1000000,Income=30000)
new.data
## Drivers Income
## 1 1e+06 30000
Now we can use our new data frame to make some intervals. Both our prediction interval and confidence interval will use the command predict with the arguments, our model, new data frame, and interval=" ". The default confidence level is 95%. The only difference in the command for the two kinds of intervals is what type of interval we specify.
For a prediction interval we will specify the interval type, "predict". Remember that because of our log transformation, we will apply exp to our interval for interpretation.
predsl <- predict(gasmodt3, new.data, interval="predict")
exp(predsl)
## fit lwr upr
## 1 636393.6 493680.1 820363
The prediction interval estimates that when a state has 1,000,000 licensed drivers and $30,000 per capita income, 95% of the time the amount of fuel consumed in a year will fall in the interval from 493,680,100 to 820,363,000 gallons
For a confidence interval we will specify the interval type, "confidence". Again, we will apply exp to our interval.
confsl <- predict(gasmodt, new.data, interval="confidence")
exp(confsl)
## fit lwr upr
## 1 833261.3 664221.2 1045321
We’re 95% confident that the mean amount of fuel consumed in a year by states with 1,000,000 licensed drivers and $30,000 per capita income will fall in the interval from 604,964,900 to 669,455,100 gallons.
If we want to create a confidence interval for our regression coefficients we can use the command confint. We use the arguments of our model, and the level of certainty we want from our interval.
confint(gasmodt,level=.90)
## 5 % 95 %
## (Intercept) 1.296793e+01 1.481946e+01
## Drivers 1.838706e-07 2.577410e-07
## Income -4.916113e-05 1.706804e-05
This means we are 90% sure our intercept will be between 43.9738 and 73.8209 and our slope will be between 0.0488 and 0.2386. Looking at this we can say we are 90% sure that for every additional 10 home runs scored, the number of wins a team will have that season will increase between 0.488 games and 2.386 games on average. We are still unable to interpret our intercept as we do not have data points around 0.
Sometimes multiple linear regression models will include an interaction term. An interaction term is a way to model if the relationship between our response and one of our predictor variables depends on another predictor. We will do a t-test on our interaction term to see if it is significant.
\(H_o\) : There is not a significant interaction between the 2 predictor variables.
\(H_a\) : There is a significant interaction between the 2 predictor variables.
gasmod2<-lm(FuelC~Drivers*Income)
summary(gasmod2)
##
## Call:
## lm(formula = FuelC ~ Drivers * Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1532767 -111468 3182 117182 1368979
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.549e+05 6.011e+05 -0.757 0.4529
## Drivers 1.136e+00 1.695e-01 6.699 2.34e-08 ***
## Income 1.472e+01 2.080e+01 0.708 0.4826
## Drivers:Income -1.513e-05 5.526e-06 -2.738 0.0087 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 430500 on 47 degrees of freedom
## Multiple R-squared: 0.9761, Adjusted R-squared: 0.9745
## F-statistic: 638.7 on 3 and 47 DF, p-value: < 2.2e-16
Because our interaction term (Drivers:Income) has a significant p-value we will reject the null hypothesis. We can say that the relationship between fuel consumption and the number of licensed drivers significantly depends on income per capita.
There are many ways to build a model. Three popular methods include forward selection, backward elimination and all subsets regression. We will look at backward elimination.
When we do backwards elimination, we start with multiple predictor variables in our model. Then we eliminate the least important predictor, until all of the predictors are significant given other predictors.
We will be using t-tests to determine if our predictors are significant, with an alpha level of .05. If the p-value is less than alpha then the predictor is significant. The least significant predictor will be the one with the highest p-value.
First, let’s build a linear model with all of the predictors in it. The summary command runs the t-test for us. Under the columns labeled, t value and Pr(>|t|), we can find the resulting test statistic and p-values.
mod1<-lm(FuelC~Drivers+Income+Miles+MPC+Pop+Tax)
summary(mod1)
##
## Call:
## lm(formula = FuelC ~ Drivers + Income + Miles + MPC + Pop + Tax)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1480910 -158802 19267 174208 1090089
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.902e+05 8.199e+05 -0.598 0.552983
## Drivers 6.368e-01 1.452e-01 4.386 7.09e-05 ***
## Income 7.690e+00 1.632e+01 0.471 0.639793
## Miles 5.850e+00 1.621e+00 3.608 0.000784 ***
## MPC 4.562e+01 3.565e+01 1.280 0.207337
## Pop -1.945e-02 1.245e-01 -0.156 0.876586
## Tax -2.087e+04 1.324e+04 -1.576 0.122235
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 398400 on 44 degrees of freedom
## Multiple R-squared: 0.9808, Adjusted R-squared: 0.9782
## F-statistic: 374.6 on 6 and 44 DF, p-value: < 2.2e-16
When we look at the t-tests on all of the predictors, we see that there are a couple predictors that are very significant. However, there are some predictors with p-values much higher than our alpha level of .05. We will start by getting rid of the variable that is currently least significant, population.
mod2<-lm(FuelC~Drivers+Income+Miles+MPC+Tax)
summary(mod2)
##
## Call:
## lm(formula = FuelC ~ Drivers + Income + Miles + MPC + Tax)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1517423 -161111 19930 173532 1085471
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.844e+05 8.102e+05 -0.598 0.552903
## Drivers 6.144e-01 2.229e-02 27.560 < 2e-16 ***
## Income 7.526e+00 1.611e+01 0.467 0.642587
## Miles 5.813e+00 1.587e+00 3.664 0.000652 ***
## MPC 4.643e+01 3.488e+01 1.331 0.189820
## Tax -2.114e+04 1.298e+04 -1.629 0.110298
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 394100 on 45 degrees of freedom
## Multiple R-squared: 0.9808, Adjusted R-squared: 0.9787
## F-statistic: 459.5 on 5 and 45 DF, p-value: < 2.2e-16
After doing this, we still have variables that are not significant. We will next take away income, because it now has the highest p-value.
mod3<-lm(FuelC~Drivers+Miles+MPC+Tax)
summary(mod3)
##
## Call:
## lm(formula = FuelC ~ Drivers + Miles + MPC + Tax)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1511238 -167913 12604 166767 1104257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.788e+05 4.742e+05 -0.377 0.707764
## Drivers 6.175e-01 2.110e-02 29.273 < 2e-16 ***
## Miles 5.582e+00 1.495e+00 3.735 0.000517 ***
## MPC 3.903e+01 3.081e+01 1.267 0.211586
## Tax -2.155e+04 1.284e+04 -1.678 0.100061
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 390700 on 46 degrees of freedom
## Multiple R-squared: 0.9807, Adjusted R-squared: 0.979
## F-statistic: 584.3 on 4 and 46 DF, p-value: < 2.2e-16
We will next take out miles driven per capita (MPC), because it now has the highest p-value.
mod4<-lm(FuelC~Drivers+Miles+Tax)
summary(mod4)
##
## Call:
## lm(formula = FuelC ~ Drivers + Miles + Tax)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1578697 -176469 37821 160777 1070883
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.133e+05 2.736e+05 1.145 0.258
## Drivers 6.056e-01 1.900e-02 31.877 < 2e-16 ***
## Miles 6.227e+00 1.414e+00 4.403 6.13e-05 ***
## Tax -2.599e+04 1.243e+04 -2.091 0.042 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 393200 on 47 degrees of freedom
## Multiple R-squared: 0.98, Adjusted R-squared: 0.9787
## F-statistic: 768.6 on 3 and 47 DF, p-value: < 2.2e-16
We can see after taking out MPC, all of our variables are now significant at a alpha level of .05. Since all of our variables are now significant, we will use this as our model. The variables we will use for our model are number of licensed drivers, number of federal highway miles and tax rate within each state.
There are several ways to compare models. We want a model that gives us the lowest residuals without getting too complicated.
One way to compare models is using a t-test, which shows us if we can drop one predictor. We compare the p-values from the t-test of nested models to see if additional variables are worth adding. If the p-value of the added variable is less than our alpha level of .05, then it is worth keeping in the model. If its p-value is greater than alpha, then we can take it out and use the simpler model.
Let’s use a t-test to determine if we should include the predictor, miles, in our model.
mod1<-lm(FuelC~Drivers)
summary(mod1)
##
## Call:
## lm(formula = FuelC ~ Drivers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1842399 -109834 -33255 154576 1906386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.588e+04 9.053e+04 0.507 0.615
## Drivers 6.657e-01 1.662e-02 40.067 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 468900 on 49 degrees of freedom
## Multiple R-squared: 0.9704, Adjusted R-squared: 0.9698
## F-statistic: 1605 on 1 and 49 DF, p-value: < 2.2e-16
We see that drivers is a significant variable from the first t-test we do.
mod2<-lm(FuelC~Drivers+Miles)
summary(mod2)
##
## Call:
## lm(formula = FuelC ~ Drivers + Miles)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1667584 -207441 63143 156912 1055581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.207e+05 1.016e+05 -2.173 0.034783 *
## Drivers 6.121e-01 1.938e-02 31.578 < 2e-16 ***
## Miles 6.041e+00 1.460e+00 4.137 0.000141 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 406800 on 48 degrees of freedom
## Multiple R-squared: 0.9782, Adjusted R-squared: 0.9773
## F-statistic: 1075 on 2 and 48 DF, p-value: < 2.2e-16
When we check if miles is significant when added to the model, we see that it has a p-value less than alpha, so we can conclude that it is worth keeping in the model.
Let’s try another example. This time we will check if it is worth including the variable income.
mod1<-lm(FuelC~Drivers)
summary(mod1)
##
## Call:
## lm(formula = FuelC ~ Drivers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1842399 -109834 -33255 154576 1906386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.588e+04 9.053e+04 0.507 0.615
## Drivers 6.657e-01 1.662e-02 40.067 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 468900 on 49 degrees of freedom
## Multiple R-squared: 0.9704, Adjusted R-squared: 0.9698
## F-statistic: 1605 on 1 and 49 DF, p-value: < 2.2e-16
We can see that drivers is significant.
mod2<-lm(FuelC~Drivers + Income)
summary(mod2)
##
## Call:
## lm(formula = FuelC ~ Drivers + Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1732521 -144319 -54704 148908 1820242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.842e+05 4.214e+05 1.861 0.0689 .
## Drivers 6.735e-01 1.681e-02 40.053 <2e-16 ***
## Income -2.701e+01 1.508e+01 -1.792 0.0794 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 458700 on 48 degrees of freedom
## Multiple R-squared: 0.9722, Adjusted R-squared: 0.9711
## F-statistic: 840.5 on 2 and 48 DF, p-value: < 2.2e-16
We see that when we add income as a variable in addition to drivers, it is not significant at an alpha level of .05. This means that it is not worth adding into our model. The model with just drivers in it is better, because all the variables are significant and it is simpler.
AIC stands for Akaike Information Criteria. It is a way to compare models which balances model fit with the number of predictors. AIC is only useful when comparing two or more models. We will compare the same models we just compared with the t-tests.
mod1<-lm(FuelC~Drivers)
mod2<-lm(FuelC~Drivers+Miles)
AIC(mod1)
## [1] 1480.629
AIC(mod2)
## [1] 1467.076
When we compare the AICs of the two models, we always use the model with the lower AIC if it is the simpler model. If the model is more complex but has a lower AIC than the other model, we need the AIC to be at least 10 smaller than the simpler one to make it worth making the model more complex. In this case, the more complex model had an AIC about 13 smaller than the simpler one, so we will use the model with both drivers and miles as the predictors. Next, let’s try it again with the other models we compared with the t-test.
mod1<-lm(FuelC~Drivers)
mod2<-lm(FuelC~Drivers + Income)
AIC(mod1)
## [1] 1480.629
AIC(mod2)
## [1] 1479.327
When we compare these models, we see that the more complex model only has an AIC about 1 lower than the simpler model. This is not a big enough drop to make it worth making the model more complex. We would choose the model with only drivers as the predictor in this case.