Getting Started

In this guide we will walk you through some multiple linear regression basics in R. Multiple linear regression models include a response variable and more than one predictor variables, making them more complex than simple linear regression models.

Our regression equation will now look something like this:

\[\hat{y}= \hat{\beta_0}+\hat{\beta_1} x_1+\hat{\beta_2}x_2+\hat{\beta_3}x_3\]

We decided to use the data fuel2001 from the package, alr3. After you install this package, we can start working with the data. We will attach the data so we do not have to keep referencing it by name.

The fuel2001 data includes information on motor fuel consumption and related variables, for each U.S. state in the year 2001. Our response variable will be the amount of gasoline sold for road use (in thousands of gallons). We will use the other variables included as our predictors.

  • Drivers: The number of licensed drivers in the state
  • Income: Per capita personal income (year 2000)
  • Miles: Number of federal-aid highway miles in the state
  • MPC: Estimated miles driven per capita
  • Pop: Population age 16 and over
  • Tax: Gasoline tax rate (cents per gallon)

We will use the plot and summary commands on our data to get an idea of what we are working with. The plot command shows the relationship between all of the variables. The summary command gives us minimum, maximum, mean, and 1st and 3rd quartile values for each variable.

library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
data("fuel2001")
attach(fuel2001)
plot(fuel2001)

summary(fuel2001)
##     Drivers             FuelC              Income          Miles       
##  Min.   :  328094   Min.   :  148769   Min.   :20993   Min.   :  1534  
##  1st Qu.: 1087128   1st Qu.:  737361   1st Qu.:25323   1st Qu.: 36586  
##  Median : 2718209   Median : 2048664   Median :27871   Median : 78914  
##  Mean   : 3750504   Mean   : 2542786   Mean   :28404   Mean   : 77419  
##  3rd Qu.: 4424256   3rd Qu.: 3039932   3rd Qu.:31209   3rd Qu.:112828  
##  Max.   :21623793   Max.   :14691753   Max.   :40640   Max.   :300767  
##       MPC             Pop                Tax       
##  Min.   : 6556   Min.   :  381882   Min.   : 7.50  
##  1st Qu.: 9391   1st Qu.: 1162624   1st Qu.:18.00  
##  Median :10458   Median : 3115130   Median :20.00  
##  Mean   :10448   Mean   : 4257046   Mean   :20.15  
##  3rd Qu.:11311   3rd Qu.: 4845200   3rd Qu.:23.25  
##  Max.   :17495   Max.   :25599275   Max.   :29.00

Multicollinearity

Before we create our multiple linear regression model, we need to check for multicollinearity. This means we will check to see if our predictor variables are too correlated to each other which could inflate our standard error. We will use variance inflation factor (VIF) to check for multicollinearity. If VIF is greater than 10, then we have to worry about the correlation between predictor variables.

First, we have to create models of our predictive variables, with one variable as the response and all others as predictors.

modDrivers<- lm(Drivers ~ Income + Miles + MPC + Pop + Tax)
modIncome<- lm(Income ~ Drivers + Miles + MPC + Pop + Tax)
modMiles<- lm(Miles ~ Income + Drivers + MPC + Pop + Tax)
modMPC<- lm(MPC ~ Income + Miles + Drivers + Pop + Tax)
modPop<- lm(Pop ~ Income + Miles + MPC + Drivers + Tax)
modTax<- lm(Tax ~ Income + Miles + MPC + Pop + Drivers)

Now we are ready to check VIFs. We will use the command vif with the arguments being each of our models created above.

vif(modDrivers,modIncome,modMiles,modMPC,modPop,modTax)
##   Income    Miles      MPC      Pop      Tax 
## 1.661833 2.322735 1.645997 2.611040 1.108626

We see that each VIF is less than 10, so we can move forward and make a model!

Creating a Model

We will create our multiple linear regression model similarly to how we construct simple linear regression models.

We will name our model gasmod and use the lm (“lm” stands for linear model) command to create our model, listing first the response variable and then the explanatory variables. The response and explanatory variables will be separated by “~” and each explanatory variable will be separated by “+” because we are not using an interaction term in this model.

For this model we will include 2 predictor variables, the number of licensed drivers in the state and per capita personal income, to predict our response variable of fuel consumption.

Let’s also use the summary command, which will produce a summary of our model.

gasmod<-lm(FuelC~Drivers+Income)
summary(gasmod)
## 
## Call:
## lm(formula = FuelC ~ Drivers + Income)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1732521  -144319   -54704   148908  1820242 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.842e+05  4.214e+05   1.861   0.0689 .  
## Drivers      6.735e-01  1.681e-02  40.053   <2e-16 ***
## Income      -2.701e+01  1.508e+01  -1.792   0.0794 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 458700 on 48 degrees of freedom
## Multiple R-squared:  0.9722, Adjusted R-squared:  0.9711 
## F-statistic: 840.5 on 2 and 48 DF,  p-value: < 2.2e-16

We can find the intercept and slopes in the column labeled Estimate and use them to interpret our line of best fit. In this case we get the equation:

\[\widehat{Gas} =784200+.6735*(Drivers)-27.01*(Income) \]

Where Gas is thousands of gallons of gas consumed, Drivers is number of licensed drivers, and Income is per capita income.

This means that for every additional licensed driver in a state, state fuel consumption will increase by 673.5 gallons on average, all else held constant. Also, for every additional dollar of income per capita, state fuel consumption will fall by 27010 gallons on average, all else held constant.

Transforms

Let’s plot our model to see what we are working with.

We will use the plot command with our model as an argument. This gives us 4 separate plots so we will use the command, par(mfrow=c(2,2)), to tell R to print the plots side by side in 2 rows and 2 columns. You can change these dimensions by changing the numbers in the command. This command is not necessary but helpful.

par(mfrow=c(2,2))
plot(gasmod)

First, we will look at the residuals vs. fitted plot. This can be used to check the assumption of homoscedasticity or equal variance. Ideally, we want to see equal spread and no trend in this plot. Things to look out for include curvature, heteroscedasticity, and outliers.

The curvature is the most important thing to pay attention to in this plot. It indicates our “mean function” could be missing something. We can improve this plot by using transforms or trying polynomial regression.

Looking at the residuals vs. fitted plot of our model, we see that though the curvature of the line is not terrible, there is definitely evidence of heteroscedasticity.

Because the residuals vs. fitted plot is so important, we will work with transforms to improve its the curvature and spread before looking at our other plots.

There are many different types of transforms but some common ones include square root, log, and inverse transformations.

First, let’s try a log transformation on our response variable and then plot our new model to see if our plots have improved.

gasmodt<-lm(log(FuelC)~ (Drivers)+ (Income), data= fuel2001)
par(mfrow=c(2,2))
plot(gasmodt)

Our residuals vs. fitted plot looks quite a bit worse than our original plots, with much more curvature. We better try another transformation.

Next, let’s try a square root transformation on our response variable

gasmodt2<-lm(sqrt(FuelC)~ (Drivers)+ (Income), data= fuel2001)
par(mfrow=c(2,2))
plot(gasmodt2)

Again, our residuals vs. fitted plot looks worse.

Let’s try a log transformation on all variables.

gasmodt3<-lm(log(FuelC)~ log(Drivers) + log(Income), data= fuel2001)
par(mfrow=c(2,2))
plot(gasmodt3)

The new residuals vs. fitted plot looks great! There is only a small amount of curvature in our line and there are no longer signs of heteroscedasticity.

We will move forward with this model and take a look at our other plots.

Checking Model Assumptions

We will check model assumptions using the plot command with our model as an argument. If assumptions do not hold, then our model is untrustworthy.

par(mfrow=c(2,2))
plot(gasmodt3)

Residuals Vs. Fitted Plot

As we talked about earlier, things to look out for in our residuals vs. fitted plot include curvature, heteroscedasticity, and outliers. The curvature in our plot is minimal, there are no longer signs of heteroscedasticity, and no obvious outliers.

Because our transformed plot no longer shows signs of heteroscedasticity, we can say the model assumption of homoscedasticity (equal variance) holds.

Normal Q-Q Plot

The next plot is a QQnorm plot, which we are familiar with from simple linear regression. It helps us check the assumption that errors follow a normal distribution. This QQplot uses studentized residuals (which give less weight to extreme values) so it may look a little bit different than the ones we made to check this assumption in the past. The closer the data points follow the line, the closer they are to normal.

Looking at our Normal Q-Q plot above, we see that the points follow the line fairly closely which is what we want to see. Thus, we can say the normality assumption holds.

Scale-Location Plot

This plot is similar to the residual vs. fitted plot except it folds up the bottom half so all points are greater than zero, reducing the skewness of our data. Ideally, there will not be an obvious trend.

Though the line in our scale-location plot is not completely straight, there is no obvious trend.

Residuals vs. Leverage Plot

This plot looks at Cook’s distance. Cook’s distance measures the influence each data point has on the regression coefficient. We want all of our data points to fall within the red lines that resemble a funnel. If they fall outside, this means the data point has too much influence on the regression coefficient.

Looking at our residuals vs. leverage plot, we see that all points fall within the red lines.

Point Prediction

Now that we have transformed our model and checked assumptions, let’s use our model to make a prediction.

Let’s look at a model summary first to see what we are working with.

summary(gasmodt3)
## 
## Call:
## lm(formula = log(FuelC) ~ log(Drivers) + log(Income), data = fuel2001)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.24925 -0.05895 -0.01716  0.08780  0.29932 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.60979    1.17173   3.081 0.003414 ** 
## log(Drivers)  1.03119    0.01759  58.632  < 2e-16 ***
## log(Income)  -0.43580    0.11734  -3.714 0.000531 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1238 on 48 degrees of freedom
## Multiple R-squared:  0.9866, Adjusted R-squared:  0.986 
## F-statistic:  1765 on 2 and 48 DF,  p-value: < 2.2e-16

Like before, we can see our intercept estimates. Remember, they are now log transformed. We will keep this in mind as we make our predictions.

To make a prediction, we simply plug in values of our predictive variables into our regression equation. Our transformed regression equation will look like this:

\[\widehat{log(Gas)} = 3.60979 + 1.03119 *log(Drivers)-0.43580*log(Income) \]

Where Gas is thousands of gallons of gas, Drivers is number of licensed drivers, and Income is per capita income.

Let’s predict fuel consumption for a state with 1,000,000 licensed drivers and $30,000 as its income per capita.

I created variables DriversPredict and IncomePredict to store the values we are plugging into our equation for drivers and income. Because of our log transformation we well apply a log to these numbers.

The variable FuelPredict will then store our prediction for fuel consumption. Our equation will give us the log of our fuel consumption, so we will apply the function exp to undo the log to make interpretation easier.

DriversPredict<-log(1000000) 
IncomePredict<-log(30000)
FuelPredict<- 3.60979+1.03119*DriversPredict-0.43580*IncomePredict
exp(FuelPredict)
## [1] 636388.7

This means that our model predicts a state with 1,000,000 licensed drivers and $30,000 per capita income would have consumed 636,388,700 gallons of fuel in the year 2001.

Prediction and Confidence Intervals

Let’s make prediction and confidence intervals. A prediction interval will give us the interval of fuel consumption values for when there are 1,000,000 licensed drivers and $30,000 per capita income, that we can believe with a given level of confidence. A confidence interval will give us an interval of values for the average amount of fuel consumption for states with 1,000,000 licensed drivers and $30,000 per capita income, that we can believe with a given level of confidence.

First, we will create a new data frame for the point we are interested in. To do this we will use the command data.frame with the arguments Drivers=1000000 and Income=30000. If we were interested in another point, we could simply replace the arguments after the command.

new.data<-data.frame(Drivers=1000000,Income=30000)
new.data
##   Drivers Income
## 1   1e+06  30000

Now we can use our new data frame to make some intervals. Both our prediction interval and confidence interval will use the command predict with the arguments, our model, new data frame, and interval=" ". The default confidence level is 95%. The only difference in the command for the two kinds of intervals is what type of interval we specify.

Prediction Intervals

For a prediction interval we will specify the interval type, "predict". Remember that because of our log transformation, we will apply exp to our interval for interpretation.

predsl <- predict(gasmodt3, new.data, interval="predict") 
exp(predsl)
##        fit      lwr    upr
## 1 636393.6 493680.1 820363

The prediction interval estimates that when a state has 1,000,000 licensed drivers and $30,000 per capita income, 95% of the time the amount of fuel consumed in a year will fall in the interval from 493,680,100 to 820,363,000 gallons

Confidence Intervals

For a confidence interval we will specify the interval type, "confidence". Again, we will apply exp to our interval.

confsl <- predict(gasmodt, new.data, interval="confidence")
exp(confsl)
##        fit      lwr     upr
## 1 833261.3 664221.2 1045321

We’re 95% confident that the mean amount of fuel consumed in a year by states with 1,000,000 licensed drivers and $30,000 per capita income will fall in the interval from 604,964,900 to 669,455,100 gallons.

Confidence Intervals for Regression Coefficients

If we want to create a confidence interval for our regression coefficients we can use the command confint. We use the arguments of our model, and the level of certainty we want from our interval.

confint(gasmodt,level=.90)
##                       5 %         95 %
## (Intercept)  1.296793e+01 1.481946e+01
## Drivers      1.838706e-07 2.577410e-07
## Income      -4.916113e-05 1.706804e-05

This means we are 90% sure our intercept will be between 43.9738 and 73.8209 and our slope will be between 0.0488 and 0.2386. Looking at this we can say we are 90% sure that for every additional 10 home runs scored, the number of wins a team will have that season will increase between 0.488 games and 2.386 games on average. We are still unable to interpret our intercept as we do not have data points around 0.

Interaction Term

Sometimes multiple linear regression models will include an interaction term. An interaction term is a way to model if the relationship between our response and one of our predictor variables depends on another predictor. We will do a t-test on our interaction term to see if it is significant.

\(H_o\) : There is not a significant interaction between the 2 predictor variables.

\(H_a\) : There is a significant interaction between the 2 predictor variables.

gasmod2<-lm(FuelC~Drivers*Income)
summary(gasmod2)
## 
## Call:
## lm(formula = FuelC ~ Drivers * Income)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1532767  -111468     3182   117182  1368979 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -4.549e+05  6.011e+05  -0.757   0.4529    
## Drivers         1.136e+00  1.695e-01   6.699 2.34e-08 ***
## Income          1.472e+01  2.080e+01   0.708   0.4826    
## Drivers:Income -1.513e-05  5.526e-06  -2.738   0.0087 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 430500 on 47 degrees of freedom
## Multiple R-squared:  0.9761, Adjusted R-squared:  0.9745 
## F-statistic: 638.7 on 3 and 47 DF,  p-value: < 2.2e-16

Because our interaction term (Drivers:Income) has a significant p-value we will reject the null hypothesis. We can say that the relationship between fuel consumption and the number of licensed drivers significantly depends on income per capita.

Backward Elimination

There are many ways to build a model. Three popular methods include forward selection, backward elimination and all subsets regression. We will look at backward elimination.

When we do backwards elimination, we start with multiple predictor variables in our model. Then we eliminate the least important predictor, until all of the predictors are significant given other predictors.

We will be using t-tests to determine if our predictors are significant, with an alpha level of .05. If the p-value is less than alpha then the predictor is significant. The least significant predictor will be the one with the highest p-value.

First, let’s build a linear model with all of the predictors in it. The summary command runs the t-test for us. Under the columns labeled, t value and Pr(>|t|), we can find the resulting test statistic and p-values.

mod1<-lm(FuelC~Drivers+Income+Miles+MPC+Pop+Tax)
summary(mod1)
## 
## Call:
## lm(formula = FuelC ~ Drivers + Income + Miles + MPC + Pop + Tax)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1480910  -158802    19267   174208  1090089 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.902e+05  8.199e+05  -0.598 0.552983    
## Drivers      6.368e-01  1.452e-01   4.386 7.09e-05 ***
## Income       7.690e+00  1.632e+01   0.471 0.639793    
## Miles        5.850e+00  1.621e+00   3.608 0.000784 ***
## MPC          4.562e+01  3.565e+01   1.280 0.207337    
## Pop         -1.945e-02  1.245e-01  -0.156 0.876586    
## Tax         -2.087e+04  1.324e+04  -1.576 0.122235    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 398400 on 44 degrees of freedom
## Multiple R-squared:  0.9808, Adjusted R-squared:  0.9782 
## F-statistic: 374.6 on 6 and 44 DF,  p-value: < 2.2e-16

When we look at the t-tests on all of the predictors, we see that there are a couple predictors that are very significant. However, there are some predictors with p-values much higher than our alpha level of .05. We will start by getting rid of the variable that is currently least significant, population.

mod2<-lm(FuelC~Drivers+Income+Miles+MPC+Tax)
summary(mod2)
## 
## Call:
## lm(formula = FuelC ~ Drivers + Income + Miles + MPC + Tax)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1517423  -161111    19930   173532  1085471 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.844e+05  8.102e+05  -0.598 0.552903    
## Drivers      6.144e-01  2.229e-02  27.560  < 2e-16 ***
## Income       7.526e+00  1.611e+01   0.467 0.642587    
## Miles        5.813e+00  1.587e+00   3.664 0.000652 ***
## MPC          4.643e+01  3.488e+01   1.331 0.189820    
## Tax         -2.114e+04  1.298e+04  -1.629 0.110298    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 394100 on 45 degrees of freedom
## Multiple R-squared:  0.9808, Adjusted R-squared:  0.9787 
## F-statistic: 459.5 on 5 and 45 DF,  p-value: < 2.2e-16

After doing this, we still have variables that are not significant. We will next take away income, because it now has the highest p-value.

mod3<-lm(FuelC~Drivers+Miles+MPC+Tax)
summary(mod3)
## 
## Call:
## lm(formula = FuelC ~ Drivers + Miles + MPC + Tax)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1511238  -167913    12604   166767  1104257 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.788e+05  4.742e+05  -0.377 0.707764    
## Drivers      6.175e-01  2.110e-02  29.273  < 2e-16 ***
## Miles        5.582e+00  1.495e+00   3.735 0.000517 ***
## MPC          3.903e+01  3.081e+01   1.267 0.211586    
## Tax         -2.155e+04  1.284e+04  -1.678 0.100061    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 390700 on 46 degrees of freedom
## Multiple R-squared:  0.9807, Adjusted R-squared:  0.979 
## F-statistic: 584.3 on 4 and 46 DF,  p-value: < 2.2e-16

We will next take out miles driven per capita (MPC), because it now has the highest p-value.

mod4<-lm(FuelC~Drivers+Miles+Tax)
summary(mod4)
## 
## Call:
## lm(formula = FuelC ~ Drivers + Miles + Tax)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1578697  -176469    37821   160777  1070883 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.133e+05  2.736e+05   1.145    0.258    
## Drivers      6.056e-01  1.900e-02  31.877  < 2e-16 ***
## Miles        6.227e+00  1.414e+00   4.403 6.13e-05 ***
## Tax         -2.599e+04  1.243e+04  -2.091    0.042 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 393200 on 47 degrees of freedom
## Multiple R-squared:   0.98,  Adjusted R-squared:  0.9787 
## F-statistic: 768.6 on 3 and 47 DF,  p-value: < 2.2e-16

We can see after taking out MPC, all of our variables are now significant at a alpha level of .05. Since all of our variables are now significant, we will use this as our model. The variables we will use for our model are number of licensed drivers, number of federal highway miles and tax rate within each state.

Comparing Models

There are several ways to compare models. We want a model that gives us the lowest residuals without getting too complicated.

Comparing Models Using a T-test

One way to compare models is using a t-test, which shows us if we can drop one predictor. We compare the p-values from the t-test of nested models to see if additional variables are worth adding. If the p-value of the added variable is less than our alpha level of .05, then it is worth keeping in the model. If its p-value is greater than alpha, then we can take it out and use the simpler model.

Let’s use a t-test to determine if we should include the predictor, miles, in our model.

mod1<-lm(FuelC~Drivers)
summary(mod1)
## 
## Call:
## lm(formula = FuelC ~ Drivers)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1842399  -109834   -33255   154576  1906386 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.588e+04  9.053e+04   0.507    0.615    
## Drivers     6.657e-01  1.662e-02  40.067   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 468900 on 49 degrees of freedom
## Multiple R-squared:  0.9704, Adjusted R-squared:  0.9698 
## F-statistic:  1605 on 1 and 49 DF,  p-value: < 2.2e-16

We see that drivers is a significant variable from the first t-test we do.

mod2<-lm(FuelC~Drivers+Miles)
summary(mod2)
## 
## Call:
## lm(formula = FuelC ~ Drivers + Miles)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1667584  -207441    63143   156912  1055581 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.207e+05  1.016e+05  -2.173 0.034783 *  
## Drivers      6.121e-01  1.938e-02  31.578  < 2e-16 ***
## Miles        6.041e+00  1.460e+00   4.137 0.000141 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 406800 on 48 degrees of freedom
## Multiple R-squared:  0.9782, Adjusted R-squared:  0.9773 
## F-statistic:  1075 on 2 and 48 DF,  p-value: < 2.2e-16

When we check if miles is significant when added to the model, we see that it has a p-value less than alpha, so we can conclude that it is worth keeping in the model.

Let’s try another example. This time we will check if it is worth including the variable income.

mod1<-lm(FuelC~Drivers)
summary(mod1)
## 
## Call:
## lm(formula = FuelC ~ Drivers)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1842399  -109834   -33255   154576  1906386 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.588e+04  9.053e+04   0.507    0.615    
## Drivers     6.657e-01  1.662e-02  40.067   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 468900 on 49 degrees of freedom
## Multiple R-squared:  0.9704, Adjusted R-squared:  0.9698 
## F-statistic:  1605 on 1 and 49 DF,  p-value: < 2.2e-16

We can see that drivers is significant.

mod2<-lm(FuelC~Drivers + Income)
summary(mod2)
## 
## Call:
## lm(formula = FuelC ~ Drivers + Income)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1732521  -144319   -54704   148908  1820242 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.842e+05  4.214e+05   1.861   0.0689 .  
## Drivers      6.735e-01  1.681e-02  40.053   <2e-16 ***
## Income      -2.701e+01  1.508e+01  -1.792   0.0794 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 458700 on 48 degrees of freedom
## Multiple R-squared:  0.9722, Adjusted R-squared:  0.9711 
## F-statistic: 840.5 on 2 and 48 DF,  p-value: < 2.2e-16

We see that when we add income as a variable in addition to drivers, it is not significant at an alpha level of .05. This means that it is not worth adding into our model. The model with just drivers in it is better, because all the variables are significant and it is simpler.

Comparing Models Using AIC

AIC stands for Akaike Information Criteria. It is a way to compare models which balances model fit with the number of predictors. AIC is only useful when comparing two or more models. We will compare the same models we just compared with the t-tests.

mod1<-lm(FuelC~Drivers)
mod2<-lm(FuelC~Drivers+Miles)
AIC(mod1)
## [1] 1480.629
AIC(mod2)
## [1] 1467.076

When we compare the AICs of the two models, we always use the model with the lower AIC if it is the simpler model. If the model is more complex but has a lower AIC than the other model, we need the AIC to be at least 10 smaller than the simpler one to make it worth making the model more complex. In this case, the more complex model had an AIC about 13 smaller than the simpler one, so we will use the model with both drivers and miles as the predictors. Next, let’s try it again with the other models we compared with the t-test.

mod1<-lm(FuelC~Drivers)
mod2<-lm(FuelC~Drivers + Income)
AIC(mod1)
## [1] 1480.629
AIC(mod2)
## [1] 1479.327

When we compare these models, we see that the more complex model only has an AIC about 1 lower than the simpler model. This is not a big enough drop to make it worth making the model more complex. We would choose the model with only drivers as the predictor in this case.