Introduction to Multiple Linear Regression

The focus for this R guide will be on learning about and creating multiple linear regression models. We will be using a variety of different techniques in R Studio to build several different multiple linear regression models. We will also be focusing on the interpretation and significance of various components of our models. Several key topics throughout this R guide will be:

Techniques for model building
Methods for model comparison
How interaction terms work and when to use them
Building models with polynomial regression
Using transfomations to produce the correct mean function for our model

Exploratory Data Analysis

Data Introduction

For this R guide, I will be using the built in mtcars data set. This data comes from the dataset package in R, making it simple to access and call on when doing our analysis. The data in this package comes from the 1974 edition of Motor Trend Magazine, and contains information about automobile fuel consumption and other attributes of 32 different vehicles.

Before jumping into analysis, it is important to familiarize ourself with the data and see what we are going to be working with. After accessing that mtcars data set via the data command. We also use the names and head commands to look at the column names and what the first few rows of data look like.

data(mtcars)
names(mtcars)

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Data Clean-Up

This data set has 11 different variables as can be seen above. For the sake of simplicity, we are going to remove several of these rows before starting our model building. One reason for doing this is so we don’t end up with too complex of a model. Additionally, several of these variables, such as vs, am, and gears, have extremely small ranges for their values and would be better served as potential indicator functions. This could potentially be useful for us in future projects, but for now we are going to remove the following variables: vs, am, gear, drat, and cyl. We can get rid of these columns by assigning them the NULL value.

mtcars$vs <- NULL
mtcars$am <- NULL
mtcars$gear <- NULL
mtcars$drat <- NULL
mtcars$cyl <- NULL
names(mtcars)

## [1] "mpg"  "disp" "hp"   "wt"   "qsec" "carb"

Removing those variables leaves us with the following six variables for each vehicle:

mpg: Miles per gallon
disp: Displacement of vehicle measured in cubic inches
hp: Gross horsepower
wt: Weight measured in thousands of pounds
qsec: Quarter mile time
carb: Number of carburators

Data Summary

These will be the variables that we will go forward with to build our multiple linear regression model. One last thing to check out is the summary of our data.

summary(mtcars)

##       mpg             disp             hp              wt       
##  Min.   :10.40   Min.   : 71.1   Min.   : 52.0   Min.   :1.513  
##  1st Qu.:15.43   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:2.581  
##  Median :19.20   Median :196.3   Median :123.0   Median :3.325  
##  Mean   :20.09   Mean   :230.7   Mean   :146.7   Mean   :3.217  
##  3rd Qu.:22.80   3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.610  
##  Max.   :33.90   Max.   :472.0   Max.   :335.0   Max.   :5.424  
##       qsec            carb      
##  Min.   :14.50   Min.   :1.000  
##  1st Qu.:16.89   1st Qu.:2.000  
##  Median :17.71   Median :2.000  
##  Mean   :17.85   Mean   :2.812  
##  3rd Qu.:18.90   3rd Qu.:4.000  
##  Max.   :22.90   Max.   :8.000

The summary gives us a high-level view of our data, and provides a better idea as to the range of values for each of our variables.

Multicollinearity

Multicollinearity is when the predictors in the regression model are correlated or dependent upon each other, and this is not a good thing to have. It’s bad because it inflates our standard errors and throws off our test statistics by decreasing them and making them closer to zero. The result of this is that we lose power to reject the null hypothesis of our t-test, meaning we’re less likely to include \(\beta_k\) in our model. Another issue it that multicollinearity makes our estimators unstable. We would expect our \(\beta\) values to be similar from from one data sample to another, but when the estimators become unstable they are less repeatable and tell a different story based on the data set analyzed.

Scatterplot Matrix

We can check for multicollinearity several different ways. The simplest and quickest way to check is to plot the predictor variables against one other to see if there is a linear trend. In models where there are more than two predictors, there is a bit of a shortcut method we can use so we don’t have to plot every combination of the variables. We can again use the plot() command, and then as the argument we will dataset[first column of predictors:last column of predictors].

plot(mtcars[2:6])

This gives us an efficient way to visually check the relationship between our predictors, as we can check all the scatterplots in one visual. For the plot above, we can see that the displacement/horsepower, weight/horsepower, and weight/displacement all have a positive relationship with each other. However, this by itself does not confirm that there is multicollinearity. In order to determine if is evidence of multicollinearity, we have two additional methods to assess the correlation strength for these two combinations of variables.

Correlation Value

One way that we can check the degree of correlation is by using the cor() command. When looking at the correlation between two variables, any correlation value with |cor(x,y)| > 0.9 is concerning, and would be cause to reevaluate the variables used in the model, as there is evidence of multicollinearity.

cor(mtcars$hp, mtcars$wt)

## [1] 0.6587479

cor(mtcars$hp, mtcars$disp)

## [1] 0.7909486

cor(mtcars$wt, mtcars$disp)

## [1] 0.8879799

We can see that the correlation between these combinations of variables are, .66, .79, and .89 respectively. All values are less than .9, pointing towards there being no multicollinearity.

Variance Inflation Factor (VIF)

One final step we could take to check to make sure there is no multicollinearity is to look at the Variance Inflation Factor (VIF). We do this by making a model of each predictor variable, with the remaining preditor variables as the predictor variables of the model. We can then use the R² value for this model and plug it in to the VIF equation, 1/(1-R²_j). Another option is that we can use the VIF command, which comes from the fmsb package. Our argument for the VIF command will be a linear regression model with the two variables we are checking for multicollinearity.

library(fmsb)
VIF(lm(hp ~ wt, data = mtcars))

## [1] 1.766625

VIF(lm(hp ~ disp, data = mtcars))

## [1] 2.670938

VIF(lm(wt ~ disp, data = mtcars))

## [1] 4.728319

When examining our variance inflation factor, a lower value is always better. We want the VIF to be as close to one as possible, and any VIF greater than 10 would be considered an issue and evidence of multicollinearity. With our VIFs only being 1.766, 2.67, and 4.73, we can safely concluded that there is no multicolinnearity between our predictor variables.

Model Building and Testing

The steps to create a multiple linear regression model in R are similar to the steps we took to create our simple linear regression model in Chapter 3. One difference in multiple linear regression is that we have to think more about how we build our model. Because we are going to have more than one predictor variable, we have to be conscious of which variables are the best to put into our model.

Model Testing

Before building a model, it is important to know about the different tests and criteria we can use when building a multiple linear regression model. These tests can then be used to determine whether our predictors are significant and if they should be kept in our model.

*Although it may be tempting to plug in a bunch of predictors and simply go with whichever one has the highest R² value, we don’t want to do that, as it doesn’t tell us anything about which of our predictors are most significant or if any of them can be dropped. Instead we will use one of the following tests.

T-Test and F-Test

T-Test

The t-test in multiple linear regression is extremely similar to the t-test in simple linear regression which was covered in the previous R guide. In multiple linear regression we will just have more predictors to look at, but we will still test each one the same way and look at the p-values provided for us in the summary. When conducting the t-test, we are testing the null hypothesis H₀: \({\beta_k}\)=0 against the alternative hypothesis H_a: \({\beta_k}\)~=/=0 for each of our predictors. If the predictor is not significant, we will remove it from the model.

F-Test and Partial F-Test

Like the t-test, the F-test was also something we learned about in the previous R guide. We use the F-test when asking “Does Y have a linear relationship with any of the predictors?”. When using the F-test, our null hypothesis is H₀: \({\beta_1}={\beta_2}=...={\beta_k}=0\), or in other words, none of the predictor variables significantly relate to response variable. Therefore our null hypothesis will be H_a: at least one of \({\beta_1},{\beta_2},...,{\beta_k}\) does not equal 0, meaning at least one of the predictor variables is significantly related to the response variable. While this is great information, it doesn’t tell us much about whether any of the predictors are insignificant and can be dropped from the model. We could go back and simply use the t-test for each predictor, but another option we can use is the partial F-Test.

When using the partial F-Test we create two different models, one complete model that has all of our predictors, and one reduced model that only has a subset of the predictors (known as a nested model). We can then use the anova command to test the two models against each other and determine whether we can get by with just the subset of the predictors. If the p-value from our anova output is significantly small, this will lead us to reject the null hypothesis and conclude that at least one of the predictors in the reduced model has a significant relationship with the response.

Other Tests for Model Comparison

Though the t-test and partial F-Test are great tools to use for model building, they do have some limitations. For example, one drawback to the partial F-Test is that we can only compare a nested model to the complete model. We have other tests we can use to compare any models, even if they are not nested.

AIC (Akaike Information Critera)

AIC is a great tool we can use to test two models and check to see whether one model is better than the other. When using AIC as a testing criteria, we want the AIC value to be as small as possible, and when comparing models, we need to see the AIC value drop by at least 10 in order to justify keeping the preditive variable that we are testing in the model. One important thing to note is that the actual AIC value that we get is not important, it is only useful in comparing it to the AIC value of another model. A benefit of using AIC is that our models do not have to be nested, so we could compare y=x₁+x₂ versus y=x₁+x₃. Additionally, there are commands that we can use such as stepAIC that will automate the process and can build the optimal model for us. This is a great tool for model building, especially if the models we’re comparing are not nested.

Mallow’s C_p

Mallow’s C_p is one final test we can use to analyze models when building a multiple linear regression model. When using Mallow’s C_p as a test, we want our test statistic to be small, and ideally we want it to be approximately equal to (K+1), where K is the number of parameters in our model. If this is found to be the case, then the model we tested should be considered desirable.

Model Building Strategies

Now that we know about the different critera and tests we can use to determine what predictors to use in our model, we can build our model with any of the following strategies.

Forward Selection

With forward selection we start with a simple model and then add predictors, using any of the tests or critera that we will dicuss later on to determine if our new predictors are a good fit. The simple model we begin with will often be a simple linear regression model with just one predictor. From there we will test our model, and if we find via that test that the predictor is significant, we will keep it in the model. We then continue adding predictors and testing until we build an optimal model.

Backward Elimination

When using backwards elimination to build our model, we begin with a complicated model that contains all of our predictors. Once we have our complex model, we will try to remove the insignificant predictors, use any of the model tests discussed previously, until only the significant predictors remain. Though it would be convenient to try and remove all of the non-significant predictors from the complex model all it once, we need to remove the least significant predictor first, then create a new model without that varaible and test it again. Removing that predictor may alter the signifcance of the other variables, so best practice is to remove them step by step to ensure optimal model construction.

All Subsets

We can also use an all subsets method to select a model. This method of model creation does not create nested models so it is important that we only use AIC or Mallow’s Cp as our criteria. To do this, we need to create models for all of the subsets of the predictor variables in our model and use the AIC() command to get the AIC values. Once we compare these AIC values, we can make a decision as to which model is the best to use. We typically do this based on two crieteria: the difference in AIC values and the number of predictors in that model. If one of the models has a significantly lower AIC value than all the other ones, we will use that as our final model. But if several models have similar AIC values, we will often choose the one with the least amount of predictors as this helps to make our model simpler.

The Multiple Linear Regression Model

Building the Model

Similar to simple linear regression, we will use the lm() command to create our model. However, this time our argument sequence will be lm(response ~ predictor₁ + predictor₂ + …). We can keep adding or removing predictors until we create the model we’re looking for. We also add data = mtcars as a final argument in the lm() command to indicate that our data and variable names from from the mtcars package. We will build this model via backwards elimination, using the t-test as our testing critera to determine variable elimination. To begin with we will build a complex model containing all of the predictor variables in our data set, with miles per gallon (mpg) being the response variable. One handy trick we can take advantage of here is that if we use a period in place of listing our predictors, it will automatically add all of the other variables as predictors. After creating our model and assigning it to a variable (complex.mod) we will call on the summary of the model to conduct our t-test.

complex.mod <- lm(mpg ~ . , data = mtcars)
summary(complex.mod)

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7078 -1.6474 -0.4117  1.1574  5.7845 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 25.451680  10.075899   2.526  0.01797 * 
## disp         0.007518   0.016800   0.448  0.65822   
## hp          -0.023743   0.020746  -1.144  0.26287   
## wt          -5.072530   1.772990  -2.861  0.00823 **
## qsec         0.669074   0.576980   1.160  0.25674   
## carb         0.271857   0.715775   0.380  0.70717   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.665 on 26 degrees of freedom
## Multiple R-squared:  0.8361, Adjusted R-squared:  0.8045 
## F-statistic: 26.52 on 5 and 26 DF,  p-value: 1.952e-09

The summary table for multiple linear regression models is very similar to the summary that was discussed in the Chapter 3 R Guide on simple linear regression. Because we are using the t-test as our criteria to determine significance, we will focus on the PR(>|t|) column on the far right of the coefficient section. This complex model is a great example of a model where it may be tempting to look at our p-values and remove all the values that are not significant, which in this case would be everything except weight. However, it is important to carry out this process step by step and not skip over anything. Since we are using backwards elimination as our model building strategy, we will eliminate the predictor that displays the least significance. For this model, that variable is the number of carburators, as it has the highest p-value at .70717. We eliminate carb from the model and create a simplified model with the remaining predictors.

simplified.mod1 <- lm(mpg ~ disp + hp + wt + qsec , data = mtcars)
summary(simplified.mod1)

## 
## Call:
## lm(formula = mpg ~ disp + hp + wt + qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8664 -1.5819 -0.3788  1.1712  5.6468 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 27.329638   8.639032   3.164  0.00383 **
## disp         0.002666   0.010738   0.248  0.80576   
## hp          -0.018666   0.015613  -1.196  0.24227   
## wt          -4.609123   1.265851  -3.641  0.00113 **
## qsec         0.544160   0.466493   1.166  0.25362   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.622 on 27 degrees of freedom
## Multiple R-squared:  0.8351, Adjusted R-squared:  0.8107 
## F-statistic: 34.19 on 4 and 27 DF,  p-value: 3.311e-10

One thing to notice in this first simplified model summary is that all of the p-values for our t-test have been adjusted and are now different than in the first model. This is why we can’t skip steps and eliminate all of the non-significant variables from the first model. Looking that this new model, we will again eliminate the least significant variable, which in this case is displacement. Therefore, the next model we create will contain all of the current predictors except for disp.

simplified.mod2 <- lm(mpg ~ hp + wt + qsec , data = mtcars)
summary(simplified.mod2)

## 
## Call:
## lm(formula = mpg ~ hp + wt + qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8591 -1.6418 -0.4636  1.1940  5.6092 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.61053    8.41993   3.279  0.00278 ** 
## hp          -0.01782    0.01498  -1.190  0.24418    
## wt          -4.35880    0.75270  -5.791 3.22e-06 ***
## qsec         0.51083    0.43922   1.163  0.25463    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.578 on 28 degrees of freedom
## Multiple R-squared:  0.8348, Adjusted R-squared:  0.8171 
## F-statistic: 47.15 on 3 and 28 DF,  p-value: 4.506e-11

Again our p-values have been adjusted accordingly since our model has changed. For this model, the p-values for horsepower and quarter mile time are the two highest and are relatively close in value. Because the p-value for qsec is slightly higher than hp, we will eliminate qsec and proceed with just horsepower and weight.

simplified.mod3 <- lm(mpg ~ hp + wt , data = mtcars)
summary(simplified.mod3)

## 
## Call:
## lm(formula = mpg ~ hp + wt, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.941 -1.600 -0.182  1.050  5.854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
## hp          -0.03177    0.00903  -3.519  0.00145 ** 
## wt          -3.87783    0.63273  -6.129 1.12e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8148 
## F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

Finally, we reach a model where all of our predictors are significant. After eliminating displacement from the regression model, the p-value for horsepower dropped all the way to .00145, which is less than our significance level of .05. Because all of our predictor variables now display significance, we determine that this is the optimal model when building via backwards elimination. It is possible that if we were to build using forward selection, we would have ended up with a different model. Because this is the final model for this section, we will rename this model best.mod1 to use going forward.

best.mod.1 <- simplified.mod3

Writing and Interpreting the Regression Equation

Now that we have finished building our model through backwards regression, we can use the coeffcient values to generate our multiple linear regression equation. The general multiple linear regression equation is: \[\hat{y}= \hat{\beta_0}+\hat{\beta_1} x_1+\hat{\beta_2} x_2\] To get our \(\beta\) values, we can either go back to the summary for our model, or we can call on the model to pull them up again.

best.mod.1

## 
## Call:
## lm(formula = mpg ~ hp + wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           hp           wt  
##    37.22727     -0.03177     -3.87783

The intercept coefficient will be the \(\hat{\beta_0}\) value similar to simple linear regression, and the hp and wt coefficients will be the \(\hat{\beta_1}\) and \(\hat{\beta_2}\) values. Inputting the values will thus yield the multiple linear regression equation: \[\hat{mpg_i}= 37.22727-.03177 (horsepower_i) -3.87783 (weight_i)\]

Similar to the simple linear regression model created in the Chapter 3 R Guide, it is important to interpret the regression equation in order to understand what it tell us about the relationship between our data. We can not interpret the intercept value, as realistically it is not possible for our other variables to be equal to zero, as no car can physically be zero pounds or have zero horsepower.

When interpreting the weight and horsepower coeffcients, it is important to understand that when we interpret one value, it means that all other values are being help constant. For example, the weight coefficient of -3.87783 indicates that for every 1000 pound increase in car weight, the miles per gallon achieved by that car will decrease by 3.87783, assuming horsepower is held constant. Similarly, the -0.03177 coefficient for horsepower indicates that increasing the horsepower by one will lower the miles per gallon by 0.03177, assuming the weight of the car is held constant.

Confidence Intervals for Coefficients

Because the coefficients of our model are least squares estimates, it may be helpful to create confidence intervals to get a better idea of the possible the range for our regression coefficients. We can do this by using the confint() command. For this command our arguments will be the model name, as well as the level at which we want our confidence intervals to be. For this model, we will use a level of .95, meaning that we are 95% confident that our regression coefficient values will fall in the given range.

confint(best.mod.1, level = .95)

##                   2.5 %      97.5 %
## (Intercept) 33.95738245 40.49715778
## hp          -0.05024078 -0.01330512
## wt          -5.17191604 -2.58374544

Because we were not able to interpret the intercept of the equation before, we will not interpret the confidence interval for intercept here either. We can interpret the horsepower interval as meaning that we are 95% confident that for every one unit increase in horse power, the miles per gallon will decrease by 0.01 and 0.05 on average. Similarly for weight, we are 95% confident that for every one thousand pound increase in weight, the miles per gallon of that car will decrease by 2.58 to 5.17 on average.

Checking the Model Assumptions

Similar to simple linear regression, it is necessary to check our model assumptions before using the model to make predictions. If any of our assumptions are violated, we will have to tweak or redo our model, as it could be providing us with inconsistant or inaccurate results. The three main assumptions that we will have to check are: normality of errors, constant variance, and correct mean function.

Normality Assumption

Similar to simple linear regression, one of our assumptions is that the population of potential error terms come from a normal distribution. We can test this assumption through the use of a qqnorm plot and a qqline.

qqnorm(best.mod.1$residuals)
qqline(best.mod.1$residuals)

In this qqplot we want the points to display a linear trend that matches the qqline plotted with them. Although most of the points display the linear trend we’re looking for, there are three points in the upper right corner that clearly deviate from the line and do not come from a normal distribution. They are a red flag, and something that we would like to correct for if we are going to use this model.

Constant Variance (Homoscedasticity)

Another assumption is that there is constant variance for all possible values of x₁ and x₂. We can check this assumption by plotting the residual values of the model against the fitted values.

plot(best.mod.1$residuals ~ best.mod.1$fitted.values,
     xlab = "Fitted Values",
     ylab = "Residual Values")
abline(0,0)

When assessing the constant variance assumption, we want to see the residual points form a horizontal band with approximately the same vertical spread across all of our fitted values. This is not the case in this residual plot, as there appears to be a much greater vertical spread on the right side of the plot. This indicates that our variance is not constant, and that the errors in our residuals is greater for larger values of horsepower and weight. This violates our assumption of constant variance, and is an issue we have to address if we wish to use this model to make statistical inferences.

Correct Mean Function

One final assumption that we make is that our model has the correct mean function. Essentially what this means is that, we do not want to see any patterns or trends in our residual plot. Find that there is a violation of this assumption suggests that there is a more appropriate form of this model. We assess this assumption using the same plot as we did for our constant variance assumption.

plot(best.mod.1$residuals ~ best.mod.1$fitted.values,
     xlab = "Fitted Values",
     ylab = "Residual Values")
abline(0,0)

In this plot we have an issue with our mean function. Based on the plot, we are overpredicting for our higher and lower values of weight and horsepower, but consistantly underpredicting for our middle values. This violates our assumption of constant variance and is an issue we have to address.

Due to the clear and discernible trend in our plot, as well as the red flag in our normality and constant variance assumptions, it would not be advised to use this model for any sort of prediction or statistical inference. Fortunately we have a way to attempt to correct these issues and improve our model, and that is through the use of transformations on our model.

Variable Transformations

Transformation Introduction

We can not use our model in any practical way if our model assumptions are violated or if our mean function is not correct. Because of this, using transformations to correct these issues is one of the most important topics when learning multiple linear regression. There are several main transformations that can be particularly helpful. They are:

Square root transformation (x^1/2)
Power transformation (x^1/k)
Log transformation (log(x))
Inverse transformation (1/x)

The goal when using these transformations is to improve the diagnostic plots and produce a linear trend. This will often help in correcting our assumption violations as well. There is no clear cut rule as to when to use each transformation, so it is best to try each one, as well as different combinations of transformations, and then checking the residual plots to decide which is best.

Using Transformations

As a reminder, our current lienar model is:

best.mod.1

## 
## Call:
## lm(formula = mpg ~ hp + wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           hp           wt  
##    37.22727     -0.03177     -3.87783

In looking at the residual plots above, our points seemed to take on aparabolic shape, with overpredictions for the extreme high and low values, and underpredictions for the middle values. Because of this shape, my first instinct would be to try a square root transformation on our predictor variables. To do this, we will create a new linear model that is the same as our current one, just with horsepower and speed sqaure rooted.

tran.mod.1 <- lm(mpg ~ sqrt(hp) + sqrt(wt) , data = mtcars)

After creating this model, we will use the plot() command, with the model name as the argument to pull up a series of four diagnostic plots, which are:

Residual vs. Fitted: This plot helps us decide if the mean function of our model is appropriate. Therefore it is the most important plot to look at. When looking at this plot, we want to ask questions such as: Is there a trend? Is there heteroscedasticity? Do we have outliers? Are there points for which the model is not appropriate?
QQNorm: This is the same as the QQ-norm plot we use when checking our normality assumption, as is interpreted the same way.
Scale-Location: This plot reduces skewness of the data and makes it easier to see trends in the residuals.
Cook’s Distance (Leverage): This plot measures each data points influence on each regression coefficient, and is helpful in identifying potential outliers. Points falling outside of the red lines are influencing the model too much and can be considered as potential outliers.

These plots will help us to assess the improvements from our transformation (or lack thereof).

par(mfrow = c(2,2))
plot(tran.mod.1)

For now, the plot we will be focusing on is the first residual plot, as our main goal is to make sure that our model has the correct mean function. Looking at the residual plot from our first transformation, we can still see that there is a parabolic trend, which is emphesized by the red trend line that is plotted in.

The next step we can try is to take the square root of miles per gallon as well, in the hope of flattening out the trend line.

tran.mod.2 <- lm(sqrt(mpg) ~ sqrt(hp) + sqrt(wt) , data = mtcars)
par(mfrow = c(2,2))
plot(tran.mod.2)

We can see the residual vs fitted plot now starting to flatten out and resemble the flat line we are looking for. But there is still room for improvement, so we will try taking the log of both of our predictor variables, rather than square root.

tran.mod.3 <- lm(sqrt(mpg) ~ log(hp) + log(wt) , data = mtcars)
par(mfrow = c(2,2))
plot(tran.mod.3)

Now that is much more like it. Our trend line as almost perfectly flat now, which indicates we have corrected our mean function and have eliminated any trends in over or underprediction. Looking at the QQ-Norm plot, we can also see that the three points we identified before that were cause for concern have now come back down to fall in line with our normal distribution. Finally, we can see in the leverage plot that none of our points fall outside the lines on the Cook’s Distance plot, meaning now points are influencing the model more than they should be.

Because we have transformed each of the variables, we need to recheck our model summary, as these transformations will have affected our regression coefficient values.

summary(tran.mod.3)

## 
## Call:
## lm(formula = sqrt(mpg) ~ log(hp) + log(wt), data = mtcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32172 -0.16754 -0.02988  0.14624  0.44294 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.6415     0.4743  18.219  < 2e-16 ***
## log(hp)      -0.5744     0.1234  -4.653 6.66e-05 ***
## log(wt)      -1.2505     0.1848  -6.768 1.99e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2227 on 29 degrees of freedom
## Multiple R-squared:  0.8938, Adjusted R-squared:  0.8865 
## F-statistic: 122.1 on 2 and 29 DF,  p-value: 7.544e-15

One issue that arises from variable transformations is that it can sometimes come that the expense of interpretation. For example, we have now transformed our multiple regression model from: \[\hat{mpg_i}= 37.22727-.03177 (horsepower_i) -3.87783 (weight_i)\] to \[\sqrt{\hat{mpg_i}}= 8.6415-.5744 \log{(horsepower_i)} -1.2505 \log{(weight_i)}\]. However, this was a necessary trade-off, as our previous model held little useful value due to its model assumption violations. Having corrected those violations, we can now use this model for more practical purposes, such as point predictions and confidence/prediction intervals.

Using the Model

Point Prediction

One practical use of our regression model is to make point predictions using it. For example, say you wanted to buy the hottest new car of 1975, an Oldsmobile Vista Cruiser, and you know the specs of the car say it has a 170 horsepower engine and weighs 4500 pounds. Based on these specifications you want to get an idea of what fuel efficiency you could get from that car. We can plug 4.5 and 170 into our regression equation and make a prediction as to what miles per gallon we would expect to get. Remember that we are taking the square root of miles per gallon, so we will have to square our answer to get the actual point prediction.

mpg <- 8.6415 - .5744*log(170) - 1.2505*log(4.5)
mpg^2

## [1] 14.52104

After our calculations, we predict that the 1975 Vista Cruiser is only going to get 14.5 miles per gallon. Perhaps based on this you will decide to buy the more practical and fuel efficient Plymouth Valiant.

Intervals

Rather than just using the model to create a point prediction, it may be more helpful to use confidence and prediction intervals which provide a range of values that account for some of the variability that comes with our data. When looking at these intervals, we first have to create a new data frame with our values for horsepower and weight. We do this using the data.frame command.

newdata.mtcars <- data.frame(hp = 170, wt = 4.5)

We can then use this new data point to create our confidence and prediction intervals.

Prediction Interval

We create a prediction interval for the miles per gallon of an individual car using the exact same command and arguments that we used in the previous R guide. Again we have to remember that our regression equation is has the square root of miles per gallon, so we will ahve to square our prediction interval.

pred <-predict(tran.mod.3, newdata.mtcars, interval="predict")
pred^2

##        fit      lwr      upr
## 1 14.52237 11.12895 18.36667

The output from this prediction interval is a point estimate of 14.52 and a prediction intercal of [11.13, 18.36]. Because we did not specify a confidence level, R defaults to a 95% confidence level. We can interpret this interval as meaning we are 95% confident that a car that has 170 horsepower and weighs 4500 pounds will get between 11.13 and 18.36 miles per gallon.

Confidence Interval

We can also create a confidence interval for the mean miles per gallon that a 170hp, 4500lbs car would get. We use the same command as our prediction interval, just specifying that is is a confidence interval and not a prediction interval.

conf <-predict(tran.mod.3, newdata.mtcars, interval="confidence")
conf^2

##        fit      lwr      upr
## 1 14.52237 13.51994 15.56064

For our confidence interval we get the same point estimate and an interval of [13.52, 15.56]. This means that we are 95% confident that the mean miles per gallon of a 170hp, 4500lbs car is between 13.52 and 15.56. One key point to take note of is that our confidence interval is smaller than the prediction interval. This is because there is more variability when looking at a single prediction versus the mean miles per gallon for all 170hp, 4500lbs cars.

MLR R Guide

Josh Thyen

3/22/2018