For this assignment we were given a data set of a collection of cars, and asked to explore the relationship between a set of variables and miles per gallon (MPG) (outcome). We were asked to address the following two questions:
Using linear regression and modeling we found that manual transmissions provide greater miles per gallon than automatic transmissions. As a predictor of MPG, transmission type by itself is weak. Using a stepwise process of linear regression we found a model using weight (wt), quarter mile time (qsec) and transmisison type (am) provides a much better prediction of MPG. We also learned that limiting the stepwise process to variables with the greatest correlation to MPG did not necessarily yield a better model.
setwd("/Users/mitchellfawcett/Documents/RProjects/RegressionModels")
## Load Motor Trend car data
data(mtcars)
The data consists of observations for 32 cars with 11 variables.
[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (lb/1000)
[, 7] qsec 1/4 mile time
[, 8] vs V/S (“v” or straight cylinder arrangement)
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
The correlation coefficient r for each variable with respect to MPG is as follows:
round(cor(mtcars$mpg, mtcars), 3)
## mpg cyl disp hp drat wt qsec vs am gear carb
## [1,] 1 -0.852 -0.848 -0.776 0.681 -0.868 0.419 0.664 0.6 0.48 -0.551
The variable am has a correlation coefficient with respect to MPG of 0.60. Six other variables have a greater (absolute) correlation: cyl, disp, hp, drat, wt, and vs. Three variables have a correlation with MPG less than 0.60: qsec, gear and carb.
The variable of primary interest is transmission type which can be either automatic or manual (column 9, “am”). The am variable is a two level factor. Its value will be “0” if it is an automatic and “1” if it is a manual transmission.
A two sample t-test using a 95% confidence interval shows the differences in the means (17.147 for automatic and 24.392 for manual) is significant (p-value 0.001374).
mpg.a <- mtcars[mtcars$am == 0, "mpg"]
mpg.m <- mtcars[mtcars$am == 1, "mpg"]
t.test(mpg.a, mpg.m)
##
## Welch Two Sample t-test
##
## data: mpg.a and mpg.m
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
The answer to the first question, is an automatic or manual transmission better for MPG, is manual transmission, when taking only transmission type into account.
Next we turn to quantifying the relationship between transmission type and other variables to MPG using regression and modeling.
Model 1: A simple linear regression model with MPG as the outcome and am as the predictor follows:
mtcars.lm <- lm(mpg ~ am, data = mtcars)
summary(mtcars.lm)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
Consistent with what we saw in the t-test the automatic transmission MPG is given by the Estimated intercept 17.147. The Estimated coefficient of am is 7.245, meaning that going from an automatic to a manual transmission can be expected to increase MPG by 7.245 on average.
The low p-value (0.000285) of the model gives us confidence that the slope of the relationship between transmission type and MPG is non-zero and confidence that transmission type is predictive of miles per gallon.
However, the R-squared of the single predictor model is 0.3385, meaning only about 1/3 of the variability in the MPG outcomes can be accounted for by the model. The Residual Standard Error is 4.902. The low R-squared seems to indicate there is room for improvement in the model. Having a model consisting of one binary variable predicting a continuous outcome intuitively seems too simplistic.
Model 2: R’s step() function takes an initial linear model and systematically identifies and removes variables whose absence improves the model. We start with a formula consisting of the seven variables with a correlation to MPG equal to or greater than 0.60 (the am correlation coefficient).
mtcars.lm2 <- lm(mpg ~ am + cyl + disp + hp + drat + wt + vs, data = mtcars)
summary(step(mtcars.lm2, direction = c("backward"), trace = 0))
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9290 -1.5598 -0.5311 1.1850 5.8986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.75179 1.78686 21.687 < 2e-16 ***
## cyl -0.94162 0.55092 -1.709 0.098480 .
## hp -0.01804 0.01188 -1.519 0.140015
## wt -3.16697 0.74058 -4.276 0.000199 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared: 0.8431, Adjusted R-squared: 0.8263
## F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11
This results in a model having cyl, hp and wt as the predictor variables. The adjusted R-squared for this model is 0.8263 meaning that approximately 83% of the variability of the response variable (mpg) around its mean can be accounted for by the statistical model. The model has a Residual Standard Error of 2.512. This is a big improvement over the single predictor model. Of the three predictors chosen for the model, only weight (wt) has a p-value (0.000199) that is small enough to be considered to be very significant. The other two predictors horsepower (hp) and number of cylinders (cyl) have p-values of 0.98 and 0.14 respectively.
Model 3: The am variable was one of the variables removed from the model by the stepwise process. To see if adding it back into the model would significantly improve its predictive power we can perform anova analysis of the stepwise model, with and without the am variable.
mtcars.lm3 <- lm(formula = mpg ~ cyl + hp + wt, data = mtcars)
mtcars.lm4 <- update(mtcars.lm3, formula = mpg ~ cyl + hp + wt + am, data = mtcars)
anova(mtcars.lm3, mtcars.lm4)
## Analysis of Variance Table
##
## Model 1: mpg ~ cyl + hp + wt
## Model 2: mpg ~ cyl + hp + wt + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28 176.62
## 2 27 170.00 1 6.6228 1.0519 0.3142
The p-value 0.3142 using the anova() function indicates that reintroducing am into Model 2 found by the stepwise process using high r value variables does not provide enough improvement to the model to make it worth adding.
Model 4: We next used the step() function with a formula consisting of all the variables in the data set instead of limiting ourselves to the higher correlated variables.
mtcars.lm5 <- lm(mpg ~ . , data = mtcars)
summary(step(mtcars.lm5, direction = c("backward"), trace = 0))
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## am 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
Somewhat counter-intuitively the step() function found a model with a higher adjusted R-squared and lower Residual Standard Error when we included variables that had a lower correlation with MPG in the starting formula.
Model 4 consisting of weight (wt) (p-value 6.95e-06), quarter mile time (qsec) (p-value 0.000216) and transmission type (am) (p-value 0.046716) is the best model we found based on RSE (2.459) and R-squared values (0.8336).
However, R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots.
par(mfrow = c(2,2))
plot(mtcars.lm5)
The residual plots do not indicate anything unusual that would lead us to question the usefulness of the model we created using a stepwise process initialized with all the variables in the data set. The residual versus fitted values plot (upper left) shows a fairly even spread of residuals. The QQ plot (upper right) shows the errors to have a normal distribution. Likewise, the standardized residuals plotted against fitted values (lower left) has an even spread with no significant trend or pattern. The Cooks plot (lower right) shows a reasonable pattern of leverage versus standardized residuals meaning no data points are exerting undue influence on the model that would result in bias.
The model we finally chose was: mpg = 9.6178 - 3.9165(wt) + 1.2259(qsec) + 2.9358(am)
What this says is holding weight and qsec constant, going from an automatic transmission to manual increases MPG 2.9358.
With an adjusted R-squared of 0.8336 and no evidence of major bias in the data, we feel this would be a useful model for predicting MPG. There may be a model with a higher R-squared that could be marginally “better”, but it doesn’t seem to be worth the effort to find it.
Miles per gallon is an extremely variable value that can be influenced by many factors beyond the 10 variables in the mtcars data set. Using a formula like the above to predict MPG involves making assumptions about how the MPG values for the data set were obtained. For example, if the MPG’s in the dataset were obtained by having cars run in place on dynomometers, the predictions made by the model only apply to predictions of MPG of cars running on dynamomters. It would be misleading to apply the predictions to cars being driven on actual roads by typical drivers.
We used an automated process for identifying a useful model because we lacked the expert knowledge of cars to do it any other way. If we had the time and resources we might have used a different process, one that used knowledge of cars and one that explored the various interactions between variables. One could argue that that is the only way to go about building models, but that is beyond the scope of this project.