Finding a relationship between this and that

(and making sure it is believable)


Author: Russ Robbins

Affiliated Code Repository (right click, and open new window or tab)


Executive Summary

The primary purpose of this research was to determine whether an automatic or a manual transmission leads to higher miles per gallon (MPG). The second purpose was to quantify the MPG difference between automatic and manual transmissions. Based on the data set provided, and only the data set provided, there is no statistically significant relationship between whether a car has an automatic or a standard transmission and MPG. Instead, the one statistically significant factor in this data set which can describe the increasing or decreasing of MPG in a linear and easily interpretable way is the number of cylinders in a car’s engine. On average, the decrease in MPG from a 4 cylinder to a 6 or a 6 to a 8 cylinder engine is 5.75 miles per gallon. 4-cylinder vehicles in this data set on average have 26.38 MPG, 6-cylinder vehicles, 20.63 MPG, and 8-cylinders, 14.88 MPG. Further, and taking into consideration quantifying uncertainty, (and the variability seen in the data) I expect the range of decrease in MPG from a 4 to 6 or a 6 to 8 cylinder should be no less than 4.43 miles per gallon (2 cyl times a 2.215 MPG decrease) and no greater than 7.07 gallons (2 cyl times 3.53 MPG decrease), in 95 out of 100 cases.

The Model


Figures 1, 2, and 3 show the model from three different perspectives.

Figure 1: MPG by Number of Cylinders in Engine


As a person moves from a larger cylinder motor to a smaller cylinder motor they can expect to increase their MPG, and vice versa.

##             Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 37.88458  2.0738436 18.267808 8.369155e-18
## x           -2.87579  0.3224089 -8.919699 6.112687e-10

Figure 2: Regression Model [ 37.89 - ( 2.88 * Number of Cylinders ) = MPG ]


To compute MPG savings, you multiply 2.88 * the number of cylinders and then add this result to 37.89.

## [1] -7.068474 -4.434687

Figure 3: 95% Confidence Interval (per cylinder)


MPG savings is between 4 and 1/2 to 7 miles per gallon when you move from a 8 to a 6 or a 6 to a 4 cylinder engine.

Steps in the Analysis


  1. Used multiple linear regression in order to keep the resulting model easily interpretable. (This included not transforming any explanatory variables or creating any interaction terms by combining explanatory variables and then using them in the model.)

  2. Explored data. This included summary of each of the variables. It also included plotting the relationships of each variable and every other variable pairwise.

  3. Eliminated variables that are not directly explanatory for miles per gallon. I used “all subsets regression.”

  4. Considered sets of independent variables and their joint prospective effect on miles per gallon. Seek the optimal model by reducing the explanatory variables from eight to the smallest number. Develop and use an understanding of the uncertainty in the competing models’ estimates. Seek the model, by use of hypothesis testing, which maximizes model’s R-squared and that has coefficients that are statistically significant as represented for the model as a whole with F, and for individual statistics with t See Figure 4.

  5. Assured that the assumptions that were made about the explanatory variables are true, for any models that appear explanatory, by running diagnostic procedures. See Figures 5 through 7.

  6. Considered uncertainty with regards to the predictions of the model by using confidence intervals**.

  7. Documented the results of the analysis in any easy to understand report.

  8. Explored variables that do not explain MPG but may be proxies for other explanatory variables.


Appendix


Additional figures are shown below to provide additional information about the kinds of diagnostics I ran to see whether particular relationships were actually linear. Each of the figures is simply one of many of these results that I analyzed.


## 
## Call:
## lm(formula = mpg ~ cyl + wt + carb, data = m)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6692 -1.5668 -0.4254  1.2567  5.7404 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.6021     1.6823  23.541  < 2e-16 ***
## cyl          -1.2898     0.4326  -2.981 0.005880 ** 
## wt           -3.1595     0.7423  -4.256 0.000211 ***
## carb         -0.4858     0.3295  -1.474 0.151536    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.517 on 28 degrees of freedom
## Multiple R-squared:  0.8425, Adjusted R-squared:  0.8256 
## F-statistic: 49.91 on 3 and 28 DF,  p-value: 2.322e-11
## 
## Call:
## lm(formula = mpg ~ cyl + wt, data = m)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2893 -1.5512 -0.4684  1.5743  6.1004 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.6863     1.7150  23.141  < 2e-16 ***
## cyl          -1.5078     0.4147  -3.636 0.001064 ** 
## wt           -3.1910     0.7569  -4.216 0.000222 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.568 on 29 degrees of freedom
## Multiple R-squared:  0.8302, Adjusted R-squared:  0.8185 
## F-statistic: 70.91 on 2 and 29 DF,  p-value: 6.809e-12

Figure 4: Sample model printouts to help considering the F and t-statistics


Forward and backward subsetting was used and F and t’s considered in each.

Figure 5: Do the residuals approximate a normal distribution?


Probably not.

##      cyl       wt 
## 2.579312 2.579312

Figure 6: Are the coefficients perfectly collinear? Yes.


Therefore we shouldn’t use one to model the other.


Figure 7: Evaluate Nonlinearity using component + residual plot


This is not a good diagnostic. It appears that wt does not have a linear relationship with MPG. This suggests that if we are going to use a linear model with no coefficient transformations, again, for easy, interpretability, we should stick with using cyl as a predictor for MPG. Therefore at this point, I changed the suggested model to Cyl affects MPG. However, this is not a hugely significant issue since both Cyl and Wt are highly correlated, and thus explain much of the same variance in the model. Further, it makes the resulting model Cyl-> MPG very interpretable. Cyl affects MPG had a large R-squared, F, and t statistics, so is very reasonable as a final model, if the other diagnostics prove supportive. From this point forward I checked out several other diagnostics and checked to see whether the variables I excluded (qsec and hp) explained any variance. Tney did not. I ended up with Cyl predicts Mpg very well and simply.