The purpose of the project is to provide the magazine “Motor Trend” with an analysis of the relationship between automatic/manual transmission and MPG (miles per gallon). We need to answer 2 questions:
- “Is an automatic or manual transmission better for MPG”
- “Quantify the MPG difference between automatic and manual transmissions”
Data We are using a dataset from 1974, “mtcars”, containing 32 observations (car models) and 11 variables. We’ll start with two variables: “mpg”" as the outcome and “am” (transmission, 0 = automatic, 1 = manual) as the regressor. Then we’ll use other variables to explain variability further.
Process 1. Exploratory data analysis, Visualization, quantification, statistical inference.
2. Linear regression “am” & “mpg”, check how much of the variance is explained by the model.
3. Fit multiple models to measure the impact of other variables.
4. Residual plot to detect/quantify uncertainty in the model.
Key findings Manual transmission is better than automatic for MPG with a difference of about 7.24 between the two. In spite of the strong correlation between transmission type and mpg, transmission explains only 36% of the variability of mpg. Other factors that improve mpg are cylinders, weight and horse power. Including these factors in the model explains 86% of the variability of mpg.
1. Exploratory Data Analysis
data(mtcars) #read data
head(mtcars, 2) #display first 2 observations & variable names
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
#transform relevant variables into factors
mtcars$cyl <- factor(mtcars$cyl); mtcars$vs <- factor(mtcars$vs); mtcars$gear <- factor(mtcars$gear); mtcars$carb <- factor(mtcars$carb); mtcars$am <- factor(mtcars$am)
We create a boxplot (appendix 1) that shows a clear difference in impact of transmission type on mpg. Manual transmission is associated with a clearly higher MPG.
Let’s quantify this:
as.data.frame(rbind(summary(mtcars$mpg[mtcars$am==0]), summary(mtcars$mpg[mtcars$am==1])), row.names = c("automatic", "manual"))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## automatic 10.4 14.95 17.3 17.15 19.2 24.4
## manual 15.0 21.00 22.8 24.39 30.4 33.9
Both the boxplot and the summaries show a clear difference, mean for automatic is 17.15 MPG, versus 24.39 for manual, a difference of about 7.24 MPG.
Is this difference statistically significant? We can check this with a t test. Null hypothesis is that both means are the same (mu_automatic - mu_manual = 0).
(t.test(mpg ~ factor(am), data=mtcars))$p.value
## [1] 0.001373638
The p-value of 0.00137 is below significance level alpha of 0.05. We reject the null hypothesis in favor of the alternative: there is a significant difference between the MPG mean of automatic cars vs. MPG of manual transmission cars.
2. Simple linear regression, mpg and am
The correlation between transmission and MPG is clear, but how much of MPG variance is explained by transmission? We can perform a simple linear regression:
fit1 <- lm(mpg ~ am, data = mtcars); summary(fit1)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am1 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
We retrieve our previous results: intercept (beta0) is the MPG mean for automatic transmission, the am coefficient (beta1, the slope) is the increase of the mean for manual transmission, the sum of the two (17.15+7.24) being the mean for manual transmission.
What interests us here is the R-squared value of about 36%, which is the variance of MPG explained by the transmission variable.
This is not enough and adding other variables may explain the variance of MPG better. We need to select the relevant ones.
3. Fit multiple models
A pairs plot (appendix 2) shows that the following variables seem to have some correlation with mpg: am, cyl, disp, hp, wt and qsec. We confirm the intuition with the step function, (stepwise selection of useful variables, omitting the ones that don’t contribute significantly).
step(lm(data=mtcars, mpg ~ .),trace=0) #the trace(0) argument prints only the final predictors.
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
##
## Coefficients:
## (Intercept) cyl6 cyl8 hp wt
## 33.70832 -3.03134 -2.16368 -0.03211 -2.49683
## am1
## 1.80921
We retrieve here more or less the same variables: am, cyl, hp, wt. Just disp and qsec don’t shown here.
They may be strongly connected with mpg, but probably won’t add more variance explanation. We can verify this by fitting 2 multivariate regression lines:
“fit4” uses 4 regressors as suggested by the step function, “fit6” uses the 6 regressors visually identified in the pairs plot.
fit4 <- lm(mpg ~ am + cyl + hp + wt, data = mtcars)
fit6 <- lm(mpg ~ am + cyl + hp + wt + disp + qsec, data = mtcars)
#Summary and R squared value with 4 regressors, THIS PART IS IMPORTANT FOR THE CONCLUSION
summary(fit4)
##
## Call:
## lm(formula = mpg ~ am + cyl + hp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## am1 1.80921 1.39630 1.296 0.20646
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
#R squared value with 6 regressors
summary(fit6)$r.squared
## [1] 0.8736016
The difference of variance explained between both models with new regressors is only 0.8%. It seems that adding variables disp and qsec do not add much to the model.
We can formalize this with the anova function, comparing the different fitted models between one another:
anova(fit1, fit4, fit6)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + cyl + hp + wt
## Model 3: mpg ~ am + cyl + hp + wt + disp + qsec
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 26 151.03 4 569.87 24.0231 4.303e-08 ***
## 3 24 142.33 2 8.70 0.7331 0.4909
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
4. Residual plot
The residual plot (appendix 3) allows us to diagnose potential issues with the model:
- Residuals vs. Fitted: points are randomly scattered, checking the independence condition.
- Normal Q-Q plot: the points are positioned around one line, verifying that residuals are normally distributed.
- Scale-Location: points are scattered in a constant band pattern, meaning variance is constant.
- Residuals vs Leverage: there are some outliers that we can see on the plot and discover with the dfbetas and hatvalues functions.
All other variables held constant:
- for every weight increase of 1000 pound, mpg decreases by about 2.5
- cylinders have also an impact, mpg will decrease by 3 for a 6 cylinders (compared to 4), and decrease by 2.15 for 8 cylinders.
Appendix 1. Box plot
boxplot(mpg~am, data = mtcars, main = "Boxplot MPG & Transmission", xlab = "Transmission, 0 = automatic, 1 = manual", ylab = "MPG")
We see a clear difference, manual transmission showing a higher (better) MPG.
Appendix 2. Pairs plot for all variables*
pairs(mpg ~., panel = panel.smooth, data = mtcars)
Appendix 3. Residual plot
par(mfrow=c(2, 2))
plot(fit4)