Motor Trend collected data on various cars to understand the impact of several factors on Miles Per Gallon (MPG). Specifically, this report tries to answer these two questions: “Is an automatic or manual transmission better for MPG” and “Quantify the MPG difference between automatic and manual transmissions”.
The analysis shows manual transmissions are better for MPG. In addition, weight of the car and horse power of the engine are also influencing factors.
## make the factor variables after creating a copy of the data
library(datasets)
mt <- mtcars
mt$cyl <- factor(mt$cyl, labels = c("4cyl", "6cyl", "8cyl"))
mt$vs <- factor(mt$vs, labels = c("VeeEngine", "StraightEngine"))
mt$am <- factor(mt$am, labels = c("Automatic", "Manual"))
mt$gear <- factor(mt$gear, labels = c("3gears", "4gears", "5gears"))
Please see appendix A1 for the structure of the above dataset!
aggregate(mpg ~ am, data = mt, mean)
## am mpg
## 1 Automatic 17.14737
## 2 Manual 24.39231
The mean for Automatic transmission is lower by about 7 MPGs. Examination of plot (please see A2 in Appendix) shows the same too. Let us run T test to see if this is significant.
t.test(mpg~am, data = mt, conf.level = 0.95)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group Automatic mean in group Manual
## 17.14737 24.39231
Since the p-value is 0.001374 (which is less than 0.05), the means are not equal. Hence Manual transmission is better for MPG!!!
fit <- lm(mpg ~ am, data = mt)
The simple model (please see A3 in Appendix) shows the R-squared value to be 0.3598 meaning 35.98% of variance is explained. Based on Subject Matter Experts’ (SME) knowledge, weight of a car and the hore power of the engine, displacement of the engine are influencing factors. Number of cyclinders can have an influence too. However, displacement and horse power can reflect this. We know introducing too many variables increase standard error. We also know, omission of variables cause bias!
fit2 <- update(fit, mpg ~ am + wt)
fit3 <- update(fit, mpg ~ am + wt + hp)
fit4 <- update(fit, mpg ~ am + wt + hp + disp)
anova(fit, fit2, fit3, fit4)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt
## Model 3: mpg ~ am + wt + hp
## Model 4: mpg ~ am + wt + hp + disp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 278.32 1 442.58 66.4206 9.394e-09 ***
## 3 28 180.29 1 98.03 14.7118 0.0006826 ***
## 4 27 179.91 1 0.38 0.0576 0.8122229
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
By analyzing the above nested models and examining the Pr(>F) column, we can see weight and horse power had significant positive impact on the model. Displacement did not! Our theory is that horse power already has been influence by displacement and hence not needed.
## fit3 is the best model (am, wt and hp)
bestfit <- fit3
We can see fit3 model (please see A4 in Appendix) shows the R-squared value to be 0.8399 meaning 83.99% of variance is explained. This is definitely an improvement from the simple model!
## compare fit3 (am, wt and hp) to simple linear model (am)
anova(fit, bestfit)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt + hp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 28 180.29 2 540.61 41.979 3.745e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Comparing the simple linear model to the fit3 model, the p-value is 3.745e-09! This rejects the null hypothesis and shows the model that includes transmission, weight and horse power is much better! It is possible for us to refine this model further and see if there are other influencing factors.
The residuals plot (please see A5 in Appendix) does not show any significant abnormality that requires more in-depth examination.
str(mt)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Factor w/ 3 levels "4cyl","6cyl",..: 2 2 1 2 3 2 3 1 1 2 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : Factor w/ 2 levels "VeeEngine","StraightEngine": 1 1 2 2 1 2 1 2 2 2 ...
## $ am : Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: Factor w/ 3 levels "3gears","4gears",..: 2 2 2 1 1 1 1 2 2 2 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
## box plot
library(ggplot2)
g2 <- ggplot(mt, aes(am,mpg))
g2 <- g2 + geom_boxplot(fill = "light grey", colour = "blue",
outlier.colour = "red", outlier.shape = 1) +
labs(x = "Transmission Type") +
labs(y = "Mile Per Gallon (MPG)") +
labs(title = "MPG for Automatic and Manual Transmission Cars")
print(g2)
summary(fit)
##
## Call:
## lm(formula = mpg ~ am, data = mt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
summary(bestfit)
##
## Call:
## lm(formula = mpg ~ am + wt + hp, data = mt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4221 -1.7924 -0.3788 1.2249 5.5317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.002875 2.642659 12.867 2.82e-13 ***
## amManual 2.083710 1.376420 1.514 0.141268
## wt -2.878575 0.904971 -3.181 0.003574 **
## hp -0.037479 0.009605 -3.902 0.000546 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.538 on 28 degrees of freedom
## Multiple R-squared: 0.8399, Adjusted R-squared: 0.8227
## F-statistic: 48.96 on 3 and 28 DF, p-value: 2.908e-11
par(mfrow = c(2,2))
plot(bestfit)