Is an automatic or manual transmission better for MPG?
Quantify the MPG difference between automatic and manual transmissions?
In order to undestand the data better. We do some initial exploratory data analysis and some data cleansing to be able to model the data better.
data(mtcars)
pairs(mtcars, panel=panel.smooth, main="Motor Trend Car Data")
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
By visually looking at the plot above and the results of str on mtcars data, we can say that the following variables are possible factors: cyl, vs, am, gear, carb. We then convert each to a factor. Also note that we have assigned a name for automatic and manual transmission instead of using the 0 and 1 for this categorical varible.
mtcars$am[mtcars$am == 0] <- "Automatic"
mtcars$am[mtcars$am == 1] <- "Manual"
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$am <- factor(mtcars$am)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
Let’s look at the relationship between mpg and am alone.
boxplot(mpg ~ am, data=mtcars)
Now, lets look at all the other variables and see if we can find a model that is useful for finding mpg. We start my using least squares on the entire dataset including all variables as regressors. We then remove variables once we see that their pvalues are > 0.05. Then we try to generate another model until we find a model that contains variables that seem to be significant enough to influence the outcome mpg.
fitall <- lm(mpg ~ ., data=mtcars)
summary(fitall)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5087 -1.3584 -0.0948 0.7745 4.6251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.87913 20.06582 1.190 0.2525
## cyl6 -2.64870 3.04089 -0.871 0.3975
## cyl8 -0.33616 7.15954 -0.047 0.9632
## disp 0.03555 0.03190 1.114 0.2827
## hp -0.07051 0.03943 -1.788 0.0939 .
## drat 1.18283 2.48348 0.476 0.6407
## wt -4.52978 2.53875 -1.784 0.0946 .
## qsec 0.36784 0.93540 0.393 0.6997
## vs1 1.93085 2.87126 0.672 0.5115
## amManual 1.21212 3.21355 0.377 0.7113
## gear4 1.11435 3.79952 0.293 0.7733
## gear5 2.52840 3.73636 0.677 0.5089
## carb2 -0.97935 2.31797 -0.423 0.6787
## carb3 2.99964 4.29355 0.699 0.4955
## carb4 1.09142 4.44962 0.245 0.8096
## carb6 4.47757 6.38406 0.701 0.4938
## carb8 7.25041 8.36057 0.867 0.3995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared: 0.8931, Adjusted R-squared: 0.779
## F-statistic: 7.83 on 16 and 15 DF, p-value: 0.000124
fit1 <- lm(mpg ~ am + wt +qsec + hp + disp + drat + gear + carb + vs, data=mtcars)
summary(fit1)
##
## Call:
## lm(formula = mpg ~ am + wt + qsec + hp + disp + drat + gear +
## carb + vs, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1897 -1.3843 -0.3634 0.9201 4.6548
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.87276 15.43382 0.834 0.416
## amManual 1.95166 2.80814 0.695 0.496
## wt -3.89980 2.25665 -1.728 0.102
## qsec 0.52619 0.83550 0.630 0.537
## hp -0.04916 0.03139 -1.566 0.136
## disp 0.03120 0.02420 1.289 0.215
## drat 2.39306 2.19773 1.089 0.291
## gear4 0.88681 3.45338 0.257 0.800
## gear5 1.97540 3.60004 0.549 0.590
## carb2 -0.75477 2.25913 -0.334 0.742
## carb3 2.08122 3.72838 0.558 0.584
## carb4 -1.30096 3.76183 -0.346 0.734
## carb6 0.96288 5.50575 0.175 0.863
## carb8 3.04608 7.39745 0.412 0.686
## vs1 1.52262 2.57613 0.591 0.562
##
## Residual standard error: 2.779 on 17 degrees of freedom
## Multiple R-squared: 0.8834, Adjusted R-squared: 0.7873
## F-statistic: 9.197 on 14 and 17 DF, p-value: 2.359e-05
fit2 <- lm(mpg ~ am + wt +qsec + hp + disp + drat + gear + vs, data=mtcars)
summary(fit2)
##
## Call:
## lm(formula = mpg ~ am + wt + qsec + hp + disp + drat + gear +
## vs, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0548 -1.4564 -0.3425 1.2825 4.7168
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.80934 13.36845 0.809 0.4274
## amManual 2.92105 2.00082 1.460 0.1584
## wt -3.68884 1.52013 -2.427 0.0239 *
## qsec 0.91001 0.64014 1.422 0.1692
## hp -0.02721 0.01720 -1.582 0.1279
## disp 0.01380 0.01341 1.029 0.3145
## drat 1.18599 1.74482 0.680 0.5038
## gear4 -0.42897 2.43311 -0.176 0.8617
## gear5 0.88164 2.57587 0.342 0.7354
## vs1 0.65015 1.93968 0.335 0.7407
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.581 on 22 degrees of freedom
## Multiple R-squared: 0.8698, Adjusted R-squared: 0.8166
## F-statistic: 16.34 on 9 and 22 DF, p-value: 8.402e-08
fit3 <- lm(mpg ~ am + wt +qsec + hp + disp + drat + vs, data=mtcars)
summary(fit3)
##
## Call:
## lm(formula = mpg ~ am + wt + qsec + hp + disp + drat + vs, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4067 -1.4690 -0.2824 1.1415 4.5365
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.49805 12.48039 1.001 0.32662
## amManual 3.02402 1.66840 1.813 0.08244 .
## wt -3.94974 1.26261 -3.128 0.00457 **
## qsec 0.87149 0.61331 1.421 0.16819
## hp -0.02282 0.01526 -1.496 0.14778
## disp 0.01374 0.01136 1.210 0.23821
## drat 0.95533 1.40737 0.679 0.50376
## vs1 0.59017 1.83303 0.322 0.75027
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.495 on 24 degrees of freedom
## Multiple R-squared: 0.8673, Adjusted R-squared: 0.8286
## F-statistic: 22.4 on 7 and 24 DF, p-value: 4.532e-09
fit4 <- lm(mpg ~ am + wt +qsec + hp + disp + drat, data=mtcars)
summary(fit4)
##
## Call:
## lm(formula = mpg ~ am + wt + qsec + hp + disp + drat, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2669 -1.6148 -0.2585 1.1220 4.5564
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.71062 10.97539 0.976 0.33848
## amManual 2.98469 1.63382 1.827 0.07969 .
## wt -4.04454 1.20558 -3.355 0.00254 **
## qsec 0.99073 0.48002 2.064 0.04955 *
## hp -0.02180 0.01465 -1.488 0.14938
## disp 0.01310 0.01098 1.193 0.24405
## drat 1.02065 1.36748 0.746 0.46240
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.45 on 25 degrees of freedom
## Multiple R-squared: 0.8667, Adjusted R-squared: 0.8347
## F-statistic: 27.09 on 6 and 25 DF, p-value: 8.637e-10
fit5 <- lm(mpg ~ am + wt +qsec + hp + disp, data=mtcars)
summary(fit5)
##
## Call:
## lm(formula = mpg ~ am + wt + qsec + hp + disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5399 -1.7398 -0.3196 1.1676 4.5534
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.36190 9.74079 1.474 0.15238
## amManual 3.47045 1.48578 2.336 0.02749 *
## wt -4.08433 1.19410 -3.420 0.00208 **
## qsec 1.00690 0.47543 2.118 0.04391 *
## hp -0.02117 0.01450 -1.460 0.15639
## disp 0.01124 0.01060 1.060 0.29897
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.429 on 26 degrees of freedom
## Multiple R-squared: 0.8637, Adjusted R-squared: 0.8375
## F-statistic: 32.96 on 5 and 26 DF, p-value: 1.844e-10
fit6 <- lm(mpg ~ am + wt + qsec + hp, data=mtcars)
summary(fit6)
##
## Call:
## lm(formula = mpg ~ am + wt + qsec + hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4975 -1.5902 -0.1122 1.1795 4.5404
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.44019 9.31887 1.871 0.07215 .
## amManual 2.92550 1.39715 2.094 0.04579 *
## wt -3.23810 0.88990 -3.639 0.00114 **
## qsec 0.81060 0.43887 1.847 0.07573 .
## hp -0.01765 0.01415 -1.247 0.22309
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.435 on 27 degrees of freedom
## Multiple R-squared: 0.8579, Adjusted R-squared: 0.8368
## F-statistic: 40.74 on 4 and 27 DF, p-value: 4.589e-11
fit7 <- lm(mpg ~ am + wt + qsec, data=mtcars)
summary(fit7)
##
## Call:
## lm(formula = mpg ~ am + wt + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## amManual 2.9358 1.4109 2.081 0.046716 *
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
Now that we have chosen a model, we determine if it is better to use the model that includes am + wt + qsec over the model that just includes am as regressor.
fitam <- lm(mpg ~ am, data=mtcars)
fitbestmodel <- update(fitam, mpg ~ am + wt + qsec)
anova(fitam, fitbestmodel)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt + qsec
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 28 169.29 2 551.61 45.618 1.55e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the results, it seems that including the other variables (wt and qsec) in the model would improve our prediction of mpg over using a single variable am.
par(mfrow=c(2, 2))
plot(fitbestmodel)
summary(fitbestmodel)
##
## Call:
## lm(formula = mpg ~ am + wt + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## amManual 2.9358 1.4109 2.081 0.046716 *
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
Based on the above model, we conclude that Manual Transmission is better than automatic transmission and that using manual transmission may result in an increase of 2.9358 miles per gallon. However, other factors also play into account such as wt and qsec. They may influence the amount mpg increase.