In this report, mtcars data set is analyzed to explore how miles per gallon (MPG) is affected by transmission. The report specifically focus on whether automatic or manual transmission is better for MPG, and the MPG difference between automatic and manual transmissions is quantified.
The report shows manual transmission has an MPG 2.94 greater than automatic transmission.
First take a glimps at the data set of cars.
df <- mtcars
df$am <- factor(df$am, labels=c("auto","manual"))
head(df,3)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 manual 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 manual 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 manual 4 1
From a boxplot of mpg grouped by am (shown in appendix), auto transmission cars seem to have lower mpg. Run a two-sample t-test to verify whether the mpg is significantly lower than manual transmission cars. The alternative is “auto transmission cars has lower mean than manual cars”.
ttest <- t.test(df$mpg[df$am=="auto"], df$mpg[df$am=="manual"],alternative = "less")
ttest$p.value
## [1] 0.0006868192
The p-value is 0.0006868, which means the mpg of auto transmission cars is significantly lower than that of manual transmission cars.
mdl1 <- lm(mpg~am, df)
summary(mdl1)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## ammanual 7.244939 1.764422 4.106127 2.850207e-04
cat("R-squared: " , summary(mdl1)$r.squared)
## R-squared: 0.3597989
As transmission is considered as factor and intercept is not included, the mean MPG of cars with auto transmissions is 17.147, and that of cars with manual transmissions is 7.245 more than auto. The \(R^2\) is only 0.3598, which means the model only explains 36% of the variance.
Then run a regression including all other variables. Moreover, explore other variables with the pair lot (in Appendix). We can see that we may also want to include “qsec” and “wt” in our model.
mdl <- lm(mpg~., df)
summary(mdl)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337416 18.71788443 0.6573058 0.51812440
## cyl -0.11144048 1.04502336 -0.1066392 0.91608738
## disp 0.01333524 0.01785750 0.7467585 0.46348865
## hp -0.02148212 0.02176858 -0.9868407 0.33495531
## drat 0.78711097 1.63537307 0.4813036 0.63527790
## wt -3.71530393 1.89441430 -1.9611887 0.06325215
## qsec 0.82104075 0.73084480 1.1234133 0.27394127
## vs 0.31776281 2.10450861 0.1509915 0.88142347
## ammanual 2.52022689 2.05665055 1.2254035 0.23398971
## gear 0.65541302 1.49325996 0.4389142 0.66520643
## carb -0.19941925 0.82875250 -0.2406258 0.81217871
mdl2 <- update(mdl1, mpg ~ am + qsec)
mdl3 <- update(mdl2, mpg ~ am + qsec + wt)
anova(mdl1, mdl2, mdl3)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + qsec
## Model 3: mpg ~ am + qsec + wt
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 352.63 1 368.26 60.911 1.679e-08 ***
## 3 28 169.29 1 183.35 30.326 6.953e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
These three regressors are all significant. Check the residual of Model 3 (plot shown in Appendix), and they are all normally distributed and homoskedastic.
summary(mdl3)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01
## ammanual 2.935837 1.4109045 2.080819 4.671551e-02
## qsec 1.225886 0.2886696 4.246676 2.161737e-04
## wt -3.916504 0.7112016 -5.506882 6.952711e-06
cat("R-squared: " , summary(mdl3)$r.squared)
## R-squared: 0.8496636
Thus, I select Model 3, and it explains 85% of the variance. Including other regressors influences the coefficient of “am”. Cars with manual transmission is 2.94 higher in MPG than cars with auto transmission.
g <- ggplot(df, aes(x=am,y=mpg))
g+geom_boxplot() + xlab("Transmission") + ylab("Miles/(US) gallon") +
ggtitle("MPG by Auto and Manual Transmission")
Pair plot of regressors.
pairs(mpg ~ ., data = mtcars)
Residual plot of Model 3.
par(mfrow=c(2,2))
plot(mdl3)