Executive Summary

In this report, mtcars data set is analyzed to explore how miles per gallon (MPG) is affected by transmission. The report specifically focus on whether automatic or manual transmission is better for MPG, and the MPG difference between automatic and manual transmissions is quantified.

The report shows manual transmission has an MPG 2.94 greater than automatic transmission.

Exploratory Data Analysis

First take a glimps at the data set of cars.

df <- mtcars
df$am <- factor(df$am, labels=c("auto","manual"))
head(df,3)
##                mpg cyl disp  hp drat    wt  qsec vs     am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0 manual    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0 manual    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1 manual    4    1

From a boxplot of mpg grouped by am (shown in appendix), auto transmission cars seem to have lower mpg. Run a two-sample t-test to verify whether the mpg is significantly lower than manual transmission cars. The alternative is “auto transmission cars has lower mean than manual cars”.

ttest <- t.test(df$mpg[df$am=="auto"], df$mpg[df$am=="manual"],alternative = "less")
ttest$p.value
## [1] 0.0006868192

The p-value is 0.0006868, which means the mpg of auto transmission cars is significantly lower than that of manual transmission cars.

Simple Linear Regression

mdl1 <- lm(mpg~am, df)
summary(mdl1)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## ammanual     7.244939   1.764422  4.106127 2.850207e-04
cat("R-squared: " , summary(mdl1)$r.squared)
## R-squared:  0.3597989

As transmission is considered as factor and intercept is not included, the mean MPG of cars with auto transmissions is 17.147, and that of cars with manual transmissions is 7.245 more than auto. The \(R^2\) is only 0.3598, which means the model only explains 36% of the variance.

Then run a regression including all other variables. Moreover, explore other variables with the pair lot (in Appendix). We can see that we may also want to include “qsec” and “wt” in our model.

mdl <- lm(mpg~., df)
summary(mdl)$coef
##                Estimate  Std. Error    t value   Pr(>|t|)
## (Intercept) 12.30337416 18.71788443  0.6573058 0.51812440
## cyl         -0.11144048  1.04502336 -0.1066392 0.91608738
## disp         0.01333524  0.01785750  0.7467585 0.46348865
## hp          -0.02148212  0.02176858 -0.9868407 0.33495531
## drat         0.78711097  1.63537307  0.4813036 0.63527790
## wt          -3.71530393  1.89441430 -1.9611887 0.06325215
## qsec         0.82104075  0.73084480  1.1234133 0.27394127
## vs           0.31776281  2.10450861  0.1509915 0.88142347
## ammanual     2.52022689  2.05665055  1.2254035 0.23398971
## gear         0.65541302  1.49325996  0.4389142 0.66520643
## carb        -0.19941925  0.82875250 -0.2406258 0.81217871

Nested Model Test

mdl2 <- update(mdl1, mpg ~ am + qsec)
mdl3 <- update(mdl2, mpg ~ am + qsec + wt)
anova(mdl1, mdl2, mdl3)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + qsec
## Model 3: mpg ~ am + qsec + wt
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 720.90                                  
## 2     29 352.63  1    368.26 60.911 1.679e-08 ***
## 3     28 169.29  1    183.35 30.326 6.953e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

These three regressors are all significant. Check the residual of Model 3 (plot shown in Appendix), and they are all normally distributed and homoskedastic.

summary(mdl3)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.617781  6.9595930  1.381946 1.779152e-01
## ammanual     2.935837  1.4109045  2.080819 4.671551e-02
## qsec         1.225886  0.2886696  4.246676 2.161737e-04
## wt          -3.916504  0.7112016 -5.506882 6.952711e-06
cat("R-squared: " , summary(mdl3)$r.squared)
## R-squared:  0.8496636

Thus, I select Model 3, and it explains 85% of the variance. Including other regressors influences the coefficient of “am”. Cars with manual transmission is 2.94 higher in MPG than cars with auto transmission.

Appendix

g <- ggplot(df, aes(x=am,y=mpg))
g+geom_boxplot() + xlab("Transmission") + ylab("Miles/(US) gallon") + 
    ggtitle("MPG by Auto and Manual Transmission")

Pair plot of regressors.

pairs(mpg ~ ., data = mtcars)

Residual plot of Model 3.

par(mfrow=c(2,2))
plot(mdl3)