Executive Summary

In this project, from a data set of a collection of cars, I am interested in exploring the relationship between a set of variables and miles per gallon (MPG). In particular, the following two questions are explored:

  1. Is an automatic or manual transmission better for MPG?
  2. Quantify the MPG difference between automatic and manual transmissions.

We have done some exploratory analysis at the beginning of the study followed by hypothesis testing and regression analysis. The study shows that MPG is depended not only on transmission but also other variables.

Exploratory analysis

library(dplyr)
library(datasets)
library(ggplot2)
library(GGally)
newcars <- mtcars
head(newcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
cor(newcars$mpg, newcars[-1])
##            cyl       disp         hp      drat         wt     qsec
## [1,] -0.852162 -0.8475514 -0.7761684 0.6811719 -0.8676594 0.418684
##             vs        am      gear       carb
## [1,] 0.6640389 0.5998324 0.4802848 -0.5509251

We are interested about the relation between mpg and other variables, in particular transmission (am). There is positive correlation between mpg and transmission (am). We can also check the average mpg of both automatic and manual transmission cars. First, we change the “am” variable from numeric to factor.

newcars$am <- factor(newcars$am, labels = c("Automatic", "Manual"))
newcars %>% group_by(am) %>% summarise(mean(mpg))
## # A tibble: 2 x 2
##          am `mean(mpg)`
##      <fctr>       <dbl>
## 1 Automatic    17.14737
## 2    Manual    24.39231

We can see that mean mpg is higher for manual than automatic cars. So, apparently manual transmission is better than automatic. A related box plot can be found in Appendix.

Hypothesis test:

t.test(newcars$mpg ~ newcars$am)
## 
##  Welch Two Sample t-test
## 
## data:  newcars$mpg by newcars$am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group Automatic    mean in group Manual 
##                17.14737                24.39231

The p-value from the t.test is 0.001374, that is less than 0.05. Hence, there is significant statistical difference between the mpg of two transmission modes. We can reject the null hypothesis that there is no difference between the automatic and manual transmission. This test does not take into consideration the other variables that might have effect on mpg.

Quantify the MPG difference by linear model:

We begin with a linear regression between mpg and transmission type (am)

model1 <- lm(mpg ~ am, data = newcars)
summary(model1)
## 
## Call:
## lm(formula = mpg ~ am, data = newcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amManual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

This model only considers the effect of transmission type and 36% variance is explained. It gives the same information as before that mean mpg for automatic transmission is 17.147 and manual is 7.245 higher.

Now, we will model a multivariate linear regression taking all variables into consideration.

model2 <- lm(mpg ~ ., data = newcars)
summary(model2)
## 
## Call:
## lm(formula = mpg ~ ., data = newcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## amManual     2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

The dominant coefficients from the model using all variables are wt, qsec and am. We are going to make a third model using step function to choose the variables for best model.

model3 = step(lm(data = newcars, mpg ~ .), trace = 0, steps = 10)
summary(model3)
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = newcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## amManual      2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

This model contains three variables and 84.9% variance is explained.The variables are also same as the three we mentioned in the previous model.

Conclusion:

From the above analysis, we can conclude that

  1. MPG performance is mostly depended on three variables, namely transmission type, weight and qsec.
  2. Manual transmission gives better mpg compared to automatic transmission by 2.94.
  3. With increase of 1000 lbs weight mpg decrease by 3.9165.
  4. qsec varies positively with mpg.

So the question whether manual or automatic transmission is better for mpg is not that straight forward. The mpg performance depends on all these three variables.

Appendix

Box plot of transmission types

g <- ggplot(newcars, aes(x = am, y = mpg, color = am))
g <- g + geom_boxplot()
g <- g + labs(x = "Transmission Type (am)", y = "Miles per galon (MPG)")
g

Correlation plot

newcars1 <- subset(mtcars, select = c("mpg", "cyl", "disp", 
                                      "hp", "wt", "qsec", "am"))
g1 <- ggpairs(newcars1, lower = list(continuous = "smooth"))
g1

Disgonostic plots

par(mfrow = c(2,2))
plot(model3)