Gas Mileage of Automatic and Manual Automobiles

Author: Murtuza Ali Lakhani Date: October 23, 2015

1. Executive Summary

The purpose of this assignment was to analyze a data set of a collection of cars to determine the relationship between a set of variables and miles per gallon and address two research questions: 1) Is an automatic or a manual transmission better for MPG? 2) What is the MPG difference between automatic and manual transmissions? The results of stepwise regression show that in addition to the transmission type variable two other variables have a significant influence on the dependent variable, mpg. These variables are wt and qsec. The combined model explains 84.97 per cent of the variation in mpg. The low p-value (1.21e-11) for the model indicates that it is significant. Further, the coefficient of 2.9358 for transmission type paints the picture for its influence on mpg. The coefficient indicates that for the change in transmission from automatic to manual the mpg increases by 2.9358. This is sufficient to conclude that manual transmission yields a better gas mileage than an automatic transmission–and this difference is 2.9358 miles per gallon.

2. Getting and Exploring Data

In this part of the project, we focus on reading in the data and performing exploratory analysis.

2.1. Get and process the data

Read in the mtcars data set and explore variable and variable types.

data(mtcars)
names(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
class(mtcars$am)
## [1] "numeric"

We convert the “am” predictor to a factor class with two levels. We label the levels “automatic” and “manual” for better readability.

mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")

2.3. Perform exploratory testing

The first check is to examine the boxplot of gas mileage on cars equipped with automatic and manual transmission. The boxplot in Appendix I - Part 1 shows that manual cars have better gas mileage, but this data point is insufficient to conclude because 1) standard deviations have not yet been comprehended, and 2) relationships across study variables have not been taken into account.

The second check is to perform t-test to test for significance in the gas mileage difference with transmission type. The results are included in Appendix I - Part 2. The low p-value (0.001374) indicates there is a significant difference in gas mileage between the transmission types, but even this is insufficient to conclude with confidence. A t-test assumes that all other study variables are silent; that is, all other variables have equal influence on automatic and manual cars–which may or may not be correct. This brings us to the third check.

For the third check we perform a correlation analysis to understand how the variables in the data set relate to one another. The results are included in Appendix I - Part 3. The correlation results indicate that all the model variables are strongly correlated with gas mileage, mpg. This suggests the need for multiple regression analysis to comprehend the interactions across dependent and independent variables.

2.4. Check the assumptions

The key assumption in regression analysis is multivariate normality. We check this assumption using the mardiaTest function given in the MVN package. The results are included in Appendix II - Part 1. The outcome shows that the data are not multivariate normal. Appendix II - Part 2 shows the distribution properties of dependent variable, mpg. Shapiro-Wilk normality test indicates that mpg is not quite normal, but is close to normal. For the purposes of this exercise we will ignore these deviations in the given data.

3. Regression Analysis

Since this case involves a large number of independent variables, we perform stepwise regression, whereby different combinations of independent variables are tried until the best model is obtained.

stepwise = step(lm(data = mtcars, mpg ~ .), trace=0, steps=20000)
summary(stepwise)
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## amManual      2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

The results of stepwise regression show that in addition to the transmission type variable two other variables have a significant influence on the dependent variable, mpg. These variables are wt and qsec. The combined model explains 84.97 per cent of the variation in mpg. The low p-value (1.21e-11) for the model indicates that it is significant. Further, the coefficient of 2.9358 for transmission type paints the picture for its influence on mpg. The coefficient indicates that for the change in transmission from automatic to manual the mpg increases by 2.9358.

As a final check, we look at the residuals for signs of non-normality and examine the residuals vs. fitted values plot for heteroscedasticity. The residual diagnostics show normality with no evidence of heteroscedasticity (Appendix III).

The above analysis is sufficient to conclude that manual transmission yields a better gas mileage than an automatic transmission–and this difference is 2.9358 miles per gallon.


Appendix I

Part 1 - Boxplot of gas mileage (mpg) in relation to transmission type (am)

boxplot(mpg~am, data = mtcars,
        col = c("light blue", "light green"),
        xlab = "Transmission Type (am)",
        ylab = "Miles per Gallon (mpg)",
        main = "Miles per Gallon by Transmission Type")

Part 2 - Student’s t-test on transmission type in relation to gas mileage (mpg)

automatic <- mtcars[mtcars$am == "Automatic",]
manual <- mtcars[mtcars$am == "Manual",]
t.test(automatic$mpg, manual$mpg)
## 
##  Welch Two Sample t-test
## 
## data:  automatic$mpg and manual$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

Part 3 - Correlations across the study variables in relation to miles per gallon

data(mtcars)
sort(cor(mtcars)[1,])
##         wt        cyl       disp         hp       carb       qsec 
## -0.8676594 -0.8521620 -0.8475514 -0.7761684 -0.5509251  0.4186840 
##       gear         am         vs       drat        mpg 
##  0.4802848  0.5998324  0.6640389  0.6811719  1.0000000

Appendix II

Part 1 - Multivariate normality check for the data set (mtcars)

library(MVN)
mardiaTest(mtcars, qqplot = FALSE)
##    Mardia's Multivariate Normality Test 
## --------------------------------------- 
##    data : mtcars 
## 
##    g1p            : 73.98927 
##    chi.skew       : 394.6095 
##    p.value.skew   : 2.114446e-05 
## 
##    g2p            : 143.229 
##    z.kurtosis     : 0.038304 
##    p.value.kurt   : 0.9694453 
## 
##    chi.small.skew : 438.2442 
##    p.value.small  : 1.687389e-08 
## 
##    Result          : Data are not multivariate normal. 
## ---------------------------------------

Part 2 - The distribution properties of the dependent variable, mpg

x <- mtcars$mpg

h<-hist(x, breaks=10, col="Green", xlab="Miles per Gallon (mpg)",
   main="Distribution Properties of Miles per Gallon (mpg)")

xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="Red", lwd=3)

# Kernel Density Plot
d <- density(mtcars$mpg)
plot(d, xlab = "Miles per Gallon (mpg)", main ="Density Plot for Miles per Gallon (mpg)")


Appendix III

Examination of the residuals vs. fitted values plot for heteroscedasticity

par(mfrow = c(2,2))
finalfit <- lm(mpg~am + wt + qsec, data = mtcars)
plot(finalfit)