You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

"Is an automatic or manual transmission better for MPG"
"Quantify the MPG difference between automatic and manual transmissions"

Introduction

Using hypothesis testing and simple linear regression, we can conslude that there is a signficant difference between the mean MPG for automatic and manual transmission cars. To adjust for other confounding variables such as the weight and quarter mile time (acceleration) of the car, multivariate regression analysis was run to understand the impact of transmission type on MPG. The final model results indicates that weight and quarter mile time (acceleration) have signficant impact in quantifying the difference of mpg between automatic and manual transmission cars.

About The Data

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).

Preprocessing

Load the required packages

library(knitr) 
library(ggplot2)

Reading the data

data(mtcars)
names(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

From here we can notice some of the variables are more of factors rather than continuous values. So we do the required conversions.

mtcars<-transform(mtcars,vs=as.factor(vs))
mtcars<-transform(mtcars,cyl=as.factor(cyl))
mtcars<-transform(mtcars,am=factor(am))
mtcars<-transform(mtcars,carb=factor(carb))
mtcars<-transform(mtcars,gear=factor(gear))

Exploratory Data Analysis

To investigate the relationship between the variables in our data. We will use plotting and linear models to explore trends. Refer APPENDIX 1 for graph. From the graph, we can see that the mpg(miles per galon) is related to and dependent on other factors like the weight and the number of cylinders and not just on the type of transmission.

Comparing averages of the data.

meanMan<-mean(mtcars$mpg[mtcars$am=="1"])
manError <- qnorm(0.995)*sd(mtcars$mpg[mtcars$am=="1"])/
      sqrt(length(mtcars$mpg[mtcars$am=="1"]))

meanAuto<-mean(mtcars$mpg[mtcars$am=="0"])
autoError <- qnorm(0.995)*sd(mtcars$mpg[mtcars$am=="0"])/
      sqrt(length(mtcars$mpg[mtcars$am=="0"]))

transmissionMean<-c(meanMan,meanAuto)
error<-c(manError,autoError)
lowerLimit<-c(meanMan - manError, meanAuto - autoError)
upperLimit<-c(meanMan + manError, meanAuto + autoError)
newData<-data.frame()
newData<-cbind(transmissionMean,error)
newData<-cbind(newData,lowerLimit)
newData<-cbind(newData,upperLimit)
row.names(newData)<-c("manual", "automatic")
newData
##           transmissionMean    error lowerLimit upperLimit
## manual            24.39231 4.405390   19.98692    28.7977
## automatic         17.14737 2.265628   14.88174    19.4130

Here we can see that the 95% lower bound of the manual’s MPG is 19.98, which is higher than the 95% higher bound of the automatic’s MPG. So it appear manual is more effiecient. However this assumption is based on all other characteristics of auto cars and manual cars are same (e.g: auto cars and manual cars have same weight distribution) - which needs to be further explored in the multiple linear regression analysis. Refer APPENDIX 1 for graph. Now lets strengthen the findings from our pairs plot that there are other parameters that are responsible for the variation in the MPG of the cars.

Simple Linear Regression

fit<-lm(mpg~am, data=mtcars)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am1            7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

We see that the R^2 value is 0.3598. This means that our model only explains 35.98% of the variance. We need to understand the impact of transmission in conjunction with other factors to quantify the mpg difference between automatic and manual transmission.

Multivariate Regrssion

We can now explore other Linear Models by including additional regressors

fit1<-step(lm(mpg~., data=mtcars),trace=0, steps=10000)
summary(fit1)
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## am1          1.80921    1.39630   1.296  0.20646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

From this graph we can see that there are three variables that effect the mpg of the car’s the most , they are weight, hoursepower and qsec. THis inference can be also made by checking the number of stars next to the variables, more the stars, more is the effect of that variable. The adjusted R^2 is 84% which means that the model explains 84% of the variation in mpg indicating it is a robust and highly predictive model.

Adjusted Multivariate Regression

fit2<-lm(mpg~am+ wt+ hp + qsec, data=mtcars)
summary(fit2)
## 
## Call:
## lm(formula = mpg ~ am + wt + hp + qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4975 -1.5902 -0.1122  1.1795  4.5404 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 17.44019    9.31887   1.871  0.07215 . 
## am1          2.92550    1.39715   2.094  0.04579 * 
## wt          -3.23810    0.88990  -3.639  0.00114 **
## hp          -0.01765    0.01415  -1.247  0.22309   
## qsec         0.81060    0.43887   1.847  0.07573 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.435 on 27 degrees of freedom
## Multiple R-squared:  0.8579, Adjusted R-squared:  0.8368 
## F-statistic: 40.74 on 4 and 27 DF,  p-value: 4.589e-11

This model captured 86% of the overall variation in mpg. With a p-value of 3.745e-09, we reject the null hypothesis and claim that our multivariate model is significantly different from our simple linear regression model. # Results Summary: This model explains 86% of the variance in miles per gallon (mpg). Moreover, we see that wt and qsec did indeed confound the relationship between am and mpg (mostly wt). Now when we read the coefficient for am, we say that, on average, manual transmission cars have 2.94 MPGs more than automatic transmission cars. However this effect is much lower than when we did not adjust for weight, horsepower and qsec.

APPENDIX 1

APPENDIX 2

boxplot(mpg~am, data = mtcars,
        col = c("dark grey", "light grey"),
        xlab = "Transmission",
        ylab = "Miles per Gallon",
        main = "MPG by Transmission Type")

APPENDIX 3

Correlation test

data(mtcars)
cor(mtcars)[,1]
##        mpg        cyl       disp         hp       drat         wt 
##  1.0000000 -0.8521620 -0.8475514 -0.7761684  0.6811719 -0.8676594 
##       qsec         vs         am       gear       carb 
##  0.4186840  0.6640389  0.5998324  0.4802848 -0.5509251

From the correlation data, we could see cyl, hp, wt and carb are negatively correlated with mpg. In addition to am (which by default must be included in our regression model), we see that wt, cyl, disp, and hp are highly correlated with our dependent variable mpg.