Problem statement

Motor Trend is a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

  1. “Is an automatic or manual transmission better for MPG”
  2. “Quantify the MPG difference between automatic and manual transmissions”

Executive Summary

We use Hypothesis testing and Multivariate Regression to analyze the relationship between Miles per gallon (MPG) and other variables, including the mode of transmission (Automatic/Manual).

We conclude that Manual transmission is better for MPG compared to Automatic transmission.

Other variables in the final model are weight and quarter mile time (acceleration), which have signficant impact in quantifying the difference of mpg between automatic and manual transmission cars.

We load the mtcars data set to look at various column names and their contents

data(mtcars)
names (mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

The predictor variable.am, is a numeric class. Since it is a dichotomous variable, let’s convert this to a factor class and label the levels as Automatic and Manual.

mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")

Exploratory Data Analysis

We begin the exploratory data analysis by looking at the pairwise scatter plot between all variables.(Plots shown in Appendix 1)

Before modeling our variable of interest MPG, we need to check if it follows a Normal distribution, whether there are any outliers, etc.

par(mfrow = c(1, 2))
# Histogram to test for Normality

x <- mtcars$mpg
h<-hist(x, breaks=10, col="green", xlab="Miles Per Gallon",
   main="Histogram of Miles per Gallon")
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="red", lwd=3)

# Kernel Density Plot
d <- density(mtcars$mpg)
plot(d, xlab = "MPG", main ="Density Plot for MPG")

From the histogram, MPG seems to follow approximately a Normal distribution, and we dont see any outliers.

We can check how mpg varies by automatic versus manual transmission using a Boxplot. A boxplot was created to test the association between mpg and transmission type.

boxplot(mpg~am, data = mtcars,
        col = c("dark green", "light green"),
        xlab = "Transmission",
        ylab = "Miles per Gallon",
        main = "MPG by Transmission Type")

From the boxplot we see that manual transmission gives more Miles per Gallon compared to Automatic.However, we can dig deeper to confirm.

Testing the Hypothesis

Null Hypothesis (H0):

There is no difference with regards to Miles per Gallon (MPG) for Automatic and Manual transmission.

Alternate Hypothesis (Ha):

There is a difference with regards to Miles per Gallon (MPG) for Automatic and Manual transmission.

aggregate(mpg~am, data = mtcars, mean)
##          am      mpg
## 1 Automatic 17.14737
## 2    Manual 24.39231

The mean MPG for manual transmission is 24.39231 whereas that for automatic transmission is 17.14737.
Thus mean MPG of cars with manual transmission is 7.245 MPGs higher than that of cars with automatic transmission cars. (We have not yet considered other confounding variables)

We will run a t-test with alpha-value at 0.5 to find if the difference is significant.

autoTrans <- mtcars[mtcars$am == "Automatic",]
manualTrans <- mtcars[mtcars$am == "Manual",]
t.test(autoTrans$mpg, manualTrans$mpg)
## 
##  Welch Two Sample t-test
## 
## data:  autoTrans$mpg and manualTrans$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

The p-value is 0.001374, hence we can reject the null hypothesis and conclude that there is a signficiant difference in the mean MPG between cars with manual transmission and cars with automatic transmission.

Now we need to quantify the difference as per the second problem statement.

Correlation between variables

Since we are interested in the determining the relationship between mpg and other variables, we first check the correlation between mpg and other variables by using the cor() function.

data(mtcars)
sort(cor(mtcars)[1,])
##         wt        cyl       disp         hp       carb       qsec 
## -0.8676594 -0.8521620 -0.8475514 -0.7761684 -0.5509251  0.4186840 
##       gear         am         vs       drat        mpg 
##  0.4802848  0.5998324  0.6640389  0.6811719  1.0000000

We see that our variable of interest am is highly correlated with the dependent variable mpg.

Variables showing positive correlation with mpg in descending order of strength are drat,vs,am,gear and qsec.

Variables showing negative correlation with mpg in descending order of strength are wt,cyl,disp,hp and carb.

Building the model

Simple Linear Regression

fit <- lm(mpg~am, data = mtcars)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am             7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

Looking at the intercept and coefficients, we can say that, on average, automatic cars have 17.147 MPG and manual transmission cars have (17.147+ 7.245)=24.392 MPGs. (We have not yet considered other confounding variables)

In addition, we see that the R^2 value is 0.3598. This means that our model only explains 35.98% of the variance.

Multivariate Regression

# We use stepwise algorithm to select the best model by using step() function

bestFitModel = step(lm(data = mtcars, mpg ~ .),trace=0,steps=10000)

summary(bestFitModel)
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

This shows that apart from transmission, weight of the vehicle as well as accelaration act as confounding variables in explaining the variation in mpg. The adjusted R^2 is 84% which means that the model explains 84% of the variation in mpg which is very good in terms of predictive power.

Model with Best Fit to quantify MPG difference between Automatic and Manual transmission

# As seen from stepwise regression,We select model with 3 variables wt, qsec and am; which accounts for 84% of total variance. 

bestFitModel <- lm(mpg~am + wt + qsec, data = mtcars)
anova(fit, bestFitModel)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt + qsec
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1     30 720.90                                 
## 2     28 169.29  2    551.61 45.618 1.55e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is very small(1.55e-09), hence we reject the null hypothesis and can say that our multivariate model is significantly different from our simple linear regression model.

Before finalizing our model, it is important to check the residuals for any signs of non-normality and examine the residuals vs. fitted values plot for heteroskedasticity.

This check and the relavant plots are shown in Appendix 2.

The residual diagnostics show normality and exhibit no evidence of heteroskedasticity.

Now we can check the important parameters of our final model through the “summary” command.

# bestFitModel results

summary(bestFitModel)
## 
## Call:
## lm(formula = mpg ~ am + wt + qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## am            2.9358     1.4109   2.081 0.046716 *  
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

Results:

Weight of the vehicle and accelaration speed act as confounding variables when we are determining the relation between mode of transmission and mpg.

Accounting for the above confounding variables,on an average, manual transmission cars have 2.94 MPGs more than automatic transmission cars. (Much lower than the earlier 7.245, which didn’t consider Confounding.)

The adjusted R^2 is 84% which means that the model explains 84% of the variation in mpg which is very good in terms of predictive power.

APPENDIX

Appendix 1:Pairwise scatterplots between variables

pairs(mtcars)

Appendix 2 : Residual diagnostics for final model

par(mfrow = c(2,2))
plot(bestFitModel)