Uchenna Emechebe 17th Jan 2017
Project
I work for the company, Motor Trend. They are interested in knowing if
1) If manual or automatic transmission is better for gas mileage (mpg)
2) If there are differences in the gas mileage of manual and automatic,
quantify that difference.
Data set is the mtcars data set in R
# Load the data set
data(mtcars)
# A peek into the data to get a feel of the structure and the variables
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# Data has 11 variables and 32 observations. I am interested in mpg variable and also the am
# variable. am variable is coded as 0 for automatic transmission or 1 for manual transmission.
# Since I am interested in how either one affects gas mileage, I want to recode them for ease
# of interpretation.
# Recode am == 1 as Manual and am == 0 as Automatic
mtcars$am = ifelse(mtcars$am=='0','Automatic',ifelse(mtcars$am=='1','Manual',2))
# Lets do some exploratory analysis to test if there is relationship
# between mpg and transmission type(am)
boxplot(mpg~am, xlab='Transmisson', ylab='miles per gallon',data = mtcars)

# Exploratory data analysis suggests that there is a difference. Manual transmission
# seems to yield more miles per gallon than transmission type.
# To test the hypothesis that manual transmission yields more miles per gallon
# than automatic transmission, I decided to fit a linear model using the continous
# variable mpg as a response and the am variable as the predictor. In plain english,
# I am using type of transmission as a predictor of how many miles we can achieve per gallon
# Null hypothesis : There is no difference between Automatic and Manual with respect to mpg
# Alternative hypothesis : There is a difference
fit = lm(mpg~am,data=mtcars)
# Lets look at the coefficients
summary(fit)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## amManual 7.244939 1.764422 4.106127 2.850207e-04
# Automatic transmission yields an avg of 17.14 mpg per gallon while Manual transmission yield 7.2
# mpg per gallon more (17.14 + 7.2 = 24.34 mpg per gallon)
# This difference is also highly significant as this result is about 4 standard
# deviations away from the mean of Automatic with a p value of .00028
# As a result of these statistics, I reject the null in favour of the alternative that
# driving a stick saves you gas when compared to automatic
# Lets get a confidence interval for these numbers:
confint(fit)
## 2.5 % 97.5 %
## (Intercept) 14.85062 19.44411
## amManual 3.64151 10.84837
# This shows that we can say with 95 percent confidence that there is a difference
# of 3.6 to 10.8 mpg per gallon saved by using manual transmission.
# So how much of transmission explains miles per gallon
summary(fit)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
# R squared suggests that only 36 percent of mpg is explained by transmission type. This suggests that
# other variables are affecting mpg other than transmission type. This motivates the need to go back
# and re-fit our model accounting for these other variables.
# Before we do that, lets look at the residuals
plot(fit$fitted,resid(fit))
abline(h=0, lwd=1)

# From the plot, you can tell there is a pattern to the residuals.
# Now this really drives home the point that we need to account for other variables
Testing the effect of transmission on mpg after accounting for other confounding
variables in the data set
# Lets fit different models with overlapping variables and then
# select the model that is significant
fit1 = lm(mpg~am+cyl,data=mtcars)
fit2 = lm(mpg~am+cyl+disp,data=mtcars)
fit3 = lm(mpg~am+cyl+disp+hp,data=mtcars)
fit4 = lm(mpg~am+cyl+disp+hp+drat,data=mtcars)
fit5 = lm(mpg~am+cyl+disp+hp+drat+wt,data=mtcars)
fit6 = lm(mpg~am+cyl+disp+hp+drat+wt+qsec,data=mtcars)
fit7 = lm(mpg~am+cyl+disp+hp+drat+wt+qsec+vs,data=mtcars)
fit8 = lm(mpg~am+cyl+disp+hp+drat+wt+qsec+vs+gear,data=mtcars)
fit9 = lm(mpg~am+cyl+disp+hp+drat+wt+qsec+vs+gear+carb,data=mtcars)
# Now lets select features that are significant in the model
anova(fit,fit1,fit2,fit3,fit4,fit5,fit6,fit7,fit8,fit9)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + cyl
## Model 3: mpg ~ am + cyl + disp
## Model 4: mpg ~ am + cyl + disp + hp
## Model 5: mpg ~ am + cyl + disp + hp + drat
## Model 6: mpg ~ am + cyl + disp + hp + drat + wt
## Model 7: mpg ~ am + cyl + disp + hp + drat + wt + qsec
## Model 8: mpg ~ am + cyl + disp + hp + drat + wt + qsec + vs
## Model 9: mpg ~ am + cyl + disp + hp + drat + wt + qsec + vs + gear
## Model 10: mpg ~ am + cyl + disp + hp + drat + wt + qsec + vs + gear + carb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 271.36 1 449.53 64.0039 8.231e-08 ***
## 3 28 252.08 1 19.28 2.7452 0.11241
## 4 27 216.37 1 35.71 5.0849 0.03493 *
## 5 26 214.50 1 1.87 0.2663 0.61121
## 6 25 162.43 1 52.06 7.4127 0.01275 *
## 7 24 149.09 1 13.34 1.8999 0.18260
## 8 23 148.87 1 0.22 0.0309 0.86214
## 9 22 147.90 1 0.97 0.1384 0.71365
## 10 21 147.49 1 0.41 0.0579 0.81218
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# From the anova analysis, we need cycl,hp and wt in our model
# So refit the model
Model = lm(mpg~am+cyl+hp+wt,data=mtcars)
summary(Model)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.14653575 3.10478079 11.642218 4.944804e-12
## amManual 1.47804771 1.44114927 1.025603 3.141799e-01
## cyl -0.74515702 0.58278741 -1.278609 2.119166e-01
## hp -0.02495106 0.01364614 -1.828433 7.855337e-02
## wt -2.60648071 0.91983749 -2.833632 8.603218e-03
# Accounting for the other variables, Manual transmission now has an increase
# of 1.47 miles per gallon over Automatic transmission. However, this is no
# longer significant (p value of .3142)
# Lets look at the confindence interval
confint(Model)
## 2.5 % 97.5 %
## (Intercept) 29.77605177 42.517019733
## amManual -1.47894635 4.435041763
## cyl -1.94093802 0.450623969
## hp -0.05295064 0.003048517
## wt -4.49383134 -0.719130075
# The interval contains 0. So based on these analysis, I cant reject the null hypothesis. Holding all other
# variables constant, Manual or Automatic transmission would not affect gas consumption.
# Lets look at the residuals
plot(Model$fitted,resid(Model))
abline(h=0, lwd=1)
