Automatic or Manual? Which Transmission Model is better for MPG? A Regression Model

Executive Summary

It is the common belief that the manual transmission is more fuel efficient than the automatic transmission for the motor car.We generally think that, mannually changing gear giive us better fuel efficiency.

In this project,I have used a dataset from the 1974 Motor Trend US magazine, mainly to answer the following questions: 1. Is an automatic or manual transmission better for miles per gallon (MPG)? 2. How different is the MPG between automatic and manual transmissions?

Using hypothesis testing and simple linear regression, we determine that there is a signficant difference between the mean MPG for automatic and manual transmission cars, with the latter having 7.245 more MPGs on average.

However, in order to adjust for other confounding variables used in the data frame, such as the weight and horsepower of the car etc, we ran a multivariate regression to get a better estimate of the impact of transmission type on MPG.

After validating the model using ANOVA, the results from the multivariate regression reveal that, on average, manual transmission cars get 2.084 miles per gallon more than automatic transmission cars.

Data Processing

Step 1 :

Reading the “mtcars” data

data(mtcars)

Step 2:

Study the structure of the data.

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

A data frame with 32 observations on 11 variables. Here, we have added a full description of the variables used in this data frame : mpg = Miles/(US) gallon,cyl = Number of cylinders,disp = Displacement (cu.in.), hp = Gross horsepower,drat = Rear axle ratio, wt = Weight (lb/1000), qsec = 1/4 mile time, vs = V/S, am = Transmission (0 = automatic, 1 = manual), gear= Number of forward gears, carb = Number of carburetors.

Step 3 :

After checking the structure of the data frame, we see that our explanatory variable of interest, “am”, is a numeric variable. Lets convert this variable to a factor class and label the levels as “Automatic” and “Manual” for better interpretability.

mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")

Explanatory Data Analysis

Step 1:

Since we run a linear regression, we want to make sure that its assumptions are met. The assumptions are : 1.Linearity: Relationship between explanatory & response variable should be linear.To know that, We should check the scatter plot of the data or the residual plot. 2.Nearly Normal Residuals: Residuals should be nearly normally distributed, centred at zero. 3.Constant Variability: Variability of the residuals arround the zero line ahould be roughly constant as well. This is also called Homoschedasticity assumption. Lets plot the dependent variable mpg to check its distribution.

par(mfrow = c(1, 2))

Histogram with Normal Curve

x <- mtcars$mpg
h<-hist(x, breaks=10, col="blue", xlab="Miles/(US) gallon",
   main="Histogram of Miles/(US) gallon")
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="magenta", lwd=2)

plot of chunk unnamed-chunk-5

Kernel Density Plot

d <- density(mtcars$mpg)
plot(d, xlab = "MPG", col = "green",main ="Density Plot of MPG")

plot of chunk unnamed-chunk-6

The distribution of mpg is approximately normal and there is no apparent outliers skewing my data.

Step 2:

Now lets check how mpg varies by automatic versus manual transmission.

boxplot(mpg ~ am, data = mtcars,
        col = c("dark grey", "light grey"),
        xlab = "Transmission",
        ylab = "Miles/(US) gallon",
        main = "MPG by Transmission Type")

plot of chunk unnamed-chunk-7

Again, there is no apparent outlier in our dataset. Morever, we can easily see a difference in the MPG by transmission type. As suspected, manual transmission seems to get better miles per gallon than automatic transmission. However, we should dig deeper.

Hypothesis Testing

aggregate(mpg ~ am, data = mtcars, mean)

##          am   mpg
## 1 Automatic 17.15
## 2    Manual 24.39

The mean MPG of manual transmission cars is 7.245 MPGs higher than that of automatic transmission cars. Is this a significant difference? Null Hypothesis : No signignificant difference. Alternative Hypothesis : There is significant difference. We set our alpha-value at 0.5 ( or at 95% confidence Level) and run a t-test to find out.

autoData <- mtcars[mtcars$am == "Automatic",]
manualData <- mtcars[mtcars$am == "Manual",]
t.test(autoData$mpg, manualData$mpg)

## 
##  Welch Two Sample t-test
## 
## data:  autoData$mpg and manualData$mpg
## t = -3.767, df = 18.33, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.28  -3.21
## sample estimates:
## mean of x mean of y 
##     17.15     24.39

With a p-value of 0.001374, I reject the null hypothesis and claim that there is a signficiant difference in the mean MPG between manual transmission cars and that of automatic transmission cars. Now I must quantify that difference.

Model Building

Step 1 : Correlation

To determine which explanatory variables should go into our model, I create a correlation matrix for the ‘mtcars’ dataset and look at the row for mpg.

data(mtcars)
sort(cor(mtcars)[1,])

##      wt     cyl    disp      hp    carb    qsec    gear      am      vs 
## -0.8677 -0.8522 -0.8476 -0.7762 -0.5509  0.4187  0.4803  0.5998  0.6640 
##    drat     mpg 
##  0.6812  1.0000

In addition to ‘am’ (which by default must be included in our regression model), I see that wt, cyl, disp, and hp are highly correlated with our dependent variable mpg. As such, they may be good candidates to include in our model. However, if we look at the correlation matrix, we also see that cyl and disp are highly correlated with each other. Since explanatory variables should not exhibit collinearity, we should not have cyl and disp in in our model. Definitely,including wt and hp in our regression equation makes sense.By practical experience, we know, heavier cars and cars that have more horsepower should have lower MPGs.

Step 2 : Regression Analysis

Simple Linear Regression

To begin our model testing, we fit a simple linear regression for mpg on am.

fit <- lm(mpg~am, data = mtcars)
summary(fit)

## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.392 -3.092 -0.297  3.244  9.508 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    17.15       1.12   15.25  1.1e-15 ***
## am              7.24       1.76    4.11  0.00029 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.9 on 30 degrees of freedom
## Multiple R-squared:  0.36,   Adjusted R-squared:  0.338 
## F-statistic: 16.9 on 1 and 30 DF,  p-value: 0.000285

We do not gain much more information from our hypothesis test using this model. Interpreting the coefficient and intercepts, we say that, on average, automatic cars have 17.147 MPG and manual transmission cars have 7.245 MPGs more. In addition, we see that the R^2 value is 0.3598. This means that our model only explains 35.98% of the variance.

Let’s check SSE ( Sum of Square Error)

fit$residuals

##           Mazda RX4       Mazda RX4 Wag          Datsun 710 
##             -3.3923             -3.3923             -1.5923 
##      Hornet 4 Drive   Hornet Sportabout             Valiant 
##              4.2526              1.5526              0.9526 
##          Duster 360           Merc 240D            Merc 230 
##             -2.8474              7.2526              5.6526 
##            Merc 280           Merc 280C          Merc 450SE 
##              2.0526              0.6526             -0.7474 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood 
##              0.1526             -1.9474             -6.7474 
## Lincoln Continental   Chrysler Imperial            Fiat 128 
##             -6.7474             -2.4474              8.0077 
##         Honda Civic      Toyota Corolla       Toyota Corona 
##              6.0077              9.5077              4.3526 
##    Dodge Challenger         AMC Javelin          Camaro Z28 
##             -1.6474             -1.9474             -3.8474 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2 
##              2.0526              2.9077              1.6077 
##        Lotus Europa      Ford Pantera L        Ferrari Dino 
##              6.0077             -8.5923             -4.6923 
##       Maserati Bora          Volvo 142E 
##             -9.3923             -2.9923

SSE = sum(fit$residuals^2)
SSE

## [1] 720.9

ohh! SSE is 720.9. It is huge.So,our model is not good fit.

Multivariate Linear Regression

Next, we fit a multivariate linear regression for mpg on am, wt, and hp. Since we have two models of the same data, we run an ANOVA to compare the two models and see if they are significantly different.

bestfit <- lm(mpg~am + wt + hp, data = mtcars)
anova(fit, bestfit)

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt + hp
##   Res.Df RSS Df Sum of Sq  F  Pr(>F)    
## 1     30 721                            
## 2     28 180  2       541 42 3.7e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With a p-value of 3.745e-09, we reject the null hypothesis and claim that our multivariate model is significantly different from our simple model.

Once again, we should check SSE ( Sum of Square Error)

bestfit$residuals

##           Mazda RX4       Mazda RX4 Wag          Datsun 710 
##            -3.42206            -2.68802            -3.12277 
##      Hornet 4 Drive   Hornet Sportabout             Valiant 
##             0.77440             1.15820            -2.00774 
##          Duster 360           Merc 240D            Merc 230 
##            -0.24407             1.90346             1.42512 
##            Merc 280           Merc 280C          Merc 450SE 
##            -0.29069            -1.69069             0.85910 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood 
##             0.78038            -1.17569            -0.80722 
## Lincoln Continental   Chrysler Imperial            Fiat 128 
##             0.06844             4.70322             5.11988 
##         Honda Civic      Toyota Corolla       Toyota Corona 
##             0.91121             5.53172            -1.77175 
##    Dodge Challenger         AMC Javelin          Camaro Z28 
##            -2.74848            -3.29316            -0.46686 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2 
##             2.82402            -0.74295            -0.51587 
##        Lotus Europa      Ford Pantera L        Ferrari Dino 
##             2.90380            -1.26712            -1.85415 
##       Maserati Bora          Volvo 142E 
##             1.74530            -2.59896

SSE = sum(bestfit$residuals^2)
SSE

## [1] 180.3

Now, the SSE is 180.3. That means, now, our model has less unexplained events.So, the model is fitted good.

Appendix

Before we report the details of our model, it is important to check the residuals for any signs of non-normality and examine the residuals vs. fitted values plot to spot for any signs of heteroskedasticity.

par(mfrow = c(2,2))
plot(bestfit)

plot of chunk unnamed-chunk-15

Our residuals are normally distributed and homoskedastic. We can now report the estimates from our final model.

summary(bestfit)

## 
## Call:
## lm(formula = mpg ~ am + wt + hp, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.422 -1.792 -0.379  1.225  5.532 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.00288    2.64266   12.87  2.8e-13 ***
## am           2.08371    1.37642    1.51  0.14127    
## wt          -2.87858    0.90497   -3.18  0.00357 ** 
## hp          -0.03748    0.00961   -3.90  0.00055 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.54 on 28 degrees of freedom
## Multiple R-squared:  0.84,   Adjusted R-squared:  0.823 
## F-statistic:   49 on 3 and 28 DF,  p-value: 2.91e-11

This model explains over 83.99% of the variance. Moreover, we see that wt and hp did indeed confound the relationship between am and mpg(mostly wt). Now when we read the coefficient for am, we say that, on average, manual transmission cars have 2.084 MPGs more than automatic transmission cars.

Automatic or Manual? Which Transmission Model is better for MPG? A Regression Model

sougata biswas

Thursday, August 21, 2014

Executive Summary

Data Processing

Step 1 :

Step 2:

Step 3 :

Explanatory Data Analysis

Step 1:

Histogram with Normal Curve

Kernel Density Plot

Step 2:

Hypothesis Testing

Model Building

Step 1 : Correlation

Step 2 : Regression Analysis

Simple Linear Regression

Multivariate Linear Regression

Appendix