It is the common belief that the manual transmission is more fuel efficient than the automatic transmission for the motor car.We generally think that, mannually changing gear giive us better fuel efficiency.
In this project,I have used a dataset from the 1974 Motor Trend US magazine, mainly to answer the following questions: 1. Is an automatic or manual transmission better for miles per gallon (MPG)? 2. How different is the MPG between automatic and manual transmissions?
Using hypothesis testing and simple linear regression, we determine that there is a signficant difference between the mean MPG for automatic and manual transmission cars, with the latter having 7.245 more MPGs on average.
However, in order to adjust for other confounding variables used in the data frame, such as the weight and horsepower of the car etc, we ran a multivariate regression to get a better estimate of the impact of transmission type on MPG.
After validating the model using ANOVA, the results from the multivariate regression reveal that, on average, manual transmission cars get 2.084 miles per gallon more than automatic transmission cars.
Reading the “mtcars” data
data(mtcars)
Study the structure of the data.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
A data frame with 32 observations on 11 variables. Here, we have added a full description of the variables used in this data frame : mpg = Miles/(US) gallon,cyl = Number of cylinders,disp = Displacement (cu.in.), hp = Gross horsepower,drat = Rear axle ratio, wt = Weight (lb/1000), qsec = 1/4 mile time, vs = V/S, am = Transmission (0 = automatic, 1 = manual), gear= Number of forward gears, carb = Number of carburetors.
After checking the structure of the data frame, we see that our explanatory variable of interest, “am”, is a numeric variable. Lets convert this variable to a factor class and label the levels as “Automatic” and “Manual” for better interpretability.
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")
Since we run a linear regression, we want to make sure that its assumptions are met. The assumptions are : 1.Linearity: Relationship between explanatory & response variable should be linear.To know that, We should check the scatter plot of the data or the residual plot. 2.Nearly Normal Residuals: Residuals should be nearly normally distributed, centred at zero. 3.Constant Variability: Variability of the residuals arround the zero line ahould be roughly constant as well. This is also called Homoschedasticity assumption. Lets plot the dependent variable mpg to check its distribution.
par(mfrow = c(1, 2))
x <- mtcars$mpg
h<-hist(x, breaks=10, col="blue", xlab="Miles/(US) gallon",
main="Histogram of Miles/(US) gallon")
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="magenta", lwd=2)
d <- density(mtcars$mpg)
plot(d, xlab = "MPG", col = "green",main ="Density Plot of MPG")
The distribution of mpg is approximately normal and there is no apparent outliers skewing my data.
Now lets check how mpg varies by automatic versus manual transmission.
boxplot(mpg ~ am, data = mtcars,
col = c("dark grey", "light grey"),
xlab = "Transmission",
ylab = "Miles/(US) gallon",
main = "MPG by Transmission Type")
Again, there is no apparent outlier in our dataset. Morever, we can easily see a difference in the MPG by transmission type. As suspected, manual transmission seems to get better miles per gallon than automatic transmission. However, we should dig deeper.
aggregate(mpg ~ am, data = mtcars, mean)
## am mpg
## 1 Automatic 17.15
## 2 Manual 24.39
The mean MPG of manual transmission cars is 7.245 MPGs higher than that of automatic transmission cars. Is this a significant difference? Null Hypothesis : No signignificant difference. Alternative Hypothesis : There is significant difference. We set our alpha-value at 0.5 ( or at 95% confidence Level) and run a t-test to find out.
autoData <- mtcars[mtcars$am == "Automatic",]
manualData <- mtcars[mtcars$am == "Manual",]
t.test(autoData$mpg, manualData$mpg)
##
## Welch Two Sample t-test
##
## data: autoData$mpg and manualData$mpg
## t = -3.767, df = 18.33, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.28 -3.21
## sample estimates:
## mean of x mean of y
## 17.15 24.39
With a p-value of 0.001374, I reject the null hypothesis and claim that there is a signficiant difference in the mean MPG between manual transmission cars and that of automatic transmission cars. Now I must quantify that difference.
To determine which explanatory variables should go into our model, I create a correlation matrix for the ‘mtcars’ dataset and look at the row for mpg.
data(mtcars)
sort(cor(mtcars)[1,])
## wt cyl disp hp carb qsec gear am vs
## -0.8677 -0.8522 -0.8476 -0.7762 -0.5509 0.4187 0.4803 0.5998 0.6640
## drat mpg
## 0.6812 1.0000
In addition to ‘am’ (which by default must be included in our regression model), I see that wt, cyl, disp, and hp are highly correlated with our dependent variable mpg. As such, they may be good candidates to include in our model. However, if we look at the correlation matrix, we also see that cyl and disp are highly correlated with each other. Since explanatory variables should not exhibit collinearity, we should not have cyl and disp in in our model. Definitely,including wt and hp in our regression equation makes sense.By practical experience, we know, heavier cars and cars that have more horsepower should have lower MPGs.
To begin our model testing, we fit a simple linear regression for mpg on am.
fit <- lm(mpg~am, data = mtcars)
summary(fit)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.392 -3.092 -0.297 3.244 9.508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.15 1.12 15.25 1.1e-15 ***
## am 7.24 1.76 4.11 0.00029 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.9 on 30 degrees of freedom
## Multiple R-squared: 0.36, Adjusted R-squared: 0.338
## F-statistic: 16.9 on 1 and 30 DF, p-value: 0.000285
We do not gain much more information from our hypothesis test using this model. Interpreting the coefficient and intercepts, we say that, on average, automatic cars have 17.147 MPG and manual transmission cars have 7.245 MPGs more. In addition, we see that the R^2 value is 0.3598. This means that our model only explains 35.98% of the variance.
Let’s check SSE ( Sum of Square Error)
fit$residuals
## Mazda RX4 Mazda RX4 Wag Datsun 710
## -3.3923 -3.3923 -1.5923
## Hornet 4 Drive Hornet Sportabout Valiant
## 4.2526 1.5526 0.9526
## Duster 360 Merc 240D Merc 230
## -2.8474 7.2526 5.6526
## Merc 280 Merc 280C Merc 450SE
## 2.0526 0.6526 -0.7474
## Merc 450SL Merc 450SLC Cadillac Fleetwood
## 0.1526 -1.9474 -6.7474
## Lincoln Continental Chrysler Imperial Fiat 128
## -6.7474 -2.4474 8.0077
## Honda Civic Toyota Corolla Toyota Corona
## 6.0077 9.5077 4.3526
## Dodge Challenger AMC Javelin Camaro Z28
## -1.6474 -1.9474 -3.8474
## Pontiac Firebird Fiat X1-9 Porsche 914-2
## 2.0526 2.9077 1.6077
## Lotus Europa Ford Pantera L Ferrari Dino
## 6.0077 -8.5923 -4.6923
## Maserati Bora Volvo 142E
## -9.3923 -2.9923
SSE = sum(fit$residuals^2)
SSE
## [1] 720.9
ohh! SSE is 720.9. It is huge.So,our model is not good fit.
Next, we fit a multivariate linear regression for mpg on am, wt, and hp. Since we have two models of the same data, we run an ANOVA to compare the two models and see if they are significantly different.
bestfit <- lm(mpg~am + wt + hp, data = mtcars)
anova(fit, bestfit)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt + hp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 721
## 2 28 180 2 541 42 3.7e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With a p-value of 3.745e-09, we reject the null hypothesis and claim that our multivariate model is significantly different from our simple model.
Once again, we should check SSE ( Sum of Square Error)
bestfit$residuals
## Mazda RX4 Mazda RX4 Wag Datsun 710
## -3.42206 -2.68802 -3.12277
## Hornet 4 Drive Hornet Sportabout Valiant
## 0.77440 1.15820 -2.00774
## Duster 360 Merc 240D Merc 230
## -0.24407 1.90346 1.42512
## Merc 280 Merc 280C Merc 450SE
## -0.29069 -1.69069 0.85910
## Merc 450SL Merc 450SLC Cadillac Fleetwood
## 0.78038 -1.17569 -0.80722
## Lincoln Continental Chrysler Imperial Fiat 128
## 0.06844 4.70322 5.11988
## Honda Civic Toyota Corolla Toyota Corona
## 0.91121 5.53172 -1.77175
## Dodge Challenger AMC Javelin Camaro Z28
## -2.74848 -3.29316 -0.46686
## Pontiac Firebird Fiat X1-9 Porsche 914-2
## 2.82402 -0.74295 -0.51587
## Lotus Europa Ford Pantera L Ferrari Dino
## 2.90380 -1.26712 -1.85415
## Maserati Bora Volvo 142E
## 1.74530 -2.59896
SSE = sum(bestfit$residuals^2)
SSE
## [1] 180.3
Now, the SSE is 180.3. That means, now, our model has less unexplained events.So, the model is fitted good.
Before we report the details of our model, it is important to check the residuals for any signs of non-normality and examine the residuals vs. fitted values plot to spot for any signs of heteroskedasticity.
par(mfrow = c(2,2))
plot(bestfit)
Our residuals are normally distributed and homoskedastic. We can now report the estimates from our final model.
summary(bestfit)
##
## Call:
## lm(formula = mpg ~ am + wt + hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.422 -1.792 -0.379 1.225 5.532
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.00288 2.64266 12.87 2.8e-13 ***
## am 2.08371 1.37642 1.51 0.14127
## wt -2.87858 0.90497 -3.18 0.00357 **
## hp -0.03748 0.00961 -3.90 0.00055 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.54 on 28 degrees of freedom
## Multiple R-squared: 0.84, Adjusted R-squared: 0.823
## F-statistic: 49 on 3 and 28 DF, p-value: 2.91e-11
This model explains over 83.99% of the variance. Moreover, we see that wt and hp did indeed confound the relationship between am and mpg(mostly wt). Now when we read the coefficient for am, we say that, on average, manual transmission cars have 2.084 MPGs more than automatic transmission cars.