You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions: - “Is an automatic or manual transmission better for MPG” - “Quantify the MPG difference between automatic and manual transmissions”
I will use the mtcars dataset, as documented at the following link: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html
library(ggplot2)
data(mtcars)
mtcars2 <- within(mtcars, {
vs <- factor(vs, labels = c("V", "S"))
am <- factor(am, labels = c("automatic", "manual"))
cyl <- ordered(cyl)
gear <- ordered(gear)
carb <- ordered(carb)
})
summary(mtcars2)
## mpg cyl disp hp drat
## Min. :10.40 4:11 Min. : 71.1 Min. : 52.0 Min. :2.760
## 1st Qu.:15.43 6: 7 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
## Median :19.20 8:14 Median :196.3 Median :123.0 Median :3.695
## Mean :20.09 Mean :230.7 Mean :146.7 Mean :3.597
## 3rd Qu.:22.80 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
## Max. :33.90 Max. :472.0 Max. :335.0 Max. :4.930
## wt qsec vs am gear carb
## Min. :1.513 Min. :14.50 V:18 automatic:19 3:15 1: 7
## 1st Qu.:2.581 1st Qu.:16.89 S:14 manual :13 4:12 2:10
## Median :3.325 Median :17.71 5: 5 3: 3
## Mean :3.217 Mean :17.85 4:10
## 3rd Qu.:3.610 3rd Qu.:18.90 6: 1
## Max. :5.424 Max. :22.90 8: 1
I will first go ahead and create a box and whiskers plot to visually get an idea about the automatic vs manual tramsmissions effect on mpg.
ggplot(data = mtcars2, aes(x=am, y=mpg)) + geom_boxplot() +ggtitle("Box-and-whiskers")
We can se from the plots that, on average, Manual Transmission provides better MPG.
Next, I created a preliminary linear model regression.
summary(lm(mpg ~ am, data=mtcars2))
##
## Call:
## lm(formula = mpg ~ am, data = mtcars2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## ammanual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
The interpretation of the linear model is as follows: the variables are statistically significant with 100% probability. Additionally, the R squared is .359 indicating that the independet variables explain 36% of the variaitons in the dependent variable. In other words, wether the car has automatic or manual transmission helps explain 36% of the variations in the mpg.
To get a wider picture, I will procede to create the linear model including all the variables. This is done with the purpose of identifying key variables with respect to mpg in order to gain more insight. We are going to be looking for the most parsimonious model, hence, I will also look at the correlation of variables with mpg to help us choose the best model.
summary(lm(mpg ~ ., mtcars2))
##
## Call:
## lm(formula = mpg ~ ., data = mtcars2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5087 -1.3584 -0.0948 0.7745 4.6251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.57171 19.56616 1.358 0.1945
## cyl.L -0.23770 5.06256 -0.047 0.9632
## cyl.Q 2.02541 2.14952 0.942 0.3610
## disp 0.03555 0.03190 1.114 0.2827
## hp -0.07051 0.03943 -1.788 0.0939 .
## drat 1.18283 2.48348 0.476 0.6407
## wt -4.52978 2.53875 -1.784 0.0946 .
## qsec 0.36784 0.93540 0.393 0.6997
## vsS 1.93085 2.87126 0.672 0.5115
## ammanual 1.21212 3.21355 0.377 0.7113
## gear.L 1.78785 2.64200 0.677 0.5089
## gear.Q 0.12235 2.40896 0.051 0.9602
## carb.L 6.06156 6.72822 0.901 0.3819
## carb.Q 1.78825 2.80043 0.639 0.5327
## carb.C 0.42384 2.57389 0.165 0.8714
## carb^4 0.93317 2.45041 0.381 0.7087
## carb^5 -2.46410 2.90450 -0.848 0.4096
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared: 0.8931, Adjusted R-squared: 0.779
## F-statistic: 7.83 on 16 and 15 DF, p-value: 0.000124
The model demonstrates that the variables really have little statistical significance. Only ho and qsec are significant with 90% probability. Hence, even though the Adjusted R-Squared is higher (the independent variables explain 78% of the variance in mpg), this number is not very accurate.
Since the variables are not statistically significant, I proceed to find the correlation between all the variables and mpg in order to choose the best variables to include in the regression model.
cor(mtcars)[1,]
## mpg cyl disp hp drat wt
## 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.6811719 -0.8676594
## qsec vs am gear carb
## 0.4186840 0.6640389 0.5998324 0.4802848 -0.5509251
Since cyl, disp, hp, drat and wt have a strong correlation with mpg, they might be the most significant variables to include in our model.
summary(lm(mpg ~ cyl+disp+hp+drat+wt+am, mtcars2))
##
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt + am, data = mtcars2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8267 -1.4366 -0.4153 1.1649 5.0671
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.755744 6.108432 5.035 3.8e-05 ***
## cyl.L -1.797442 2.163142 -0.831 0.4142
## cyl.Q 1.433586 1.020878 1.404 0.1730
## disp 0.004395 0.013090 0.336 0.7400
## hp -0.033038 0.014476 -2.282 0.0316 *
## drat 0.326616 1.471086 0.222 0.8262
## wt -2.726729 1.200207 -2.272 0.0323 *
## ammanual 1.681130 1.554386 1.082 0.2902
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.501 on 24 degrees of freedom
## Multiple R-squared: 0.8667, Adjusted R-squared: 0.8278
## F-statistic: 22.29 on 7 and 24 DF, p-value: 4.768e-09
Since drat, cyl and disp don’t add any significant value to the model, I will eliminate these variables and go ahead and run the regression again.
amodel <- lm(mpg ~ hp+wt+am, mtcars2)
summary(amodel)
##
## Call:
## lm(formula = mpg ~ hp + wt + am, data = mtcars2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4221 -1.7924 -0.3788 1.2249 5.5317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.002875 2.642659 12.867 2.82e-13 ***
## hp -0.037479 0.009605 -3.902 0.000546 ***
## wt -2.878575 0.904971 -3.181 0.003574 **
## ammanual 2.083710 1.376420 1.514 0.141268
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.538 on 28 degrees of freedom
## Multiple R-squared: 0.8399, Adjusted R-squared: 0.8227
## F-statistic: 48.96 on 3 and 28 DF, p-value: 2.908e-11
Now that we have a more parsimonious model, we can do some interpretation: Wt, which is weight, has a very strong correlation with mpg. Intuitively, if the car weights 1000 lbs more, the mpg will go down by -2.87 miles/(US) gallon. This result is important because it determines that weight is esential for MPG. Another variable that is important is hp, Gross horsepower, which is statistically significant with 99.9% probability. Thus, hp and wt are confounding variables in the relation between car Transmission Type and Miles per Gallon. Additionaly, we know that hp, wt and am help explain 83% of the variance in MPG. Finally, we can conclude that manual cars get on average 2 more MPG than automatic cars, when weight and Gross horsepower are ignored.
To confirm we are not incurring in heteroskedacity and check for non-normality, I will plot the residuals and examine their behaviour.
par(mfrow = c(2, 2))
plot(amodel)
From the first graph, we can conclude there is independence because the points in the Residuals vs. Fitted plot are randomly scattered. The second plot, Normal Q-Q plot indicates that the residuals are normally distributed because most points lie on the doted line. The third graph, scale-location depicts points scattered along the plot in between bands at the top and bottom, indicating constant variance.
“Is an automatic or manual transmission better for MPG”
"Quantify the MPG difference between automatic and manual transmissions"
Using linear regression analysis, we are now able to answer the initial questions. Manual transmission is significantly better for MPG. Specifically, manual cars get on average 2 more MPG than automatic cars, when weight and Gross horsepower are ignored. Hence, manual transmissions achieve a higher value of MPG compared to automatic transmission.