In this brief report we try to give an answer to an essential question for every car enthustiast. Is an automatic or manual transmission better for MPG ? And how can we quantify the difference? This is the kind of question that regression analysis can answer, and we use the standard R dataset mtcars to show it.
First we load some R package. Dataset to obtain mtcars, ggplot2 and ggally to obtain pair-wise correlation and boxplot necessary for quick data exploration.
suppressMessages(library(datasets)) # to load mtcars
suppressMessages(library(ggplot2)) # to alow GGally to work and plot
suppressMessages(library(GGally)) # for pair-wise correlation and boxplot
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
In this section we use first the ggpairs function of R-Package GGally to have a quick glance at pair-wise correlation between all the variables. The aim is to further avoid hidden links between mpg and the transmission mode that could lead to false conclusions (covariates)
# Function to obtain nice plots with ggpairs
my_fn <- function(data, mapping, method="loess", ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point() +
geom_smooth(method=method, ...)
p
}
# Default loess curve
ggpairs(mtcars, lower = list(continuous = my_fn))
What we see here, with categorical variables considerer first as continuous, the very high level of inter-correlation between variables. For example if “am”- of value one if transmission is manual and zero if transmission is automatic- is correlated to “mpg” with R=0.6, the absolute value of correlation with “mpg” is greater for 6 predictors: “cyl” (0.85), “disp” (0.84), “hp” (0.77),“drat” (0.68), “wt” (0.86) and “ws” (0.66). The best results, the most significative results concerning the link between “mpg” and “am” then should be adjusted for these predictors to avoid false conclusions.
Before continuing the analysis we transform the continuous am variable to categorical, using factor.
mtcars$am <- factor(mtcars$am,labels=c('Automatic','Manual'))
We continue here our investigation in order to answer the questions. First in a mono-variable scheme.
The t-test gives the answer:
t.test(mtcars$mpg~mtcars$am)
##
## Welch Two Sample t-test
##
## data: mtcars$mpg by mtcars$am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group Automatic mean in group Manual
## 17.14737 24.39231
The p-value is 0.001374 the difference in means is not equal to 0. The mean mileage of automatic transmission is 17.15 mpg and the manual transmission is 24.39 mpg. Manual transmission seems betterfor mileage.
box plots are useful to the more clearly the separation.
boxplot(mpg~am, data = mtcars,
xlab = "Transmission",
ylab = "Miles per Gallon",
main = "MPG by Transmission Type")
par(mfrow = c(2,2))
A mono-dimensional regression can help to obtain a linear model mpg=f(am)
model1 <- lm(mpg ~ am, data = mtcars)
summary(model1)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
The model is here mpg=17.147+7.254 amManual, and this quantifies the difference between the two transmissions. Note also that regression answers the two questions at the same time.
We see here that the regression model covers only 36% of the variance. We can do better!
In order to be certain of the link, adjustement to strong predictors of mpg should be done. The workhorse here is the automatic step-wise model contruction. Other powerfull model like lasso (penalized regression) can also handle very efficienlty the problem. We show here the different models found by the algorithm.
Multi = lm(data = mtcars, mpg~.)
Best <- (step(Multi,trace=0)) #automatic step-wise model contruction.
Summary of the best model
summary(Best)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## amManual 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
The best model formula found here is mpg ~ wt + hp + cyl + am
The model that shows a better mpg for manual transmission is now mpg=9.61+2.93 amManual-3.91 wt +1.22 qsec (all predictors significant), and this also quantifies the difference between the two transmissions. In this model 1.80 is the adjusted coefficient that links mpg and the transmission and we can see here that the regression model covers now 85% of the variance. The conclusion is now stronger than with the mono-variable scheme.
The diagnostic plots shows that the regression was numerically efficient: residuals are randomly distributed, and lies on the normality line (QQ-plot)
par(mfrow = c(2,2))
plot(Best)
Using model contruction, we have shown that, adjusted to other strong mpg predictors that we can find in the mtcars dataset, manual transmission is really the best transmission for mpg with mmpg=9.61+2.93 amManual-3.91 wt +1.22 qsec