Analysis of the effect of automobile transmission type on the MPG

Summary

The task of the current report was to quantify whether automatic or manual transmission is better for MPG. In addition, it was required to quantify the MPG difference between automatic and manual transmissions. Data used for this analysis was taken from 1974 Motor Trend US magazine. In order to eliminate possible confounding factors, the linear model of dependence of the MPG on other factors was created. Then the raised questions were answered by controlling confounding factors.

Loading the data

The data used for this analysis is included in standard R data sets from 1974 Motor Trend US magazine, which comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1974-74 models). The data set was loaded, and am variable was converted into a factor with the levels: “Automatic” and “Manual”.

Exploratory Data Analysis

Exploratory data analysis was performed by plotting all available data grouped by the transmission type (Figure 1). It appears that there is a difference in the MPG depending on the transmission type. However, there could be potential confounding factors affecting MPG. Therefore, in order to test whether car’s transmission type affects MPG, it is important to control for other factors that could mask this relationship.

Model Selection

Initially, linear model with MPG versus all other variables was constructed and assigned to fitAll variable. The summary of coefficients is shown in Table 1. Since this model probably suffers from overfitting, an attempt was made to minimize the number of dependent variables in the model. In order to do that the function regsubsets from leaps package was used. This allows to select the model with the highest adjusted \(R^2\) and lowest number of dependent variables. Intercept is required for all analyzed dependent variable combinations (Figure 2). Also, it appears that the best model will contain intercept, weight, \(1/4\) mile time and transmission type as dependent variables (Figure 2) since it shows the largest \(R^2\) with the smallest number of dependent variables. This model was created and assigned to fit variable. The coefficients for the model fit are presented in the Table 2. The table shows that the coefficient for the transmission type variable is significantly different from 0 with the \(\alpha\) of 0.05 (p = 0.0467). The confidence interval for increase in the MPG with switching from automatic to manual transmission, while holding weight and quarter-mile time constant, is 2.94. The 95% confidence interval for this increase is [0.046, 5.826].

Since the fit model is nested within fitAll model, ANOVA was used to compare them (Table 3). Table 3 shows that the test is insignificant (p = 0.86). I conclude from this that additional parameters used in fitAll model could be excluded from the final model.

Model Diagnostics

In order to validate the fit model, analysis of residuals was performed by plotting the model (Figure 3). Upper left plot indicates that residuals are symmetrically distributed around 0. Q-Q plot of fit model studentized residuals against standard normal distribution quantiles is shown in the upper right plot of the figure 3. I concluded that the points follow straight line, therefore residuals are normally distributed. Scale-Location plot is shown on lower left panel of the Figure 3. The residuals are homoscedastic (since the points appear to form random horizontal band around a horizontal line, however there is a slight slope to this line). Residuals vs Leverage plot is shown on the lower right panel of the Figure 3. All points are located within Cook’s distance lines, which indicates that there are no influential points.

Appendix

require(datasets)
require(knitr)
require(ggplot2)
require(leaps)
require(MASS)
require(gvlma)
data(mtcars)
mtcars$am <- factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")

?mtcars
ggplot(aes(x = am, y = mpg), data = mtcars) +
    geom_boxplot(aes(color = am)) +
    geom_jitter(width = 0.25) +
    xlab("transmission") +
    ylab("miles per gallon")

fitAll <- lm(mpg ~ ., data = mtcars)
kable(summary(fitAll)$coefficients, caption = "Coefficients of the linear model of MPG as a function of all other variables in `mtcars` package.")

mpgMod <- regsubsets(mpg ~ ., data = mtcars, nbest = 4)
plot(mpgMod, scale = "adjr2")

fit <- lm(mpg ~ wt + qsec + am, data = mtcars)
kable(summary(fit)$coefficients, caption = "Coefficients of improved linear model of MPG as a function of weight, quarter mile time and transmission type in `mtcars` package.")

kable(anova(fit, fitAll), caption = "Comparison of the improved model with the model that includes all dependent variables by ANOVA")

par(mfrow = c(2, 2)); plot(fit)

Coefficients of the linear model of MPG as a function of all other variables in `mtcars` package.
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	12.3033742	18.7178844	0.6573058	0.5181244
cyl	-0.1114405	1.0450234	-0.1066392	0.9160874
disp	0.0133352	0.0178575	0.7467585	0.4634887
hp	-0.0214821	0.0217686	-0.9868407	0.3349553
drat	0.7871110	1.6353731	0.4813036	0.6352779
wt	-3.7153039	1.8944143	-1.9611887	0.0632522
qsec	0.8210407	0.7308448	1.1234133	0.2739413
vs	0.3177628	2.1045086	0.1509915	0.8814235
amManual	2.5202269	2.0566506	1.2254035	0.2339897
gear	0.6554130	1.4932600	0.4389142	0.6652064
carb	-0.1994193	0.8287525	-0.2406258	0.8121787

Coefficients of improved linear model of MPG as a function of weight, quarter mile time and transmission type in `mtcars` package.
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	9.617781	6.9595930	1.381946	0.1779152
wt	-3.916504	0.7112016	-5.506882	0.0000070
qsec	1.225886	0.2886696	4.246676	0.0002162
amManual	2.935837	1.4109045	2.080819	0.0467155

Comparison of the improved model with the model that includes all dependent variables by ANOVA
Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
28	169.2859	NA	NA	NA	NA
21	147.4944	7	21.7915	0.4432337	0.8636073

MPG by transmission type. All data from mtcars dataset.

Adjusted \(R^2\) values for different combination of dependent variables

Analysis of residuals