Comparing Mileage by Transmission Type

Executive Summary

There was an 11.6% increase in mileage (mpg) for cars with manual transmission compared to those with automatic transmission (p = 0.009; t value = 2.83, 95%CI 3.17 to 19.98), while holding weight (wt) and number of cylinders(cyl) constant. There was a 2.4% decrease in mileage for cars with automatic transmission for every 1 unit increase in weight (wt), while holding the number of cylinders and type of transmission constant. There is 2.7% decrease in mileage for cars with 6 cylinders and 4.8 % decrease for cars with 8 cylinders compared to 4 cylinder cars while holding weight and type of transmission constant. There was a significant p value (0.007) for the interaction between weight and transmission suggesting that relationship between miles per gallon and weight varies by type of transmission.

A brief description of the data

For this analysis we will be using the dataset mtcars, which is included with every standard installation of R. The data comprises fuel consumption and 10 aspects (number of cylinders (cyl), engine displacement (mpg), gross horsepower (hp), rear axle ratio (drat), weight (wt), quarter mile time (qsec), type of transmission (am), number of forward gears (drat), and number of carburetors (carb)) of automobile design and performance for 32 automobiles (1973-74 models). We will be zeroing particularly on the problem: “Which type of transmission (automatic or manual) produces better mileage (more miles per gallon, MPG)”. Below is a comparison of the distribution and mean (represented by the green lines) miles per gallon of cars with automatic and manual transmission.

Mileage (Miles/Gallon or MPG) by Transmission Type

Shapiro.Wilk.normality.test	Automatic	Manual
statistic.w	0.98	0.95
p.value	0.89	0.54

Assuming normality of our data (shapiro.wilk’s normality test of 0.9 and 0.54) and that random sampling was performed, the difference in the average mileage between cars with automatic and manual transmission is significant with a p-value of 0.0014 (t-stat= 3.77, 95% CI = -11.280194 -3.209684).

	t	deg.f	p.val	low.CI	upp.CI	auto	manual
t.test_mpg~am	-3.77	18.33	0	-11.28	-3.21	17.15	24.39

Fitting our Model

Among the continuous variables, the most correlated to mileage is weight (wt) and we use that as our initial predictor together with the type of transmission (factor variable am).

	mpg	disp	hp	drat	wt	qsec
mpg	1	-0.8475514	-0.7761684	0.6811719	-0.8676594	0.418684

Fitting the other variables in our models resulted in the model mpg ~ wt factor(am) + factor(cyl)* with the best fit (R squared value = 0.877 ) while maintaining a significant p value in all the coefficients and a confidence interval that does not include 0.

	Estimate	Std.Error	t.value	P.Value	2.5 %	97.5 %
(Intercept)	29.775	10.483	10.482836	0.000	23.936	35.613
wt	-2.399	-2.842	-2.842116	0.009	-4.134	-0.664
factor(am)1	11.569	2.830	2.830083	0.009	3.166	19.971
factor(cyl)6	-2.710	-1.996	-1.996371	0.056	-5.500	0.080
factor(cyl)8	-4.776	-3.070	-3.069814	0.005	-7.974	-1.578
wt:factor(am)1	-4.068	-2.911	-2.911075	0.007	-6.940	-1.196

Regression Diagnostics

The plots referred to in this section may be viewed in the Appendix section The sum of our residuals is -3.330669110^{-16}. The points on the plot of the Residuals vs. Fitted values are randomly scattered and the non-constant Variance Score Test is not significant (0.113) which suggest that the error variance does not changes with the level of the fitted values (test for heteroscedasticity).

The Q-Q plot and the shapiro.test for normality show a normal distribution of the residuals (p = 0.103). The Scale-Location plot and the Reidual vs Leverage plot identified 3 points of interest which depart from the cluster of data points. We further examine these datapoints for there influence in our model using the function influence. measure.

Using the outliertest function, we have identified Fiat 128 as an outlier.

In the interest of reproducible research, codes for this analysis are available upon request

Appendix

fig.1 Plotting mileage (mpg) vs weight by transmission (am)

library(ggplot2)
g <- ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(am)))
g <- g + geom_point(size = 6, colour = "black") + geom_point(size = 4)
g <- g + xlab("% in weight") + ylab("mpg")
g

fig.2 plotting mpg and weight by transmission type with our regression line mpg ~ wt

1 line, 1 intercept, 1 slope

fitwt <- lm(mpg ~ wt, data = mtcars)
g1 <- g
g1 <- g1 + geom_abline(intercept = coef(fitwt)[1], slope = coef(fitwt)[2], size = 2)
g1

model1 <- lm(mpg ~ wt, data = mtcars)
kable(summary(model1)$coef)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	37.285126	1.877627	19.857575	0
wt	-5.344472	0.559101	-9.559044	0

summary(model1)$r.squared

[1] 0.7528328

summary(model1)$adj.r.squared

[1] 0.7445939

fig.3 plotting mpg and weight by transmission type with our regression line mpg ~ wt + factor(am)

2 lines, 2 intercepts, 1 slope the lines are very close to each other

fitwt_am <- lm(mpg ~ wt + factor(am), data = mtcars)
g2 <- g
g2 <- g2 + geom_abline(intercept = coef(fitwt_am)[1], slope = coef(fitwt_am)[2], size = 1, col = "blue")
g2 <- g2 + geom_abline(intercept = coef(fitwt_am)[1] + coef(fitwt_am)[3], slope = coef(fitwt_am)[2], size = 1, col = "red")
g2

model2 <- lm(mpg ~ wt + factor(am), data = mtcars)
kable(summary(model2)$coef)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	37.3215513	3.0546385	12.2179928	0.0000000
wt	-5.3528114	0.7882438	-6.7908072	0.0000002
factor(am)1	-0.0236152	1.5456453	-0.0152786	0.9879146

summary(model2)$r.squared

[1] 0.7528348

summary(model2)$adj.r.squared

[1] 0.7357889

fig.4 plotting mpg and weight by transmission type with our regression line mpg ~ wt * factor(am)

2 lines, 2 intercepts, 2 slopes. Interaction between weight and transmission

fitwt_Iam <- lm(mpg ~ wt * factor(am), data = mtcars)
g3 <- g
g3 <- g3 + geom_abline(intercept = coef(fitwt_Iam)[1], slope = coef(fitwt_Iam)[2], size = 2, col = "red")
g3 <- g3 + geom_abline(intercept = coef(fitwt_Iam)[1] + coef(fitwt_Iam)[3], slope = coef(fitwt_Iam)[2] + coef(fitwt_Iam)[4], size = 2, col = "blue")
g3

model3 <- lm(mpg ~ wt * factor(am), data = mtcars)
kable(summary(model3)$coef)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	31.416055	3.0201093	10.402291	0.0000000
wt	-3.785907	0.7856478	-4.818836	0.0000455
factor(am)1	14.878422	4.2640422	3.489276	0.0016210
wt:factor(am)1	-5.298361	1.4446993	-3.667449	0.0010171

summary(model3)$r.squared

[1] 0.8330375

summary(model3)$adj.r.squared

[1] 0.8151486

fig.5 plotting mpg and weight by transmission type with our regression line mpg ~ wt * factor(am) + factor(cyl)

2 lines, 2 intercepts, 2 slopes. Interaction between weight and transmission and adjustments for number of cylinders.

fitwt_Iamcyl <- lm(mpg ~ wt * factor(am) + factor(cyl), data = mtcars)
g4 <- g
g4 <- g4 + geom_abline(intercept = coef(fitwt_Iamcyl )[1], slope = coef(fitwt_Iamcyl )[2], size = 2, col = "red")
g4 <- g4 + geom_abline(intercept = coef(fitwt_Iamcyl)[1] + coef(fitwt_Iamcyl)[3], slope = coef(fitwt_Iamcyl)[2] + coef(fitwt_Iamcyl)[6], size = 2, col = "blue")
g4

model4 <- lm(mpg ~ wt * factor(am) +  factor(cyl), data = mtcars)
kable(summary(model4)$coef)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	29.774836	2.8403415	10.482836	0.0000000
wt	-2.398713	0.8439884	-2.842116	0.0086039
factor(am)1	11.568790	4.0877912	2.830083	0.0088538
factor(cyl)6	-2.709777	1.3573517	-1.996371	0.0564651
factor(cyl)8	-4.776110	1.5558306	-3.069814	0.0049646
wt:factor(am)1	-4.067981	1.3974151	-2.911075	0.0072955

summary(model4)$r.squared

[1] 0.8774548

summary(model4)$adj.r.squared

[1] 0.8538884

model_comparison <- anova(model1, model2, model3, model4)
print(kable(model_comparison))

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
30	278.3219	NA	NA	NA	NA
29	278.3197	1	0.0022403	0.0004221	0.9837651
28	188.0077	1	90.3120314	17.0163295	0.0003372
26	137.9917	2	50.0159319	4.7119280	0.0179394

kable(round(sqrt(vif(model4)), digits = 1))

	GVIF	Df	GVIF^(1/(2*Df))
wt	2.0	1.0	1.4
factor(am)	4.9	1.0	2.2
factor(cyl)	1.7	1.4	1.2
wt:factor(am)	4.3	1.0	2.1

fig.6-10 plot of residuals

par(mfrow=c(2,2))
plot(fitwt_Iamcyl)