Executive Summary

There was an 11.6% increase in mileage (mpg) for cars with manual transmission compared to those with automatic transmission (p = 0.009; t value = 2.83, 95%CI 3.17 to 19.98), while holding weight (wt) and number of cylinders(cyl) constant. There was a 2.4% decrease in mileage for cars with automatic transmission for every 1 unit increase in weight (wt), while holding the number of cylinders and type of transmission constant. There is 2.7% decrease in mileage for cars with 6 cylinders and 4.8 % decrease for cars with 8 cylinders compared to 4 cylinder cars while holding weight and type of transmission constant. There was a significant p value (0.007) for the interaction between weight and transmission suggesting that relationship between miles per gallon and weight varies by type of transmission.

A brief description of the data

For this analysis we will be using the dataset mtcars, which is included with every standard installation of R. The data comprises fuel consumption and 10 aspects (number of cylinders (cyl), engine displacement (mpg), gross horsepower (hp), rear axle ratio (drat), weight (wt), quarter mile time (qsec), type of transmission (am), number of forward gears (drat), and number of carburetors (carb)) of automobile design and performance for 32 automobiles (1973-74 models). We will be zeroing particularly on the problem: “Which type of transmission (automatic or manual) produces better mileage (more miles per gallon, MPG)”. Below is a comparison of the distribution and mean (represented by the green lines) miles per gallon of cars with automatic and manual transmission.

Mileage (Miles/Gallon or MPG) by Transmission Type

Shapiro.Wilk.normality.test Automatic Manual
statistic.w 0.98 0.95
p.value 0.89 0.54

Assuming normality of our data (shapiro.wilk’s normality test of 0.9 and 0.54) and that random sampling was performed, the difference in the average mileage between cars with automatic and manual transmission is significant with a p-value of 0.0014 (t-stat= 3.77, 95% CI = -11.280194 -3.209684).

t deg.f p.val low.CI upp.CI auto manual
t.test_mpg~am -3.77 18.33 0 -11.28 -3.21 17.15 24.39

Fitting our Model

Among the continuous variables, the most correlated to mileage is weight (wt) and we use that as our initial predictor together with the type of transmission (factor variable am).

mpg disp hp drat wt qsec
mpg 1 -0.8475514 -0.7761684 0.6811719 -0.8676594 0.418684

Fitting the other variables in our models resulted in the model mpg ~ wt factor(am) + factor(cyl)* with the best fit (R squared value = 0.877 ) while maintaining a significant p value in all the coefficients and a confidence interval that does not include 0.

Estimate Std.Error t.value P.Value 2.5 % 97.5 %
(Intercept) 29.775 10.483 10.482836 0.000 23.936 35.613
wt -2.399 -2.842 -2.842116 0.009 -4.134 -0.664
factor(am)1 11.569 2.830 2.830083 0.009 3.166 19.971
factor(cyl)6 -2.710 -1.996 -1.996371 0.056 -5.500 0.080
factor(cyl)8 -4.776 -3.070 -3.069814 0.005 -7.974 -1.578
wt:factor(am)1 -4.068 -2.911 -2.911075 0.007 -6.940 -1.196

Regression Diagnostics

The plots referred to in this section may be viewed in the Appendix section The sum of our residuals is -3.330669110^{-16}. The points on the plot of the Residuals vs. Fitted values are randomly scattered and the non-constant Variance Score Test is not significant (0.113) which suggest that the error variance does not changes with the level of the fitted values (test for heteroscedasticity).

The Q-Q plot and the shapiro.test for normality show a normal distribution of the residuals (p = 0.103). The Scale-Location plot and the Reidual vs Leverage plot identified 3 points of interest which depart from the cluster of data points. We further examine these datapoints for there influence in our model using the function influence. measure.

Using the outliertest function, we have identified Fiat 128 as an outlier.

In the interest of reproducible research, codes for this analysis are available upon request

Appendix

fig.1 Plotting mileage (mpg) vs weight by transmission (am)

library(ggplot2)
g <- ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(am)))
g <- g + geom_point(size = 6, colour = "black") + geom_point(size = 4)
g <- g + xlab("% in weight") + ylab("mpg")
g

fig.2 plotting mpg and weight by transmission type with our regression line mpg ~ wt

1 line, 1 intercept, 1 slope

fitwt <- lm(mpg ~ wt, data = mtcars)
g1 <- g
g1 <- g1 + geom_abline(intercept = coef(fitwt)[1], slope = coef(fitwt)[2], size = 2)
g1

model1 <- lm(mpg ~ wt, data = mtcars)
kable(summary(model1)$coef)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.285126 1.877627 19.857575 0
wt -5.344472 0.559101 -9.559044 0
summary(model1)$r.squared

[1] 0.7528328

summary(model1)$adj.r.squared

[1] 0.7445939

fig.3 plotting mpg and weight by transmission type with our regression line mpg ~ wt + factor(am)

2 lines, 2 intercepts, 1 slope the lines are very close to each other

fitwt_am <- lm(mpg ~ wt + factor(am), data = mtcars)
g2 <- g
g2 <- g2 + geom_abline(intercept = coef(fitwt_am)[1], slope = coef(fitwt_am)[2], size = 1, col = "blue")
g2 <- g2 + geom_abline(intercept = coef(fitwt_am)[1] + coef(fitwt_am)[3], slope = coef(fitwt_am)[2], size = 1, col = "red")
g2

model2 <- lm(mpg ~ wt + factor(am), data = mtcars)
kable(summary(model2)$coef)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.3215513 3.0546385 12.2179928 0.0000000
wt -5.3528114 0.7882438 -6.7908072 0.0000002
factor(am)1 -0.0236152 1.5456453 -0.0152786 0.9879146
summary(model2)$r.squared

[1] 0.7528348

summary(model2)$adj.r.squared

[1] 0.7357889

fig.4 plotting mpg and weight by transmission type with our regression line mpg ~ wt * factor(am)

2 lines, 2 intercepts, 2 slopes. Interaction between weight and transmission

fitwt_Iam <- lm(mpg ~ wt * factor(am), data = mtcars)
g3 <- g
g3 <- g3 + geom_abline(intercept = coef(fitwt_Iam)[1], slope = coef(fitwt_Iam)[2], size = 2, col = "red")
g3 <- g3 + geom_abline(intercept = coef(fitwt_Iam)[1] + coef(fitwt_Iam)[3], slope = coef(fitwt_Iam)[2] + coef(fitwt_Iam)[4], size = 2, col = "blue")
g3

model3 <- lm(mpg ~ wt * factor(am), data = mtcars)
kable(summary(model3)$coef)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 31.416055 3.0201093 10.402291 0.0000000
wt -3.785907 0.7856478 -4.818836 0.0000455
factor(am)1 14.878422 4.2640422 3.489276 0.0016210
wt:factor(am)1 -5.298361 1.4446993 -3.667449 0.0010171
summary(model3)$r.squared

[1] 0.8330375

summary(model3)$adj.r.squared

[1] 0.8151486

fig.5 plotting mpg and weight by transmission type with our regression line mpg ~ wt * factor(am) + factor(cyl)

2 lines, 2 intercepts, 2 slopes. Interaction between weight and transmission and adjustments for number of cylinders.

fitwt_Iamcyl <- lm(mpg ~ wt * factor(am) + factor(cyl), data = mtcars)
g4 <- g
g4 <- g4 + geom_abline(intercept = coef(fitwt_Iamcyl )[1], slope = coef(fitwt_Iamcyl )[2], size = 2, col = "red")
g4 <- g4 + geom_abline(intercept = coef(fitwt_Iamcyl)[1] + coef(fitwt_Iamcyl)[3], slope = coef(fitwt_Iamcyl)[2] + coef(fitwt_Iamcyl)[6], size = 2, col = "blue")
g4

model4 <- lm(mpg ~ wt * factor(am) +  factor(cyl), data = mtcars)
kable(summary(model4)$coef)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.774836 2.8403415 10.482836 0.0000000
wt -2.398713 0.8439884 -2.842116 0.0086039
factor(am)1 11.568790 4.0877912 2.830083 0.0088538
factor(cyl)6 -2.709777 1.3573517 -1.996371 0.0564651
factor(cyl)8 -4.776110 1.5558306 -3.069814 0.0049646
wt:factor(am)1 -4.067981 1.3974151 -2.911075 0.0072955
summary(model4)$r.squared

[1] 0.8774548

summary(model4)$adj.r.squared

[1] 0.8538884

model_comparison <- anova(model1, model2, model3, model4)
print(kable(model_comparison))
Res.Df RSS Df Sum of Sq F Pr(>F)
30 278.3219 NA NA NA NA
29 278.3197 1 0.0022403 0.0004221 0.9837651
28 188.0077 1 90.3120314 17.0163295 0.0003372
26 137.9917 2 50.0159319 4.7119280 0.0179394
kable(round(sqrt(vif(model4)), digits = 1))
GVIF Df GVIF^(1/(2*Df))
wt 2.0 1.0 1.4
factor(am) 4.9 1.0 2.2
factor(cyl) 1.7 1.4 1.2
wt:factor(am) 4.3 1.0 2.1

fig.6-10 plot of residuals

par(mfrow=c(2,2))
plot(fitwt_Iamcyl)