Motor Trend currently performed an analysis on the ‘mtcars’ dataset to determine the relationship between transmission type (manual & automatic) and fuel economy (mpg).
We determined that there is a statistically significant difference, with manual transmisisons performing more efficiently than automatic. After controlling for confounding variables (eg, horsepower and weight), we find that, on average, manual transmisisons provide a ~2.08 mpg improvement.
To begin, we first want to get a sense of the shape and quality of the data. This can be done using ‘head’, ‘str’, and ‘summary’ functions, as well as a preliminary violin plot. The data seems clean and the transmission category (am) is split into 0s & 1s. 0 represents “Automatic” and 1 represents “Manual.” The underlying code for this output is in the appendix.
## Warning: package 'ggplot2' was built under R version 3.2.3
## am mpg
## 1 Automatic 17.14737
## 2 Manual 24.39231
Looking at the chart, it is clear that manual transmissions seem to perform more efficiently. On average, their mpg is ~7.2 mpg greater than automatic transmissions. Let’s perform a t.test and a simple regression to see if the difference is statistically significant and how predictive the variables are.
The results (in the appendix) are pretty good. We use a two-sided t-test since the null is that the values should be equal to one another. The p-value of ~0.001 indicates that there is a statistically significant difference between the two. The R2 from the regression would indicate that transmission, by itself, would explain about ~95% of the difference. Let’s see if there might be confounding of the output with other variables, though.
We will run a ‘cor’ test to see how well the variables correlate to mpg. This can be used to predict which variables we will include in the analysis.
## mpg cyl disp hp drat wt
## 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.6811719 -0.8676594
## qsec vs am gear carb
## 0.4186840 0.6640389 0.5998324 0.4802848 -0.5509251
##
## Call:
## lm(formula = mpg ~ hp + wt + factor(am) - 1, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4221 -1.7924 -0.3788 1.2249 5.5317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## hp -0.037479 0.009605 -3.902 0.000546 ***
## wt -2.878575 0.904971 -3.181 0.003574 **
## factor(am)0 34.002875 2.642659 12.867 2.82e-13 ***
## factor(am)1 36.086585 1.736338 20.783 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.538 on 28 degrees of freedom
## Multiple R-squared: 0.9872, Adjusted R-squared: 0.9853
## F-statistic: 538.2 on 4 and 28 DF, p-value: < 2.2e-16
We included hp & wt. For each regression we used “-1” to force the intercept to go through the origin. This improvs the R2 scores - it leaves the coefficients as absolute interpretations intead of relative. Therefore, the influence of manual transmissions (over automatic) is the difference between the two values -> 36.08-34.00 = 2.08.
All values in this model are statistically significant and account for ~98% of the variance. Looking at the residual plots (which can be found in the appendix), there does not appear to be any apparent patterns or outliers that create an undue influence on the prediction.
data("mtcars")
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")
violin = ggplot(data = mtcars, aes(y = mpg, x = am, fill = am))
violin = violin + geom_violin(alpha = .5)
violin = violin + xlab("Transmission Type") + ylab("MPG")
violin = violin + scale_fill_discrete(name = "Transmission Type", labels=c("Automatic", "Manual"))
violin
aggregate(mpg ~ am, data=mtcars, mean)
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")
mtcars_auto <- mtcars[mtcars$am == "Automatic",]
mtcars_man <- mtcars[mtcars$am == "Manual",]
t.test(mtcars_auto$mpg, mtcars_man$mpg)
##
## Welch Two Sample t-test
##
## data: mtcars_auto$mpg and mtcars_man$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
reg1 <- lm(mpg ~ factor(am) -1, data=mtcars)
summary(reg1)
##
## Call:
## lm(formula = mpg ~ factor(am) - 1, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## factor(am)Automatic 17.147 1.125 15.25 1.13e-15 ***
## factor(am)Manual 24.392 1.360 17.94 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.9487, Adjusted R-squared: 0.9452
## F-statistic: 277.2 on 2 and 30 DF, p-value: < 2.2e-16
anova(reg1, reg2)
## Analysis of Variance Table
##
## Model 1: mpg ~ factor(am) - 1
## Model 2: mpg ~ hp + wt + factor(am) - 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 28 180.29 2 540.61 41.979 3.745e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow = c(2,2))
plot(reg2)