As a statistician for Motor Trend (I wish…), I wanted to investigate two questions regarding the relationship between mpg and transmission type:
First, let’s explore the relevant dataset: mtcars.
library(datasets); library(plyr)
data(mtcars);?mtcars
str(mtcars)## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Calling the str function gives us the data types for each column in the dataset. Aditionally, the dataset information sheet gives us the following:
A data frame with 32 observations on 11 variables.
To answer the two questions mentioned in the executive summary, we must take a look at some basic analysis to see what is going on.
Using a boxplot to visualize the dataset, we can see right away that manual transmissions have a much higher mean mpg than automatic transmissions. However, we have to be careful, no other variables have been considered as factors yet.
#rename levels
mtcars$am <- factor(mtcars$am)
levels(mtcars$am) <- c("auto", "manual")
#boxplot
plot(mpg~factor(am), data = mtcars, xlab = "Transmission", main = "MPG by Transmission Type")In the table below, we see there is a 7.24 mpg difference in means numerically between manual and automatic transmissions with no other variables included.
tbl <- aggregate(mpg~factor(am), data = mtcars, mean)
tbl <- rename(tbl, c("factor(am)" = "transmission"))
tbl## transmission mpg
## 1 auto 17.14737
## 2 manual 24.39231
Testing the hypothesis that the mean mpg of manual and automatic transmission cars are the same, we arrive at our t-test conclusion:
a <- subset(mtcars, am == "auto")
m <- subset(mtcars, am == "manual")
t.test(a$mpg, m$mpg)##
## Welch Two Sample t-test
##
## data: a$mpg and m$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
The p-value of our test is .001374 meaning we reject the null hypothesis. Thus, there seems to be an effect on mpg dependent on transmission type.
To start off, I look at a basic linear model to see the effect transmission type has on mpg directly.
#Simple linear:
fit <- lm(mpg~factor(am), data = mtcars)
summary(fit)##
## Call:
## lm(formula = mpg ~ factor(am), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## factor(am)manual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
From the summary, we see the expected 7.24 mpg increase between automatic and manual transmission types in our estimates. However, this model explains only one third of the variance (.3385 adj. R-squared term) meaning better models could be out there if we add some variables to our model.
Next, I want to consider all variables, then attempt to choose the best model for mpg analysis by looking at the anova table for effects.
#Multivariate Linear:
fit_multi <- lm(mpg ~., data = mtcars)
anova(fit_multi)## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## cyl 1 817.71 817.71 116.4245 5.034e-10 ***
## disp 1 37.59 37.59 5.3526 0.030911 *
## hp 1 9.37 9.37 1.3342 0.261031
## drat 1 16.47 16.47 2.3446 0.140644
## wt 1 77.48 77.48 11.0309 0.003244 **
## qsec 1 3.95 3.95 0.5623 0.461656
## vs 1 0.13 0.13 0.0185 0.893173
## am 1 14.47 14.47 2.0608 0.165858
## gear 1 0.97 0.97 0.1384 0.713653
## carb 1 0.41 0.41 0.0579 0.812179
## Residuals 21 147.49 7.02
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit_best <- lm(mpg ~ wt + disp + hp + cyl + factor(am), data = mtcars)
summary(fit_best)##
## Call:
## lm(formula = mpg ~ wt + disp + hp + cyl + factor(am), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5952 -1.5864 -0.7157 1.2821 5.5725
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.20280 3.66910 10.412 9.08e-11 ***
## wt -3.30262 1.13364 -2.913 0.00726 **
## disp 0.01226 0.01171 1.047 0.30472
## hp -0.02796 0.01392 -2.008 0.05510 .
## cyl -1.10638 0.67636 -1.636 0.11393
## factor(am)manual 1.55649 1.44054 1.080 0.28984
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.505 on 26 degrees of freedom
## Multiple R-squared: 0.8551, Adjusted R-squared: 0.8273
## F-statistic: 30.7 on 5 and 26 DF, p-value: 4.029e-10
From the selected model, 86% of the variance is explained meaning the model is much more comprehensive now. Additionally, we see a 1.56 increase in mpg when weight and the quarter mile time are held constant.
Taking a look at the residual plot, we also can see a decent model fit.
par(mfrow = c(2,2))
plot(fit_best)The residuals seem fairly scattered and random with very little evidence of heteroskedasticity. The errors look to be normal and there aren’t any extreme terms leveraging the model.
I can now concisely sumamrize the conslusions reached throughout the analysis.
To answer both questions, a car with a manual transmission seems to be the better choice. The boxplot showed a 7.24 mpg difference between manual and automatic transmissions in favor of manual when no other variables are considered. After finding a better model, we see there is still a 1.56 mpg increase in manual transmission cars when weight, displacment, horsepower and cylinder count are held constant. These factors make sense in the model since they all have to do with engine size and output.