Executive Summary

As a statistician for Motor Trend (I wish…), I wanted to investigate two questions regarding the relationship between mpg and transmission type:

Exploratory Analysis

First, let’s explore the relevant dataset: mtcars.

library(datasets); library(plyr)
data(mtcars);?mtcars
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Calling the str function gives us the data types for each column in the dataset. Aditionally, the dataset information sheet gives us the following:

Format

A data frame with 32 observations on 11 variables.

  1. [, 1] mpg Miles/(US) gallon
  2. [, 2] cyl Number of cylinders
  3. [, 3] disp Displacement (cu.in.)
  4. [, 4] hp Gross horsepower
  5. [, 5] drat Rear axle ratio
  6. [, 6] wt Weight (1000 lbs)
  7. [, 7] qsec 1/4 mile time
  8. [, 8] vs V/S
  9. [, 9] am Transmission (0 = automatic, 1 = manual)
  10. [,10] gear Number of forward gears
  11. [,11] carb Number of carburetors

Questions

To answer the two questions mentioned in the executive summary, we must take a look at some basic analysis to see what is going on.

Using a boxplot to visualize the dataset, we can see right away that manual transmissions have a much higher mean mpg than automatic transmissions. However, we have to be careful, no other variables have been considered as factors yet.

#rename levels
mtcars$am <- factor(mtcars$am)
levels(mtcars$am) <- c("auto", "manual")
#boxplot
plot(mpg~factor(am), data = mtcars, xlab = "Transmission", main = "MPG by Transmission Type")

In the table below, we see there is a 7.24 mpg difference in means numerically between manual and automatic transmissions with no other variables included.

tbl <- aggregate(mpg~factor(am), data = mtcars, mean)
tbl <- rename(tbl, c("factor(am)" = "transmission"))
tbl
##   transmission      mpg
## 1         auto 17.14737
## 2       manual 24.39231

Testing the hypothesis that the mean mpg of manual and automatic transmission cars are the same, we arrive at our t-test conclusion:

a <- subset(mtcars, am == "auto")
m <- subset(mtcars, am == "manual")
t.test(a$mpg, m$mpg)
## 
##  Welch Two Sample t-test
## 
## data:  a$mpg and m$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

The p-value of our test is .001374 meaning we reject the null hypothesis. Thus, there seems to be an effect on mpg dependent on transmission type.

Modeling

To start off, I look at a basic linear model to see the effect transmission type has on mpg directly.

#Simple linear:
fit <- lm(mpg~factor(am), data = mtcars)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ factor(am), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        17.147      1.125  15.247 1.13e-15 ***
## factor(am)manual    7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

From the summary, we see the expected 7.24 mpg increase between automatic and manual transmission types in our estimates. However, this model explains only one third of the variance (.3385 adj. R-squared term) meaning better models could be out there if we add some variables to our model.

Next, I want to consider all variables, then attempt to choose the best model for mpg analysis by looking at the anova table for effects.

#Multivariate Linear:
fit_multi <- lm(mpg ~., data = mtcars)
anova(fit_multi)
## Analysis of Variance Table
## 
## Response: mpg
##           Df Sum Sq Mean Sq  F value    Pr(>F)    
## cyl        1 817.71  817.71 116.4245 5.034e-10 ***
## disp       1  37.59   37.59   5.3526  0.030911 *  
## hp         1   9.37    9.37   1.3342  0.261031    
## drat       1  16.47   16.47   2.3446  0.140644    
## wt         1  77.48   77.48  11.0309  0.003244 ** 
## qsec       1   3.95    3.95   0.5623  0.461656    
## vs         1   0.13    0.13   0.0185  0.893173    
## am         1  14.47   14.47   2.0608  0.165858    
## gear       1   0.97    0.97   0.1384  0.713653    
## carb       1   0.41    0.41   0.0579  0.812179    
## Residuals 21 147.49    7.02                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit_best <- lm(mpg ~  wt + disp + hp + cyl + factor(am), data = mtcars)
summary(fit_best)
## 
## Call:
## lm(formula = mpg ~ wt + disp + hp + cyl + factor(am), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5952 -1.5864 -0.7157  1.2821  5.5725 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      38.20280    3.66910  10.412 9.08e-11 ***
## wt               -3.30262    1.13364  -2.913  0.00726 ** 
## disp              0.01226    0.01171   1.047  0.30472    
## hp               -0.02796    0.01392  -2.008  0.05510 .  
## cyl              -1.10638    0.67636  -1.636  0.11393    
## factor(am)manual  1.55649    1.44054   1.080  0.28984    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.505 on 26 degrees of freedom
## Multiple R-squared:  0.8551, Adjusted R-squared:  0.8273 
## F-statistic:  30.7 on 5 and 26 DF,  p-value: 4.029e-10

From the selected model, 86% of the variance is explained meaning the model is much more comprehensive now. Additionally, we see a 1.56 increase in mpg when weight and the quarter mile time are held constant.

Taking a look at the residual plot, we also can see a decent model fit.

par(mfrow = c(2,2))
plot(fit_best)

The residuals seem fairly scattered and random with very little evidence of heteroskedasticity. The errors look to be normal and there aren’t any extreme terms leveraging the model.

Conclusion

I can now concisely sumamrize the conslusions reached throughout the analysis.

To answer both questions, a car with a manual transmission seems to be the better choice. The boxplot showed a 7.24 mpg difference between manual and automatic transmissions in favor of manual when no other variables are considered. After finding a better model, we see there is still a 1.56 mpg increase in manual transmission cars when weight, displacment, horsepower and cylinder count are held constant. These factors make sense in the model since they all have to do with engine size and output.