Description

The mtcars data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models). In this analysis, the two particularly interested questions are

  1. Is an automatic or manual transmission better for MPG ?

  2. How to quantify the mile per gallons difference between automatic and manual transmissions?

The following statistical techinques will be used.
- t-test
- simple linear regression
- multiple linear regression

Summary

According to t-test, without consideration of other variables, automatic is better than manual cars for decreasing in mpg. With consideration of other factors, it cannot be said that automatic is better than manual cars without searching the remaining 14% of the variability of the mpg.

Load the data and required libraries

data <- mtcars
library(ggplot2)
library(broom)
library(ggfortify)
theme_set(theme_bw())

Glance the data

str(data)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Description of dataset from R documentation

Name Description
mpg Miles/(US) gallon
cyl Number of cylinders
disp Displacement (cu.in.)
hp Gross horsepower
drat Rear axle ratio
wt Weight (1000 lbs)
qsec 1/4 mile time
vs Engine (0 = V-shaped, 1 = straight)
am Transmission (0 = automatic, 1 = manual)
gear Number of forward gears
carb Number of carburetors

vs, am, gear, carb should be categorical variables.

data$vs <- factor(data$vs)
data$am <- factor(ifelse(data$am == 0, "automatic", "manual"))
data$gear <- factor(data$gear)
data$carb <- factor(data$carb)

Is an automatic or manual transmission better for MPG ?

ggplot(data, aes(am, mpg, fill = am)) + 
    geom_boxplot() +
    labs(x = "", y = "Miles Per Gallon") + 
    theme(legend.position = "none")

According to box-plot, mean difference between automatic is lower than that of manual.

t-test between mpg and automatic/manual

t.test(mpg ~ am, data = data)
## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means between group automatic and group manual is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group automatic    mean in group manual 
##                17.14737                24.39231

With 95% confidence interval, p-value is 0.0014, confidence interval does not contain zero and mean difference between automatic and manual is statistically significant. Null hypothesis of no mean difference between automatic and manual can be rejected and without consideration of other variables, automatic is better than manual cars for decreasing in mpg.

Linear Regression

model1 <- lm(mpg ~ am, data = data)
summary(model1)
## 
## Call:
## lm(formula = mpg ~ am, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## ammanual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

With simple linear regression, for manual cars, there is 7.245 mpg increase compared to automatic, with p-value nearly zero. But, \(R^2\) is just 0.3598 means that am variable can explain 36% of variability of mpg. So, it is time to consider multiple linear regression. Formula for this model is

\(mpg = 17.147 + am*7.245\)

For am , 0 is for automatic and 1 for manual. Approximate 95% confidence intervals of each variables can be calculated from Estimate +/- 2 x Std.Error.

Multiple Linear Regression

Models are selected by backward elimination and find the best model with low AIC value.

step(lm(mpg~., data = data), direction = "backward",trace=0)
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = data)
## 
## Coefficients:
## (Intercept)           wt         qsec     ammanual  
##       9.618       -3.917        1.226        2.936

Model with wt, qsec and am variables is the best model for multiple linear regression. For the model, response mpg is changed to log scale after testing with normal scale with Adjusted \(R^2\) (0.8336).

model2 <- lm(log(mpg) ~ wt + qsec + am, data = data)
summary(model2)
## 
## Call:
## lm(formula = log(mpg) ~ wt + qsec + am, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.13879 -0.08114 -0.03466  0.07030  0.26575 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.69410    0.31326   8.600 2.40e-09 ***
## wt          -0.22456    0.03201  -7.015 1.25e-07 ***
## qsec         0.05329    0.01299   4.101  0.00032 ***
## ammanual     0.08558    0.06351   1.347  0.18863    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1107 on 28 degrees of freedom
## Multiple R-squared:  0.8752, Adjusted R-squared:  0.8619 
## F-statistic: 65.47 on 3 and 28 DF,  p-value: 9.036e-13

As mpg is changed to log scale, for calculation of the estimates, we have to exponentiate the results. Formula is

\(log(mpg) = 2.6941 + wt*-0.22456 + qsec*0.05329 + am*0.08558\) OR \(mpg = exp(2.6941 + wt*-0.22456 + qsec*0.05329 + am*0.08558)\)

Adjusted R-square of 0.86 means the 86 % of the variability of mpg can be explained by this model but auto/manual is not statistically significant. What about the diagnostic plots for linearity assumption?

par(mfrow=c(2,2))
autoplot(model2)

  1. Linearity of the data By inspecting Residuals vs Fitted plot, the residual plot show no fitted pattern and the blue line is approximately horizontal at zero. So, linear relationship between the predictors and the outcome variables can be assumed.

  2. Homogeneity of variance
    By inspecting Scale-Location plot, 3 variables at the upper left corner distort the horizontal line.

  3. Normality of residuals
    Normal Q-Q plot shows that the plot of residuals approximately follows the straight line.

  4. Leverage
    Residuals vs Leverage plot highlights 4 most extreme points with standard residuals below -1 and so, there is no outliers that exceed -2 and it is good.