The mtcars data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models). In this analysis, the two particularly interested questions are
Is an automatic or manual transmission better for MPG ?
How to quantify the mile per gallons difference between automatic and manual transmissions?
The following statistical techinques will be used.
- t-test
- simple linear regression
- multiple linear regression
According to t-test, without consideration of other variables, automatic is better than manual cars for decreasing in mpg. With consideration of other factors, it cannot be said that automatic is better than manual cars without searching the remaining 14% of the variability of the mpg.
data <- mtcars
library(ggplot2)
library(broom)
library(ggfortify)
theme_set(theme_bw())
str(data)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
| Name | Description |
|---|---|
| mpg | Miles/(US) gallon |
| cyl | Number of cylinders |
| disp | Displacement (cu.in.) |
| hp | Gross horsepower |
| drat | Rear axle ratio |
| wt | Weight (1000 lbs) |
| qsec | 1/4 mile time |
| vs | Engine (0 = V-shaped, 1 = straight) |
| am | Transmission (0 = automatic, 1 = manual) |
| gear | Number of forward gears |
| carb | Number of carburetors |
vs, am, gear, carb should be categorical variables.
data$vs <- factor(data$vs)
data$am <- factor(ifelse(data$am == 0, "automatic", "manual"))
data$gear <- factor(data$gear)
data$carb <- factor(data$carb)
ggplot(data, aes(am, mpg, fill = am)) +
geom_boxplot() +
labs(x = "", y = "Miles Per Gallon") +
theme(legend.position = "none")
According to box-plot, mean difference between automatic is lower than that of manual.
t.test(mpg ~ am, data = data)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means between group automatic and group manual is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group automatic mean in group manual
## 17.14737 24.39231
With 95% confidence interval, p-value is 0.0014, confidence interval does not contain zero and mean difference between automatic and manual is statistically significant. Null hypothesis of no mean difference between automatic and manual can be rejected and without consideration of other variables, automatic is better than manual cars for decreasing in mpg.
model1 <- lm(mpg ~ am, data = data)
summary(model1)
##
## Call:
## lm(formula = mpg ~ am, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## ammanual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
With simple linear regression, for manual cars, there is 7.245 mpg increase compared to automatic, with p-value nearly zero. But, \(R^2\) is just 0.3598 means that am variable can explain 36% of variability of mpg. So, it is time to consider multiple linear regression. Formula for this model is
\(mpg = 17.147 + am*7.245\)
For am , 0 is for automatic and 1 for manual. Approximate 95% confidence intervals of each variables can be calculated from Estimate +/- 2 x Std.Error.
Models are selected by backward elimination and find the best model with low AIC value.
step(lm(mpg~., data = data), direction = "backward",trace=0)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = data)
##
## Coefficients:
## (Intercept) wt qsec ammanual
## 9.618 -3.917 1.226 2.936
Model with wt, qsec and am variables is the best model for multiple linear regression. For the model, response mpg is changed to log scale after testing with normal scale with Adjusted \(R^2\) (0.8336).
model2 <- lm(log(mpg) ~ wt + qsec + am, data = data)
summary(model2)
##
## Call:
## lm(formula = log(mpg) ~ wt + qsec + am, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.13879 -0.08114 -0.03466 0.07030 0.26575
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.69410 0.31326 8.600 2.40e-09 ***
## wt -0.22456 0.03201 -7.015 1.25e-07 ***
## qsec 0.05329 0.01299 4.101 0.00032 ***
## ammanual 0.08558 0.06351 1.347 0.18863
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1107 on 28 degrees of freedom
## Multiple R-squared: 0.8752, Adjusted R-squared: 0.8619
## F-statistic: 65.47 on 3 and 28 DF, p-value: 9.036e-13
As mpg is changed to log scale, for calculation of the estimates, we have to exponentiate the results. Formula is
\(log(mpg) = 2.6941 + wt*-0.22456 + qsec*0.05329 + am*0.08558\) OR \(mpg = exp(2.6941 + wt*-0.22456 + qsec*0.05329 + am*0.08558)\)
Adjusted R-square of 0.86 means the 86 % of the variability of mpg can be explained by this model but auto/manual is not statistically significant. What about the diagnostic plots for linearity assumption?
par(mfrow=c(2,2))
autoplot(model2)
Linearity of the data By inspecting Residuals vs Fitted plot, the residual plot show no fitted pattern and the blue line is approximately horizontal at zero. So, linear relationship between the predictors and the outcome variables can be assumed.
Homogeneity of variance
By inspecting Scale-Location plot, 3 variables at the upper left corner distort the horizontal line.
Normality of residuals
Normal Q-Q plot shows that the plot of residuals approximately follows the straight line.
Leverage
Residuals vs Leverage plot highlights 4 most extreme points with standard residuals below -1 and so, there is no outliers that exceed -2 and it is good.