This report investigates the relationship between the type of transmission and miles per gallon (MPG) based on the mtcars dataset. A T-test between manual and automatic transmission vehicles shows that vehicles with the former transmission have a 7.245 greater MPG than the latter. Multiple linear regressions revealed that the former contributed less significantly to MPG, only an improvement of 1.81 MPG. The main contribution to the overall vehicle MPG originates from the other variables, including weight, horsepower, and number of cylinders.
The data is loaded and preprocessed as below:
# Setup
knitr::opts_chunk$set(warning=FALSE)
# Load required libraries
library(ggplot2)
# Load data and examine data structure (see Appendix 1)
data(mtcars)
# Convert the categorical variable to appropriate class
categorical_vars <- c("cyl", "vs", "am", "gear", "carb")
mtcars[categorical_vars] <- lapply(mtcars[categorical_vars], factor)
# Rename binary variables
levels(mtcars$am) <- c("Automatic", "Manual")
levels(mtcars$vs) <- c("V-shaped", "Straight")
The relationships between all variables are first explored through a scatter plot matrix (see Appendix 2), where some strong correlations were observed between certain variables and MPG.
A box plot that compares the MPG of the 2 transmission types indicates that manual transmission vehicle allows longer MPG in general (see Appendix 3).
A simple model is first fitted to the data, including only the main explanatory and response variables. Albeit the p-value returned is small, the R-squared value indicates that only around 34% of the variance in MPG could be explained, hinting at the need for inclusion of more variables (see Appendix 4).
simpleModel <- lm(formula=mpg~am, data=mtcars)
Therefore, all variables are included in the next analysis, resulting in the full model. This model returns opposite result compared to the previous model (see Appendix 5). Although about 89% of the MPG variance could be explained, all p-values are greater than 0.05, thus no result is significant. This implies that only the most significant variables should be included.
fullModel <- lm(formula=mpg~., data=mtcars)
The determination of most statistically significant variables is achieved by both forward selection and backward elimination methods by AIC algorithm.
stepModel <- step(fullModel, direction="both", trace=0)
The resulting model includes cylinders, horsepower, and weight as confounder variables, and transmission being the response variable. It explains about 87% of the variance in MPG. The p-values are statistically significantly for all 3 confounder variables at the significance level of 0.05 (see Appendix 6).
The coefficients indicate that holding the other variables constant, the increment in the number of cylinders from 4 to 6 leads to lower MPG by 3.03, while further increment to 8 cylinders reduce the MPG by 2.16. Every unit increase in the horsepower corresponds to reduction in MPG by 0.0321 while every 1000 lbs increase in the vehicle weight decreases the MPG by 2.5 if other variables are held constant. A manual transmission corresponds to 1.81 higher MPG compared to automatic transmission.
The 95% confidence interval of each coefficient is computed and attached to Appendix 7. A Welch 2 sample T-test on transmission type and MPG indeed results in a p-value much smaller than 0.05 and a confidence interval that does not include 0, giving evidence to reject the null hypothesis that transmission type has no impact on MPG (see Appendix 8).
The residual plots (see Appendix 9) leads to the following conclusions:
Some regression diagnostics of the model are computed to find the leverage points:
leverage <- hatvalues(stepModel)
tail(sort(leverage), 3)
## Toyota Corona Lincoln Continental Maserati Bora
## 0.2777872 0.2936819 0.4713671
influential <- dfbetas(stepModel)
tail(sort(influential[, 6]), 3)
## Chrysler Imperial Fiat 128 Toyota Corona
## 0.3507458 0.4292043 0.7305402
The top three points in each case of influence measures are found in the residual plots, thus indicating that the analysis was correct.
The inclusion of confounding variables like weight, horsepower, and number of cylinders in the analysis revealed that the difference in MPG based on transmission type is not found to be significant. The difference could largely be explained by the other variables instead. Thus the question of whether an automatic or manual transmission is better for MPG could not be answered.
Based on the best fit model, it could be said that holding the other confounder variables constant, a manual transmission vehicle could travel anywhere between 4.7 MPG longer or 1.1 MPG shorter than its counterpart. Other variables would be much better variables to be tuned for obtaining optimal MPG.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : Factor w/ 2 levels "V-shaped","Straight": 1 1 2 2 1 2 1 2 2 2 ...
## $ am : Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
## $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
pairs(mpg~., data=mtcars)
boxplot(formula=mpg~am, data=mtcars, xlab="Transmission Type", ylab="MPG", main="MPG by Transmission Type")
summary(simpleModel)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
summary(fullModel)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5087 -1.3584 -0.0948 0.7745 4.6251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.87913 20.06582 1.190 0.2525
## cyl6 -2.64870 3.04089 -0.871 0.3975
## cyl8 -0.33616 7.15954 -0.047 0.9632
## disp 0.03555 0.03190 1.114 0.2827
## hp -0.07051 0.03943 -1.788 0.0939 .
## drat 1.18283 2.48348 0.476 0.6407
## wt -4.52978 2.53875 -1.784 0.0946 .
## qsec 0.36784 0.93540 0.393 0.6997
## vsStraight 1.93085 2.87126 0.672 0.5115
## amManual 1.21212 3.21355 0.377 0.7113
## gear4 1.11435 3.79952 0.293 0.7733
## gear5 2.52840 3.73636 0.677 0.5089
## carb2 -0.97935 2.31797 -0.423 0.6787
## carb3 2.99964 4.29355 0.699 0.4955
## carb4 1.09142 4.44962 0.245 0.8096
## carb6 4.47757 6.38406 0.701 0.4938
## carb8 7.25041 8.36057 0.867 0.3995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared: 0.8931, Adjusted R-squared: 0.779
## F-statistic: 7.83 on 16 and 15 DF, p-value: 0.000124
summary(stepModel)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## amManual 1.80921 1.39630 1.296 0.20646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
confint(stepModel)
## 2.5 % 97.5 %
## (Intercept) 28.35390366 39.062744138
## cyl6 -5.92405718 -0.138631806
## cyl8 -6.85902199 2.531671342
## hp -0.06025492 -0.003963941
## wt -4.31718120 -0.676477640
## amManual -1.06093363 4.679356394
t.test(formula=mpg~am, data=mtcars)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group Automatic mean in group Manual
## 17.14737 24.39231
par(mfrow = c(2, 2))
plot(stepModel)