This project is a practice of applying linear regression model to mtcars
data. The analysis will look at different variables and answer if transmission type (manual vs. automatic) affects gas efficiency (mile per gallon).
My analysis shows that if we just look at the relationship between mileage and transmission, a correlation is found. However it is proven that there are more cofounding variables affecting mileage. A deeper multivariable analysis suggests that transmission type does not solely determine car efficiency.
data(mtcars)
library(ggplot2) # to create plot
library(gridExtra) # to arrange plots
## Warning: package 'gridExtra' was built under R version 3.4.4
library(car) # Companion to Applied Regression (vif function)
## Warning: package 'car' was built under R version 3.4.4
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.4.4
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
This data contains 32 observations of 11 variables: mpg
(mile per gallon), cyl
(number of cylinder), disp
(displacement), hp
(horsepower), drat
(rear axle ratio), wt
(weight), qsec
(mile time), vs
(engine alignment), am
(transmission: 0 = automatic, 1 = manual), gear
(number of gear) and carb
(number of carburetors).
Figure 1 (appendix) shows a very quick look of the relationship between them. Apparently, there are some categorical variables: cyl
, vs
, am
, gear
and carb
.
Now if we only look at transmission type vs. mpg (Figure 2 - appendix), we can have a gut idea that manual transmission is better than automatic in terms of mileage. And actually, we can simply calculate the correlation between them:
cor(mtcars$am, mtcars$mpg)
## [1] 0.5998324
Or perform a simple t-test to compare:
t.test(mtcars$mpg[mtcars$am == "0"], mtcars$mpg[mtcars$am == "1"])
##
## Welch Two Sample t-test
##
## data: mtcars$mpg[mtcars$am == "0"] and mtcars$mpg[mtcars$am == "1"]
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
All of these simple analyses suggest that manual transmission is better for mileage. The quantified difference is about 7 miles (24.4 compared to 17.1).
However, other variables may affect mpg
as well. Figure 3 (appendix) shows that cyl
and vs
have separate yet similar effects on mpg
. Therefore it is necessarry to perform a linear regression model with more variables as adjustment.
In this approach, I build different linear models including:
1. only mpg
and am
2. mpg
and all remaining variables
3. the best model between mpg
and meaningful variables among the remaining ones
I then interpret each of them to find a deeper relationship between transmission and other variables with respect to mileage.
Initially, we need to convert all categorical variables to factor
cols <- c("cyl", "am", "gear", "vs", "carb")
mtcars[cols] <- lapply(mtcars[cols], factor)
Now let’s first look at a linear model that predicts mpg
from solely am
:
fit <- lm(mpg ~ am, data = mtcars)
summary(fit)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am1 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
confint(fit)
## 2.5 % 97.5 %
## (Intercept) 14.85062 19.44411
## am1 3.64151 10.84837
This model takes am = 0 (automatic) as the reference. It shows that mileage with manual transmission is significant larger than automatic with difference of extra 7.245 mile (varying in the 95% CI between 3.64 and 10.85) and probability of 0.000285. This is not surprised as we observed similar behavior earlier in the exploratory analysis.
The R-squared coefficient tells that this model only explains 36% the data. Therefore I need to include more variables. The first try is include all of them:
fit0 <- lm(mpg ~ ., data = mtcars)
summary(fit0)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5087 -1.3584 -0.0948 0.7745 4.6251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.87913 20.06582 1.190 0.2525
## cyl6 -2.64870 3.04089 -0.871 0.3975
## cyl8 -0.33616 7.15954 -0.047 0.9632
## disp 0.03555 0.03190 1.114 0.2827
## hp -0.07051 0.03943 -1.788 0.0939 .
## drat 1.18283 2.48348 0.476 0.6407
## wt -4.52978 2.53875 -1.784 0.0946 .
## qsec 0.36784 0.93540 0.393 0.6997
## vs1 1.93085 2.87126 0.672 0.5115
## am1 1.21212 3.21355 0.377 0.7113
## gear4 1.11435 3.79952 0.293 0.7733
## gear5 2.52840 3.73636 0.677 0.5089
## carb2 -0.97935 2.31797 -0.423 0.6787
## carb3 2.99964 4.29355 0.699 0.4955
## carb4 1.09142 4.44962 0.245 0.8096
## carb6 4.47757 6.38406 0.701 0.4938
## carb8 7.25041 8.36057 0.867 0.3995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared: 0.8931, Adjusted R-squared: 0.779
## F-statistic: 7.83 on 16 and 15 DF, p-value: 0.000124
This model shows that more variables explain mpg
better with total adjusted R-squared value of 78%.
Interestingly, although the model indicates manual transmission leads to 1.21 more mile than automatic, it shows that moving from automatic to manual transmission does not give any significant changes in mpg
(probability of 0.7113).
Practically, there are different criteria to choose the best variables to include in the model. From the full model (fit0
above), I find them by using two common methods:
a) based on Variance inflation criteria
b) combination test.
The idea of this method is to pick out the variables that contribute most (or in other words, have more variation) to the response. From the original full model, the variance inflation of each variables are summarized below:
vif(fit0)
## GVIF Df GVIF^(1/(2*Df))
## cyl 128.120962 2 3.364380
## disp 60.365687 1 7.769536
## hp 28.219577 1 5.312210
## drat 6.809663 1 2.609533
## wt 23.830830 1 4.881683
## qsec 10.790189 1 3.284842
## vs 8.088166 1 2.843970
## am 9.930495 1 3.151269
## gear 50.852311 2 2.670408
## carb 503.211851 5 1.862838
Here if I want to include am
, I take all variables in order from high to low variance inflation and stop at am
. They are (in order): disp
, hp
, wt
, cyl
, qsec
, and am
. I build a model with these variables only:
fit1 <- lm(mpg ~ disp+hp+wt+cyl+qsec+am, data = mtcars)
summary(fit1)
##
## Call:
## lm(formula = mpg ~ disp + hp + wt + cyl + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9501 -1.4335 -0.1542 1.3632 4.1917
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.75881 11.54341 1.798 0.0847 .
## disp 0.00680 0.01289 0.528 0.6026
## hp -0.02477 0.01537 -1.612 0.1201
## wt -3.40642 1.30019 -2.620 0.0150 *
## cyl6 -1.98412 1.76113 -1.127 0.2710
## cyl8 -0.97700 3.24098 -0.301 0.7657
## qsec 0.67413 0.57760 1.167 0.2546
## am1 2.91836 1.70260 1.714 0.0994 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.435 on 24 degrees of freedom
## Multiple R-squared: 0.8736, Adjusted R-squared: 0.8367
## F-statistic: 23.7 on 7 and 24 DF, p-value: 2.566e-09
As we removed some information from other variable, this model can explain 84% the data (Adjusted R-squared) and the residual plot doesn’t represent any error trend (Figure 4 - appendix). So, this is a good model. Although the model says manual transmission leads to 2.9 more mile than automatic, the transmission generally does not have significant impact on mpg
with probability exceeds 0.099. Based from this result, there is no difference between automatic and manual transmission.
The idea of this method is to try as many model as possible from the original one, evaluate their quality and pick the best model.
fit2 <- step(fit0, trace = 0) # trace=0 to not showing the intermediate models to the console
summary(fit2)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## am1 1.80921 1.39630 1.296 0.20646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
This method returns slightly different and less variables: cyl
, hp
, wt
and am
. Again, this is a good model that explains 84% the data (Adjusted R-squared, not much different from the VIF method above) with random residual (Figure 5 - appendix). This model says that manual transmission leads to 1.8 more mile and automatic. Even though, the transmission again does not have significant impact on mpg
with probability exceeds 0.2. Based from this model, there is no difference between automatic and manual transmission.
From this data, a simple regression model between mpg
and am
suggests manual transmission is better than automatic in increasing 7.245 miles more (uncertainty between 3.64 - 10.85) with a probability of 0.000285. However this model is poor and can only explains 36% of the data.
Further investigations show that transmission is not the only variable that affects mileage. I have tested 2 more different cases with all variables and best variables. The quality of these models are good (explaining 78% and 84% of the data, respectively). They show that although manual transmission seems to perform better than automatic with coefficients varies between 1.8 - 2.9 miles, transmission itself does not affect mileage as much as other variables. It has low significance with probability of 0.1 and 0.2, respectively.
In the final word, I think the deeper models indicate a reasonable results because realistic car efficiency should depends on a combination of several factors (which found in this study are number of cylinder, weight, horsepower), but not only on transmission type.
pairs(mtcars, main = "Correlation of mtcars")
g <- qplot(as.factor(am),mpg,data=mtcars) + geom_boxplot(aes(fill = as.factor(am)))
g <- g + xlab("Transmission (0=auto, 1=manual)") + scale_fill_discrete(name="Transmission")
g
g1 <- qplot(cyl, mpg, data = mtcars, facets = ~am) + geom_boxplot(aes(fill = factor(cyl)))
g1 <- g1 + xlab("Transmission (0=auto, 1=manual)") + scale_fill_discrete(name="No. of cylinder")
g2 <- qplot(vs, mpg, data = mtcars, facets = ~am) + geom_boxplot(aes(fill = factor(vs)))
g2 <- g2 + xlab("Transmission (0=auto, 1=manual)")
g2 <- g2 + scale_fill_discrete(name="Engine alignment (vs)")
grid.arrange(arrangeGrob(g1,g2, ncol=2, nrow=1))
plot(fit1, which=c(1,1))
plot(fit2, which=c(1,1))