Summary

This project is a practice of applying linear regression model to mtcars data. The analysis will look at different variables and answer if transmission type (manual vs. automatic) affects gas efficiency (mile per gallon).

My analysis shows that if we just look at the relationship between mileage and transmission, a correlation is found. However it is proven that there are more cofounding variables affecting mileage. A deeper multivariable analysis suggests that transmission type does not solely determine car efficiency.

Data loading

data(mtcars)
library(ggplot2) # to create plot
library(gridExtra) # to arrange plots
## Warning: package 'gridExtra' was built under R version 3.4.4
library(car) # Companion to Applied Regression (vif function)
## Warning: package 'car' was built under R version 3.4.4
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.4.4

Exploratory analysis

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

This data contains 32 observations of 11 variables: mpg (mile per gallon), cyl (number of cylinder), disp (displacement), hp (horsepower), drat (rear axle ratio), wt (weight), qsec (mile time), vs (engine alignment), am (transmission: 0 = automatic, 1 = manual), gear (number of gear) and carb (number of carburetors).

Figure 1 (appendix) shows a very quick look of the relationship between them. Apparently, there are some categorical variables: cyl, vs, am, gear and carb.

Simple approach

Now if we only look at transmission type vs. mpg (Figure 2 - appendix), we can have a gut idea that manual transmission is better than automatic in terms of mileage. And actually, we can simply calculate the correlation between them:

cor(mtcars$am, mtcars$mpg)
## [1] 0.5998324

Or perform a simple t-test to compare:

t.test(mtcars$mpg[mtcars$am == "0"], mtcars$mpg[mtcars$am == "1"])
## 
##  Welch Two Sample t-test
## 
## data:  mtcars$mpg[mtcars$am == "0"] and mtcars$mpg[mtcars$am == "1"]
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

All of these simple analyses suggest that manual transmission is better for mileage. The quantified difference is about 7 miles (24.4 compared to 17.1).

However, other variables may affect mpg as well. Figure 3 (appendix) shows that cyl and vs have separate yet similar effects on mpg. Therefore it is necessarry to perform a linear regression model with more variables as adjustment.

More sophisticated approach

In this approach, I build different linear models including:
1. only mpg and am
2. mpg and all remaining variables
3. the best model between mpg and meaningful variables among the remaining ones

I then interpret each of them to find a deeper relationship between transmission and other variables with respect to mileage.

Initially, we need to convert all categorical variables to factor

cols <- c("cyl", "am", "gear", "vs", "carb")
mtcars[cols] <- lapply(mtcars[cols], factor)

1. Linear regression: mpg vs. am

Now let’s first look at a linear model that predicts mpg from solely am:

fit <- lm(mpg ~ am, data = mtcars)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am1            7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285
confint(fit)
##                2.5 %   97.5 %
## (Intercept) 14.85062 19.44411
## am1          3.64151 10.84837

This model takes am = 0 (automatic) as the reference. It shows that mileage with manual transmission is significant larger than automatic with difference of extra 7.245 mile (varying in the 95% CI between 3.64 and 10.85) and probability of 0.000285. This is not surprised as we observed similar behavior earlier in the exploratory analysis.

The R-squared coefficient tells that this model only explains 36% the data. Therefore I need to include more variables. The first try is include all of them:

2. Linear regression: mpg vs. all variables

fit0 <- lm(mpg ~ ., data = mtcars)
summary(fit0)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5087 -1.3584 -0.0948  0.7745  4.6251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 23.87913   20.06582   1.190   0.2525  
## cyl6        -2.64870    3.04089  -0.871   0.3975  
## cyl8        -0.33616    7.15954  -0.047   0.9632  
## disp         0.03555    0.03190   1.114   0.2827  
## hp          -0.07051    0.03943  -1.788   0.0939 .
## drat         1.18283    2.48348   0.476   0.6407  
## wt          -4.52978    2.53875  -1.784   0.0946 .
## qsec         0.36784    0.93540   0.393   0.6997  
## vs1          1.93085    2.87126   0.672   0.5115  
## am1          1.21212    3.21355   0.377   0.7113  
## gear4        1.11435    3.79952   0.293   0.7733  
## gear5        2.52840    3.73636   0.677   0.5089  
## carb2       -0.97935    2.31797  -0.423   0.6787  
## carb3        2.99964    4.29355   0.699   0.4955  
## carb4        1.09142    4.44962   0.245   0.8096  
## carb6        4.47757    6.38406   0.701   0.4938  
## carb8        7.25041    8.36057   0.867   0.3995  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared:  0.8931, Adjusted R-squared:  0.779 
## F-statistic:  7.83 on 16 and 15 DF,  p-value: 0.000124

This model shows that more variables explain mpg better with total adjusted R-squared value of 78%.
Interestingly, although the model indicates manual transmission leads to 1.21 more mile than automatic, it shows that moving from automatic to manual transmission does not give any significant changes in mpg (probability of 0.7113).

3. Linear regression: mpg vs. best variables

Practically, there are different criteria to choose the best variables to include in the model. From the full model (fit0 above), I find them by using two common methods:
a) based on Variance inflation criteria
b) combination test.

a) Variance inflation (vif function)

The idea of this method is to pick out the variables that contribute most (or in other words, have more variation) to the response. From the original full model, the variance inflation of each variables are summarized below:

vif(fit0)
##            GVIF Df GVIF^(1/(2*Df))
## cyl  128.120962  2        3.364380
## disp  60.365687  1        7.769536
## hp    28.219577  1        5.312210
## drat   6.809663  1        2.609533
## wt    23.830830  1        4.881683
## qsec  10.790189  1        3.284842
## vs     8.088166  1        2.843970
## am     9.930495  1        3.151269
## gear  50.852311  2        2.670408
## carb 503.211851  5        1.862838

Here if I want to include am, I take all variables in order from high to low variance inflation and stop at am. They are (in order): disp, hp, wt, cyl, qsec, and am. I build a model with these variables only:

fit1 <- lm(mpg ~ disp+hp+wt+cyl+qsec+am, data = mtcars)
summary(fit1)
## 
## Call:
## lm(formula = mpg ~ disp + hp + wt + cyl + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9501 -1.4335 -0.1542  1.3632  4.1917 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 20.75881   11.54341   1.798   0.0847 .
## disp         0.00680    0.01289   0.528   0.6026  
## hp          -0.02477    0.01537  -1.612   0.1201  
## wt          -3.40642    1.30019  -2.620   0.0150 *
## cyl6        -1.98412    1.76113  -1.127   0.2710  
## cyl8        -0.97700    3.24098  -0.301   0.7657  
## qsec         0.67413    0.57760   1.167   0.2546  
## am1          2.91836    1.70260   1.714   0.0994 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.435 on 24 degrees of freedom
## Multiple R-squared:  0.8736, Adjusted R-squared:  0.8367 
## F-statistic:  23.7 on 7 and 24 DF,  p-value: 2.566e-09

As we removed some information from other variable, this model can explain 84% the data (Adjusted R-squared) and the residual plot doesn’t represent any error trend (Figure 4 - appendix). So, this is a good model. Although the model says manual transmission leads to 2.9 more mile than automatic, the transmission generally does not have significant impact on mpg with probability exceeds 0.099. Based from this result, there is no difference between automatic and manual transmission.

b) Combination test (step function)

The idea of this method is to try as many model as possible from the original one, evaluate their quality and pick the best model.

fit2 <- step(fit0, trace = 0) # trace=0 to not showing the intermediate models to the console
summary(fit2)
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## am1          1.80921    1.39630   1.296  0.20646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

This method returns slightly different and less variables: cyl, hp, wt and am. Again, this is a good model that explains 84% the data (Adjusted R-squared, not much different from the VIF method above) with random residual (Figure 5 - appendix). This model says that manual transmission leads to 1.8 more mile and automatic. Even though, the transmission again does not have significant impact on mpg with probability exceeds 0.2. Based from this model, there is no difference between automatic and manual transmission.

Conclusion

From this data, a simple regression model between mpg and am suggests manual transmission is better than automatic in increasing 7.245 miles more (uncertainty between 3.64 - 10.85) with a probability of 0.000285. However this model is poor and can only explains 36% of the data.

Further investigations show that transmission is not the only variable that affects mileage. I have tested 2 more different cases with all variables and best variables. The quality of these models are good (explaining 78% and 84% of the data, respectively). They show that although manual transmission seems to perform better than automatic with coefficients varies between 1.8 - 2.9 miles, transmission itself does not affect mileage as much as other variables. It has low significance with probability of 0.1 and 0.2, respectively.

In the final word, I think the deeper models indicate a reasonable results because realistic car efficiency should depends on a combination of several factors (which found in this study are number of cylinder, weight, horsepower), but not only on transmission type.

Appendix

Figure 1

pairs(mtcars, main = "Correlation of mtcars")

Figure 2

g <- qplot(as.factor(am),mpg,data=mtcars) + geom_boxplot(aes(fill = as.factor(am)))
g <- g + xlab("Transmission (0=auto, 1=manual)") + scale_fill_discrete(name="Transmission")
g

Figure 3

g1 <- qplot(cyl, mpg, data = mtcars, facets = ~am) + geom_boxplot(aes(fill = factor(cyl)))
g1 <- g1 + xlab("Transmission (0=auto, 1=manual)") + scale_fill_discrete(name="No. of cylinder")
g2 <- qplot(vs, mpg, data = mtcars, facets = ~am) + geom_boxplot(aes(fill = factor(vs)))
g2 <- g2 + xlab("Transmission (0=auto, 1=manual)")
g2 <- g2 + scale_fill_discrete(name="Engine alignment (vs)")

grid.arrange(arrangeGrob(g1,g2, ncol=2, nrow=1))

Figure 4

plot(fit1, which=c(1,1))

Figure 5

plot(fit2, which=c(1,1))