Executive Summary

In this project we will build linear regression model to predict miles per gallon (MPG) from mtcars dataset. We will evaluate suitable regressors with final goal to answer these following questions:

  1. Is an automatic or manual transmission better for MPG?
  2. Quantify the MPG difference between automatic and manual transmissions

Exploratory Data Analysis

From the plot shown in appendix, we see that mpg has strong linear correlation with disp, hp, wt. It also has moderate correlation with drat and distinctinve patterns can be seen on categorical variables cyl, vs and am.

Building Linear Regression

Starting with naive approach, we build simple linear regression model to predict mpg with variable am.

data(mtcars) #Load and preprocess data
mtcars$am <- as.factor(mtcars$am)
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$vs <- as.factor(mtcars$vs)

naive <- lm(mpg ~ am, data = mtcars)
summary(naive)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## am1          7.244939   1.764422  4.106127 2.850207e-04
summary(naive)$r.squared
## [1] 0.3597989

This model states that cars with automatic transmission, in average has 7.245 milles per gallon more than manual transmission. However, the R-squared is only 0.36, meaning this model only captures 0.36 of variance. Hence, we need to build more robust model.

First, we use all predictor variables mentioned in Exploratory Data Analysis section.

fit <- lm(mpg ~ am + vs + cyl + disp + hp + wt + drat, data = mtcars)
summary(fit)$coef
##                 Estimate Std. Error      t value     Pr(>|t|)
## (Intercept) 29.829969134 6.74446788  4.422879559 0.0001962074
## am1          2.558988828 1.74302127  1.468134026 0.1556117761
## vs1          2.004897600 1.82994849  1.095603300 0.2845926673
## cyl6        -2.055523435 1.80310789 -1.139989150 0.2660238246
## cyl8        -0.023304443 3.81651017 -0.006106218 0.9951806281
## disp         0.004360163 0.01303611  0.334468226 0.7410571328
## hp          -0.035794756 0.01463423 -2.445960216 0.0225138202
## wt          -2.594622674 1.20129538 -2.159854031 0.0414485707
## drat         0.388141033 1.46606024  0.264751080 0.7935593982
summary(fit)$adj.r.squared
## [1] 0.829248

We have gotten good model with adjusted R-squared 0.8292. However, inference for this linear model indicates some of the regressors are not statistically significant, e.g. p-value for variable drat is as high as 0.79. Moreover, some of the regressors are highly correlated, e.g. disp and wt have 0.888 correlation. The second model aims to reach parsimony and is shown below. The steps to reach this parsimonious model is not discussed here.

fit1 <- lm(mpg ~ cyl + vs + am + hp + wt, data = mtcars)
summary(fit1)$coef
##                Estimate Std. Error     t value     Pr(>|t|)
## (Intercept) 31.18461386 3.42002374  9.11824486 1.996628e-09
## cyl6        -2.09010865 1.62867960 -1.28331481 2.111508e-01
## cyl8         0.29097541 3.14269833  0.09258776 9.269690e-01
## vs1          1.99000402 1.76018458  1.13056554 2.689680e-01
## am1          2.70384441 1.59850120  1.69148726 1.031742e-01
## hp          -0.03475025 0.01381876 -2.51471630 1.871372e-02
## wt          -2.37336709 0.88763117 -2.67382125 1.302256e-02
summary(fit1)$adj.r.squared
## [1] 0.8417804

We get our final model with higher adjusted R-squared and less regressors. In this model, every other variable holds constant, automatic transmission car in average has 2.70 more milleage per gallon than manual transmission.

Appendix

Correlation plot

library(GGally)
ggpairs(mtcars)

Residual and Diagnostics

par(mfrow = c(2, 2))
plot(fit1)

Interpretations:

  1. Residuals are randomly scattered along fitted values, indicating independent samples
  2. Residuals are normally distributed, with slight skewness on both ends
  3. Variances of residuals along fitted values are constant
  4. There is no influential outlier in the model