In this project we will build linear regression model to predict miles per gallon (MPG) from mtcars
dataset. We will evaluate suitable regressors with final goal to answer these following questions:
From the plot shown in appendix, we see that mpg
has strong linear correlation with disp
, hp
, wt
. It also has moderate correlation with drat
and distinctinve patterns can be seen on categorical variables cyl
, vs
and am
.
Starting with naive approach, we build simple linear regression model to predict mpg
with variable am
.
data(mtcars) #Load and preprocess data
mtcars$am <- as.factor(mtcars$am)
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$vs <- as.factor(mtcars$vs)
naive <- lm(mpg ~ am, data = mtcars)
summary(naive)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## am1 7.244939 1.764422 4.106127 2.850207e-04
summary(naive)$r.squared
## [1] 0.3597989
This model states that cars with automatic transmission, in average has 7.245 milles per gallon more than manual transmission. However, the R-squared is only 0.36, meaning this model only captures 0.36 of variance. Hence, we need to build more robust model.
First, we use all predictor variables mentioned in Exploratory Data Analysis section.
fit <- lm(mpg ~ am + vs + cyl + disp + hp + wt + drat, data = mtcars)
summary(fit)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.829969134 6.74446788 4.422879559 0.0001962074
## am1 2.558988828 1.74302127 1.468134026 0.1556117761
## vs1 2.004897600 1.82994849 1.095603300 0.2845926673
## cyl6 -2.055523435 1.80310789 -1.139989150 0.2660238246
## cyl8 -0.023304443 3.81651017 -0.006106218 0.9951806281
## disp 0.004360163 0.01303611 0.334468226 0.7410571328
## hp -0.035794756 0.01463423 -2.445960216 0.0225138202
## wt -2.594622674 1.20129538 -2.159854031 0.0414485707
## drat 0.388141033 1.46606024 0.264751080 0.7935593982
summary(fit)$adj.r.squared
## [1] 0.829248
We have gotten good model with adjusted R-squared 0.8292. However, inference for this linear model indicates some of the regressors are not statistically significant, e.g. p-value for variable drat
is as high as 0.79. Moreover, some of the regressors are highly correlated, e.g. disp
and wt
have 0.888 correlation. The second model aims to reach parsimony and is shown below. The steps to reach this parsimonious model is not discussed here.
fit1 <- lm(mpg ~ cyl + vs + am + hp + wt, data = mtcars)
summary(fit1)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.18461386 3.42002374 9.11824486 1.996628e-09
## cyl6 -2.09010865 1.62867960 -1.28331481 2.111508e-01
## cyl8 0.29097541 3.14269833 0.09258776 9.269690e-01
## vs1 1.99000402 1.76018458 1.13056554 2.689680e-01
## am1 2.70384441 1.59850120 1.69148726 1.031742e-01
## hp -0.03475025 0.01381876 -2.51471630 1.871372e-02
## wt -2.37336709 0.88763117 -2.67382125 1.302256e-02
summary(fit1)$adj.r.squared
## [1] 0.8417804
We get our final model with higher adjusted R-squared and less regressors. In this model, every other variable holds constant, automatic transmission car in average has 2.70 more milleage per gallon than manual transmission.
library(GGally)
ggpairs(mtcars)
par(mfrow = c(2, 2))
plot(fit1)
Interpretations: