We want to analyze the data set mtcars that contains information on different models of cars. In particular we want to answer the following questions:
We have created a simple linear regression model using Transmission to predict MPG and a second model with a multivariable linear regression including all the relevant variables. Then we have added to the last one the Transmission and we have compare this new model with the initial one to see which performs better.
As a result we can determinate that Manual cars are more fuel efficient than Automatic cars. In particular, Manual cars offer 1.8 MPG more than Automatic cars, when other variables stay equal
Y = \(\beta\)0 + X\(\beta\)1 being Y = MPG and X = Transmission Type (0 - Automatic / 1 - Manual)
Now we calculate, \(\beta\)0 and \(\beta\)1
data(mtcars) #load data set
mtcars$am <- as.factor(mtcars$am); mtcars$cyl <- as.factor(mtcars$cyl) #convert to factor
levels(mtcars$am) <- c("Automatic", "Manual")
model_simple <- lm(mpg ~ am, data = mtcars) #create linear regression model
coef(summary(model_simple))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## amManual 7.244939 1.764422 4.106127 2.850207e-04
The result is that \(\beta\)0 = 17.15 and \(\beta\)1 = 7.24
AS we just show, the MPG mean of Manual cars is 7.24 higher than the MPG mean of Automatic cars. But is this a significant difference?
With a p-value of 2.8510^{-4}, we reject the null hypothesis (H0) and claim that there is a significant difference in the mean MPG between Manual cars and Automatic cars. Being automatic cars less efficient than manuals.
Y = \(\beta\)0 + X1\(\beta\)1 + .. + Xn\(\beta\)n
Now we have to identify the significant predictors Xi and their coeficients \(\beta\)i. Note that we will keep Transmission (am) in the model as is the object of the study, and then verify if the Transmission is relevant to MPG
Based on the correlation matrix (see Appendix), we will study the following predictors (|correlation| >0.7): wt, cyl, disp, hp, drat, vs + am
summary(model_initial<-lm(mpg ~ wt+cyl+disp+hp+drat+vs+am, data = mtcars))$coef
#See details of summary on the Appendix
We will follow a Stepwise Strategy to select the best model. For it we will use the function step to choose a model by AIC in a Stepwise Algorithm
model_best <- step(model_initial, direction = "both")
#See details of summary on the Appendix
As a result, the varibles wt, cyl and hp describe MPG explaining on a 87% the variation
Now we build the model adding Transmission
summary(model_final<-lm(mpg ~ cyl + hp + wt + am, data = mtcars))$coef
#See details of summary on the Appendix
This model explains a 87% the variability.
Now we compare the model with just the variable am (Transmission) variable and the one that we just built with wt, cyl, hp and am (Transmission). As they are nested we use anova function.
anova(model_simple, model_final)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + hp + wt + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 26 151.03 4 569.87 24.527 1.688e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What tell us that the models are significaly different at a \(\alpha\) = 1.688434910^{-8}
So the inclusion of the new variables improves the initial model with just one variable
head(mtcars[1:10])
## mpg cyl disp hp drat wt qsec vs am gear
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 Manual 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 Manual 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 Manual 4
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 Automatic 3
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 Automatic 3
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 Automatic 3
boxplot(mpg~am, data = mtcars,
xlab = "Transmission",
ylab = "Miles per Gallon (MPG)",
main = "MPG by Transmission Type")
We see that there are no outliers and that Automatic cars have a lower mean of MPG than Manual cars.
Model with the variables that have high correlation to MPG
summary(model_initial)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.829969134 6.74446788 4.422879559 0.0001962074
## wt -2.594622674 1.20129538 -2.159854031 0.0414485707
## cyl6 -2.055523435 1.80310789 -1.139989150 0.2660238246
## cyl8 -0.023304443 3.81651017 -0.006106218 0.9951806281
## disp 0.004360163 0.01303611 0.334468226 0.7410571328
## hp -0.035794756 0.01463423 -2.445960216 0.0225138202
## drat 0.388141033 1.46606024 0.264751080 0.7935593982
## vs 2.004897600 1.82994849 1.095603300 0.2845926673
## amManual 2.558988828 1.74302127 1.468134026 0.1556117761
Model with wt, cyl and hp
summary(model_best)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832390 2.60488618 12.940421 7.733392e-13
## wt -2.49682942 0.88558779 -2.819404 9.081408e-03
## cyl6 -3.03134449 1.40728351 -2.154040 4.068272e-02
## cyl8 -2.16367532 2.28425172 -0.947214 3.522509e-01
## hp -0.03210943 0.01369257 -2.345025 2.693461e-02
## amManual 1.80921138 1.39630450 1.295714 2.064597e-01
Model with am, wt, cyl and hp
summary(model_final)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832390 2.60488618 12.940421 7.733392e-13
## cyl6 -3.03134449 1.40728351 -2.154040 4.068272e-02
## cyl8 -2.16367532 2.28425172 -0.947214 3.522509e-01
## hp -0.03210943 0.01369257 -2.345025 2.693461e-02
## wt -2.49682942 0.88558779 -2.819404 9.081408e-03
## amManual 1.80921138 1.39630450 1.295714 2.064597e-01
data(mtcars)
corr <- round(cor(mtcars), 1)
head(corr[, 1:6])
## mpg cyl disp hp drat wt
## mpg 1.0 -0.9 -0.8 -0.8 0.7 -0.9
## cyl -0.9 1.0 0.9 0.8 -0.7 0.8
## disp -0.8 0.9 1.0 0.8 -0.7 0.9
## hp -0.8 0.8 0.8 1.0 -0.4 0.7
## drat 0.7 -0.7 -0.7 -0.4 1.0 -0.7
## wt -0.9 0.8 0.9 0.7 -0.7 1.0
par(mfrow=c(2, 2))
plot(model_final)