Executive Summary

We want to analyze the data set mtcars that contains information on different models of cars. In particular we want to answer the following questions:

We have created a simple linear regression model using Transmission to predict MPG and a second model with a multivariable linear regression including all the relevant variables. Then we have added to the last one the Transmission and we have compare this new model with the initial one to see which performs better.

As a result we can determinate that Manual cars are more fuel efficient than Automatic cars. In particular, Manual cars offer 1.8 MPG more than Automatic cars, when other variables stay equal

Analysis

Simple linear regression

Y = \(\beta\)0 + X\(\beta\)1 being Y = MPG and X = Transmission Type (0 - Automatic / 1 - Manual)
Now we calculate, \(\beta\)0 and \(\beta\)1

data(mtcars) #load data set
mtcars$am <- as.factor(mtcars$am); mtcars$cyl <- as.factor(mtcars$cyl) #convert to factor
levels(mtcars$am) <- c("Automatic", "Manual")
model_simple <- lm(mpg ~ am, data = mtcars) #create linear regression model
coef(summary(model_simple))
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## amManual     7.244939   1.764422  4.106127 2.850207e-04

The result is that \(\beta\)0 = 17.15 and \(\beta\)1 = 7.24

  • For Automatic cars (am = 0), the mean of MPG is \(\beta\)0 (= 17.15 )
  • For Manual cars (am = 1), the mean of MPG is \(\beta\)0 + \(\beta\)1 (= 17.15 + 7.24 )

Hypothesis testing

AS we just show, the MPG mean of Manual cars is 7.24 higher than the MPG mean of Automatic cars. But is this a significant difference?

With a p-value of 2.8510^{-4}, we reject the null hypothesis (H0) and claim that there is a significant difference in the mean MPG between Manual cars and Automatic cars. Being automatic cars less efficient than manuals.

Multiple linear regression

Y = \(\beta\)0 + X1\(\beta\)1 + .. + Xn\(\beta\)n
Now we have to identify the significant predictors Xi and their coeficients \(\beta\)i. Note that we will keep Transmission (am) in the model as is the object of the study, and then verify if the Transmission is relevant to MPG

Based on the correlation matrix (see Appendix), we will study the following predictors (|correlation| >0.7): wt, cyl, disp, hp, drat, vs + am

summary(model_initial<-lm(mpg ~ wt+cyl+disp+hp+drat+vs+am, data = mtcars))$coef
#See details of summary on the Appendix

We will follow a Stepwise Strategy to select the best model. For it we will use the function step to choose a model by AIC in a Stepwise Algorithm

model_best <- step(model_initial, direction = "both")
#See details of summary on the Appendix

As a result, the varibles wt, cyl and hp describe MPG explaining on a 87% the variation

Now we build the model adding Transmission

summary(model_final<-lm(mpg ~ cyl + hp + wt + am, data = mtcars))$coef
#See details of summary on the Appendix

This model explains a 87% the variability.

Comparing models

Now we compare the model with just the variable am (Transmission) variable and the one that we just built with wt, cyl, hp and am (Transmission). As they are nested we use anova function.

anova(model_simple, model_final)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + hp + wt + am
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 720.90                                  
## 2     26 151.03  4    569.87 24.527 1.688e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What tell us that the models are significaly different at a \(\alpha\) = 1.688434910^{-8}
So the inclusion of the new variables improves the initial model with just one variable

Appendix

Data set

head(mtcars[1:10])
##                    mpg cyl disp  hp drat    wt  qsec vs        am gear
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0    Manual    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0    Manual    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1    Manual    4
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1 Automatic    3
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0 Automatic    3
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1 Automatic    3

MPG by Transmission Type

boxplot(mpg~am, data = mtcars,
        xlab = "Transmission",
        ylab = "Miles per Gallon (MPG)",
        main = "MPG by Transmission Type")

We see that there are no outliers and that Automatic cars have a lower mean of MPG than Manual cars.

Models summaries

Model with the variables that have high correlation to MPG

summary(model_initial)$coef
##                 Estimate Std. Error      t value     Pr(>|t|)
## (Intercept) 29.829969134 6.74446788  4.422879559 0.0001962074
## wt          -2.594622674 1.20129538 -2.159854031 0.0414485707
## cyl6        -2.055523435 1.80310789 -1.139989150 0.2660238246
## cyl8        -0.023304443 3.81651017 -0.006106218 0.9951806281
## disp         0.004360163 0.01303611  0.334468226 0.7410571328
## hp          -0.035794756 0.01463423 -2.445960216 0.0225138202
## drat         0.388141033 1.46606024  0.264751080 0.7935593982
## vs           2.004897600 1.82994849  1.095603300 0.2845926673
## amManual     2.558988828 1.74302127  1.468134026 0.1556117761

Model with wt, cyl and hp

summary(model_best)$coef
##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 33.70832390 2.60488618 12.940421 7.733392e-13
## wt          -2.49682942 0.88558779 -2.819404 9.081408e-03
## cyl6        -3.03134449 1.40728351 -2.154040 4.068272e-02
## cyl8        -2.16367532 2.28425172 -0.947214 3.522509e-01
## hp          -0.03210943 0.01369257 -2.345025 2.693461e-02
## amManual     1.80921138 1.39630450  1.295714 2.064597e-01

Model with am, wt, cyl and hp

summary(model_final)$coef
##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 33.70832390 2.60488618 12.940421 7.733392e-13
## cyl6        -3.03134449 1.40728351 -2.154040 4.068272e-02
## cyl8        -2.16367532 2.28425172 -0.947214 3.522509e-01
## hp          -0.03210943 0.01369257 -2.345025 2.693461e-02
## wt          -2.49682942 0.88558779 -2.819404 9.081408e-03
## amManual     1.80921138 1.39630450  1.295714 2.064597e-01

Correlations

data(mtcars)
corr <- round(cor(mtcars), 1)
head(corr[, 1:6])
##       mpg  cyl disp   hp drat   wt
## mpg   1.0 -0.9 -0.8 -0.8  0.7 -0.9
## cyl  -0.9  1.0  0.9  0.8 -0.7  0.8
## disp -0.8  0.9  1.0  0.8 -0.7  0.9
## hp   -0.8  0.8  0.8  1.0 -0.4  0.7
## drat  0.7 -0.7 -0.7 -0.4  1.0 -0.7
## wt   -0.9  0.8  0.9  0.7 -0.7  1.0

Residual Plots

par(mfrow=c(2, 2))
plot(model_final)