Motor Trend is a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
Considering a model that includes weight, acceleration and transmission, we can say that automatic cars have 2.94 miles per galon (MPG) more than manual cars.
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
summary(mtcars)
## mpg cyl disp hp
## Min. :10.4 Min. :4.00 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.4 1st Qu.:4.00 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.2 Median :6.00 Median :196.3 Median :123.0
## Mean :20.1 Mean :6.19 Mean :230.7 Mean :146.7
## 3rd Qu.:22.8 3rd Qu.:8.00 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.9 Max. :8.00 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.76 Min. :1.51 Min. :14.5 Min. :0.000
## 1st Qu.:3.08 1st Qu.:2.58 1st Qu.:16.9 1st Qu.:0.000
## Median :3.69 Median :3.33 Median :17.7 Median :0.000
## Mean :3.60 Mean :3.22 Mean :17.8 Mean :0.438
## 3rd Qu.:3.92 3rd Qu.:3.61 3rd Qu.:18.9 3rd Qu.:1.000
## Max. :4.93 Max. :5.42 Max. :22.9 Max. :1.000
## am gear carb
## Min. :0.000 Min. :3.00 Min. :1.00
## 1st Qu.:0.000 1st Qu.:3.00 1st Qu.:2.00
## Median :0.000 Median :4.00 Median :2.00
## Mean :0.406 Mean :3.69 Mean :2.81
## 3rd Qu.:1.000 3rd Qu.:4.00 3rd Qu.:4.00
## Max. :1.000 Max. :5.00 Max. :8.00
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
colSums(is.na(mtcars))
## mpg cyl disp hp drat wt qsec vs am gear carb
## 0 0 0 0 0 0 0 0 0 0 0
No missing values was found.
mtcars$am <- factor(mtcars$am,
levels = c(0,1),
labels = c("automatic", "manual"))
require(knitr)
## Loading required package: knitr
opts_chunk$set(fig.align='center')
par(bty= "n")
boxplot(mpg ~ am, data= mtcars, col= (c("lightblue","salmon")), xlab= "transmission type",
ylab= "miles per gallon");
Form the above plot, we notice that manual transmission has a higher MPG than automatic transimission. To test the significance and quantify the difference, we regress transmission type as a factor varible against mpg:
fit <- lm(mpg ~ am, mtcars)
summary(fit)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.392 -3.092 -0.297 3.244 9.508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.15 1.12 15.25 1.1e-15 ***
## ammanual 7.24 1.76 4.11 0.00029 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.9 on 30 degrees of freedom
## Multiple R-squared: 0.36, Adjusted R-squared: 0.338
## F-statistic: 16.9 on 1 and 30 DF, p-value: 0.000285
The result indicates the averge mpg of manual transmission is 7.24 (ammanual) higher than that of automatic transmission (17.15 as intercept) with a p-value less than 1% significance. Nevertheless, this analysis may be severely biased since many potential features that are more intuitively related to mpg are omitted in the model.
To verify our suspicion, we scatterplot all varible pairs in the mtcars dataset:
pairs(mtcars)
By looking at the first row of the scatter plots, we could immediately identify that # of cylinder (cyl), displacement (disp), horse power (hp), real axle ratio (drat) and weight (wt) are strongly correlated with mpg. We list the first row of correlation matrix.
sub_mtcars <- subset(mtcars, select = -c(qsec,vs,am,gear,carb))
cor(sub_mtcars)[1,]
## mpg cyl disp hp drat wt
## 1.0000 -0.8522 -0.8476 -0.7762 0.6812 -0.8677
Comparing the correlation of the caracteristic of a car with MPG, we notice that 3 variables are highly correlated with MPG (>.8): wt, cyl and disp.
We build a first model based on Simple Linear Regression.
fit.simple <- lm(mpg ~ am, mtcars)
summary(fit.simple)$adj.r.squared
## [1] 0.3385
The adjusted \(R^2\) value indicates that the model explains only 34% of the variations. It’s a very low value.
summary(fit.simple)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.134e-15
## ammanual 7.245 1.764 4.106 2.850e-04
This model tells us that changing from automatic to manual transmission causes a 7.245 increase in MPG.
Then, we will use the Stepwise Algorithm (step-by-step selection) to select a better model (keeping am variable in the model):
data(mtcars)
mtcars$am <- factor(mtcars$am, levels=c(0,1), labels=c("automatic", "manual"))
fit.step <- step(lm(mpg~., mtcars), trace=0, scope=list(lower=~am), direction="both")
summary(fit.step)$call
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
The best model proposed by Stepwise includes the weight (wt) and the “1/4 mile time” (qsec) of the cars, in addition to transmission (am), to explain fuel consumption (MPG).
summary(fit.step)$adj.r.squared
## [1] 0.8336
The adjusted \(R^2\) is 0.8336 which means that the model explains 83% of the variation.
We then compare the model proposed by Stepwise with our first model using ANOVA.
anova(fit.simple, fit.step)[2,6] #p-value
## [1] 1.55e-09
The p-value is very low: we can then reject the null hypothesis (i.e. “Model are equals”) and claim that the model proposed by the Stepwise algorithm is better than our first simple model.
par(mfrow=c(2,2))
plot(fit.step)
The above figure is a residual plot of the selected model. Residuals seems to be uncorrelated with the fit, independent and (almost) identically distributed with mean zero.
summary(fit.step)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.618 6.9596 1.382 1.779e-01
## wt -3.917 0.7112 -5.507 6.953e-06
## qsec 1.226 0.2887 4.247 2.162e-04
## ammanual 2.936 1.4109 2.081 4.672e-02
Given the coefficients of our model, we can say that automatic cars have lower fuel consumption than manual cars: they have 2.94 miles per galon (MPG) more than manual cars. This value can be obtained when we consider the weight (wt) and the “1/4 mile time” (qsec) variables of the cars of our dataset.