You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
mtcars
is a data frame with 32 observations on 11 variables. At first, we load the data set and perform data transformations by factoring the necessary variables.
library(ggplot2)
data(mtcars)
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$vs <- as.factor(mtcars$vs)
mtcars$am <- factor(mtcars$am,labels=c('Automatic','Manual'))
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
## $ am : Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
## $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
Initially, we plot the relationships between all the variables of the dataset (see Figure 1 in the Appendix). From the plot, we can notice that variables like cyl
, disp
, hp
, drat
, wt
, vs
and am
seem to have some strong correlation with mpg
. Since we are interested in the effects of car transmission type on mpg
, we plot boxplots of the variable mpg
when am
is Automatic
or Manual
(see Figure 2 in the Appendix). This plot shows an increase in the mpg
when the transmission is Manual
.
From the previous plots, we have seen that several variables seem to have high correlation with mpg
, so we fit an initial model with all the variables as predictors, and perfom stepwise model selection to select significant predictors for the final model which is the best model
initModel <- lm(mpg ~ ., data=mtcars)
bestModel <- step(initModel, direction = "both")
summary(initModel)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5087 -1.3584 -0.0948 0.7745 4.6251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.87913 20.06582 1.190 0.2525
## cyl6 -2.64870 3.04089 -0.871 0.3975
## cyl8 -0.33616 7.15954 -0.047 0.9632
## disp 0.03555 0.03190 1.114 0.2827
## hp -0.07051 0.03943 -1.788 0.0939 .
## drat 1.18283 2.48348 0.476 0.6407
## wt -4.52978 2.53875 -1.784 0.0946 .
## qsec 0.36784 0.93540 0.393 0.6997
## vs1 1.93085 2.87126 0.672 0.5115
## amManual 1.21212 3.21355 0.377 0.7113
## gear4 1.11435 3.79952 0.293 0.7733
## gear5 2.52840 3.73636 0.677 0.5089
## carb2 -0.97935 2.31797 -0.423 0.6787
## carb3 2.99964 4.29355 0.699 0.4955
## carb4 1.09142 4.44962 0.245 0.8096
## carb6 4.47757 6.38406 0.701 0.4938
## carb8 7.25041 8.36057 0.867 0.3995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared: 0.8931, Adjusted R-squared: 0.779
## F-statistic: 7.83 on 16 and 15 DF, p-value: 0.000124
summary(bestModel)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## amManual 1.80921 1.39630 1.296 0.20646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
This model is “mpg ~ cyl + hp + wt + am”. It has the Residual standard error as 2.41 on 26 degrees of freedom. The Adjusted R-squared value is 0.8401, which means that the model can explain about 84% of the variance of the MPG variable.
From the above computations, the best model consists of the variables, cyl
, wt
and hp
as confounders and am
as the independent variable
Now, we compare the model which we obtained and a base model with only am
as the predictor variable.
amModel <- lm(mpg ~ am, data = mtcars)
anova(amModel, bestModel)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + hp + wt + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 26 151.03 4 569.87 24.527 1.688e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This model captured 84% of the overall variation in mpg. With a p-value of 3.745e-09, we reject the null hypothesis and claim that our multivariate model is significantly different from our simple linear regression model.
According to the residual plots (see Figure 3 in Appendix), we can make the following observations:
Residuals vs. Fitted
plot shows the randomness of the distribution of the points, so it confirms the variable indipendence.Normal Q-Q plot
indicates that the residuals are normally distributed because the points lie closely to the line.Scale-Location
plot confirms the constant variance assumption, as the points are randomly distributed.Residuals vs. Leverage
shows that the labeled points appear to be leverage points, and no outliers are presentThe labeled points appear to be leverage points above the rest of the points.In order to find out the interesting leverage points, we can compute some regression diagnostics of our model. We compute top three points in each case of influence measures.
leverage <- hatvalues(bestModel)
tail(sort(leverage),3)
## Toyota Corona Lincoln Continental Maserati Bora
## 0.2777872 0.2936819 0.4713671
influential <- dfbetas(bestModel)
tail(sort(influential[,6]),3)
## Chrysler Imperial Fiat 128 Toyota Corona
## 0.3507458 0.4292043 0.7305402
This shows that our analysis is correct, because the same cars are mentioned in the residual plots.
We can perform t-test to see that the manual and automatic transmissions are significatively different:
t.test(mpg ~ am, data = mtcars)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group Automatic mean in group Manual
## 17.14737 24.39231
We can conclude that cars with Manual
transmission get more miles per gallon mpg
compared to cars with Automatic
transmission.
boxplot(mpg ~ am, xlab="Transmission (0 = Automatic, 1 = Manual)", ylab="MPG", col=c("lightblue","blue"),
data = mtcars, main="Boxplot of MPG vs. Transmission")
2. Pair Graph of Motor Trend Car Road Tests
plot(mtcars, panel=panel.smooth, main="Pair Graph of Motor Trend Car Road Tests")
3. Residual Plots
par(mfrow = c(2,2))
plot(bestModel)