For an article in Motor Trend, we are interested in the mileage per gallon (MPG) of cars and whether the type of transmission (automatic vs manual) has a significant effect on this. To conclude on the relation between MPG and transmission type, we study the mtcars dataset from R’s datasets library. We use different linear regression models to fit the relation, and discuss which model is most appropriate.
mpg and ammpg and with each other.str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
any(is.na(mtcars))
## [1] FALSE
mtcars has no missing values and contains numeric data on mpg and 10 other variables, among which the transmission type am. Before doing in-depth regression modeling, we perform a quick analysis on the relation between transmission type and MPG.
gg <- ggplot(data = mtcars, aes(x = am, y = mpg))
gg <- gg + geom_boxplot(aes(fill = factor(am)), alpha = 0.5)
gg <- gg + xlab("") + ylab("Miles per gallon") + scale_x_continuous(breaks = NULL)
gg <- gg + scale_fill_discrete(name="Transmission", labels = c("manual","automatic"))
gg
In the boxplot above, we show the distribution of mpg, for manual and automatic transmission. From this plot, it seems that cars with automatic transmission perform better on MPG than cars with manual transmission.
man <- mtcars[mtcars$am == 0,]
aut <- mtcars[mtcars$am == 1,]
t.test(man$mpg,aut$mpg,paired = FALSE)
##
## Welch Two Sample t-test
##
## data: man$mpg and aut$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
A T-test on this difference agrees with our observation and shows a very significant dependence of mpg on am. Intuitively, it seems strange that the type of transmission has such a strong effect on MPG. There might be confounding variables that, when taken into account, change the relation between mpg and transmission type. Let’s dive a bit deeper into the data!
In order to determine which variables to use in a regression model, we first check which variables are correlated:
ggpairs(mtcars)
The upper-right half of the table above shows the correlations between variables. We observe that cyl,disp, hp and wt are most correlated with mpg. However, these variables also show mutual correlation. Since including correlated regressors in a regression model can distort the fit, we pick the two variables with the lowest mutual correlation: weight and horse power.
In this section, we compare the fit coefficients and p-values found in three regression models:
mpg ~ am
mpg~ am + wt + hp
mpg ~ am + wt + hp + cyl + disp
mtcars$vs <- factor(mtcars$vs)
mtcars$am <- factor(mtcars$am)
fit1 <- lm(mpg ~ factor(am) , data = mtcars)
fit2 <- lm(mpg ~ factor(am) + wt + hp , data = mtcars)
fit3 <- lm(mpg ~ factor(am) + wt + hp + disp + cyl, data = mtcars)
summary(fit1)$coef; summary(fit2)$coef; summary(fit3)$coef;
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## factor(am)1 7.244939 1.764422 4.106127 2.850207e-04
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.00287512 2.642659337 12.866916 2.824030e-13
## factor(am)1 2.08371013 1.376420152 1.513862 1.412682e-01
## wt -2.87857541 0.904970538 -3.180850 3.574031e-03
## hp -0.03747873 0.009605422 -3.901830 5.464023e-04
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.20279869 3.66909647 10.412045 9.084987e-11
## factor(am)1 1.55649163 1.44053603 1.080495 2.898430e-01
## wt -3.30262301 1.13364263 -2.913284 7.256888e-03
## hp -0.02796002 0.01392172 -2.008374 5.509659e-02
## disp 0.01225708 0.01170645 1.047036 3.047194e-01
## cyl -1.10637984 0.67635506 -1.635797 1.139322e-01
The first model shows a very significant dependence of mpg on am, with a two-sided p-value of around 0.0003. When we include wt and hp as regressors, the change of MPG with transmission type has diminished from 7.2 to 2.1 and the new p-value of 0.14 indicates that when corrected for horse power and weight of cars, the type of transmission has no significant effect on MPG anymore. Including additional variables `disp and cyl does not change this story much.
anova(fit1,fit2,fit3)
## Analysis of Variance Table
##
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ factor(am) + wt + hp
## Model 3: mpg ~ factor(am) + wt + hp + disp + cyl
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 28 180.29 2 540.61 43.0841 5.576e-09 ***
## 3 26 163.12 2 17.17 1.3685 0.2722
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Performing an anova test on the nested model confirms that including wt and hp as regressors in model 2 improved the linear model enormously, reducing the Residual Sum Squared by a factor 4. Further including disp and cyl in model 3 only gave a very minor, non-significant (p = 0.2722) improvement.
summary(fit1)$r.squared; summary(fit2)$r.squared; summary(fit3)$r.squared;
## [1] 0.3597989
## [1] 0.8398903
## [1] 0.8551394
Model1, model2 and model3 respectively explain 36, 84 and 86% of the variance in the data.
e1 <- resid(fit1)
e2 <- resid(fit2)
e3 <- resid(fit3)
x1 <- rep(1, each=length(e1))
x2 <- rep(2, each=length(e1))
x3 <- rep(3, each=length(e1))
e1 <- cbind(x1,e1)
e2 <- cbind(x2,e2)
e3 <- cbind(x3,e3)
n <- 1:length(e3[,1])
e <- data.frame(rbind(e1,e2,e3))
## Warning in data.row.names(row.names, rowsi, i): some row.names duplicated:
## 33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96
## --> row.names NOT used
e <- cbind(rep(n,3), e)
colnames(e) <- c("n","model","resid")
gg2 <- ggplot(data = e, aes(n, resid)) + geom_point(aes(shape = factor(model), colour = factor(model)))
gg2 <- gg2 + xlab("") + ylab("Residual")
gg2
A plot of the residuals shows that the residuals are homoskedastic for all three models, indicating that our linear model is not obviously missing a hidden trend in the data.