You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
library(ggplot2)
data(mtcars)
First, let’s take a look at the data set:
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Next, we need to convert the “am” variable to a factor. Currently, the value 0 stands for automatic transmission and 1 stands for manual transmission.
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")
Let’s look at the difference in average miles per gallon between automatic and manual transmissions:
g <- ggplot(aes(x=am, y=mpg), data=mtcars)
g <- g + geom_boxplot(aes(fill=mpg))
g <- g + ggtitle("Avg Miles Per Gallon by Transmission Type")
g <- g + xlab("Transmission Type")
g <- g + ylab("Miles per Gallon")
g
From this graph, we can see that manual transmissions have a higher average mpg than cars with automatic transmissions. However, there could be confounding factors. Perhaps mpg is correlated with another variable. Let’s look at the relationship between mpg and weight:
g2 <- ggplot(mtcars, aes(x=wt, y=mpg))
g2 <- g2 + geom_point()
g2 <- g2 + facet_grid(.~am)
g2 <- g2 + ggtitle("Avg Miles per Gallon by Transmision Type and Weight")
g2 <- g2 + xlab("Weight")
g2 <- g2 + ylab("Miles per Gallon")
g2
It appears from this graph that even when accounting for weight, manual transmission cars still have higher mpg than automatic transmission cars.
To see if there is a statistically significant difference in mpg between automatic and manual transmission cars, we can run a t-test. The null hypothesis is that there is no difference in the average mpg between the two types of cars.
t.test(mtcars$mpg~mtcars$am)
##
## Welch Two Sample t-test
##
## data: mtcars$mpg by mtcars$am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group Automatic mean in group Manual
## 17.14737 24.39231
The p-value is 0.0014, which is less than 0.05. Therefore, we can reject the null hypothesis. There is a difference in the average mpg between automatic and manual transmission cars.
First, we will run a simple linear model to see the relationship between mpg and our “am” variable:
model1 <- lm(mpg~am, data = mtcars)
summary(model1)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
While the p-value for the coefficient is small (0.0002), our R-squared value is 0.359. This tells us that our model is only expalaining 36% of the variation in mpg. Let’s try and find a better model.
In order to find the best multivariate model, we will use the step function:
model2 = step(lm(data = mtcars, mpg ~ .),trace=0,steps=10000)
summary(model2)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## amManual 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
Now our model has an R-squared value of 0.849, meaning that the new model can explain 85% of the variation in mpg. In addition to the am variable, we have added wt (weight) and qsec (quarter mile time) to our model.
We can run an anova test to see if our new multivariate linear model is truly better than the simple linear model.
anova(model1, model2)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + qsec + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 28 169.29 2 551.61 45.618 1.55e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model1, model2)$"Pr(>F)"[2]
## [1] 1.550495e-09
With a p-value of 1.55e-09, we can reject the null hypothesis that the models are not significantly different.
Let’s take a look at our model’s residuals:
layout(matrix(c(1,2,3,4),2,2))
plot(model2)
The residuals don’t show any pattern, appear to be following a normal distributuion, and do not appear to have heteroskedasticity.
Is an automatic or manual transmission better for MPG?
No, regression analysis shows that manual transmissions have better MPG.
Quantify the MPG difference between automatic and manual transmissions?
A car with manual transmission will get 2.9 miles more per gallon than an automatic car, holding weight and quarter mile time constant.