Regression Models Peer Assessment

Executive Summary

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

Is an automatic or manual transmission better for MPG?
Quantify the MPG difference between automatic and manual transmissions?

Exploratory Data Analysis

Reading in and cleaning the data

library(ggplot2)
data(mtcars)

First, let’s take a look at the data set:

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Next, we need to convert the “am” variable to a factor. Currently, the value 0 stands for automatic transmission and 1 stands for manual transmission.

mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")

Exploratory graphs

Let’s look at the difference in average miles per gallon between automatic and manual transmissions:

g <- ggplot(aes(x=am, y=mpg), data=mtcars) 
g <- g + geom_boxplot(aes(fill=mpg))
g <- g + ggtitle("Avg Miles Per Gallon by Transmission Type")
g <- g + xlab("Transmission Type") 
g <- g + ylab("Miles per Gallon")
g

From this graph, we can see that manual transmissions have a higher average mpg than cars with automatic transmissions. However, there could be confounding factors. Perhaps mpg is correlated with another variable. Let’s look at the relationship between mpg and weight:

g2 <- ggplot(mtcars, aes(x=wt, y=mpg)) 
g2 <- g2 + geom_point()
g2 <- g2 + facet_grid(.~am)
g2 <- g2 + ggtitle("Avg Miles per Gallon by Transmision Type and Weight")
g2 <- g2 + xlab("Weight") 
g2 <- g2 + ylab("Miles per Gallon")
g2

It appears from this graph that even when accounting for weight, manual transmission cars still have higher mpg than automatic transmission cars.

T-test

To see if there is a statistically significant difference in mpg between automatic and manual transmission cars, we can run a t-test. The null hypothesis is that there is no difference in the average mpg between the two types of cars.

t.test(mtcars$mpg~mtcars$am)

## 
##  Welch Two Sample t-test
## 
## data:  mtcars$mpg by mtcars$am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group Automatic    mean in group Manual 
##                17.14737                24.39231

The p-value is 0.0014, which is less than 0.05. Therefore, we can reject the null hypothesis. There is a difference in the average mpg between automatic and manual transmission cars.

Regression Analysis

Model 1: Simple linear model

First, we will run a simple linear model to see the relationship between mpg and our “am” variable:

model1 <- lm(mpg~am, data = mtcars)
summary(model1)

## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amManual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

While the p-value for the coefficient is small (0.0002), our R-squared value is 0.359. This tells us that our model is only expalaining 36% of the variation in mpg. Let’s try and find a better model.

Model 2: Multivariate linear model

In order to find the best multivariate model, we will use the step function:

model2 = step(lm(data = mtcars, mpg ~ .),trace=0,steps=10000)
summary(model2)

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## amManual      2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

Now our model has an R-squared value of 0.849, meaning that the new model can explain 85% of the variation in mpg. In addition to the am variable, we have added wt (weight) and qsec (quarter mile time) to our model.

Model testing:

We can run an anova test to see if our new multivariate linear model is truly better than the simple linear model.

anova(model1, model2)

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + qsec + am
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1     30 720.90                                 
## 2     28 169.29  2    551.61 45.618 1.55e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(model1, model2)$"Pr(>F)"[2]

## [1] 1.550495e-09

With a p-value of 1.55e-09, we can reject the null hypothesis that the models are not significantly different.

Residuals:

Let’s take a look at our model’s residuals:

layout(matrix(c(1,2,3,4),2,2)) 
plot(model2)

The residuals don’t show any pattern, appear to be following a normal distributuion, and do not appear to have heteroskedasticity.

Conclusions

Is an automatic or manual transmission better for MPG?
No, regression analysis shows that manual transmissions have better MPG.
Quantify the MPG difference between automatic and manual transmissions?
A car with manual transmission will get 2.9 miles more per gallon than an automatic car, holding weight and quarter mile time constant.