Executive Summary

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

Exploratory Data Analysis

Reading in and cleaning the data

library(ggplot2)
data(mtcars)

First, let’s take a look at the data set:

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Next, we need to convert the “am” variable to a factor. Currently, the value 0 stands for automatic transmission and 1 stands for manual transmission.

mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")

Exploratory graphs

Let’s look at the difference in average miles per gallon between automatic and manual transmissions:

g <- ggplot(aes(x=am, y=mpg), data=mtcars) 
g <- g + geom_boxplot(aes(fill=mpg))
g <- g + ggtitle("Avg Miles Per Gallon by Transmission Type")
g <- g + xlab("Transmission Type") 
g <- g + ylab("Miles per Gallon")
g

From this graph, we can see that manual transmissions have a higher average mpg than cars with automatic transmissions. However, there could be confounding factors. Perhaps mpg is correlated with another variable. Let’s look at the relationship between mpg and weight:

g2 <- ggplot(mtcars, aes(x=wt, y=mpg)) 
g2 <- g2 + geom_point()
g2 <- g2 + facet_grid(.~am)
g2 <- g2 + ggtitle("Avg Miles per Gallon by Transmision Type and Weight")
g2 <- g2 + xlab("Weight") 
g2 <- g2 + ylab("Miles per Gallon")
g2

It appears from this graph that even when accounting for weight, manual transmission cars still have higher mpg than automatic transmission cars.

T-test

To see if there is a statistically significant difference in mpg between automatic and manual transmission cars, we can run a t-test. The null hypothesis is that there is no difference in the average mpg between the two types of cars.

t.test(mtcars$mpg~mtcars$am)
## 
##  Welch Two Sample t-test
## 
## data:  mtcars$mpg by mtcars$am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group Automatic    mean in group Manual 
##                17.14737                24.39231

The p-value is 0.0014, which is less than 0.05. Therefore, we can reject the null hypothesis. There is a difference in the average mpg between automatic and manual transmission cars.

Regression Analysis

Model 1: Simple linear model

First, we will run a simple linear model to see the relationship between mpg and our “am” variable:

model1 <- lm(mpg~am, data = mtcars)
summary(model1)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amManual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

While the p-value for the coefficient is small (0.0002), our R-squared value is 0.359. This tells us that our model is only expalaining 36% of the variation in mpg. Let’s try and find a better model.

Model 2: Multivariate linear model

In order to find the best multivariate model, we will use the step function:

model2 = step(lm(data = mtcars, mpg ~ .),trace=0,steps=10000)
summary(model2)
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## amManual      2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

Now our model has an R-squared value of 0.849, meaning that the new model can explain 85% of the variation in mpg. In addition to the am variable, we have added wt (weight) and qsec (quarter mile time) to our model.

Model testing:

We can run an anova test to see if our new multivariate linear model is truly better than the simple linear model.

anova(model1, model2)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + qsec + am
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1     30 720.90                                 
## 2     28 169.29  2    551.61 45.618 1.55e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model1, model2)$"Pr(>F)"[2]
## [1] 1.550495e-09

With a p-value of 1.55e-09, we can reject the null hypothesis that the models are not significantly different.

Residuals:

Let’s take a look at our model’s residuals:

layout(matrix(c(1,2,3,4),2,2)) 
plot(model2)

The residuals don’t show any pattern, appear to be following a normal distributuion, and do not appear to have heteroskedasticity.

Conclusions