Executive Summary

We are asked to perform a regression analysis for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

  1. “Is an automatic or manual transmission better for MPG”
  2. “Quantify the MPG difference between automatic and manual transmissions”

Exploratory Analysis

Lets load our data and check the general structure

data(mtcars)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

We shall start with checking the most important question, Is an automatic or manual transmission better for MPG?

# Lets make am factor variable for future use.
mtcars$am <- as.factor(mtcars$am)
# Note if am variable is 1 it means manual and 0 means automatic
levels(mtcars$am) <- c("Automatic", "Manual")
boxplot(mpg~am,data=mtcars,main="MPG by Tranmission",varwidth=TRUE, col=c(3,4), ylab = "MPG")

Although this is just a exploratory graph it seems the manual tranmission has a higher mpg as expected.

To quantify the numerical difference in mpg

 by(mtcars$mpg, mtcars$am , mean)
## mtcars$am: Automatic
## [1] 17.14737
## -------------------------------------------------------- 
## mtcars$am: Manual
## [1] 24.39231

Of course this is without taking into account any of the other variables but it still shows manual transmission is better for mpg

We can also see the inference from a t.test

t.test(mpg ~ am, data = mtcars)
## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group Automatic    mean in group Manual 
##                17.14737                24.39231

With a p value of 0.001374 our first impression is that the tranmission type has significance on mpg. Indeed The 95% confidence interval shows the mpg decrease will be in the interval (-11.280194 , -3.209684)

Before going into the model development it is better to check the pairs relation of all the variables so that we can decide which variables to include in our linear regression model.

 pairs(mtcars,  pch = 18, panel = panel.smooth)

Regression Model

Our approach will be linear regression fit. And we will fit couple different models with different regressors. Our pairs plot gives an idea of which to include in our model.

Both our intuition and the pairs plot shows cyl wt hp and of course am are significant. Because we expect Number of cylinders, the weight of the car and the horsepower to be effective on mpg. On the other hand disp and vs seem to have also effect but we are suspicious whether adding them is helpful to our model at all.

Variables carb qsec drat and gear do not show promising regressions so we discard them.

We will fit 3 models, one single variable solely on transmission type, than the secondary important variables listed above, and finally we will add disp and vs to the picture.

 # Cylinders is a factor.
 mtcars$cyl <- as.factor(mtcars$cyl)
# Create the models
 fit1 <- lm(mpg ~ am, data = mtcars)
 fit2 <- lm(mpg ~ am + cyl + hp + wt, data=mtcars)
 fit3 <- lm(mpg ~ am + cyl + hp + wt + disp + vs, data = mtcars)

Lets check the ANOVA - Analysis of variances

anova (fit1, fit2, fit3)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + cyl + hp + wt
## Model 3: mpg ~ am + cyl + hp + wt + disp + vs
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     26 151.03  4    569.87 23.8956 4.524e-08 ***
## 3     24 143.09  2      7.94  0.6655    0.5233    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the low p-value 4.524e-08 in fit2 we conclude that cyl wt hp and am are indeed significant terms. Similarly, the high p-value of model 3 suggest there is not much gain by including variable disp and vs.

So our model selection is fit2

Let’s take a closer look to “Quantify the MPG difference between automatic and manual transmissions”

summary(fit2)
## 
## Call:
## lm(formula = mpg ~ am + cyl + hp + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## amManual     1.80921    1.39630   1.296  0.20646    
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

The result after adding extra regressors to transmission type are important. We can quantify the effect of manual transmission as 1.80921 increase in mpg . But the pvalue is 0.20646 which is a high value. So we conclude that the transmission type may not represent a significant effect on mpg.

Weight and Horsepower are much more significant quantities with lower p-values.

Residuals and Diagnostics

par(mfrow = c(2,2))
plot(fit2)