Regression Models Project - Motor Trend Data Analysis Report

Executive Summary

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

  • “Is an automatic or manual transmission better for MPG”
  • “Quantify the MPG difference between automatic and manual transmissions”

Exploratory Data Analysis

mtcars is a data frame with 32 observations on 11 variables. At first, we load the data set and perform data transformations by factoring the necessary variables.

library(ggplot2)
data(mtcars)
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$vs <- as.factor(mtcars$vs)
mtcars$am <- factor(mtcars$am,labels=c('Automatic','Manual'))
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...

Initially, we plot the relationships between all the variables of the dataset (see Figure 1 in the Appendix). From the plot, we can notice that variables like cyl, disp, hp, drat, wt, vs and am seem to have some strong correlation with mpg. Since we are interested in the effects of car transmission type on mpg, we plot boxplots of the variable mpg when am is Automatic or Manual (see Figure 2 in the Appendix). This plot shows an increase in the mpg when the transmission is Manual.

Regression Analysis

From the previous plots, we have seen that several variables seem to have high correlation with mpg, so we fit an initial model with all the variables as predictors, and perfom stepwise model selection to select significant predictors for the final model which is the best model

initModel <- lm(mpg ~ ., data=mtcars)
bestModel <- step(initModel, direction = "both")
summary(initModel)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5087 -1.3584 -0.0948  0.7745  4.6251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 23.87913   20.06582   1.190   0.2525  
## cyl6        -2.64870    3.04089  -0.871   0.3975  
## cyl8        -0.33616    7.15954  -0.047   0.9632  
## disp         0.03555    0.03190   1.114   0.2827  
## hp          -0.07051    0.03943  -1.788   0.0939 .
## drat         1.18283    2.48348   0.476   0.6407  
## wt          -4.52978    2.53875  -1.784   0.0946 .
## qsec         0.36784    0.93540   0.393   0.6997  
## vs1          1.93085    2.87126   0.672   0.5115  
## amManual     1.21212    3.21355   0.377   0.7113  
## gear4        1.11435    3.79952   0.293   0.7733  
## gear5        2.52840    3.73636   0.677   0.5089  
## carb2       -0.97935    2.31797  -0.423   0.6787  
## carb3        2.99964    4.29355   0.699   0.4955  
## carb4        1.09142    4.44962   0.245   0.8096  
## carb6        4.47757    6.38406   0.701   0.4938  
## carb8        7.25041    8.36057   0.867   0.3995  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared:  0.8931, Adjusted R-squared:  0.779 
## F-statistic:  7.83 on 16 and 15 DF,  p-value: 0.000124
summary(bestModel)
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## amManual     1.80921    1.39630   1.296  0.20646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

This model is “mpg ~ cyl + hp + wt + am”. It has the Residual standard error as 2.41 on 26 degrees of freedom. The Adjusted R-squared value is 0.8401, which means that the model can explain about 84% of the variance of the MPG variable.

From the above computations, the best model consists of the variables, cyl, wt and hp as confounders and am as the independent variable

Now, we compare the model which we obtained and a base model with only am as the predictor variable.

amModel <- lm(mpg ~ am, data = mtcars)
anova(amModel, bestModel)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + hp + wt + am
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 720.90                                  
## 2     26 151.03  4    569.87 24.527 1.688e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This model captured 84% of the overall variation in mpg. With a p-value of 3.745e-09, we reject the null hypothesis and claim that our multivariate model is significantly different from our simple linear regression model.

Residual Analysis and Diagnostics

According to the residual plots (see Figure 3 in Appendix), we can make the following observations:

  • The Residuals vs. Fitted plot shows the randomness of the distribution of the points, so it confirms the variable indipendence.
  • The Normal Q-Q plot indicates that the residuals are normally distributed because the points lie closely to the line.
  • The Scale-Location plot confirms the constant variance assumption, as the points are randomly distributed.
  • The Residuals vs. Leverage shows that the labeled points appear to be leverage points, and no outliers are presentThe labeled points appear to be leverage points above the rest of the points.

In order to find out the interesting leverage points, we can compute some regression diagnostics of our model. We compute top three points in each case of influence measures.

leverage <- hatvalues(bestModel)
tail(sort(leverage),3)
##       Toyota Corona Lincoln Continental       Maserati Bora 
##           0.2777872           0.2936819           0.4713671
influential <- dfbetas(bestModel)
tail(sort(influential[,6]),3)
## Chrysler Imperial          Fiat 128     Toyota Corona 
##         0.3507458         0.4292043         0.7305402

This shows that our analysis is correct, because the same cars are mentioned in the residual plots.

Inference

We can perform t-test to see that the manual and automatic transmissions are significatively different:

t.test(mpg ~ am, data = mtcars)
## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group Automatic    mean in group Manual 
##                17.14737                24.39231

We can conclude that cars with Manual transmission get more miles per gallon mpg compared to cars with Automatic transmission.

Appendix: Figures

  1. Boxplot of MPG vs. Transmission
boxplot(mpg ~ am, xlab="Transmission (0 = Automatic, 1 = Manual)", ylab="MPG", col=c("lightblue","blue"),
        data = mtcars, main="Boxplot of MPG vs. Transmission")

2. Pair Graph of Motor Trend Car Road Tests

plot(mtcars, panel=panel.smooth, main="Pair Graph of Motor Trend Car Road Tests")

3. Residual Plots

par(mfrow = c(2,2))
plot(bestModel)