Executive Summary

Looking at a data set of a collection of cars, this report will explore the relationship between a set of variables and miles per gallon (MPG) (outcome). In particular, it seeks to answer the following two questions:

  1. “Is an automatic or manual transmission bettter for MPG”
  2. “Quantify the MPG difference between automatic and manual transmissions”

Steps taken: 1. Load and process dataset. 2. Exploratory analysis of the dataset. 3. Model selection of various regression models 4. An analysis of the residuals for the best fit model.

Loading and Processing the Data

# Loading libraries
library(ggplot2)
library(GGally)
library(dplyr)

# Loading mtcars dataset
data(mtcars)

mtcarsData <- mtcars

# Convert the non continuous variables into factors
mtcarsData$am <- as.factor(mtcarsData$am)
levels(mtcarsData$am) <- c("Auto", "Manual")

mtcarsData$vs <- as.factor(mtcarsData$vs)
levels(mtcarsData$vs) <- c("V", "S")

mtcarsData$cyl <- as.factor(mtcarsData$cyl)
mtcarsData$gear <- as.factor(mtcarsData$gear)
mtcarsData$carb <- as.factor(mtcarsData$carb)

Exploratory Data Analyses

# Dimensions of the dataset
str(mtcarsData)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "V","S": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "Auto","Manual": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
# Look at the first 6 observations of the dataset
head(mtcarsData)
##                    mpg cyl disp  hp drat    wt  qsec vs     am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  V Manual    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  V Manual    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  S Manual    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  S   Auto    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  V   Auto    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  S   Auto    3    1

We’ll begin by looking at the pair-wise relationship between the variables in the dataset. The figure below shows scatterplots produced by plotting each variable against all others.

pairs(mtcarsData, panel=panel.smooth, pch=5, cex=0.5, gap=0.3, lwd=3, las=1, cex.axis=0.8)

We’ll also a look at the relationship between transmission type and mpg.

ggplot(mtcarsData, aes(am, mpg, colour = am)) +
    geom_boxplot() + theme(legend.position = "right") + ggtitle("Relationship between MPG and Transmission type (am)") +
    theme(plot.title = element_text(lineheight = 1, face = "bold", size = 14))

We can observe that the manual transmissions cars have higher MPGs compared to the Auto transmission cars.

Regression Models - Model Selection

1. Starting with a simple model looking at just the transmission type.

fitAM <- lm(mpg ~ am, mtcarsData)
summary(fitAM)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcarsData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amManual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

While the p-value is low, the low adjusted R squared value of 33.85% means that we will need to look at how the other variables affect mpg, as this model that only includes the transmission type am, is only able to account for 33.85% of the variance in mpg.

2. Next we consider a model that includes all the variables.

fitAll <- lm(mpg ~ ., mtcarsData)
summary(fitAll)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcarsData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5087 -1.3584 -0.0948  0.7745  4.6251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 23.87913   20.06582   1.190   0.2525  
## cyl6        -2.64870    3.04089  -0.871   0.3975  
## cyl8        -0.33616    7.15954  -0.047   0.9632  
## disp         0.03555    0.03190   1.114   0.2827  
## hp          -0.07051    0.03943  -1.788   0.0939 .
## drat         1.18283    2.48348   0.476   0.6407  
## wt          -4.52978    2.53875  -1.784   0.0946 .
## qsec         0.36784    0.93540   0.393   0.6997  
## vsS          1.93085    2.87126   0.672   0.5115  
## amManual     1.21212    3.21355   0.377   0.7113  
## gear4        1.11435    3.79952   0.293   0.7733  
## gear5        2.52840    3.73636   0.677   0.5089  
## carb2       -0.97935    2.31797  -0.423   0.6787  
## carb3        2.99964    4.29355   0.699   0.4955  
## carb4        1.09142    4.44962   0.245   0.8096  
## carb6        4.47757    6.38406   0.701   0.4938  
## carb8        7.25041    8.36057   0.867   0.3995  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.833 on 15 degrees of freedom
## Multiple R-squared:  0.8931, Adjusted R-squared:  0.779 
## F-statistic:  7.83 on 16 and 15 DF,  p-value: 0.000124

As expected, the Adjusted R-squared is higher than our previous model, however, it is more likely that we can get a better model fit with fewer variables.

3. Trimming down the variables.

fitBest <- step(fitAll,direction="both",trace=FALSE)
summary(fitBest)
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcarsData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## amManual     1.80921    1.39630   1.296  0.20646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

We observe that the best model, mpg ~ cyl + hp + wt + am, includes 4 variables and the transmission type, am, is indeed one of the variables that affects the mpg. We now have an improved Adjusted R-squared of 84.01% compared to that of the previous model (77.9%). As expected the p-value for the model is small, and all variables show significant p-values.

A Closer Look at the Best Fit Model

Exploring this model further:

par(mfrow = c(2,2))
plot(fitBest)

Residuals Analysis

According to the above plots:

  • Residuals vs. Fitted: There is no consistent pattern, suggesting that the assumption that the relationship is linear is reasonable. The residuals also appear to band around the Residual=0 line, suggesting that the variances of the error terms are equal. There also doesn’t appear to be any outliers.
  • Normal Q-Q plot: The residuals appear to be normally distributed, as the data points lie close to the line as shown in the.
  • Scale-Location: The plot confirms the constant variance assumption, as the points are randomly distributed.
  • Residuals vs. Leverage: All values fall well within the 0.5 bands shows that there are no outliers.

Conclusion

Given the very small sample size of just 32 observations, we would expect bias to exist in the analysis. Despite this, the model fitBest, appears to be quite a good fit with a high Adjusted R-squared and a study of the residuals also confirms the good fit of the model.

Back to the questions:

Based on our analysis above, we can conclude that Manual transmission cars give better MPG. The MPG difference as given by our best fit model is 1.8 miles per gallon, i.e. Manual transmission cars give 1.8 miles more per gallon compared to Auto transmission cars.