Executive Summary

This project aims to build a Regression Model for “mtcars” dataset. We are particularly interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). Our aim is to find answers for the following 2 questions:

  1. Is an automatic or manual transmission better for MPG?
  2. Quantify the MPG difference between automatic and manual transmissions?

Loading the dataset into R/Rstudio

Before we start our analysis we need to load the dataset. The following R code does that.

data("mtcars")

Exploratory Data Analysis

We will start our analysis by doing some basic exploratory analysis.

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
dim(mtcars)
## [1] 32 11
head(mtcars, 3)
##                mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

Model Selection

Is an automatic or manual transmission better for MPG?

Let’s firs make a simple boxplot to see if there are overall differences between the two type of transmission. We will use “ggplot2” package to create the boxplot but one could produce a similar boxplot using a different package.

require(ggplot2)
## Loading required package: ggplot2
require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mtcars <- mtcars %>%
        mutate(am = ifelse(test = am==1, yes = "Manual", no = "Automatic")) %>%
        mutate(am = as.factor(am))
g <- ggplot(data = mtcars, aes(x = am, y = mpg, fill = am))
g + geom_boxplot() + 
        labs(fill="Transmission Type") +
        theme(legend.position = c(.2, .8)) +
        ggtitle(label = "Miles Per Gallon relationship with Transmission types") +
        theme(plot.title = element_text(hjust = 0.5)) +
        xlab("Transmission Type") + ylab("Miles Per Gallon")

We can see from the boxplot that the amount of MPG for Manual Transmission is higher in general. However, we need to do statistical analysis to claim if the difference between the transmission types are statistically significant. Apart from that, we need to quantify the difference between the two transmission types.

Quantifying the difference

The following R codes quantify the mean difference between the two transmission types.

comparison <- mtcars %>%
        group_by(am) %>%
        summarise(meanMPG = mean(mpg)) %>%
        add_row(am = "Difference (Manual - Automatic)", meanMPG = 7.24494)
## `summarise()` ungrouping output (override with `.groups` argument)
data.frame(comparison)
##                                am  meanMPG
## 1                       Automatic 17.14737
## 2                          Manual 24.39231
## 3 Difference (Manual - Automatic)  7.24494

The table shows that there is about 7 mpg difference between the transmission types (in favor of the type Manual)

Simple Linear Regression Model

Let’s first create a simple linear regression model miles per gallon (mpg) as the outcome and transmission (am) as a predictor variable.

fit0 <- lm(formula = mpg ~ am, data = mtcars)
summary(fit0)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amManual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

Our first linear model with only one predictor indicates that our model is statistically significant meaning the variable “am” can be used as a good predictor for miles per gallon. Also, note that our coefficient (slope) for Manual transmission is 7.245 indicating that the manual transmission is about 7 mpg higher compared to automatic transmission. Coefficients have very high t values and very small p values both showing that the result is significant. However, R-squared of the model is relatively low, suggesting only about 35% of the variation can be explained by the model. Residual standard error is 4.9. Next we will add different variables to see if we can improve our model.

Multiple Linear Regression Models

Now let’s use other variables as well in order to improve our model. Adding more variables to the model can also help us identify multicollinearity.

fit1 <- lm(formula = mpg ~ am + wt, data = mtcars)
fit2 <- lm(formula = mpg ~ am + wt + qsec, data = mtcars)
anova(fit0, fit1, fit2)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt
## Model 3: mpg ~ am + wt + qsec
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 720.90                                  
## 2     29 278.32  1    442.58 73.203 2.673e-09 ***
## 3     28 169.29  1    109.03 18.034 0.0002162 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see from the anova table that adding other variables can improve the model performance. The ANOVA table above shows clear decrease of RSS and SSQ as well as very small p values. Therefore, we the last model (fit2) can be chosen as a final model.

Final Model

Let’s see how is our final model is performing.

summary(fit2)
## 
## Call:
## lm(formula = mpg ~ am + wt + qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## amManual      2.9358     1.4109   2.081 0.046716 *  
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

Compared to our first model where there was only one predictor this time R squared has improved drastically (from about 36% of explained variation to 85%).

Residual Plots and Some Diagnostics

Before claiming that our model is working well we need to check the assumptions of Linear Regression. The following R codes will produce multiple residual plots in order to detect if there is any assumptions violated.

require(ggfortify)
## Loading required package: ggfortify
autoplot(fit2)
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

The plots above show that relatively speaking the assumptions of Linear Regression are met. Residuals seem to be normally distributed and heteroscedasticity is seen in the first plot.

Uncertainty and Inference

It is also very important to mention that there may be some error and uncertainty with the chosen model. Therefore, with the following R codes we will demonstrate how much uncertainty we may expect with our model using 95% confidence interval.

confint(fit2)
##                   2.5 %    97.5 %
## (Intercept) -4.63829946 23.873860
## amManual     0.04573031  5.825944
## wt          -5.37333423 -2.459673
## qsec         0.63457320  1.817199