This report is a course project within the Regression Models course on the Data Science Specialization by Johns Hopkins University on Coursera.
In this report, we are using the mtcars
data set (located in the reshape2
package), to study the relationship between a car’s transmission and the number of miles per gallon of gasoline, along with a set of other variables that could affect this relationship.
You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).
A data frame with 32 observations on 11 (numeric) variables.
Setting the working directory, loading the data and the required libraries.
# Setting working directory first
setwd("~/Coursera/8_Data_Science_Specialization/7 Regression Models/Week 4/Assignment")
# Loading Libraries
library(ggplot2)
library(knitr)
library(kableExtra)
library(dplyr)
# Loading Data
data(mtcars)
Let’s take a look at the first rows of our data set.
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
To answer this project’s questions, we will use linear models to quantify the relationship between MPG
variable, and the rest. And we will look for the best model fit in the subsequent regression analysis section.
Taking a look at how the MPG variable of interest is distributed (Figure 1 in the appendix), we can see that it kind of fits a normal distribution. The lack of precission for this comparison can be due to the small sample size of only 32 observations. Anyways, multiple linear regression assumes that the residuals are normally distributed, so we will check this assumption once we have the best model fit for this project.
To know which transmission type is better for fuel consumption, we first plot boxplots of the variable MPG
when am
is Automatic or Manual (see Figure 2 in the appendix). And we can clearly notice an increase in the MPG
when the transmission is Manual.
We can take a look into our data and explore the relationships between all the variables (Firgure 3 in the appendix). Our outcome variable, miles per gallon seems to have a relationship with some other variables like cyl
, disp
, hp
, drat
, wt
, vs
and am
.
These relationships are easier to notice by looking at the correlogram (Firgure 4 in the appendix). All the correlations have a darker tone of blue if it’s closer to 1, and a darker tone of red when it’s closer to -1, which means a stronger relationship.
Besides our interest in the outcome variable, we can also have an idea whether or not to consider the combination of certain variables for a linear model depending on their correlation. This is because one variable could be explaining another one, e.g. Weigh (wt
) vs Displacement (disp
) sharing a correlation of 0.89. We migh explain this relationship because a heavy vehicle simply needs more power than a smaller one to produce comparable acceleration and load-hauling capacity. And that’s usually achieved by a larger-displacement engine.
In this section, we will try to find the best model fit to explain the relationship between the set of variables and miles per gallon (MPG
) through a variable selection procedure called stepwise regression. This is a method that consists on fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of explanatory variables based on some prespecified criterion. Usually, this takes the form of a sequence of F-tests or t-tests, but other techniques are possible, such as adjusted R2, Akaike information criterion, Bayesian information criterion, Mallows’s Cp, PRESS, or false discovery rate.
Stepwise Regression and similar selection methods might not be the best for regression analysis [1], [2], but for the sake of this academic project, we will run this procedure.
We will start our selection using a component of the stepwise regression technique called Backward Elimination. Where each variable is considered for subtraction from the set of explanatory variables, based on the prespecified Akaike information criterion according to the R function step
.
 .
Let’s start fitting a model with all the variables.
# Fitting a base model that contains all the explanatory variables
Fit_Base = lm(mpg ~ ., data = mtcars)
# We can check the models formula in case it's needed.
formula(Fit_Base)
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
# Summary
summary(Fit_Base)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
With this first model, we can basically say that none of the variables are significant. This can be due to multicollinearity (as mentioned in the Weigh (wt
) vs Displacement (disp
) example in the Exploratory Analysis section), and the lack of simplicity of the model (parsimony principle).
 .
We will use the R funcion step()
to look for the best model, starting from our Fit_Base
model that has all the variables already fitted. As we just mentioned, this function use Akaike information criterion for the selection of variables for substraction. The lower the AIC, the better fit we will have. I.e., one by one, variables are going to get removed.
# Running the Backward Elimination Procedure
Fit_Best <- step(Fit_Base, direction = "backward")
## Start: AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - cyl 1 0.0799 147.57 68.915
## - vs 1 0.1601 147.66 68.932
## - carb 1 0.4067 147.90 68.986
## - gear 1 1.3531 148.85 69.190
## - drat 1 1.6270 149.12 69.249
## - disp 1 3.9167 151.41 69.736
## - hp 1 6.8399 154.33 70.348
## - qsec 1 8.8641 156.36 70.765
## <none> 147.49 70.898
## - am 1 10.5467 158.04 71.108
## - wt 1 27.0144 174.51 74.280
##
## Step: AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - vs 1 0.2685 147.84 66.973
## - carb 1 0.5201 148.09 67.028
## - gear 1 1.8211 149.40 67.308
## - drat 1 1.9826 149.56 67.342
## - disp 1 3.9009 151.47 67.750
## - hp 1 7.3632 154.94 68.473
## <none> 147.57 68.915
## - qsec 1 10.0933 157.67 69.032
## - am 1 11.8359 159.41 69.384
## - wt 1 27.0280 174.60 72.297
##
## Step: AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
##
## Df Sum of Sq RSS AIC
## - carb 1 0.6855 148.53 65.121
## - gear 1 2.1437 149.99 65.434
## - drat 1 2.2139 150.06 65.449
## - disp 1 3.6467 151.49 65.753
## - hp 1 7.1060 154.95 66.475
## <none> 147.84 66.973
## - am 1 11.5694 159.41 67.384
## - qsec 1 15.6830 163.53 68.200
## - wt 1 27.3799 175.22 70.410
##
## Step: AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
##
## Df Sum of Sq RSS AIC
## - gear 1 1.565 150.09 63.457
## - drat 1 1.932 150.46 63.535
## <none> 148.53 65.121
## - disp 1 10.110 158.64 65.229
## - am 1 12.323 160.85 65.672
## - hp 1 14.826 163.35 66.166
## - qsec 1 26.408 174.94 68.358
## - wt 1 69.127 217.66 75.350
##
## Step: AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - drat 1 3.345 153.44 62.162
## - disp 1 8.545 158.64 63.229
## <none> 150.09 63.457
## - hp 1 13.285 163.38 64.171
## - am 1 20.036 170.13 65.466
## - qsec 1 25.574 175.67 66.491
## - wt 1 67.572 217.66 73.351
##
## Step: AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - disp 1 6.629 160.07 61.515
## <none> 153.44 62.162
## - hp 1 12.572 166.01 62.682
## - qsec 1 26.470 179.91 65.255
## - am 1 32.198 185.63 66.258
## - wt 1 69.043 222.48 72.051
##
## Step: AIC=61.52
## mpg ~ hp + wt + qsec + am
##
## Df Sum of Sq RSS AIC
## - hp 1 9.219 169.29 61.307
## <none> 160.07 61.515
## - qsec 1 20.225 180.29 63.323
## - am 1 25.993 186.06 64.331
## - wt 1 78.494 238.56 72.284
##
## Step: AIC=61.31
## mpg ~ wt + qsec + am
##
## Df Sum of Sq RSS AIC
## <none> 169.29 61.307
## - am 1 26.178 195.46 63.908
## - qsec 1 109.034 278.32 75.217
## - wt 1 183.347 352.63 82.790
As we can see, the AIC of the model decreases every time we remove a suggested variable. The Backward Elimination Procedure stops when there are no more variables to remove (
 .
This is how our best model fit looks like
# Model Summary
summary(Fit_Best)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## am 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
# Confidence Intervals
confint(Fit_Best)
## 2.5 % 97.5 %
## (Intercept) -4.63829946 23.873860
## wt -5.37333423 -2.459673
## qsec 0.63457320 1.817199
## am 0.04573031 5.825944
 .
MPG
outcome with the regressors weight wt
, 1/4 mile time qsec
, and transmission am
.MPG
outcome, as the P-value for the regression is 1.2104*e-11, which is smaller than the significance level 0.05.am
variable is 0 (automatic), the intercept of the model is 9.6178 (coefficient of the reference variable). But when is equal to 1 (manual) the intercept of the model is 9.6178 + 2.9358 (coefficient of the reference variable + the coefficient of the manual transmission variable).am
variable when is equal to 1 (manual), suggests that this variable is significantly different compared to when is equal to cero (automatic). We can then say that manual transmission has on average 30.53% (2.9358 units) more miles per gallon compared to automatic transmission.qsec
and Transmission am
constant, as the weight of the car increases by 1 unit (1000 lbs), the miles per gallon, on an average, decreases by -3.9165 miles per gallon.wt
and Transmission am
constant, as the 1/4 mile time increases by 1 unit (1 second), the miles per gallon, on an average, increases by 1.2259 miles per gallon. .
Assuming that the data of Miles per Gallon for Automatic and Manual cars are normally distributed, we can test the hypothesis that the two population of cars have equal means.
H0: (H0 = M1-M2 = 0) There is no difference in miles per gallon given Automatic or Manual transmission.
H1: (H1 = M1-M2 > 0) Miles per gallon are higher when the cars have manual transmission.
Where:
M1 is the average Miles per gallon for manual cars.
M2 is the average Miles per gallon for automatic cars.
# Running the Backward Elimination Procedure
t.test(mpg ~ am, data = mtcars, alternative = "greater", paired = FALSE, var.equal = FALSE, conf.level = 0.95)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.9993
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -10.57662 Inf
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
At 5% significance level, there is sufficient evidence to reject the Null Hypothesis (p-value < Significance level). For cars with manual transmission, the miles per gallon are higher.
Consult Figure 5 in the appendix for this analysis.
- The points in the Residuals vs. Fitted plot seem to be randomly scattered, verifying the assumption of the homoscedasticity of errors.
- The Normal Q-Q plot consists of the points which mostly fall on the line indicating that the residuals are normally distributed.
- The scale location plot confirms the constant variance assumption, as the points are randomly distributed.
- Cook’s distances are less than 1, D<1. This means that our observations are not highly influential.
Based on the observations from our best model fit, we can conclude the following:
- Our best model fit explains 84% of the variability in MPG
outcome, with the regressors weight wt
, 1/4 mile time qsec
, and transmission am
.
- Manual transmission has on average 30.53% (2.9358 units) more miles per gallon compared to automatic transmission.
- Holding 1/4 mile time qsec
and Transmission am
constant, as the weight of the car increases by 1 unit (1000 lbs), the miles per gallon, on an average, decreases by -3.9165 miles per gallon.
- Holding Weight wt
and Transmission am
constant, as the 1/4 mile time increases by 1 unit (1 second), the miles per gallon, on an average, increases by 1.2259 miles per gallon.