Regression Models Course

Executive Summary

This report is a course project within the Regression Models course on the Data Science Specialization by Johns Hopkins University on Coursera.

In this report, we are using the mtcars data set (located in the reshape2 package), to study the relationship between a car’s transmission and the number of miles per gallon of gasoline, along with a set of other variables that could affect this relationship.

Project instructions

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

Is an automatic or manual transmission better for MPG
Quantify the MPG difference between automatic and manual transmissions

About the data set

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).

A data frame with 32 observations on 11 (numeric) variables.

mpg Miles/(US) gallon
cyl Number of cylinders
disp Displacement (cu.in.)
hp Gross horsepower
drat Rear axle ratio
wt Weight (1000 lbs)
qsec 1/4 mile time
vs Engine (0 = V-shaped, 1 = straight)
am Transmission (0 = automatic, 1 = manual)
gear Number of forward gears
carb Number of carburetors

Data Processing

Setting the working directory, loading the data and the required libraries.

 # Setting working directory first
setwd("~/Coursera/8_Data_Science_Specialization/7 Regression Models/Week 4/Assignment")

 # Loading Libraries
library(ggplot2)
library(knitr)
library(kableExtra)
library(dplyr)

 # Loading Data
data(mtcars)

Let’s take a look at the first rows of our data set.

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.460	20.22	1	0	3	1

Exploratory Data Analysis

To answer this project’s questions, we will use linear models to quantify the relationship between MPG variable, and the rest. And we will look for the best model fit in the subsequent regression analysis section.
Taking a look at how the MPG variable of interest is distributed (Figure 1 in the appendix), we can see that it kind of fits a normal distribution. The lack of precission for this comparison can be due to the small sample size of only 32 observations. Anyways, multiple linear regression assumes that the residuals are normally distributed, so we will check this assumption once we have the best model fit for this project.
To know which transmission type is better for fuel consumption, we first plot boxplots of the variable MPG when am is Automatic or Manual (see Figure 2 in the appendix). And we can clearly notice an increase in the MPG when the transmission is Manual.
We can take a look into our data and explore the relationships between all the variables (Firgure 3 in the appendix). Our outcome variable, miles per gallon seems to have a relationship with some other variables like cyl, disp, hp, drat, wt, vs and am.
These relationships are easier to notice by looking at the correlogram (Firgure 4 in the appendix). All the correlations have a darker tone of blue if it’s closer to 1, and a darker tone of red when it’s closer to -1, which means a stronger relationship.
Besides our interest in the outcome variable, we can also have an idea whether or not to consider the combination of certain variables for a linear model depending on their correlation. This is because one variable could be explaining another one, e.g. Weigh (wt) vs Displacement (disp) sharing a correlation of 0.89. We migh explain this relationship because a heavy vehicle simply needs more power than a smaller one to produce comparable acceleration and load-hauling capacity. And that’s usually achieved by a larger-displacement engine.

Regression Analysis

In this section, we will try to find the best model fit to explain the relationship between the set of variables and miles per gallon (MPG) through a variable selection procedure called stepwise regression. This is a method that consists on fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of explanatory variables based on some prespecified criterion. Usually, this takes the form of a sequence of F-tests or t-tests, but other techniques are possible, such as adjusted R2, Akaike information criterion, Bayesian information criterion, Mallows’s Cp, PRESS, or false discovery rate.

Stepwise Regression and similar selection methods might not be the best for regression analysis [1], [2], but for the sake of this academic project, we will run this procedure.

Model Selection

We will start our selection using a component of the stepwise regression technique called Backward Elimination. Where each variable is considered for subtraction from the set of explanatory variables, based on the prespecified Akaike information criterion according to the R function step.

Base Model

Let’s start fitting a model with all the variables.

# Fitting a base model that contains all the explanatory variables
Fit_Base = lm(mpg ~ ., data = mtcars)

# We can check the models formula in case it's needed.
formula(Fit_Base)

## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb

# Summary
summary(Fit_Base)

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

With this first model, we can basically say that none of the variables are significant. This can be due to multicollinearity (as mentioned in the Weigh (wt) vs Displacement (disp) example in the Exploratory Analysis section), and the lack of simplicity of the model (parsimony principle).

Backward Elimination Procedure

We will use the R funcion step() to look for the best model, starting from our Fit_Base model that has all the variables already fitted. As we just mentioned, this function use Akaike information criterion for the selection of variables for substraction. The lower the AIC, the better fit we will have. I.e., one by one, variables are going to get removed.

# Running the Backward Elimination Procedure
Fit_Best <- step(Fit_Base, direction = "backward")

## Start:  AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - cyl   1    0.0799 147.57 68.915
## - vs    1    0.1601 147.66 68.932
## - carb  1    0.4067 147.90 68.986
## - gear  1    1.3531 148.85 69.190
## - drat  1    1.6270 149.12 69.249
## - disp  1    3.9167 151.41 69.736
## - hp    1    6.8399 154.33 70.348
## - qsec  1    8.8641 156.36 70.765
## <none>              147.49 70.898
## - am    1   10.5467 158.04 71.108
## - wt    1   27.0144 174.51 74.280
## 
## Step:  AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - vs    1    0.2685 147.84 66.973
## - carb  1    0.5201 148.09 67.028
## - gear  1    1.8211 149.40 67.308
## - drat  1    1.9826 149.56 67.342
## - disp  1    3.9009 151.47 67.750
## - hp    1    7.3632 154.94 68.473
## <none>              147.57 68.915
## - qsec  1   10.0933 157.67 69.032
## - am    1   11.8359 159.41 69.384
## - wt    1   27.0280 174.60 72.297
## 
## Step:  AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - carb  1    0.6855 148.53 65.121
## - gear  1    2.1437 149.99 65.434
## - drat  1    2.2139 150.06 65.449
## - disp  1    3.6467 151.49 65.753
## - hp    1    7.1060 154.95 66.475
## <none>              147.84 66.973
## - am    1   11.5694 159.41 67.384
## - qsec  1   15.6830 163.53 68.200
## - wt    1   27.3799 175.22 70.410
## 
## Step:  AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
## 
##        Df Sum of Sq    RSS    AIC
## - gear  1     1.565 150.09 63.457
## - drat  1     1.932 150.46 63.535
## <none>              148.53 65.121
## - disp  1    10.110 158.64 65.229
## - am    1    12.323 160.85 65.672
## - hp    1    14.826 163.35 66.166
## - qsec  1    26.408 174.94 68.358
## - wt    1    69.127 217.66 75.350
## 
## Step:  AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - drat  1     3.345 153.44 62.162
## - disp  1     8.545 158.64 63.229
## <none>              150.09 63.457
## - hp    1    13.285 163.38 64.171
## - am    1    20.036 170.13 65.466
## - qsec  1    25.574 175.67 66.491
## - wt    1    67.572 217.66 73.351
## 
## Step:  AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1     6.629 160.07 61.515
## <none>              153.44 62.162
## - hp    1    12.572 166.01 62.682
## - qsec  1    26.470 179.91 65.255
## - am    1    32.198 185.63 66.258
## - wt    1    69.043 222.48 72.051
## 
## Step:  AIC=61.52
## mpg ~ hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - hp    1     9.219 169.29 61.307
## <none>              160.07 61.515
## - qsec  1    20.225 180.29 63.323
## - am    1    25.993 186.06 64.331
## - wt    1    78.494 238.56 72.284
## 
## Step:  AIC=61.31
## mpg ~ wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## <none>              169.29 61.307
## - am    1    26.178 195.46 63.908
## - qsec  1   109.034 278.32 75.217
## - wt    1   183.347 352.63 82.790

As we can see, the AIC of the model decreases every time we remove a suggested variable. The Backward Elimination Procedure stops when there are no more variables to remove ( AIC = 61.307). And removing one more variable would result in a higher AIC.

Best Model Fit

This is how our best model fit looks like

# Model Summary
summary(Fit_Best)

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

# Confidence Intervals
confint(Fit_Best)

##                   2.5 %    97.5 %
## (Intercept) -4.63829946 23.873860
## wt          -5.37333423 -2.459673
## qsec         0.63457320  1.817199
## am           0.04573031  5.825944

Best Model Analysis:

According to the R-squared value, our best model fit explains 84% of the variability in MPG outcome with the regressors weight wt, 1/4 mile time qsec, and transmission am.
The Adjusted R-squared is 83%, which is close to the R-squared value. This means that if we had unnecessary variables in our model, the difference between this two would be substantial.
We can see that all the p-values of the coefficients are smaller than 0.05 of significance. Which means that all the regressors are statistically significant, except the intercept.
The previous statement can be reinforced by the confidence intervals of the coefficients by not containing 0. Hence, these control variables have a statistically significant effect on the outcome.
The model as a whole is significant to explain the MPG outcome, as the P-value for the regression is 1.2104*e-11, which is smaller than the significance level 0.05.
When the transmission am variable is 0 (automatic), the intercept of the model is 9.6178 (coefficient of the reference variable). But when is equal to 1 (manual) the intercept of the model is 9.6178 + 2.9358 (coefficient of the reference variable + the coefficient of the manual transmission variable).
The t-test for the transmission am variable when is equal to 1 (manual), suggests that this variable is significantly different compared to when is equal to cero (automatic). We can then say that manual transmission has on average 30.53% (2.9358 units) more miles per gallon compared to automatic transmission.
Holding 1/4 mile time qsec and Transmission am constant, as the weight of the car increases by 1 unit (1000 lbs), the miles per gallon, on an average, decreases by -3.9165 miles per gallon.
Holding Weight wt and Transmission am constant, as the 1/4 mile time increases by 1 unit (1 second), the miles per gallon, on an average, increases by 1.2259 miles per gallon.

Testing the hypothesis that two populations have equal means.

Assuming that the data of Miles per Gallon for Automatic and Manual cars are normally distributed, we can test the hypothesis that the two population of cars have equal means.

H0: (H0 = M1-M2 = 0) There is no difference in miles per gallon given Automatic or Manual transmission.
H1: (H1 = M1-M2 > 0) Miles per gallon are higher when the cars have manual transmission.

Where:
M1 is the average Miles per gallon for manual cars.
M2 is the average Miles per gallon for automatic cars.

# Running the Backward Elimination Procedure
t.test(mpg ~ am, data = mtcars, alternative = "greater", paired = FALSE, var.equal = FALSE, conf.level = 0.95)

## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.9993
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -10.57662       Inf
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

At 5% significance level, there is sufficient evidence to reject the Null Hypothesis (p-value < Significance level). For cars with manual transmission, the miles per gallon are higher.

Residual Analysis

Consult Figure 5 in the appendix for this analysis.
- The points in the Residuals vs. Fitted plot seem to be randomly scattered, verifying the assumption of the homoscedasticity of errors.
- The Normal Q-Q plot consists of the points which mostly fall on the line indicating that the residuals are normally distributed.
- The scale location plot confirms the constant variance assumption, as the points are randomly distributed.
- Cook’s distances are less than 1, D<1. This means that our observations are not highly influential.

Conclusion

Based on the observations from our best model fit, we can conclude the following:
- Our best model fit explains 84% of the variability in MPG outcome, with the regressors weight wt, 1/4 mile time qsec, and transmission am.
- Manual transmission has on average 30.53% (2.9358 units) more miles per gallon compared to automatic transmission.
- Holding 1/4 mile time qsec and Transmission am constant, as the weight of the car increases by 1 unit (1000 lbs), the miles per gallon, on an average, decreases by -3.9165 miles per gallon.
- Holding Weight wt and Transmission am constant, as the 1/4 mile time increases by 1 unit (1 second), the miles per gallon, on an average, increases by 1.2259 miles per gallon.

Regression Models Course - Final Project

Jhons Hopkins University - Data Science Specialization - Course #7

Diego Angulo

July 16, 2019