Peer-graded Assignment: Regression Models Course Project

Executive Summary

This project will work for Motor Trend, a magazine about the automobile industry. We will do analysis for “mtcars” dataset to explore the relationship between a set of variables and miles per gallon(MPG) (outcome). This dataset was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models). We will focus on answering these two questions:

Is an automatic or manual transmission better for MPG?
Quantify the MPG difference between automatic and manual transmissions

In order to answer those two questions, we will perform exploratory data analysis (EDA) and use hypothesis test and linear regression to make statistical inference

Set up

Load package

library(ggplot2)
library(statsr)

## Loading required package: BayesFactor

## Loading required package: coda

## Loading required package: Matrix

## ************
## Welcome to BayesFactor 0.9.12-4.2. If you have questions, please contact Richard Morey (richarddmorey@gmail.com).
## 
## Type BFManual() to open the manual.
## ************

library(GGally)
library(knitr)
library(car)

## Loading required package: carData

library(MASS)

Load data

data(mtcars)

Part 1: Data

dim(mtcars)

## [1] 32 11

The data has 32 observations and 11 variables. We will look at the first 6 rows to have more details about this dataset

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Part 2: Exploratory Data Analysis (EAD)

We will visualize the relationship between am and mpg via a boxplot

ggplot(mtcars, aes(y = mpg, x = factor(am, labels = c("automatic", "manual")), fill = factor(am))) +
  geom_boxplot(colour = "black", size = 1) +
  xlab("Transmission") +ylab("MPG")

According to the plot, automatic cars have lower MPG than manual cars so automatic cars have lower efficiency than manual cars. However, it is possible that this happened due to the random chance so we will do hypothesis test later to check about this. But first, we will pairwise to see which variables correlate to each other

ggpairs(mtcars,
        lower = list(continuous = "smooth", colour = "blue"),
        upper = list(corSize = 15), 
        diag = list(continuous = "bar", colour = "blue"),
        axisLabels = "show")

## Warning in check_and_set_ggpairs_defaults("diag", diag, continuous =
## "densityDiag", : Changing diag$continuous from 'bar' to 'barDiag'

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Part 3: Hypothesis testing

Two sample t-test

We assume that the variables are independent from each other, and mpg follows the normal distribution and alpha is set to be 5%. We will perform hypothesis test with confidence interval and use t.test function to find 95% confidence interval

State the hypothesis

Null value: manual = automatic
Alternative value: manual != automatic

Prepare the dataset

test <- t.test(mpg ~ am, data = mtcars, var.equal = FALSE, paired = FALSE, conf.level = 0.95)
test

## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

p-value is very small –> fail to reject null hypothesis. We have a strong evidence to say that automatic cars and manual cars have difference efficiency for mpg.

Part 4: Modeling

Simple linear regression model

mtcars$amfactor <- factor(mtcars$am, labels = c("automatic", "manual")) 
summary(lm(mpg ~ factor(amfactor), data = mtcars))$coef

##                         Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)            17.147368   1.124603 15.247492 1.133983e-15
## factor(amfactor)manual  7.244939   1.764422  4.106127 2.850207e-04

The intercept of 17.147368 is simply the mean MPG of automatic transmission. The slope of 7.244939 is the change in the mean between manual transmission and automatic transmission. The p-value of 2.850207e-04 for the mean MPG difference between manual and automatic transmission is significant. Therefore we conclude that according to this model manual transmission is more fuel efficient.

Multivariable linear regression model

Model selection

If we just take into consideration one predictor, the model is not sufficient enough because we don’t know if other variables have any relationship with the outcome or not. Thus, should develop a model that includes the effect of other variables

summary(lm(mpg ~ cyl+disp+hp+drat+wt+qsec+factor(vs)+factor(am)+gear+carb, data = mtcars))$coef

##                Estimate  Std. Error    t value   Pr(>|t|)
## (Intercept) 12.30337416 18.71788443  0.6573058 0.51812440
## cyl         -0.11144048  1.04502336 -0.1066392 0.91608738
## disp         0.01333524  0.01785750  0.7467585 0.46348865
## hp          -0.02148212  0.02176858 -0.9868407 0.33495531
## drat         0.78711097  1.63537307  0.4813036 0.63527790
## wt          -3.71530393  1.89441430 -1.9611887 0.06325215
## qsec         0.82104075  0.73084480  1.1234133 0.27394127
## factor(vs)1  0.31776281  2.10450861  0.1509915 0.88142347
## factor(am)1  2.52022689  2.05665055  1.2254035 0.23398971
## gear         0.65541302  1.49325996  0.4389142 0.66520643
## carb        -0.19941925  0.82875250 -0.2406258 0.81217871

One problem with MLR model is collinearity. If two or more predictors are highly correlated and they are both included into the model, it will increase the true standard error and we will ger a very unstable estimates of the slope. We can asses the collinearity by variance inflation factor (VIF)

Detecting collinearity

fitvif <- lm(mpg~cyl+disp+hp+drat+wt+qsec+factor(vs)+factor(am)+gear+carb, data = mtcars)
vif(fitvif)

##        cyl       disp         hp       drat         wt       qsec 
##  15.373833  21.620241   9.832037   3.374620  15.164887   7.527958 
## factor(vs) factor(am)       gear       carb 
##   4.965873   4.648487   5.357452   7.908747

To have a better view of the result, we will use “kable” function

kable(vif(fitvif), align = "c")

	x
cyl	15.373833
disp	21.620241
hp	9.832037
drat	3.374620
wt	15.164887
qsec	7.527958
factor(vs)	4.965873
factor(am)	4.648487
gear	5.357452
carb	7.908747

VIF with values more than 10 is considered large so we should leave only one of these variables in the model. To know which variables to be kept in the model, we will perform stepwise selection method

Stepwise selection method

fit <- lm(mpg~cyl+disp+hp+drat+wt+qsec+factor(vs)+factor(am)+gear+carb, data = mtcars)
step <- stepAIC(fit, direction = "both", trace = FALSE)
summary(step)$coef

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.617781  6.9595930  1.381946 1.779152e-01
## wt          -3.916504  0.7112016 -5.506882 6.952711e-06
## qsec         1.225886  0.2886696  4.246676 2.161737e-04
## factor(am)1  2.935837  1.4109045  2.080819 4.671551e-02

summary(step)$r.squared

## [1] 0.8496636

So, besides “am” variables, “wt” and “qsec” have high relation to mpg. R squared is 84.97% means that the model is 84.97% highly predicted

Nested likelihood ratio test

fit1 <- lm(mpg ~ factor(am), data = mtcars)
fit2 <- lm(mpg ~ factor(am) + wt, data = mtcars)
fit3 <- lm(mpg ~ factor(am) + wt + qsec, data = mtcars)
anova(fit1, fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ factor(am) + wt
## Model 3: mpg ~ factor(am) + wt + qsec
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 720.90                                  
## 2     29 278.32  1    442.58 73.203 2.673e-09 ***
## 3     28 169.29  1    109.03 18.034 0.0002162 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We will add other variables to see if the model is changed or not

fit4 <- lm(mpg ~ factor(am) + wt + qsec + hp, data = mtcars)
fit5 <- lm(mpg ~ factor(am) + wt + qsec + hp + drat, data = mtcars)
anova(fit1, fit2, fit3, fit4, fit5)

## Analysis of Variance Table
## 
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ factor(am) + wt
## Model 3: mpg ~ factor(am) + wt + qsec
## Model 4: mpg ~ factor(am) + wt + qsec + hp
## Model 5: mpg ~ factor(am) + wt + qsec + hp + drat
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 720.90                                  
## 2     29 278.32  1    442.58 72.536 5.362e-09 ***
## 3     28 169.29  1    109.03 17.870 0.0002579 ***
## 4     27 160.07  1      9.22  1.511 0.2299925    
## 5     26 158.64  1      1.43  0.234 0.6326111    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that p-value for fit4 and fit5 is not significant. It means that if we add other variables into the model, the variation will be increased. So only wt, am and qsec are significant predictors for the model.

Final model

fit_final <- lm(mpg ~ wt + factor(am) + qsec, data = mtcars)
summary(fit_final)$coef

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.617781  6.9595930  1.381946 1.779152e-01
## wt          -3.916504  0.7112016 -5.506882 6.952711e-06
## factor(am)1  2.935837  1.4109045  2.080819 4.671551e-02
## qsec         1.225886  0.2886696  4.246676 2.161737e-04

We can answer the second research question via this table

Question: Quantify the MPG difference between automatic and manual transmissions? Answer: On average, the manual transmission cars have 2.94 MPGs more than automatic transmission cars

Regression diagnotics

Detecting collinearity

fitvif <- lm(mpg~wt+factor(am)+qsec,data=mtcars)
kable(vif(fitvif), align = "c")

	x
wt	2.482951
factor(am)	2.541437
qsec	1.364339

cog_final = lm(mpg~wt+factor(am)+qsec, data=mtcars)
plot(cog_final$residuals ~ mtcars$wt + mtcars$qsec, main = "Linearity Condition")

Residuals vs. fitted values

plot(cog_final$residuals ~ cog_final$fitted, main = "Residuals vs Fitted")

plot(abs(cog_final$residuals) ~ cog_final$fitted, main = "Absolute value of residuals vs. fitted")

Normality of residuals

qqnorm(cog_final$residuals)
qqline(cog_final$residuals)

Independence residuals

plot(cog_final$residuals, main = "Independent residuals")

Peer-graded Assignment: Regression Models Course Project

Hoang Lam (Nancy Lam)

January 9, 2019

Executive Summary

Set up

Load package

Load data

Part 1: Data

Part 2: Exploratory Data Analysis (EAD)

Part 3: Hypothesis testing

Two sample t-test

Part 4: Modeling

Simple linear regression model

Multivariable linear regression model

Model selection

Detecting collinearity

Stepwise selection method

Nested likelihood ratio test

Final model

Regression diagnotics

Detecting collinearity

Residuals vs. fitted values

Normality of residuals

Independence residuals