Regression Models Course Project
Anna Huynh
12/30/2020

1. Overview

The assignment offers an assumption that You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). Our analysis is eventually to answer two primary questions which are summarized as below:

1- Is an automatic or manual transmission better for MPG?

Manual transmission is better for fuel consumption (mpg: Miles/gallon)

2- Quantify the MPG difference between automatic and manual transmissions?

We basically run two models of linear regression to figure out whether an additional predictor significantly reduces the residual’s variance, the predictor therefore is deemed important. We, then used Analysis of variance (ANOVA) to quantify the significance of additional and confidently made conclusion upon the model serving better our target.

The working dataset is “mtcars”. The link is enclosed: (https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/mtcars)

2. Getting Data

library(datasets)
data(mtcars)

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

3. Exploratory Data Analysis

3.1. Checking Missing Values

summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

3.2. Detecting Outliers

# Use Percentiles method for outliers detection
# mpg _ Miles/(US) gallon
lower_bound_mpg <- quantile(mtcars$mpg , 0.025)
upper_bound_mpg <- quantile(mtcars$mpg, 0.975)
outlier_ind_mpg <- which(mtcars$mpg < lower_bound_mpg | mtcars$mpg > upper_bound_mpg)
# Print outliers
head(mtcars[outlier_ind_mpg, ])
##                 mpg cyl disp hp drat    wt qsec vs am gear carb
## Toyota Corolla 33.9   4 71.1 65 4.22 1.835 19.9  1  1    4    1
# disp _ Displacement (cu.in.)
lower_bound_disp <- quantile(mtcars$disp , 0.025)
upper_bound_disp <- quantile(mtcars$disp, 0.975)
outlier_ind_disp <- which(mtcars$disp < lower_bound_disp | mtcars$disp > upper_bound_disp)
# Print outliers
head(mtcars[outlier_ind_disp, ])
##                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Toyota Corolla     33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
# hp _ Gross horsepower
lower_bound_hp <- quantile(mtcars$hp , 0.025)
upper_bound_hp <- quantile(mtcars$hp, 0.975)
outlier_ind_hp <- which(mtcars$hp < lower_bound_hp | mtcars$hp > upper_bound_hp)
# Print outliers
head(mtcars[outlier_ind_hp, ])
##                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Honda Civic   30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Maserati Bora 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
# qsec _ 1/4 mile time
lower_bound_qsec <- quantile(mtcars$qsec , 0.025)
upper_bound_qsec <- quantile(mtcars$qsec, 0.975)
outlier_ind_qsec <- which(mtcars$qsec < lower_bound_qsec | mtcars$qsec > upper_bound_qsec)
# Print outliers
head(mtcars[outlier_ind_qsec, ])
##                 mpg cyl  disp  hp drat   wt qsec vs am gear carb
## Merc 230       22.8   4 140.8  95 3.92 3.15 22.9  1  0    4    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.17 14.5  0  1    5    4
# carb 
lower_bound_carb <- quantile(mtcars$carb , 0.025)
upper_bound_carb <- quantile(mtcars$carb, 0.975)
outlier_ind_carb <- which(mtcars$carb < lower_bound_carb | mtcars$carb > upper_bound_carb)
# Print outliers
head(mtcars[outlier_ind_qsec, ])
##                 mpg cyl  disp  hp drat   wt qsec vs am gear carb
## Merc 230       22.8   4 140.8  95 3.92 3.15 22.9  1  0    4    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.17 14.5  0  1    5    4

3.3. Data Visualization

require(ggplot2)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.3
hist(mtcars$mpg, col = "green")
rug(mtcars$mpg)
abline(v = median(mtcars$mpg), col = "magenta", lwd = 4)

hist(mtcars$disp, col = "blue")
rug(mtcars$disp)
abline(v = median(mtcars$disp), col = "magenta", lwd = 4)

hist(mtcars$hp, col = "gold")
rug(mtcars$hp)
abline(v = median(mtcars$hp), col = "magenta", lwd = 4)

hist(mtcars$qsec, col = "maroon")
rug(mtcars$qsec)
abline(v = median(mtcars$qsec), col = "magenta", lwd = 4)

hist(mtcars$carb, col = "lightblue")
rug(mtcars$carb)
abline(v = median(mtcars$carb), col = "magenta", lwd = 4)

3.4. Testing Variables correlation

library(corrplot)
## Warning: package 'corrplot' was built under R version 4.0.3
## corrplot 0.84 loaded
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.0.3
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Warning: package 'Formula' was built under R version 4.0.3
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
# # Mark the insignificant coefficients according to the specified p-value significance level
cor_9 <- rcorr(as.matrix(mtcars))
mtcars_cor <- cor_9$r
p_mat <- cor_9$P

col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(mtcars_cor, method = "color", col = col(200),  
         type = "upper", order = "hclust", 
         addCoef.col = "black", # Add coefficient of correlation
         tl.col = "darkblue", tl.srt = 45, #Text label color and rotation
         # Combine with significance level
         p.mat = p_mat, sig.level = 0.01,  
         # hide correlation coefficient on the principal diagonal
         diag = FALSE 
         )

# Get some colors with Heatmap()
col <- colorRampPalette(c("darkblue", "white", "darkorange"))(20)
mtcars_hea <- cor(mtcars)
heatmap(x = mtcars_hea, col = col, symm = TRUE)

## 4. Data Analysis

4.1. Data Preparation

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Transform transmission into values of automatic (<=0) & manual (> 0) 
mtcars2 <- mtcars %>%
        arrange(mpg) %>%
        mutate(auto_manu = ifelse(am <= 0, 0, 1))

mtcars2 <- data.frame(mtcars2)

head(mtcars2)
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb auto_manu
## 1 10.4   8  472 205 2.93 5.250 17.98  0  0    3    4         0
## 2 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4         0
## 3 13.3   8  350 245 3.73 3.840 15.41  0  0    3    4         0
## 4 14.3   8  360 245 3.21 3.570 15.84  0  0    3    4         0
## 5 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4         0
## 6 15.0   8  301 335 3.54 3.570 14.60  0  1    5    8         1

4.2. Diagnosis Analysis

Using VIF (Variance Inflation Factor) to describe the increase in the variance of a coefficient due to the correlation of its regressor with the other regressor. If the regressor is strongly correlated with others, hence will increase their VIF’s. So, we made linear regression to model “mpg” factor in terms of the other regressors, including transmission (0 = automatic, 1 = manual).

library(car)
## Warning: package 'car' was built under R version 4.0.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.0.3
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
# Run linear regression (03 regressors) with transmission
mdl3_with_tran <- lm(mpg ~ drat+gear+as.factor(auto_manu), mtcars2)
data.frame(vif(mdl3_with_tran))
##                      vif.mdl3_with_tran.
## drat                            2.253884
## gear                            3.001611
## as.factor(auto_manu)            3.114484
# Run linear regression (03 regressors) without transmission
mdl3_without_tran <- lm(mpg ~ drat+gear, mtcars2)
data.frame(vif(mdl3_without_tran))
##      vif.mdl3_without_tran.
## drat               1.958689
## gear               1.958689
# Run linear regression (05 regressors) with transmission
mdl5_with_tran <- lm(mpg ~ drat+gear+qsec+vs+as.factor(auto_manu), mtcars2)
data.frame(vif(mdl5_with_tran))
##                      vif.mdl5_with_tran.
## drat                            2.845364
## gear                            3.240614
## qsec                            3.415864
## vs                              3.517920
## as.factor(auto_manu)            3.351250
# Run linear regression (05 regressors) without transmission
mdl5_without_tran <- lm(mpg ~ drat+gear+qsec+vs, mtcars2)
data.frame(vif(mdl5_without_tran))
##      vif.mdl5_without_tran.
## drat               2.387146
## gear               2.473239
## qsec               3.271375
## vs                 3.505223
# Run linear regression (06 regressors) with transmission
mdl6_with_tran <- lm(mpg ~ wt+disp+hp+carb+cyl+as.factor(auto_manu), mtcars2)
data.frame(vif(mdl6_with_tran))
##                      vif.mdl6_with_tran.
## wt                              9.720042
## disp                           18.236233
## hp                              8.949862
## carb                            4.955221
## cyl                             7.913536
## as.factor(auto_manu)            3.109477
# Run linear regression (06 regressors) without transmission
mdl6_without_tran <- lm(mpg ~ wt+disp+hp+carb+cyl, mtcars2)
data.frame(vif(mdl6_without_tran))
##      vif.mdl6_without_tran.
## wt                 6.434340
## disp              17.171084
## hp                 8.946679
## carb               4.068529
## cyl                6.958400

Observation:

  1. In the model of 06 regressors,the variance in the estimated coefficient of “wt”(Weight (1000 lbs)) was significantly different (9.72 i/o 6.43) in model with i/o without transmission. We might guess that most of the variance inflation of “wt” is due to influence of “am” (transmission).

  2. In the model of 05 and 03 regressors, omitting transmission has had slightly effect the VIF for these regressors, so chances are that they are not strongly correlated as expected from correlation matrix.

From VIF Analysis, we ended up with the most correlated regressor (weight) and the most uncorrelated regressor (“drat” _ Rear axle ratio) for further analysis of auto/manual transmission and fuel consumption (Miles/ gallon).

Let’s see how much important of Rear axle ratio is to a vehicle?

Definition of Rear axle ratio: Axle ratio is important because it helps determine how much towing capacity, torque, or fuel-economy a truck is going to get. A vehicle with a higher numerical axle ratio will have more towing capacity. However, with the added towing capacity comes decreased fuel economy and a lower top speed.

We, then, created two models “fit”. Car productivity(“mpg”) is the outcome and the three predictors are transmission(“auto_manu”), Rear axle ratio(“drat”), and weight(“wt”). The data is “mtcars2”. Next, we considered the interaction between transmission and Rear axle ratio to see how that affects changes in rates of car productivity.

fit1 <- lm(mpg ~ as.factor(auto_manu) + drat -1, mtcars2) # model 01
summary(fit1)$coef
##                         Estimate Std. Error     t value   Pr(>|t|)
## as.factor(auto_manu)0 -1.9498833   7.073285 -0.27566869 0.78475740
## as.factor(auto_manu)1  0.8571777   8.713579  0.09837263 0.92231322
## drat                   5.8111432   2.129833  2.72844960 0.01069548
fit2 <- lm(mpg ~ as.factor(auto_manu) * wt -1, mtcars2) # model 02
summary(fit2)$coef
##                           Estimate Std. Error   t value     Pr(>|t|)
## as.factor(auto_manu)0    31.416055  3.0201093 10.402291 4.001043e-11
## as.factor(auto_manu)1    46.294478  3.0101489 15.379465 3.488923e-15
## wt                       -3.785908  0.7856478 -4.818836 4.551182e-05
## as.factor(auto_manu)1:wt -5.298360  1.4446993 -3.667449 1.017148e-03

Observation:

Model 01: With uncorrelated regressor, two predictors of transmission and Rear axle ratio - The percentage of mpg with transmission = 0 (automatic): -1.9498833. Or for one unit change in Rear axle ratio with automatic transmission, the expected change in the mpg decreased 1.95 (Miles/gallon)

  • The percentage of mpg with transmission = 1 (manual): 0.8571777. In the other words, for one unit change in Rear axle ratio with manual transmission, the expected change in the mpg increased 0.86 (Miles/gallon)

  • For one unit change in Rear axle ratio with both automatic & manual transmission, the expected change in the mpg increased 5.81 (Miles/gallon)

Model 02: With correlated regressor, one predictor of transmission and weight is held constant - The percentage of mpg with transmission = 0 (automatic): 31.416055. Or for one unit change in weight with automatic transmission, the expected change in the mpg increased 31.41 (Miles/gallon)

  • The percentage of mpg with transmission = 1 (manual): 46.294478. In the other words, for one unit change in weight with manual transmission, the expected change in the mpg increased 46.29 (Miles/gallon)

  • For one unit change in weight with both automatic & manual transmission, the expected change in the mpg decreased 3.79 (Miles/gallon)

  • The distance of the weight change in percent of manual transmission from that of automatic transmission was decreasing of 9.08 (Miles/gallon)(= -(5.29 + 3.79)).

In both models, we might have conclusion that manual transmission is better for car productivity((Miles/(US) gallon)

library(car)

# Plot "fit" models
qqPlot(fit1, frame = FALSE, pch = 19, col = "magenta", bg = "steelblue", cex = 2)

## [1] 10 32
qqPlot(fit2, frame = FALSE, pch = 19, col = "green", bg = "steelblue", cex = 2)

## [1] 26 31
# Test residuals for normality
shapiro.test(fit2$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  fit2$residuals
## W = 0.94234, p-value = 0.08724
shapiro.test(fit1$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  fit1$residuals
## W = 0.98503, p-value = 0.9253
# Use Analysis of variance (ANOVA) to quantify the significance of additional 
# regressors.
anova(fit1, fit2)
## Analysis of Variance Table
## 
## Model 1: mpg ~ as.factor(auto_manu) + drat - 1
## Model 2: mpg ~ as.factor(auto_manu) * wt - 1
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     29 573.64                                  
## 2     28 188.01  1    385.63 57.432 2.961e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Observation:

1- The Shapiro-Wilk p-value of 0.08724 & 0.9253 fail to reject normality (Null Hypothesis), supporting confidence in our analysis of variance.

2- Based on the calculated p-value (2.961e-08 < 0.05), We are confident that model 02 is significantly better than model 01. So at least one of the two additional regressors (transmission, weight) is significant.