Regression Models Course Project

by Sandy Sng
28 May 2018

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

  • Is an automatic or manual transmission better for MPG?
  • Quantify the MPG difference between automatic and manual transmissions

Pre-processing

library(datasets)
data(mtcars)
?mtcars
str(mtcars)

Analysis

Look at the correlation between “mpg” and all variables, specifically at “am”.

cor(mtcars$mpg,mtcars[,-1])
##            cyl       disp         hp      drat         wt     qsec
## [1,] -0.852162 -0.8475514 -0.7761684 0.6811719 -0.8676594 0.418684
##             vs        am      gear       carb
## [1,] 0.6640389 0.5998324 0.4802848 -0.5509251

We see that the correlation is positive (at +0.5998). Since under ?mtcars, am Transmission (0 = automatic, 1 = manual), i.e. high “mpg” to high “am”.

This shows that manual transmission is higher for mpg.

  • cyl, disp, hp, wt, carb are negatively correlated to mpg
  • drat, qsec, vs, am, gear are positively correlated to mpg

Do data conversion for “am Transmission (0 = automatic, 1 = manual)”, and perform a statistical analysis to support this hypothesis with a t-test at 95% confidence interval.

mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <-c("Automatic", "Manual")
t.test(mtcars$mpg~mtcars$am,conf.level=0.95)
## 
##  Welch Two Sample t-test
## 
## data:  mtcars$mpg by mtcars$am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group Automatic    mean in group Manual 
##                17.14737                24.39231

Since p-value = 0.001374 < 5%, we reject the null hypothesis. There is a true difference in the means between mpg of manual and automatic transmissions.

Since we rejected the null hypothesis, we can perform multivariates regression to see if more variance can be explained.

Check which variables are highly correlated to mpg, causing variation inflation.

library(car) # to use vif function 
## Loading required package: carData
fit <- lm(mpg ~ . , data = mtcars)
vif(fit)
##       cyl      disp        hp      drat        wt      qsec        vs 
## 15.373833 21.620241  9.832037  3.374620 15.164887  7.527958  4.965873 
##        am      gear      carb 
##  4.648487  5.357452  7.908747

Note that the variance is inflated by a huge factor for variables “cyl”, “disp”, and “wt”. We will subsequently add these two variables to the multivariate regression model in m3.

Toggle around several multivariate regression models to find the best fit, an alternative to m1. For m1, Residual Standard Error (RSE) = 4.902 and Multiple R-squared (R2) = 0.3598, m1 predicts mpg w/an average error of 4.9mpg, and explains only 36% of the variance. This is not the best model.

A better alternative model will have a lower RSE and higher multiple R-squared value.

m1 <- lm(mpg~am, data = mtcars)                         # RSE 4.902, multipleR2 0.3598
m2 <- lm(mpg~am + wt + cyl + hp, data = mtcars)         # RSE 2.509, multipleR2 0.849
m3 <- lm(mpg~am + wt + cyl + disp, data = mtcars)       # RSE 2.642, multipleR2 0.8327
m4 <- lm(mpg~am + wt + cyl + disp + hp + carb, data = mtcars)   # RSE 2.541, multipleR2 0.8566

anova(m1, m2, m3, m4) # test if adding more variables are necessary
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt + cyl + hp
## Model 3: mpg ~ am + wt + cyl + disp
## Model 4: mpg ~ am + wt + cyl + disp + hp + carb
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     27 170.00  3    550.90 28.4370 3.191e-08 ***
## 3     27 188.43  0    -18.43                      
## 4     25 161.44  2     26.99  2.0896    0.1448    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When we try several other models, m2 and m4 are better options because they have relatively lower RSE and higher multipleR2 values. From the Analysis of Variance Table (anova Table), we see that m2 is statistically significant (different from our simple linear model), and with higher degrees of freedom than m4. As such, it is not be necessary to include 2 more variables (disp and carb) to move from m2 to m4.

Results

m2 is a better alternative model will have a lower RSE (average error) of 2.509mpg and higher multiple R-squared value of 84.9%. It has a lower average error compared to the simple linear model m1 (m1 has RSE = 4.902), and explains more variance than m1 (m1 explains only 36%).

Given the above analysis, to answer the question of “Is an automatic or manual transmission better for MPG”, we have to consider three more variables: Weight, Number of cylinders, and Gross horsepower, instead of just the Transmission (auto/manual).

Appendix: Regression Diagnostics

To check the validity of the model (m2), we will have to check the 4 assumptions required to use a linear model as an explainer/predictor:

  1. The Y-values, or the errors, are independent (This won’t/can’t be checked as it requires knowledge of study design and data collection)
  2. The Y-values can be expressed as a linear function of the X-variable
  3. For a given value of X, the Y-values (or the errors) are normally-distributed
  4. The variation of observations around the regression line (the Residual SE) is constant (homoscedasticity)
par(mfrow = c(2,2))
plot(m2)

  1. This cannot be answered using above plots.
  2. Explained by the Residuals vs Fitted graph: Red trend line seems to be relatively flat, so the linear assumption seems valid.
  3. Explained by the Normal Q-Q graph: All the points are roughly on a diagonal line, so we can say that the errors are normally-distributed
  4. Explained by the Scale-Location graph: Whether the data are homoscedastic? (Does residual value change as a function of x? i.e. Are errors larger, on average, as x increases/decreases). Red trend line seems to be relatively flat, so we have homoscedasticity.

Residuals vs Leverage graph: Does not show any influential outliers (since there are no points with extreme Cook’s Distances, there don’t appear to be any observations that exert too much influence or leverage.)