Executive Summary:

This is a project report for the Regression Models course which explores some linear multivariable regression models for MPG (miles per gallon) as an outcome variable versus AM (automatic(0)/manual(1) transmission) as a primary predictor along with other covariates in the mtcars dataset. Several candidate regression models are considered for selecting a reasonable fit using ANOVA for comparing the overall significance of rejecting the null hypothesis for each candidate. At the outset, selected metrics are evaluated to check sanity of the selected model. Overall conclusions for the selected model are summarized as follows:

Analysis:

Let us explore the mtcars dataset briefly.

data(mtcars)

# Let us take a brief examination of the mtcars dataset to understand the variables
dim(mtcars)
## [1] 32 11
names(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

The outcome variable is mpg and the primary predictor of interest is the factor variable am: 0 = automatic, 1 = manual.

regressors <- names(mtcars)
regressors <- regressors[!regressors %in% c("am","mpg")]

# Compute the correlations between the other candidate regressors and am
corvec <- sapply(regressors,function(x) cor(mtcars[,x],mtcars$am))

Let us rank the correlations between am and other candidate regressors. The absolute value is included in the process of ranking to include the impact of high correlations both positive and negative.

scor <- sort(abs(corvec), decreasing=FALSE)
scor
##       carb         vs       qsec         hp        cyl       disp 
## 0.05753435 0.16834512 0.22986086 0.24320426 0.52260705 0.59122704 
##         wt       drat       gear 
## 0.69249526 0.71271113 0.79405876

The top regressors in scor have low correlation and bottom regressors in scor are highly correlated to am. This ranking of correlations helps in deciding which covariates should be added for a better regression fit. Let us start adding one regressor at one time starting with the regressor and build up a set of models and then use ANOVA for comparison. Our first preference is to include covariates which are least correlated first in the prefered models.

fit1 <- lm(mpg ~ factor(am), mtcars)
fit2 <- lm(mpg ~ factor(am) + carb, mtcars)
fit3 <- lm(mpg ~ factor(am) + carb + vs, mtcars)
fit4 <- lm(mpg ~ factor(am) + carb + vs + qsec, mtcars)
fit5 <- lm(mpg ~ factor(am) + carb + vs + qsec + hp, mtcars)
fit6 <- lm(mpg ~ factor(am) + carb + vs + qsec + hp + cyl, mtcars)
fit7 <- lm(mpg ~ factor(am) + carb + vs + qsec + hp + cyl + disp, mtcars)
fit7 <- lm(mpg ~ factor(am) + carb + vs + qsec + hp + cyl + disp + wt, mtcars)
fit8 <- lm(mpg ~ factor(am) + carb + vs + qsec + hp + cyl + disp + wt + drat, mtcars)
fit9 <- lm(mpg ~ factor(am) + carb + vs + qsec + hp + cyl + disp + wt + drat + gear, mtcars)
anova(fit1,fit2,fit3,fit4,fit5,fit6,fit7,fit8,fit9)
## Analysis of Variance Table
## 
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ factor(am) + carb
## Model 3: mpg ~ factor(am) + carb + vs
## Model 4: mpg ~ factor(am) + carb + vs + qsec
## Model 5: mpg ~ factor(am) + carb + vs + qsec + hp
## Model 6: mpg ~ factor(am) + carb + vs + qsec + hp + cyl
## Model 7: mpg ~ factor(am) + carb + vs + qsec + hp + cyl + disp + wt
## Model 8: mpg ~ factor(am) + carb + vs + qsec + hp + cyl + disp + wt + 
##     drat
## Model 9: mpg ~ factor(am) + carb + vs + qsec + hp + cyl + disp + wt + 
##     drat + gear
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     29 333.68  1    387.22 55.1314 2.665e-07 ***
## 3     28 245.65  1     88.03 12.5332  0.001939 ** 
## 4     27 240.46  1      5.20  0.7398  0.399452    
## 5     26 204.62  1     35.84  5.1022  0.034657 *  
## 6     25 197.01  1      7.61  1.0834  0.309780    
## 7     23 150.71  2     46.30  3.2958  0.056905 .  
## 8     22 148.85  1      1.87  0.2659  0.611495    
## 9     21 147.49  1      1.35  0.1926  0.665206    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From ANOVA summary, we can see that only fit2, fit3, and fit5 have p-values (Pr(>F)) that point to rejecting the null hypothesis and have some significance. The other models in this candidate list are not eligible as they indicate null hypothesis. Let us select these two models, fit3 and fit5 for further analysis.

From lm summary for fit5, we see that the individual P-Values for the coefficients only show null hypothesis rejection for factor(am) and hp cofficients. Other coefficients show null hypothesis. Let us examine lm summary for fit3 model as shown below:

# fit3 <- lm(mpg ~ factor(am)*carb*vs, mtcars)
summary(fit3)
## 
## Call:
## lm(formula = mpg ~ factor(am) + carb + vs, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2803 -1.2308  0.4078  2.0519  4.8197 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.5174     1.6091  12.130 1.16e-12 ***
## factor(am)1   6.7980     1.1015   6.172 1.15e-06 ***
## carb         -1.4308     0.4081  -3.506  0.00155 ** 
## vs            4.1957     1.3246   3.168  0.00370 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.962 on 28 degrees of freedom
## Multiple R-squared:  0.7818, Adjusted R-squared:  0.7585 
## F-statistic: 33.45 on 3 and 28 DF,  p-value: 2.138e-09

All the coefficients of fit3 show very significant p-values in rejecting the null hypothesis. It appears that fit3 is a much better fit than fit5 based on adjusted R-squared (0.7585), and p-value of 2.14e-09. Note the intercept of 19.5174 miles/gallon reflects the mean of factor(am)0, which is the mean mpg for automatic transmission. The estimated coefficient for factor(am)1 which is the mean mpg for manual transmission is 19.5174 + 6.7980 = 26.3154 miles/gallon. In other words, the mean mpg for manual transmission is 6.7980 miles/gallon higher than the mean mpg for automatic transmission.

Since the analysis of variance is sensitive to assumption that the model residuals are normal. Let us do a Shapiro-Wilk test for normality of residuals.

shapiro.test(fit3$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  fit3$residuals
## W = 0.946, p-value = 0.1109

The P-value indicates null hypothesis. The residual is normal. Let us select fit3 as the model.

Let us look at the statistics associated with the residual errors for fit3 model. See Appendix for the residual plot.

# residual plots and diagnostics for fit3 model
resid3 <- resid(fit3)

See the appendix for the plots of the residual error

Appendix:

coplot(mpg ~ factor(am) | as.factor(carb) + as.factor(vs), 
       data = mtcars,
       panel = panel.smooth,
       col = "purple", 
       bg = "light blue", 
       pch = 21,
       bar.bg = c(fac = "light blue"),
       rows = 1)

#Plot of fit3
par(mfrow=c(1,2))
plot(fit3)

# residual plots and diagnostics for fit3 model
resid3 <- resid(fit3)
par(mfrow=c(1,2))
plot(mtcars$am, 
     resid3,
     col = "purple", 
     bg = "purple", 
     pch = 21,
     ylab="Residual mpg",
     xlab="Transmission(am)", 
     main="Residuals for am"
) 
abline(0, 0)

plot(mtcars$carb, 
     resid3,
     col = "orange", 
     bg = "orange", 
     pch = 21,
     ylab="Residuals",
     xlab="# of carb", 
     main="Residuals for carb"
) 
abline(0, 0)  

plot(mtcars$vs, 
     resid3,
     col = "orange", 
     bg = "orange", 
     pch = 21,
     ylab="Residuals",
     xlab="VS Engine", 
     main="Residuals for vs"
) 
abline(0, 0)