Regression Models Course Project

Executive Summary

In this report, we analyze the mtcars data set (extracted from the 1974 Motor Trend US magazine, and comprising fuel consumption and 10 aspects of automobile design and performance for 32 automobiles) in order to ascertain and quantify the MPG difference between automatic and manual transmissions.
Our model based on the multivariable linear regression analysis leads us to the conclusion that a manual transmission is better for MPG, and with 95% confidence, we estimate a 0.05 to 5.83 increase in MPG for manual cars versus automatic.

Exploratory Data Analysis

The mtcars data set is a data frame with 32 observations on 11 numeric variables.

library( datasets ) ; data( mtcars ) ; str(mtcars, list.len=0)

## 'data.frame':    32 obs. of  11 variables:
##   [list output truncated]

The mpg ~ am boxplot (Appendix Figure 1) shows the difference between automatic and manual cars - the latter appear to have better MPG.
The variables of interest (mpg and am) also demonstrate a marked correlation with other variables in the data set (see Appendix Figure 2 for the scatterplot matrix from the mtcars data set). This suggests using covariate adjustment and investigating an adjusted effect of those variables.

Model Selection

For more simplicity and brevity, handling outliers and including interaction terms is out of scope of this project.
Here is our approach to fitting and selecting a model:

fit the initial model (just with a single predictor) - mpg ~ am
fit the full model (with all variables as predictors) - mpg ~ .
fit a stepwise-selected model (returned by step function)
compare the models with anova function

Data Transformation

The continuous regressor variables (disp, hp, drat, wt and qsec) are candidates for centering (for a more interpretable intercept).
The rest of the regressor variables (cyl, vs, am, gear and carb) are converted into a factor.

mtcars2 <- within( mtcars, {
  disp <- disp-mean(disp) ; hp <- hp-mean(hp) ; drat <- drat-mean(drat) ; wt <- wt-mean(wt) ;
  qsec <- qsec-mean(qsec) ; cyl <- factor(cyl) ; vs <- factor(vs,labels=c("V","S")) ;
  am <- factor(am,labels=c("automatic","manual")) ; gear <- factor(gear) ; carb <- factor(carb) } )

Model with a single predictor

Firstly, we fit and analyse a minimal linear regression model of mpg with the single predictor am (2-level factor).

m0 <- lm(mpg ~ am, mtcars2) ; summary(m0)$coef  # fitting the initial model and getting the coefficients

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## ammanual     7.244939   1.764422  4.106127 2.850207e-04

summary(m0)$r.squared    # getting R2

## [1] 0.3597989

confint(m0)    # getting confidence intervals

##                2.5 %   97.5 %
## (Intercept) 14.85062 19.44411
## ammanual     3.64151 10.84837

The intercept, 17.1, is the mpg mean for automatic transmission cars, which is the reference level (level automatic of variable am). The coefficient ammanual (level manual of variable am), 7.2, is the change in the intercept to the reference level and is interpreted as the increase in the mpg mean for manual transmission cars. The t-tests for both the intercept and the slope are significant (statistically) since p-values, 1.133983e-15 and 2.850207e-04, are less than 0.05 (a typical benchmark for Type I error rate). The confidence intervals are entirely above zero. According to \(R^2\) (R-squared) value, 0.36, this model doesn’t explain 64% of the total variability so adding more predictors (ideally, necessary ones) is worth exploring.

Model with all variables as predictors

Our second model includes all variables as predictors.

m_all <- lm(mpg ~ . , mtcars2)    # fitting the full model
summary(m_all)$coef[c("(Intercept)","ammanual"),]    # getting the coefficients of interest

##              Estimate Std. Error   t value   Pr(>|t|)
## (Intercept) 17.984247   5.324119 3.3778825 0.00414153
## ammanual     1.212116   3.213545 0.3771896 0.71131573

c( R2=summary(m_all)$r.squared, adj.R2=summary(m_all)$adj.r.squared )  # getting R2 and adjusted R2

##        R2    adj.R2 
## 0.8930749 0.7790215

confint(m_all, parm=c("(Intercept)","ammanual") )    # getting confidence intervals

##                 2.5 %    97.5 %
## (Intercept)  6.636157 29.332338
## ammanual    -5.637394  8.061625

None of the coefficients in this model is significant due to their p-values being greater than 0.05. With t-test for coefficient ammanual, p-value 0.711, we fail to reject the null hypothesis of a zero difference in mpg between automatic and manual transmission cars. Also, the confidence interval (-5.637, 8.062) contains zero suggesting no difference. The standard error of ammanual is significantly larger than that of the previous model (3.214 vs 1.764) which is the result of adding unnecessary and correlated variables. Taking into account the large number of regressors included in the model, the \(adjusted\) \(R^2\) value, 0.7790215, is a better indicator of model performace than \(R^2\) (0.8930749).

Stepwise-selected model

This model is the result of a stepwise variable selection procedure. We use the step() function, which is supposed to return the “best” model. The algorithm deals with both the risk of overfitting and the risk of underfitting - the first increases standard errors of other regressors, the second results in bias in the coefficients of interest.

m2 <- step(m0, scope=list(lower=m0,upper=m_all), direction="both", trace=0) ; summary(m2)$coef  # get the model

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 18.897941  0.7193542 26.270704 2.855851e-21
## ammanual     2.935837  1.4109045  2.080819 4.671551e-02
## wt          -3.916504  0.7112016 -5.506882 6.952711e-06
## qsec         1.225886  0.2886696  4.246676 2.161737e-04

c( R2=summary(m2)$r.squared, adj.R2=summary(m2)$adj.r.squared )    # getting R2 and adjusted R2

##        R2    adj.R2 
## 0.8496636 0.8335561

confint(m2, parm=c("(Intercept)","ammanual") )    # getting confidence intervals

##                   2.5 %    97.5 %
## (Intercept) 17.42441087 20.371471
## ammanual     0.04573031  5.825944

The resulting model is mpg ~ am + wt + qsec. Surprisingly, all the coefficients in this model are significant, p-values are less than 0.05, so we reject the null hypothesis of a zero difference in mpg between automatic and manual transmission cars. The confidence interval for coefficient ammanual is above zero (0.046, 5.826), suggesting the manual transmission is better (by 2.936 in average) than the automatic. \(R^2\) value is 0.85 i.e. this model explains 85% of the total variability.

Model Comparison

We use the anova function to compare our three models.

anova( m0, m2, m_all )  # get the analysis-of-variance and analysis-of-deviance tables

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt + qsec
## Model 3: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     28 169.29  2    551.61 34.3604 2.509e-06 ***
## 3     15 120.40 13     48.88  0.4685    0.9114    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our “winner” is Model 2 (mpg ~ am + wt + qsec), the stepwise-selected model we created last.

Diagnostic Plots

We use the following residual plots for diagnostics (to assess quality of our final linear-regression model).

par(mfrow=c(1,2), mar=c(3,4,2,2), mgp=c(2,1,0)) ; plot(m2,which=c(1:2))  # plot diagnostics for the model

The left panel, Residuals (\(e_i\)) versus Fitted values (\(\hat Y_i\)) plot shows points scattered symmetrically around the horizontal line at 0, uncorrelated with the fitted values. There are no systematic patterns, suggesting homoscedasticity (the assumption of equal variance).

The right panel, Normal Q-Q plot shows points that are roughly on a straight line (the \(45^∘\) diagonal). This indicates a normal distribution of the residuals i.e. the assumption of normality (of the errors) is fulfilled.

Conclusion

Model 2 (mpg ~ am + wt + qsec) best describes the effect of transmission type on fuel consumption - there is a 2.9-mpg average difference in favor of manual transmission, and with 95% confidence, we estimate a 0.05 to 5.83 increase in MPG for manual cars versus automatic.

Appendix

Figure 1. Difference in `mpg` between automatic and manual transmission cars

par( mar=c(4,4,1,2) , mgp=c(2,1,0) ) ; boxplot( mpg ~ factor(am, labels=c("Automatic","Manual")), data=mtcars, xlab="Miles Per Gallon (MPG)", ylab="", col=c("red","blue"), horizontal=TRUE )

Figure 2. Scatterplot matrix from the `mtcars` data set

pairs( mtcars, pch=18, col=c("red","blue")[mtcars$am+1] )  #colors: red - automatic, blue - manual cars

Regression Models Course Project

Vladimir Maganov

2022-04-23

Executive Summary

Exploratory Data Analysis

Model Selection

Data Transformation

Model with a single predictor

Model with all variables as predictors

Stepwise-selected model

Model Comparison

Diagnostic Plots

Conclusion

Appendix

Figure 1. Difference in `mpg` between automatic and manual transmission cars

Figure 2. Scatterplot matrix from the `mtcars` data set

Regression Models Course Project

Vladimir Maganov

2022-04-23

Executive Summary

Exploratory Data Analysis

Model Selection

Data Transformation

Model with a single predictor

Model with all variables as predictors

Stepwise-selected model

Model Comparison

Diagnostic Plots

Conclusion

Appendix

Figure 1. Difference in mpg between automatic and manual transmission cars

Figure 2. Scatterplot matrix from the mtcars data set

Figure 1. Difference in `mpg` between automatic and manual transmission cars

Figure 2. Scatterplot matrix from the `mtcars` data set