Regression Models Final Course Project

The following analysis of the built-in data MTCars data set in R was careated as the final project for the Coursera Regression Models Course. A description of the MTCars data set can be found at the following link: https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/mtcars

The objective of the project is to answer the following two questions using the built-in dataset:

“Is an automatic or manual transmission better for MPG” “Quantify the MPG difference between automatic and manual transmissions”
This reports includes all the exploratory data analysis conducted in order to reach a final conclusion on both of these questions.

Overall Conclusion

Overall, the analysis shown here revealed that Manual transmission vehicles yield better MPG when compared to Automatic transmission vehicles. Manual transmission cars deliver an average of 24 MPGs compared to the 17 MPGs delivered by automatic transmissions.

Regression analysis shows that MPG can be explained as a function of the following regressors: transmission type (am), displacement (disp), and either weight (wt) or cylinders (cyl). The last two variables were found to be confounded with each other and thus it was concluded that the optimal model should only include one of them. For the purposes of this analysis, the variable chosen for additional residual analysis was weight.

Is an automatic or manual transmission better for MPG

Loading the Data

The dataset has 10 control variables, 1 target varialbe (mpg), and 32 different cars. A summary of the dataset is below.

library(knitr)
library(ggplot2)
data(mtcars);
mtcars$vs <- factor(mtcars$vs)
mtcars$am.label <- factor(mtcars$am, labels=c("Automatic","Manual")) # 0=automatic, 1=manual
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb  am.label
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4    Manual
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4    Manual
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1    Manual
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 Automatic
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 Automatic
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 Automatic

Box plot showing the MPGs by Transmission Type (Manual vs. Automatic)

boxplot(mpg ~ am.label, data = mtcars, col = (c("green","gray")), ylab = "Miles Per Gallon", xlab = "Transmission Type")

Not surprsingly, the boxplot shows that overall Manual Transmission cars provdie a better MPG than those with Automatic transmissions.

Quantify the MPG difference between automatic and manual transmissions

Regression Analysis

To get an exact calculation of the average MPG values by transmission type, the means can be computed like this:

aggregate(mtcars$mpg,by=list(mtcars$am.label),FUN=mean)
##     Group.1        x
## 1 Automatic 17.14737
## 2    Manual 24.39231

This shows that on average Manual transmission vehicles yield 7 MPG than those with Automatic transmission. Simple Regression Analysis can be used to understand the percentage of the variance explained in the model.

simple_reg <- lm(mpg ~ factor(am), data=mtcars)
summary(simple_reg)
## 
## Call:
## lm(formula = mpg ~ factor(am), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## factor(am)1    7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

The p-value is less than 0.0003. This means we cannot reject the null hypothesis. The r-squared value shows that only one third of the variance or so can be attributed to transmission type alone (r-squared = 0.35).

anova_test <- aov(mpg ~ ., data = mtcars)
summary(anova_test)
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## cyl          1  817.7   817.7 102.591 2.3e-08 ***
## disp         1   37.6    37.6   4.717 0.04525 *  
## hp           1    9.4     9.4   1.176 0.29430    
## drat         1   16.5    16.5   2.066 0.16988    
## wt           1   77.5    77.5   9.720 0.00663 ** 
## qsec         1    3.9     3.9   0.495 0.49161    
## vs           1    0.1     0.1   0.016 0.90006    
## am           1   14.5    14.5   1.816 0.19657    
## gear         2    2.3     1.2   0.145 0.86578    
## carb         5   19.0     3.8   0.477 0.78789    
## Residuals   16  127.5     8.0                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results from the ANOVA suggest 3 variables that we should consider adding to a multivariate model. These variables are: cyl (number of cylinders), disp (displacement, measured in cu.in.), and wt (weight of car)

multvar_reg <- lm(mpg ~ cyl + disp + wt + am, data = mtcars)
summary(multvar_reg)
## 
## Call:
## lm(formula = mpg ~ cyl + disp + wt + am, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.318 -1.362 -0.479  1.354  6.059 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.898313   3.601540  11.356 8.68e-12 ***
## cyl         -1.784173   0.618192  -2.886  0.00758 ** 
## disp         0.007404   0.012081   0.613  0.54509    
## wt          -3.583425   1.186504  -3.020  0.00547 ** 
## am           0.129066   1.321512   0.098  0.92292    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.642 on 27 degrees of freedom
## Multiple R-squared:  0.8327, Adjusted R-squared:  0.8079 
## F-statistic: 33.59 on 4 and 27 DF,  p-value: 4.038e-10

The multivariate regression model shows that adding these 3 variables to the model improves the model significantly explaining over 80% of the variance. However, the p-values for cyl and wt are below 0.5 suggesting that these two regressors may not be needed to explain the relationship between transmission type and MPG.

multvar_reg_opt <- lm(mpg ~ + disp + am, data = mtcars)
summary(multvar_reg_opt)
## 
## Call:
## lm(formula = mpg ~ +disp + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6382 -2.4751 -0.5631  2.2333  6.8386 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.848081   1.834071  15.184 2.45e-15 ***
## disp        -0.036851   0.005782  -6.373 5.75e-07 ***
## am           1.833458   1.436100   1.277    0.212    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.218 on 29 degrees of freedom
## Multiple R-squared:  0.7333, Adjusted R-squared:  0.7149 
## F-statistic: 39.87 on 2 and 29 DF,  p-value: 4.749e-09
multvar_reg_opt2 <- lm(mpg ~ + cyl + disp + am, data = mtcars)
summary(multvar_reg_opt2)
## 
## Call:
## lm(formula = mpg ~ +cyl + disp + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0863 -1.7831 -0.4842  1.5987  6.6358 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 32.91686    2.77914  11.844 2.03e-12 ***
## cyl         -1.61822    0.69937  -2.314   0.0282 *  
## disp        -0.01559    0.01065  -1.463   0.1545    
## am           1.92873    1.33973   1.440   0.1611    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3 on 28 degrees of freedom
## Multiple R-squared:  0.7761, Adjusted R-squared:  0.7522 
## F-statistic: 32.36 on 3 and 28 DF,  p-value: 3.06e-09
multvar_reg_opt3 <- lm(mpg ~ + wt + disp + am, data = mtcars)
summary(multvar_reg_opt3)
## 
## Call:
## lm(formula = mpg ~ +wt + disp + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4890 -2.4106 -0.7232  1.7503  6.3293 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.675911   3.240609  10.700 2.12e-11 ***
## wt          -3.279044   1.327509  -2.470   0.0199 *  
## disp        -0.017805   0.009375  -1.899   0.0679 .  
## am           0.177724   1.484316   0.120   0.9055    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.967 on 28 degrees of freedom
## Multiple R-squared:  0.781,  Adjusted R-squared:  0.7576 
## F-statistic: 33.29 on 3 and 28 DF,  p-value: 2.25e-09

The optimal model seems to explain 78% of the variance including transmission type, displacement, and either weight or cylinders.

Residual Plot Analysis

par(mfrow = c(2, 2))
plot(multvar_reg_opt3)

The residual plots shared here are for the model explaining MPG as a function of transmission type, displacement and weight (this was the 3rd model considered previously). These results suggest that the residuals are homoscedastic. These results also show the the residuals are overall normally distributed with the exception of the outliers.