Regression Models Course Project

Executive Summary

This report, commissioned by Motor Trend a automobile industry magazine, will provide an analysis of Cars Road test data, which was extracted from the 1974 Motor Trend US magazine.

The analysis will explore the relationship between set of report variables and the Miles per gallon variable.

The methods of analysis will involve fitting single and multi variable regression models and interpret the results to address relationship between the variables, plot the relationships between the variables, plot the residual data and perform data diagnostics.

The two primary questions to address are:

Is an automatic or manual transmission better for MPG.
Quantify the MPG difference between automatic and manual transmissions.

Data Processing and Analysis.

For this analysis, we will use the mtcars dataset which has data for 32 cars.

The cars are Mazda RX4, Mazda RX4 Wag, Datsun 710, Hornet 4 Drive, Hornet Sportabout, Valiant, Duster 360, Merc 240D, Merc 230, Merc 280, Merc 280C, Merc 450SE, Merc 450SL, Merc 450SLC, Cadillac Fleetwood, Lincoln Continental, Chrysler Imperial, Fiat 128, Honda Civic, Toyota Corolla, Toyota Corona, Dodge Challenger, AMC Javelin, Camaro Z28, Pontiac Firebird, Fiat X1-9, Porsche 914-2, Lotus Europa, Ford Pantera L, Ferrari Dino, Maserati Bora, Volvo 142E.

It is a data frame with 32 observations on 11 variables.

[, 1] mpg Miles/(US) gallon

[, 2] cyl Number of cylinders

[, 3] disp Displacement (cu.in.)

[, 4] hp Gross horsepower

[, 5] drat Rear axle ratio

[, 6] wt Weight (lb/1000)

[, 7] qsec 1/4 mile time

[, 8] vs V/S

[, 9] am Transmission (0 = automatic, 1 = manual)

[,10] gear Number of forward gears

[,11] carb Number of carburetors

Since the Transmission regressor “am” is numeric variable with values 0 for automatic and 1 for manual, we will convert the values to factors - “Auto”" and “Manual”- and assign it to a new variable “Transmission”. This new dataset will be created in a new object mtcars1

library(datasets)
library(dplyr)
library(ggplot2)

mtcars1 <- mtcars
mtcars1$Transmission <- as.factor(mtcars1$am)
levels(mtcars1$Transmission) <- c ("Auto", "Manual")

Is an automatic or manual transmission better for MPG?

To answer this question, we will plot the variation of MPG for Auto and Manual transmission

plot(mtcars1$Transmission, mtcars1$mpg, main = "Car Mileage-Auto vs Manual Transmission", xlab ="Transmission", ylab = "Mileage")

From the above plot, we see that a car with Manual transmission has better gas mileage than a Auto transmission.

Quantify the MPG difference between automatic and manual transmissions.

We will peform some linear regression models with mpg as outcome and other variables as regressors.

In this model, we will fit the effect of only the transmission on miles per gallon(mpg). For a two sided hypothesis with p value = 0.05, the null hypothesis can be defined that the transmission has no impact on miles per gallon(mpg).

fitlm1 <- lm(mpg~am, mtcars)
summary(fitlm1)

## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am             7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

From the summary of the fitted model, we see that there exists a positive coefficient of both transmission types on the mileage. Also the manual transmission has a edge over auto, i.e. a higher gas mileage of 7.2449393 compared to auto transmission.

The p value of both transmission types < 0.05, we can reject the null hypothesis and state that both transmission types have impact on the mileage.

The R-squared values are below 0.5, which is a indicator of a good fit model.

Multiple variable models

We will create a linear model to measure the effect of all the variables on the miles per gallon(mpg). For a two sided hypothesis with p value = 0.05, the null hypothesis can be defined that regressors have no impact on miles per gallon(mpg).

fitlm2 <- lm(mpg~., mtcars)
summary(fitlm2)$coef

##                Estimate  Std. Error    t value   Pr(>|t|)
## (Intercept) 12.30337416 18.71788443  0.6573058 0.51812440
## cyl         -0.11144048  1.04502336 -0.1066392 0.91608738
## disp         0.01333524  0.01785750  0.7467585 0.46348865
## hp          -0.02148212  0.02176858 -0.9868407 0.33495531
## drat         0.78711097  1.63537307  0.4813036 0.63527790
## wt          -3.71530393  1.89441430 -1.9611887 0.06325215
## qsec         0.82104075  0.73084480  1.1234133 0.27394127
## vs           0.31776281  2.10450861  0.1509915 0.88142347
## am           2.52022689  2.05665055  1.2254035 0.23398971
## gear         0.65541302  1.49325996  0.4389142 0.66520643
## carb        -0.19941925  0.82875250 -0.2406258 0.81217871

In this model’s summary, p values for all variables are greater than 0.05. Hence we will fail to reject the null hypothesis. The model shows that including multiple regressors has adjusted the effect of individual regressors on the outcome mpg.

Calculating the variance inflation factor(vif’s) for each of the regressors :

library(car)
vif(fitlm2)

##       cyl      disp        hp      drat        wt      qsec        vs 
## 15.373833 21.620241  9.832037  3.374620 15.164887  7.527958  4.965873 
##        am      gear      carb 
##  4.648487  5.357452  7.908747

These vif’s show, for each regression coefficient, the variance inflation due to including all other regressors. For instance, the variance in the estimated coefficient of “wt”" is 15.164887 times what it might have been if “wt” were not correlated with other regressors.

We will now create a new model by excluding the regressor (disp) with highest variance and compute the variance inflation

vif(lm(mpg~.- disp, mtcars))

##       cyl        hp      drat        wt      qsec        vs        am 
## 14.284737  7.123361  3.329298  6.189050  6.914423  4.916053  4.645108 
##      gear      carb 
##  5.324402  4.310597

From the above vif’s, we can observe that omitting the disp as regressor has markedly decreased the vif of regressors - wt. But the omission has almost no effect on the regressors - drat, vs, am, gear. So we can conclude that while disp has strong correlation with wt, it has almost no correlation with other regressors - drat, vs, am, gear.

Multiple models, by excluding regressors which have high p values(descending order) in each subsequent model, can be found in the Appendix section.

Among the multi variable models, we see that the model with regressors transmission(am), wt and qsec for outcome miles per gallon(mpg) will be the best fitted model as the p values are less than 0.05, which allows us to reject the null hypothesis.

fitmod7 <- lm(mpg ~.-cyl -vs-carb-gear-drat-disp-hp, mtcars)
summary(fitmod7)

## 
## Call:
## lm(formula = mpg ~ . - cyl - vs - carb - gear - drat - disp - 
##     hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

Exploratory analysis:

Correlation between all variables in the dataset will be plotted :

require(datasets);  require(GGally); require(ggplot2)
g = ggpairs(mtcars, lower = list(continuous = "smooth"))
g

Plotting the best fitted model with regressors transmission(am), wt and qsec for the outcome miles per gallon(mpg)

This plot wil map the Residuals, Normal Q-Q, Scale location and Leverage

par(mfrow = c(2, 2))
plot(fitmod7)

Summary:

From the above analysis, we can conclude that a car with manual transmission has better gas mileage i.e. 7.24 more miles compared to a car with auto transmission.

Regarding the multi variable model analysis, we can say the linear model with regressors transmission(am), wt and qsec for the outcome miles per gallon(mpg) is the best fitted model. Also the disp regressor has strong correlation with wt, while it has almost no correlation with other regressors - drat, vs, am, gear.

Appendix:

Identifying the best fitted model by excluding regressors which have high values in each subsequent model.

We will sort the p values of multi variable fitted model in descending order.

sort(summary(fitlm2)$coef[,4], decreasing = T)

##         cyl          vs        carb        gear        drat (Intercept) 
##  0.91608738  0.88142347  0.81217871  0.66520643  0.63527790  0.51812440 
##        disp          hp        qsec          am          wt 
##  0.46348865  0.33495531  0.27394127  0.23398971  0.06325215

fitmod1 <- lm(mpg ~.-cyl, mtcars)
summary(fitmod1)

## 
## Call:
## lm(formula = mpg ~ . - cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4286 -1.5908 -0.0412  1.2120  4.5961 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 10.96007   13.53030   0.810   0.4266  
## disp         0.01283    0.01682   0.763   0.4538  
## hp          -0.02191    0.02091  -1.048   0.3062  
## drat         0.83520    1.53625   0.544   0.5921  
## wt          -3.69251    1.83954  -2.007   0.0572 .
## qsec         0.84244    0.68678   1.227   0.2329  
## vs           0.38975    1.94800   0.200   0.8433  
## am           2.57743    1.94035   1.328   0.1977  
## gear         0.71155    1.36562   0.521   0.6075  
## carb        -0.21958    0.78856  -0.278   0.7833  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.59 on 22 degrees of freedom
## Multiple R-squared:  0.8689, Adjusted R-squared:  0.8153 
## F-statistic: 16.21 on 9 and 22 DF,  p-value: 9.031e-08

fitmod2 <- lm(mpg ~.-cyl -vs, mtcars)
summary(fitmod2)

## 
## Call:
## lm(formula = mpg ~ . - cyl - vs, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.356 -1.576 -0.149  1.218  4.604 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  9.76828   11.89230   0.821   0.4199  
## disp         0.01214    0.01612   0.753   0.4590  
## hp          -0.02095    0.01993  -1.051   0.3040  
## drat         0.87510    1.49113   0.587   0.5630  
## wt          -3.71151    1.79834  -2.064   0.0505 .
## qsec         0.91083    0.58312   1.562   0.1319  
## am           2.52390    1.88128   1.342   0.1928  
## gear         0.75984    1.31577   0.577   0.5692  
## carb        -0.24796    0.75933  -0.327   0.7470  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.535 on 23 degrees of freedom
## Multiple R-squared:  0.8687, Adjusted R-squared:  0.823 
## F-statistic: 19.02 on 8 and 23 DF,  p-value: 2.008e-08

fitmod3 <- lm(mpg ~.-cyl -vs -carb, mtcars)
summary(fitmod3)

## 
## Call:
## lm(formula = mpg ~ . - cyl - vs - carb, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1200 -1.7753 -0.1446  1.0903  4.7172 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  9.19763   11.54220   0.797  0.43334   
## disp         0.01552    0.01214   1.278  0.21342   
## hp          -0.02471    0.01596  -1.548  0.13476   
## drat         0.81023    1.45007   0.559  0.58151   
## wt          -4.13065    1.23593  -3.342  0.00272 **
## qsec         1.00979    0.48883   2.066  0.04981 * 
## am           2.58980    1.83528   1.411  0.17104   
## gear         0.60644    1.20596   0.503  0.61964   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.488 on 24 degrees of freedom
## Multiple R-squared:  0.8681, Adjusted R-squared:  0.8296 
## F-statistic: 22.56 on 7 and 24 DF,  p-value: 4.218e-09

fitmod4 <- lm(mpg ~.-cyl -vs-carb-gear, mtcars)
summary(fitmod4)

## 
## Call:
## lm(formula = mpg ~ . - cyl - vs - carb - gear, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2669 -1.6148 -0.2585  1.1220  4.5564 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 10.71062   10.97539   0.976  0.33848   
## disp         0.01310    0.01098   1.193  0.24405   
## hp          -0.02180    0.01465  -1.488  0.14938   
## drat         1.02065    1.36748   0.746  0.46240   
## wt          -4.04454    1.20558  -3.355  0.00254 **
## qsec         0.99073    0.48002   2.064  0.04955 * 
## am           2.98469    1.63382   1.827  0.07969 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.45 on 25 degrees of freedom
## Multiple R-squared:  0.8667, Adjusted R-squared:  0.8347 
## F-statistic: 27.09 on 6 and 25 DF,  p-value: 8.637e-10

fitmod5 <- lm(mpg ~.-cyl -vs-carb-gear-drat, mtcars)
summary(fitmod5)

## 
## Call:
## lm(formula = mpg ~ . - cyl - vs - carb - gear - drat, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5399 -1.7398 -0.3196  1.1676  4.5534 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 14.36190    9.74079   1.474  0.15238   
## disp         0.01124    0.01060   1.060  0.29897   
## hp          -0.02117    0.01450  -1.460  0.15639   
## wt          -4.08433    1.19410  -3.420  0.00208 **
## qsec         1.00690    0.47543   2.118  0.04391 * 
## am           3.47045    1.48578   2.336  0.02749 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.429 on 26 degrees of freedom
## Multiple R-squared:  0.8637, Adjusted R-squared:  0.8375 
## F-statistic: 32.96 on 5 and 26 DF,  p-value: 1.844e-10

fitmod6 <- lm(mpg ~.-cyl -vs-carb-gear-drat-disp, mtcars)
summary(fitmod6)

## 
## Call:
## lm(formula = mpg ~ . - cyl - vs - carb - gear - drat - disp, 
##     data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4975 -1.5902 -0.1122  1.1795  4.5404 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 17.44019    9.31887   1.871  0.07215 . 
## hp          -0.01765    0.01415  -1.247  0.22309   
## wt          -3.23810    0.88990  -3.639  0.00114 **
## qsec         0.81060    0.43887   1.847  0.07573 . 
## am           2.92550    1.39715   2.094  0.04579 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.435 on 27 degrees of freedom
## Multiple R-squared:  0.8579, Adjusted R-squared:  0.8368 
## F-statistic: 40.74 on 4 and 27 DF,  p-value: 4.589e-11

fitmod7 <- lm(mpg ~.-cyl -vs-carb-gear-drat-disp-hp, mtcars)
summary(fitmod7)

## 
## Call:
## lm(formula = mpg ~ . - cyl - vs - carb - gear - drat - disp - 
##     hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

fitmod8 <- lm(mpg ~.-cyl -vs-carb-gear-drat-disp-hp-qsec, mtcars)
summary(fitmod8)

## 
## Call:
## lm(formula = mpg ~ . - cyl - vs - carb - gear - drat - disp - 
##     hp - qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5295 -2.3619 -0.1317  1.4025  6.8782 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.32155    3.05464  12.218 5.84e-13 ***
## wt          -5.35281    0.78824  -6.791 1.87e-07 ***
## am          -0.02362    1.54565  -0.015    0.988    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.098 on 29 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7358 
## F-statistic: 44.17 on 2 and 29 DF,  p-value: 1.579e-09