Regression Models Course Project

Overview

Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

“Is an automatic or manual transmission better for MPG”
“Quantify the MPG difference between automatic and manual transmissions”

Exploratory Analysis

First I plot a boxplot of mpg by transmission type. This figure shows that Manual type seems to give a good effect on the fuel consumption. However, this model which is transmission type as a only variable has a very low Adjusted R-squared.

data(mtcars)
boxplot(mpg ~ am, data = mtcars, ylab = "MPG", xlab = "Transmission Type (0 = Automatic, 1 = Manual)")

fit_1 <- lm(mpg ~ am, data = mtcars)
summary(fit_1)

## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## am             7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

Next checking what variable affects to fuel consumption, I use simple liner regression model for all variable.

fit_all <- lm(mpg ~ ., data = mtcars)
summary(fit_all)

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

Regression Analysis

The previous model (fit_all) includes all of the variables. To select predictors, I use step() function and fit_all as an initial model.

best_fit <- step(fit_all, direction = "both")

summary(best_fit)

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

The best model obtained from the above computations shows that variables “qsec” and “wt”and “am” as the variable. Further the adjusted R-squared value of 0.8336 which is larger than fit_all which is all factor as the variables.

Residuals plot

Residuals can be easily plotted by plot() function.

par(mfrow=c(2, 2))
plot(best_fit)

Describing these figures

TOP LEFT: The Residuals vs. Fitted plot is mostly randomly scattered. It shows that this model is a independence condition.
TOP RIGHT: The Normal Q-Q plot are mostly on the line. This shows that the residuals are normally distributed.
BOTTOM LET: The Scale-Location plot is mostly randomly scattered. It is like a band. It shows that variance are almost constant.
BOTTOM RIGHT: Cook’s distance figure has a outlier point at top right. This point’s car name is “chrysler imperial”.

Summary

The obtained regression equation is mpg = 9.6178 - 3.9165 wt + 1.2259 qsec + 2.9358 am.
Manual transmission is higher fuel consumption than the automatic transmission. Its coefficient is 2.9358.
Heavy car will worsen the fuel consumption. MPG decrease 3.9165 for every 1000 lb.
High qsec improve the fuel consumption. Its coefficient is 1.2259.