Regression Models

Regression Models: Peer Assessment 1

Jenina Halitsky

October 26, 2014

======================================================================================================================== ### Executive Summary

This study focuses on a regression analysis that explores the relationship between a set of variables within the mtcars dataset. The biggest focus is to answer the following questions:

  1. Is an automatic or manual transmission better for MPG?
  2. Quantify the MPG difference between automatic and manual transmissions.

Based on the analysis, the remainder of this document will explain how and why I conclude the following solutions to the above questions.

  1. Yes, a manual transmission is better for MPG than automatic transmissions. This is because on average, a manual transmission can consume 7.24 gallons more fuel than an automatic transmission:
    am avg_mpg stdev_mpg
    manual 24.39231 6.166504
    automatic 17.14737 3.833966

  2. To adjust for other confounding variables such as the weight and horsepower of the car, multivariate regression helps to better estimate the impact of transmission type on MPG. Using an ANOVA regression model, the results reveal that manual transmission cars get 2.084 miles per gallon more than automatic transmission cars.

======================================================================================================================== ### Data Processing

Load needed libraries

        require(plyr)
## Loading required package: plyr
        require(ggplot2)
## Loading required package: ggplot2
        library(datasets)

Load mtcars Dataset

        data(mtcars)
        str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

The variables present in the dataset are:
* mpg - Miles/(US) gallon
* cyl - Number of cylinders
* disp - Displacement (cu.in.)
* hp - Gross horsepower
* drat - Rear axle ratio
* wt - Weight (lb/1000)
* qsec - 1/4 mile time
* vs - V/S
* am - Transmission (0 = automatic, 1 = manual)
* gear - Number of forward gears
* carb - Number of carburators

Convert Transmission Type

The dataset currently represents the transmission type as a 1 or a 0. For readibility sake, we will update the variable to be manual versus automatic and change it to a factor variable.

        mtcars$am.factor <- relevel(factor(c('manual', 'automatic'))[2 - mtcars$am], 'manual')

========================================================================================================================

Is an automatic or manual transmission better for MPG?

Exploration

To answer this question, lets break this information down to show the average MPG and standard deviation MPG for each. This helps us to determine whether or not there is a noticable difference in fuel efficiency between the transmission types.

        data.frame(am=levels(mtcars$am.factor),
                   avg_mpg=aggregate(mtcars$mpg, 
                   by=list(mtcars$am.factor), mean)$x, 
                   stdev_mpg=aggregate(mtcars$mpg, by=list(mtcars$am.factor), sd)$x)
##          am avg_mpg stdev_mpg
## 1    manual   24.39     6.167
## 2 automatic   17.15     3.834

On average, this states that a manual transmission is better for MPG than automatic transmissions because a manual transmission can consume 7.24 gallons more fuel than an automatic transmission.

Regression Analysis

A simple regression model for MPG with a single predictor of AM

        mpg.am <- lm(mpg~am, data=mtcars)
        summary(mpg.am)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.392 -3.092 -0.297  3.244  9.508 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    17.15       1.12   15.25  1.1e-15 ***
## am              7.24       1.76    4.11  0.00029 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.9 on 30 degrees of freedom
## Multiple R-squared:  0.36,   Adjusted R-squared:  0.338 
## F-statistic: 16.9 on 1 and 30 DF,  p-value: 0.000285

Interpreting:
* Intercept = 24.392 which represents the manual cars mean MPG
* AM Coefficient = 7.244 which represents the difference between the manual transmission MPG and the automatic transmission MPG.
* R-squared value = 0.3598 which means that our model only explains 35.98% of the variance.

Please see the Appendix - Figure 1: MPG by Transmission Type to see a visual representation.

========================================================================================================================

Quantify the MPG difference between automatic and manual transmissions.

Exploration

To answer this question, lets take a look at the correlation between MPG and transmission types.

        cor(mtcars$am, mtcars$mpg)
## [1] 0.5998

The correlation value = 0.5998 which shows a significant positive correlation.

Regression Analysis

Since we have two models of the same data, we can use ANOVA to compare the two models to see if there is a significant difference.

        bestfit <- lm(mpg~am + wt + hp, data = mtcars)
        anova(mpg.am, bestfit)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt + hp
##   Res.Df RSS Df Sum of Sq  F  Pr(>F)    
## 1     30 721                            
## 2     28 180  2       541 42 3.7e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpreting:
* The P-Value = 3.745e-09 which causes us to reject the null hypothesis and claim that our Anova model is significantly different from our simple model.

Final Model

        summary(bestfit)
## 
## Call:
## lm(formula = mpg ~ am + wt + hp, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.422 -1.792 -0.379  1.225  5.532 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.00288    2.64266   12.87  2.8e-13 ***
## am           2.08371    1.37642    1.51  0.14127    
## wt          -2.87858    0.90497   -3.18  0.00357 ** 
## hp          -0.03748    0.00961   -3.90  0.00055 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.54 on 28 degrees of freedom
## Multiple R-squared:  0.84,   Adjusted R-squared:  0.823 
## F-statistic:   49 on 3 and 28 DF,  p-value: 2.91e-11

Interpreting:
* The R-squared = 83.99% of the variance.
* Estimate am = 2.083710 which tells us that manual transmission cars do have more MPG than automatic transmission cars.
* Estimate Weight (wt) and Horse Power (hp) do indeed confound the relationship between am and mpg (mostly wt).

Please see the Appendix - Figure 2: Residuals vs. Fitted Values. to see that the values are normally distributed and homoskedastic.

========================================================================================================================

Appendix

Figure 1: MPG by Transmission Type

        boxplot(mpg~am.factor, data = mtcars,
                col = c("red", "blue"),
                main = "MPG by Transmission Type",
                xlab = "Transmission",
                ylab = "Miles per Gallon")

plot of chunk figure1

The manual transmission (red box) shows the average MPG to be 24.39 and the automatic transmission (blue box) shows the average MPG to be 17.15.

Figure 2: Residuals vs. Fitted Values

It is important to check the residuals for any signs of non-normality and examine the residuals vs. fitted values plot to spot for any signs of heteroskedasticity.

        par(mfrow = c(2,2))
        plot(bestfit)        

plot of chunk figure2

The graph shows that the residuals are normally distributed and homoskedastic.