Regression Models Course Project

By Megan Williams

Executive Summary

Research Question

Using a data set of a collection of cars (mtcars; 32 observations on 11 variables), we are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). Specifically, we hope to answer the following questions in this report:

-“Is an automatic or manual transmission better for MPG”

-"Quantify the MPG difference between automatic and manual transmissions"

Summary of Results

Although our initial analyses suggested a significant difference in fuel consumption between the two types of transmission, further analyses indicated that fuel consumption is more impacted by cyninder, horsepower, and weight.

Limitations

This analysis was not without limitations. The small sample size may have led to an over- or under-estimation of fuel consumption with respect to transmission type.

Exploratory Data Analysis

First, let's load the mtcars data and take a look at the variables

data(mtcars)
summary(mtcars)

##       mpg            cyl            disp             hp       
##  Min.   :10.4   Min.   :4.00   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.4   1st Qu.:4.00   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.2   Median :6.00   Median :196.3   Median :123.0  
##  Mean   :20.1   Mean   :6.19   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.8   3rd Qu.:8.00   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.9   Max.   :8.00   Max.   :472.0   Max.   :335.0  
##       drat            wt            qsec            vs       
##  Min.   :2.76   Min.   :1.51   Min.   :14.5   Min.   :0.000  
##  1st Qu.:3.08   1st Qu.:2.58   1st Qu.:16.9   1st Qu.:0.000  
##  Median :3.69   Median :3.33   Median :17.7   Median :0.000  
##  Mean   :3.60   Mean   :3.22   Mean   :17.8   Mean   :0.438  
##  3rd Qu.:3.92   3rd Qu.:3.61   3rd Qu.:18.9   3rd Qu.:1.000  
##  Max.   :4.93   Max.   :5.42   Max.   :22.9   Max.   :1.000  
##        am             gear           carb     
##  Min.   :0.000   Min.   :3.00   Min.   :1.00  
##  1st Qu.:0.000   1st Qu.:3.00   1st Qu.:2.00  
##  Median :0.000   Median :4.00   Median :2.00  
##  Mean   :0.406   Mean   :3.69   Mean   :2.81  
##  3rd Qu.:1.000   3rd Qu.:4.00   3rd Qu.:4.00  
##  Max.   :1.000   Max.   :5.00   Max.   :8.00

Next, let's convert all of the variables that we may be using to factor variables (this will make the rest of our analyses much easier to handle), and take a look at the structure of our dataset

mtcars$am = factor(mtcars$am)
mtcars$cyl = factor(mtcars$cyl)
mtcars$gear = factor(mtcars$gear)
mtcars$carb = factor(mtcars$carb)
mtcars$vs = factor(mtcars$vs)
levels(mtcars$am) = c("automatic", "manual")
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "automatic","manual": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...

*NOTE: For more information on the variables, follow this link: http://rpubs.com/williamsmr/29922

Next, let's test for normality

shapiro.test(mtcars$mpg)

## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$mpg
## W = 0.9476, p-value = 0.1229

The large p-value (p > .05) suggests that the sample for fuel consumption is normally distributed.

Next, let's use a simple t-test to compare the means for the mpg (fuel consumption) for two types of transmission (automatic or manual)

t.test(mtcars$mpg ~ mtcars$am)

## 
##  Welch Two Sample t-test
## 
## data:  mtcars$mpg by mtcars$am
## t = -3.767, df = 18.33, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.28  -3.21
## sample estimates:
## mean in group automatic    mean in group manual 
##                   17.15                   24.39

Our p-value is less than .05, suggesting that there is in fact a statisically significant difference in fuel consumption for the two types of transmissions. See Figure 1 in the Appendix for a visual representation.

Model Selection

Now we will construct our regression model! First, we use backwards stepwise regression to determine what other variables (other than transmission type) have an impact on fuel consumption

mod1 = lm(mpg ~ ., data = mtcars)
mod2 = step(mod1, direction="backward", k=2, trace=0) 
summary(mod2)

## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.939 -1.256 -0.401  1.125  5.051 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  33.7083     2.6049   12.94  7.7e-13
## cyl6         -3.0313     1.4073   -2.15   0.0407
## cyl8         -2.1637     2.2843   -0.95   0.3523
## hp           -0.0321     0.0137   -2.35   0.0269
## wt           -2.4968     0.8856   -2.82   0.0091
## ammanual      1.8092     1.3963    1.30   0.2065
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.866,  Adjusted R-squared:  0.84 
## F-statistic: 33.6 on 5 and 26 DF,  p-value: 1.51e-10

The p-value for this model is less than .05, suggesting significance. Looking at R-squared, this model explains 84% of the variance in fuel consumption. Cylinder, horsepower, and weight appear to significantly impact fuel consumption; however, transmission type does not appear to be significant.

Let's test the significance of a model including the significant variables as compared to the basic model with just transmission

mod1 = lm(mpg ~ am, data = mtcars)
mod2 = lm(mpg ~ cyl + hp + wt + am, data = mtcars)
anova(mod1, mod2)

## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + hp + wt + am
##   Res.Df RSS Df Sum of Sq    F  Pr(>F)
## 1     30 721                          
## 2     26 151  4       570 24.5 1.7e-08

The p-value of less than .05 suggests that mod2 is statistically significant.

Residuals Analysis

Figure 2 in the "Appendix" section shows that the residuals are normally distributed.

Appendix

Figure 1. Boxplot representing fuel consumption by transmission type

boxplot(mpg ~ am, data = mtcars, xlab = "Transmission type", ylab = "Miles per gallon")

plot of chunk unnamed-chunk-6 Figure 2. Regression model (mod2) suggested by backwards stepwise analysis

par(mfrow=c(2, 2))
plot(mod2)

plot of chunk unnamed-chunk-7