By Megan Williams
Executive Summary
Research Question
Using a data set of a collection of cars (mtcars; 32 observations on 11 variables), we are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). Specifically, we hope to answer the following questions in this report:
-“Is an automatic or manual transmission better for MPG”
-"Quantify the MPG difference between automatic and manual transmissions"
Summary of Results
Although our initial analyses suggested a significant difference in fuel consumption between the two types of transmission, further analyses indicated that fuel consumption is more impacted by cyninder, horsepower, and weight.
Limitations
This analysis was not without limitations. The small sample size may have led to an over- or under-estimation of fuel consumption with respect to transmission type.
Exploratory Data Analysis
First, let's load the mtcars data and take a look at the variables
data(mtcars)
summary(mtcars)
## mpg cyl disp hp
## Min. :10.4 Min. :4.00 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.4 1st Qu.:4.00 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.2 Median :6.00 Median :196.3 Median :123.0
## Mean :20.1 Mean :6.19 Mean :230.7 Mean :146.7
## 3rd Qu.:22.8 3rd Qu.:8.00 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.9 Max. :8.00 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.76 Min. :1.51 Min. :14.5 Min. :0.000
## 1st Qu.:3.08 1st Qu.:2.58 1st Qu.:16.9 1st Qu.:0.000
## Median :3.69 Median :3.33 Median :17.7 Median :0.000
## Mean :3.60 Mean :3.22 Mean :17.8 Mean :0.438
## 3rd Qu.:3.92 3rd Qu.:3.61 3rd Qu.:18.9 3rd Qu.:1.000
## Max. :4.93 Max. :5.42 Max. :22.9 Max. :1.000
## am gear carb
## Min. :0.000 Min. :3.00 Min. :1.00
## 1st Qu.:0.000 1st Qu.:3.00 1st Qu.:2.00
## Median :0.000 Median :4.00 Median :2.00
## Mean :0.406 Mean :3.69 Mean :2.81
## 3rd Qu.:1.000 3rd Qu.:4.00 3rd Qu.:4.00
## Max. :1.000 Max. :5.00 Max. :8.00
Next, let's convert all of the variables that we may be using to factor variables (this will make the rest of our analyses much easier to handle), and take a look at the structure of our dataset
mtcars$am = factor(mtcars$am)
mtcars$cyl = factor(mtcars$cyl)
mtcars$gear = factor(mtcars$gear)
mtcars$carb = factor(mtcars$carb)
mtcars$vs = factor(mtcars$vs)
levels(mtcars$am) = c("automatic", "manual")
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
## $ am : Factor w/ 2 levels "automatic","manual": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
## $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
*NOTE: For more information on the variables, follow this link: http://rpubs.com/williamsmr/29922
Next, let's test for normality
shapiro.test(mtcars$mpg)
##
## Shapiro-Wilk normality test
##
## data: mtcars$mpg
## W = 0.9476, p-value = 0.1229
The large p-value (p > .05) suggests that the sample for fuel consumption is normally distributed.
Next, let's use a simple t-test to compare the means for the mpg (fuel consumption) for two types of transmission (automatic or manual)
t.test(mtcars$mpg ~ mtcars$am)
##
## Welch Two Sample t-test
##
## data: mtcars$mpg by mtcars$am
## t = -3.767, df = 18.33, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.28 -3.21
## sample estimates:
## mean in group automatic mean in group manual
## 17.15 24.39
Our p-value is less than .05, suggesting that there is in fact a statisically significant difference in fuel consumption for the two types of transmissions. See Figure 1 in the Appendix for a visual representation.
Model Selection
Now we will construct our regression model! First, we use backwards stepwise regression to determine what other variables (other than transmission type) have an impact on fuel consumption
mod1 = lm(mpg ~ ., data = mtcars)
mod2 = step(mod1, direction="backward", k=2, trace=0)
summary(mod2)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.939 -1.256 -0.401 1.125 5.051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.7083 2.6049 12.94 7.7e-13
## cyl6 -3.0313 1.4073 -2.15 0.0407
## cyl8 -2.1637 2.2843 -0.95 0.3523
## hp -0.0321 0.0137 -2.35 0.0269
## wt -2.4968 0.8856 -2.82 0.0091
## ammanual 1.8092 1.3963 1.30 0.2065
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.866, Adjusted R-squared: 0.84
## F-statistic: 33.6 on 5 and 26 DF, p-value: 1.51e-10
The p-value for this model is less than .05, suggesting significance. Looking at R-squared, this model explains 84% of the variance in fuel consumption. Cylinder, horsepower, and weight appear to significantly impact fuel consumption; however, transmission type does not appear to be significant.
Let's test the significance of a model including the significant variables as compared to the basic model with just transmission
mod1 = lm(mpg ~ am, data = mtcars)
mod2 = lm(mpg ~ cyl + hp + wt + am, data = mtcars)
anova(mod1, mod2)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + hp + wt + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 721
## 2 26 151 4 570 24.5 1.7e-08
The p-value of less than .05 suggests that mod2 is statistically significant.
Residuals Analysis
Figure 2 in the "Appendix" section shows that the residuals are normally distributed.
Appendix
Figure 1. Boxplot representing fuel consumption by transmission type
boxplot(mpg ~ am, data = mtcars, xlab = "Transmission type", ylab = "Miles per gallon")
Figure 2. Regression model (mod2) suggested by backwards stepwise analysis
par(mfrow=c(2, 2))
plot(mod2)