Josh Katz

February 28, 2016

Executive Summary

The objective of this analysis was to determine if fuel efficiency, miles per gallon (mpg), is highest in automatic or manual transmission cars, and quantify any differences. The default R dataset, “mtcars”, was used. This dataset contains 32 rows of cars from 1973-1974. Each row has 10 columns of car attributes such as mpg and horsepower (hp); see Appendix for more information.

The dataset mtcars was loaded and inspected: see Appendix. Manual transmission car mpg was on average 7.25 higher than automatic car mpg. However, an r squared of 0.36 demonstrated that only 36% of mpg variance was explained by transmission type, (am: manual or automatic). In an effort to increase the explained variance, a multivariate regression model was fitted with the most optimal explanatory variables: horsepower, weight (wt:per 1000 lbs of car), and number of cylinders. The optimal model improved the r squared value to 0.87, however it excluded transmission type (automatic or manual) as the significance was (p>0.05).

This study demonstrated that a book isn’t always to be judged by its cover, or in this case a quick glance and test of the data. A more robust analysis of the data showed that fuel economy, mpg, was best described by other car attributes than transmission type: weight, horsepower and number of cylinders. In conclusion, the type of car transmission that achieves better fuel efficiency is uncertain as other car attributes (horsepower, car weight and number of cylinders) may be a better indication of fuel efficiency.

Load dataset and transform categorical variables to factors

data(mtcars)
mtcars$cyl=factor(mtcars$cyl)
mtcars$vs=factor(mtcars$vs)
mtcars$am=factor(mtcars$am, labels = c("Automatic", "Manual"))
mtcars$gear=factor(mtcars$gear)
mtcars$carb=factor(mtcars$carb)
Simple Regression (see Appendix for output)

A simple regression model of mpg explained by transmission type, only, results in a 7.25mpg increase from automatic to manual. So the formula from the simple regression (yhat=17.1 + 7.25x) means: an automatic car, on average achieves a fuel efficiency of 17.1 mpg, while; a manual car on average achieves an increase of 7.25mpg or 24.35mpg. However, the r squared is only 0.36, which means that transmission type explains only 36% of the mpg variance. A model will be used below to improve the r squared with additional explanatory variables.

MultivariateRegression Analysis (see Appendix for output)

A step-wise procedure (removing and adding back explanatory variables) was performed to estimate the most optimal explanatory variables, car attributes, to explain mpg. The model line (yhat=33.7 -3cyl6 - 2.2cyl8 -0.03hp -2.50wt + 1.80manual) can be interpreted as mpg for a 4cylinder, automatic is 33.7 and its mpg is adjusted -3 for 6cylinder, -3.2 for an 8 cylinder, -0.03 for every increase in horsepower, -2.49 for every 1000lb increase in car weight and +1.81 for an automatic. *In conclusion, transmission type (p>0.05) is not a significant predictor of fuel efficiency (mpg); other predictors (hp,wt,cyl) improved the r squared value from 0.36 to 0.87.

Residual Plots and Diagnostics

Graphics were developed below to investigate linear model assumptions: errors are independent, normally distributed, and have a constant variance. The assumptions are valid as described below the figure.

par(mfrow=c(2,2))
plot(model_opt,pch=23,col="orange",cex=2.5,cex.lab=1.6,lwd=3)

Plot analysis from left to right: 1) The residuals, distance of a point to the regression line, do not show a pattern as they have a random scatter about the dotted line. 2) The residuals in the Quantile/Quantile plot for the most part follow the line and can be assumed to be normally distributed, 3) The red line is fairly flat demonstrating homoschedasity, the residuals are not affected by explanatory variables, and 4) None of the residuals have a Cook’s distance of greater than 0.5.

In conclusion, the type of car transmission that achieves better fuel efficiency is uncertain as other car attributes; horsepower, car weight and number of cylinders, may be a better indication of fuel efficiency. This model could be further refined through such techniques such as reducing any covariance between variables such as horsepower and number of cylinders or weight.

Appendix

Exploratory Data Analysis of mtcars
#?mtcars ##get dataset attribute info 
str(mtcars) ##look at structure/class 
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars) ##look at distribution of attributes
##       mpg        cyl         disp             hp             drat      
##  Min.   :10.40   4:11   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
##  1st Qu.:15.43   6: 7   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080  
##  Median :19.20   8:14   Median :196.3   Median :123.0   Median :3.695  
##  Mean   :20.09          Mean   :230.7   Mean   :146.7   Mean   :3.597  
##  3rd Qu.:22.80          3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920  
##  Max.   :33.90          Max.   :472.0   Max.   :335.0   Max.   :4.930  
##        wt             qsec       vs             am     gear   carb  
##  Min.   :1.513   Min.   :14.50   0:18   Automatic:19   3:15   1: 7  
##  1st Qu.:2.581   1st Qu.:16.89   1:14   Manual   :13   4:12   2:10  
##  Median :3.325   Median :17.71                         5: 5   3: 3  
##  Mean   :3.217   Mean   :17.85                                4:10  
##  3rd Qu.:3.610   3rd Qu.:18.90                                6: 1  
##  Max.   :5.424   Max.   :22.90                                8: 1
library(ggplot2) ##open up plotting package

Boxplot comparison of mpg explained by transmission type

p=qplot(am,mpg,data=mtcars,fill=am,geom="boxplot",xlab="Transmission Type",ylab="Miles Per Gallon (mpg)")
p+scale_fill_brewer(palette="Purples")+ theme(legend.position = "none")

Simple Regression

simple_model=lm(mpg~am,mtcars);summary(simple_model)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amManual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

Multiple Regression

model_all=lm(mpg~.,mtcars)
model_opt=step(model_all,direction="both",trace=F)
summary(model_opt)
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## amManual     1.80921    1.39630   1.296  0.20646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10