The objective of this analysis was to determine if fuel efficiency, miles per gallon (mpg), is highest in automatic or manual transmission cars, and quantify any differences. The default R dataset, “mtcars”, was used. This dataset contains 32 rows of cars from 1973-1974. Each row has 10 columns of car attributes such as mpg and horsepower (hp); see Appendix for more information.
The dataset mtcars was loaded and inspected: see Appendix. Manual transmission car mpg was on average 7.25 higher than automatic car mpg. However, an r squared of 0.36 demonstrated that only 36% of mpg variance was explained by transmission type, (am: manual or automatic). In an effort to increase the explained variance, a multivariate regression model was fitted with the most optimal explanatory variables: horsepower, weight (wt:per 1000 lbs of car), and number of cylinders. The optimal model improved the r squared value to 0.87, however it excluded transmission type (automatic or manual) as the significance was (p>0.05).
This study demonstrated that a book isn’t always to be judged by its cover, or in this case a quick glance and test of the data. A more robust analysis of the data showed that fuel economy, mpg, was best described by other car attributes than transmission type: weight, horsepower and number of cylinders. In conclusion, the type of car transmission that achieves better fuel efficiency is uncertain as other car attributes (horsepower, car weight and number of cylinders) may be a better indication of fuel efficiency.
Load dataset and transform categorical variables to factors
data(mtcars)
mtcars$cyl=factor(mtcars$cyl)
mtcars$vs=factor(mtcars$vs)
mtcars$am=factor(mtcars$am, labels = c("Automatic", "Manual"))
mtcars$gear=factor(mtcars$gear)
mtcars$carb=factor(mtcars$carb)
A simple regression model of mpg explained by transmission type, only, results in a 7.25mpg increase from automatic to manual. So the formula from the simple regression (yhat=17.1 + 7.25x) means: an automatic car, on average achieves a fuel efficiency of 17.1 mpg, while; a manual car on average achieves an increase of 7.25mpg or 24.35mpg. However, the r squared is only 0.36, which means that transmission type explains only 36% of the mpg variance. A model will be used below to improve the r squared with additional explanatory variables.
A step-wise procedure (removing and adding back explanatory variables) was performed to estimate the most optimal explanatory variables, car attributes, to explain mpg. The model line (yhat=33.7 -3cyl6 - 2.2cyl8 -0.03hp -2.50wt + 1.80manual) can be interpreted as mpg for a 4cylinder, automatic is 33.7 and its mpg is adjusted -3 for 6cylinder, -3.2 for an 8 cylinder, -0.03 for every increase in horsepower, -2.49 for every 1000lb increase in car weight and +1.81 for an automatic. *In conclusion, transmission type (p>0.05) is not a significant predictor of fuel efficiency (mpg); other predictors (hp,wt,cyl) improved the r squared value from 0.36 to 0.87.
Graphics were developed below to investigate linear model assumptions: errors are independent, normally distributed, and have a constant variance. The assumptions are valid as described below the figure.
par(mfrow=c(2,2))
plot(model_opt,pch=23,col="orange",cex=2.5,cex.lab=1.6,lwd=3)
Plot analysis from left to right: 1) The residuals, distance of a point to the regression line, do not show a pattern as they have a random scatter about the dotted line. 2) The residuals in the Quantile/Quantile plot for the most part follow the line and can be assumed to be normally distributed, 3) The red line is fairly flat demonstrating homoschedasity, the residuals are not affected by explanatory variables, and 4) None of the residuals have a Cook’s distance of greater than 0.5.
In conclusion, the type of car transmission that achieves better fuel efficiency is uncertain as other car attributes; horsepower, car weight and number of cylinders, may be a better indication of fuel efficiency. This model could be further refined through such techniques such as reducing any covariance between variables such as horsepower and number of cylinders or weight.
#?mtcars ##get dataset attribute info
str(mtcars) ##look at structure/class
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
## $ am : Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
## $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars) ##look at distribution of attributes
## mpg cyl disp hp drat
## Min. :10.40 4:11 Min. : 71.1 Min. : 52.0 Min. :2.760
## 1st Qu.:15.43 6: 7 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
## Median :19.20 8:14 Median :196.3 Median :123.0 Median :3.695
## Mean :20.09 Mean :230.7 Mean :146.7 Mean :3.597
## 3rd Qu.:22.80 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
## Max. :33.90 Max. :472.0 Max. :335.0 Max. :4.930
## wt qsec vs am gear carb
## Min. :1.513 Min. :14.50 0:18 Automatic:19 3:15 1: 7
## 1st Qu.:2.581 1st Qu.:16.89 1:14 Manual :13 4:12 2:10
## Median :3.325 Median :17.71 5: 5 3: 3
## Mean :3.217 Mean :17.85 4:10
## 3rd Qu.:3.610 3rd Qu.:18.90 6: 1
## Max. :5.424 Max. :22.90 8: 1
library(ggplot2) ##open up plotting package
Boxplot comparison of mpg explained by transmission type
p=qplot(am,mpg,data=mtcars,fill=am,geom="boxplot",xlab="Transmission Type",ylab="Miles Per Gallon (mpg)")
p+scale_fill_brewer(palette="Purples")+ theme(legend.position = "none")
Simple Regression
simple_model=lm(mpg~am,mtcars);summary(simple_model)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
Multiple Regression
model_all=lm(mpg~.,mtcars)
model_opt=step(model_all,direction="both",trace=F)
summary(model_opt)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## amManual 1.80921 1.39630 1.296 0.20646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10