Miles per Gallon Analysis

SONJA OFFWOOD

Executive Summary

The following analysis uses the mtcars dataset from R and performs an analysis on the miles per gallon (mpg) for automatic vs manual vehicles. The analysis indicates that there is a clear difference in mpg for automatic and manual cars. It shows that manual cars have better usage per gallon, however there are other variables that also come into play when modelling the mpg for vehicles. These variables include the weight of the car as well as the horsepower of the car.

Data Processing and Analysis

To start we load the data and required packages into R. We also convert any factor variables (in this case transmission and number of cylinders) into factors.

data(mtcars)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.3
mtcars$am = factor(mtcars$am, levels=c(0,1),labels=c('Automatic','Manual'))
mtcars$cyl = factor(mtcars$cyl,levels=c(4,6,8), labels=c("4cyl","6cyl","8cyl")) 

At first glance (see boxplot in appendix) it appears as though manual cars have a higher mpg than automatic cars. We will attempt to fit a linear model using mpg as the dependent variable and a combination of the other variables as the predictors. After initial analysis (see plots in the appendix), it appears that the variables am, cyl, wt and hp are significant in modelling the miles per gallon of cars, especially in relation to the transmission of the car. Using these variables, we can create an anova table, to find out which variables are significant. Note that we are excluding an intercept in the models below.

fit1=lm(mpg~factor(am)-1, data=mtcars)
fit2=update(fit1, mpg~factor(am)+wt-1)
fit3=update(fit1, mpg~factor(am)+wt+hp-1)
fit4=update(fit1, mpg~factor(am)+wt+hp+factor(cyl)-1)
anova(fit1,fit2,fit3,fit4)
## Analysis of Variance Table
## 
## Model 1: mpg ~ factor(am) - 1
## Model 2: mpg ~ factor(am) + wt - 1
## Model 3: mpg ~ factor(am) + wt + hp - 1
## Model 4: mpg ~ factor(am) + wt + hp + factor(cyl) - 1
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     29 278.32  1    442.58 76.1924  3.32e-09 ***
## 3     28 180.29  1     98.03 16.8762 0.0003525 ***
## 4     26 151.03  2     29.27  2.5191 0.0999982 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using a 1% confidence level, we can see from the p-values above that adding the weight and horsepower variables, seem to be significant for the model, however the cylinder variable is no longer significant. We will analyse fit3 as the model going forward as this appears to the the best fit using a 1% confidence. The model includes transmission, weight and horsepower as the predictor variables. We will also examine the residuals of this model.

fit3=lm(mpg~factor(am)+wt+hp-1,data=mtcars)
summary(fit3)
## 
## Call:
## lm(formula = mpg ~ factor(am) + wt + hp - 1, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4221 -1.7924 -0.3788  1.2249  5.5317 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## factor(am)Automatic 34.002875   2.642659  12.867 2.82e-13 ***
## factor(am)Manual    36.086585   1.736338  20.783  < 2e-16 ***
## wt                  -2.878575   0.904971  -3.181 0.003574 ** 
## hp                  -0.037479   0.009605  -3.902 0.000546 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.538 on 28 degrees of freedom
## Multiple R-squared:  0.9872, Adjusted R-squared:  0.9853 
## F-statistic: 538.2 on 4 and 28 DF,  p-value: < 2.2e-16

This model suggests that the transmission of the vehicle is highly significant in predicting the mpg of the vehicle, Also there is a negative slope for both weight and horsepower in predicting the mpg of a vehicle, i.e. as the weight and the horsepower of the vehicle go up, the mpg reduces. This model is in line with the plots in the appendix of this document.

We next look at our residuals of this plot.

par(mfrow=c(2,2))
plot(fit3)

The QQ plot indicates normality of the residuals. There does not appear to be anything noteworthy about the residuals vs fitted plot indicating that the model is appropriate for the data.

As a next step, it is probably worth investigating the interaction effects of the various variables, to see if any of those are also significant in predicting the mpg per vehicle, however givent he small size of the dataset, including interaction variables, may overfit a model.

Appendix

# Boxplot of mpg for automatic and manual vehicles
qplot(am, mpg, data=mtcars, geom=c("boxplot", "jitter"), 
   fill=am, main="Mileage by Transmission",
   xlab="Transmission", ylab="Miles per Gallon")

# Boxplot of mpg for different cylinders 
qplot(cyl, mpg, data=mtcars, geom=c("boxplot", "jitter"), 
   fill=cyl, main="Mileage by Cylinder",
   xlab="Number of Cylinders", ylab="Miles per Gallon")

# Regression of mpg on weight for Automatic and Manual 
qplot(wt, mpg, data=mtcars, geom=c("point", "smooth"), 
   method="lm", formula=y~x, color=am, 
   main="Regression of MPG on Weight", 
   xlab="Weight", ylab="Miles per Gallon")

# Regression of mpg on horsepower for Automatic and Manual 
qplot(hp, mpg, data=mtcars, geom=c("point", "smooth"), 
   method="lm", formula=y~x, color=am, 
   main="Regression of MPG on Horsepower", 
   xlab="Horsepower", ylab="Miles per Gallon")