Executive Summary

Looking at a data set of a collection of cars, we are interested in exploring the relationship between a set of variables and miles per gallon (MPG).

We are particularly interested in the following two questions:
- Is an automatic or manual transmission better for MPG?
- What is the MPG difference between automatic and manual transmissions?

Is an automatic or manual transmission better for MPG?

After performing a regression analysis with MPG as an outcome and the type of transmission as a predictor, we have determined that a manual transmission is better for fuel consumption and that vehicles fitted with manual transmission tend to achieve higher MPG than vehicles fitted with automatic transmissions.

What is the MPG difference between automatic and manual transmissions?

Our calculations and the linear model predict that manual transmissions achieve on average an estimated 7.24 mpg more than automatic transmissions, with the average consumption of 17.15 mpg for automatic transmissions and 24.39 mpg for manual transmissions.

Data Exploration

Library Load

library(dplyr)
library(GGally)

Data Transformation

cardata <- mtcars
dim(cardata)
## [1] 32 11
head(cardata)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
summary(cardata)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
class(cardata$mpg)
## [1] "numeric"
class(cardata$am)
## [1] "numeric"
#As the transmission variable "am" was is numeric, we transform it to a factor
cardata$am <- as.factor(cardata$am)
class(cardata$am)
## [1] "factor"

Graphing the Data

Because the outcome variable, “mpg”“, is a continuous variable, and the transmission-type variable,”am“”, is a factor variable, a boxplot best represents the data

with(cardata, plot(am, mpg, main = "MPG by Transmission", ylab = "Miles per Gallon", xlab = "Automatic : 0       Manual : 1"))

From this boxplot we see that the average mpg for automatic transmissions is around 17 mpg, and that of manual transmission is around 23 mpg.

Regression Analysis

Building Linear Model

Selected Model and Parameters

We use a standard linear regression model, as the outcome/dependent variable, “mpg”, is a continuous variable, and the predictor/independent variable, “am”, is a factor variable.

Because our predictor variable, “am”, is a factor, we can remove the intercept.

Other model configurations and parameters tested will be listed at the end of the report.

lm_mpg <- lm(mpg ~ am - 1, data = cardata)

Inspecting the Linear Model

round(quantile(lm_mpg$residuals), 2)
##    0%   25%   50%   75%  100% 
## -9.39 -3.09 -0.30  3.24  9.51

The residuals are approximately normally distributed.

summary(lm_mpg)$coefficients
##     Estimate Std. Error  t value     Pr(>|t|)
## am0 17.14737   1.124603 15.24749 1.133983e-15
## am1 24.39231   1.359578 17.94109 1.376283e-17

The “am0” indicates transmission type 0 (Automatic), and “am1” indicates transmission type 1 (Manual).

The coefficients have a low standard error, and the extremely low probability of the t-values indicate that the coefficients are statistically significant.

round(summary(lm_mpg)$r.squared, 3)
## [1] 0.949

The R-squared value is quite high, and indicates that using only type of transmission in the fitted model accounts for roughly 95% of the variation in miles per gallon.

Coefficient Interpretation

Considering the coefficients of this model:

lm_mpg$coefficients
##      am0      am1 
## 17.14737 24.39231

“am0” indicates Automatic and “am1” indicates Manual. The coefficient values provided is the average MPG for each of the two types of transmission.

From the values presented in the coefficient table, we see that automatic transmissions have an average consumption of 17.15 mpg, and manual transmissions have an average consumption of 24.39 mpg.

Confidence Intervals

confint(lm_mpg)
##        2.5 %   97.5 %
## am0 14.85062 19.44411
## am1 21.61568 27.16894

Considering the confidence intervals we can, with 95% confidence, say that the average MPG for automatic transmissions is in the range of 14.85 mpg to 19.44 mpg, and the average MPG for manual transmissions is in the range of 21.62 mpg to 27.17 mpg.

Model Diagnostics

Anova

Anova computes the analysis of variance (or deviance) tables for one or more fitted model objects.

anova(lm_mpg)
## Analysis of Variance Table
## 
## Response: mpg
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## am         2 13321.4  6660.7  277.18 < 2.2e-16 ***
## Residuals 30   720.9    24.0                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Anova table confirms that our predictor variable, “am”, is statistically significant, and provides us with the breakdown of variance in the form of the SSR (sum of squared residuals).

Leverage

We will identify points that have high leverage, and that have the potential to affect the outcome of the model.

round(summary(hatvalues(lm_mpg)),3)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.053   0.053   0.053   0.062   0.077   0.077

Judging by the summary output of the hat values we can see that there are no problematic data points, and that the range from minimum to maximum is quite narrow. It indicates that all data points have approximately the same leverage on the outcome of the model.

Influence

We will identify points that have high influence the outcome of the model.

These data points would most likely show up in the regression model’s diagnostic plots below.

influential <- as.data.frame(round(dfbetas(lm_mpg), 2))
influential$Vehicle <- row.names(dfbetas(lm_mpg))
influential <- select(influential, c(Vehicle, am1))
influential <- arrange(influential, desc(am1))

The three datapoints with the most positive leverage are:

head(influential, 2)
##          Vehicle  am1
## 1 Toyota Corolla 0.62
## 2       Fiat 128 0.51

The three datapoints with the most negative leverage are:

tail(influential, 2)
##           Vehicle   am1
## 31 Ford Pantera L -0.55
## 32  Maserati Bora -0.61

Model Diagnostics Plots

By plotting the various regression statistics against each other, we can diagnose the various factors at play in the selected model.

This plot shows the spread of the residuals and some outliers/influencers are indicated.

The near-horizontal level of the red line shows that the model fit is good.

plot(lm_mpg, 1)

This plot shows that the error terms are approximately normally distributed.

(Although not as much as one would ideally require.)

plot(lm_mpg, 2)

This plot indicates the effect that the outliers have on the residuals.

One can clearly see that the influential points identified earlier have an impact on the residuals of the model.

plot(lm_mpg, 3)

This plot shows the influence each outlier (row names and observation numbers supplied) have on the outcome.

For more accurate predication or model outcomes, the influential outliers are to be removed.

plot(lm_mpg, 4)

The leverage the outliers have are indicated in this plot.

As the red line is nearly horizontal, we can see that the leverage for all points are approximately the same.

plot(lm_mpg, 5)

Other Models Considered

Other linear regressom model parameters had to be considered, as the model in scope of the questions posed only had an R-squared value of 0.949, i.e. the type of transmission only explained 94.9% of the variance in MPG. Considering all the various factors that could affect MPG, it is logical to combine these factors with the type of transmission used to see if the R-squared value can be increased, therefore accounting for more of the variance in the MPG outcome.

Other variables available

Looking at the data, we can see that there are other variables that could have a potenetial impact on the MPG outcome:

head(cardata, 2)
##               mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4

Potential variables include:
- Displacement (disp)
- Weight (wt)
- Number of gears (gear)

We will be constructing a couple of additional models, adding each of these variables as predictors along with the type of transmission, and then see which of the models can predict MPG the best with the given data.

MPG influenced by Transmission and Displacement

lm_td <- lm(mpg ~ am + disp - 1, data = cardata)
summary(lm_td)
## 
## Call:
## lm(formula = mpg ~ am + disp - 1, data = cardata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6382 -2.4751 -0.5631  2.2333  6.8386 
## 
## Coefficients:
##       Estimate Std. Error t value Pr(>|t|)    
## am0  27.848081   1.834071  15.184 2.45e-15 ***
## am1  29.681539   1.218689  24.355  < 2e-16 ***
## disp -0.036851   0.005782  -6.373 5.75e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.218 on 29 degrees of freedom
## Multiple R-squared:  0.9786, Adjusted R-squared:  0.9764 
## F-statistic: 442.4 on 3 and 29 DF,  p-value: < 2.2e-16

The model shows that, with the addition of displacement to type of transmission, the R-squared value increases somewhat, and we can see that the displacement variable also has a significant impact on the outcome.

MPG influenced by Transmission, Displacement, and Weight

lm_tdw <- lm(mpg ~ am + disp + wt - 1, data = cardata)
summary(lm_tdw)
## 
## Call:
## lm(formula = mpg ~ am + disp + wt - 1, data = cardata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4890 -2.4106 -0.7232  1.7503  6.3293 
## 
## Coefficients:
##       Estimate Std. Error t value Pr(>|t|)    
## am0  34.675911   3.240609  10.700 2.12e-11 ***
## am1  34.853635   2.376440  14.666 1.14e-14 ***
## disp -0.017805   0.009375  -1.899   0.0679 .  
## wt   -3.279044   1.327509  -2.470   0.0199 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.967 on 28 degrees of freedom
## Multiple R-squared:  0.9824, Adjusted R-squared:  0.9799 
## F-statistic: 391.7 on 4 and 28 DF,  p-value: < 2.2e-16

By adding weight along with the other two variables, the R-squared value increases slightly, but displacement’s effect on the outcome isn’t as significant compared to the other variables.
That being said, this is the model with the highest R-squared value.

Note that the R-squared value only shows how much of the variance is explained by the predictor variables in the model, and should not be the sole consideration or deciding factor for model fit.

MPG influenced by Displacement, Weight, and number of Gears

lm_dwg <- lm(mpg ~ disp + wt + gear - 1, data = cardata)
summary(lm_dwg)
## 
## Call:
## lm(formula = mpg ~ disp + wt + gear - 1, data = cardata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.5520  -2.1574   0.4345   2.9138  10.3726 
## 
## Coefficients:
##      Estimate Std. Error t value Pr(>|t|)    
## disp -0.02933    0.01587  -1.848   0.0748 .  
## wt    1.85260    1.69342   1.094   0.2830    
## gear  5.55331    0.67944   8.173 5.18e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.073 on 29 degrees of freedom
## Multiple R-squared:  0.9468, Adjusted R-squared:  0.9413 
## F-statistic: 172.2 on 3 and 29 DF,  p-value: < 2.2e-16

The R-squared value is high, explaining 94.7% of the variance in the MPG outcome, but the other models have more significant variables and higher R-squared values.

Anova Diagnostics

We will run an Anova diagnostic (as an example) on the model with the highest R-squared value:

anova(lm_tdw)
## Analysis of Variance Table
## 
## Response: mpg
##           Df  Sum Sq Mean Sq  F value    Pr(>F)    
## am         2 13321.4  6660.7 756.4188 < 2.2e-16 ***
## disp       1   420.6   420.6  47.7669 1.637e-07 ***
## wt         1    53.7    53.7   6.1013   0.01987 *  
## Residuals 28   246.6     8.8                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From this Anova analysis we can see that all the variables are statistically significant.
Anova can be used when you next multiple models to compare them to see the effect each new additional variable (as added during the nesting process using the update function) has on each other.

Multicollinearity

As we are using many variables to determine the MPG outcome, we need to ensure that there isn’t multicollinearity.
There is a chance that there is a linear relationship between displacement and weight.
Let’s plot the pairs to see the relationship between the variables:

g <- ggpairs(cardata[, c(1,3,6,10)], lower = list(continuous = "smooth", method = c("loess")))
g

From this we can see that some variables have a linear relationship with each other, and thus we have multicollinearity.

When there is an intercept in the data we can use the VIF function to determine what variance inflation the multicollinear variables have caused. (The vif function can be found in the “car” or “HH” packages)

For the VIF example we will NOT remove the intercept, else the VIF output is not sensible.

testvif <- lm(mpg ~ am + disp + wt, data = cardata)
car::vif(testvif)
##       am     disp       wt 
## 1.931264 4.752561 5.939675

These show that adding displacement will increase variance with a factor of around 4.7, and by adding weight it will increase by a factor of around 5.9.