Introduction

In this project, using the mtcars dataset, we will explore how miles per gallon (MPG) is affected by different factors. In particularly, we will answer the following two questions: (1) Is an automatic or manual transmission better for MPG, and (2) Quantify the MPG difference between automatic and manual transmissions.

Loading the necessary libraries

Necessary libraries for loading, plotting, and model selection. Reading the mtcars dataset and making a copy in a data.table.

library(data.table)
library(ggplot2)
library(leaps)
library(printr)

Loading and preprocessing the data

Let’s fetch the data first, change categorical variables to factors, and relabel am to Automatic and Manual.

data("mtcars")
mtcars_num <-copy(mtcars)
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
as.data.frame(t(apply(mtcars,2,class)))
mpg cyl disp hp drat wt qsec vs am gear carb
numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric
#data preparation
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$am <- factor(mtcars$am, labels = c("Automatic","Manual"))
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)

Performe a t-test to determine if the difference is significant

t.test <- t.test(mpg ~ am, mtcars)
t.test
## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group Automatic    mean in group Manual 
##                17.14737                24.39231

The t-test rejected the null-hypothesis that the difference in means is equal to zero, with a p value of 0.001374. Therefore, there is a significant difference in transmission types, with automatic transmissions having a lower MPG.

Estimate with basic linear regression

The basic linear method to determine the difference is to fit a simple linear regression model with transmission types as the dependent variable.

basic_fit <- lm(mpg ~ am, mtcars)
summary(basic_fit)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.147368 1.124602 15.247492 0.000000
amManual 7.244939 1.764422 4.106127 0.000285
summary(basic_fit)$r.squared
## [1] 0.3597989

The basic linear regression model with one predictor only explains 36% of the variation. It is important to examine the influence using stepwise regression.

Stepwise linear regression vs. best subsets regression

step <- lm(mpg ~ ., mtcars)
step_fit <- step(step,direction="both",trace=FALSE)

# now perform a best subsets regression
# reference: http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/155-best-subsets-regression-essentials-in-r/
best_subsets <- regsubsets(mpg ~ ., mtcars)
#In our example, we have only 5 predictor variables in the data. So, we'll use nvmax = 5.
best_subsets_summary <- summary(best_subsets)
adjr2 <- which.max(best_subsets_summary$adjr2)
cp <- which.min(best_subsets_summary$cp)
bic <- which.min(best_subsets_summary$bic)
best_set <- best_subsets_summary$outmat[c(adjr2,cp),]
sub3_fit <- lm(mpg ~ am + wt + qsec, mtcars)
sub5_fit <- lm(mpg ~ am + cyl + hp + wt + vs, mtcars)

The function summary() reports the best set of variables. An asterisk in the corresponding model specifies that a given variable is included.

For example, it can be seen from the output that the best 2-variables model contains only wt and hp variables (mpg ~ wt + hp). The best three-variable model is (mpg ~ wt + hp + amManual), and so forth.

Now we need to decide which of these models should be finally use for our predictive analytics.

Model Selection

Both the best subsets regression and the BIC method as shown in the step above suggest that model 3 is the best. Using the newly developed packages like tidyverse and broom, the code below give us an overall picture of the two modles in terms of model performance metrics.

library(tidyverse)
## -- Attaching packages -------------------------------------------------------- tidyverse 1.2.1 --
## v tibble  2.1.1     v purrr   0.3.2
## v tidyr   0.8.3     v dplyr   0.8.1
## v readr   1.3.1     v stringr 1.4.0
## v tibble  2.1.1     v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::between()   masks data.table::between()
## x dplyr::filter()    masks stats::filter()
## x dplyr::first()     masks data.table::first()
## x dplyr::lag()       masks stats::lag()
## x dplyr::last()      masks data.table::last()
## x purrr::transpose() masks data.table::transpose()
library(modelr)
library(broom)
## 
## Attaching package: 'broom'
## The following object is masked from 'package:modelr':
## 
##     bootstrap
# Metrics for model 3
glance(sub3_fit) %>%
  dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)
adj.r.squared sigma AIC BIC p.value
0.8335561 2.458846 154.1194 161.4481 0
# Metrics for model 5
glance(sub5_fit) %>%
  dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)
adj.r.squared sigma AIC BIC p.value
0.8417804 2.397329 154.8713 166.5972 0

From the output above, it can be seen that: The two models have exactly the samed adjusted R2 (0.83 vs. 0.84), meaning that they are equivalent in explaining the outcome, here mpg. Additionally, they have the same amount of residual standard error (RSE or sigma is close to 2.4). However, model 3 is simpler than model 5 because it incorporates less variables. All things equal, the simple model is always better in statistics.

Conclusions

From the multiple analyses we performed, manual transmissions seem to have an advantage over auto transmissions. The conclusion is supported by the t-test and our the final linear model. According to the final model, by having a manual transmission instead of an automatic the MPG will increase by 2.94 as can be seen from the coefficient for amManual.

The model fit well with a p value of less than 0.05 and and adjusted R2 of 0.83. According to the diagnostics test, our model satisfies all statistical assmptions. The only exception is that the equal variance assumption is not strictly followed.

Appendix

1. Plot the miles per gallon (MPG) for automatic and manual transmissions

plot <- ggplot(mtcars, aes(x=am, y=mpg)) +
    geom_boxplot(aes(fill = am)) +
    xlab("Transmission Types") +
    ylab("MPG (Miles Per Gallon") +
    theme(legend.position = "none")
plot

According to the plot, there is a difference between automatic and manual transmissions. However, it is necessary to performe a t-test to help verify if the difference in means is significant.

2.Standard diagnostic plots for regression analysis:

par(mfrow = c(2,2))  # Change the panel layout to 2 x 2
plot(sub3_fit, col = "red", lwd = 2)

par(mfrow=c(2,2)) lets us look at four plots all at once rather than one by one. The diagnostic plots show residuals in four different ways. Now let’s examine these diagnostic plots. Residuals vs Fitted: The points are randomly scattered.Since residuals equally spread around the horizontal line without distinct patterns, that is a good indication we don’t have non-linear relationships. Normal Q-Q plot: The points does not deviate a straight line too much, which suggests that the assumption of normality is satisfied. Scale-Location: This plot helps us check the assumption of equal variance (homoscedasticity). The red smooth line is not horizontal and shows an upward trend. Addotionally, the residues spread slightly wider. Residuals vs Leverage: There is no highly influential case. We can barely see Cook’s distance lines (a red dashed line) because all cases are well inside of the Cook’s distance lines.