In this project, using the mtcars dataset, we will explore how miles per gallon (MPG) is affected by different factors. In particularly, we will answer the following two questions: (1) Is an automatic or manual transmission better for MPG, and (2) Quantify the MPG difference between automatic and manual transmissions.
Necessary libraries for loading, plotting, and model selection. Reading the mtcars
dataset and making a copy in a data.table
.
library(data.table)
library(ggplot2)
library(leaps)
library(printr)
Let’s fetch the data first, change categorical variables to factors, and relabel am
to Automatic
and Manual
.
data("mtcars")
mtcars_num <-copy(mtcars)
head(mtcars)
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
as.data.frame(t(apply(mtcars,2,class)))
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|
numeric | numeric | numeric | numeric | numeric | numeric | numeric | numeric | numeric | numeric | numeric |
#data preparation
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$am <- factor(mtcars$am, labels = c("Automatic","Manual"))
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
t.test <- t.test(mpg ~ am, mtcars)
t.test
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group Automatic mean in group Manual
## 17.14737 24.39231
The t-test rejected the null-hypothesis that the difference in means is equal to zero, with a p value of 0.001374. Therefore, there is a significant difference in transmission types, with automatic transmissions having a lower MPG.
The basic linear method to determine the difference is to fit a simple linear regression model with transmission types as the dependent variable.
basic_fit <- lm(mpg ~ am, mtcars)
summary(basic_fit)$coefficients
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 17.147368 | 1.124602 | 15.247492 | 0.000000 |
amManual | 7.244939 | 1.764422 | 4.106127 | 0.000285 |
summary(basic_fit)$r.squared
## [1] 0.3597989
The basic linear regression model with one predictor only explains 36% of the variation. It is important to examine the influence using stepwise regression.
step <- lm(mpg ~ ., mtcars)
step_fit <- step(step,direction="both",trace=FALSE)
# now perform a best subsets regression
# reference: http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/155-best-subsets-regression-essentials-in-r/
best_subsets <- regsubsets(mpg ~ ., mtcars)
#In our example, we have only 5 predictor variables in the data. So, we'll use nvmax = 5.
best_subsets_summary <- summary(best_subsets)
adjr2 <- which.max(best_subsets_summary$adjr2)
cp <- which.min(best_subsets_summary$cp)
bic <- which.min(best_subsets_summary$bic)
best_set <- best_subsets_summary$outmat[c(adjr2,cp),]
sub3_fit <- lm(mpg ~ am + wt + qsec, mtcars)
sub5_fit <- lm(mpg ~ am + cyl + hp + wt + vs, mtcars)
The function summary() reports the best set of variables. An asterisk in the corresponding model specifies that a given variable is included.
For example, it can be seen from the output that the best 2-variables model contains only wt and hp variables (mpg ~ wt + hp). The best three-variable model is (mpg ~ wt + hp + amManual), and so forth.
Now we need to decide which of these models should be finally use for our predictive analytics.
Both the best subsets regression and the BIC method as shown in the step above suggest that model 3 is the best. Using the newly developed packages like tidyverse and broom, the code below give us an overall picture of the two modles in terms of model performance metrics.
library(tidyverse)
## -- Attaching packages -------------------------------------------------------- tidyverse 1.2.1 --
## v tibble 2.1.1 v purrr 0.3.2
## v tidyr 0.8.3 v dplyr 0.8.1
## v readr 1.3.1 v stringr 1.4.0
## v tibble 2.1.1 v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::between() masks data.table::between()
## x dplyr::filter() masks stats::filter()
## x dplyr::first() masks data.table::first()
## x dplyr::lag() masks stats::lag()
## x dplyr::last() masks data.table::last()
## x purrr::transpose() masks data.table::transpose()
library(modelr)
library(broom)
##
## Attaching package: 'broom'
## The following object is masked from 'package:modelr':
##
## bootstrap
# Metrics for model 3
glance(sub3_fit) %>%
dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)
adj.r.squared | sigma | AIC | BIC | p.value |
---|---|---|---|---|
0.8335561 | 2.458846 | 154.1194 | 161.4481 | 0 |
# Metrics for model 5
glance(sub5_fit) %>%
dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)
adj.r.squared | sigma | AIC | BIC | p.value |
---|---|---|---|---|
0.8417804 | 2.397329 | 154.8713 | 166.5972 | 0 |
From the output above, it can be seen that: The two models have exactly the samed adjusted R2 (0.83 vs. 0.84), meaning that they are equivalent in explaining the outcome, here mpg. Additionally, they have the same amount of residual standard error (RSE or sigma is close to 2.4). However, model 3 is simpler than model 5 because it incorporates less variables. All things equal, the simple model is always better in statistics.
From the multiple analyses we performed, manual transmissions seem to have an advantage over auto transmissions. The conclusion is supported by the t-test and our the final linear model. According to the final model, by having a manual transmission instead of an automatic the MPG will increase by 2.94 as can be seen from the coefficient for amManual.
The model fit well with a p value of less than 0.05 and and adjusted R2 of 0.83. According to the diagnostics test, our model satisfies all statistical assmptions. The only exception is that the equal variance assumption is not strictly followed.
plot <- ggplot(mtcars, aes(x=am, y=mpg)) +
geom_boxplot(aes(fill = am)) +
xlab("Transmission Types") +
ylab("MPG (Miles Per Gallon") +
theme(legend.position = "none")
plot
According to the plot, there is a difference between automatic and manual transmissions. However, it is necessary to performe a t-test to help verify if the difference in means is significant.
par(mfrow = c(2,2)) # Change the panel layout to 2 x 2
plot(sub3_fit, col = "red", lwd = 2)
par(mfrow=c(2,2)) lets us look at four plots all at once rather than one by one. The diagnostic plots show residuals in four different ways. Now let’s examine these diagnostic plots. Residuals vs Fitted: The points are randomly scattered.Since residuals equally spread around the horizontal line without distinct patterns, that is a good indication we don’t have non-linear relationships. Normal Q-Q plot: The points does not deviate a straight line too much, which suggests that the assumption of normality is satisfied. Scale-Location: This plot helps us check the assumption of equal variance (homoscedasticity). The red smooth line is not horizontal and shows an upward trend. Addotionally, the residues spread slightly wider. Residuals vs Leverage: There is no highly influential case. We can barely see Cook’s distance lines (a red dashed line) because all cases are well inside of the Cook’s distance lines.