Regression analysis

Introduction

In this project, using the mtcars dataset, we will explore how miles per gallon (MPG) is affected by different factors. In particularly, we will answer the following two questions: (1) Is an automatic or manual transmission better for MPG, and (2) Quantify the MPG difference between automatic and manual transmissions.

Loading the necessary libraries

Necessary libraries for loading, plotting, and model selection. Reading the mtcars dataset and making a copy in a data.table.

library(data.table)
library(ggplot2)
library(leaps)
library(printr)

Loading and preprocessing the data

Let’s fetch the data first, change categorical variables to factors, and relabel am to Automatic and Manual.

data("mtcars")
mtcars_num <-copy(mtcars)
head(mtcars)

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.460	20.22	1	0	3	1

as.data.frame(t(apply(mtcars,2,class)))

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
numeric	numeric	numeric	numeric	numeric	numeric	numeric	numeric	numeric	numeric	numeric

#data preparation
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$am <- factor(mtcars$am, labels = c("Automatic","Manual"))
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)

Performe a t-test to determine if the difference is significant

t.test <- t.test(mpg ~ am, mtcars)
t.test

## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group Automatic    mean in group Manual 
##                17.14737                24.39231

The t-test rejected the null-hypothesis that the difference in means is equal to zero, with a p value of 0.001374. Therefore, there is a significant difference in transmission types, with automatic transmissions having a lower MPG.

Estimate with basic linear regression

The basic linear method to determine the difference is to fit a simple linear regression model with transmission types as the dependent variable.

basic_fit <- lm(mpg ~ am, mtcars)
summary(basic_fit)$coefficients

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	17.147368	1.124602	15.247492	0.000000
amManual	7.244939	1.764422	4.106127	0.000285

summary(basic_fit)$r.squared

## [1] 0.3597989

The basic linear regression model with one predictor only explains 36% of the variation. It is important to examine the influence using stepwise regression.

Stepwise linear regression vs. best subsets regression

step <- lm(mpg ~ ., mtcars)
step_fit <- step(step,direction="both",trace=FALSE)

# now perform a best subsets regression
# reference: http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/155-best-subsets-regression-essentials-in-r/
best_subsets <- regsubsets(mpg ~ ., mtcars)
#In our example, we have only 5 predictor variables in the data. So, we'll use nvmax = 5.
best_subsets_summary <- summary(best_subsets)
adjr2 <- which.max(best_subsets_summary$adjr2)
cp <- which.min(best_subsets_summary$cp)
bic <- which.min(best_subsets_summary$bic)
best_set <- best_subsets_summary$outmat[c(adjr2,cp),]
sub3_fit <- lm(mpg ~ am + wt + qsec, mtcars)
sub5_fit <- lm(mpg ~ am + cyl + hp + wt + vs, mtcars)

The function summary() reports the best set of variables. An asterisk in the corresponding model specifies that a given variable is included.

For example, it can be seen from the output that the best 2-variables model contains only wt and hp variables (mpg ~ wt + hp). The best three-variable model is (mpg ~ wt + hp + amManual), and so forth.

Now we need to decide which of these models should be finally use for our predictive analytics.

Model Selection

Both the best subsets regression and the BIC method as shown in the step above suggest that model 3 is the best. Using the newly developed packages like tidyverse and broom, the code below give us an overall picture of the two modles in terms of model performance metrics.

library(tidyverse)

## -- Attaching packages -------------------------------------------------------- tidyverse 1.2.1 --

## v tibble  2.1.1     v purrr   0.3.2
## v tidyr   0.8.3     v dplyr   0.8.1
## v readr   1.3.1     v stringr 1.4.0
## v tibble  2.1.1     v forcats 0.4.0

## -- Conflicts ----------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::between()   masks data.table::between()
## x dplyr::filter()    masks stats::filter()
## x dplyr::first()     masks data.table::first()
## x dplyr::lag()       masks stats::lag()
## x dplyr::last()      masks data.table::last()
## x purrr::transpose() masks data.table::transpose()

library(modelr)
library(broom)

## 
## Attaching package: 'broom'

## The following object is masked from 'package:modelr':
## 
##     bootstrap

# Metrics for model 3
glance(sub3_fit) %>%
  dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)

adj.r.squared	sigma	AIC	BIC	p.value
0.8335561	2.458846	154.1194	161.4481	0

# Metrics for model 5
glance(sub5_fit) %>%
  dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value)

adj.r.squared	sigma	AIC	BIC	p.value
0.8417804	2.397329	154.8713	166.5972	0

From the output above, it can be seen that: The two models have exactly the samed adjusted R2 (0.83 vs. 0.84), meaning that they are equivalent in explaining the outcome, here mpg. Additionally, they have the same amount of residual standard error (RSE or sigma is close to 2.4). However, model 3 is simpler than model 5 because it incorporates less variables. All things equal, the simple model is always better in statistics.

Conclusions

From the multiple analyses we performed, manual transmissions seem to have an advantage over auto transmissions. The conclusion is supported by the t-test and our the final linear model. According to the final model, by having a manual transmission instead of an automatic the MPG will increase by 2.94 as can be seen from the coefficient for amManual.

The model fit well with a p value of less than 0.05 and and adjusted R2 of 0.83. According to the diagnostics test, our model satisfies all statistical assmptions. The only exception is that the equal variance assumption is not strictly followed.

Appendix

1. Plot the miles per gallon (MPG) for automatic and manual transmissions

plot <- ggplot(mtcars, aes(x=am, y=mpg)) +
    geom_boxplot(aes(fill = am)) +
    xlab("Transmission Types") +
    ylab("MPG (Miles Per Gallon") +
    theme(legend.position = "none")
plot

According to the plot, there is a difference between automatic and manual transmissions. However, it is necessary to performe a t-test to help verify if the difference in means is significant.

2.Standard diagnostic plots for regression analysis:

par(mfrow = c(2,2))  # Change the panel layout to 2 x 2
plot(sub3_fit, col = "red", lwd = 2)

par(mfrow=c(2,2)) lets us look at four plots all at once rather than one by one. The diagnostic plots show residuals in four different ways. Now let’s examine these diagnostic plots. Residuals vs Fitted: The points are randomly scattered.Since residuals equally spread around the horizontal line without distinct patterns, that is a good indication we don’t have non-linear relationships. Normal Q-Q plot: The points does not deviate a straight line too much, which suggests that the assumption of normality is satisfied. Scale-Location: This plot helps us check the assumption of equal variance (homoscedasticity). The red smooth line is not horizontal and shows an upward trend. Addotionally, the residues spread slightly wider. Residuals vs Leverage: There is no highly influential case. We can barely see Cook’s distance lines (a red dashed line) because all cases are well inside of the Cook’s distance lines.