Regression Models Project - Relationship Between Manual Transmission and MPG

Summary

In this document we are analyzing the data from various cars, the special itnerest is to explore the relationship between the MPG (miles per gallon) and the type of transmission (Manual or Automatic). Two questions are expected to be addressed: 1. Is an automatic or manual transmission better for MPG 2. Quantify the MPG difference between automatic and manual transmissions

The steps taken are:

Process the raw data Explore the data using plots to visualize relationships Model selection and model examination to see which model fit best to our data and help us better answer our questions Conclusions to answer the questions

Processing

# Loading the required libraries from the start
library(datasets)
library(ggplot2)
library(GGally)
library(knitr)

Getting and Manipulating the Data

For this exercise, the mtcars dataset will be used, this is a well known dataset that comes with the base R. The data is imported, then the non-continuous varaiables are converted to factor variables. Before this, we will get the correlation table between variables.

# Extracting the data
data(mtcars)

# Getting all the correlations
corr_list <- cor(mtcars$mpg, mtcars)

# Converting the non-continuous variables to factor
# Converting am
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")
# Converting vs
mtcars$vs <- as.factor(mtcars$vs)
levels(mtcars$vs) <- c("V-shaped", "Straight")
# Converting cyl
mtcars$cyl <- as.factor(mtcars$cyl)
# Converting gear
mtcars$gear <- as.factor(mtcars$gear)

Exploratory Data Analysis

Showing the dimensions and types of data in the dataset.

# Showing info of the data
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "V-shaped","Straight": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Visualizing the first 5 rows of the dataset.

# Visualizing the first rows of data
kable(head(mtcars, 5))

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	V-shaped	Manual	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	V-shaped	Manual	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	Straight	Manual	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	Straight	Automatic	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	V-shaped	Automatic	3	2

Plotting the relationship between tha main parameters for this exercise.

# Plotting the relationship between the variables of interest
p1 <- ggplot(mtcars, aes(x=am, y=mpg))
p1 <- p1 + geom_boxplot(aes(fill = am))
p1

Boxplot of the two mains variables

The previous plot shows that a relationship between the transmission type and MPG exists, and seemingly, manual transmission has a higher MPG than its automatic counterpart. However, it is a good practice to look at the correlations of the variables before fitting a model.

# Ordering the correlations and getting the highest-than-am ones
corr_list <- corr_list[,order(-abs(corr_list[1,]))]
corr_list

##        mpg         wt        cyl       disp         hp       drat         vs 
##  1.0000000 -0.8676594 -0.8521620 -0.8475514 -0.7761684  0.6811719  0.6640389 
##         am       carb       gear       qsec 
##  0.5998324 -0.5509251  0.4802848  0.4186840

ind <- which(names(corr_list) == "am")
corr_var <- names(corr_list)[1:ind]
corr_var

## [1] "mpg"  "wt"   "cyl"  "disp" "hp"   "drat" "vs"   "am"

All the variables with a higher correlation than am are extracted, and then all their relationships are plotter using ggpairs.

# Figure 2
p2 <- ggpairs(data=mtcars[, corr_var],  mapping = ggplot2::aes(color = am))
p2

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Pair plot of all the highly correlated quantities

Model Selection

The first model to fit is the one only considering mpg as outcome and am as the only predictor.

fit1 <- lm(mpg ~ am, mtcars)
summary(fit1)

## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amManual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

This model gives a really small p-values which indicates that it is a good fit to our data. However, this model have a small R-squared value in thus cannot explain much of the variance.

Now a model including all the variable will be taken into account.

fit_all <- lm(mpg ~ ., data=mtcars)
summary(fit_all)

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2015 -1.2319  0.1033  1.1953  4.3085 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 15.09262   17.13627   0.881   0.3895  
## cyl6        -1.19940    2.38736  -0.502   0.6212  
## cyl8         3.05492    4.82987   0.633   0.5346  
## disp         0.01257    0.01774   0.708   0.4873  
## hp          -0.05712    0.03175  -1.799   0.0879 .
## drat         0.73577    1.98461   0.371   0.7149  
## wt          -3.54512    1.90895  -1.857   0.0789 .
## qsec         0.76801    0.75222   1.021   0.3201  
## vsStraight   2.48849    2.54015   0.980   0.3396  
## amManual     3.34736    2.28948   1.462   0.1601  
## gear4       -0.99922    2.94658  -0.339   0.7382  
## gear5        1.06455    3.02730   0.352   0.7290  
## carb         0.78703    1.03599   0.760   0.4568  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.616 on 19 degrees of freedom
## Multiple R-squared:  0.8845, Adjusted R-squared:  0.8116 
## F-statistic: 12.13 on 12 and 19 DF,  p-value: 1.764e-06

Even though this model have a high R-squared and explain much of the variance, its p-value are really high, meaning low significance.

To find the best fit to model our data, the step function is going to be used parting from the fit_all model.

fit_best <- step(fit_all, direction="both",trace=FALSE)
summary(fit_best)

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## amManual      2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

In this model we have a R-squared of 0.8497 meaning a high variance explanation and also the p-values are low enough to mean that this is a significant model. The variable taken into account are wt, qsec and am.

To really determine if this model its good enough, the residuals will be plotted.

# Plotting residuals
par(mfrow=c(2,2))
p3 <- plot(fit_best)

Conclusion

The answer to the first question is that the manual transmission has a higher MPG than the automatic one. This can be seen in all three models fitted as a positive coefficient for the am-manual.

The second question is better answered when looking at the best fitted model (fit_best), with the formula \(mpg \sim wt + qsec + am\). Said model indicates that cars with manual transmission spend 2.936 more miles per gallon than automatic cars. The p-value is less than our tolerance of 0.05, which indicates a good significance.

When looking at the residuals vs fitted value plot we can see that there is no abnormal variance in the sense that it all appears to be randomly distributed, however, it is still a large residual error.

Even though the model was a good fit to our data, and the relationship between transmission and mpg seems feasible for the year the data was taken, I cannot establish and strong conclusion due to the low number of observations and and how scarcely separated the data is -the overlap in the fitting parameters is almost non-existant-.