Executive Summary

On this report we analyze which type of transmission is better for mileage. An exploratory data analysis was conducted, then a simple linear regression model with mileage as the outcome and transmission type as the predictor was fitted. Other multiple linear regression models were analyzed and the best model was fitted, which was \(mpg \sim am + wt + cyl\). With this model we can reject the null hypothesis, with a \(p < 0.001\) and with \(R^2 = 0.81\).

Introduction

The analysis was performed using the mtcars data set. This data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

Research questions

For this analysis, we are interested in the following:
1. “Is an automatic or manual transmission better for MPG (Miles per galon)”
2. “Quantify the MPG difference between automatic and manual transmissions”

Our hypothesis is that the manual transmission is better for mileage. Our null hypothesis is that there is no difference in mileage between automatic and manual transmission.

To test our hypothesis we will use regression models.

Exploratory data Analysis

First, the needed libraries and the data set were loaded.

library(datasets)
library(ggplot2)
data(mtcars)

An exploratory data analysis was performed, to get the general idea of the data set variables.

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

For our analysis, some variables were changed to factors and relabeled.

factor_vars <- c(2, 8, 9, 10, 11)
mtcars[, factor_vars] <- lapply(mtcars[, factor_vars], factor)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
levels(mtcars$am) <- c("Automatic", "Manual")
levels(mtcars$vs) <- c("V-shaped", "Straight")

We were interested in knowing if an automatic or manual transmission is better for mileage. A boxplot was made to visualize the difference in mileage between both types of transmission.

boxplot(
        mpg ~ am, 
        data = mtcars, 
        main = "Car mileage compared by transmission", 
        xlab = "Transmission", 
        ylab = "Miles per Galon"
        )

From the plot above, it is evident that tha manual transmission has better mileage overall than the automatic transmission. More exploratory plots were made to see if other variables have a relationship with mileage.

par(mfrow = c(2, 2))
boxplot(
        mpg ~ cyl, 
        data = mtcars, 
        main = "Car mileage by Number of Cylinders", 
        xlab = "Number of cylinders", 
        ylab = "Miles per Galon"
        )
boxplot(
        mpg ~ vs, 
        data = mtcars, 
        main = "Car mileage by Engine", 
        xlab = "Engine", 
        ylab = "Miles per Galon"
        )
boxplot(
        mpg ~ gear, 
        data = mtcars, 
        main = "Car mileage by Number of Forward Gears", 
        xlab = "Number of Forward Gears", 
        ylab = "Miles per Galon"
        )
boxplot(
        mpg ~ carb, 
        data = mtcars, 
        main = "Car mileage by Number of Carburators", 
        xlab = "Number of Carburators", 
        ylab = "Miles per Galon"
        )

par(mfrow = c(2, 2))
plot(
        mtcars$disp, mtcars$mpg, pch = 16, col = "blue",
        main = "Mileage by displacement (cu. in)", 
        xlab = "Displacement", 
        ylab = "Miles per Galon"
        )
plot(
        mtcars$hp, mtcars$mpg,  pch = 16, col = "red", 
        main = "Mileage by Gross Horsepower", 
        xlab = "Displacement", 
        ylab = "Miles per Galon"
        )
plot(
        mtcars$wt, mtcars$mpg,  pch = 16, col = "darkgreen", 
        main = "Mileage by weight (1000 lbs)", 
        xlab = "Displacement", 
        ylab = "Miles per Galon"
        )

There appears to be a negative correlation between the number of cylinders and mileage, and a mild negative correlation between the number of carburetors and mileage. The mileage appears to be better for straight engines than for V-shaped engines. There is an apparent negative correlation between mileage and displacement, gross horsepower and weight. Further analysis is needed to clarify and quantify the relationship between these variables and mileage.

Analysis

Simple Linear Regression model

A simple linear regression model was fitted, with mpg as the outcome and am, the transmission type, as the predictor.

fit1 <- lm(mpg ~ factor(am), data = mtcars)
summary(fit1)
## 
## Call:
## lm(formula = mpg ~ factor(am), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        17.147      1.125  15.247 1.13e-15 ***
## factor(am)Manual    7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

For this simple linear regression model an intercept of 17.147 was obtained, with an increase in the mean of 7.245 when the transmission is manual, with a \(p = 0.000285\) and an \(R^2 = 0.3385\). The p-value is less than 0.5, so we can reject the null hypothesis. However, the R-squared value suggests that only a third of the data variability is explained by transmission type alone.

Multiple Regression model

Nested multiple regression models were created in order to get the best model that better explains the variance between the variables. For model selection we run an ANOVA test and got the following.

fit2 <- lm(mpg ~ am + wt, data = mtcars)
fit3 <- lm(mpg ~ am + wt + cyl, data = mtcars)
fit4 <- lm(mpg ~ am + wt + cyl + disp , data = mtcars)
fit5 <- lm(mpg ~ am + wt + cyl + disp + hp, data = mtcars)
fit6 <- lm(mpg ~ am + wt + cyl + disp + hp + vs , data = mtcars)
anova(fit1, fit2, fit3, fit4, fit5, fit6)
## Analysis of Variance Table
## 
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ am + wt
## Model 3: mpg ~ am + wt + cyl
## Model 4: mpg ~ am + wt + cyl + disp
## Model 5: mpg ~ am + wt + cyl + disp + hp
## Model 6: mpg ~ am + wt + cyl + disp + hp + vs
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     29 278.32  1    442.58 74.2318 8.287e-09 ***
## 3     27 182.97  2     95.35  7.9965  0.002181 ** 
## 4     26 182.87  1      0.10  0.0166  0.898540    
## 5     25 150.41  1     32.46  5.4445  0.028337 *  
## 6     24 143.09  1      7.32  1.2275  0.278870    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA results shows that the inclusion of weight and number of cylinders in the model appears to be necessary over just transmission type and weight. The summary of the selected multipleable regression model appears below.

summary(fit3)
## 
## Call:
## lm(formula = mpg ~ am + wt + cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4898 -1.3116 -0.5039  1.4162  5.7758 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.7536     2.8135  11.997  2.5e-12 ***
## amManual      0.1501     1.3002   0.115  0.90895    
## wt           -3.1496     0.9080  -3.469  0.00177 ** 
## cyl6         -4.2573     1.4112  -3.017  0.00551 ** 
## cyl8         -6.0791     1.6837  -3.611  0.00123 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.603 on 27 degrees of freedom
## Multiple R-squared:  0.8375, Adjusted R-squared:  0.8134 
## F-statistic: 34.79 on 4 and 27 DF,  p-value: 2.73e-10

This model has an adjusted R-squared of \(R^2 = 0.81\), which means that 81% of the variance can be explained by this model. The p-values for weight and cylinder numbers are \(p < 0.001\), suggesting that this variables are the confounders in the relationship between mileage and transmission type.

Residuals Plot

Below are the residual plots

par(mfrow = c(2, 2))
plot(fit3)

The residual vs fitted plot shows that the residuals have homoscedasticity and the Q-Q plot shows that the data has an approximate normal distribution, with a few outliers.

Conclusions

The model with which we can reject the null hypothesis, and that best describes the relationship between mileage and transmission types is \(mpg \sim am + wt + cyl\), it’s coefficients are:

summary(fit3)$coef[1:5, 1]
## (Intercept)    amManual          wt        cyl6        cyl8 
##  33.7535920   0.1501031  -3.1495978  -4.2573185  -6.0791189