Exercutive Summary

In this report we approach the idea of finding out two main questions:

In regard of the first question we found that it appears that manual cars have better mpg than automatic ones, however after inspecting further in the second question our main findings are that the relationship between mpg and transmission is not that high as it was confounded by wt and qsec variables.

Exploratory Analysis

In this analysis we use the dataset mtcars, and we will use ggplot2 for visualizations

library(ggplot2)
data("mtcars")

From all the variables from this dataset, we focus on $mpg (Miles/(US) gallon) and $am (Transmission (0 = automatic, 1 = manual))

Exploration:

# Change $am to factor
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")

summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##          am          gear            carb      
##  Automatic:19   Min.   :3.000   Min.   :1.000  
##  Manual   :13   1st Qu.:3.000   1st Qu.:2.000  
##                 Median :4.000   Median :2.000  
##                 Mean   :3.688   Mean   :2.812  
##                 3rd Qu.:4.000   3rd Qu.:4.000  
##                 Max.   :5.000   Max.   :8.000

We test for normality for the $mpg variable with Shapiro Test

shapiro.test(mtcars$mpg)
## 
##  Shapiro-Wilk normality test
## 
## data:  mtcars$mpg
## W = 0.94756, p-value = 0.1229

As the p-value is > 0.05, we can conclude that the sample might come from a Normal distribution.

Is an automatic or manual transmission better for MPG?

First thing we check is a boxplot plot of the miles per gallon, factoring by the transmission type

ggplot(aes(y = mpg, x = factor(am), fill = factor(am)), data = mtcars) + geom_boxplot() + labs(x = "Transmission type", y = "Miles per gallon") + scale_fill_discrete("Transmission type") + theme(legend.position = "bottom") 

It appears as manual transmission has a higher distribution than automatic ones. We test for this:

t.test(subset(mtcars, am == "Automatic")$mpg, subset(mtcars, am == "Manual")$mpg)
## 
##  Welch Two Sample t-test
## 
## data:  subset(mtcars, am == "Automatic")$mpg and subset(mtcars, am == "Manual")$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean of x mean of y 
##  17.14737  24.39231

P-value is 0.0013, we can reject the null hypothesis and say that Manual cars have higher mpg than Automatic ones. However we need to find if there are cofounding variables.

Quantify the MPG difference between automatic and manual transmissions?

First we construct a linear regression model on the transmission:

fit1 <- lm(mpg ~ am, data = mtcars)
summary(fit1)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amManual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

We can see how this model is able to account only 33% of mpg variability (R-squared = 0.33). Therefore we try to find if there are other variables that can be useful.

We use step function, which applies a Stepwise algorithm to find which is the best multivariate linear regression model, taking into account all variables of mtcars dataset.

fit_step = step(lm(data = mtcars, mpg ~ .),trace=0,steps=10000)
summary(fit_step)
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## amManual      2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

The three selected variables are weight, 1/4 mile time and transmission, which are able to account up to 83.4% of the mpg variability.

We test now for the normality on the residuals from the last fitted model and everything looks ok.

par(mfrow = c(2,2))
plot(fit_step)

Conclusions

We can see how from the first model, the differente between Automatic and Manual coefficient was 7.245, and now it is 2.93, which means that wt and qsec variables, were confounding the real relationship between am and mpg. Also, the probability goes up from 0.000285 to 0.046716. Therefore, there is not that a significant relationship between am and mpg. It is hard to exactly quantify the difference between automatic and manual transmissions, as there might be other confounding variables.