In this report we approach the idea of finding out two main questions:
In regard of the first question we found that it appears that manual cars have better mpg than automatic ones, however after inspecting further in the second question our main findings are that the relationship between mpg and transmission is not that high as it was confounded by wt and qsec variables.
In this analysis we use the dataset mtcars, and we will use ggplot2 for visualizations
library(ggplot2)
data("mtcars")
From all the variables from this dataset, we focus on $mpg (Miles/(US) gallon) and $am (Transmission (0 = automatic, 1 = manual))
Exploration:
# Change $am to factor
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("Automatic", "Manual")
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Automatic:19 Min. :3.000 Min. :1.000
## Manual :13 1st Qu.:3.000 1st Qu.:2.000
## Median :4.000 Median :2.000
## Mean :3.688 Mean :2.812
## 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :8.000
We test for normality for the $mpg variable with Shapiro Test
shapiro.test(mtcars$mpg)
##
## Shapiro-Wilk normality test
##
## data: mtcars$mpg
## W = 0.94756, p-value = 0.1229
As the p-value is > 0.05, we can conclude that the sample might come from a Normal distribution.
First thing we check is a boxplot plot of the miles per gallon, factoring by the transmission type
ggplot(aes(y = mpg, x = factor(am), fill = factor(am)), data = mtcars) + geom_boxplot() + labs(x = "Transmission type", y = "Miles per gallon") + scale_fill_discrete("Transmission type") + theme(legend.position = "bottom")
It appears as manual transmission has a higher distribution than automatic ones. We test for this:
t.test(subset(mtcars, am == "Automatic")$mpg, subset(mtcars, am == "Manual")$mpg)
##
## Welch Two Sample t-test
##
## data: subset(mtcars, am == "Automatic")$mpg and subset(mtcars, am == "Manual")$mpg
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
P-value is 0.0013, we can reject the null hypothesis and say that Manual cars have higher mpg than Automatic ones. However we need to find if there are cofounding variables.
First we construct a linear regression model on the transmission:
fit1 <- lm(mpg ~ am, data = mtcars)
summary(fit1)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
We can see how this model is able to account only 33% of mpg variability (R-squared = 0.33). Therefore we try to find if there are other variables that can be useful.
We use step function, which applies a Stepwise algorithm to find which is the best multivariate linear regression model, taking into account all variables of mtcars dataset.
fit_step = step(lm(data = mtcars, mpg ~ .),trace=0,steps=10000)
summary(fit_step)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## amManual 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
The three selected variables are weight, 1/4 mile time and transmission, which are able to account up to 83.4% of the mpg variability.
We test now for the normality on the residuals from the last fitted model and everything looks ok.
par(mfrow = c(2,2))
plot(fit_step)
We can see how from the first model, the differente between Automatic and Manual coefficient was 7.245, and now it is 2.93, which means that wt and qsec variables, were confounding the real relationship between am and mpg. Also, the probability goes up from 0.000285 to 0.046716. Therefore, there is not that a significant relationship between am and mpg. It is hard to exactly quantify the difference between automatic and manual transmissions, as there might be other confounding variables.