The Motor Trend magazine has develop an analysis about the performance of MPG in several car models, this performance analysis contrast two types of transmission in the model review (automatic versus manual). So, is an automatic or manual transmission better for MPG? Then there is a quantification the MPG difference between automatic and manual transmissions. The results indicates that the manual transmission cars get more MPG values than the automatic transmission, this is according not only the transmission type, but the relation between the transmission and the weight of the car. The T-test shows the performance difference between cars with Automatic versus manual transmission was better in the manual transmission cars.
library(datasets)
library(ggplot2)
library(ggfortify)
data("mtcars")
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
According to the exploratory analysis, there are higher correlations with mpg with “cyl”, “wt”, “disp” and “hp”.
The question made was that if the automatic or manual transmission are better for the mpg. Understanding that mgp is a variable of performance, and transmission (am) is a discrete variable, there must e some other factors that can impact the mpg. So, the null hypothesis it would be that there´s no difference between manual or automatic transmission, while the H1 hypothesis is that exists a difference between transmissions. So, null hypothesis indicates that if manual and automatic are from the same population, their distribution trends to be the same. For accept or reject this I run a two sample T-test:
test_am <- t.test(mtcars$mpg~ mtcars$am)
test_am$p.value
## [1] 0.001373638
test_am$estimate
## mean in group 0 mean in group 1
## 17.14737 24.39231
The p-value shows that null hypothesis can be rejected, and the two types of transmission are from different populations. The differences between means of MPG of manual transmissions is about 7 more than automatic transmissions.
The first model were fit with all variables, but take the intercept as “AM” variable, considering that this is the primal question.
complete_model <- lm(mpg ~ ., data = mtcars)
complete_model$coefficients
## (Intercept) cyl disp hp drat wt
## 12.30337416 -0.11144048 0.01333524 -0.02148212 0.78711097 -3.71530393
## qsec vs am gear carb
## 0.82104075 0.31776281 2.52022689 0.65541302 -0.19941925
Residual standard error of 2.637 on 21 degrees of freedom. The adjusted R-squared is 0.081, with an explanation of the variance of MPG variable of nearly the 80%. However, non of the coefficients are significant at 95%. Then, it is necessary to select statistical significant variables
model2 <- step(complete_model, k=log(nrow(mtcars)))
The model were selected by “Akaike Information Criterion” (AIC), used to determine what model fits better. In this particular case the model selected was:
M1: MPG ~ WT + QSEC + AM
This model with only 3 variables can explain the 67.17 of the variance, and has p-values with high significance (<0.05). The model m1 has a residual standard error of 2.459, with 28 degrees of freedom, and a R^2 of 0.8497, the p-values were significant for each of the variables (p-value < 0.05). But, there is an interaction between the weight and the transmission, were the automatic transmission are more weighted than the manual transmission, so I’ve done another model to test:
m2 <- lm(mpg~qsec+am+wt:am, data = mtcars)
summary(m2)
##
## Call:
## lm(formula = mpg ~ qsec + am + wt:am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4694 -0.9707 0.1469 1.8014 4.7670
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.7295 5.6328 -1.372 0.180885
## qsec 1.3681 0.3079 4.443 0.000127 ***
## am 23.7657 3.4011 6.988 1.34e-07 ***
## am:wt -6.3851 1.3950 -4.577 8.81e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.684 on 28 degrees of freedom
## Multiple R-squared: 0.8209, Adjusted R-squared: 0.8017
## F-statistic: 42.77 on 3 and 28 DF, p-value: 1.386e-10
So, to test this model, I had run another simplest model that relates the MPG variable only with the AM variable:
mp <- lm(mpg~am, data=mtcars)
summary(mp)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
The results shows that the model m2 has a R-squared index of 0.8937, with a residual standard error of 2.068 with 28 freedom degrees. In the other hand, the model mp has a R-squared of 0.3598, with a residual standard error of 4.902 and 30 freedom degrees. Now for the selection of the final model:
m2$coefficients
## (Intercept) qsec am am:wt
## -7.729507 1.368127 23.765661 -6.385125
The coefficients and results show that when the QSEC remains constant, the cars with manual transmission add 11.57+(-3.3)*WT more MPG that the cars with automatic transmission. This corresponds to a manual transmitted car that weights 2000 lbs have 4.959 more MPG than an automatic transmission cars with the same weight and QSEC values.
Please refer to the Appendix section for the plots. According to the residual plots, the following underlying assumptions can be verified: A. The Residuals vs. Fitted plot shows no consistent pattern, supporting the accuracy of the independence assumption. B. The Normal Q-Q plot indicates that the residuals are normally distributed because the points lie closely to the line. C. The Scale-Location plot confirms the constant variance assumption, as the points are randomly distributed. D. The Residuals vs. Leverage argues that no outlier are present, as all values fall well within the 0.5 bands.
As for the Dfbetas, the measure of how much an observation has effected the estimate of a regression coefficient, this is the following result:
sum((abs(dfbetas(m2)))>1)
## [1] 0
mtcars$cyl <- as.numeric(mtcars$cyl)
mtcars$vs <- as.numeric(mtcars$vs)
mtcars$am <- as.numeric(mtcars$am)
mtcars$gear <- as.numeric(mtcars$gear)
mtcars$carb <- as.numeric(mtcars$carb)
pairs(mtcars, panel = panel.smooth, main = "Pair graph Road Test MT-Cars")
boxplot(mtcars$mpg~mtcars$am, xlab="Transmission (0 = Automatic, 1 = Manual)", ylab="MPG",main="Boxplot of MPG vs. Transmission")
mtcars$am <- as.factor(mtcars$am)
plt3 <- ggplot(data=mtcars, aes(x=wt, y=mpg)) +
geom_point(aes(group=am, color=am, height=3, width=3)) +
scale_colour_discrete(labels=c("Automatic", "Manual")) +
xlab("Weight") +
ggtitle("MPG related Weight by Transmission")+
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"))
## Warning: Ignoring unknown aesthetics: height, width
plt3
autoplot(m2, smooth.colour=NA)
## Warning: Removed 32 row(s) containing missing values (geom_path).
## Removed 32 row(s) containing missing values (geom_path).
## Removed 32 row(s) containing missing values (geom_path).