The goal of this analysis is to determing the impact of automatic vs. manual transmission. This analysis will provide answers to following questions: - “Is an automatic or manual transmission better for MPG” and “What is the extent of MPG difference between automatic and manual transmissions”
Data source: Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
A simple analysis showed that the type of transmission is significant for the fuel consumption (mpg), where manual transmission had on average 7.245 higher “miles per gallon” than automatic.
However, futher analysis shows a very different picture, where transmission type is not quite significant. On the other hand “weight”, “horsepower” and “number of cylinders” appear to have more significant impact on the MPG.
# Load required libraries
library(dplyr);
library(tidyr);
library(ggplot2)
# Load the mtcars data
data(mtcars)
# Keeping the original data
mtcars.orig <- mtcars
# Rename the vairable names into more descriptive names
mtcars <- mtcars %>%
select(milespergallon = mpg,
cylinders = cyl,
displacement = disp,
horsepower = hp,
realaxleratio = drat,
weight = wt,
quartermiletime = qsec,
vs = vs,
transmission = am,
gears = gear,
carburetors = carb)
# Convert the catrgorical variables into factors
mtcars$transmission <- factor(mtcars$transmission);
mtcars$cylinders <- factor(mtcars$cylinders)
mtcars$gears <- factor(mtcars$gears)
mtcars$carburetors <- factor(mtcars$carburetors)
mtcars$vs <- factor(mtcars$vs)
# Assign names to Transmission variable for easy identification
levels(mtcars$transmission) <- c("automatic", "manual")
Now, we want to compare whether there are any impact of the “Transmission type” on the “Miles per Gallon”.
# Density plots with semi-transparent fill
ggplot(mtcars, aes(x=milespergallon, fill=transmission)) +
geom_density(alpha=.3) +
xlab("Miles per Gallon") +
ylab("Frequency") +
ggtitle("Impact of Transmission on 'Miles per Gallon'")
From the graph it is visually appearent that the “Automatic Transmission” has lesser miles per gallon compared to the “Manual Transmission”. Let’s conduct a t-tes and deterimine the p-value. p < 0.05 should indicate that the “means” are likely different.
t.test(mtcars$milespergallon ~ mtcars$transmission)
##
## Welch Two Sample t-test
##
## data: mtcars$milespergallon by mtcars$transmission
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group automatic mean in group manual
## 17.14737 24.39231
Since the p-value is 00137, there is indeed difference in the fuel consumption based on the Transmission type. Manual transmission is better for MPG.
The average miles per gallon for Manual Transmission is 24.39231 which is 7.24494 higher than the average miles of Automatic Transmission. MPG for Manual Transmission is “7.24496” higher than Automatic Transmission.
The plot with respect to the Transmission show significant overlap, indicating that it may not be the best predictor of the MPG. To determine the best predictor, further analysis needs to be done. To determine the best predictors, Backward stepwise regression have been conducted in the “appendix” and the “suggested model” is compared with the initial model - the one using only Transmission. The analysis shows that a combination of “weight”, “horsepower” and “cylinder” is a better model than “Transmission”.
#--------------------------------
# Plotting with Base graphics
#--------------------------------
x <- mtcars$milespergallon
# Plotting the histogram
myhist <- hist(x, breaks=10, density=10, col="darkgrey",
xlab="Miles per Gallon",
main="Frequency Distribution of Miles per Gallon")
# Adding a vertical line for the mean
abline(v=mean(x), col="darkgreen", lwd=2)
# Plotting the density curve
multiplier <- myhist$counts / myhist$density
mydensity <- density(x)
mydensity$y <- mydensity$y * multiplier[1]
lines(mydensity, col="blue", lwd=2)
# Plotting the normal curve with the same mean and Standard deviation
xfit <- seq(min(x), max(x), length=40)
yfit <- dnorm(xfit, mean = mean(x), sd = sd(x))
yfit <- yfit * diff(myhist$mids[1:2]) * length(x)
lines(xfit, yfit, col="red", lwd=2)
# Add legend
legend('topright', c("Mean", "Density Curve", "Normal Curve"),
lty=c(1,1,1), lwd=c(2,2,2), col = c("darkgreen", "blue", "red"))
g <- ggplot(mtcars, aes(x=milespergallon))
g <- g + geom_histogram(aes(y = ..density..), fill="dark grey")
g <- g + geom_density(alpha=.3, fill="#FF6666")
g <- g + stat_function(fun = dnorm, colour = "red",
arg = list(mean = mean(mtcars$milespergallon),
sd=sd(mtcars$milespergallon)))
g <- g + xlab("Miles per Gallon")
g <- g + ylab("Frequency")
g <- g + ggtitle("Frequency Distribution of Miles per Gallon")
g
From comparing the Density as well as the Normal curver, we see that the distribution is pretty close to “Normal”
pairs(mtcars.orig, col="red")
library(corrplot)
M <- cor(mtcars.orig)
corrplot.mixed(M)
From the corr plot we can see that the variables which have high correlation with the MPG are Cylider, Displacement, Horsepower and Weight. However, we also note that among these variables, Displacment has high correlation with Cylinder and Horsepower. So, Displacement variable can be dropped. We will see later if our assumption is true.
In order to select the best model, we need to find out the variables that have biggest impact on fuel consumption, beside the transmission type. One way is to use the “Backward Stepwise Regression”, which starts with all predictors and removes those which are not statistically significant.
full.model <- lm(milespergallon ~ ., data = mtcars)
reduced.model <- step(full.model, direction="backward", k=2, trace=0)
summary(reduced.model)
##
## Call:
## lm(formula = milespergallon ~ cylinders + horsepower + weight +
## transmission, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## cylinders6 -3.03134 1.40728 -2.154 0.04068 *
## cylinders8 -2.16368 2.28425 -0.947 0.35225
## horsepower -0.03211 0.01369 -2.345 0.02693 *
## weight -2.49683 0.88559 -2.819 0.00908 **
## transmissionmanual 1.80921 1.39630 1.296 0.20646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
If we look at the asterix marks against the variables that can impact the miles-per-gallon, we see that the model suggest the predictors “weight”, “horsepower” and “cylinder” have the largest impact on the “Miles per Gallon”. We note that Displacement variable can indeed be dropped.
#-------------------------------
## Exploring the predictors using box plot
#-------------------------------
# Plotting the box-plot into 2 * 2 matrix
par(mfrow=c(2, 2))
# Impact of Transmission on the Fuel consumption
boxplot(mtcars$milespergallon ~ mtcars$transmission, xlab="Transmission")
# Impact of weight on the Fuel consumption
boxplot(mtcars$milespergallon ~ mtcars$weight, xlab="Weight")
# Impact of Horsepower on the Fuel consumption
boxplot(mtcars$milespergallon ~ mtcars$horsepower, xlab = "Horesepower")
# Impact of Cylinder on the Fuel consumption
boxplot(mtcars$milespergallon ~ mtcars$cylinder, xlab="Cylinder")
We can see here that there is significant overlap between the transmission types as a predictor of the Fuel Consumption. Hence, it there is not significant dependency of the Fuel Consumption on the Type of Transmission. However, for the other predictors, we can see that variation of the predictor leads to significant variation of the Fuel Consumption.
Hence, we continue to assume that the best fit model should include the combination of “Weight”, “Horsepower” and “Cylinder”
We will now use “analysis of variance (Anova)” to determine the variances bewteen the “Starting Model” i.e., assuming “Transmission” as the predictor, and compare it with the “Suggested Model” which comprises of “weight”, “horsepower” and “cylinder”
# The Starting and Suggested Model
starting.model <- lm(milespergallon ~ transmission, data = mtcars)
suggested.model <- lm(milespergallon ~ cylinders + horsepower + weight +
transmission, data = mtcars)
# Conduct Analysis of Variance
anova(starting.model, suggested.model)
## Analysis of Variance Table
##
## Model 1: milespergallon ~ transmission
## Model 2: milespergallon ~ cylinders + horsepower + weight + transmission
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 26 151.03 4 569.87 24.527 1.688e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of variance (ANOVA) resulted in p-value (1.688e-08) much lower than 0.05, which indicates that “Suggested Model” is statisticaly significant.
x <- mtcars$milespergallon;
y <- resid(suggested.model)
ggplot(data.frame(x, y), aes(x,y)) +
geom_hline(yintercept=0, size=1) +
geom_point(size=3, colour="black", alpha=0.3) +
geom_point(size=2, colour="red", alpha=0.8) +
xlab("Predictors") +
ylab("Unexplained variances") +
geom_smooth(method="lm", colour="red", lwd=1)
In the residual plot, we don’t see any pattern, indicating that the no more variances in the Fuel consumption can be significantly explained by any other predictors available in the dataset.
par(mfrow=c(2, 2))
plot(suggested.model)
Points in Q-Q plot are mor-or-less on the line, indicating that residuals are normaly distributed.