Some insights about the project
You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
1.-Is an automatic or manual transmission better for MPG?
2.-Quantify the MPG difference between automatic and manual transmissions?
Some regression analysis was done, and the results obtained shows that other than transmission type, cylinders, horsepower, and weitght are the important factors in affecting the MPG.
Executive Summary
This report analyzed the relationship between transmission type
(manual or automatic) and miles per gallon (MPG). The report set out to
determine which transmission type produces a higher MPG. The
mtcars dataset was used for this analysis.
A t-test between automatic and manual transmission vehicles shows
that manual transmission vehicles have a 7.245 greater MPG
than automatic transmission vehicles. After fitting multiple linear
regressions, analysis showed that the manual transmission contributed
less significantly to MPG, only an improvement of
1.81 MPG. Other variables, such as: weight,
horsepower, and # of cylinders contributed
more significantly to the overall MPG of vehicles.
Data processing
First, load the dataset and perform some basic exploratory data analysis.
suppressMessages(library(xtable)) # Pretty printing dataframes
suppressMessages(library(ggplot2)) # Plotting
suppressMessages(library(gridExtra, warn.conflicts = FALSE))
suppressMessages(library(reshape2)) # Transforming Data Frames
data(mtcars)
Exploratory Data Analysis
Variables
knitr::kable(head(mtcars[, 1:4]), "simple",align = "lccrr")
| mpg | cyl | disp | hp | |
|---|---|---|---|---|
| Mazda RX4 | 21.0 | 6 | 160 | 110 |
| Mazda RX4 Wag | 21.0 | 6 | 160 | 110 |
| Datsun 710 | 22.8 | 4 | 108 | 93 |
| Hornet 4 Drive | 21.4 | 6 | 258 | 110 |
| Hornet Sportabout | 18.7 | 8 | 360 | 175 |
| Valiant | 18.1 | 6 | 225 | 105 |
Data
Taking a look to the data:
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
They are all numbers, but some should be categories. I will map them
(am and vs,both categorical) to factors for
easier reading:
Transform data:
mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
mtcars$am <- factor(mtcars$am,labels=c('Automatic','Manual'))
The pairwise scatter plot between all variables is also shown
(Appendix 1).
Data Visualization
Scatter plot matrix
We use the following command:
pairs(mtcars, panel = panel.smooth, main = "Motor Trend", col = "light blue")
But we find correlation heating maps more useful to read as follow :
Correlation Heat Map
correlation_matrix <- function(data) {
numeric_data <- data[, sapply(data,is.numeric)]
matrix <- round(cor(numeric_data), 2)
matrix[upper.tri(matrix)] <- NA
matrix <- melt(matrix, na.rm = TRUE)
return(matrix)
}
correlation_heat_map <- function(data) {
matrix <- correlation_matrix(data)
ggplot(data = matrix, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1, 1), name="Pearson Correlation") +
theme(axis.text.x = element_text(angle = 90)) +
coord_fixed()
}
correlation_heat_map(mtcars)
The potential predictors for MPG seem correlated among themselves to some degree, with a few exceptions:
*gear seems weakly correlated to hp.
*drat seems weakly correlated to qsec or carb.
*wt seems weakly correlated to qsec.
We can easily explain the pairs that correlate:
*Greater wt (weight), obviously implies more fuel
consumption.
*Greater cyl (number of cylinders) disp
(displacement volume) or carb (number of carburetors),
implies more powerful engines and therefore more hp
(horsepower). Greater hp (horsepower) generally implies
less fuel efficiency (you need to go one way, efficiency, the other,
power, or for a compromise between the two).
*Greater gear (number of gears) implies a greater number
of degrees of freedom to choose the appropriate gear for a given speed,
thus resulting in better fuel efficiency.
*drat (rear axle ratio) is a little trickier: greater
the ratio, greater the engine’s RPM (rotations per minute) required to
keep the same speed, thus more fuel consumption.
Box Plots
Here are some box plots for the data:
box_plot <- function(data, y_column, x_column, x_title, y_title) {
ggplot(data, aes_string(y = y_column, x = x_column)) +
geom_boxplot(aes_string(fill = x_column)) +
geom_point(position = position_jitter(width = 0.2), color = "blue", alpha = 0.2) +
xlab(x_title) +
ylab(y_title)
}
mpg_box <- box_plot(mtcars, "mpg", "am", "Transmission", "Miles per U.S. Galon")
hp_box <- box_plot(mtcars, "hp", "am", "Transmission", "Horse Power")
gear_box <- box_plot(mtcars, "gear", "am", "Transmission", "Number of Gears")
grid.arrange(mpg_box, hp_box, gear_box, ncol = 3)
From the box plots, we seem to have indeed better fuel efficiency for vehicles with automatic transmission for the following reasons:
*The vehicles with automatic transmission in the data-set seem to have greater horsepower, which correlates to less fuel efficiency.
*The vehicles with manual transmission have a greater number of gears.
Predictors Analysis
Lineal modelling
Let’s try different models and look at their p-values to check their
effect in the response (mpg):
lm(mpg ~ . - 1, data = mtcars)
##
## Call:
## lm(formula = mpg ~ . - 1, data = mtcars)
##
## Coefficients:
## cyl4 cyl6 cyl8 disp hp drat wt qsec
## 23.87913 21.23044 23.54297 0.03555 -0.07051 1.18283 -4.52978 0.36784
## vs1 amManual gear4 gear5 carb2 carb3 carb4 carb6
## 1.93085 1.21212 1.11435 2.52840 -0.97935 2.99964 1.09142 4.47757
## carb8
## 7.25041
knitr::kable(summary(lm(mpg ~ . - 1, data = mtcars))$coef, "simple",align = "lccrr")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| cyl4 | 23.8791324 | 20.0658203 | 1.1900402 | 0.2525255 |
| cyl6 | 21.2304372 | 18.3341648 | 1.1579713 | 0.2649816 |
| cyl8 | 23.5429695 | 18.2224967 | 1.2919728 | 0.2159181 |
| disp | 0.0355463 | 0.0318992 | 1.1143329 | 0.2826734 |
| hp | -0.0705068 | 0.0394256 | -1.7883534 | 0.0939316 |
| drat | 1.1828302 | 2.4834846 | 0.4762784 | 0.6407392 |
| wt | -4.5297758 | 2.5387458 | -1.7842573 | 0.0946186 |
| qsec | 0.3678448 | 0.9353957 | 0.3932505 | 0.6996672 |
| vs1 | 1.9308505 | 2.8712578 | 0.6724755 | 0.5115079 |
| amManual | 1.2121157 | 3.2135451 | 0.3771896 | 0.7113157 |
| gear4 | 1.1143549 | 3.7995173 | 0.2932886 | 0.7733203 |
| gear5 | 2.5283960 | 3.7363580 | 0.6767007 | 0.5088975 |
| carb2 | -0.9793543 | 2.3179745 | -0.4225044 | 0.6786509 |
| carb3 | 2.9996387 | 4.2935461 | 0.6986390 | 0.4954678 |
| carb4 | 1.0914229 | 4.4496199 | 0.2452845 | 0.8095603 |
| carb6 | 4.4775692 | 6.3840624 | 0.7013668 | 0.4938127 |
| carb8 | 7.2504113 | 8.3605664 | 0.8672153 | 0.3994849 |
Taking a look at the P-values, all variables (but qsec)
accept the null hypothesis variable=0. The reason for that is that many
of these variables correlate among themselves. For instance, the
following predictors are strongly correlated:
lm(mpg ~ . - 1, data = mtcars)
##
## Call:
## lm(formula = mpg ~ . - 1, data = mtcars)
##
## Coefficients:
## cyl4 cyl6 cyl8 disp hp drat wt qsec
## 23.87913 21.23044 23.54297 0.03555 -0.07051 1.18283 -4.52978 0.36784
## vs1 amManual gear4 gear5 carb2 carb3 carb4 carb6
## 1.93085 1.21212 1.11435 2.52840 -0.97935 2.99964 1.09142 4.47757
## carb8
## 7.25041
knitr::kable(summary(lm(mpg ~ cyl + carb + disp + hp - 1, data = mtcars))$coef, "simple",align = "lccrr")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| cyl4 | 32.6101845 | 3.0359626 | 10.7412999 | 0.0000000 |
| cyl6 | 29.8304388 | 3.7282850 | 8.0011155 | 0.0000001 |
| cyl8 | 33.6276043 | 7.1377657 | 4.7112228 | 0.0001063 |
| carb2 | -0.9571345 | 1.6367112 | -0.5847913 | 0.5646380 |
| carb3 | -3.9535234 | 3.0248876 | -1.3069984 | 0.2047104 |
| carb4 | -1.6820143 | 2.3211040 | -0.7246613 | 0.4762969 |
| carb6 | -1.2291298 | 4.6375138 | -0.2650407 | 0.7934456 |
| carb8 | -0.8076424 | 6.2538614 | -0.1291430 | 0.8984179 |
| disp | -0.0333060 | 0.0147537 | -2.2574777 | 0.0342422 |
| hp | -0.0232682 | 0.0305283 | -0.7621848 | 0.4540443 |
The null hypothesis that hp=0 is accepted due to its
p-value. Using the information for the presence of a strong correlation
(refer to the Correlation Heat Map section), we may come up
with the following model:
The null hypothesis that hp=0 is accepted due to its
p-value. Using the information for the presence of a strong correlation
(refer to the Correlation Heat Map section), we may come up
with the following model:
model1_gear <- lm(mpg ~ gear + hp + wt + drat, data = mtcars)
But given that the questions wee need to answer are related to
transmission, let’s replace gear with am, even
though we know that gear is actually directly correlated to
mpg and am is directly correlated to
gear:
model2_am <- lm(mpg ~ am + hp + wt + drat, data = mtcars)
summary(model2_am)
##
## Call:
## lm(formula = mpg ~ am + hp + wt + drat, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2882 -1.7531 -0.6827 1.1691 5.5211
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.027077 6.185177 4.855 4.5e-05 ***
## amManual 1.578521 1.559281 1.012 0.320363
## hp -0.036373 0.009814 -3.706 0.000958 ***
## wt -2.726092 0.937791 -2.907 0.007209 **
## drat 0.981018 1.377101 0.712 0.482341
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.56 on 27 degrees of freedom
## Multiple R-squared: 0.8428, Adjusted R-squared: 0.8196
## F-statistic: 36.2 on 4 and 27 DF, p-value: 1.75e-10
We may automate this process using step as follows:
fit_all_model <-lm(mpg ~ ., data = mtcars)
fit_best_model <- step(fit_all_model, direction = "both", trace = FALSE)
summary(fit_best_model)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## amManual 1.80921 1.39630 1.296 0.20646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
The model1_gear and model2_am did slightly
worst in the residuals squared, but they are easier to understand.
Residuals
residual_plot <- function(fit, title) plot(predict(fit), resid(fit), main = title)
par(mfrow = c(1, 3))
residual_plot(model1_gear, "Model1 w/ gear")
residual_plot(model2_am, "Model w/ am")
residual_plot(fit_best_model, "Best Model")
The residuals seem more or less randomly spread, thus uncorrelated to the response. This means our model is able to explain most of the behavior of the response.
1.-Is an automatic or manual transmission better for MPG?
Plot a boxplot of MPG by transmission types
(Appendix 2).
From the box plot, it seems like manual transmission is better than automatic transmission for MPG.
Conduct a t-test to test the hypothesis.
t.test(mtcars$mpg~mtcars$am)
##
## Welch Two Sample t-test
##
## data: mtcars$mpg by mtcars$am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means between group Automatic and group Manual is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group Automatic mean in group Manual
## 17.14737 24.39231
Based on the results, p-value = 0.001374<0.05, we reject
the null hypothesis that there is no difference in MPG, and
infer that manual transmission is better than automatic transmission for
MPG, with assumption that all other conditions remain unchanged.
2.-Quantify the MPG difference between automatic and manual transmissions?
Stat Regression
We will use statistical regression to quantify this difference:
fit_am <- lm(mpg ~ am, data = mtcars)
knitr::kable(summary(lm(mpg ~ am, data = mtcars))$coef, "simple",align = "lccrr")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 17.147368 | 1.124602 | 15.247492 | 0.000000 |
| amManual | 7.244939 | 1.764422 | 4.106127 | 0.000285 |
Remarks
Let’s take the intercept and the slope for the unadjusted estimate:
intercept_am <- coefficients(fit_am)[1]
slope_am <- coefficients(fit_am)[2]
r_squared_am <- summary(fit_am)$r.squared
The intercept (17.1473684) represents the mean MPG when
am is zero (automatic). The slope (7.2449393) represents
the increase in the mean MPG when am is one
(manual), thus the mean MPG when am is one is
slope+intercept×1, which is 24.3923077.
The above model only accounts for am and doesn’t adjust for the
effect of the other predictors. Therefore the slope
itself, 7.2449393, does not quantify the difference between the
MPG for automatic and manual transmissions (just
look at R2 of 0.3597989, which menas the model doesn’t explain
a lot of the data).
The model1_gear and model2_am use
am, hp, wt and drat,
therefore adjusting am for the effect of hp,
wt and drat:
intercept_model2_am <- coefficients(model2_am)[1]
slope_model2_am <- coefficients(model2_am)[2]
r_squared_model2_am <- summary(model2_am)$r.squared
The R2 of 0.8428442 shows that this model explains the sample data much better.
The intercept does not have a physical interpretation here,
since it would be the MPG for amAutomatic when the
remaining predictors are zero (zero horsepower, weight and drat
don’t make much sense in an experimental setup). But the slope
1.5785208 represents the increase in MPG when
switching from amAutomatic to amManual while keeping
the remaining predictors constant.
Appendix
Appendix 1
pairs(mtcars)
Appendix 2
boxplot(mpg~am, data = mtcars,
xlab = "Transmission",
ylab = "Miles per Gallon",
main = "MPG by Transmission Type")