Some insights about the project

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

1.-Is an automatic or manual transmission better for MPG?

2.-Quantify the MPG difference between automatic and manual transmissions?

Some regression analysis was done, and the results obtained shows that other than transmission type, cylinders, horsepower, and weitght are the important factors in affecting the MPG.


Executive Summary

This report analyzed the relationship between transmission type (manual or automatic) and miles per gallon (MPG). The report set out to determine which transmission type produces a higher MPG. The mtcars dataset was used for this analysis.

A t-test between automatic and manual transmission vehicles shows that manual transmission vehicles have a 7.245 greater MPG than automatic transmission vehicles. After fitting multiple linear regressions, analysis showed that the manual transmission contributed less significantly to MPG, only an improvement of 1.81 MPG. Other variables, such as: weight, horsepower, and # of cylinders contributed more significantly to the overall MPG of vehicles.


Data processing

First, load the dataset and perform some basic exploratory data analysis.

suppressMessages(library(xtable))                            # Pretty printing dataframes
suppressMessages(library(ggplot2))                           # Plotting
suppressMessages(library(gridExtra, warn.conflicts = FALSE))
suppressMessages(library(reshape2))                          # Transforming Data Frames
data(mtcars)

Exploratory Data Analysis

Variables

knitr::kable(head(mtcars[, 1:4]), "simple",align = "lccrr")
mpg cyl disp hp
Mazda RX4 21.0 6 160 110
Mazda RX4 Wag 21.0 6 160 110
Datsun 710 22.8 4 108 93
Hornet 4 Drive 21.4 6 258 110
Hornet Sportabout 18.7 8 360 175
Valiant 18.1 6 225 105

Data

Taking a look to the data:

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

They are all numbers, but some should be categories. I will map them (am and vs,both categorical) to factors for easier reading:

Transform data:

mtcars$cyl <- factor(mtcars$cyl)
mtcars$vs <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
mtcars$am <- factor(mtcars$am,labels=c('Automatic','Manual'))

The pairwise scatter plot between all variables is also shown (Appendix 1).

Data Visualization

Scatter plot matrix

We use the following command:

pairs(mtcars, panel = panel.smooth, main = "Motor Trend", col = "light blue") 

But we find correlation heating maps more useful to read as follow :

Correlation Heat Map

correlation_matrix <- function(data) {
  numeric_data <- data[, sapply(data,is.numeric)]
  matrix <- round(cor(numeric_data), 2)
  matrix[upper.tri(matrix)] <- NA
  matrix <- melt(matrix, na.rm = TRUE)
  return(matrix)
}

correlation_heat_map <- function(data) {
  matrix <- correlation_matrix(data)
  ggplot(data = matrix, aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1, 1), name="Pearson Correlation") +
  theme(axis.text.x = element_text(angle = 90)) +
  coord_fixed()
}

correlation_heat_map(mtcars) 

The potential predictors for MPG seem correlated among themselves to some degree, with a few exceptions:

*gear seems weakly correlated to hp.

*drat seems weakly correlated to qsec or carb.

*wt seems weakly correlated to qsec.

We can easily explain the pairs that correlate:

*Greater wt (weight), obviously implies more fuel consumption.

*Greater cyl (number of cylinders) disp (displacement volume) or carb (number of carburetors), implies more powerful engines and therefore more hp (horsepower). Greater hp (horsepower) generally implies less fuel efficiency (you need to go one way, efficiency, the other, power, or for a compromise between the two).

*Greater gear (number of gears) implies a greater number of degrees of freedom to choose the appropriate gear for a given speed, thus resulting in better fuel efficiency.

*drat (rear axle ratio) is a little trickier: greater the ratio, greater the engine’s RPM (rotations per minute) required to keep the same speed, thus more fuel consumption.

Box Plots

Here are some box plots for the data:

box_plot <- function(data, y_column, x_column, x_title, y_title) {
  ggplot(data, aes_string(y = y_column, x = x_column)) + 
  geom_boxplot(aes_string(fill = x_column)) + 
  geom_point(position = position_jitter(width = 0.2), color = "blue", alpha = 0.2) +
  xlab(x_title) +
  ylab(y_title)
}

mpg_box <- box_plot(mtcars, "mpg", "am", "Transmission", "Miles per U.S. Galon")
hp_box <- box_plot(mtcars, "hp", "am", "Transmission", "Horse Power")
gear_box <- box_plot(mtcars, "gear", "am", "Transmission", "Number of Gears")
grid.arrange(mpg_box, hp_box, gear_box, ncol = 3)

From the box plots, we seem to have indeed better fuel efficiency for vehicles with automatic transmission for the following reasons:

*The vehicles with automatic transmission in the data-set seem to have greater horsepower, which correlates to less fuel efficiency.

*The vehicles with manual transmission have a greater number of gears.

Predictors Analysis

Lineal modelling

Let’s try different models and look at their p-values to check their effect in the response (mpg):

lm(mpg ~ . - 1, data = mtcars)
## 
## Call:
## lm(formula = mpg ~ . - 1, data = mtcars)
## 
## Coefficients:
##     cyl4      cyl6      cyl8      disp        hp      drat        wt      qsec  
## 23.87913  21.23044  23.54297   0.03555  -0.07051   1.18283  -4.52978   0.36784  
##      vs1  amManual     gear4     gear5     carb2     carb3     carb4     carb6  
##  1.93085   1.21212   1.11435   2.52840  -0.97935   2.99964   1.09142   4.47757  
##    carb8  
##  7.25041
knitr::kable(summary(lm(mpg ~ . - 1, data = mtcars))$coef, "simple",align = "lccrr")
Estimate Std. Error t value Pr(>|t|)
cyl4 23.8791324 20.0658203 1.1900402 0.2525255
cyl6 21.2304372 18.3341648 1.1579713 0.2649816
cyl8 23.5429695 18.2224967 1.2919728 0.2159181
disp 0.0355463 0.0318992 1.1143329 0.2826734
hp -0.0705068 0.0394256 -1.7883534 0.0939316
drat 1.1828302 2.4834846 0.4762784 0.6407392
wt -4.5297758 2.5387458 -1.7842573 0.0946186
qsec 0.3678448 0.9353957 0.3932505 0.6996672
vs1 1.9308505 2.8712578 0.6724755 0.5115079
amManual 1.2121157 3.2135451 0.3771896 0.7113157
gear4 1.1143549 3.7995173 0.2932886 0.7733203
gear5 2.5283960 3.7363580 0.6767007 0.5088975
carb2 -0.9793543 2.3179745 -0.4225044 0.6786509
carb3 2.9996387 4.2935461 0.6986390 0.4954678
carb4 1.0914229 4.4496199 0.2452845 0.8095603
carb6 4.4775692 6.3840624 0.7013668 0.4938127
carb8 7.2504113 8.3605664 0.8672153 0.3994849

Taking a look at the P-values, all variables (but qsec) accept the null hypothesis variable=0. The reason for that is that many of these variables correlate among themselves. For instance, the following predictors are strongly correlated:

lm(mpg ~ . - 1, data = mtcars)
## 
## Call:
## lm(formula = mpg ~ . - 1, data = mtcars)
## 
## Coefficients:
##     cyl4      cyl6      cyl8      disp        hp      drat        wt      qsec  
## 23.87913  21.23044  23.54297   0.03555  -0.07051   1.18283  -4.52978   0.36784  
##      vs1  amManual     gear4     gear5     carb2     carb3     carb4     carb6  
##  1.93085   1.21212   1.11435   2.52840  -0.97935   2.99964   1.09142   4.47757  
##    carb8  
##  7.25041
knitr::kable(summary(lm(mpg ~ cyl + carb + disp + hp - 1, data = mtcars))$coef, "simple",align = "lccrr")
Estimate Std. Error t value Pr(>|t|)
cyl4 32.6101845 3.0359626 10.7412999 0.0000000
cyl6 29.8304388 3.7282850 8.0011155 0.0000001
cyl8 33.6276043 7.1377657 4.7112228 0.0001063
carb2 -0.9571345 1.6367112 -0.5847913 0.5646380
carb3 -3.9535234 3.0248876 -1.3069984 0.2047104
carb4 -1.6820143 2.3211040 -0.7246613 0.4762969
carb6 -1.2291298 4.6375138 -0.2650407 0.7934456
carb8 -0.8076424 6.2538614 -0.1291430 0.8984179
disp -0.0333060 0.0147537 -2.2574777 0.0342422
hp -0.0232682 0.0305283 -0.7621848 0.4540443

The null hypothesis that hp=0 is accepted due to its p-value. Using the information for the presence of a strong correlation (refer to the Correlation Heat Map section), we may come up with the following model:

The null hypothesis that hp=0 is accepted due to its p-value. Using the information for the presence of a strong correlation (refer to the Correlation Heat Map section), we may come up with the following model:

model1_gear <- lm(mpg ~ gear + hp + wt + drat, data = mtcars)

But given that the questions wee need to answer are related to transmission, let’s replace gear with am, even though we know that gear is actually directly correlated to mpg and am is directly correlated to gear:

model2_am <- lm(mpg ~ am + hp + wt + drat, data = mtcars)

summary(model2_am)
## 
## Call:
## lm(formula = mpg ~ am + hp + wt + drat, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2882 -1.7531 -0.6827  1.1691  5.5211 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.027077   6.185177   4.855  4.5e-05 ***
## amManual     1.578521   1.559281   1.012 0.320363    
## hp          -0.036373   0.009814  -3.706 0.000958 ***
## wt          -2.726092   0.937791  -2.907 0.007209 ** 
## drat         0.981018   1.377101   0.712 0.482341    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.56 on 27 degrees of freedom
## Multiple R-squared:  0.8428, Adjusted R-squared:  0.8196 
## F-statistic:  36.2 on 4 and 27 DF,  p-value: 1.75e-10

We may automate this process using step as follows:

fit_all_model <-lm(mpg ~ ., data = mtcars) 
fit_best_model <- step(fit_all_model, direction = "both", trace = FALSE)
summary(fit_best_model)
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## amManual     1.80921    1.39630   1.296  0.20646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

The model1_gear and model2_am did slightly worst in the residuals squared, but they are easier to understand.

Residuals

residual_plot <- function(fit, title) plot(predict(fit), resid(fit), main = title)

par(mfrow = c(1, 3))
residual_plot(model1_gear, "Model1 w/ gear")
residual_plot(model2_am, "Model w/ am")
residual_plot(fit_best_model, "Best Model")

The residuals seem more or less randomly spread, thus uncorrelated to the response. This means our model is able to explain most of the behavior of the response.

1.-Is an automatic or manual transmission better for MPG?

Plot a boxplot of MPG by transmission types (Appendix 2).

From the box plot, it seems like manual transmission is better than automatic transmission for MPG.

Conduct a t-test to test the hypothesis.

t.test(mtcars$mpg~mtcars$am)
## 
##  Welch Two Sample t-test
## 
## data:  mtcars$mpg by mtcars$am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means between group Automatic and group Manual is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group Automatic    mean in group Manual 
##                17.14737                24.39231

Based on the results, p-value = 0.001374<0.05, we reject the null hypothesis that there is no difference in MPG, and infer that manual transmission is better than automatic transmission for MPG, with assumption that all other conditions remain unchanged.

2.-Quantify the MPG difference between automatic and manual transmissions?

Stat Regression

We will use statistical regression to quantify this difference:

fit_am <- lm(mpg ~ am, data = mtcars)

knitr::kable(summary(lm(mpg ~ am, data = mtcars))$coef, "simple",align = "lccrr")
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.147368 1.124602 15.247492 0.000000
amManual 7.244939 1.764422 4.106127 0.000285

Remarks

Let’s take the intercept and the slope for the unadjusted estimate:

intercept_am <- coefficients(fit_am)[1]
slope_am <- coefficients(fit_am)[2]
r_squared_am <- summary(fit_am)$r.squared

The intercept (17.1473684) represents the mean MPG when am is zero (automatic). The slope (7.2449393) represents the increase in the mean MPG when am is one (manual), thus the mean MPG when am is one is slope+intercept×1, which is 24.3923077.

The above model only accounts for am and doesn’t adjust for the effect of the other predictors. Therefore the slope itself, 7.2449393, does not quantify the difference between the MPG for automatic and manual transmissions (just look at R2 of 0.3597989, which menas the model doesn’t explain a lot of the data).

The model1_gear and model2_am use am, hp, wt and drat, therefore adjusting am for the effect of hp, wt and drat:

intercept_model2_am <- coefficients(model2_am)[1]
slope_model2_am <- coefficients(model2_am)[2]
r_squared_model2_am <- summary(model2_am)$r.squared

The R2 of 0.8428442 shows that this model explains the sample data much better.

The intercept does not have a physical interpretation here, since it would be the MPG for amAutomatic when the remaining predictors are zero (zero horsepower, weight and drat don’t make much sense in an experimental setup). But the slope 1.5785208 represents the increase in MPG when switching from amAutomatic to amManual while keeping the remaining predictors constant.

Appendix

Appendix 1

pairs(mtcars)

Appendix 2

boxplot(mpg~am, data = mtcars,
        xlab = "Transmission",
        ylab = "Miles per Gallon",
        main = "MPG by Transmission Type")