Overview

This report will be describing vehicles listed in the 1974 Motor Trends US Magazine. The assignment of interest comes from RPubs.com. The objective is to explore the relationship between mpg and the other features listed in the dataset.

Particular Questions of Interest

  1. Is an automatic or manual transmission better for mpg.
  2. Quantify the mpg difference between automatic and manual transmissions.

Libraries

library(dplyr)
library(ggplot2)
library(factoextra)
library(FactoMineR)
library(plotly)
library(broom)

Analysis

Data

The mtcars dataset is already provided in R. The Documentation can be found here

set.seed(1993)
df <- mtcars
df %>% glimpse()
## Rows: 32
## Columns: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17...
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4,...
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8,...
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, ...
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3....
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150,...
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90,...
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,...
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,...
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3,...
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1,...

Description of Features

df_description <- tibble(
  Feature = c("mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"),
  Description = c("Miles / US Gallon",
                  "Number of Cylinders (4,6,8)",
                  "Displacement - Measures the overall volume in the engine as a factor of cylinder circumfrance, depth and total number of cylinders. 'Proxy for the total amount of power the engine can generate.'",
                  "Horsepower",
                  "Rear Axle Ratio - The number of turns of the drive shaft for every one rotation of the wheel axle",
                  "Weight (lb/1000)",
                  "1/4 time - A performance measure, primarily of acceleration",
                  "Engine Cylinder Configuration - (VShape = 0 | StraightLine = 1)",
                  "Transmission Type - (Auto = 0 | Manual = 1)",
                  "Number of Forward Gears - Auto = (3 | 4), Manual = (4 | 5)",
                  "Number of Carburetors - Engines with higher displacement typically have higher barrel configuration")
) %>% knitr::kable()

df_description
Feature Description
mpg Miles / US Gallon
cyl Number of Cylinders (4,6,8)
disp Displacement - Measures the overall volume in the engine as a factor of cylinder circumfrance, depth and total number of cylinders. ‘Proxy for the total amount of power the engine can generate.’
hp Horsepower
drat Rear Axle Ratio - The number of turns of the drive shaft for every one rotation of the wheel axle
wt Weight (lb/1000)
qsec 1/4 time - A performance measure, primarily of acceleration
vs Engine Cylinder Configuration - (VShape = 0 | StraightLine = 1)
am Transmission Type - (Auto = 0 | Manual = 1)
gear Number of Forward Gears - Auto = (3 | 4), Manual = (4 | 5)
carb Number of Carburetors - Engines with higher displacement typically have higher barrel configuration

Refer to RPubs for a more in depth description

Preprocessing

df <- df %>% mutate(
  cyl = as.factor(cyl),
  vs = as.factor(ifelse(vs == 1, "S","V")),
  am = as.factor(ifelse(am == 1, "M","A")),
  gear = as.factor(gear),
  name = rownames(df)
)
# Split to Train and Testset 
set <- sample(1:2, replace = TRUE,
              size = nrow(df),
              prob = c(0.6,0.4))
df_train <- df[set == 1,]
df_test <- df[set == 2,]
df_train %>% skimr::skim()
Data summary
Name Piped data
Number of rows 22
Number of columns 12
_______________________
Column type frequency:
character 1
factor 4
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 7 18 0 22 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
cyl 0 1 FALSE 3 8: 11, 4: 6, 6: 5
vs 0 1 FALSE 2 V: 15, S: 7
am 0 1 FALSE 2 A: 13, M: 9
gear 0 1 FALSE 3 3: 11, 4: 6, 5: 5

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
mpg 0 1 19.27 5.21 10.40 15.27 18.95 21.38 30.40 ▃▇▇▃▂
disp 0 1 247.97 122.77 75.70 145.43 250.40 350.75 472.00 ▇▅▃▆▃
hp 0 1 158.41 72.42 52.00 106.25 150.00 198.75 335.00 ▇▇▇▅▁
drat 0 1 3.57 0.55 2.76 3.15 3.66 3.89 4.93 ▇▃▇▂▁
wt 0 1 3.23 0.95 1.51 2.66 3.44 3.57 5.34 ▂▃▇▂▂
qsec 0 1 17.27 1.60 14.50 16.52 17.04 18.22 20.22 ▃▃▇▃▃
carb 0 1 3.09 1.72 1.00 2.00 2.50 4.00 8.00 ▇▁▆▁▁

Note:

  • Data split (60/40)

  • cyl, vs, am, and gear were changed to factors

Exploratory Data Analysis

PCA

# Construct PCA Object: 
df_train_PCA <- 
  df_train %>% select(-cyl,-vs,-am,-gear) %>% 
  PCA(graph = FALSE,
      # Target (MPG)
      quanti.sup = 1,
      # Vehicle Names
      quali.sup = 8)
df_train_PCA$quanti.sup$cor[,1:2] %>% round(2); df_train_PCA$quanti.sup$cos2[,1:2] %>% round(2)
## Dim.1 Dim.2 
## -0.87  0.25
## Dim.1 Dim.2 
##  0.76  0.06
df_train_PCA %>% 
  fviz_pca_var(col.quanti.sup = "navyblue", invisible = "var") + 
  theme_minimal() +
  labs(
    title = "MPG Summary",
    x = "MPG Cor: -0.877 | Quality of Representaion: 76%",
    y = "MPG Cor:  0.254 | Quality of Representaion: 6%"
  ) +
  theme(
    plot.background = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_blank()
  )
Figure 1

Figure 1

round(df_train_PCA$eig,2)
##        eigenvalue percentage of variance cumulative percentage of variance
## comp 1       3.40                  56.63                             56.63
## comp 2       1.77                  29.44                             86.07
## comp 3       0.39                   6.57                             92.63
## comp 4       0.32                   5.32                             97.95
## comp 5       0.10                   1.71                             99.66
## comp 6       0.02                   0.34                            100.00
df_train_PCA %>% 
  fviz_eig(addlabels = TRUE, ylim = c(0,70)) +
  labs(
    title = "Principal Components 1 and 2 explains about 86% of the variance in the data",
    x = "Principal Components",
    y = element_blank()
  ) +
  theme_minimal() +
  theme(
    plot.background = element_blank(),
    axis.text.y = element_blank()
  )
Figure 2

Figure 2

Principal Components: Total Contribution

df_train_PCA$var$contrib[,1] %>% sort(decreasing = TRUE) %>% round(2)
##    hp  disp    wt  carb  qsec  drat 
## 25.27 23.35 19.49 12.23 10.72  8.94
df_train_PCA %>% 
  fviz_contrib(choice = "var", axes = 1) + 
  labs(
    title = "",
    x = element_blank(),
    y = element_blank()
  ) +
  theme_minimal() +
  theme(
    plot.background = element_blank(),
    axis.text.y = element_blank()
  )
Figure 3.1

Figure 3.1

Principal Component 1:

  • These variables seems to correlate with the overall power of a vehicle. I would assume a vehicle with high horsepower has a high displacement. Weight would increase as well.
df_train_PCA$var$contrib[,2] %>% sort(decreasing = TRUE) %>% round(2)
##  drat  qsec  carb    wt  disp    hp 
## 28.00 27.98 21.59 12.39  6.21  3.83
df_train_PCA %>% 
  fviz_contrib(choice = "var", axes = 2) +
  labs(
    title = "",
    x = element_blank(),
    y = element_blank()
  ) +
  theme_minimal() +
  theme(
    plot.background = element_blank(),
    axis.text.y = element_blank()
  )
Figure 3.2

Figure 3.2

Principal Component 2:

  • drat “Rear axle ratio”: as this ratio goes up, towing capacity increases which in turn reduces the fuel economy of the vehicle

  • qsec “Measure of Acceleration”: edmunds.com.

  • carb “Carburetor”: used to mix air with gasoline by using an optimal amount of fire to burn the gas. A good Carburator will be able to move fuel thoughout the vehicle efficiently thus increasing power to improve speed CARBURETORS | How They Work

All in all based on these variable descriptions I would say that this PC is more associated with how a efficient a vehicle uses its fuel in terms of performance

PCA-Categories

df_train_PCA %>% fviz_pca_biplot(
  geom = "point", pointshape = 21, fill.ind = "gray", col.ind = df_train$cyl,
  col.var = "black", alpha.var = "cos2", col.quanti.sup = "navyblue",
  addEllipses = TRUE, ellipse.type = "confidence",
  palette = c("lightsteelblue4","lightskyblue","lightseagreen"),
  legend.title = "", mean.point = FALSE
) +
  labs(
    title = "Cylinder",
    x = element_blank(),
    y = element_blank(),
    alpha = "Quality of Representation"
  ) +
  theme(
    plot.background = element_blank(),
    legend.position = "top"
  )
Figure 4.1

Figure 4.1

Note:

- While the cylinder of a vehicle increases so does the Horsepower, Displacement and Weight, which in turn leads to a lower mpg

df_train_PCA %>% fviz_pca_biplot(
  geom = "point", pointshape = 21, fill.ind = "gray", col.ind = df_train$vs,
  col.var = "black", alpha.var = "cos2", col.quanti.sup = "navyblue",
  addEllipses = TRUE, ellipse.type = "confidence",
  palette = c("lightsteelblue4", "lightseagreen"),
  legend.title = "", mean.point = FALSE
) +
  labs(
    title = "Engine Configuration",
    x = element_blank(),
    y = element_blank(),
    alpha = "Quality of Representation"
  ) +
  theme(
    plot.background = element_blank(),
    legend.position = "top"
  )
Figure 4.2

Figure 4.2

Note:

- Straightline engines seem to have a lower Horsepower, Displacement, and Weight compared to V-Shaped engines, which leads to a higher mpg

df_train_PCA %>% fviz_pca_biplot(
  geom = "point", pointshape = 21, fill.ind = "gray", col.ind = df_train$gear,
  col.var = "black", alpha.var = "cos2", col.quanti.sup = "navyblue",
  addEllipses = TRUE, ellipse.type = "confidence",
  palette = c("lightsteelblue4","lightskyblue","lightseagreen"),
  legend.title = "", mean.point = FALSE
) +
  labs(
    title = "Gear",
    x = element_blank(),
    y = element_blank(),
    alpha = "Quality of Representation"
  ) +
  theme(
    plot.background = element_blank(),
    legend.position = "top"
  )
Figure 4.3

Figure 4.3

Note:

- A gear 4 vehicle is associated with a better mpg on avg than a gear 3 and 5 vehicles

- A gear 5 vehicles with higher drat “Rear Axle Ratio” leads to a higher mpg.

df_train_PCA %>% fviz_pca_biplot(
  geom = "point", pointshape = 21, fill.ind = "gray", col.ind = df_train$am,
  col.var = "black", alpha.var = "cos2", col.quanti.sup = "navyblue",
  addEllipses = TRUE, ellipse.type = "confidence",
  palette = c("tomato", "seagreen"),
  legend.title = "", mean.point = FALSE
) +
  labs(
    title = "Transmission",
    subtitle = "I used a different color scheme to highlight that this is the particular catergory of interest",
    x = element_blank(),
    y = element_blank(),
    alpha = "Quality of Representation"
  ) +
  theme(
    plot.background = element_blank(),
    legend.position = "top"
  )
Figure: 4.4

Figure: 4.4

Note:

- Manual Transmission is a associated with a better mpg than Automatic.

PCA - Biplot

df_train_PCA %>% 
  fviz_pca_biplot(invisible = "ind", col.var = "black", alpha.var = "cos2",
                  col.quanti.sup = "navyblue") %>% 
  fviz_add(df_train_PCA$quali.sup$coord, repel = TRUE,
           geom = "text", color = "lightsteelblue4") +
  labs(
    title = "Vehicle Names",
    x = element_blank(),
    y = element_blank(),
    alpha = "Quality of Represention"
  ) +
  theme(
    plot.background = element_blank(),
    legend.position = "top"
  )
Figure: 5

Figure: 5

Note:

- Honda Civic “what I drive currently :)” is a lower weighted vehicle with good mpg compared to the others in the dataset

- Cadillac Fleetwood is a higher weighted vehicle which is associated with a lower mpg compared to the others in the data

- Maserati Bora is a high carb vehicle associated with a worse mpg compared to the others in the data

- Merc 230 has a good “Measure of Acceleration” compared to the other vehicles in the data

Plots

df_train %>% select(mpg,disp,hp,wt,qsec,drat,carb) %>% 
  cor() %>% round(2)
##        mpg  disp    hp    wt  qsec  drat  carb
## mpg   1.00 -0.84 -0.74 -0.86  0.33  0.65 -0.42
## disp -0.84  1.00  0.74  0.88 -0.35 -0.63  0.27
## hp   -0.74  0.74  1.00  0.59 -0.73 -0.33  0.71
## wt   -0.86  0.88  0.59  1.00 -0.08 -0.67  0.31
## qsec  0.33 -0.35 -0.73 -0.08  1.00 -0.10 -0.70
## drat  0.65 -0.63 -0.33 -0.67 -0.10  1.00  0.04
## carb -0.42  0.27  0.71  0.31 -0.70  0.04  1.00
df_train %>% 
  plot_ly(color = ~am, 
          hoverinfo = "text",
          text = ~paste("Name:", name, "<br>", am),
          colors = c("tomato", "seagreen")) %>% 
  add_trace(
    type = "splom",
    dimensions = list(
      list(label = "mpg", values = ~mpg),
      list(label = "wt", values = ~wt),
      list(label = "drat", values = ~drat)
    )
  ) %>%
  style(showlowerhalf = FALSE, showlegend = FALSE) %>% 
  layout(title = "Correlation Plot")

Figure: 6.1

Findings: mpg by am (transmission)

  • controlling for wt, it seems as though the mpg of a vehicle varies evenly no matter the type of transmission. However their seem to be a few outliers for Automatic vehicles at the high end of wt.

  • controlling for drat the data also varies. It looks like their might be one potential outlier at the high end of drat.

  • using the lasso select tool if you highlight the potential outliers at wt(5,5), you can see that those vehicles is associated with a low drat “Rear Axle Ratio” and mpg

Note:

- If you look at Figure 3.1: PC1 was best represented by disp, hp, and wt. I choose wt because it is, even though not by very much, the most correlated with mpg

- If you look at Figure 3.2: PC2 was best represented by qsec, drat, and carb. I choose drat for similar reasons

df_train %>% 
  plot_ly(x = ~wt, y = ~drat, z = ~mpg,
          hoverinfo = "text",
          text = ~paste("Mpg:", mpg, "<br>",
                        "Weight:", wt, "<br>",
                        "Drat:", drat, "<br>",
                        "Transmission:", am, "<br>",
                        "Name:", name),
          colors = c("tomato", "seagreen")
          ) %>% 
  add_markers(color = ~am, showlegend = FALSE) %>% 
  layout(xaxis = list(title = "Weight (lb/1000)"),
         yaxis = list(title = "Miles per US Gallon"),
         title = "Automatic vs Manual",
         plot_bgcolor = toRGB("gray90"))

Figure: 6.2

Findings: mpg ~ am + drat + wt (transmission)

  • controling for drat and wt, you can see more of a relationship between Transmission Types.

  • Manual: tend to have lower weight but high drat which lends to a better mpg compared to auto

  • Automatic: tend to have high wt and lower drat which lead to a worse mpg compared to manual

This graph is also interactive to feel free to play around with it

Modeling

To Quantify the difference I will be using a regression model

Model Fit

mod <- lm(mpg ~ am + drat + wt, data = df_train)
mod %>% tidy()
## # A tibble: 4 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    27.2      7.56      3.60  0.00206 
## 2 amM            -1.59     1.78     -0.890 0.385   
## 3 drat            2.00     1.68      1.20  0.247   
## 4 wt             -4.47     0.917    -4.87  0.000123

Note:

- The intercept represents Automatic

- Controlling for other the variables in the model, only wt seems to be significant

mod_2 <- lm(mpg ~ am + drat + am:wt, data = df_test)
mod_2 %>% tidy()
## # A tibble: 5 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)   32.5       17.8      1.83   0.127 
## 2 amM           16.1       11.4      1.42   0.216 
## 3 drat           0.818      3.70     0.221  0.834 
## 4 amA:wt        -4.65       1.97    -2.37   0.0643
## 5 amM:wt       -10.6        4.23    -2.51   0.0538

Note:

- all terms are insignificant at the 0.05 level

mod_3 <- lm(mpg ~ am + am:wt, data = df_train)
mod_3 %>% tidy()
## # A tibble: 4 x 5
##   term        estimate std.error statistic      p.value
##   <chr>          <dbl>     <dbl>     <dbl>        <dbl>
## 1 (Intercept)    29.3      3.24       9.04 0.0000000411
## 2 amM            13.4      4.46       2.99 0.00785     
## 3 amA:wt         -3.28     0.852     -3.85 0.00116     
## 4 amM:wt         -8.05     1.18      -6.80 0.00000229

Note:

- controlling for all the the variables in the model, when the transmission is Manual a 1 unit increase in weight leads to a decrease in mpg at a higher rate compared to Automatic.

- It looks like this is a good model at the 0.05 and 0.01 significance level, but I’ll stick to the 0.05 level

mod_3 %>% glance() %>% select(r.squared, rmse = sigma)
## # A tibble: 1 x 2
##   r.squared  rmse
##       <dbl> <dbl>
## 1     0.834  2.29

Note:

- At the 0.05 significance level this model explains 0.83 of the variance.

- Training RMSE: 2.29

Diagnostics

mod_3 %>% augment() %>% cbind(name = df_train$name) %>% 
  ggplot(aes(.fitted, .resid, color = am)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "black") +
  geom_text(aes(label = name)) +
  scale_color_manual(values = c("tomato", "seagreen")) +
  labs(
    title = "Apart from a potential outlier, Variance seems to be Homoscedastic",
    x = "Fitted Values",
    y = "Residuals",
    color = "Transmission"
  ) + 
  theme_minimal() +
  theme(legend.position = "bottom")
Figure: 7.1

Figure: 7.1

mod_3 %>% augment() %>% cbind(name = df_train$name) %>% 
  ggplot(aes(.fitted, .std.resid, color = am)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "black") +
  geom_text(aes(label = name)) +
  scale_color_manual(values = c("tomato", "seagreen")) +
  xlim(c(10,32)) +
  labs(
    title = "No Outliers - Merc 240D is still with 3 standard deviations",
    x = "Fitted Values",
    y = "Standardized Residuals",
    color = "Transmission"
  ) + 
  theme_minimal() +
  theme(legend.position = "bottom")
Figure: 7.2

Figure: 7.2

mod_3 %>% augment() %>% cbind(name = df_train$name) %>% 
  ggplot(aes(.fitted, mpg, color = am)) +
  geom_point() +
  geom_text(aes(label = name)) +
  geom_abline(intercept = 0, slope = 1, color = "black") +
  scale_color_manual(values = c("tomato", "seagreen")) +
  xlab(c(5,40)) +
  labs(
    title = "Normality Check",
    x = "Fitted Values",
    y = "Actual Values",
    color = "Transmission"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")
Figure 7.3

Figure 7.3

Predictions

mod_3_pred <- mod_3 %>% predict(df_test) %>% round(2)
df_test %>% 
  select(name, am, wt, mpg) %>% 
  cbind(mpg_pred = mod_3_pred) %>% 
  knitr::kable()
name am wt mpg mpg_pred
4 Hornet 4 Drive A 3.215 21.4 18.76
9 Merc 230 A 3.150 22.8 18.98
11 Merc 280C A 3.440 17.8 18.02
12 Merc 450SE A 4.070 16.4 15.96
14 Merc 450SLC A 3.780 15.2 16.91
16 Lincoln Continental A 5.424 10.4 11.51
18 Fiat 128 M 2.200 32.4 24.95
20 Toyota Corolla M 1.835 33.9 27.89
26 Fiat X1-9 M 1.935 27.3 27.09
32 Volvo 142E M 2.780 21.4 20.28
Metrics::rmse(actual = df_test$mpg,
              predicted = mod_3_pred) %>% 
  round(2)
## [1] 3.45

Note:

- Test RMSE: 3.45

Results

1. Is an automatic or manual transmission better for mpg.

Based on the sample taken from the 1974 Motor Trends Analysis.

  • If your vehicle weighs below 2,500lb (2.5 * 1000). I would say go for the manual transmission because it has a better mpg.

  • If the weight of a vehicle is above 2.500lbs there not much of a difference between manual and automatic

2. Quantify the mpg difference between automatic and manual transmissions.

mod_3
## 
## Call:
## lm(formula = mpg ~ am + am:wt, data = df_train)
## 
## Coefficients:
## (Intercept)          amM       amA:wt       amM:wt  
##      29.314       13.352       -3.282       -8.051
  • Rsquared: 0.83 | Rmse: 3.45

  • According to the model Automatic Transmission vehicles have an avg mpg of 29.31. As weight increases by 1 unit (1000/lbs), for an Automatic vehicle the mpg will drop by -3.28

  • If the vehicle has a Manual Transmission the mpg increases by 13.35. As weight increases by 1 unit (1000/lbs), for a Manual vehicle the mpg will drop by -8.05

Note: The sample size of the mtcars dataset was very small so I would be cautious to generalize these findings to the entire population