Overview

This report will be describing vehicles listed in the 1974 Motor Trends US Magazine. The assignment of interest comes from RPubs.com. The objective is to explore the relationship between mpg and the other features listed in the dataset.

Particular Questions of Interest

Is an automatic or manual transmission better for mpg.
Quantify the mpg difference between automatic and manual transmissions.

Libraries

library(dplyr)
library(ggplot2)
library(factoextra)
library(FactoMineR)
library(plotly)
library(broom)

Analysis

Data

The mtcars dataset is already provided in R. The Documentation can be found here

set.seed(1993)
df <- mtcars
df %>% glimpse()

## Rows: 32
## Columns: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17...
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4,...
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8,...
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, ...
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3....
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150,...
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90,...
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,...
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,...
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3,...
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1,...

Description of Features

df_description <- tibble(
  Feature = c("mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"),
  Description = c("Miles / US Gallon",
                  "Number of Cylinders (4,6,8)",
                  "Displacement - Measures the overall volume in the engine as a factor of cylinder circumfrance, depth and total number of cylinders. 'Proxy for the total amount of power the engine can generate.'",
                  "Horsepower",
                  "Rear Axle Ratio - The number of turns of the drive shaft for every one rotation of the wheel axle",
                  "Weight (lb/1000)",
                  "1/4 time - A performance measure, primarily of acceleration",
                  "Engine Cylinder Configuration - (VShape = 0 | StraightLine = 1)",
                  "Transmission Type - (Auto = 0 | Manual = 1)",
                  "Number of Forward Gears - Auto = (3 | 4), Manual = (4 | 5)",
                  "Number of Carburetors - Engines with higher displacement typically have higher barrel configuration")
) %>% knitr::kable()

df_description

Feature	Description
mpg	Miles / US Gallon
cyl	Number of Cylinders (4,6,8)
disp	Displacement - Measures the overall volume in the engine as a factor of cylinder circumfrance, depth and total number of cylinders. ‘Proxy for the total amount of power the engine can generate.’
hp	Horsepower
drat	Rear Axle Ratio - The number of turns of the drive shaft for every one rotation of the wheel axle
wt	Weight (lb/1000)
qsec	1/4 time - A performance measure, primarily of acceleration
vs	Engine Cylinder Configuration - (VShape = 0 \| StraightLine = 1)
am	Transmission Type - (Auto = 0 \| Manual = 1)
gear	Number of Forward Gears - Auto = (3 \| 4), Manual = (4 \| 5)
carb	Number of Carburetors - Engines with higher displacement typically have higher barrel configuration

Refer to RPubs for a more in depth description

Preprocessing

df <- df %>% mutate(
  cyl = as.factor(cyl),
  vs = as.factor(ifelse(vs == 1, "S","V")),
  am = as.factor(ifelse(am == 1, "M","A")),
  gear = as.factor(gear),
  name = rownames(df)
)
# Split to Train and Testset 
set <- sample(1:2, replace = TRUE,
              size = nrow(df),
              prob = c(0.6,0.4))
df_train <- df[set == 1,]
df_test <- df[set == 2,]
df_train %>% skimr::skim()

Data summary
Name	Piped data
Number of rows	22
Number of columns	12
_______________________
Column type frequency:
character	1
factor	4
numeric	7
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
name	0	1	7	18	0	22	0

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
cyl	1	FALSE	3	8: 11, 4: 6, 6: 5
vs	1	FALSE	2	V: 15, S: 7
am	1	FALSE	2	A: 13, M: 9
gear	1	FALSE	3	3: 11, 4: 6, 5: 5

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
mpg	1	19.27	5.21	10.40	15.27	18.95	21.38	30.40	▃▇▇▃▂
disp	1	247.97	122.77	75.70	145.43	250.40	350.75	472.00	▇▅▃▆▃
hp	1	158.41	72.42	52.00	106.25	150.00	198.75	335.00	▇▇▇▅▁
drat	1	3.57	0.55	2.76	3.15	3.66	3.89	4.93	▇▃▇▂▁
wt	1	3.23	0.95	1.51	2.66	3.44	3.57	5.34	▂▃▇▂▂
qsec	1	17.27	1.60	14.50	16.52	17.04	18.22	20.22	▃▃▇▃▃
carb	1	3.09	1.72	1.00	2.00	2.50	4.00	8.00	▇▁▆▁▁

Note:

Data split (60/40)
cyl, vs, am, and gear were changed to factors

Exploratory Data Analysis

PCA

# Construct PCA Object: 
df_train_PCA <- 
  df_train %>% select(-cyl,-vs,-am,-gear) %>% 
  PCA(graph = FALSE,
      # Target (MPG)
      quanti.sup = 1,
      # Vehicle Names
      quali.sup = 8)

df_train_PCA$quanti.sup$cor[,1:2] %>% round(2); df_train_PCA$quanti.sup$cos2[,1:2] %>% round(2)

## Dim.1 Dim.2 
## -0.87  0.25

## Dim.1 Dim.2 
##  0.76  0.06

df_train_PCA %>% 
  fviz_pca_var(col.quanti.sup = "navyblue", invisible = "var") + 
  theme_minimal() +
  labs(
    title = "MPG Summary",
    x = "MPG Cor: -0.877 | Quality of Representaion: 76%",
    y = "MPG Cor:  0.254 | Quality of Representaion: 6%"
  ) +
  theme(
    plot.background = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_blank()
  )

Figure 1

round(df_train_PCA$eig,2)

##        eigenvalue percentage of variance cumulative percentage of variance
## comp 1       3.40                  56.63                             56.63
## comp 2       1.77                  29.44                             86.07
## comp 3       0.39                   6.57                             92.63
## comp 4       0.32                   5.32                             97.95
## comp 5       0.10                   1.71                             99.66
## comp 6       0.02                   0.34                            100.00

df_train_PCA %>% 
  fviz_eig(addlabels = TRUE, ylim = c(0,70)) +
  labs(
    title = "Principal Components 1 and 2 explains about 86% of the variance in the data",
    x = "Principal Components",
    y = element_blank()
  ) +
  theme_minimal() +
  theme(
    plot.background = element_blank(),
    axis.text.y = element_blank()
  )

Figure 2

Principal Components: Total Contribution

df_train_PCA$var$contrib[,1] %>% sort(decreasing = TRUE) %>% round(2)

##    hp  disp    wt  carb  qsec  drat 
## 25.27 23.35 19.49 12.23 10.72  8.94

df_train_PCA %>% 
  fviz_contrib(choice = "var", axes = 1) + 
  labs(
    title = "",
    x = element_blank(),
    y = element_blank()
  ) +
  theme_minimal() +
  theme(
    plot.background = element_blank(),
    axis.text.y = element_blank()
  )

Figure 3.1

Principal Component 1:

These variables seems to correlate with the overall power of a vehicle. I would assume a vehicle with high horsepower has a high displacement. Weight would increase as well.

df_train_PCA$var$contrib[,2] %>% sort(decreasing = TRUE) %>% round(2)

##  drat  qsec  carb    wt  disp    hp 
## 28.00 27.98 21.59 12.39  6.21  3.83

df_train_PCA %>% 
  fviz_contrib(choice = "var", axes = 2) +
  labs(
    title = "",
    x = element_blank(),
    y = element_blank()
  ) +
  theme_minimal() +
  theme(
    plot.background = element_blank(),
    axis.text.y = element_blank()
  )

Figure 3.2

Principal Component 2:

drat “Rear axle ratio”: as this ratio goes up, towing capacity increases which in turn reduces the fuel economy of the vehicle
qsec “Measure of Acceleration”: edmunds.com.
carb “Carburetor”: used to mix air with gasoline by using an optimal amount of fire to burn the gas. A good Carburator will be able to move fuel thoughout the vehicle efficiently thus increasing power to improve speed CARBURETORS | How They Work

All in all based on these variable descriptions I would say that this PC is more associated with how a efficient a vehicle uses its fuel in terms of performance

PCA-Categories

df_train_PCA %>% fviz_pca_biplot(
  geom = "point", pointshape = 21, fill.ind = "gray", col.ind = df_train$cyl,
  col.var = "black", alpha.var = "cos2", col.quanti.sup = "navyblue",
  addEllipses = TRUE, ellipse.type = "confidence",
  palette = c("lightsteelblue4","lightskyblue","lightseagreen"),
  legend.title = "", mean.point = FALSE
) +
  labs(
    title = "Cylinder",
    x = element_blank(),
    y = element_blank(),
    alpha = "Quality of Representation"
  ) +
  theme(
    plot.background = element_blank(),
    legend.position = "top"
  )

Figure 4.1

Note:

- While the cylinder of a vehicle increases so does the Horsepower, Displacement and Weight, which in turn leads to a lower mpg

df_train_PCA %>% fviz_pca_biplot(
  geom = "point", pointshape = 21, fill.ind = "gray", col.ind = df_train$vs,
  col.var = "black", alpha.var = "cos2", col.quanti.sup = "navyblue",
  addEllipses = TRUE, ellipse.type = "confidence",
  palette = c("lightsteelblue4", "lightseagreen"),
  legend.title = "", mean.point = FALSE
) +
  labs(
    title = "Engine Configuration",
    x = element_blank(),
    y = element_blank(),
    alpha = "Quality of Representation"
  ) +
  theme(
    plot.background = element_blank(),
    legend.position = "top"
  )

Figure 4.2

Note:

- Straightline engines seem to have a lower Horsepower, Displacement, and Weight compared to V-Shaped engines, which leads to a higher mpg

df_train_PCA %>% fviz_pca_biplot(
  geom = "point", pointshape = 21, fill.ind = "gray", col.ind = df_train$gear,
  col.var = "black", alpha.var = "cos2", col.quanti.sup = "navyblue",
  addEllipses = TRUE, ellipse.type = "confidence",
  palette = c("lightsteelblue4","lightskyblue","lightseagreen"),
  legend.title = "", mean.point = FALSE
) +
  labs(
    title = "Gear",
    x = element_blank(),
    y = element_blank(),
    alpha = "Quality of Representation"
  ) +
  theme(
    plot.background = element_blank(),
    legend.position = "top"
  )

Figure 4.3

Note:

- A gear 4 vehicle is associated with a better mpg on avg than a gear 3 and 5 vehicles

- A gear 5 vehicles with higher drat “Rear Axle Ratio” leads to a higher mpg.

df_train_PCA %>% fviz_pca_biplot(
  geom = "point", pointshape = 21, fill.ind = "gray", col.ind = df_train$am,
  col.var = "black", alpha.var = "cos2", col.quanti.sup = "navyblue",
  addEllipses = TRUE, ellipse.type = "confidence",
  palette = c("tomato", "seagreen"),
  legend.title = "", mean.point = FALSE
) +
  labs(
    title = "Transmission",
    subtitle = "I used a different color scheme to highlight that this is the particular catergory of interest",
    x = element_blank(),
    y = element_blank(),
    alpha = "Quality of Representation"
  ) +
  theme(
    plot.background = element_blank(),
    legend.position = "top"
  )

Figure: 4.4

Note:

- Manual Transmission is a associated with a better mpg than Automatic.

PCA - Biplot

df_train_PCA %>% 
  fviz_pca_biplot(invisible = "ind", col.var = "black", alpha.var = "cos2",
                  col.quanti.sup = "navyblue") %>% 
  fviz_add(df_train_PCA$quali.sup$coord, repel = TRUE,
           geom = "text", color = "lightsteelblue4") +
  labs(
    title = "Vehicle Names",
    x = element_blank(),
    y = element_blank(),
    alpha = "Quality of Represention"
  ) +
  theme(
    plot.background = element_blank(),
    legend.position = "top"
  )

Figure: 5

Note:

- Honda Civic “what I drive currently :)” is a lower weighted vehicle with good mpg compared to the others in the dataset

- Cadillac Fleetwood is a higher weighted vehicle which is associated with a lower mpg compared to the others in the data

- Maserati Bora is a high carb vehicle associated with a worse mpg compared to the others in the data

- Merc 230 has a good “Measure of Acceleration” compared to the other vehicles in the data

Plots

df_train %>% select(mpg,disp,hp,wt,qsec,drat,carb) %>% 
  cor() %>% round(2)

##        mpg  disp    hp    wt  qsec  drat  carb
## mpg   1.00 -0.84 -0.74 -0.86  0.33  0.65 -0.42
## disp -0.84  1.00  0.74  0.88 -0.35 -0.63  0.27
## hp   -0.74  0.74  1.00  0.59 -0.73 -0.33  0.71
## wt   -0.86  0.88  0.59  1.00 -0.08 -0.67  0.31
## qsec  0.33 -0.35 -0.73 -0.08  1.00 -0.10 -0.70
## drat  0.65 -0.63 -0.33 -0.67 -0.10  1.00  0.04
## carb -0.42  0.27  0.71  0.31 -0.70  0.04  1.00

df_train %>% 
  plot_ly(color = ~am, 
          hoverinfo = "text",
          text = ~paste("Name:", name, "<br>", am),
          colors = c("tomato", "seagreen")) %>% 
  add_trace(
    type = "splom",
    dimensions = list(
      list(label = "mpg", values = ~mpg),
      list(label = "wt", values = ~wt),
      list(label = "drat", values = ~drat)
    )
  ) %>%
  style(showlowerhalf = FALSE, showlegend = FALSE) %>% 
  layout(title = "Correlation Plot")

Figure: 6.1

Findings: mpg by am (transmission)

controlling for wt, it seems as though the mpg of a vehicle varies evenly no matter the type of transmission. However their seem to be a few outliers for Automatic vehicles at the high end of wt.
controlling for drat the data also varies. It looks like their might be one potential outlier at the high end of drat.
using the lasso select tool if you highlight the potential outliers at wt(5,5), you can see that those vehicles is associated with a low drat “Rear Axle Ratio” and mpg

Note:

- If you look at Figure 3.1: PC1 was best represented by disp, hp, and wt. I choose wt because it is, even though not by very much, the most correlated with mpg

- If you look at Figure 3.2: PC2 was best represented by qsec, drat, and carb. I choose drat for similar reasons

df_train %>% 
  plot_ly(x = ~wt, y = ~drat, z = ~mpg,
          hoverinfo = "text",
          text = ~paste("Mpg:", mpg, "<br>",
                        "Weight:", wt, "<br>",
                        "Drat:", drat, "<br>",
                        "Transmission:", am, "<br>",
                        "Name:", name),
          colors = c("tomato", "seagreen")
          ) %>% 
  add_markers(color = ~am, showlegend = FALSE) %>% 
  layout(xaxis = list(title = "Weight (lb/1000)"),
         yaxis = list(title = "Miles per US Gallon"),
         title = "Automatic vs Manual",
         plot_bgcolor = toRGB("gray90"))

Figure: 6.2

Findings: mpg ~ am + drat + wt (transmission)

controling for drat and wt, you can see more of a relationship between Transmission Types.
Manual: tend to have lower weight but high drat which lends to a better mpg compared to auto
Automatic: tend to have high wt and lower drat which lead to a worse mpg compared to manual

This graph is also interactive to feel free to play around with it

Modeling

To Quantify the difference I will be using a regression model

Model Fit

mod <- lm(mpg ~ am + drat + wt, data = df_train)
mod %>% tidy()

## # A tibble: 4 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    27.2      7.56      3.60  0.00206 
## 2 amM            -1.59     1.78     -0.890 0.385   
## 3 drat            2.00     1.68      1.20  0.247   
## 4 wt             -4.47     0.917    -4.87  0.000123

Note:

- The intercept represents Automatic

- Controlling for other the variables in the model, only wt seems to be significant

mod_2 <- lm(mpg ~ am + drat + am:wt, data = df_test)
mod_2 %>% tidy()

## # A tibble: 5 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)   32.5       17.8      1.83   0.127 
## 2 amM           16.1       11.4      1.42   0.216 
## 3 drat           0.818      3.70     0.221  0.834 
## 4 amA:wt        -4.65       1.97    -2.37   0.0643
## 5 amM:wt       -10.6        4.23    -2.51   0.0538

Note:

- all terms are insignificant at the 0.05 level

mod_3 <- lm(mpg ~ am + am:wt, data = df_train)
mod_3 %>% tidy()

## # A tibble: 4 x 5
##   term        estimate std.error statistic      p.value
##   <chr>          <dbl>     <dbl>     <dbl>        <dbl>
## 1 (Intercept)    29.3      3.24       9.04 0.0000000411
## 2 amM            13.4      4.46       2.99 0.00785     
## 3 amA:wt         -3.28     0.852     -3.85 0.00116     
## 4 amM:wt         -8.05     1.18      -6.80 0.00000229

Note:

- controlling for all the the variables in the model, when the transmission is Manual a 1 unit increase in weight leads to a decrease in mpg at a higher rate compared to Automatic.

- It looks like this is a good model at the 0.05 and 0.01 significance level, but I’ll stick to the 0.05 level

mod_3 %>% glance() %>% select(r.squared, rmse = sigma)

## # A tibble: 1 x 2
##   r.squared  rmse
##       <dbl> <dbl>
## 1     0.834  2.29

Note:

- At the 0.05 significance level this model explains 0.83 of the variance.

- Training RMSE: 2.29

Diagnostics

mod_3 %>% augment() %>% cbind(name = df_train$name) %>% 
  ggplot(aes(.fitted, .resid, color = am)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "black") +
  geom_text(aes(label = name)) +
  scale_color_manual(values = c("tomato", "seagreen")) +
  labs(
    title = "Apart from a potential outlier, Variance seems to be Homoscedastic",
    x = "Fitted Values",
    y = "Residuals",
    color = "Transmission"
  ) + 
  theme_minimal() +
  theme(legend.position = "bottom")

Figure: 7.1

mod_3 %>% augment() %>% cbind(name = df_train$name) %>% 
  ggplot(aes(.fitted, .std.resid, color = am)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "black") +
  geom_text(aes(label = name)) +
  scale_color_manual(values = c("tomato", "seagreen")) +
  xlim(c(10,32)) +
  labs(
    title = "No Outliers - Merc 240D is still with 3 standard deviations",
    x = "Fitted Values",
    y = "Standardized Residuals",
    color = "Transmission"
  ) + 
  theme_minimal() +
  theme(legend.position = "bottom")

Figure: 7.2

mod_3 %>% augment() %>% cbind(name = df_train$name) %>% 
  ggplot(aes(.fitted, mpg, color = am)) +
  geom_point() +
  geom_text(aes(label = name)) +
  geom_abline(intercept = 0, slope = 1, color = "black") +
  scale_color_manual(values = c("tomato", "seagreen")) +
  xlab(c(5,40)) +
  labs(
    title = "Normality Check",
    x = "Fitted Values",
    y = "Actual Values",
    color = "Transmission"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Figure 7.3

Predictions

mod_3_pred <- mod_3 %>% predict(df_test) %>% round(2)
df_test %>% 
  select(name, am, wt, mpg) %>% 
  cbind(mpg_pred = mod_3_pred) %>% 
  knitr::kable()

	name	am	wt	mpg	mpg_pred
4	Hornet 4 Drive	A	3.215	21.4	18.76
9	Merc 230	A	3.150	22.8	18.98
11	Merc 280C	A	3.440	17.8	18.02
12	Merc 450SE	A	4.070	16.4	15.96
14	Merc 450SLC	A	3.780	15.2	16.91
16	Lincoln Continental	A	5.424	10.4	11.51
18	Fiat 128	M	2.200	32.4	24.95
20	Toyota Corolla	M	1.835	33.9	27.89
26	Fiat X1-9	M	1.935	27.3	27.09
32	Volvo 142E	M	2.780	21.4	20.28

Metrics::rmse(actual = df_test$mpg,
              predicted = mod_3_pred) %>% 
  round(2)

## [1] 3.45

Note:

- Test RMSE: 3.45

Results

1. Is an automatic or manual transmission better for mpg.

Based on the sample taken from the 1974 Motor Trends Analysis.

If your vehicle weighs below 2,500lb (2.5 * 1000). I would say go for the manual transmission because it has a better mpg.
If the weight of a vehicle is above 2.500lbs there not much of a difference between manual and automatic

2. Quantify the mpg difference between automatic and manual transmissions.

mod_3

## 
## Call:
## lm(formula = mpg ~ am + am:wt, data = df_train)
## 
## Coefficients:
## (Intercept)          amM       amA:wt       amM:wt  
##      29.314       13.352       -3.282       -8.051

Rsquared: 0.83 | Rmse: 3.45
According to the model Automatic Transmission vehicles have an avg mpg of 29.31. As weight increases by 1 unit (1000/lbs), for an Automatic vehicle the mpg will drop by -3.28
If the vehicle has a Manual Transmission the mpg increases by 13.35. As weight increases by 1 unit (1000/lbs), for a Manual vehicle the mpg will drop by -8.05

Note: The sample size of the mtcars dataset was very small so I would be cautious to generalize these findings to the entire population

1974 Motor Trends Analysis

Isioma Mabatah

9/2/2020

Overview

Libraries

Analysis

Data

Description of Features

Preprocessing

Exploratory Data Analysis

PCA

Principal Components: Total Contribution

PCA-Categories

PCA - Biplot

Plots

Modeling

Model Fit

Diagnostics

Predictions

Results

1. Is an automatic or manual transmission better for mpg.

2. Quantify the mpg difference between automatic and manual transmissions.