This report will be describing vehicles listed in the 1974 Motor Trends US Magazine. The assignment of interest comes from RPubs.com. The objective is to explore the relationship between mpg and the other features listed in the dataset.
Particular Questions of Interest
library(dplyr)
library(ggplot2)
library(factoextra)
library(FactoMineR)
library(plotly)
library(broom)
The mtcars dataset is already provided in R. The Documentation can be found here
set.seed(1993)
df <- mtcars
df %>% glimpse()
## Rows: 32
## Columns: 11
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17...
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4,...
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8,...
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, ...
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3....
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150,...
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90,...
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,...
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,...
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3,...
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1,...
df_description <- tibble(
Feature = c("mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"),
Description = c("Miles / US Gallon",
"Number of Cylinders (4,6,8)",
"Displacement - Measures the overall volume in the engine as a factor of cylinder circumfrance, depth and total number of cylinders. 'Proxy for the total amount of power the engine can generate.'",
"Horsepower",
"Rear Axle Ratio - The number of turns of the drive shaft for every one rotation of the wheel axle",
"Weight (lb/1000)",
"1/4 time - A performance measure, primarily of acceleration",
"Engine Cylinder Configuration - (VShape = 0 | StraightLine = 1)",
"Transmission Type - (Auto = 0 | Manual = 1)",
"Number of Forward Gears - Auto = (3 | 4), Manual = (4 | 5)",
"Number of Carburetors - Engines with higher displacement typically have higher barrel configuration")
) %>% knitr::kable()
df_description
| Feature | Description |
|---|---|
| mpg | Miles / US Gallon |
| cyl | Number of Cylinders (4,6,8) |
| disp | Displacement - Measures the overall volume in the engine as a factor of cylinder circumfrance, depth and total number of cylinders. ‘Proxy for the total amount of power the engine can generate.’ |
| hp | Horsepower |
| drat | Rear Axle Ratio - The number of turns of the drive shaft for every one rotation of the wheel axle |
| wt | Weight (lb/1000) |
| qsec | 1/4 time - A performance measure, primarily of acceleration |
| vs | Engine Cylinder Configuration - (VShape = 0 | StraightLine = 1) |
| am | Transmission Type - (Auto = 0 | Manual = 1) |
| gear | Number of Forward Gears - Auto = (3 | 4), Manual = (4 | 5) |
| carb | Number of Carburetors - Engines with higher displacement typically have higher barrel configuration |
Refer to RPubs for a more in depth description
df <- df %>% mutate(
cyl = as.factor(cyl),
vs = as.factor(ifelse(vs == 1, "S","V")),
am = as.factor(ifelse(am == 1, "M","A")),
gear = as.factor(gear),
name = rownames(df)
)
# Split to Train and Testset
set <- sample(1:2, replace = TRUE,
size = nrow(df),
prob = c(0.6,0.4))
df_train <- df[set == 1,]
df_test <- df[set == 2,]
df_train %>% skimr::skim()
| Name | Piped data |
| Number of rows | 22 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| factor | 4 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| name | 0 | 1 | 7 | 18 | 0 | 22 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| cyl | 0 | 1 | FALSE | 3 | 8: 11, 4: 6, 6: 5 |
| vs | 0 | 1 | FALSE | 2 | V: 15, S: 7 |
| am | 0 | 1 | FALSE | 2 | A: 13, M: 9 |
| gear | 0 | 1 | FALSE | 3 | 3: 11, 4: 6, 5: 5 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| mpg | 0 | 1 | 19.27 | 5.21 | 10.40 | 15.27 | 18.95 | 21.38 | 30.40 | ▃▇▇▃▂ |
| disp | 0 | 1 | 247.97 | 122.77 | 75.70 | 145.43 | 250.40 | 350.75 | 472.00 | ▇▅▃▆▃ |
| hp | 0 | 1 | 158.41 | 72.42 | 52.00 | 106.25 | 150.00 | 198.75 | 335.00 | ▇▇▇▅▁ |
| drat | 0 | 1 | 3.57 | 0.55 | 2.76 | 3.15 | 3.66 | 3.89 | 4.93 | ▇▃▇▂▁ |
| wt | 0 | 1 | 3.23 | 0.95 | 1.51 | 2.66 | 3.44 | 3.57 | 5.34 | ▂▃▇▂▂ |
| qsec | 0 | 1 | 17.27 | 1.60 | 14.50 | 16.52 | 17.04 | 18.22 | 20.22 | ▃▃▇▃▃ |
| carb | 0 | 1 | 3.09 | 1.72 | 1.00 | 2.00 | 2.50 | 4.00 | 8.00 | ▇▁▆▁▁ |
Note:
Data split (60/40)
cyl, vs, am, and gear were changed to factors
# Construct PCA Object:
df_train_PCA <-
df_train %>% select(-cyl,-vs,-am,-gear) %>%
PCA(graph = FALSE,
# Target (MPG)
quanti.sup = 1,
# Vehicle Names
quali.sup = 8)
df_train_PCA$quanti.sup$cor[,1:2] %>% round(2); df_train_PCA$quanti.sup$cos2[,1:2] %>% round(2)
## Dim.1 Dim.2
## -0.87 0.25
## Dim.1 Dim.2
## 0.76 0.06
df_train_PCA %>%
fviz_pca_var(col.quanti.sup = "navyblue", invisible = "var") +
theme_minimal() +
labs(
title = "MPG Summary",
x = "MPG Cor: -0.877 | Quality of Representaion: 76%",
y = "MPG Cor: 0.254 | Quality of Representaion: 6%"
) +
theme(
plot.background = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank()
)
Figure 1
round(df_train_PCA$eig,2)
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 3.40 56.63 56.63
## comp 2 1.77 29.44 86.07
## comp 3 0.39 6.57 92.63
## comp 4 0.32 5.32 97.95
## comp 5 0.10 1.71 99.66
## comp 6 0.02 0.34 100.00
df_train_PCA %>%
fviz_eig(addlabels = TRUE, ylim = c(0,70)) +
labs(
title = "Principal Components 1 and 2 explains about 86% of the variance in the data",
x = "Principal Components",
y = element_blank()
) +
theme_minimal() +
theme(
plot.background = element_blank(),
axis.text.y = element_blank()
)
Figure 2
df_train_PCA$var$contrib[,1] %>% sort(decreasing = TRUE) %>% round(2)
## hp disp wt carb qsec drat
## 25.27 23.35 19.49 12.23 10.72 8.94
df_train_PCA %>%
fviz_contrib(choice = "var", axes = 1) +
labs(
title = "",
x = element_blank(),
y = element_blank()
) +
theme_minimal() +
theme(
plot.background = element_blank(),
axis.text.y = element_blank()
)
Figure 3.1
Principal Component 1:
df_train_PCA$var$contrib[,2] %>% sort(decreasing = TRUE) %>% round(2)
## drat qsec carb wt disp hp
## 28.00 27.98 21.59 12.39 6.21 3.83
df_train_PCA %>%
fviz_contrib(choice = "var", axes = 2) +
labs(
title = "",
x = element_blank(),
y = element_blank()
) +
theme_minimal() +
theme(
plot.background = element_blank(),
axis.text.y = element_blank()
)
Figure 3.2
Principal Component 2:
drat “Rear axle ratio”: as this ratio goes up, towing capacity increases which in turn reduces the fuel economy of the vehicle
qsec “Measure of Acceleration”: edmunds.com.
carb “Carburetor”: used to mix air with gasoline by using an optimal amount of fire to burn the gas. A good Carburator will be able to move fuel thoughout the vehicle efficiently thus increasing power to improve speed CARBURETORS | How They Work
All in all based on these variable descriptions I would say that this PC is more associated with how a efficient a vehicle uses its fuel in terms of performance
df_train_PCA %>% fviz_pca_biplot(
geom = "point", pointshape = 21, fill.ind = "gray", col.ind = df_train$cyl,
col.var = "black", alpha.var = "cos2", col.quanti.sup = "navyblue",
addEllipses = TRUE, ellipse.type = "confidence",
palette = c("lightsteelblue4","lightskyblue","lightseagreen"),
legend.title = "", mean.point = FALSE
) +
labs(
title = "Cylinder",
x = element_blank(),
y = element_blank(),
alpha = "Quality of Representation"
) +
theme(
plot.background = element_blank(),
legend.position = "top"
)
Figure 4.1
Note:
- While the cylinder of a vehicle increases so does the Horsepower, Displacement and Weight, which in turn leads to a lower mpg
df_train_PCA %>% fviz_pca_biplot(
geom = "point", pointshape = 21, fill.ind = "gray", col.ind = df_train$vs,
col.var = "black", alpha.var = "cos2", col.quanti.sup = "navyblue",
addEllipses = TRUE, ellipse.type = "confidence",
palette = c("lightsteelblue4", "lightseagreen"),
legend.title = "", mean.point = FALSE
) +
labs(
title = "Engine Configuration",
x = element_blank(),
y = element_blank(),
alpha = "Quality of Representation"
) +
theme(
plot.background = element_blank(),
legend.position = "top"
)
Figure 4.2
Note:
- Straightline engines seem to have a lower Horsepower, Displacement, and Weight compared to V-Shaped engines, which leads to a higher mpg
df_train_PCA %>% fviz_pca_biplot(
geom = "point", pointshape = 21, fill.ind = "gray", col.ind = df_train$gear,
col.var = "black", alpha.var = "cos2", col.quanti.sup = "navyblue",
addEllipses = TRUE, ellipse.type = "confidence",
palette = c("lightsteelblue4","lightskyblue","lightseagreen"),
legend.title = "", mean.point = FALSE
) +
labs(
title = "Gear",
x = element_blank(),
y = element_blank(),
alpha = "Quality of Representation"
) +
theme(
plot.background = element_blank(),
legend.position = "top"
)
Figure 4.3
Note:
- A gear 4 vehicle is associated with a better mpg on avg than a gear 3 and 5 vehicles
- A gear 5 vehicles with higher drat “Rear Axle Ratio” leads to a higher mpg.
df_train_PCA %>% fviz_pca_biplot(
geom = "point", pointshape = 21, fill.ind = "gray", col.ind = df_train$am,
col.var = "black", alpha.var = "cos2", col.quanti.sup = "navyblue",
addEllipses = TRUE, ellipse.type = "confidence",
palette = c("tomato", "seagreen"),
legend.title = "", mean.point = FALSE
) +
labs(
title = "Transmission",
subtitle = "I used a different color scheme to highlight that this is the particular catergory of interest",
x = element_blank(),
y = element_blank(),
alpha = "Quality of Representation"
) +
theme(
plot.background = element_blank(),
legend.position = "top"
)
Figure: 4.4
Note:
- Manual Transmission is a associated with a better mpg than Automatic.
df_train_PCA %>%
fviz_pca_biplot(invisible = "ind", col.var = "black", alpha.var = "cos2",
col.quanti.sup = "navyblue") %>%
fviz_add(df_train_PCA$quali.sup$coord, repel = TRUE,
geom = "text", color = "lightsteelblue4") +
labs(
title = "Vehicle Names",
x = element_blank(),
y = element_blank(),
alpha = "Quality of Represention"
) +
theme(
plot.background = element_blank(),
legend.position = "top"
)
Figure: 5
Note:
- Honda Civic “what I drive currently :)” is a lower weighted vehicle with good mpg compared to the others in the dataset
- Cadillac Fleetwood is a higher weighted vehicle which is associated with a lower mpg compared to the others in the data
- Maserati Bora is a high carb vehicle associated with a worse mpg compared to the others in the data
- Merc 230 has a good “Measure of Acceleration” compared to the other vehicles in the data
df_train %>% select(mpg,disp,hp,wt,qsec,drat,carb) %>%
cor() %>% round(2)
## mpg disp hp wt qsec drat carb
## mpg 1.00 -0.84 -0.74 -0.86 0.33 0.65 -0.42
## disp -0.84 1.00 0.74 0.88 -0.35 -0.63 0.27
## hp -0.74 0.74 1.00 0.59 -0.73 -0.33 0.71
## wt -0.86 0.88 0.59 1.00 -0.08 -0.67 0.31
## qsec 0.33 -0.35 -0.73 -0.08 1.00 -0.10 -0.70
## drat 0.65 -0.63 -0.33 -0.67 -0.10 1.00 0.04
## carb -0.42 0.27 0.71 0.31 -0.70 0.04 1.00
df_train %>%
plot_ly(color = ~am,
hoverinfo = "text",
text = ~paste("Name:", name, "<br>", am),
colors = c("tomato", "seagreen")) %>%
add_trace(
type = "splom",
dimensions = list(
list(label = "mpg", values = ~mpg),
list(label = "wt", values = ~wt),
list(label = "drat", values = ~drat)
)
) %>%
style(showlowerhalf = FALSE, showlegend = FALSE) %>%
layout(title = "Correlation Plot")
Figure: 6.1
Findings: mpg by am (transmission)
controlling for wt, it seems as though the mpg of a vehicle varies evenly no matter the type of transmission. However their seem to be a few outliers for Automatic vehicles at the high end of wt.
controlling for drat the data also varies. It looks like their might be one potential outlier at the high end of drat.
using the lasso select tool if you highlight the potential outliers at wt(5,5), you can see that those vehicles is associated with a low drat “Rear Axle Ratio” and mpg
Note:
- If you look at Figure 3.1: PC1 was best represented by disp, hp, and wt. I choose wt because it is, even though not by very much, the most correlated with mpg
- If you look at Figure 3.2: PC2 was best represented by qsec, drat, and carb. I choose drat for similar reasons
df_train %>%
plot_ly(x = ~wt, y = ~drat, z = ~mpg,
hoverinfo = "text",
text = ~paste("Mpg:", mpg, "<br>",
"Weight:", wt, "<br>",
"Drat:", drat, "<br>",
"Transmission:", am, "<br>",
"Name:", name),
colors = c("tomato", "seagreen")
) %>%
add_markers(color = ~am, showlegend = FALSE) %>%
layout(xaxis = list(title = "Weight (lb/1000)"),
yaxis = list(title = "Miles per US Gallon"),
title = "Automatic vs Manual",
plot_bgcolor = toRGB("gray90"))
Figure: 6.2
Findings: mpg ~ am + drat + wt (transmission)
controling for drat and wt, you can see more of a relationship between Transmission Types.
Manual: tend to have lower weight but high drat which lends to a better mpg compared to auto
Automatic: tend to have high wt and lower drat which lead to a worse mpg compared to manual
This graph is also interactive to feel free to play around with it
To Quantify the difference I will be using a regression model
mod <- lm(mpg ~ am + drat + wt, data = df_train)
mod %>% tidy()
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 27.2 7.56 3.60 0.00206
## 2 amM -1.59 1.78 -0.890 0.385
## 3 drat 2.00 1.68 1.20 0.247
## 4 wt -4.47 0.917 -4.87 0.000123
Note:
- The intercept represents Automatic
- Controlling for other the variables in the model, only wt seems to be significant
mod_2 <- lm(mpg ~ am + drat + am:wt, data = df_test)
mod_2 %>% tidy()
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 32.5 17.8 1.83 0.127
## 2 amM 16.1 11.4 1.42 0.216
## 3 drat 0.818 3.70 0.221 0.834
## 4 amA:wt -4.65 1.97 -2.37 0.0643
## 5 amM:wt -10.6 4.23 -2.51 0.0538
Note:
- all terms are insignificant at the 0.05 level
mod_3 <- lm(mpg ~ am + am:wt, data = df_train)
mod_3 %>% tidy()
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 29.3 3.24 9.04 0.0000000411
## 2 amM 13.4 4.46 2.99 0.00785
## 3 amA:wt -3.28 0.852 -3.85 0.00116
## 4 amM:wt -8.05 1.18 -6.80 0.00000229
Note:
- controlling for all the the variables in the model, when the transmission is Manual a 1 unit increase in weight leads to a decrease in mpg at a higher rate compared to Automatic.
- It looks like this is a good model at the 0.05 and 0.01 significance level, but I’ll stick to the 0.05 level
mod_3 %>% glance() %>% select(r.squared, rmse = sigma)
## # A tibble: 1 x 2
## r.squared rmse
## <dbl> <dbl>
## 1 0.834 2.29
Note:
- At the 0.05 significance level this model explains 0.83 of the variance.
- Training RMSE: 2.29
mod_3 %>% augment() %>% cbind(name = df_train$name) %>%
ggplot(aes(.fitted, .resid, color = am)) +
geom_point() +
geom_hline(yintercept = 0, color = "black") +
geom_text(aes(label = name)) +
scale_color_manual(values = c("tomato", "seagreen")) +
labs(
title = "Apart from a potential outlier, Variance seems to be Homoscedastic",
x = "Fitted Values",
y = "Residuals",
color = "Transmission"
) +
theme_minimal() +
theme(legend.position = "bottom")
Figure: 7.1
mod_3 %>% augment() %>% cbind(name = df_train$name) %>%
ggplot(aes(.fitted, .std.resid, color = am)) +
geom_point() +
geom_hline(yintercept = 0, color = "black") +
geom_text(aes(label = name)) +
scale_color_manual(values = c("tomato", "seagreen")) +
xlim(c(10,32)) +
labs(
title = "No Outliers - Merc 240D is still with 3 standard deviations",
x = "Fitted Values",
y = "Standardized Residuals",
color = "Transmission"
) +
theme_minimal() +
theme(legend.position = "bottom")
Figure: 7.2
mod_3 %>% augment() %>% cbind(name = df_train$name) %>%
ggplot(aes(.fitted, mpg, color = am)) +
geom_point() +
geom_text(aes(label = name)) +
geom_abline(intercept = 0, slope = 1, color = "black") +
scale_color_manual(values = c("tomato", "seagreen")) +
xlab(c(5,40)) +
labs(
title = "Normality Check",
x = "Fitted Values",
y = "Actual Values",
color = "Transmission"
) +
theme_minimal() +
theme(legend.position = "bottom")
Figure 7.3
mod_3_pred <- mod_3 %>% predict(df_test) %>% round(2)
df_test %>%
select(name, am, wt, mpg) %>%
cbind(mpg_pred = mod_3_pred) %>%
knitr::kable()
| name | am | wt | mpg | mpg_pred | |
|---|---|---|---|---|---|
| 4 | Hornet 4 Drive | A | 3.215 | 21.4 | 18.76 |
| 9 | Merc 230 | A | 3.150 | 22.8 | 18.98 |
| 11 | Merc 280C | A | 3.440 | 17.8 | 18.02 |
| 12 | Merc 450SE | A | 4.070 | 16.4 | 15.96 |
| 14 | Merc 450SLC | A | 3.780 | 15.2 | 16.91 |
| 16 | Lincoln Continental | A | 5.424 | 10.4 | 11.51 |
| 18 | Fiat 128 | M | 2.200 | 32.4 | 24.95 |
| 20 | Toyota Corolla | M | 1.835 | 33.9 | 27.89 |
| 26 | Fiat X1-9 | M | 1.935 | 27.3 | 27.09 |
| 32 | Volvo 142E | M | 2.780 | 21.4 | 20.28 |
Metrics::rmse(actual = df_test$mpg,
predicted = mod_3_pred) %>%
round(2)
## [1] 3.45
Note:
- Test RMSE: 3.45
Based on the sample taken from the 1974 Motor Trends Analysis.
If your vehicle weighs below 2,500lb (2.5 * 1000). I would say go for the manual transmission because it has a better mpg.
If the weight of a vehicle is above 2.500lbs there not much of a difference between manual and automatic
mod_3
##
## Call:
## lm(formula = mpg ~ am + am:wt, data = df_train)
##
## Coefficients:
## (Intercept) amM amA:wt amM:wt
## 29.314 13.352 -3.282 -8.051
Rsquared: 0.83 | Rmse: 3.45
According to the model Automatic Transmission vehicles have an avg mpg of 29.31. As weight increases by 1 unit (1000/lbs), for an Automatic vehicle the mpg will drop by -3.28
If the vehicle has a Manual Transmission the mpg increases by 13.35. As weight increases by 1 unit (1000/lbs), for a Manual vehicle the mpg will drop by -8.05
Note: The sample size of the mtcars dataset was very small so I would be cautious to generalize these findings to the entire population