library(tidyverse)
library(reshape2)
library(hablar)
library(RColorBrewer)
library(plotly)
library(ggbiplot) # installation library(devtools); install_github("vqv/ggbiplot")
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggbiplot_0.55 scales_1.1.1 plyr_1.8.6 plotly_4.9.2.1
## [5] RColorBrewer_1.1-2 hablar_0.3.0 reshape2_1.4.4 forcats_0.5.0
## [9] stringr_1.4.0 dplyr_0.8.5 purrr_0.3.4 readr_1.3.1
## [13] tidyr_1.1.0 tibble_3.0.1 ggplot2_3.3.1 tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.0 xfun_0.14 haven_2.2.0 lattice_0.20-41
## [5] colorspace_1.4-1 vctrs_0.3.0 generics_0.0.2 viridisLite_0.3.0
## [9] htmltools_0.4.0 yaml_2.2.1 rlang_0.4.6 pillar_1.4.4
## [13] glue_1.4.1 withr_2.2.0 DBI_1.1.0 dbplyr_1.4.3
## [17] modelr_0.1.8 readxl_1.3.1 lifecycle_0.2.0 munsell_0.5.0
## [21] gtable_0.3.0 cellranger_1.1.0 rvest_0.3.5 htmlwidgets_1.5.1
## [25] evaluate_0.14 knitr_1.28 fansi_0.4.1 broom_0.5.6
## [29] Rcpp_1.0.4.6 backports_1.1.7 jsonlite_1.6.1 fs_1.4.1
## [33] hms_0.5.3 digest_0.6.25 stringi_1.4.6 cli_2.0.2
## [37] tools_4.0.0 magrittr_1.5 lazyeval_0.2.2 crayon_1.3.4
## [41] pkgconfig_2.0.3 ellipsis_0.3.1 data.table_1.12.8 xml2_1.3.2
## [45] reprex_0.3.0 lubridate_1.7.8 assertthat_0.2.1 rmarkdown_2.1
## [49] httr_1.4.1 rstudioapi_0.11 R6_2.4.1 nlme_3.1-147
## [53] compiler_4.0.0
In this report, we will analyze the mtcars dataset and evaluate the relationship of mpg and auto transmission (automatic/manual). We will also deep dive other variables that affects fuel efficiency such as weight of the vehicle, horsepower, qsec 1/4 mile performance test run and many others in the dataset. We will also use a package called olsrr. This tool made it easier to cycle combinations and patterns for ordinary least squares regression models and combined it with principle component analysis (prcomp). We obtained and use the highest adjusted Rsquared to find our model.
This two model selection (olsrr and pca) gave us insight on the following:
According to the help documentation ?mtcars: “The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).”
Description of variables:
data(mtcars)
This is a fast operation that provides us with the number of obersvations and variables, with their respective datatypes. We can quickly analyze opportunities that we can see using str().
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
A few opportunities I see in the above output are as follows:
Using str() provides us with a quick and powerful way in analyzing our data and making it more tidy.
According to the Buereau of Transportation, the average fuel efficiency for ligh-duty vehicle-Passenger car in 2017 is 39.4 mpg.
Also, we added another variable called countryGroup in mtcars which shows the country or continent they were made (personal assumption).
countryGroup <-
c(
rep("JAPAN", 3),
rep("USA", 4),
rep("EUROPE", 7),
rep("USA", 3),
"EUROPE",
rep("JAPAN", 3),
rep("USA", 4),
rep("EUROPE", 3),
"USA",
rep("EUROPE", 3)
)
mtcars$country <- countryGroup
quantile(mtcars$mpg, seq(0, 1, 0.1))
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 10.40 14.34 15.20 15.98 17.92 19.20 21.00 21.47 24.08 30.09 33.90
Above table, we zoomed-in our data using quantile function. If we are to add external dataset of 39.4 mpg in mtcars dataset, this becomes the new max and a complete outlier.
There are many technological advances and changes in environmental policies that occurred between 1973 to 2017. Some advancements are lighter metals and safety equipment to increase fuel efficiency and decrease risk in accident.
There are many more advances in the field of vehicles-electrification that increases mpg-e(equivalent) and driverless vehicles (off-topic).
After 44 years, the advancement in fuel efficiency increased at an average rate of .44 mpg each year for 44 years or 19.31/44 years (note: 39.4-20.09=19.31). This is pretty bad.
Tesla Model S on average has 115 city /107 highway mpg-e.
Knowing the national average-mpg, gives us insight to the following:
Continue on the EDA…
Let’s review the relationship of variables to our vehicles:
pca <- prcomp(mtcars[, c(1:7, 10, 11)], center = TRUE, scale. = TRUE)
pca_x_df <- data.frame(pca[["x"]])
pca_name <- colnames(pca_x_df)
pca.var <- pca$sdev ^ 2
pca.var.per <- round(pca.var / sum(pca.var) * 100, 1)
pca_df <- data.frame(name = pca_name, pca_percent = pca.var.per)
pca_df %>% ggplot(aes(x = name, y = pca_percent)) + geom_col()
#pca <- prcomp(mtcars[, c(1:7,10,11)],center = TRUE, scale. = TRUE)
pca_g <-
ggplotly(ggbiplot(
pca,
labels = rownames(mtcars),
obs.scale = .5,
var.scale = .5,
ellipse = TRUE,
groups = mtcars$country
))
pca_g
Not all cars are made equal. We can see from the above graph that the Japan, and Europe made vehicles tend to have a higher mpg and less hp, cyl, disp, and wt compared with the USA counterpart.
Let’s review PCA3 and PCA4:
ggplotly(ggbiplot(
pca,
labels = rownames(mtcars),
obs.scale = .5,
var.scale = .5,
ellipse = TRUE,
choices = c(3,4),
groups = mtcars$country
))
It’s hard to see any insight from the above because of the small percentage.
Let’s take a look at the breakdown of countries vs. am:
group_country_am <- mtcars %>% group_by(country, am)
group_country_am <- dplyr::summarise(group_country_am, count = n())
group_country_am <-
melt(group_country_am, id.vars = c("country", "am"))
ggplot(group_country_am, aes(x = country, y = value)) +
geom_bar(stat = "identity", alpha = .9) + facet_grid( ~ am) + labs(title = "Country by Auto Transmission (0=AT, 1=MT)", y = "Count", x = "Country/Continent")
mpg_rank <-
mtcars %>% arrange(desc(mpg)) %>% select(country, mpg, am) %>% mutate(am_abbr = ifelse(mtcars$am == 0, "AT", "MT"))
top10 <-
mpg_rank %>% mutate(ranking = rank(desc(mpg), ties.method = "first"))
head(top10 %>% select(country, mpg, am_abbr, ranking) )
Japan has a total of 5 vehicles in our sample size and the above chart shows that 4 of Japanese vehicles are part of the top 10.
Is an automatic or manual transmission better for MPG?
Data set description of “am”, from ?mtcars:
amMean <- summarise(group_by(mtcars, AM = as.factor(am)), MN = mean(mpg))
mpgMean <- mean(mtcars$mpg)
amMean
by(mtcars$mpg, INDICES = mtcars$am, summary)
## mtcars$am: 0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 14.95 17.30 17.15 19.20 24.40
## ------------------------------------------------------------
## mtcars$am: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 21.00 22.80 24.39 30.40 33.90
Manual transmission has a higher min and max compared to Automatic Transmission in our dataset. Let’s create a boxplot to visualize our findings.
g <- mtcars %>% ggplot(aes(y = mpg, x = am))
g + geom_boxplot(alpha = .3) +
facet_grid( ~ mtcars$am) +
geom_hline(aes(yintercept = mean(mpg), col = "red")) +
theme(legend.position = "none") + stat_summary(fun = mean, geom = "point",
colour = "blue")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
Transmission Analysis:
Let’s review the counts of each transmission:
by_am <- mtcars %>% group_by(am)
dplyr::summarise(by_am, count = n())
Answer to question 1:
Based on the simple solution we applied, manual transmission is better than automatic transmission in fuel efficiency. We still need to keep poking our data for insights.
Quantify the MPG difference between automatic and manual transmissions?
Manual Transmission mpg is mpg higher than Automatic manual transmission mpg. We do not know what percentage belongs to AM in the difference between MT and AT.
We need to dig deeper in our analysis and keep poking our simple solution.
Hypothesis:
H0: AT mpg = MT mpg; There is no difference in fuel efficiency between Automatic transmission and Manual Transmission
Ha: AT mpg != MT mpg
t.test(mtcars$mpg ~ factor(mtcars$am),
paired = FALSE,
var.equal = TRUE)
##
## Two Sample t-test
##
## data: mtcars$mpg by factor(mtcars$am)
## t = -4.1061, df = 30, p-value = 0.000285
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -10.84837 -3.64151
## sample estimates:
## mean in group 0 mean in group 1
## 17.14737 24.39231
There is a .0285% chance that our null hypothesis = MT mpg = MT mpg, therefore, we reject the null hypothesis. The mean of AT != mean of MT. Let’s check out the linear model:
fit <- lm(data = mtcars, mpg ~ factor(am))
summary(fit)
##
## Call:
## lm(formula = mpg ~ factor(am), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## factor(am)1 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
As we said before we rejected the null hypothesis, the linear model shows the following:
Let us review the plot and see if it is a good fit.
rm(list = ls())
attach(mtcars)
fit <- lm(am ~ mpg)
plot(mpg,
am,
frame.plot = FALSE,
xlab = "MPG",
ylab = "Transmission type (0=AT, 1=MT)")
abline(fit, col = "blue")
lines(lowess(mpg, am), type = "l", col = "red")
The above linear model is not a good fit the blue bar crosses below 0 and above 1 and the redline crosses above 1. We need to find a better model that fits out data.
Log regression part of the GLM:
fit <- glm(factor(mtcars$am) ~ mtcars$mpg, family = binomial)
plot(
x = mpg,
y = am,
frame.plot = FALSE,
xlab = "MPG",
ylab = "Transmission type (0=AT, 1=MT)"
)
points(mpg, fit$fitted.values , col = "red")
Exponentiated coefficient.
exp(fit$coefficients)
## (Intercept) mtcars$mpg
## 0.001355579 1.359379288
The first line shows the exponentiated slope coefficient is 1.36. We estimate a 36% increase in odds that the transmission is manual transmission per 1 mpg increase.
fit <- glm(factor(mtcars$am)~mtcars$mpg, family = binomial)
summary(fit)
##
## Call:
## glm(formula = factor(mtcars$am) ~ mtcars$mpg, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5701 -0.7531 -0.4245 0.5866 2.0617
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.6035 2.3514 -2.808 0.00498 **
## mtcars$mpg 0.3070 0.1148 2.673 0.00751 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.230 on 31 degrees of freedom
## Residual deviance: 29.675 on 30 degrees of freedom
## AIC: 33.675
##
## Number of Fisher Scoring iterations: 5
The above model has an AIC of 33.6751676 similar to the R-squared. Which means our model only accounts for 33.675% of the deviance. We need a better model.
Nested Model.
Update the datatype and scale the data:
data(mtcars)
scale_mtcars <- data.frame(scale(mtcars))
fit1 <- lm(data = scale_mtcars, mpg ~ factor(am))
fit2 <- update(fit1, mpg ~ factor(am) + disp + hp)
fit3 <- update(fit1, mpg ~ factor(am) + disp + hp + drat + wt)
fit4 <- update(fit1, mpg ~ factor(am) + disp + hp + drat + wt + qsec + factor(vs))
fit5 <-
update(
fit1,
mpg ~ factor(am) + disp + hp + drat + wt + qsec + factor(vs) + factor(cyl) +
factor(gear) + factor(carb)
)
anova(fit1, fit2, fit3, fit4, fit5)
Model 2 and Model 3 seems to be a good candidate. Let us look deeper on model three and see if there is something we are missing.
The best fit is fit2 and fit3. Let’s examine it closely.
summary(fit3)
##
## Call:
## lm(formula = mpg ~ factor(am) + disp + hp + drat + wt, data = scale_mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5535 -0.3011 -0.1028 0.1986 0.9061
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.11031 0.13145 -0.839 0.40902
## factor(am)1.18990141802394 0.27154 0.26360 1.030 0.31243
## disp 0.09759 0.22272 0.438 0.66488
## hp -0.45164 0.14253 -3.169 0.00389 **
## drat 0.10185 0.12856 0.792 0.43543
## wt -0.49041 0.18924 -2.591 0.01547 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4313 on 26 degrees of freedom
## Multiple R-squared: 0.844, Adjusted R-squared: 0.814
## F-statistic: 28.13 on 5 and 26 DF, p-value: 1.037e-09
library(olsrr)
data(mtcars)
model <- lm(data = mtcars, mpg ~ factor(am): + .)
find_model <- ols_step_all_possible(model)
head(find_model %>% filter(rank(desc(adjr), ties.method = "first") <= 10))
The above table, shows us the best model. Let’s go ahead and see how it looks.
data(mtcars)
countryGroup <-
c(
rep("JAPAN", 3),
rep("USA", 4),
rep("EUROPE", 7),
rep("USA", 3),
"EUROPE",
rep("JAPAN", 3),
rep("USA", 4),
rep("EUROPE", 3),
"USA",
rep("EUROPE", 3)
)
mtcars$country <- countryGroup
bestfit <- lm(data = mtcars, mpg ~ factor(am):wt+ factor(am):hp + factor(am):qsec+ factor(am):factor(vs)+ factor(am):factor(gear)+ factor(am):carb+ factor(am):country)
plot(bestfit)
## Warning: not plotting observations with leverage one:
## 21, 29
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
confint(bestfit)
## 2.5 % 97.5 %
## (Intercept) 17.48061188 55.34077972
## factor(am)0:wt -0.35677751 5.21051370
## factor(am)1:wt -16.42668727 -4.96329415
## factor(am)0:hp 0.03674741 0.18492616
## factor(am)1:hp -0.03029621 0.08737513
## factor(am)0:qsec -2.83966754 -0.23165048
## factor(am)1:qsec 2.31069618 6.93368664
## factor(am)0:factor(vs)1 1.13191918 10.88255742
## factor(am)1:factor(vs)1 -11.99520739 0.08145159
## factor(am)0:factor(gear)4 5.80733310 19.18283666
## factor(am)1:factor(gear)4 -118.19232301 -22.33093457
## factor(am)0:factor(gear)5 NA NA
## factor(am)1:factor(gear)5 -113.36907289 -24.51058028
## factor(am)0:carb -11.01492888 -3.85110632
## factor(am)1:carb -0.68866745 2.82618705
## factor(am)0:countryJAPAN -4.13592850 5.15637238
## factor(am)1:countryJAPAN -4.54125645 1.84235226
## factor(am)0:countryUSA -7.98647328 -1.60837748
## factor(am)1:countryUSA -3.23756246 10.03960125
Conclusion
When we combine our PCA1/PCA2 graph and bestfit model, we can see the following: