Github view mycode

rpub link

library(tidyverse)
library(reshape2)
library(hablar)
library(RColorBrewer)
library(plotly)
library(ggbiplot) # installation library(devtools); install_github("vqv/ggbiplot")
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] ggbiplot_0.55      scales_1.1.1       plyr_1.8.6         plotly_4.9.2.1    
##  [5] RColorBrewer_1.1-2 hablar_0.3.0       reshape2_1.4.4     forcats_0.5.0     
##  [9] stringr_1.4.0      dplyr_0.8.5        purrr_0.3.4        readr_1.3.1       
## [13] tidyr_1.1.0        tibble_3.0.1       ggplot2_3.3.1      tidyverse_1.3.0   
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.0  xfun_0.14         haven_2.2.0       lattice_0.20-41  
##  [5] colorspace_1.4-1  vctrs_0.3.0       generics_0.0.2    viridisLite_0.3.0
##  [9] htmltools_0.4.0   yaml_2.2.1        rlang_0.4.6       pillar_1.4.4     
## [13] glue_1.4.1        withr_2.2.0       DBI_1.1.0         dbplyr_1.4.3     
## [17] modelr_0.1.8      readxl_1.3.1      lifecycle_0.2.0   munsell_0.5.0    
## [21] gtable_0.3.0      cellranger_1.1.0  rvest_0.3.5       htmlwidgets_1.5.1
## [25] evaluate_0.14     knitr_1.28        fansi_0.4.1       broom_0.5.6      
## [29] Rcpp_1.0.4.6      backports_1.1.7   jsonlite_1.6.1    fs_1.4.1         
## [33] hms_0.5.3         digest_0.6.25     stringi_1.4.6     cli_2.0.2        
## [37] tools_4.0.0       magrittr_1.5      lazyeval_0.2.2    crayon_1.3.4     
## [41] pkgconfig_2.0.3   ellipsis_0.3.1    data.table_1.12.8 xml2_1.3.2       
## [45] reprex_0.3.0      lubridate_1.7.8   assertthat_0.2.1  rmarkdown_2.1    
## [49] httr_1.4.1        rstudioapi_0.11   R6_2.4.1          nlme_3.1-147     
## [53] compiler_4.0.0

Executive Summary

In this report, we will analyze the mtcars dataset and evaluate the relationship of mpg and auto transmission (automatic/manual). We will also deep dive other variables that affects fuel efficiency such as weight of the vehicle, horsepower, qsec 1/4 mile performance test run and many others in the dataset. We will also use a package called olsrr. This tool made it easier to cycle combinations and patterns for ordinary least squares regression models and combined it with principle component analysis (prcomp). We obtained and use the highest adjusted Rsquared to find our model.

This two model selection (olsrr and pca) gave us insight on the following:

Dataset History

According to the help documentation ?mtcars: “The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).”

Description of variables:

  1. mpg: Miles/(US) gallon
  2. cyl: Number of cylinders
  3. disp: Displacement (cu.in.)
  4. hp: Gross horsepower
  5. drat: Rear axle ratio
  6. wt: Weight (1000 lbs)
  7. qsec: 1/4 mile time
  8. vs: Engine (0 = V-shaped, 1 = straight)
  9. am: Transmission (0 = automatic, 1 = manual)
  10. gear: Number of forward gears
  11. carb: Number of carburetors

Exploratory Data Analysis

Formulate our question(s)

  1. Is an automatic or manual transmission better for MPG?
  2. Quantify the MPG difference between automatic and manual transmissions?

Load Data

data(mtcars)

Check the Packaging

This is a fast operation that provides us with the number of obersvations and variables, with their respective datatypes. We can quickly analyze opportunities that we can see using str().

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

A few opportunities I see in the above output are as follows:

  • The disp and hp are much higher compared to the remaining variables and rescaling could help with the analysis.
  • Variables cyl, vs, am, gear, carb are potential factor types

Using str() provides us with a quick and powerful way in analyzing our data and making it more tidy.

Validate with at least one external data source

According to the Buereau of Transportation, the average fuel efficiency for ligh-duty vehicle-Passenger car in 2017 is 39.4 mpg.

Also, we added another variable called countryGroup in mtcars which shows the country or continent they were made (personal assumption).

countryGroup <-
  c(
    rep("JAPAN", 3),
    rep("USA", 4),
    rep("EUROPE", 7),
    rep("USA", 3),
    "EUROPE",
    rep("JAPAN", 3),
    rep("USA", 4),
    rep("EUROPE", 3),
    "USA",
    rep("EUROPE", 3)
  )
mtcars$country <- countryGroup
quantile(mtcars$mpg, seq(0, 1, 0.1))
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
## 10.40 14.34 15.20 15.98 17.92 19.20 21.00 21.47 24.08 30.09 33.90

Above table, we zoomed-in our data using quantile function. If we are to add external dataset of 39.4 mpg in mtcars dataset, this becomes the new max and a complete outlier.

There are many technological advances and changes in environmental policies that occurred between 1973 to 2017. Some advancements are lighter metals and safety equipment to increase fuel efficiency and decrease risk in accident.

There are many more advances in the field of vehicles-electrification that increases mpg-e(equivalent) and driverless vehicles (off-topic).

After 44 years, the advancement in fuel efficiency increased at an average rate of .44 mpg each year for 44 years or 19.31/44 years (note: 39.4-20.09=19.31). This is pretty bad.

Tesla Model S on average has 115 city /107 highway mpg-e.

Knowing the national average-mpg, gives us insight to the following:

  • Right magnitude (mpg and not kilometers per gallon)
  • The max MPG in our data set is at 33.90 vs. the mean of 39.4 mpg in 2017.
  • The max outlier in year 1973 is less than the mean in 2017.
  • Adding the country of origin will provide us a better insight how these are group together.

Continue on the EDA…

Principle Component Analysis

Let’s review the relationship of variables to our vehicles:

pca <- prcomp(mtcars[, c(1:7, 10, 11)], center = TRUE, scale. = TRUE)
pca_x_df <- data.frame(pca[["x"]])
pca_name <- colnames(pca_x_df)
pca.var <- pca$sdev ^ 2
pca.var.per <- round(pca.var / sum(pca.var) * 100, 1)
pca_df <- data.frame(name = pca_name, pca_percent = pca.var.per)
pca_df %>% ggplot(aes(x = name, y = pca_percent)) + geom_col()

#pca <- prcomp(mtcars[, c(1:7,10,11)],center = TRUE, scale. = TRUE)
pca_g <-
  
    ggplotly(ggbiplot(
      pca,
      labels = rownames(mtcars),
      obs.scale = .5,
      var.scale = .5,
      ellipse = TRUE,
      groups = mtcars$country
    ))
  
pca_g

Not all cars are made equal. We can see from the above graph that the Japan, and Europe made vehicles tend to have a higher mpg and less hp, cyl, disp, and wt compared with the USA counterpart.

Let’s review PCA3 and PCA4:

   ggplotly(ggbiplot(
      pca,
      labels = rownames(mtcars),
      obs.scale = .5,
      var.scale = .5,
      ellipse = TRUE, 
      choices = c(3,4),
      groups = mtcars$country
    ))

It’s hard to see any insight from the above because of the small percentage.

Let’s take a look at the breakdown of countries vs. am:

group_country_am <- mtcars %>% group_by(country, am)
group_country_am <- dplyr::summarise(group_country_am, count = n())
group_country_am <-
  melt(group_country_am, id.vars = c("country", "am"))
ggplot(group_country_am, aes(x = country, y = value)) +
  geom_bar(stat = "identity", alpha = .9) + facet_grid( ~ am) + labs(title = "Country by Auto Transmission (0=AT, 1=MT)", y = "Count", x = "Country/Continent")

mpg_rank <-
  mtcars %>% arrange(desc(mpg)) %>% select(country, mpg, am) %>% mutate(am_abbr = ifelse(mtcars$am == 0, "AT", "MT"))

top10 <-
  mpg_rank %>% mutate(ranking = rank(desc(mpg), ties.method = "first"))
head(top10 %>% select(country, mpg, am_abbr, ranking) )

Japan has a total of 5 vehicles in our sample size and the above chart shows that 4 of Japanese vehicles are part of the top 10.

Try a simple solution

Is an automatic or manual transmission better for MPG?

Data set description of “am”, from ?mtcars:

  • 0 = automatic transmission
  • 1 = manual transmission
amMean <- summarise(group_by(mtcars, AM = as.factor(am)), MN = mean(mpg))
mpgMean <- mean(mtcars$mpg)
amMean
by(mtcars$mpg, INDICES = mtcars$am, summary)
## mtcars$am: 0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   14.95   17.30   17.15   19.20   24.40 
## ------------------------------------------------------------ 
## mtcars$am: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   21.00   22.80   24.39   30.40   33.90

Manual transmission has a higher min and max compared to Automatic Transmission in our dataset. Let’s create a boxplot to visualize our findings.

g <- mtcars %>% ggplot(aes(y = mpg, x = am))
g + geom_boxplot(alpha = .3) +
  facet_grid( ~ mtcars$am) +
  geom_hline(aes(yintercept = mean(mpg), col = "red")) +
  theme(legend.position = "none") + stat_summary(fun = mean, geom = "point",
                                                 colour = "blue")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

Transmission Analysis:

  • Red line across boxplot is the total mpg mean, 20.090625
  • Blue point is our respective mean for AT and MT.
  • 0 or automatic transmission mean = ****
  • 1 or manual transmission mean = ****
  • The mean of total mpg is greater than Automatic Transmission mean or 20.090625 >
  • The mean of the manual transmission is greater than the mean of the total mpg and is greater than the automatic transmission mpg or ** > 20.090625 > **.

Let’s review the counts of each transmission:

by_am <- mtcars %>% group_by(am)
dplyr::summarise(by_am, count = n())

Answer to question 1:

Based on the simple solution we applied, manual transmission is better than automatic transmission in fuel efficiency. We still need to keep poking our data for insights.

Quantify the MPG difference between automatic and manual transmissions?

Manual Transmission mpg is mpg higher than Automatic manual transmission mpg. We do not know what percentage belongs to AM in the difference between MT and AT.

We need to dig deeper in our analysis and keep poking our simple solution.

Deep Dive and Challenge “simple solution” and Build a Model

Hypothesis:

H0: AT mpg = MT mpg; There is no difference in fuel efficiency between Automatic transmission and Manual Transmission

Ha: AT mpg != MT mpg

t.test(mtcars$mpg ~ factor(mtcars$am),
       paired = FALSE,
       var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mtcars$mpg by factor(mtcars$am)
## t = -4.1061, df = 30, p-value = 0.000285
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10.84837  -3.64151
## sample estimates:
## mean in group 0 mean in group 1 
##        17.14737        24.39231

There is a .0285% chance that our null hypothesis = MT mpg = MT mpg, therefore, we reject the null hypothesis. The mean of AT != mean of MT. Let’s check out the linear model:

fit <- lm(data = mtcars, mpg ~ factor(am))
summary(fit)
## 
## Call:
## lm(formula = mpg ~ factor(am), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## factor(am)1    7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

As we said before we rejected the null hypothesis, the linear model shows the following:

  1. Automatic Transmission has a mean of 17.147
  2. The 7.245 is the estimated change in the intercept of the linear relatiohship between Manual Transmission mpg1 and Automatic Transmission mpg0, going from Automatic to Manual Transmission. Also, 17.1473684 + 7.2449393 = the mean(Manual Transmission) = 24.3923077
  3. The adjusted r-squared `r fit$ which means that our model only explains 33.8% of the variance.

Let us review the plot and see if it is a good fit.

rm(list = ls())
attach(mtcars)
fit <- lm(am ~ mpg)
plot(mpg,
     am,
     frame.plot = FALSE,
     xlab = "MPG",
     ylab = "Transmission type (0=AT, 1=MT)")
abline(fit, col = "blue")
lines(lowess(mpg, am), type = "l", col = "red")

The above linear model is not a good fit the blue bar crosses below 0 and above 1 and the redline crosses above 1. We need to find a better model that fits out data.

Log regression part of the GLM:

fit <- glm(factor(mtcars$am) ~ mtcars$mpg, family = binomial)
plot(
  x = mpg,
  y = am,
  frame.plot = FALSE,
  xlab = "MPG",
  ylab = "Transmission type (0=AT, 1=MT)"
)
points(mpg, fit$fitted.values , col = "red")

Exponentiated coefficient.

exp(fit$coefficients)
## (Intercept)  mtcars$mpg 
## 0.001355579 1.359379288

The first line shows the exponentiated slope coefficient is 1.36. We estimate a 36% increase in odds that the transmission is manual transmission per 1 mpg increase.

fit <- glm(factor(mtcars$am)~mtcars$mpg, family = binomial)
summary(fit)
## 
## Call:
## glm(formula = factor(mtcars$am) ~ mtcars$mpg, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5701  -0.7531  -0.4245   0.5866   2.0617  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -6.6035     2.3514  -2.808  0.00498 **
## mtcars$mpg    0.3070     0.1148   2.673  0.00751 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.230  on 31  degrees of freedom
## Residual deviance: 29.675  on 30  degrees of freedom
## AIC: 33.675
## 
## Number of Fisher Scoring iterations: 5

The above model has an AIC of 33.6751676 similar to the R-squared. Which means our model only accounts for 33.675% of the deviance. We need a better model.

Model Selection

Nested Model.

  1. mpg: Miles/(US) gallon
  2. cyl: Number of cylinders
  3. disp: Displacement (cu.in.)
  4. hp: Gross horsepower
  5. drat: Rear axle ratio
  6. wt: Weight (1000 lbs)
  7. qsec: 1/4 mile time
  8. vs: Engine (0 = V-shaped, 1 = straight)
  9. am: Transmission (0 = automatic, 1 = manual)
  10. gear: Number of forward gears
  11. carb: Number of carburetors

Update the datatype and scale the data:

data(mtcars)
scale_mtcars <- data.frame(scale(mtcars))
fit1 <- lm(data = scale_mtcars, mpg ~ factor(am))
fit2 <- update(fit1, mpg ~ factor(am) + disp + hp)
fit3 <- update(fit1, mpg ~ factor(am) + disp + hp + drat + wt)
fit4 <- update(fit1, mpg ~ factor(am) + disp + hp + drat + wt + qsec + factor(vs))
fit5 <-
  update(
    fit1,
    mpg ~ factor(am) + disp + hp + drat + wt + qsec + factor(vs) + factor(cyl) +
      factor(gear) + factor(carb)
  )
anova(fit1, fit2, fit3, fit4, fit5)

Model 2 and Model 3 seems to be a good candidate. Let us look deeper on model three and see if there is something we are missing.

The best fit is fit2 and fit3. Let’s examine it closely.

summary(fit3)
## 
## Call:
## lm(formula = mpg ~ factor(am) + disp + hp + drat + wt, data = scale_mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5535 -0.3011 -0.1028  0.1986  0.9061 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                -0.11031    0.13145  -0.839  0.40902   
## factor(am)1.18990141802394  0.27154    0.26360   1.030  0.31243   
## disp                        0.09759    0.22272   0.438  0.66488   
## hp                         -0.45164    0.14253  -3.169  0.00389 **
## drat                        0.10185    0.12856   0.792  0.43543   
## wt                         -0.49041    0.18924  -2.591  0.01547 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4313 on 26 degrees of freedom
## Multiple R-squared:  0.844,  Adjusted R-squared:  0.814 
## F-statistic: 28.13 on 5 and 26 DF,  p-value: 1.037e-09
library(olsrr)
data(mtcars)
model <- lm(data = mtcars, mpg ~ factor(am): + .)
find_model <- ols_step_all_possible(model)
head(find_model %>% filter(rank(desc(adjr), ties.method = "first") <= 10))

The above table, shows us the best model. Let’s go ahead and see how it looks.

data(mtcars)
countryGroup <-
  c(
    rep("JAPAN", 3),
    rep("USA", 4),
    rep("EUROPE", 7),
    rep("USA", 3),
    "EUROPE",
    rep("JAPAN", 3),
    rep("USA", 4),
    rep("EUROPE", 3),
    "USA",
    rep("EUROPE", 3)
  )
mtcars$country <- countryGroup
bestfit <- lm(data = mtcars, mpg ~ factor(am):wt+ factor(am):hp + factor(am):qsec+ factor(am):factor(vs)+ factor(am):factor(gear)+ factor(am):carb+ factor(am):country)
plot(bestfit)
## Warning: not plotting observations with leverage one:
##   21, 29

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

Coefficient Explanation

confint(bestfit)
##                                   2.5 %       97.5 %
## (Intercept)                 17.48061188  55.34077972
## factor(am)0:wt              -0.35677751   5.21051370
## factor(am)1:wt             -16.42668727  -4.96329415
## factor(am)0:hp               0.03674741   0.18492616
## factor(am)1:hp              -0.03029621   0.08737513
## factor(am)0:qsec            -2.83966754  -0.23165048
## factor(am)1:qsec             2.31069618   6.93368664
## factor(am)0:factor(vs)1      1.13191918  10.88255742
## factor(am)1:factor(vs)1    -11.99520739   0.08145159
## factor(am)0:factor(gear)4    5.80733310  19.18283666
## factor(am)1:factor(gear)4 -118.19232301 -22.33093457
## factor(am)0:factor(gear)5            NA           NA
## factor(am)1:factor(gear)5 -113.36907289 -24.51058028
## factor(am)0:carb           -11.01492888  -3.85110632
## factor(am)1:carb            -0.68866745   2.82618705
## factor(am)0:countryJAPAN    -4.13592850   5.15637238
## factor(am)1:countryJAPAN    -4.54125645   1.84235226
## factor(am)0:countryUSA      -7.98647328  -1.60837748
## factor(am)1:countryUSA      -3.23756246  10.03960125

Conclusion

When we combine our PCA1/PCA2 graph and bestfit model, we can see the following: