Executive summary

This assignment is part of the regression models course project from Coursera. Given the mtcars data set, we are supposed to answer two questions:

An initial exploratory analysis showed that manual transmission was more efficient; however, it was necessary to consider other variables, such as weight and the number of cylinders. We built an initial linear regression model, but it had issues regarding collinearity and linearity. The second model found that the type of transmission was not significant. In addition, by keeping the other variables constant, heavier vehicles and V-shaped engines were less efficient. Nonetheless, the model is not reliable because the linearity issue persisted. Thus, it is impossible to answer the questions proposed with linear regression models.

Packages

Sys.setlocale("LC_ALL","English")
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
library(tidyverse)
library(broom)
library(kableExtra)
library(easystats)

Exploratory data analyses

data("mtcars")

mtcars %>% 
  ggplot(aes(as_factor(am),
             mpg)) +
  geom_boxplot() +
  labs(x = "Transmission",
       y = "Miles/(US) gallon") +
  stat_summary(fun = "mean", 
               colour = "red", 
               size = 2, 
               geom = "point")+
  scale_x_discrete(labels = c("0" = "Automatic", 
                              "1" = "Manual")) +
  theme_classic()

An initial assessment shows that cars with a manual transmission have lower fuel consumption than those with an automatic transmission. However, it is necessary to consider other variables before taking a position.

labels <- 
  c("0" = "Automatic", 
    "1" = "Manual")

mtcars %>% 
  ggplot(aes(wt,
             mpg,
             color = as_factor(vs),
             size = as_factor(cyl))) +
  geom_point(alpha = 0.5) +
  facet_wrap(~ am, 
             labeller = labeller(am = labels)) +
  labs(x = "Weight (1000 lbs)",
       y = "Miles/(US) gallon",
       color = "Engine",
       size = "Number of cylinders") +
  scale_color_discrete(labels = c("V-shaped",
                                 "Straight")) +
  theme_classic()

Heavier cars with more cylinders are less fuel efficient. Furthermore, it seems that the consumption of manual and automatic vehicles is similar as the weight and number of cylinders increase. Let’s prepare the data for modeling to learn more.

Data wrangling

mtcars <- 
  mtcars %>% 
  mutate(vs = as_factor(vs,),
         vs = fct_recode(vs,
                         V = "0",
                         S = "1"),
         vs = fct_relevel(vs,
                          "S"),
         am = as_factor(am),
         am = fct_recode(am,
                         automatic = "0",
                         manual = "1"),
         am = fct_relevel(am,
                          "manual"),
         cyl = as_factor(cyl),
         cyl = fct_inseq(cyl),
         gear = as_factor(gear),
         gear = fct_inseq(gear),
         carb = as_factor(carb),
         carb = fct_inseq(carb))

Build a model

model_1 <- 
  lm(mpg ~ 
       .,
     mtcars)

model_1 %>% 
  glance() %>% 
  mutate(r.squared = r.squared %>% 
           round(2),
         statistic = statistic %>% 
           round(2), 
         p.value = p.value %>% 
           round(3)) %>% 
  pivot_longer(cols = everything(),
               names_to = "Statistics",
               values_to = "Values") %>%
  kable() %>% 
  kable_styling(latex_options = "striped",
                full_width = F,
                position = "left") %>% 
  scroll_box(width = "50%", height = "200px")
Statistics Values
r.squared 0.8900000
adj.r.squared 0.7790215
sigma 2.8331687
statistic 7.8300000
p.value 0.0000000
df 16.0000000
logLik -66.6077262
AIC 169.2154524
BIC 195.5986987
deviance 120.4026720
df.residual 15.0000000
nobs 32.0000000
model_1 %>% 
  tidy() %>%
  mutate(p.value = p.value %>% 
           round(3),
         estimate = estimate %>% 
           round(3)) %>%
  kable() %>% 
  kable_styling(latex_options = "striped",
                full_width = F,
                position = "left") %>% 
  scroll_box(width = "100%", height = "200px")
term estimate std.error statistic p.value
(Intercept) 27.022 19.0438676 1.4189396 0.176
cyl6 -2.649 3.0408904 -0.8710262 0.397
cyl8 -0.336 7.1595395 -0.0469532 0.963
disp 0.036 0.0318992 1.1143329 0.283
hp -0.071 0.0394256 -1.7883534 0.094
drat 1.183 2.4834846 0.4762784 0.641
wt -4.530 2.5387458 -1.7842573 0.095
qsec 0.368 0.9353957 0.3932505 0.700
vsV -1.931 2.8712578 -0.6724755 0.512
amautomatic -1.212 3.2135451 -0.3771896 0.711
gear4 1.114 3.7995173 0.2932886 0.773
gear5 2.528 3.7363580 0.6767007 0.509
carb2 -0.979 2.3179745 -0.4225044 0.679
carb3 3.000 4.2935461 0.6986390 0.495
carb4 1.091 4.4496199 0.2452845 0.810
carb6 4.478 6.3840624 0.7013668 0.494
carb8 7.250 8.3605664 0.8672153 0.399

The full model explained 89% of the variance (R2 = 0.89) and proved to be significant: F(16,15) = 7.83, p < .001. However, no variable was significant (p < .05). Let’s check the linear regression assumptions.

Regression assumptions

Multicollinearity

collinearity <- 
  check_collinearity(model_1)

collinearity %>%
  kable() %>% 
  kable_styling(latex_options = "striped",
                full_width = F,
                position = "left") %>% 
  scroll_box(width = "100%", height = "200px")
Term VIF VIF_CI_low VIF_CI_high SE_factor Tolerance Tolerance_CI_low Tolerance_CI_high
cyl 128.120962 96.009237 171.085979 11.319053 0.0078051 0.0058450 0.0104157
disp 60.365687 45.311732 80.533898 7.769536 0.0165657 0.0124171 0.0220693
hp 28.219577 21.258708 37.572193 5.312210 0.0354364 0.0266154 0.0470395
drat 6.809662 5.239892 8.960623 2.609533 0.1468502 0.1115994 0.1908436
wt 23.830830 17.974903 31.706907 4.881683 0.0419625 0.0315389 0.0556331
qsec 10.790190 8.217760 14.279441 3.284842 0.0926768 0.0700308 0.1216877
vs 8.088166 6.196286 10.668848 2.843970 0.1236374 0.0937308 0.1613870
am 9.930495 7.574574 13.130634 3.151269 0.1006999 0.0761578 0.1320206
gear 50.852311 38.193417 67.819698 7.131081 0.0196648 0.0147450 0.0261825
carb 503.211851 376.669237 672.379817 22.432384 0.0019872 0.0014873 0.0026548
plot(collinearity)

Heteroscedasticity

heteroscedasticity <- 
  check_heteroscedasticity(model_1)

heteroscedasticity
## OK: Error variance appears to be homoscedastic (p = 0.417).
plot(heteroscedasticity)

Normality

normality <- 
  check_normality(model_1)

normality
## OK: residuals appear as normally distributed (p = 0.321).
plot(normality, 
     type = "qq")

Outliers and influential observations

outliers <- 
  check_outliers(model_1)

outliers
## 1 outlier detected: case 28.
## - Based on the following method and threshold: cook (1.01).
## - For variable: (Whole model).
plot(outliers)

mtcars %>% 
  slice(28)

Linearity

augment(model_1) %>% 
  ggplot(aes(.fitted,
             .resid)) +
  geom_point() +
  stat_smooth() +
  theme_classic()

This model has problems regarding collinearity and linearity. Besides, despite the Shapiro-Wilk test results pointing out the residuals to be normally distributed, the visual inspection did not make us so confident. In addition, case 28 (Lotus Europa) influenced the model, regardless of its validity on the dataset. Given this, we will take three measures: 1) exclude case 28, since our objective is to evaluate the effect of the type of transmission on the fuel consumption; 2) among the variables with high collinearity, keep weight (wt), as we understand that it can have a substantial impact on car efficiency; and 3) to log-scale the outcome variable aiming to solve the problem of linearity (and normality).

New model

mtcars_2 <-
  mtcars %>%
  select(mpg,
         am,
         drat,
         vs,
         wt) %>% 
  slice(-28)
model_2 <-
  lm(log(mpg) ~
     am +
     drat +
     vs + 
     wt,
   data = mtcars_2)
model_2 %>% 
  glance() %>% 
  mutate(r.squared = r.squared %>% 
           round(2),
         statistic = statistic %>% 
           round(2), 
         p.value = p.value %>% 
           round(3)) %>% 
  pivot_longer(cols = everything(),
               names_to = "Statistics",
               values_to = "Values") %>%
  kable() %>% 
  kable_styling(latex_options = "striped",
                full_width = F,
                position = "left") %>% 
  scroll_box(width = "50%", height = "200px")
Statistics Values
r.squared 0.8300000
adj.r.squared 0.8016808
sigma 0.1294071
statistic 31.3200000
p.value 0.0000000
df 4.0000000
logLik 22.1277698
AIC -32.2555396
BIC -23.6516163
deviance 0.4354010
df.residual 26.0000000
nobs 31.0000000
model_2 %>% 
  tidy() %>%
  mutate(p.value = p.value %>% 
           round(3),
         estimate = estimate %>% 
           round(3)) %>%
  kable() %>% 
  kable_styling(latex_options = "striped",
                full_width = F,
                position = "left") %>% 
  scroll_box(width = "100%", height = "200px")
term estimate std.error statistic p.value
(Intercept) 3.655 0.3573533 10.2276126 0.000
amautomatic -0.007 0.0807181 -0.0905634 0.929
drat 0.025 0.0749467 0.3349121 0.740
vsV -0.147 0.0622886 -2.3646349 0.026
wt -0.218 0.0440008 -4.9466971 0.000

Regression assumptions

Multicollinearity

collinearity_2 <- 
  check_collinearity(model_2)

collinearity_2
plot(collinearity_2)

Heteroscedasticity

heteroscedasticity_2 <- 
  check_heteroscedasticity(model_2)

heteroscedasticity_2
## OK: Error variance appears to be homoscedastic (p = 0.319).
plot(heteroscedasticity_2)

Normality

normality_2 <- 
  check_normality(model_2)

normality_2
## OK: residuals appear as normally distributed (p = 0.147).
plot(normality_2, 
     type = "qq")

Linearity

augment(model_2) %>% 
  ggplot(aes(.fitted,
             .resid)) +
  geom_point() +
  stat_smooth() +
  theme_classic()

The nested model explained 83% of the variance (R2 = .83) and proved to be significant: F(4,26) = 31.32, p < .001. The engine’s shape (b = -.15, p = .03) and the vehicle’s weight (b = -.22, p < .001) predicted the fuel consumption. The type of transmission was not significant. Keeping the other variables constant, heavier vehicles and V-shaped engines are less efficient. However, the model is not reliable. The linearity issue persists, even with the log transformation of the dependent variable. Thus, it is impossible to answer the questions proposed with linear regression models.