Executive summary

This assignment is part of the regression models course project from Coursera. Given the mtcars data set, we are supposed to answer two questions:

“Is an automatic or manual transmission better for MPG?”
“Quantify the MPG difference between automatic and manual transmissions?”

An initial exploratory analysis showed that manual transmission was more efficient; however, it was necessary to consider other variables, such as weight and the number of cylinders. We built an initial linear regression model, but it had issues regarding collinearity and linearity. The second model found that the type of transmission was not significant. In addition, by keeping the other variables constant, heavier vehicles and V-shaped engines were less efficient. Nonetheless, the model is not reliable because the linearity issue persisted. Thus, it is impossible to answer the questions proposed with linear regression models.

Packages

Sys.setlocale("LC_ALL","English")

## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

library(tidyverse)
library(broom)
library(kableExtra)
library(easystats)

Exploratory data analyses

data("mtcars")

mtcars %>% 
  ggplot(aes(as_factor(am),
             mpg)) +
  geom_boxplot() +
  labs(x = "Transmission",
       y = "Miles/(US) gallon") +
  stat_summary(fun = "mean", 
               colour = "red", 
               size = 2, 
               geom = "point")+
  scale_x_discrete(labels = c("0" = "Automatic", 
                              "1" = "Manual")) +
  theme_classic()

An initial assessment shows that cars with a manual transmission have lower fuel consumption than those with an automatic transmission. However, it is necessary to consider other variables before taking a position.

labels <- 
  c("0" = "Automatic", 
    "1" = "Manual")

mtcars %>% 
  ggplot(aes(wt,
             mpg,
             color = as_factor(vs),
             size = as_factor(cyl))) +
  geom_point(alpha = 0.5) +
  facet_wrap(~ am, 
             labeller = labeller(am = labels)) +
  labs(x = "Weight (1000 lbs)",
       y = "Miles/(US) gallon",
       color = "Engine",
       size = "Number of cylinders") +
  scale_color_discrete(labels = c("V-shaped",
                                 "Straight")) +
  theme_classic()

Heavier cars with more cylinders are less fuel efficient. Furthermore, it seems that the consumption of manual and automatic vehicles is similar as the weight and number of cylinders increase. Let’s prepare the data for modeling to learn more.

Data wrangling

mtcars <- 
  mtcars %>% 
  mutate(vs = as_factor(vs,),
         vs = fct_recode(vs,
                         V = "0",
                         S = "1"),
         vs = fct_relevel(vs,
                          "S"),
         am = as_factor(am),
         am = fct_recode(am,
                         automatic = "0",
                         manual = "1"),
         am = fct_relevel(am,
                          "manual"),
         cyl = as_factor(cyl),
         cyl = fct_inseq(cyl),
         gear = as_factor(gear),
         gear = fct_inseq(gear),
         carb = as_factor(carb),
         carb = fct_inseq(carb))

Build a model

model_1 <- 
  lm(mpg ~ 
       .,
     mtcars)

model_1 %>% 
  glance() %>% 
  mutate(r.squared = r.squared %>% 
           round(2),
         statistic = statistic %>% 
           round(2), 
         p.value = p.value %>% 
           round(3)) %>% 
  pivot_longer(cols = everything(),
               names_to = "Statistics",
               values_to = "Values") %>%
  kable() %>% 
  kable_styling(latex_options = "striped",
                full_width = F,
                position = "left") %>% 
  scroll_box(width = "50%", height = "200px")

Statistics	Values
r.squared	0.8900000
adj.r.squared	0.7790215
sigma	2.8331687
statistic	7.8300000
p.value	0.0000000
df	16.0000000
logLik	-66.6077262
AIC	169.2154524
BIC	195.5986987
deviance	120.4026720
df.residual	15.0000000
nobs	32.0000000

model_1 %>% 
  tidy() %>%
  mutate(p.value = p.value %>% 
           round(3),
         estimate = estimate %>% 
           round(3)) %>%
  kable() %>% 
  kable_styling(latex_options = "striped",
                full_width = F,
                position = "left") %>% 
  scroll_box(width = "100%", height = "200px")

term	estimate	std.error	statistic	p.value
(Intercept)	27.022	19.0438676	1.4189396	0.176
cyl6	-2.649	3.0408904	-0.8710262	0.397
cyl8	-0.336	7.1595395	-0.0469532	0.963
disp	0.036	0.0318992	1.1143329	0.283
hp	-0.071	0.0394256	-1.7883534	0.094
drat	1.183	2.4834846	0.4762784	0.641
wt	-4.530	2.5387458	-1.7842573	0.095
qsec	0.368	0.9353957	0.3932505	0.700
vsV	-1.931	2.8712578	-0.6724755	0.512
amautomatic	-1.212	3.2135451	-0.3771896	0.711
gear4	1.114	3.7995173	0.2932886	0.773
gear5	2.528	3.7363580	0.6767007	0.509
carb2	-0.979	2.3179745	-0.4225044	0.679
carb3	3.000	4.2935461	0.6986390	0.495
carb4	1.091	4.4496199	0.2452845	0.810
carb6	4.478	6.3840624	0.7013668	0.494
carb8	7.250	8.3605664	0.8672153	0.399

The full model explained 89% of the variance (R² = 0.89) and proved to be significant: F(16,15) = 7.83, p < .001. However, no variable was significant (p < .05). Let’s check the linear regression assumptions.

Regression assumptions

Multicollinearity

collinearity <- 
  check_collinearity(model_1)

collinearity %>%
  kable() %>% 
  kable_styling(latex_options = "striped",
                full_width = F,
                position = "left") %>% 
  scroll_box(width = "100%", height = "200px")

Term	VIF	VIF_CI_low	VIF_CI_high	SE_factor	Tolerance	Tolerance_CI_low	Tolerance_CI_high
cyl	128.120962	96.009237	171.085979	11.319053	0.0078051	0.0058450	0.0104157
disp	60.365687	45.311732	80.533898	7.769536	0.0165657	0.0124171	0.0220693
hp	28.219577	21.258708	37.572193	5.312210	0.0354364	0.0266154	0.0470395
drat	6.809662	5.239892	8.960623	2.609533	0.1468502	0.1115994	0.1908436
wt	23.830830	17.974903	31.706907	4.881683	0.0419625	0.0315389	0.0556331
qsec	10.790190	8.217760	14.279441	3.284842	0.0926768	0.0700308	0.1216877
vs	8.088166	6.196286	10.668848	2.843970	0.1236374	0.0937308	0.1613870
am	9.930495	7.574574	13.130634	3.151269	0.1006999	0.0761578	0.1320206
gear	50.852311	38.193417	67.819698	7.131081	0.0196648	0.0147450	0.0261825
carb	503.211851	376.669237	672.379817	22.432384	0.0019872	0.0014873	0.0026548

plot(collinearity)

Heteroscedasticity

heteroscedasticity <- 
  check_heteroscedasticity(model_1)

heteroscedasticity

## OK: Error variance appears to be homoscedastic (p = 0.417).

plot(heteroscedasticity)

Normality

normality <- 
  check_normality(model_1)

normality

## OK: residuals appear as normally distributed (p = 0.321).

plot(normality, 
     type = "qq")

Outliers and influential observations

outliers <- 
  check_outliers(model_1)

outliers

## 1 outlier detected: case 28.
## - Based on the following method and threshold: cook (1.01).
## - For variable: (Whole model).

plot(outliers)

mtcars %>% 
  slice(28)

Linearity

augment(model_1) %>% 
  ggplot(aes(.fitted,
             .resid)) +
  geom_point() +
  stat_smooth() +
  theme_classic()

This model has problems regarding collinearity and linearity. Besides, despite the Shapiro-Wilk test results pointing out the residuals to be normally distributed, the visual inspection did not make us so confident. In addition, case 28 (Lotus Europa) influenced the model, regardless of its validity on the dataset. Given this, we will take three measures: 1) exclude case 28, since our objective is to evaluate the effect of the type of transmission on the fuel consumption; 2) among the variables with high collinearity, keep weight (wt), as we understand that it can have a substantial impact on car efficiency; and 3) to log-scale the outcome variable aiming to solve the problem of linearity (and normality).

New model

mtcars_2 <-
  mtcars %>%
  select(mpg,
         am,
         drat,
         vs,
         wt) %>% 
  slice(-28)

model_2 <-
  lm(log(mpg) ~
     am +
     drat +
     vs + 
     wt,
   data = mtcars_2)

model_2 %>% 
  glance() %>% 
  mutate(r.squared = r.squared %>% 
           round(2),
         statistic = statistic %>% 
           round(2), 
         p.value = p.value %>% 
           round(3)) %>% 
  pivot_longer(cols = everything(),
               names_to = "Statistics",
               values_to = "Values") %>%
  kable() %>% 
  kable_styling(latex_options = "striped",
                full_width = F,
                position = "left") %>% 
  scroll_box(width = "50%", height = "200px")

Statistics	Values
r.squared	0.8300000
adj.r.squared	0.8016808
sigma	0.1294071
statistic	31.3200000
p.value	0.0000000
df	4.0000000
logLik	22.1277698
AIC	-32.2555396
BIC	-23.6516163
deviance	0.4354010
df.residual	26.0000000
nobs	31.0000000

model_2 %>% 
  tidy() %>%
  mutate(p.value = p.value %>% 
           round(3),
         estimate = estimate %>% 
           round(3)) %>%
  kable() %>% 
  kable_styling(latex_options = "striped",
                full_width = F,
                position = "left") %>% 
  scroll_box(width = "100%", height = "200px")

term	estimate	std.error	statistic	p.value
(Intercept)	3.655	0.3573533	10.2276126	0.000
amautomatic	-0.007	0.0807181	-0.0905634	0.929
drat	0.025	0.0749467	0.3349121	0.740
vsV	-0.147	0.0622886	-2.3646349	0.026
wt	-0.218	0.0440008	-4.9466971	0.000

Regression assumptions

Multicollinearity

collinearity_2 <- 
  check_collinearity(model_2)

collinearity_2

plot(collinearity_2)

Heteroscedasticity

heteroscedasticity_2 <- 
  check_heteroscedasticity(model_2)

heteroscedasticity_2

## OK: Error variance appears to be homoscedastic (p = 0.319).

plot(heteroscedasticity_2)

Normality

normality_2 <- 
  check_normality(model_2)

normality_2

## OK: residuals appear as normally distributed (p = 0.147).

plot(normality_2, 
     type = "qq")

Linearity

augment(model_2) %>% 
  ggplot(aes(.fitted,
             .resid)) +
  geom_point() +
  stat_smooth() +
  theme_classic()

The nested model explained 83% of the variance (R² = .83) and proved to be significant: F(4,26) = 31.32, p < .001. The engine’s shape (b = -.15, p = .03) and the vehicle’s weight (b = -.22, p < .001) predicted the fuel consumption. The type of transmission was not significant. Keeping the other variables constant, heavier vehicles and V-shaped engines are less efficient. However, the model is not reliable. The linearity issue persists, even with the log transformation of the dependent variable. Thus, it is impossible to answer the questions proposed with linear regression models.

Regression Models Course Project

Francisco Pablo Huascar Aragão Pinheiro

Data Science Specialization - Coursera

2023-02-21

Executive summary

Packages

Exploratory data analyses

Data wrangling

Build a model

Regression assumptions

Multicollinearity

Heteroscedasticity

Normality

Outliers and influential observations

Linearity

New model

Regression assumptions

Multicollinearity

Heteroscedasticity

Normality

Linearity