This assignment is part of the regression models course project from Coursera. Given the mtcars data set, we are supposed to answer two questions:
An initial exploratory analysis showed that manual transmission was more efficient; however, it was necessary to consider other variables, such as weight and the number of cylinders. We built an initial linear regression model, but it had issues regarding collinearity and linearity. The second model found that the type of transmission was not significant. In addition, by keeping the other variables constant, heavier vehicles and V-shaped engines were less efficient. Nonetheless, the model is not reliable because the linearity issue persisted. Thus, it is impossible to answer the questions proposed with linear regression models.
Sys.setlocale("LC_ALL","English")
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
library(tidyverse)
library(broom)
library(kableExtra)
library(easystats)
data("mtcars")
mtcars %>%
ggplot(aes(as_factor(am),
mpg)) +
geom_boxplot() +
labs(x = "Transmission",
y = "Miles/(US) gallon") +
stat_summary(fun = "mean",
colour = "red",
size = 2,
geom = "point")+
scale_x_discrete(labels = c("0" = "Automatic",
"1" = "Manual")) +
theme_classic()
An initial assessment shows that cars with a manual transmission have lower fuel consumption than those with an automatic transmission. However, it is necessary to consider other variables before taking a position.
labels <-
c("0" = "Automatic",
"1" = "Manual")
mtcars %>%
ggplot(aes(wt,
mpg,
color = as_factor(vs),
size = as_factor(cyl))) +
geom_point(alpha = 0.5) +
facet_wrap(~ am,
labeller = labeller(am = labels)) +
labs(x = "Weight (1000 lbs)",
y = "Miles/(US) gallon",
color = "Engine",
size = "Number of cylinders") +
scale_color_discrete(labels = c("V-shaped",
"Straight")) +
theme_classic()
Heavier cars with more cylinders are less fuel efficient. Furthermore, it seems that the consumption of manual and automatic vehicles is similar as the weight and number of cylinders increase. Let’s prepare the data for modeling to learn more.
mtcars <-
mtcars %>%
mutate(vs = as_factor(vs,),
vs = fct_recode(vs,
V = "0",
S = "1"),
vs = fct_relevel(vs,
"S"),
am = as_factor(am),
am = fct_recode(am,
automatic = "0",
manual = "1"),
am = fct_relevel(am,
"manual"),
cyl = as_factor(cyl),
cyl = fct_inseq(cyl),
gear = as_factor(gear),
gear = fct_inseq(gear),
carb = as_factor(carb),
carb = fct_inseq(carb))
model_1 <-
lm(mpg ~
.,
mtcars)
model_1 %>%
glance() %>%
mutate(r.squared = r.squared %>%
round(2),
statistic = statistic %>%
round(2),
p.value = p.value %>%
round(3)) %>%
pivot_longer(cols = everything(),
names_to = "Statistics",
values_to = "Values") %>%
kable() %>%
kable_styling(latex_options = "striped",
full_width = F,
position = "left") %>%
scroll_box(width = "50%", height = "200px")
| Statistics | Values |
|---|---|
| r.squared | 0.8900000 |
| adj.r.squared | 0.7790215 |
| sigma | 2.8331687 |
| statistic | 7.8300000 |
| p.value | 0.0000000 |
| df | 16.0000000 |
| logLik | -66.6077262 |
| AIC | 169.2154524 |
| BIC | 195.5986987 |
| deviance | 120.4026720 |
| df.residual | 15.0000000 |
| nobs | 32.0000000 |
model_1 %>%
tidy() %>%
mutate(p.value = p.value %>%
round(3),
estimate = estimate %>%
round(3)) %>%
kable() %>%
kable_styling(latex_options = "striped",
full_width = F,
position = "left") %>%
scroll_box(width = "100%", height = "200px")
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 27.022 | 19.0438676 | 1.4189396 | 0.176 |
| cyl6 | -2.649 | 3.0408904 | -0.8710262 | 0.397 |
| cyl8 | -0.336 | 7.1595395 | -0.0469532 | 0.963 |
| disp | 0.036 | 0.0318992 | 1.1143329 | 0.283 |
| hp | -0.071 | 0.0394256 | -1.7883534 | 0.094 |
| drat | 1.183 | 2.4834846 | 0.4762784 | 0.641 |
| wt | -4.530 | 2.5387458 | -1.7842573 | 0.095 |
| qsec | 0.368 | 0.9353957 | 0.3932505 | 0.700 |
| vsV | -1.931 | 2.8712578 | -0.6724755 | 0.512 |
| amautomatic | -1.212 | 3.2135451 | -0.3771896 | 0.711 |
| gear4 | 1.114 | 3.7995173 | 0.2932886 | 0.773 |
| gear5 | 2.528 | 3.7363580 | 0.6767007 | 0.509 |
| carb2 | -0.979 | 2.3179745 | -0.4225044 | 0.679 |
| carb3 | 3.000 | 4.2935461 | 0.6986390 | 0.495 |
| carb4 | 1.091 | 4.4496199 | 0.2452845 | 0.810 |
| carb6 | 4.478 | 6.3840624 | 0.7013668 | 0.494 |
| carb8 | 7.250 | 8.3605664 | 0.8672153 | 0.399 |
The full model explained 89% of the variance (R2 = 0.89) and proved to be significant: F(16,15) = 7.83, p < .001. However, no variable was significant (p < .05). Let’s check the linear regression assumptions.
collinearity <-
check_collinearity(model_1)
collinearity %>%
kable() %>%
kable_styling(latex_options = "striped",
full_width = F,
position = "left") %>%
scroll_box(width = "100%", height = "200px")
| Term | VIF | VIF_CI_low | VIF_CI_high | SE_factor | Tolerance | Tolerance_CI_low | Tolerance_CI_high |
|---|---|---|---|---|---|---|---|
| cyl | 128.120962 | 96.009237 | 171.085979 | 11.319053 | 0.0078051 | 0.0058450 | 0.0104157 |
| disp | 60.365687 | 45.311732 | 80.533898 | 7.769536 | 0.0165657 | 0.0124171 | 0.0220693 |
| hp | 28.219577 | 21.258708 | 37.572193 | 5.312210 | 0.0354364 | 0.0266154 | 0.0470395 |
| drat | 6.809662 | 5.239892 | 8.960623 | 2.609533 | 0.1468502 | 0.1115994 | 0.1908436 |
| wt | 23.830830 | 17.974903 | 31.706907 | 4.881683 | 0.0419625 | 0.0315389 | 0.0556331 |
| qsec | 10.790190 | 8.217760 | 14.279441 | 3.284842 | 0.0926768 | 0.0700308 | 0.1216877 |
| vs | 8.088166 | 6.196286 | 10.668848 | 2.843970 | 0.1236374 | 0.0937308 | 0.1613870 |
| am | 9.930495 | 7.574574 | 13.130634 | 3.151269 | 0.1006999 | 0.0761578 | 0.1320206 |
| gear | 50.852311 | 38.193417 | 67.819698 | 7.131081 | 0.0196648 | 0.0147450 | 0.0261825 |
| carb | 503.211851 | 376.669237 | 672.379817 | 22.432384 | 0.0019872 | 0.0014873 | 0.0026548 |
plot(collinearity)
heteroscedasticity <-
check_heteroscedasticity(model_1)
heteroscedasticity
## OK: Error variance appears to be homoscedastic (p = 0.417).
plot(heteroscedasticity)
normality <-
check_normality(model_1)
normality
## OK: residuals appear as normally distributed (p = 0.321).
plot(normality,
type = "qq")
outliers <-
check_outliers(model_1)
outliers
## 1 outlier detected: case 28.
## - Based on the following method and threshold: cook (1.01).
## - For variable: (Whole model).
plot(outliers)
mtcars %>%
slice(28)
augment(model_1) %>%
ggplot(aes(.fitted,
.resid)) +
geom_point() +
stat_smooth() +
theme_classic()
This model has problems regarding collinearity and linearity. Besides, despite the Shapiro-Wilk test results pointing out the residuals to be normally distributed, the visual inspection did not make us so confident. In addition, case 28 (Lotus Europa) influenced the model, regardless of its validity on the dataset. Given this, we will take three measures: 1) exclude case 28, since our objective is to evaluate the effect of the type of transmission on the fuel consumption; 2) among the variables with high collinearity, keep weight (wt), as we understand that it can have a substantial impact on car efficiency; and 3) to log-scale the outcome variable aiming to solve the problem of linearity (and normality).
mtcars_2 <-
mtcars %>%
select(mpg,
am,
drat,
vs,
wt) %>%
slice(-28)
model_2 <-
lm(log(mpg) ~
am +
drat +
vs +
wt,
data = mtcars_2)
model_2 %>%
glance() %>%
mutate(r.squared = r.squared %>%
round(2),
statistic = statistic %>%
round(2),
p.value = p.value %>%
round(3)) %>%
pivot_longer(cols = everything(),
names_to = "Statistics",
values_to = "Values") %>%
kable() %>%
kable_styling(latex_options = "striped",
full_width = F,
position = "left") %>%
scroll_box(width = "50%", height = "200px")
| Statistics | Values |
|---|---|
| r.squared | 0.8300000 |
| adj.r.squared | 0.8016808 |
| sigma | 0.1294071 |
| statistic | 31.3200000 |
| p.value | 0.0000000 |
| df | 4.0000000 |
| logLik | 22.1277698 |
| AIC | -32.2555396 |
| BIC | -23.6516163 |
| deviance | 0.4354010 |
| df.residual | 26.0000000 |
| nobs | 31.0000000 |
model_2 %>%
tidy() %>%
mutate(p.value = p.value %>%
round(3),
estimate = estimate %>%
round(3)) %>%
kable() %>%
kable_styling(latex_options = "striped",
full_width = F,
position = "left") %>%
scroll_box(width = "100%", height = "200px")
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 3.655 | 0.3573533 | 10.2276126 | 0.000 |
| amautomatic | -0.007 | 0.0807181 | -0.0905634 | 0.929 |
| drat | 0.025 | 0.0749467 | 0.3349121 | 0.740 |
| vsV | -0.147 | 0.0622886 | -2.3646349 | 0.026 |
| wt | -0.218 | 0.0440008 | -4.9466971 | 0.000 |
collinearity_2 <-
check_collinearity(model_2)
collinearity_2
plot(collinearity_2)
heteroscedasticity_2 <-
check_heteroscedasticity(model_2)
heteroscedasticity_2
## OK: Error variance appears to be homoscedastic (p = 0.319).
plot(heteroscedasticity_2)
normality_2 <-
check_normality(model_2)
normality_2
## OK: residuals appear as normally distributed (p = 0.147).
plot(normality_2,
type = "qq")
augment(model_2) %>%
ggplot(aes(.fitted,
.resid)) +
geom_point() +
stat_smooth() +
theme_classic()
The nested model explained 83% of the variance (R2 = .83) and proved to be significant: F(4,26) = 31.32, p < .001. The engine’s shape (b = -.15, p = .03) and the vehicle’s weight (b = -.22, p < .001) predicted the fuel consumption. The type of transmission was not significant. Keeping the other variables constant, heavier vehicles and V-shaped engines are less efficient. However, the model is not reliable. The linearity issue persists, even with the log transformation of the dependent variable. Thus, it is impossible to answer the questions proposed with linear regression models.