Regression modeling processes are theroying, data collection, data cleaning, training set & test set, variable selection (feature selection), advancing by the goodness of fit, checking finalised model assumptions, evaluating with test set (no underfit nor overfit), causation & prediction with the finalised model.
knitr::opts_chunk$set(echo = TRUE)
if(!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, ggplot2, GGally, caret, car, lmtest)
options(digits = 3)
theme_set(theme_minimal())
set.seed(123)
# load
data("mtcars")
# partition dataset (train/test)
# encode categorical data
mtcars$cyl = factor(mtcars$cyl)
mtcars$vs = factor(mtcars$vs)
mtcars$gear = factor(mtcars$gear)
mtcars$carb = factor(mtcars$carb)
mtcars$am = factor(mtcars$am, labels = c("Automatic", "Manual"))
# check others
# 1. missing data
# 2. near zero variance
# 3. mostly NA variable
# 4. identification variable
# 5. correlation matrix (PCA)
# best fit
init.mod = lm(mpg ~ ., data = mtcars)
best.mod = step(init.mod, direction = "both", trace = FALSE)
# compare other model
# 1. anova(init.mod, best.mod)
# 2. t.test(mpg ~ am, data = mtcars)
# influence point
plot(best.mod, which = 4)
abline(h = 4/nrow(mtcars), lty = 2, col = "red")
influence = c("Chrysler Imperial", "Toyota Corolla", "Toyota Corona")
# leverage point
best.mod %>%
hatvalues() %>%
sort() %>%
tail(3) %>%
names()
## [1] "Toyota Corona" "Lincoln Continental" "Maserati Bora"
leverage = c("Toyota Corona", "Lincoln Continental", "Maserati Bora")
# outlier
data.frame(influence, leverage)
## influence leverage
## 1 Chrysler Imperial Toyota Corona
## 2 Toyota Corolla Lincoln Continental
## 3 Toyota Corona Maserati Bora
# rule of thumb
best.mod %>%
vif() %>%
as.data.frame() %>%
mutate("Threshold" = 10^(1/2)) %>%
mutate("Collinearity" = ifelse(("GVIF^(1/(2*Df))" > "Threshold"),
"Yes", "No"))
## GVIF Df GVIF^(1/(2*Df)) Threshold Collinearity
## 1 5.82 2 1.55 3.16 No
## 2 4.70 1 2.17 3.16 No
## 3 4.01 1 2.00 3.16 No
## 4 2.59 1 1.61 3.16 No
# plot
plot(best.mod, which = 1)
# test
bptest(mpg ~ cyl + hp + wt + am, data = mtcars) # similar white test
##
## studentized Breusch-Pagan test
##
## data: mpg ~ cyl + hp + wt + am
## BP = 8, df = 5, p-value = 0.1
# qq plot
plot(best.mod, which = 2)
# residual hist
hist(best.mod$residuals, probability = T)
lines(density(best.mod$residuals), lwd = 2, col = "red") # sample distribution
lines(density(best.mod$residuals, adjust = 1.5), lwd = 2, lty = 2, col = "blue") # kernel distribution
# check
mean(best.mod$residuals) # close to 0
## [1] 1.3e-17
# plot
acf(best.mod$residuals, type = "partial", main = "")
# remove outlier
# rownames(mtcars) # 17, 20, 21
mtcars.1 = mtcars[-c(17, 20, 21), ]
# rownames(mtcars.1)
# best fit with no outlier
init.mod.1 = lm(mpg ~ ., data = mtcars.1)
best.mod.1 = step(init.mod.1, direction = "both", trace = FALSE)
# check again assumption
par(mfrow = c(3, 2)); plot(best.mod.1, which = c(1:6))
summary(best.mod.1)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt, data = mtcars.1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.400 -0.875 -0.107 0.983 5.188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.3793 1.8422 19.75 2.4e-16 ***
## cyl6 -3.0179 1.1829 -2.55 0.018 *
## cyl8 -3.0287 1.8230 -1.66 0.110
## hp -0.0219 0.0097 -2.26 0.033 *
## wt -3.5098 0.6417 -5.47 1.3e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.96 on 24 degrees of freedom
## Multiple R-squared: 0.898, Adjusted R-squared: 0.881
## F-statistic: 52.9 on 4 and 24 DF, p-value: 1.47e-11
Multiple R-squared: 0.898
The 89% of the dependent variable (mpg; y) can be explained by the model’s inputs (the independent variables; x).
Significant coefficient in numeric variable: hp, wt
As increasing x by 1 unit, the y would increase/decrease by the coefficient unit.
Significant coefficient in dummy/indicator/nominal variable: cyl6
As for the x in A to B, the difference in y is the coefficient unit.
As for the x with A comparing with B, the y would increase/decrease by the coefficient unit.