Regression modeling processes are theroying, data collection, data cleaning, training set & test set, variable selection (feature selection), advancing by the goodness of fit, checking finalised model assumptions, evaluating with test set (no underfit nor overfit), causation & prediction with the finalised model.

knitr::opts_chunk$set(echo = TRUE)
if(!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, ggplot2, GGally, caret, car, lmtest)
options(digits = 3)
theme_set(theme_minimal())
set.seed(123)

1. Loading & Preprocessing & Exploring

# load
data("mtcars")

# partition dataset (train/test)

# encode categorical data
mtcars$cyl = factor(mtcars$cyl)
mtcars$vs = factor(mtcars$vs)
mtcars$gear = factor(mtcars$gear)
mtcars$carb = factor(mtcars$carb)
mtcars$am = factor(mtcars$am, labels = c("Automatic", "Manual"))

# check others
# 1. missing data
# 2. near zero variance
# 3. mostly NA variable
# 4. identification variable
# 5. correlation matrix (PCA)

2. Model building

# best fit
init.mod = lm(mpg ~ ., data = mtcars)
best.mod = step(init.mod, direction = "both", trace = FALSE)

# compare other model
# 1. anova(init.mod, best.mod)
# 2. t.test(mpg ~ am, data = mtcars)

3. Assumption checking

1. Outlier

# influence point
plot(best.mod, which = 4)
abline(h = 4/nrow(mtcars), lty = 2, col = "red")

influence = c("Chrysler Imperial", "Toyota Corolla", "Toyota Corona")

# leverage point
best.mod %>% 
  hatvalues() %>% 
  sort() %>% 
  tail(3) %>% 
  names()
## [1] "Toyota Corona"       "Lincoln Continental" "Maserati Bora"
leverage = c("Toyota Corona", "Lincoln Continental", "Maserati Bora")

# outlier
data.frame(influence, leverage)
##           influence            leverage
## 1 Chrysler Imperial       Toyota Corona
## 2    Toyota Corolla Lincoln Continental
## 3     Toyota Corona       Maserati Bora
  • Problem:
    • Outlier will mpact model directly
    • Influence must be outlier, which has extreme value of y so it has the power to move the line
    • Leverage might be outlier, which has extreme value of x so it has possible ability to move the line
  • Identification:
    • Influence by cook’s distance with a rule of thumb of 4/n
    • Leverage by the r function
  • Solution:
    • Influence must be removed from the dataset
    • Leverage by comparing the model with it and without it

2. Collinearity

# rule of thumb
best.mod %>% 
  vif() %>% 
  as.data.frame() %>% 
  mutate("Threshold" = 10^(1/2)) %>% 
  mutate("Collinearity" = ifelse(("GVIF^(1/(2*Df))" > "Threshold"),
                                 "Yes", "No"))
##   GVIF Df GVIF^(1/(2*Df)) Threshold Collinearity
## 1 5.82  2            1.55      3.16           No
## 2 4.70  1            2.17      3.16           No
## 3 4.01  1            2.00      3.16           No
## 4 2.59  1            1.61      3.16           No
  • Problem:
    • Var more > SE more > CI more > p-value more > non-significant > coefficient close to 0
    • Hard to explain the dedication for corelated variables
    • The more the VIF/VGIF increases, the less reliable the regression model is
  • Identification:
    • The rule of thumb for VIF is 10; for GVIF is 3.16
    • If less than threshold, we conclude that there are no collinearity among the indenpendent variables (x) in the model
    • If more than threshold, we conclude that there are collinearity among the indenpendent variables (x) in the model
  • Solution:
    • PCA to combine correlated variables together
    • Remove one of the correlated variables
    • Increase data size

3. Heteroscedasticity

# plot
plot(best.mod, which = 1)

# test
bptest(mpg ~ cyl + hp + wt + am, data = mtcars) # similar white test
## 
##  studentized Breusch-Pagan test
## 
## data:  mpg ~ cyl + hp + wt + am
## BP = 8, df = 5, p-value = 0.1
  • Problem:
    • Estimation inefficient
    • SE error > CI wrong
  • Identification:
    • Plot seeing for equal spreading of data along the line
    • Test for statisitc inference (H0: Homoscedasticity; H1: Heteroscedasticity)
  • Solution:
    • Dependent variable (y) transformation (log/sqrt)
    • Adjust to robust SE
    • Segment data to make each model

4. Normality

# qq plot
plot(best.mod, which = 2)

# residual hist
hist(best.mod$residuals, probability = T)
lines(density(best.mod$residuals), lwd = 2, col = "red") # sample distribution
lines(density(best.mod$residuals, adjust = 1.5), lwd = 2, lty = 2, col = "blue") # kernel distribution

# check
mean(best.mod$residuals) # close to 0
## [1] 1.3e-17
  • Problem:
    • Obey assumption (error distribution is normal; error mean is 0)
  • Identification:
    • See QQ plot with the dots all on the line
    • See histogram with an approximate normal distribution
  • Solution:
    • Dependent variable (y) transformation (log/sqrt)
    • Relate to the outlier problem

5. Autocorrelation

# plot
acf(best.mod$residuals, type = "partial", main = "")

  • Problem:
    • Obey assumption (error covariance is 0) (no serial correlation)
  • Identification:
    • See acf plot with the spike out of the blue line
  • Solution:
    • Use time-series process, such as differencing

4. Model refining & Analysis & Inference

# remove outlier
# rownames(mtcars) # 17, 20, 21
mtcars.1 = mtcars[-c(17, 20, 21), ]
# rownames(mtcars.1)

# best fit with no outlier
init.mod.1 = lm(mpg ~ ., data = mtcars.1)
best.mod.1 = step(init.mod.1, direction = "both", trace = FALSE)
# check again assumption
par(mfrow = c(3, 2)); plot(best.mod.1, which = c(1:6))

5. Analysis & Inference

summary(best.mod.1)
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt, data = mtcars.1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.400 -0.875 -0.107  0.983  5.188 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.3793     1.8422   19.75  2.4e-16 ***
## cyl6         -3.0179     1.1829   -2.55    0.018 *  
## cyl8         -3.0287     1.8230   -1.66    0.110    
## hp           -0.0219     0.0097   -2.26    0.033 *  
## wt           -3.5098     0.6417   -5.47  1.3e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.96 on 24 degrees of freedom
## Multiple R-squared:  0.898,  Adjusted R-squared:  0.881 
## F-statistic: 52.9 on 4 and 24 DF,  p-value: 1.47e-11