2020-10-19

Outline

1. What are underfitting and overfitting?

2. How to avoid overfitting and underfitting?

3. Other examples

1. What are underfitting and overfitting?

Definition

  • Underfitting:When the model performs poorly on the training data. This is because the model is unable to capture the relationship between the input variance and the target values (Amazon Web Services, 2020).

  • Overfitting:When the model performs well on the training data but does not perform well on the testing data (Amazon Web Services, 2020).

drawing

Model/Function (1/2)

A model is a function mapping inputs to outputs

  • 有兩組觀察值X、Y呈現高度相關

  • 假設兩組觀察值X、Y之間存在某種關係,欲把這兩者的關係給描述出來

  • 以數學式的方式正確地描述出來,將可以對未來發生的新y值來進行預測

Model/Function (2/2)

  • 將\(f(x) = y\)定義為兩者的映射關係:

    • \(y = f(x) + \epsilon\), where \(\epsilon\) is the error term
  • 常見的函數關係:簡單線性迴歸、多元迴歸、機器學習(包含深度學習)

drawing

(Koehrsen, 2018)

Objective (loss/cost) function

\(RSS(\beta) = (\mathbf{y}-\mathbf{X}\beta)^T(\mathbf{y}-\mathbf{X}\beta)\)

  • Our goal is to find the optimal coefficients in \(f(x)\), and to minimize the objective function.

  • 經過最小平方法(ordinary least squares, OLS)或梯度下降法(gradient descent, GD),找到一組參數使得loss/residual(cost function函數值)最小

Underfitting

  • 模型沒有良好地去配適我們的training data,使得模型在testing data上的預測表現也差

  • 雖然在在迴歸中利用OLS找到最小平方解,模型的架構本身無法良好地匹配樣本的分佈情況,甚至母體的分佈情況,以及解釋變數之間的關係

  • 通常發生在簡單的模型中,模型能力不足所導致

  • Underfitting model estimation 具有low variance and high bias

  • Variance代表對於資料的敏感程度,bias代表對於估計值的誤差

(Koehrsen, 2018)

How to avoid underfitting?

  • 加深模型的複雜程度,來提高其預測的能力

  • 增加解釋變數或者使用polynomial regression,來提升模型抓取變數關係的能力

  • 提升模型複雜程度可使得畫出來的迴歸線能夠確實地fit到每一個訓練資料點,讓我們的模型可以匹配樣本資料的分佈情形

Overfitting

  • 來自相同分配的testing data上表現力不如預期來得好,training loss與testing loss會有大幅落差

  • \(y = f(x) + \epsilon\)

  • 在找尋最佳參數估計值的過程中會同時模型無法解釋的誤差項\(\epsilon\)給計算進去,模型過度去解釋訓練資料中\(f\)與\(\epsilon\)的關係,而不是變數間真正潛在的互動關係,造成R-square, coefficient, p-value被錯估

(Koehrsen, 2018)

Brief Summary

  • Underfitting: high bias on training data and testing data

  • Overfitting: low bias on training data but high bias on testing data

(Ghojogh & Mark Crowley, 2019)

drawing

How to check if models are overfitting or underfitting?

Cross-validation

  • K-fold cross validation: 將資料集拆分成K個不同的subset,分別以不同的組合訓練模型,計算其平均loss

  • Leave-One-Out cross validation: 自資料集中剔除一個樣本,針對剩下的資料進行建模,計算剔除樣本的loss/cost,對每個樣本重複以上動作,最後平均loss

Cross-validation in regression analysis and machine learning

Regression analysis

  • Model selection/cross-validation

  • 核心目標是找到有用或具解釋依變項(dependent variable/response/target variable)變異的獨變項(independent variable/predictor)

Machine learning

  • 訓練ML模型時間成本較高,通常會直接看單一模型在訓練的過程中的training and validation loss

  • 核心目標是預測,所以會切割testing data來檢驗模型處理未看過的資料的能力

Regularization(正則化)in Regression

  • A common-used method to avoid underfitting and overfitting in regression and machine learning.

  • In regularized methods, the parameter is estimated via minimizing an estimation criterion with constraints.

Non-Regularized Estimation

In many statistical techniques, the model parameter is estimated via minimizing some estimation criterion

\(min_\beta\ \mathcal{D}(\beta)\), where

  • \(\mathcal{D}(\beta)\) is an estimation criterion measuring the discrepancy between \(y_n\) and \(f(x_n)\);

  • \(\beta\): a p-dimensional model parameter determining the shape of f.

Regularized Estimation

In regularized methods, the parameter is estimated via minimizing an estimation criterion with constraints

\(min_\beta\ \mathcal{D}(\beta)\), \(subject\ to\ \mathcal{R}(\beta) \leq \mathcal{C}\),

where

  • \(\mathcal{R}(\beta)\) is a regularizer (or penalty) measuring the "complexity" of \(\beta\);

  • \(\mathcal{C}\): a positive number representing some kind of “budget”.

Regularized regression (Elastic Net)

  • 可能需要納入模型的獨變項數量大於樣本大小(i.e., \(p > n\)),而納入不具解釋力的獨變項上升,會使 \(R^2\) 上升,卻沒反應模型解釋力

  • 以正規化迴歸(regularized regression)對迴歸係數管控

    • 線性迴歸:\(minimize[SS_E]\) (\(SS_E\): Error sum of squares)

    • 正規化迴歸:\(minimize[SS_E+P]\) (\(P\): 懲罰項(penalty term))

  • 常見的懲罰項有兩種,分別對應到脊(ridge)與套索(lasso)迴歸

Representative Regularization Methods

  • Ridge regression: \(minimize[SS_E+\lambda \sum_{j=1}^P \beta_j^2]\),可降低資料雜訊

  • Lasso regression: \(minimize[SS_E+\lambda \sum_{j=1}^P | \beta_j |]\),可避免不具解釋力的變項被納入 (variable selection)

  • \(\lambda\)為超參數,人為調整或以交叉驗證法挑選

Ridge versus Lasso

Ridge and lasso regression, which one is better?

  • In the case of many small coefficients, ridge regression is better.
  • In the case of many zero coefficients, lasso is better.

Elastic Net Regularization

The elastic net penalty is a compromise of ridge and lasso defined as

\[R_{elastic}(\beta) = \sum_{p=1}^{P}[(1 - \alpha){\beta_p}^2 + \alpha | \beta_j |]\],

where \(\alpha \in [0,1]\), we should \(minimize\ [SS_E + R_{elastic}(\beta)]\).

Implement elastic net in R

Example: Salaries, Batting, and Master in Lahman

  • Dependent variable: Salaries of baseball players
  • Independent variables: Batting, and master data of baseball players

Load in the required packages

library(readxl)
library(dplyr)
library(Lahman)
library(glmnet)
library(Matrix)
library(GGally)
library(ggplot2)

Reshape and merge the data sets

tbl_s <- Salaries %>% tbl_df() %>% filter(yearID == 2015) %>% select(-yearID) %>%
  mutate(salary = salary/10000)
tbl_b <- Batting %>% tbl_df() %>% filter(yearID == 2014) %>%
  dplyr::select(-yearID, -stint, -teamID, -lgID) %>%
  group_by(playerID) %>% summarise_each(funs(sum))
tbl_m <- Master %>% tbl_df() %>% 
  mutate(years_MLB = as.integer(as.Date("2014-10-29") - as.Date(debut)) / 365)
tbl_baseball <- tbl_s %>% left_join(tbl_b, by = "playerID") %>%
  left_join(tbl_m, by = "playerID") %>% 
  dplyr::select(G:GIDP, years_MLB, salary) %>%
  mutate_all(.funs = funs(replace(., is.na(.), 0)))

The structure of tbl_baseball

# A tibble: 817 x 19
       G    AB     R     H   X2B   X3B    HR   RBI    SB    CS    BB    SO   IBB
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1    25    70     9    14     2     0     1     4     0     1     3    10     0
 2    22    34     0     1     0     0     0     0     0     0     0    14     0
 3     3     2     0     1     0     0     0     1     0     0     0     1     0
 4    33    54     2     6     2     0     0     2     0     0     3    15     0
 5     0     0     0     0     0     0     0     0     0     0     0     0     0
 6    19     2     0     0     0     0     0     0     0     0     0     0     0
 7    47     9     0     1     0     0     0     0     0     0     0     4     0
 8   109   406    75   122    39     1    19    69     9     3    64   110    10
 9    41   129     6    29     8     0     1     7     0     0     3    24     0
10    13     0     0     0     0     0     0     0     0     0     0     0     0
# … with 807 more rows, and 6 more variables: HBP <dbl>, SH <dbl>, SF <dbl>,
#   GIDP <dbl>, years_MLB <dbl>, salary <dbl>

Generate x Matrix and y Vector

X <- tbl_baseball %>% model.matrix(salary ~ (.) - 1, data = .); str(X)
 num [1:817, 1:18] 25 22 3 33 0 19 47 109 41 13 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:817] "1" "2" "3" "4" ...
  ..$ : chr [1:18] "G" "AB" "R" "H" ...
 - attr(*, "assign")= int [1:18] 1 2 3 4 5 6 7 8 9 10 ...
y <- tbl_baseball$salary; str(y)
 num [1:817] 50.9 51.2 50.8 140 52.4 ...

Correlation coefficents

cor_matrix <- cor(cbind(X, y))
ggcorr(data = NULL, cor_matrix = cor_matrix)

Train-test Spliting

set.seed(6094028)
idc_train <- sample(c(T, F), nrow(tbl_baseball),
                    replace = T, prob = c(.75, .25))

Conduct Ridge Regression

set.seed(6094028)
idc_train <- sample(c(T, F), nrow(tbl_baseball),
                    replace = T, prob = c(.7, .3))
lambda_all <- exp(seq(10, -10, length.out = 100))
glmnet_l2_fit <- glmnet(X[idc_train, ], y[idc_train],
                        alpha = 0, lambda = lambda_all,
                        family = "gaussian", standardize = TRUE)

Plot Ridge Estimation Result (In log-\(\lambda\) Scale)

glmnet_l2_fit %>% plot(xvar = "lambda")

Other example: Data Science HW1

X: Dam level change, Y: Water out, Polynomial Regression X對Y的預測,調整懲罰項參數lambda觀察對loss的變化

\(J(\theta) = \frac{1}{2m}[\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{i=1}^n \theta_j^2]\)

drawing

Conduct 5-Fold CV for Ridge Regression

set.seed(6094028)
cv_l2_fit <- cv.glmnet(X[idc_train, ], y[idc_train],
                       alpha = 0, lambda = lambda_all,
                       family = "gaussian", nfolds = 5,
                       type.measure = "deviance")

Plot 5 Fold CV Result for Ridge

plot(cv_l2_fit)

Plot Ridge Estimates Under Selected \(\lambda\)

glmnet_l2_fit %>% plot(xvar = "lambda")
abline(v = log(cv_l2_fit$lambda.min), lty = 2)

Testing Error of Ridge Regression Under Selected \(\lambda\)

y_hat_train_l2 <- glmnet_l2_fit %>%
  predict(s = cv_l2_fit$lambda.min, newx = X[idc_train, ], type = "link")
1 - mean((y[idc_train] - y_hat_train_l2)^2) / var(y[idc_train])
[1] 0.4289935
y_hat_test_l2 <- glmnet_l2_fit %>%
  predict(s = cv_l2_fit$lambda.min, newx = X[!idc_train,], type = "link")
1 - mean((y[!idc_train] - y_hat_test_l2)^2) / var(y[!idc_train])
[1] 0.4311388

Other examples 1: Regularization in Matrix Factorization

Data Science HW2,使用來預測user對item的評分,調整懲罰項係數\(\lambda\)觀察對 RMSE的變化。

\(min_{P, Q} \frac{1}{2}\sum_{(u,i) \in R} (r_{ui} - \mathbf{p}^T_u \mathbf{q}_i)^2 + \frac{\lambda}{2}(||\mathbf{p}_u||^2 + ||\mathbf{q}_i||^2)\)

drawing

Other examples 2: Deep learning - classification

Data Science HW3,使用 Neural Network來辨識數字圖片,調整懲罰項係數lambda 觀察對loss的變化。

參考資料

感謝聆聽!

Any question or comment?

邱俊維、廖傑恩、莊文明 | 國立成功大學數據科學研究所