2020-10-19
Underfitting:When the model performs poorly on the training data. This is because the model is unable to capture the relationship between the input variance and the target values (Amazon Web Services, 2020).
Overfitting:When the model performs well on the training data but does not perform well on the testing data (Amazon Web Services, 2020).
有兩組觀察值X、Y呈現高度相關
假設兩組觀察值X、Y之間存在某種關係,欲把這兩者的關係給描述出來
以數學式的方式正確地描述出來,將可以對未來發生的新y值來進行預測
將\(f(x) = y\)定義為兩者的映射關係:
常見的函數關係:簡單線性迴歸、多元迴歸、機器學習(包含深度學習)
(Koehrsen, 2018)
\(RSS(\beta) = (\mathbf{y}-\mathbf{X}\beta)^T(\mathbf{y}-\mathbf{X}\beta)\)
Our goal is to find the optimal coefficients in \(f(x)\), and to minimize the objective function.
經過最小平方法(ordinary least squares, OLS)或梯度下降法(gradient descent, GD),找到一組參數使得loss/residual(cost function函數值)最小
模型沒有良好地去配適我們的training data,使得模型在testing data上的預測表現也差
雖然在在迴歸中利用OLS找到最小平方解,模型的架構本身無法良好地匹配樣本的分佈情況,甚至母體的分佈情況,以及解釋變數之間的關係
通常發生在簡單的模型中,模型能力不足所導致
Underfitting model estimation 具有low variance and high bias
Variance代表對於資料的敏感程度,bias代表對於估計值的誤差
(Koehrsen, 2018)
加深模型的複雜程度,來提高其預測的能力
增加解釋變數或者使用polynomial regression,來提升模型抓取變數關係的能力
提升模型複雜程度可使得畫出來的迴歸線能夠確實地fit到每一個訓練資料點,讓我們的模型可以匹配樣本資料的分佈情形
來自相同分配的testing data上表現力不如預期來得好,training loss與testing loss會有大幅落差
\(y = f(x) + \epsilon\)
在找尋最佳參數估計值的過程中會同時模型無法解釋的誤差項\(\epsilon\)給計算進去,模型過度去解釋訓練資料中\(f\)與\(\epsilon\)的關係,而不是變數間真正潛在的互動關係,造成R-square, coefficient, p-value被錯估
(Koehrsen, 2018)
Underfitting: high bias on training data and testing data
Overfitting: low bias on training data but high bias on testing data
(Ghojogh & Mark Crowley, 2019)
K-fold cross validation: 將資料集拆分成K個不同的subset,分別以不同的組合訓練模型,計算其平均loss
Leave-One-Out cross validation: 自資料集中剔除一個樣本,針對剩下的資料進行建模,計算剔除樣本的loss/cost,對每個樣本重複以上動作,最後平均loss
Model selection/cross-validation
核心目標是找到有用或具解釋依變項(dependent variable/response/target variable)變異的獨變項(independent variable/predictor)
訓練ML模型時間成本較高,通常會直接看單一模型在訓練的過程中的training and validation loss
核心目標是預測,所以會切割testing data來檢驗模型處理未看過的資料的能力
A common-used method to avoid underfitting and overfitting in regression and machine learning.
In regularized methods, the parameter is estimated via minimizing an estimation criterion with constraints.
In many statistical techniques, the model parameter is estimated via minimizing some estimation criterion
\(min_\beta\ \mathcal{D}(\beta)\), where
\(\mathcal{D}(\beta)\) is an estimation criterion measuring the discrepancy between \(y_n\) and \(f(x_n)\);
\(\beta\): a p-dimensional model parameter determining the shape of f.
In regularized methods, the parameter is estimated via minimizing an estimation criterion with constraints
\(min_\beta\ \mathcal{D}(\beta)\), \(subject\ to\ \mathcal{R}(\beta) \leq \mathcal{C}\),
where
\(\mathcal{R}(\beta)\) is a regularizer (or penalty) measuring the "complexity" of \(\beta\);
\(\mathcal{C}\): a positive number representing some kind of “budget”.
可能需要納入模型的獨變項數量大於樣本大小(i.e., \(p > n\)),而納入不具解釋力的獨變項上升,會使 \(R^2\) 上升,卻沒反應模型解釋力
以正規化迴歸(regularized regression)對迴歸係數管控
線性迴歸:\(minimize[SS_E]\) (\(SS_E\): Error sum of squares)
正規化迴歸:\(minimize[SS_E+P]\) (\(P\): 懲罰項(penalty term))
常見的懲罰項有兩種,分別對應到脊(ridge)與套索(lasso)迴歸
Ridge regression: \(minimize[SS_E+\lambda \sum_{j=1}^P \beta_j^2]\),可降低資料雜訊
Lasso regression: \(minimize[SS_E+\lambda \sum_{j=1}^P | \beta_j |]\),可避免不具解釋力的變項被納入 (variable selection)
\(\lambda\)為超參數,人為調整或以交叉驗證法挑選
Ridge and lasso regression, which one is better?
The elastic net penalty is a compromise of ridge and lasso defined as
\[R_{elastic}(\beta) = \sum_{p=1}^{P}[(1 - \alpha){\beta_p}^2 + \alpha | \beta_j |]\],
where \(\alpha \in [0,1]\), we should \(minimize\ [SS_E + R_{elastic}(\beta)]\).
Example: Salaries, Batting, and Master in Lahman
library(readxl) library(dplyr) library(Lahman) library(glmnet) library(Matrix) library(GGally) library(ggplot2)
tbl_s <- Salaries %>% tbl_df() %>% filter(yearID == 2015) %>% select(-yearID) %>%
mutate(salary = salary/10000)
tbl_b <- Batting %>% tbl_df() %>% filter(yearID == 2014) %>%
dplyr::select(-yearID, -stint, -teamID, -lgID) %>%
group_by(playerID) %>% summarise_each(funs(sum))
tbl_m <- Master %>% tbl_df() %>%
mutate(years_MLB = as.integer(as.Date("2014-10-29") - as.Date(debut)) / 365)
tbl_baseball <- tbl_s %>% left_join(tbl_b, by = "playerID") %>%
left_join(tbl_m, by = "playerID") %>%
dplyr::select(G:GIDP, years_MLB, salary) %>%
mutate_all(.funs = funs(replace(., is.na(.), 0)))
tbl_baseball# A tibble: 817 x 19
G AB R H X2B X3B HR RBI SB CS BB SO IBB
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 25 70 9 14 2 0 1 4 0 1 3 10 0
2 22 34 0 1 0 0 0 0 0 0 0 14 0
3 3 2 0 1 0 0 0 1 0 0 0 1 0
4 33 54 2 6 2 0 0 2 0 0 3 15 0
5 0 0 0 0 0 0 0 0 0 0 0 0 0
6 19 2 0 0 0 0 0 0 0 0 0 0 0
7 47 9 0 1 0 0 0 0 0 0 0 4 0
8 109 406 75 122 39 1 19 69 9 3 64 110 10
9 41 129 6 29 8 0 1 7 0 0 3 24 0
10 13 0 0 0 0 0 0 0 0 0 0 0 0
# … with 807 more rows, and 6 more variables: HBP <dbl>, SH <dbl>, SF <dbl>,
# GIDP <dbl>, years_MLB <dbl>, salary <dbl>
X <- tbl_baseball %>% model.matrix(salary ~ (.) - 1, data = .); str(X)
num [1:817, 1:18] 25 22 3 33 0 19 47 109 41 13 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:817] "1" "2" "3" "4" ... ..$ : chr [1:18] "G" "AB" "R" "H" ... - attr(*, "assign")= int [1:18] 1 2 3 4 5 6 7 8 9 10 ...
y <- tbl_baseball$salary; str(y)
num [1:817] 50.9 51.2 50.8 140 52.4 ...
cor_matrix <- cor(cbind(X, y)) ggcorr(data = NULL, cor_matrix = cor_matrix)
set.seed(6094028)
idc_train <- sample(c(T, F), nrow(tbl_baseball),
replace = T, prob = c(.75, .25))
set.seed(6094028)
idc_train <- sample(c(T, F), nrow(tbl_baseball),
replace = T, prob = c(.7, .3))
lambda_all <- exp(seq(10, -10, length.out = 100))
glmnet_l2_fit <- glmnet(X[idc_train, ], y[idc_train],
alpha = 0, lambda = lambda_all,
family = "gaussian", standardize = TRUE)
glmnet_l2_fit %>% plot(xvar = "lambda")
X: Dam level change, Y: Water out, Polynomial Regression X對Y的預測,調整懲罰項參數lambda觀察對loss的變化
\(J(\theta) = \frac{1}{2m}[\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{i=1}^n \theta_j^2]\)
set.seed(6094028)
cv_l2_fit <- cv.glmnet(X[idc_train, ], y[idc_train],
alpha = 0, lambda = lambda_all,
family = "gaussian", nfolds = 5,
type.measure = "deviance")
plot(cv_l2_fit)
glmnet_l2_fit %>% plot(xvar = "lambda") abline(v = log(cv_l2_fit$lambda.min), lty = 2)
y_hat_train_l2 <- glmnet_l2_fit %>% predict(s = cv_l2_fit$lambda.min, newx = X[idc_train, ], type = "link") 1 - mean((y[idc_train] - y_hat_train_l2)^2) / var(y[idc_train])
[1] 0.4289935
y_hat_test_l2 <- glmnet_l2_fit %>% predict(s = cv_l2_fit$lambda.min, newx = X[!idc_train,], type = "link") 1 - mean((y[!idc_train] - y_hat_test_l2)^2) / var(y[!idc_train])
[1] 0.4311388
Data Science HW2,使用來預測user對item的評分,調整懲罰項係數\(\lambda\)觀察對 RMSE的變化。
\(min_{P, Q} \frac{1}{2}\sum_{(u,i) \in R} (r_{ui} - \mathbf{p}^T_u \mathbf{q}_i)^2 + \frac{\lambda}{2}(||\mathbf{p}_u||^2 + ||\mathbf{q}_i||^2)\)
Data Science HW3,使用 Neural Network來辨識數字圖片,調整懲罰項係數lambda 觀察對loss的變化。
Amazon Web Services(n.d.). Model Fit: Underfitting vs. Overfitting Retrieved October 18, 2020, from https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html
Ghojogh & Mark Crowley (2019). The Theory Behind Overfitting, Cross Validation, Regularization, Bagging, and Boosting: Tutorial.