빅데이터 분류분석 II: 라쏘 | 랜덤 포레스트 | 그래디언트 부스팅

glmnet 함수를 통한 라쏘 모형, 능형회귀, 변수 선택

glmnet과 모형행력

설명변수 중 범주형 변수가 있을 경우, glm() 함수는 자동을 모형 행렬을 생성한다. 이와 달리, glmnet() 함수는 모형행렬을 수동으로 만들어줘야 한다. model.matrix() 함구를 사용하면 되고, 절편항은 필요하지 않으므로’-1’을 모형식에 지정한다.

adult <- read.csv("data/adult.data", header = F, strip.white = T)  
names(adult) <- c("age", "workclass", "fnlwgt", "education", 
                  "education_num", "marial_status", "occupation",
                  "relationship", "race", "sex", "capital_gain",
                  "capital_loss", "hours_per_week", "natice_country", "wage")

set.seed(1711)
n <- nrow(adult)
idx <- 1:n
training_idx <- sample(idx, n * .60)
idx <- setdiff(idx, training_idx)
validate_idx <-sample(idx, n * .20)
test_idx <- setdiff(idx, validate_idx)
length(training_idx)

## [1] 19536

length(validate_idx)

## [1] 6512

length(test_idx)

## [1] 6513

training <- adult[training_idx, ]
validation <- adult[validate_idx, ]
test <- adult[test_idx, ]

xx <- model.matrix(wage ~ . -1, adult)
x <- xx[training_idx, ]
y <- ifelse(training$wage == ">50K", 1, 0)
dim(x)

## [1] 19536   101

모형 적합은 glmnet() 함수를 사용한다.

ad_glmnet_fit <- glmnet(x, y)
plot(ad_glmnet_fit)

자동 모형 선택, cv.glmnet

ad_cvfit <- cv.glmnet(x, y, family = "binomial")
plot(ad_cvfit)

나무 모형 적합

cvr_tr <- rpart(wage ~ ., data = training)
cvr_tr

## n= 19536 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 19536 4722 <=50K (0.75829238 0.24170762)  
##    2) relationship=Not-in-family,Other-relative,Own-child,Unmarried 10641  692 <=50K (0.93496852 0.06503148)  
##      4) capital_gain< 7073.5 10452  510 <=50K (0.95120551 0.04879449) *
##      5) capital_gain>=7073.5 189    7 >50K (0.03703704 0.96296296) *
##    3) relationship=Husband,Wife 8895 4030 <=50K (0.54693648 0.45306352)  
##      6) education=10th,11th,12th,1st-4th,5th-6th,7th-8th,9th,Assoc-acdm,Assoc-voc,HS-grad,Preschool,Some-college 6231 2120 <=50K (0.65976569 0.34023431)  
##       12) capital_gain< 5095.5 5905 1801 <=50K (0.69500423 0.30499577) *
##       13) capital_gain>=5095.5 326    7 >50K (0.02147239 0.97852761) *
##      7) education=Bachelors,Doctorate,Masters,Prof-school 2664  754 >50K (0.28303303 0.71696697) *

printcp(cvr_tr)

## 
## Classification tree:
## rpart(formula = wage ~ ., data = training)
## 
## Variables actually used in tree construction:
## [1] capital_gain education    relationship
## 
## Root node error: 4722/19536 = 0.24171
## 
## n= 19536 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.122406      0   1.00000 1.00000 0.012672
## 2 0.066074      2   0.75519 0.77446 0.011546
## 3 0.037061      3   0.68911 0.73528 0.011315
## 4 0.010000      4   0.65205 0.67874 0.010962

summary(cvr_tr)

## Call:
## rpart(formula = wage ~ ., data = training)
##   n= 19536 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.12240576      0 1.0000000 1.0000000 0.01267230
## 2 0.06607370      2 0.7551885 0.7744600 0.01154597
## 3 0.03706057      3 0.6891148 0.7352817 0.01131548
## 4 0.01000000      4 0.6520542 0.6787378 0.01096167
## 
## Variable importance
##   relationship  marial_status   capital_gain      education  education_num 
##             24             24             10              9              9 
##            sex     occupation            age hours_per_week 
##              8              7              6              3 
## 
## Node number 1: 19536 observations,    complexity param=0.1224058
##   predicted class=<=50K  expected loss=0.2417076  P(node) =1
##     class counts: 14814  4722
##    probabilities: 0.758 0.242 
##   left son=2 (10641 obs) right son=3 (8895 obs)
##   Primary splits:
##       relationship  splits as  RLLLLR, improve=1459.0090, (0 missing)
##       marial_status splits as  LRRLLLL, improve=1443.3630, (0 missing)
##       capital_gain  < 5119    to the left,  improve= 982.3756, (0 missing)
##       education     splits as  LLLLLLLLLRRLRLRL, improve= 725.5141, (0 missing)
##       education_num < 12.5    to the left,  improve= 725.5141, (0 missing)
##   Surrogate splits:
##       marial_status  splits as  LRRLLLL, agree=0.992, adj=0.983, (0 split)
##       sex            splits as  LR, agree=0.688, adj=0.316, (0 split)
##       age            < 32.5    to the left,  agree=0.652, adj=0.235, (0 split)
##       occupation     splits as  LLLRRRLLLLRRLLR, agree=0.625, adj=0.176, (0 split)
##       hours_per_week < 43.5    to the left,  agree=0.604, adj=0.131, (0 split)
## 
## Node number 2: 10641 observations,    complexity param=0.03706057
##   predicted class=<=50K  expected loss=0.06503148  P(node) =0.5446867
##     class counts:  9949   692
##    probabilities: 0.935 0.065 
##   left son=4 (10452 obs) right son=5 (189 obs)
##   Primary splits:
##       capital_gain   < 7073.5  to the left,  improve=310.28530, (0 missing)
##       education      splits as  LLLLLLLLLLRLRLRL, improve= 85.16586, (0 missing)
##       education_num  < 13.5    to the left,  improve= 85.16586, (0 missing)
##       hours_per_week < 44.5    to the left,  improve= 74.42321, (0 missing)
##       occupation     splits as  LLLLRLLLLLRLLLL, improve= 73.24884, (0 missing)
## 
## Node number 3: 8895 observations,    complexity param=0.1224058
##   predicted class=<=50K  expected loss=0.4530635  P(node) =0.4553133
##     class counts:  4865  4030
##    probabilities: 0.547 0.453 
##   left son=6 (6231 obs) right son=7 (2664 obs)
##   Primary splits:
##       education     splits as  LLLLLLLLLRRLRLRL, improve=529.7153, (0 missing)
##       education_num < 12.5    to the left,  improve=529.7153, (0 missing)
##       occupation    splits as  LL-LRLLLLLRLRRL, improve=526.1943, (0 missing)
##       capital_gain  < 5095.5  to the left,  improve=443.9957, (0 missing)
##       capital_loss  < 1782.5  to the left,  improve=159.0220, (0 missing)
##   Surrogate splits:
##       education_num  < 12.5    to the left,  agree=1.000, adj=1.000, (0 split)
##       occupation     splits as  LL-LRLLLLLRLLLL, agree=0.793, adj=0.309, (0 split)
##       capital_gain   < 10585.5 to the left,  agree=0.716, adj=0.053, (0 split)
##       natice_country splits as  LLLRLLLLLRLLLLL-LLLRRLLLRLLLLLRLLLLRRLLLLL, agree=0.708, adj=0.025, (0 split)
##       capital_loss   < 1894.5  to the left,  agree=0.706, adj=0.019, (0 split)
## 
## Node number 4: 10452 observations
##   predicted class=<=50K  expected loss=0.04879449  P(node) =0.5350123
##     class counts:  9942   510
##    probabilities: 0.951 0.049 
## 
## Node number 5: 189 observations
##   predicted class=>50K   expected loss=0.03703704  P(node) =0.009674447
##     class counts:     7   182
##    probabilities: 0.037 0.963 
## 
## Node number 6: 6231 observations,    complexity param=0.0660737
##   predicted class=<=50K  expected loss=0.3402343  P(node) =0.3189496
##     class counts:  4111  2120
##    probabilities: 0.660 0.340 
##   left son=12 (5905 obs) right son=13 (326 obs)
##   Primary splits:
##       capital_gain  < 5095.5  to the left,  improve=280.30190, (0 missing)
##       occupation    splits as  LR-LRLLLLLRRRRL, improve=140.16950, (0 missing)
##       education     splits as  LLRLLLLRR--R-L-R, improve=103.82930, (0 missing)
##       education_num < 7.5     to the left,  improve=103.82930, (0 missing)
##       capital_loss  < 1782.5  to the left,  improve= 78.02779, (0 missing)
## 
## Node number 7: 2664 observations
##   predicted class=>50K   expected loss=0.283033  P(node) =0.1363636
##     class counts:   754  1910
##    probabilities: 0.283 0.717 
## 
## Node number 12: 5905 observations
##   predicted class=<=50K  expected loss=0.3049958  P(node) =0.3022625
##     class counts:  4104  1801
##    probabilities: 0.695 0.305 
## 
## Node number 13: 326 observations
##   predicted class=>50K   expected loss=0.02147239  P(node) =0.01668714
##     class counts:     7   319
##    probabilities: 0.021 0.979

opar <- par(mfrow = c(1,1), xpd = NA)
plot(cvr_tr)
text(cvr_tr, use.n = T)

par(opar)

나무 모형 평가

y_obs <- ifelse(validation$wage == ">50K", 1, 0)
ad_glm_full <- glm(wage ~ ., data = training, family = binomial())

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

yhat_lm <- predict(ad_glm_full, newdata = validation, type = 'response')

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

yhat_tr <- predict(cvr_tr, validation)
yhat_tr <- yhat_tr[, ">50K"]
binomial_deviance(y_obs, yhat_tr)

## [1] 4780.445

pred_lm <- prediction(yhat_lm, y_obs)
perf_lm <- performance(pred_lm, measure = "tpr", x.measure = "fpr")
pred_tr <- prediction(yhat_tr, y_obs)
perf_tr <- performance(pred_tr, measure = "tpr", x.measure = "fpr")
plot(perf_lm, col = "black", main = "ROC Curve")
plot(perf_tr, col = "blue", add = T)
abline(0, 1)
legend("bottomright", inset = .1, legend = c("GLM", "Tree"),
       col = c("black", "blue"), lty = 1, lwd = 2)

performance(pred_tr, "auc")@y.values[[1]]

## [1] 0.8487345

나무 모형은 그 모형의 단순성 대문에 ROC 곡선도 부드러운 곡선이 아닌 꺽어진 직선으로 나타난다. GLM, glmnet 결과들에 비해 높은 이항편차와 낮은 AUC 값으로 부터, 예측 능력은 낮은 것을 알 수 있다.

랜덤 포레스트

set.seed(1711)
ad_rf <- randomForest(wage ~ ., training)
ad_rf

## 
## Call:
##  randomForest(formula = wage ~ ., data = training) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 17.89%
## Confusion matrix:
##       <=50K >50K class.error
## <=50K 14764   50 0.003375186
## >50K   3445 1277 0.729563744

tmp <- importance(ad_rf)
head(round(tmp[order(-tmp[,1]), 1, drop = F], 2), n = 10)

##                MeanDecreaseGini
## relationship             757.94
## capital_gain             723.89
## age                      704.82
## fnlwgt                   653.45
## occupation               602.60
## marial_status            561.30
## education                451.74
## hours_per_week           438.36
## education_num            373.36
## workclass                241.84

varImpPlot(ad_rf)

랜덤 포레스트 예측

predict(ad_rf, newdata = adult[1:5, ])

##     1     2     3     4     5 
## <=50K <=50K <=50K <=50K <=50K 
## Levels: <=50K >50K

predict(ad_rf, newdata = adult[1:5, ], type = "prob")

##   <=50K  >50K
## 1 0.980 0.020
## 2 0.866 0.134
## 3 1.000 0.000
## 4 0.938 0.062
## 5 0.758 0.242
## attr(,"class")
## [1] "matrix" "votes"

yhat_rf <- predict(ad_rf, newdata = validation, type = 'prob')[, ">50K"]
binomial_deviance(y_obs, yhat_rf)

## [1] Inf

pred_rf <- prediction(yhat_rf, y_obs)
perf_rf <- performance(pred_rf, measure = "tpr", x.measure = "fpr")

plot(perf_lm, col = "black", main = "ROC Curve")
#plot(perf_glmnet, add = T, col = "blue")
plot(perf_rf, add = T, col = "red")
abline(0, 1)
legend("bottomright", inset = .1, legend = c("GLM", "glmnet", "RF"),
       col = c("black", "blue", "red"), lty = 1, lwd = 2)

performance(pred_tr, "auc")@y.values[[1]]

## [1] 0.8487345

에측 확률값 자체의 비교

# p1 <- data.frame(yhat_gmlnet, yhat_rf) %>% 
#   ggplot(aes(yhat_glmnet, yhat_fit)) +
#   geom_point(alpha=.5) +
#   geom_abline() +
#   geom_smooth()
# 
# p2 <- reshape2::melt(data.frmae(yhat_glmnet, yhat_rf)) %>% 
#   ggplot(aes(value, fill = variable)) +
#   geom_density(alpha=.5)
# 
# grid.arrange(p1, p2, ncol = 2)