설명변수 중 범주형 변수가 있을 경우, glm() 함수는 자동을 모형 행렬을 생성한다. 이와 달리, glmnet() 함수는 모형행렬을 수동으로 만들어줘야 한다. model.matrix() 함구를 사용하면 되고, 절편항은 필요하지 않으므로’-1’을 모형식에 지정한다.
adult <- read.csv("data/adult.data", header = F, strip.white = T)
names(adult) <- c("age", "workclass", "fnlwgt", "education",
"education_num", "marial_status", "occupation",
"relationship", "race", "sex", "capital_gain",
"capital_loss", "hours_per_week", "natice_country", "wage")
set.seed(1711)
n <- nrow(adult)
idx <- 1:n
training_idx <- sample(idx, n * .60)
idx <- setdiff(idx, training_idx)
validate_idx <-sample(idx, n * .20)
test_idx <- setdiff(idx, validate_idx)
length(training_idx)
## [1] 19536
length(validate_idx)
## [1] 6512
length(test_idx)
## [1] 6513
training <- adult[training_idx, ]
validation <- adult[validate_idx, ]
test <- adult[test_idx, ]
xx <- model.matrix(wage ~ . -1, adult)
x <- xx[training_idx, ]
y <- ifelse(training$wage == ">50K", 1, 0)
dim(x)
## [1] 19536 101
모형 적합은 glmnet() 함수를 사용한다.
ad_glmnet_fit <- glmnet(x, y)
plot(ad_glmnet_fit)
ad_cvfit <- cv.glmnet(x, y, family = "binomial")
plot(ad_cvfit)
cvr_tr <- rpart(wage ~ ., data = training)
cvr_tr
## n= 19536
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 19536 4722 <=50K (0.75829238 0.24170762)
## 2) relationship=Not-in-family,Other-relative,Own-child,Unmarried 10641 692 <=50K (0.93496852 0.06503148)
## 4) capital_gain< 7073.5 10452 510 <=50K (0.95120551 0.04879449) *
## 5) capital_gain>=7073.5 189 7 >50K (0.03703704 0.96296296) *
## 3) relationship=Husband,Wife 8895 4030 <=50K (0.54693648 0.45306352)
## 6) education=10th,11th,12th,1st-4th,5th-6th,7th-8th,9th,Assoc-acdm,Assoc-voc,HS-grad,Preschool,Some-college 6231 2120 <=50K (0.65976569 0.34023431)
## 12) capital_gain< 5095.5 5905 1801 <=50K (0.69500423 0.30499577) *
## 13) capital_gain>=5095.5 326 7 >50K (0.02147239 0.97852761) *
## 7) education=Bachelors,Doctorate,Masters,Prof-school 2664 754 >50K (0.28303303 0.71696697) *
printcp(cvr_tr)
##
## Classification tree:
## rpart(formula = wage ~ ., data = training)
##
## Variables actually used in tree construction:
## [1] capital_gain education relationship
##
## Root node error: 4722/19536 = 0.24171
##
## n= 19536
##
## CP nsplit rel error xerror xstd
## 1 0.122406 0 1.00000 1.00000 0.012672
## 2 0.066074 2 0.75519 0.77446 0.011546
## 3 0.037061 3 0.68911 0.73528 0.011315
## 4 0.010000 4 0.65205 0.67874 0.010962
summary(cvr_tr)
## Call:
## rpart(formula = wage ~ ., data = training)
## n= 19536
##
## CP nsplit rel error xerror xstd
## 1 0.12240576 0 1.0000000 1.0000000 0.01267230
## 2 0.06607370 2 0.7551885 0.7744600 0.01154597
## 3 0.03706057 3 0.6891148 0.7352817 0.01131548
## 4 0.01000000 4 0.6520542 0.6787378 0.01096167
##
## Variable importance
## relationship marial_status capital_gain education education_num
## 24 24 10 9 9
## sex occupation age hours_per_week
## 8 7 6 3
##
## Node number 1: 19536 observations, complexity param=0.1224058
## predicted class=<=50K expected loss=0.2417076 P(node) =1
## class counts: 14814 4722
## probabilities: 0.758 0.242
## left son=2 (10641 obs) right son=3 (8895 obs)
## Primary splits:
## relationship splits as RLLLLR, improve=1459.0090, (0 missing)
## marial_status splits as LRRLLLL, improve=1443.3630, (0 missing)
## capital_gain < 5119 to the left, improve= 982.3756, (0 missing)
## education splits as LLLLLLLLLRRLRLRL, improve= 725.5141, (0 missing)
## education_num < 12.5 to the left, improve= 725.5141, (0 missing)
## Surrogate splits:
## marial_status splits as LRRLLLL, agree=0.992, adj=0.983, (0 split)
## sex splits as LR, agree=0.688, adj=0.316, (0 split)
## age < 32.5 to the left, agree=0.652, adj=0.235, (0 split)
## occupation splits as LLLRRRLLLLRRLLR, agree=0.625, adj=0.176, (0 split)
## hours_per_week < 43.5 to the left, agree=0.604, adj=0.131, (0 split)
##
## Node number 2: 10641 observations, complexity param=0.03706057
## predicted class=<=50K expected loss=0.06503148 P(node) =0.5446867
## class counts: 9949 692
## probabilities: 0.935 0.065
## left son=4 (10452 obs) right son=5 (189 obs)
## Primary splits:
## capital_gain < 7073.5 to the left, improve=310.28530, (0 missing)
## education splits as LLLLLLLLLLRLRLRL, improve= 85.16586, (0 missing)
## education_num < 13.5 to the left, improve= 85.16586, (0 missing)
## hours_per_week < 44.5 to the left, improve= 74.42321, (0 missing)
## occupation splits as LLLLRLLLLLRLLLL, improve= 73.24884, (0 missing)
##
## Node number 3: 8895 observations, complexity param=0.1224058
## predicted class=<=50K expected loss=0.4530635 P(node) =0.4553133
## class counts: 4865 4030
## probabilities: 0.547 0.453
## left son=6 (6231 obs) right son=7 (2664 obs)
## Primary splits:
## education splits as LLLLLLLLLRRLRLRL, improve=529.7153, (0 missing)
## education_num < 12.5 to the left, improve=529.7153, (0 missing)
## occupation splits as LL-LRLLLLLRLRRL, improve=526.1943, (0 missing)
## capital_gain < 5095.5 to the left, improve=443.9957, (0 missing)
## capital_loss < 1782.5 to the left, improve=159.0220, (0 missing)
## Surrogate splits:
## education_num < 12.5 to the left, agree=1.000, adj=1.000, (0 split)
## occupation splits as LL-LRLLLLLRLLLL, agree=0.793, adj=0.309, (0 split)
## capital_gain < 10585.5 to the left, agree=0.716, adj=0.053, (0 split)
## natice_country splits as LLLRLLLLLRLLLLL-LLLRRLLLRLLLLLRLLLLRRLLLLL, agree=0.708, adj=0.025, (0 split)
## capital_loss < 1894.5 to the left, agree=0.706, adj=0.019, (0 split)
##
## Node number 4: 10452 observations
## predicted class=<=50K expected loss=0.04879449 P(node) =0.5350123
## class counts: 9942 510
## probabilities: 0.951 0.049
##
## Node number 5: 189 observations
## predicted class=>50K expected loss=0.03703704 P(node) =0.009674447
## class counts: 7 182
## probabilities: 0.037 0.963
##
## Node number 6: 6231 observations, complexity param=0.0660737
## predicted class=<=50K expected loss=0.3402343 P(node) =0.3189496
## class counts: 4111 2120
## probabilities: 0.660 0.340
## left son=12 (5905 obs) right son=13 (326 obs)
## Primary splits:
## capital_gain < 5095.5 to the left, improve=280.30190, (0 missing)
## occupation splits as LR-LRLLLLLRRRRL, improve=140.16950, (0 missing)
## education splits as LLRLLLLRR--R-L-R, improve=103.82930, (0 missing)
## education_num < 7.5 to the left, improve=103.82930, (0 missing)
## capital_loss < 1782.5 to the left, improve= 78.02779, (0 missing)
##
## Node number 7: 2664 observations
## predicted class=>50K expected loss=0.283033 P(node) =0.1363636
## class counts: 754 1910
## probabilities: 0.283 0.717
##
## Node number 12: 5905 observations
## predicted class=<=50K expected loss=0.3049958 P(node) =0.3022625
## class counts: 4104 1801
## probabilities: 0.695 0.305
##
## Node number 13: 326 observations
## predicted class=>50K expected loss=0.02147239 P(node) =0.01668714
## class counts: 7 319
## probabilities: 0.021 0.979
opar <- par(mfrow = c(1,1), xpd = NA)
plot(cvr_tr)
text(cvr_tr, use.n = T)
par(opar)
y_obs <- ifelse(validation$wage == ">50K", 1, 0)
ad_glm_full <- glm(wage ~ ., data = training, family = binomial())
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
yhat_lm <- predict(ad_glm_full, newdata = validation, type = 'response')
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
yhat_tr <- predict(cvr_tr, validation)
yhat_tr <- yhat_tr[, ">50K"]
binomial_deviance(y_obs, yhat_tr)
## [1] 4780.445
pred_lm <- prediction(yhat_lm, y_obs)
perf_lm <- performance(pred_lm, measure = "tpr", x.measure = "fpr")
pred_tr <- prediction(yhat_tr, y_obs)
perf_tr <- performance(pred_tr, measure = "tpr", x.measure = "fpr")
plot(perf_lm, col = "black", main = "ROC Curve")
plot(perf_tr, col = "blue", add = T)
abline(0, 1)
legend("bottomright", inset = .1, legend = c("GLM", "Tree"),
col = c("black", "blue"), lty = 1, lwd = 2)
performance(pred_tr, "auc")@y.values[[1]]
## [1] 0.8487345
나무 모형은 그 모형의 단순성 대문에 ROC 곡선도 부드러운 곡선이 아닌 꺽어진 직선으로 나타난다. GLM, glmnet 결과들에 비해 높은 이항편차와 낮은 AUC 값으로 부터, 예측 능력은 낮은 것을 알 수 있다.
set.seed(1711)
ad_rf <- randomForest(wage ~ ., training)
ad_rf
##
## Call:
## randomForest(formula = wage ~ ., data = training)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 17.89%
## Confusion matrix:
## <=50K >50K class.error
## <=50K 14764 50 0.003375186
## >50K 3445 1277 0.729563744
tmp <- importance(ad_rf)
head(round(tmp[order(-tmp[,1]), 1, drop = F], 2), n = 10)
## MeanDecreaseGini
## relationship 757.94
## capital_gain 723.89
## age 704.82
## fnlwgt 653.45
## occupation 602.60
## marial_status 561.30
## education 451.74
## hours_per_week 438.36
## education_num 373.36
## workclass 241.84
varImpPlot(ad_rf)
predict(ad_rf, newdata = adult[1:5, ])
## 1 2 3 4 5
## <=50K <=50K <=50K <=50K <=50K
## Levels: <=50K >50K
predict(ad_rf, newdata = adult[1:5, ], type = "prob")
## <=50K >50K
## 1 0.980 0.020
## 2 0.866 0.134
## 3 1.000 0.000
## 4 0.938 0.062
## 5 0.758 0.242
## attr(,"class")
## [1] "matrix" "votes"
yhat_rf <- predict(ad_rf, newdata = validation, type = 'prob')[, ">50K"]
binomial_deviance(y_obs, yhat_rf)
## [1] Inf
pred_rf <- prediction(yhat_rf, y_obs)
perf_rf <- performance(pred_rf, measure = "tpr", x.measure = "fpr")
plot(perf_lm, col = "black", main = "ROC Curve")
#plot(perf_glmnet, add = T, col = "blue")
plot(perf_rf, add = T, col = "red")
abline(0, 1)
legend("bottomright", inset = .1, legend = c("GLM", "glmnet", "RF"),
col = c("black", "blue", "red"), lty = 1, lwd = 2)
performance(pred_tr, "auc")@y.values[[1]]
## [1] 0.8487345
# p1 <- data.frame(yhat_gmlnet, yhat_rf) %>%
# ggplot(aes(yhat_glmnet, yhat_fit)) +
# geom_point(alpha=.5) +
# geom_abline() +
# geom_smooth()
#
# p2 <- reshape2::melt(data.frmae(yhat_glmnet, yhat_rf)) %>%
# ggplot(aes(value, fill = variable)) +
# geom_density(alpha=.5)
#
# grid.arrange(p1, p2, ncol = 2)