H2O 를 활용한 Generalized Linear Model (GLM)
[참조 1] http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html
일반화 선형 모델 (GLM)은 지수 분포를 따르는 결과에 대한 회귀 모델을 추정
가우스 (즉, 정규) 분포 외에도 포아송, 이항 및 감마 분포가 포함
## Warning: package 'h2o' was built under R version 4.0.3
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 5 hours 24 minutes
## H2O cluster timezone: Asia/Seoul
## H2O data parsing timezone: UTC
## H2O cluster version: 3.32.0.1
## H2O cluster version age: 25 days
## H2O cluster name: H2O_started_from_R_user_uho906
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.96 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4
## R Version: R version 4.0.2 (2020-06-22)
# Import
df <- h2o.importFile("http://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")
##
|
| | 0%
|
|======================================================================| 100%
## ID CAPSULE AGE RACE DPROS DCAPS PSA VOL GLEASON
## 1 1 0 65 1 2 1 1.4 0.0 6
## 2 2 0 72 1 3 2 6.7 0.0 7
## 3 3 0 70 1 1 2 4.9 0.0 6
## 4 4 0 76 2 2 1 51.2 20.0 7
## 5 5 0 69 1 1 1 12.3 55.9 6
## 6 6 1 71 1 3 2 3.3 0.0 8
##
## [380 rows x 9 columns]
df$CAPSULE <- as.factor(df$CAPSULE)
df$RACE <- as.factor(df$RACE)
df$DCAPS <- as.factor(df$DCAPS)
df$DPROS <- as.factor(df$DPROS)
predictors <- c("AGE", "RACE", "VOL", "GLEASON")
response <- "CAPSULE"
prostate_glm <- h2o.glm(family = "binomial",
x = predictors,
y = response,
training_frame = df,
lambda = 0,
compute_p_values = TRUE)
##
|
| | 0%
|
|======================================================================| 100%
## Model Details:
## ==============
##
## H2OBinomialModel: glm
## Model Key: GLM_model_R_1604363756625_126
## GLM Model: summary
## family link regularization number_of_predictors_total
## 1 binomial logit None 5
## number_of_active_predictors number_of_iterations training_frame
## 1 5 4 RTMP_sid_b790_5
##
## H2OBinomialMetrics: glm
## ** Reported on training data. **
##
## MSE: 0.1821583
## RMSE: 0.4268001
## LogLoss: 0.540688
## Mean Per-Class Error: 0.278584
## AUC: 0.7857534
## AUCPR: 0.722375
## Gini: 0.5715067
## R^2: 0.242646
## Residual Deviance: 410.9229
## AIC: 422.9229
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 148 79 0.348018 =79/227
## 1 32 121 0.209150 =32/153
## Totals 180 200 0.292105 =111/380
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.326318 0.685552 143
## 2 max f2 0.199660 0.803279 230
## 3 max f0point5 0.546868 0.673813 86
## 4 max accuracy 0.477010 0.736842 120
## 5 max precision 0.964240 1.000000 0
## 6 max recall 0.081948 1.000000 282
## 7 max specificity 0.964240 1.000000 0
## 8 max absolute_mcc 0.477010 0.456656 120
## 9 max min_per_class_accuracy 0.370958 0.713656 134
## 10 max mean_per_class_accuracy 0.477010 0.729665 120
## 11 max tns 0.964240 227.000000 0
## 12 max fns 0.964240 152.000000 0
## 13 max fps 0.000253 227.000000 302
## 14 max tps 0.081948 153.000000 282
## 15 max tnr 0.964240 1.000000 0
## 16 max fnr 0.964240 0.993464 0
## 17 max fpr 0.000253 1.000000 302
## 18 max tpr 0.081948 1.000000 282
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
##
##
##
## Scoring History:
## timestamp duration iterations negative_log_likelihood objective
## 1 2020-11-03 15:00:50 0.000 sec 0 256.14442 0.67406
## 2 2020-11-03 15:00:50 0.002 sec 1 209.72238 0.55190
## 3 2020-11-03 15:00:50 0.003 sec 2 205.57760 0.54099
## 4 2020-11-03 15:00:50 0.004 sec 3 205.46157 0.54069
## 5 2020-11-03 15:00:50 0.004 sec 4 205.46144 0.54069
## training_rmse training_logloss training_r2 training_auc training_pr_auc
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 NA NA NA NA NA
## 5 0.42680 0.54069 0.24265 NA NA
## training_lift training_classification_error
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 1.24183 0.29211
##
## Variable Importances: (Extract with `h2o.varimp`)
## =================================================
##
## variable relative_importance scaled_importance percentage
## 1 GLEASON 1.3653342 1.00000000 0.49660257
## 2 RACE.2 0.5899233 0.43207244 0.21456829
## 3 RACE.1 0.4427875 0.32430707 0.16105172
## 4 VOL 0.2345440 0.17178507 0.08530891
## 5 AGE 0.1167608 0.08551811 0.04246851
## Intercept RACE.1 RACE.2 AGE VOL GLEASON
## -6.67515539 -0.44278752 -0.58992326 -0.01788870 -0.01278335 1.25035939
# Coefficients fitted on the standardized data (requires standardize=TRUE, which is on by default)
h2o.coef_norm(prostate_glm)
## Intercept RACE.1 RACE.2 AGE VOL GLEASON
## -0.07610006 -0.44278752 -0.58992326 -0.11676080 -0.23454402 1.36533415
## Coefficients: glm coefficients
## names coefficients std_error z_value p_value standardized_coefficients
## 1 Intercept -6.675155 1.931760 -3.455478 0.000549 -0.076100
## 2 RACE.1 -0.442788 1.324231 -0.334373 0.738098 -0.442788
## 3 RACE.2 -0.589923 1.373466 -0.429514 0.667549 -0.589923
## 4 AGE -0.017889 0.018702 -0.956516 0.338812 -0.116761
## 5 VOL -0.012783 0.007514 -1.701191 0.088907 -0.234544
## 6 GLEASON 1.250359 0.156156 8.007103 0.000000 1.365334
## [1] 1.931760363 1.324230832 1.373465793 0.018701933 0.007514354 0.156156271
## [1] 5.493181e-04 7.380978e-01 6.675490e-01 3.388116e-01 8.890718e-02
## [6] 1.221245e-15
## [1] -3.4554780 -0.3343734 -0.4295143 -0.9565159 -1.7011907 8.0071033