Generalized Linear Model (GLM)

0.1 Generalized Linear Model (GLM) _Supervised

H2O 를 활용한 Generalized Linear Model (GLM)

[참조 1] http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html

일반화 선형 모델 (GLM)은 지수 분포를 따르는 결과에 대한 회귀 모델을 추정

가우스 (즉, 정규) 분포 외에도 포아송, 이항 및 감마 분포가 포함

0.1.1 packages

library(h2o)

## Warning: package 'h2o' was built under R version 4.0.3

# 초기 준비
h2o.init()

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         5 hours 24 minutes 
##     H2O cluster timezone:       Asia/Seoul 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.32.0.1 
##     H2O cluster version age:    25 days  
##     H2O cluster name:           H2O_started_from_R_user_uho906 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.96 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 4.0.2 (2020-06-22)

0.1.2 data import

# Import
df <- h2o.importFile("http://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

df

##   ID CAPSULE AGE RACE DPROS DCAPS  PSA  VOL GLEASON
## 1  1       0  65    1     2     1  1.4  0.0       6
## 2  2       0  72    1     3     2  6.7  0.0       7
## 3  3       0  70    1     1     2  4.9  0.0       6
## 4  4       0  76    2     2     1 51.2 20.0       7
## 5  5       0  69    1     1     1 12.3 55.9       6
## 6  6       1  71    1     3     2  3.3  0.0       8
## 
## [380 rows x 9 columns]

0.1.3 data mumming

df$CAPSULE <- as.factor(df$CAPSULE)
df$RACE <- as.factor(df$RACE)
df$DCAPS <- as.factor(df$DCAPS)
df$DPROS <- as.factor(df$DPROS)

predictors <- c("AGE", "RACE", "VOL", "GLEASON")
response <- "CAPSULE"

0.1.4 modeling

prostate_glm <- h2o.glm(family = "binomial",
                        x = predictors,
                        y = response,
                        training_frame = df,
                        lambda = 0,
                        compute_p_values = TRUE)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

summary(prostate_glm)

## Model Details:
## ==============
## 
## H2OBinomialModel: glm
## Model Key:  GLM_model_R_1604363756625_126 
## GLM Model: summary
##     family  link regularization number_of_predictors_total
## 1 binomial logit           None                          5
##   number_of_active_predictors number_of_iterations  training_frame
## 1                           5                    4 RTMP_sid_b790_5
## 
## H2OBinomialMetrics: glm
## ** Reported on training data. **
## 
## MSE:  0.1821583
## RMSE:  0.4268001
## LogLoss:  0.540688
## Mean Per-Class Error:  0.278584
## AUC:  0.7857534
## AUCPR:  0.722375
## Gini:  0.5715067
## R^2:  0.242646
## Residual Deviance:  410.9229
## AIC:  422.9229
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error      Rate
## 0      148  79 0.348018   =79/227
## 1       32 121 0.209150   =32/153
## Totals 180 200 0.292105  =111/380
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.326318   0.685552 143
## 2                       max f2  0.199660   0.803279 230
## 3                 max f0point5  0.546868   0.673813  86
## 4                 max accuracy  0.477010   0.736842 120
## 5                max precision  0.964240   1.000000   0
## 6                   max recall  0.081948   1.000000 282
## 7              max specificity  0.964240   1.000000   0
## 8             max absolute_mcc  0.477010   0.456656 120
## 9   max min_per_class_accuracy  0.370958   0.713656 134
## 10 max mean_per_class_accuracy  0.477010   0.729665 120
## 11                     max tns  0.964240 227.000000   0
## 12                     max fns  0.964240 152.000000   0
## 13                     max fps  0.000253 227.000000 302
## 14                     max tps  0.081948 153.000000 282
## 15                     max tnr  0.964240   1.000000   0
## 16                     max fnr  0.964240   0.993464   0
## 17                     max fpr  0.000253   1.000000 302
## 18                     max tpr  0.081948   1.000000 282
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## 
## 
## 
## Scoring History: 
##             timestamp   duration iterations negative_log_likelihood objective
## 1 2020-11-03 15:00:50  0.000 sec          0               256.14442   0.67406
## 2 2020-11-03 15:00:50  0.002 sec          1               209.72238   0.55190
## 3 2020-11-03 15:00:50  0.003 sec          2               205.57760   0.54099
## 4 2020-11-03 15:00:50  0.004 sec          3               205.46157   0.54069
## 5 2020-11-03 15:00:50  0.004 sec          4               205.46144   0.54069
##   training_rmse training_logloss training_r2 training_auc training_pr_auc
## 1            NA               NA          NA           NA              NA
## 2            NA               NA          NA           NA              NA
## 3            NA               NA          NA           NA              NA
## 4            NA               NA          NA           NA              NA
## 5       0.42680          0.54069     0.24265           NA              NA
##   training_lift training_classification_error
## 1            NA                            NA
## 2            NA                            NA
## 3            NA                            NA
## 4            NA                            NA
## 5       1.24183                       0.29211
## 
## Variable Importances: (Extract with `h2o.varimp`) 
## =================================================
## 
##   variable relative_importance scaled_importance percentage
## 1  GLEASON           1.3653342        1.00000000 0.49660257
## 2   RACE.2           0.5899233        0.43207244 0.21456829
## 3   RACE.1           0.4427875        0.32430707 0.16105172
## 4      VOL           0.2345440        0.17178507 0.08530891
## 5      AGE           0.1167608        0.08551811 0.04246851

# Coefficients that can be applied to the non-standardized data
h2o.coef(prostate_glm)

##   Intercept      RACE.1      RACE.2         AGE         VOL     GLEASON 
## -6.67515539 -0.44278752 -0.58992326 -0.01788870 -0.01278335  1.25035939

# Coefficients fitted on the standardized data (requires standardize=TRUE, which is on by default)
h2o.coef_norm(prostate_glm)

##   Intercept      RACE.1      RACE.2         AGE         VOL     GLEASON 
## -0.07610006 -0.44278752 -0.58992326 -0.11676080 -0.23454402  1.36533415

# Print the coefficients table
prostate_glm@model$coefficients_table

## Coefficients: glm coefficients
##       names coefficients std_error   z_value  p_value standardized_coefficients
## 1 Intercept    -6.675155  1.931760 -3.455478 0.000549                 -0.076100
## 2    RACE.1    -0.442788  1.324231 -0.334373 0.738098                 -0.442788
## 3    RACE.2    -0.589923  1.373466 -0.429514 0.667549                 -0.589923
## 4       AGE    -0.017889  0.018702 -0.956516 0.338812                 -0.116761
## 5       VOL    -0.012783  0.007514 -1.701191 0.088907                 -0.234544
## 6   GLEASON     1.250359  0.156156  8.007103 0.000000                  1.365334

# Print the standard error
prostate_glm@model$coefficients_table$std_error

## [1] 1.931760363 1.324230832 1.373465793 0.018701933 0.007514354 0.156156271

# Print the p values
prostate_glm@model$coefficients_table$p_value

## [1] 5.493181e-04 7.380978e-01 6.675490e-01 3.388116e-01 8.890718e-02
## [6] 1.221245e-15

# Print the z values
prostate_glm@model$coefficients_table$z_value

## [1] -3.4554780 -0.3343734 -0.4295143 -0.9565159 -1.7011907  8.0071033

0.1.5 graphical plot

# Retrieve a graphical plot of the standardized coefficient magnitudes
h2o.std_coef_plot(prostate_glm)

Generalized Linear Model (GLM)

updragon

2020 11 3

0.1 Generalized Linear Model (GLM) _Supervised

0.1.1 packages

0.1.2 data import

0.1.3 data mumming

0.1.4 modeling

0.1.5 graphical plot