naive bayes classifier

0.1 Naive Bayes Classifier _Supervised

H2O 를 활용한 Naïve Bayes Classifier

[참조 1] http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/naive-bayes.html

Naïve Bayes는 Bayes 정리를 적용 할 때 공변량의 독립성에 대한 강력한 가정에 의존하는 분류 알고리즘

Naïve Bayes 분류기는 반응 조건에 따른 예측 변수와 훈련 데이터 세트에서 계산 된 평균 및 표준 편차가있는 숫자 예측 변수의 가우스 분포 사이의 독립성을 가정

Naïve Bayes 모델은 일반적으로 분류 문제에 대한 의사 결정 트리의 대안으로 사용

0.1.1 packages

library(h2o)

## Warning: package 'h2o' was built under R version 4.0.3

# 초기 준비
h2o.init()

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         6 hours 17 minutes 
##     H2O cluster timezone:       Asia/Seoul 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.32.0.1 
##     H2O cluster version age:    25 days  
##     H2O cluster name:           H2O_started_from_R_user_uho906 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.95 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 4.0.2 (2020-06-22)

0.1.2 data import

# Import the prostate dataset into H2O:
prostate <- h2o.importFile("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

prostate

##   ID CAPSULE AGE RACE DPROS DCAPS  PSA  VOL GLEASON
## 1  1       0  65    1     2     1  1.4  0.0       6
## 2  2       0  72    1     3     2  6.7  0.0       7
## 3  3       0  70    1     1     2  4.9  0.0       6
## 4  4       0  76    2     2     1 51.2 20.0       7
## 5  5       0  69    1     1     1 12.3 55.9       6
## 6  6       1  71    1     3     2  3.3  0.0       8
## 
## [380 rows x 9 columns]

0.1.3 data mumming

# Set the predictors and response; set the factors:
prostate$CAPSULE <- as.factor(prostate$CAPSULE)
predictors <- c("ID", "AGE", "RACE", "DPROS", "DCAPS", "PSA", "VOL", "GLEASON")
response <- "CAPSULE"

0.1.4 modeling

# Build and train the model:
pros_nb <- h2o.naiveBayes(x = predictors,
                          y = response,
                          training_frame = prostate,
                          laplace = 0,
                          nfolds = 5,
                          seed = 1234)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |======================================================================| 100%

summary(pros_nb)

## Model Details:
## ==============
## 
## H2OBinomialModel: naivebayes
## Model Key:  NaiveBayes_model_R_1604363756625_1890 
## Model Summary: 
##   number_of_response_levels min_apriori_probability max_apriori_probability
## 1                         2                 0.40263                 0.59737
## 
## H2OBinomialMetrics: naivebayes
## ** Reported on training data. **
## 
## MSE:  0.2177341
## RMSE:  0.4666199
## LogLoss:  0.8877717
## Mean Per-Class Error:  0.2500936
## AUC:  0.8143445
## AUCPR:  0.7374712
## Gini:  0.6286891
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error      Rate
## 0      155  72 0.317181   =72/227
## 1       28 125 0.183007   =28/153
## Totals 183 197 0.263158  =100/380
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.097343   0.714286 192
## 2                       max f2  0.023689   0.813578 311
## 3                 max f0point5  0.209032   0.707351 137
## 4                 max accuracy  0.209032   0.760526 137
## 5                max precision  1.000000   1.000000   0
## 6                   max recall  0.007216   1.000000 352
## 7              max specificity  1.000000   1.000000   0
## 8             max absolute_mcc  0.209032   0.497193 137
## 9   max min_per_class_accuracy  0.126591   0.745098 166
## 10 max mean_per_class_accuracy  0.097343   0.749906 192
## 11                     max tns  1.000000 227.000000   0
## 12                     max fns  1.000000 148.000000   0
## 13                     max fps  0.000003 227.000000 375
## 14                     max tps  0.007216 153.000000 352
## 15                     max tnr  1.000000   1.000000   0
## 16                     max fnr  1.000000   0.967320   0
## 17                     max fpr  0.000003   1.000000 375
## 18                     max tpr  0.007216   1.000000 352
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## 
## H2OBinomialMetrics: naivebayes
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.2224463
## RMSE:  0.4716421
## LogLoss:  1.006322
## Mean Per-Class Error:  0.2510869
## AUC:  0.8001785
## AUCPR:  0.7132516
## Gini:  0.600357
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error     Rate
## 0      159  68 0.299559  =68/227
## 1       31 122 0.202614  =31/153
## Totals 190 190 0.260526  =99/380
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.102958   0.711370 173
## 2                       max f2  0.031904   0.812155 276
## 3                 max f0point5  0.241473   0.691729 111
## 4                 max accuracy  0.187209   0.747368 132
## 5                max precision  1.000000   0.923077   0
## 6                   max recall  0.006408   1.000000 342
## 7              max specificity  1.000000   0.995595   0
## 8             max absolute_mcc  0.102958   0.488296 173
## 9   max min_per_class_accuracy  0.132498   0.738562 155
## 10 max mean_per_class_accuracy  0.102958   0.748913 173
## 11                     max tns  1.000000 226.000000   0
## 12                     max fns  1.000000 141.000000   0
## 13                     max fps  0.000005 227.000000 363
## 14                     max tps  0.006408 153.000000 342
## 15                     max tnr  1.000000   0.995595   0
## 16                     max fnr  1.000000   0.921569   0
## 17                     max fpr  0.000005   1.000000 363
## 18                     max tpr  0.006408   1.000000 342
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary: 
##                 mean          sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
## accuracy    0.759484  0.07326607  0.7794118     0.7875 0.75949365  0.6376812
## auc        0.8018616 0.028404253  0.8021978 0.82898355  0.7766798  0.7702703
## aucpr     0.73397154 0.046767615 0.72865367  0.6884452 0.69029164  0.7695594
## err         0.240516  0.07326607 0.22058824     0.2125 0.24050634 0.36231884
## err_count       18.0    4.358899       15.0       17.0       19.0       25.0
##           cv_5_valid
## accuracy   0.8333333
## auc       0.83117646
## aucpr     0.79290783
## err       0.16666667
## err_count       14.0
## 
## ---
##                    mean          sd cv_1_valid  cv_2_valid  cv_3_valid
## pr_auc       0.73397154 0.046767615 0.72865367   0.6884452  0.69029164
## precision     0.6954579  0.10696869       0.72  0.67741936  0.65909094
## r2          0.067146584 0.094245516 0.06642159 0.036379173 -0.03291186
## recall        0.7991456  0.12003771  0.6923077        0.75   0.8787879
## rmse         0.47191358 0.025731008 0.46954563  0.46821335   0.5012339
## specificity  0.71725804   0.2227794  0.8333333   0.8076923  0.67391306
##              cv_4_valid cv_5_valid
## pr_auc        0.7695594 0.79290783
## precision    0.56363636 0.85714287
## r2          0.043711007 0.22213301
## recall          0.96875  0.7058824
## rmse         0.48766473 0.43291023
## specificity  0.35135135       0.92
## 
## NULL

# Eval performance:
perf <- h2o.performance(pros_nb)

0.1.5 prediction

# Generate the predictions on a test set (if necessary):
pred <- h2o.predict(pros_nb, newdata = prostate)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

pred

##   predict           p0          p1
## 1       0 0.9415726386 0.058427361
## 2       1 0.0004140885 0.999585911
## 3       1 0.0047612605 0.995238740
## 4       1 0.0038528573 0.996147143
## 5       0 0.9950751924 0.004924808
## 6       1 0.0001460359 0.999853964
## 
## [380 rows x 3 columns]