H2O 를 활용한 Naïve Bayes Classifier
[참조 1] http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/naive-bayes.html
Naïve Bayes는 Bayes 정리를 적용 할 때 공변량의 독립성에 대한 강력한 가정에 의존하는 분류 알고리즘
Naïve Bayes 분류기는 반응 조건에 따른 예측 변수와 훈련 데이터 세트에서 계산 된 평균 및 표준 편차가있는 숫자 예측 변수의 가우스 분포 사이의 독립성을 가정
Naïve Bayes 모델은 일반적으로 분류 문제에 대한 의사 결정 트리의 대안으로 사용
## Warning: package 'h2o' was built under R version 4.0.3
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 6 hours 17 minutes
## H2O cluster timezone: Asia/Seoul
## H2O data parsing timezone: UTC
## H2O cluster version: 3.32.0.1
## H2O cluster version age: 25 days
## H2O cluster name: H2O_started_from_R_user_uho906
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.95 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4
## R Version: R version 4.0.2 (2020-06-22)
# Import the prostate dataset into H2O:
prostate <- h2o.importFile("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")
##
|
| | 0%
|
|======================================================================| 100%
## ID CAPSULE AGE RACE DPROS DCAPS PSA VOL GLEASON
## 1 1 0 65 1 2 1 1.4 0.0 6
## 2 2 0 72 1 3 2 6.7 0.0 7
## 3 3 0 70 1 1 2 4.9 0.0 6
## 4 4 0 76 2 2 1 51.2 20.0 7
## 5 5 0 69 1 1 1 12.3 55.9 6
## 6 6 1 71 1 3 2 3.3 0.0 8
##
## [380 rows x 9 columns]
# Set the predictors and response; set the factors:
prostate$CAPSULE <- as.factor(prostate$CAPSULE)
predictors <- c("ID", "AGE", "RACE", "DPROS", "DCAPS", "PSA", "VOL", "GLEASON")
response <- "CAPSULE"
# Build and train the model:
pros_nb <- h2o.naiveBayes(x = predictors,
y = response,
training_frame = prostate,
laplace = 0,
nfolds = 5,
seed = 1234)
##
|
| | 0%
|
|========================================================== | 83%
|
|======================================================================| 100%
## Model Details:
## ==============
##
## H2OBinomialModel: naivebayes
## Model Key: NaiveBayes_model_R_1604363756625_1890
## Model Summary:
## number_of_response_levels min_apriori_probability max_apriori_probability
## 1 2 0.40263 0.59737
##
## H2OBinomialMetrics: naivebayes
## ** Reported on training data. **
##
## MSE: 0.2177341
## RMSE: 0.4666199
## LogLoss: 0.8877717
## Mean Per-Class Error: 0.2500936
## AUC: 0.8143445
## AUCPR: 0.7374712
## Gini: 0.6286891
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 155 72 0.317181 =72/227
## 1 28 125 0.183007 =28/153
## Totals 183 197 0.263158 =100/380
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.097343 0.714286 192
## 2 max f2 0.023689 0.813578 311
## 3 max f0point5 0.209032 0.707351 137
## 4 max accuracy 0.209032 0.760526 137
## 5 max precision 1.000000 1.000000 0
## 6 max recall 0.007216 1.000000 352
## 7 max specificity 1.000000 1.000000 0
## 8 max absolute_mcc 0.209032 0.497193 137
## 9 max min_per_class_accuracy 0.126591 0.745098 166
## 10 max mean_per_class_accuracy 0.097343 0.749906 192
## 11 max tns 1.000000 227.000000 0
## 12 max fns 1.000000 148.000000 0
## 13 max fps 0.000003 227.000000 375
## 14 max tps 0.007216 153.000000 352
## 15 max tnr 1.000000 1.000000 0
## 16 max fnr 1.000000 0.967320 0
## 17 max fpr 0.000003 1.000000 375
## 18 max tpr 0.007216 1.000000 352
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
##
## H2OBinomialMetrics: naivebayes
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.2224463
## RMSE: 0.4716421
## LogLoss: 1.006322
## Mean Per-Class Error: 0.2510869
## AUC: 0.8001785
## AUCPR: 0.7132516
## Gini: 0.600357
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 159 68 0.299559 =68/227
## 1 31 122 0.202614 =31/153
## Totals 190 190 0.260526 =99/380
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.102958 0.711370 173
## 2 max f2 0.031904 0.812155 276
## 3 max f0point5 0.241473 0.691729 111
## 4 max accuracy 0.187209 0.747368 132
## 5 max precision 1.000000 0.923077 0
## 6 max recall 0.006408 1.000000 342
## 7 max specificity 1.000000 0.995595 0
## 8 max absolute_mcc 0.102958 0.488296 173
## 9 max min_per_class_accuracy 0.132498 0.738562 155
## 10 max mean_per_class_accuracy 0.102958 0.748913 173
## 11 max tns 1.000000 226.000000 0
## 12 max fns 1.000000 141.000000 0
## 13 max fps 0.000005 227.000000 363
## 14 max tps 0.006408 153.000000 342
## 15 max tnr 1.000000 0.995595 0
## 16 max fnr 1.000000 0.921569 0
## 17 max fpr 0.000005 1.000000 363
## 18 max tpr 0.006408 1.000000 342
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
## accuracy 0.759484 0.07326607 0.7794118 0.7875 0.75949365 0.6376812
## auc 0.8018616 0.028404253 0.8021978 0.82898355 0.7766798 0.7702703
## aucpr 0.73397154 0.046767615 0.72865367 0.6884452 0.69029164 0.7695594
## err 0.240516 0.07326607 0.22058824 0.2125 0.24050634 0.36231884
## err_count 18.0 4.358899 15.0 17.0 19.0 25.0
## cv_5_valid
## accuracy 0.8333333
## auc 0.83117646
## aucpr 0.79290783
## err 0.16666667
## err_count 14.0
##
## ---
## mean sd cv_1_valid cv_2_valid cv_3_valid
## pr_auc 0.73397154 0.046767615 0.72865367 0.6884452 0.69029164
## precision 0.6954579 0.10696869 0.72 0.67741936 0.65909094
## r2 0.067146584 0.094245516 0.06642159 0.036379173 -0.03291186
## recall 0.7991456 0.12003771 0.6923077 0.75 0.8787879
## rmse 0.47191358 0.025731008 0.46954563 0.46821335 0.5012339
## specificity 0.71725804 0.2227794 0.8333333 0.8076923 0.67391306
## cv_4_valid cv_5_valid
## pr_auc 0.7695594 0.79290783
## precision 0.56363636 0.85714287
## r2 0.043711007 0.22213301
## recall 0.96875 0.7058824
## rmse 0.48766473 0.43291023
## specificity 0.35135135 0.92
##
## NULL
# Generate the predictions on a test set (if necessary):
pred <- h2o.predict(pros_nb, newdata = prostate)
##
|
| | 0%
|
|======================================================================| 100%
## predict p0 p1
## 1 0 0.9415726386 0.058427361
## 2 1 0.0004140885 0.999585911
## 3 1 0.0047612605 0.995238740
## 4 1 0.0038528573 0.996147143
## 5 0 0.9950751924 0.004924808
## 6 1 0.0001460359 0.999853964
##
## [380 rows x 3 columns]