The data can be downloaded from https://www.kaggle.com/dalpozz/creditcardfraud. You need to unzip the data
# quite slow, but OK
credit_card = read.csv("creditcard.csv")
Let’s take a look at the data
str (credit_card)
## 'data.frame': 284807 obs. of 31 variables:
## $ Time : num 0 0 1 1 2 2 4 7 7 9 ...
## $ V1 : num -1.36 1.192 -1.358 -0.966 -1.158 ...
## $ V2 : num -0.0728 0.2662 -1.3402 -0.1852 0.8777 ...
## $ V3 : num 2.536 0.166 1.773 1.793 1.549 ...
## $ V4 : num 1.378 0.448 0.38 -0.863 0.403 ...
## $ V5 : num -0.3383 0.06 -0.5032 -0.0103 -0.4072 ...
## $ V6 : num 0.4624 -0.0824 1.8005 1.2472 0.0959 ...
## $ V7 : num 0.2396 -0.0788 0.7915 0.2376 0.5929 ...
## $ V8 : num 0.0987 0.0851 0.2477 0.3774 -0.2705 ...
## $ V9 : num 0.364 -0.255 -1.515 -1.387 0.818 ...
## $ V10 : num 0.0908 -0.167 0.2076 -0.055 0.7531 ...
## $ V11 : num -0.552 1.613 0.625 -0.226 -0.823 ...
## $ V12 : num -0.6178 1.0652 0.0661 0.1782 0.5382 ...
## $ V13 : num -0.991 0.489 0.717 0.508 1.346 ...
## $ V14 : num -0.311 -0.144 -0.166 -0.288 -1.12 ...
## $ V15 : num 1.468 0.636 2.346 -0.631 0.175 ...
## $ V16 : num -0.47 0.464 -2.89 -1.06 -0.451 ...
## $ V17 : num 0.208 -0.115 1.11 -0.684 -0.237 ...
## $ V18 : num 0.0258 -0.1834 -0.1214 1.9658 -0.0382 ...
## $ V19 : num 0.404 -0.146 -2.262 -1.233 0.803 ...
## $ V20 : num 0.2514 -0.0691 0.525 -0.208 0.4085 ...
## $ V21 : num -0.01831 -0.22578 0.248 -0.1083 -0.00943 ...
## $ V22 : num 0.27784 -0.63867 0.77168 0.00527 0.79828 ...
## $ V23 : num -0.11 0.101 0.909 -0.19 -0.137 ...
## $ V24 : num 0.0669 -0.3398 -0.6893 -1.1756 0.1413 ...
## $ V25 : num 0.129 0.167 -0.328 0.647 -0.206 ...
## $ V26 : num -0.189 0.126 -0.139 -0.222 0.502 ...
## $ V27 : num 0.13356 -0.00898 -0.05535 0.06272 0.21942 ...
## $ V28 : num -0.0211 0.0147 -0.0598 0.0615 0.2152 ...
## $ Amount: num 149.62 2.69 378.66 123.5 69.99 ...
## $ Class : int 0 0 0 0 0 0 0 0 0 0 ...
Let’s convert the Class column to factor
credit_card$Class = as.factor(credit_card$Class)
library(h2o)
##
## ----------------------------------------------------------------------
##
## Your next step is to start H2O:
## > h2o.init()
##
## For H2O package documentation, ask for help:
## > ??h2o
##
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
##
## ----------------------------------------------------------------------
##
## Attaching package: 'h2o'
## The following objects are masked from 'package:stats':
##
## cor, sd, var
## The following objects are masked from 'package:base':
##
## &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
## colnames<-, ifelse, is.character, is.factor, is.numeric, log,
## log10, log1p, log2, round, signif, trunc
h2o.init ()
##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## /var/folders/sn/2lcvwckj4lzgrv63m0k0fvr80000gn/T//RtmpkdIQS5/h2o_qdang_started_from_r.out
## /var/folders/sn/2lcvwckj4lzgrv63m0k0fvr80000gn/T//RtmpkdIQS5/h2o_qdang_started_from_r.err
##
##
## Starting H2O JVM and connecting: .. Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 seconds 391 milliseconds
## H2O cluster version: 3.10.3.6
## H2O cluster version age: 1 month and 1 day
## H2O cluster name: H2O_started_from_R_qdang_elw304
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.56 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 2
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## R Version: R version 3.3.2 (2016-10-31)
##
## Note: As started, H2O is limited to the CRAN default of 2 CPUs.
## Shut down and restart H2O as shown below to use all your CPUs.
## > h2o.shutdown()
## > h2o.init(nthreads = -1)
# First RF model
h2o_credit = as.h2o (credit_card)
##
|
| | 0%
|
|=================================================================| 100%
divide = h2o.splitFrame(h2o_credit, ratios = 0.8)
train = divide[[1]]
test = divide[[2]]
rf1 = h2o.randomForest(x=1:30,y=31, training_frame = train, validation_frame = test, ntrees = 500)
##
|
| | 0%
|
|= | 1%
|
|= | 2%
|
|== | 3%
|
|=== | 4%
|
|=== | 5%
|
|==== | 6%
|
|==== | 7%
|
|===== | 7%
|
|===== | 8%
|
|====== | 9%
|
|======= | 10%
|
|======= | 11%
|
|======== | 12%
|
|======== | 13%
|
|========= | 14%
|
|========= | 15%
|
|========== | 16%
|
|=========== | 16%
|
|=========== | 17%
|
|============ | 18%
|
|============ | 19%
|
|============= | 20%
|
|============== | 21%
|
|============== | 22%
|
|=============== | 23%
|
|=============== | 24%
|
|================ | 25%
|
|================= | 26%
|
|================= | 27%
|
|================== | 28%
|
|=================== | 29%
|
|==================== | 30%
|
|==================== | 31%
|
|===================== | 32%
|
|===================== | 33%
|
|====================== | 34%
|
|====================== | 35%
|
|======================= | 35%
|
|======================== | 36%
|
|======================== | 37%
|
|========================= | 38%
|
|========================= | 39%
|
|========================== | 40%
|
|=========================== | 41%
|
|=========================== | 42%
|
|============================ | 43%
|
|============================= | 44%
|
|============================= | 45%
|
|============================== | 46%
|
|=============================== | 47%
|
|=============================== | 48%
|
|================================ | 49%
|
|================================ | 50%
|
|================================= | 51%
|
|================================== | 52%
|
|================================== | 53%
|
|=================================== | 54%
|
|==================================== | 55%
|
|==================================== | 56%
|
|===================================== | 57%
|
|====================================== | 58%
|
|====================================== | 59%
|
|======================================= | 60%
|
|======================================== | 61%
|
|======================================== | 62%
|
|========================================= | 63%
|
|========================================= | 64%
|
|========================================== | 65%
|
|=========================================== | 66%
|
|=========================================== | 67%
|
|============================================ | 68%
|
|============================================= | 69%
|
|============================================== | 70%
|
|============================================== | 71%
|
|=============================================== | 72%
|
|================================================ | 73%
|
|================================================ | 74%
|
|================================================= | 75%
|
|================================================== | 76%
|
|================================================== | 77%
|
|=================================================== | 78%
|
|==================================================== | 79%
|
|==================================================== | 80%
|
|===================================================== | 81%
|
|====================================================== | 82%
|
|====================================================== | 83%
|
|======================================================= | 85%
|
|======================================================== | 86%
|
|======================================================== | 87%
|
|========================================================= | 88%
|
|========================================================== | 89%
|
|=========================================================== | 91%
|
|============================================================ | 92%
|
|============================================================ | 93%
|
|============================================================= | 94%
|
|============================================================= | 95%
|
|============================================================== | 96%
|
|=============================================================== | 97%
|
|=============================================================== | 98%
|
|================================================================ | 99%
|
|=================================================================| 100%
Let’s see what do you achieve
rf1
## Model Details:
## ==============
##
## H2OBinomialModel: drf
## Model ID: DRF_model_R_1490256548604_1
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 500 500 1210747 18
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 20 19.99000 110 241 187.10600
##
##
## H2OBinomialMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
##
## MSE: 0.0003856468
## RMSE: 0.01963789
## LogLoss: 0.002653083
## Mean Per-Class Error: 0.08694796
## AUC: 0.9718499
## Gini: 0.9436998
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 227282 21 0.000092 =21/227303
## 1 69 328 0.173804 =69/397
## Totals 227351 349 0.000395 =90/227700
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.367021 0.879357 157
## 2 max f2 0.278351 0.857580 174
## 3 max f0point5 0.517242 0.928196 139
## 4 max accuracy 0.517242 0.999605 139
## 5 max precision 1.000000 1.000000 0
## 6 max recall 0.000008 1.000000 399
## 7 max specificity 1.000000 1.000000 0
## 8 max absolute_mcc 0.367021 0.880991 157
## 9 max min_per_class_accuracy 0.000322 0.929471 386
## 10 max mean_per_class_accuracy 0.006192 0.951290 347
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: drf
## ** Reported on validation data. **
##
## MSE: 0.0004829403
## RMSE: 0.0219759
## LogLoss: 0.003279189
## Mean Per-Class Error: 0.110614
## AUC: 0.9751498
## Gini: 0.9502997
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 57002 10 0.000175 =10/57012
## 1 21 74 0.221053 =21/95
## Totals 57023 84 0.000543 =31/57107
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.336000 0.826816 65
## 2 max f2 0.097960 0.807453 82
## 3 max f0point5 0.550000 0.909091 50
## 4 max accuracy 0.550000 0.999475 50
## 5 max precision 1.000000 1.000000 0
## 6 max recall 0.000022 1.000000 393
## 7 max specificity 1.000000 1.000000 0
## 8 max absolute_mcc 0.336000 0.828115 65
## 9 max min_per_class_accuracy 0.000246 0.915544 368
## 10 max mean_per_class_accuracy 0.010401 0.923150 222
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
On validation data, the very first random forest achieved the AUC of 0.9765926 and the F1-score of 0.853659.