Sử dụng thuật toán random forest để dự báo xác suất vỡ nợ của khách hàng. Dữ liệu đầu vào là tập german_credit.
task_classif <- tsk('german_credit')
task_classif
## <TaskClassif:german_credit> (1000 x 21)
## * Target: credit_risk
## * Properties: twoclass
## * Features (20):
## - fct (14): credit_history, employment_duration, foreign_worker,
## housing, job, other_debtors, other_installment_plans,
## people_liable, personal_status_sex, property, purpose, savings,
## status, telephone
## - int (3): age, amount, duration
## - ord (3): installment_rate, number_credits, present_residence
classif_learner <- lrn('classif.ranger')
classif_learner
## <LearnerClassifRanger:classif.ranger>
## * Model: -
## * Parameters: num.threads=1
## * Packages: ranger
## * Predict Type: response
## * Feature types: logical, integer, numeric, character, factor, ordered
## * Properties: importance, multiclass, oob_error, twoclass, weights
# chọn loại dự báo
classif_learner$predict_type <- "prob"
classif_learner$param_set %>%
as.data.table() %>%
select(id, class, lower, upper, levels) %>%
kable()
| id | class | lower | upper | levels |
|---|---|---|---|---|
| num.trees | ParamInt | 1 | Inf | NULL |
| mtry | ParamInt | 1 | Inf | NULL |
| importance | ParamFct | NA | NA | none , impurity , impurity_corrected, permutation |
| write.forest | ParamLgl | NA | NA | TRUE, FALSE |
| min.node.size | ParamInt | 1 | Inf | NULL |
| replace | ParamLgl | NA | NA | TRUE, FALSE |
| sample.fraction | ParamDbl | 0 | 1 | NULL |
| class.weights | ParamDbl | -Inf | Inf | NULL |
| splitrule | ParamFct | NA | NA | gini , extratrees |
| num.random.splits | ParamInt | 1 | Inf | NULL |
| split.select.weights | ParamDbl | 0 | 1 | NULL |
| always.split.variables | ParamUty | NA | NA | NULL |
| respect.unordered.factors | ParamFct | NA | NA | ignore , order , partition |
| scale.permutation.importance | ParamLgl | NA | NA | TRUE, FALSE |
| keep.inbag | ParamLgl | NA | NA | TRUE, FALSE |
| holdout | ParamLgl | NA | NA | TRUE, FALSE |
| num.threads | ParamInt | 1 | Inf | NULL |
| save.memory | ParamLgl | NA | NA | TRUE, FALSE |
| verbose | ParamLgl | NA | NA | TRUE, FALSE |
| oob.error | ParamLgl | NA | NA | TRUE, FALSE |
| max.depth | ParamInt | -Inf | Inf | NULL |
| alpha | ParamDbl | -Inf | Inf | NULL |
| min.prop | ParamDbl | -Inf | Inf | NULL |
| regularization.factor | ParamUty | NA | NA | NULL |
| regularization.usedepth | ParamLgl | NA | NA | TRUE, FALSE |
| seed | ParamInt | -Inf | Inf | NULL |
| minprop | ParamDbl | -Inf | Inf | NULL |
| se.method | ParamFct | NA | NA | jack , infjack |
ps_ranger = ps(
num.trees = p_int(300, 800, tags = "budget"),
mtry = p_int(8, 15),
sample.fraction = p_dbl(0.7, 0.8)
)
# cross-validation with 5 folds
resampling_inner = rsmp("cv", folds = 5)
resampling_inner
## <ResamplingCV> with 5 iterations
## * Instantiated: FALSE
## * Parameters: folds=5
measure = msr("classif.auc")
measure
## <MeasureBinarySimple:classif.auc>
## * Packages: mlr3measures
## * Range: [0, 1]
## * Minimize: FALSE
## * Parameters: list()
## * Properties: -
## * Predict type: prob
tuner = tnr("hyperband", eta = 2)
tuner
## <TunerHyperband>
## * Parameters: eta=2
## * Parameter classes: ParamLgl, ParamInt, ParamDbl, ParamFct
## * Properties: dependencies, single-crit, multi-crit
## * Packages: -
tune_single_crit = TuningInstanceSingleCrit$new(
task = task_classif,
learner = classif_learner,
resampling = resampling_inner,
measure = measure,
terminator = trm("none"), # hyperband terminates itself
search_space = ps_ranger
)
tune_single_crit
## <TuningInstanceSingleCrit>
## * State: Not optimized
## * Objective: <ObjectiveTuning:classif.ranger_on_german_credit>
## * Search Space:
## <ParamSet>
## id class lower upper nlevels default value
## 1: num.trees ParamInt 300.0 800.0 501 <NoDefault[3]>
## 2: mtry ParamInt 8.0 15.0 8 <NoDefault[3]>
## 3: sample.fraction ParamDbl 0.7 0.8 Inf <NoDefault[3]>
## * Terminator: <TerminatorNone>
## * Terminated: FALSE
## * Archive:
## <ArchiveTuning>
## Null data.table (0 rows and 0 cols)
tune_single_crit$archive
## <ArchiveTuning>
## mtry sample.fraction num.trees bracket bracket_stage budget_scaled
## 1: 11 0.72 400 1 0 1.3
## 2: 9 0.76 400 1 0 1.3
## 3: 9 0.76 800 1 1 2.7
## 4: 9 0.76 800 0 0 2.7
## 5: 14 0.78 800 0 0 2.7
## budget_real n_configs classif.auc timestamp batch_nr
## 1: 400 2 0.79 2021-11-24 23:18:25 1
## 2: 400 2 0.79 2021-11-24 23:18:25 1
## 3: 800 1 0.79 2021-11-24 23:18:27 2
## 4: 800 2 0.79 2021-11-24 23:18:33 3
## 5: 800 2 0.79 2021-11-24 23:18:33 3
Áp tham số tốt nhất để training trên toán tập dữ liệu
tuned_learner <- classif_learner$clone()
tuned_learner$param_set$values = tune_single_crit$result_learner_param_vals # best paramters
tuned_learner$train(task_classif)
Kết quả dự báo
tuned_learner$predict(task_classif)
## <PredictionClassif> for 1000 observations:
## row_ids truth response prob.good prob.bad
## 1 good good 0.8959236 0.10407639
## 2 bad bad 0.2060665 0.79393353
## 3 good good 0.9492312 0.05076885
## ---
## 998 good good 0.9708611 0.02913889
## 999 bad bad 0.1580159 0.84198413
## 1000 good good 0.6234355 0.37656448