1 Giới thiệu

Việc ứng dụng Machine Learning vào các bài toán dự báo kinh tế hiện nay đang rất phổ biến và được sử dụng rộng rãi. Các phương pháp classification phải kể đến như: Logistic, Decision Tree, Random Forest, Boosting family, Support Vector Machine, Neural Network… Tuy nhiên trong một số trường hợp, việc áp dụng thuật toán tuy mang tính chính xác cao nhưng lại không đạt được điều mà bài toán thực sự mong muốn. Đó là trường hợp về Imbalanced Data. Một ví dụ thực tế: trong Card Attrition trong ngân hàng, mỗi tháng chỉ có 1% thẻ tín dụng bị đóng mỗi tháng, do vậy một model với tỉ lệ cutoff cao có thể dự báo tất các các thẻ sẽ không đóng mang lại accuracy là 99% tuy nhiên lại không đi vào được mục tiêu là tìm được những thẻ đóng. Điều quan trọng trong bài toán này là phải có bước xử lý imbalanced Data trước khi dự báo với một số phương pháp hữu hiệu như sau:

Oversampling: Nhân số lượng dữ liệu bị chênh lệch lên nhiều lần để có một data cân xứng hơn
Undersampling: Giảm số lượng dữ liệu chênh lệch xuống, tuy nhiên chỉ nên áp dụng với trường hợp dữ liệu lớn
Both oversampling and undersampling: Sử dụng cả 2 phương pháp nêu trên
SMOTE: Cách thức giống với oversampling tuy nhiên dữ liệu được thêm vào sẽ là dữ liệu nhân tạo được tính dựa trên phương pháp boostrap và K-nearest neighbors

Bài viết này là một project dự báo thu nhập của các khách hàng dựa trên 15 biến độc lập liệu rằng những khách hàng trên sẽ có thu nhập lớn hơn 50000 USD/năm hay thấp hơn.

2 Xử lý dữ liệu

Data:

Train: https://www.analyticsvidhya.com/wp-content/uploads/2016/09/train.zip

Test : https://www.analyticsvidhya.com/wp-content/uploads/2016/09/test.zip

#load data:
train <- read.csv("D:/4. BA - R/4. LECTURE & TRAINING/2. TRAINING/10. IMBALANCED DATA/train.csv")
test <- read.csv("D:/4. BA - R/4. LECTURE & TRAINING/2. TRAINING/10. IMBALANCED DATA/test.csv")

Xem kết cấu của bảng dữ liệu

train %>% str

## 'data.frame':    199523 obs. of  41 variables:
##  $ age                             : int  73 58 18 9 10 48 42 28 47 34 ...
##  $ class_of_worker                 : Factor w/ 9 levels "Federal government",..: 4 7 4 4 4 5 5 5 2 5 ...
##  $ industry_code                   : int  0 4 0 0 0 40 34 4 43 4 ...
##  $ occupation_code                 : int  0 34 0 0 0 10 3 40 26 37 ...
##  $ education                       : Factor w/ 17 levels "10th grade","11th grade",..: 13 17 1 11 11 17 10 13 17 17 ...
##  $ wage_per_hour                   : int  0 0 0 0 0 1200 0 0 876 0 ...
##  $ enrolled_in_edu_inst_lastwk     : Factor w/ 3 levels "College or university",..: 3 3 2 3 3 3 3 3 3 3 ...
##  $ marital_status                  : Factor w/ 7 levels "Divorced","Married-A F spouse present",..: 7 1 5 5 5 3 3 5 3 3 ...
##  $ major_industry_code             : Factor w/ 24 levels "Agriculture",..: 15 5 15 15 15 7 8 5 6 5 ...
##  $ major_occupation_code           : Factor w/ 15 levels "Adm support including clerical",..: 7 9 7 7 7 11 3 5 1 6 ...
##  $ race                            : Factor w/ 5 levels "Amer Indian Aleut or Eskimo",..: 5 5 2 5 5 1 5 5 5 5 ...
##  $ hispanic_origin                 : Factor w/ 9 levels "All other","Central or South American",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex                             : Factor w/ 2 levels "Female","Male": 1 2 1 1 1 1 2 1 1 2 ...
##  $ member_of_labor_union           : Factor w/ 3 levels "No","Not in universe",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ reason_for_unemployment         : Factor w/ 6 levels "Job leaver","Job loser - on layoff",..: 4 4 4 4 4 4 4 2 4 4 ...
##  $ full_parttime_employment_stat   : Factor w/ 8 levels "Children or Armed Forces",..: 3 1 3 1 1 2 1 7 2 1 ...
##  $ capital_gains                   : int  0 0 0 0 0 0 5178 0 0 0 ...
##  $ capital_losses                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ dividend_from_Stocks            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tax_filer_status                : Factor w/ 6 levels "Head of household",..: 5 1 5 5 5 3 3 6 3 3 ...
##  $ region_of_previous_residence    : Factor w/ 6 levels "Abroad","Midwest",..: 4 5 4 4 4 4 4 4 4 4 ...
##  $ state_of_previous_residence     : Factor w/ 50 levels "Abroad","Alabama",..: 36 5 36 36 36 36 36 36 36 36 ...
##  $ d_household_family_stat         : Factor w/ 38 levels "Child <18 ever marr not in subfamily",..: 30 21 8 3 3 37 21 36 37 21 ...
##  $ d_household_summary             : Factor w/ 8 levels "Child 18 or older",..: 7 5 1 3 3 8 5 6 8 5 ...
##  $ migration_msa                   : Factor w/ 9 levels "Abroad to MSA",..: NA 3 NA 5 5 NA 5 NA NA 5 ...
##  $ migration_reg                   : Factor w/ 8 levels "Abroad","Different county same state",..: NA 8 NA 6 6 NA 6 NA NA 6 ...
##  $ migration_within_reg            : Factor w/ 9 levels "Abroad","Different county same state",..: NA 9 NA 7 7 NA 7 NA NA 7 ...
##  $ live_1_year_ago                 : Factor w/ 3 levels "No","Not in universe under 1 year old",..: 2 1 2 3 3 2 3 2 2 3 ...
##  $ migration_sunbelt               : Factor w/ 3 levels "No","Not in universe",..: NA 3 NA 2 2 NA 2 NA NA 2 ...
##  $ num_person_Worked_employer      : int  0 1 0 0 0 1 6 4 5 6 ...
##  $ family_members_under_18         : Factor w/ 5 levels "Both parents present",..: 5 5 5 1 1 5 5 5 5 5 ...
##  $ country_father                  : Factor w/ 42 levels "Cambodia","Canada",..: 40 40 41 40 40 31 40 40 40 40 ...
##  $ country_mother                  : Factor w/ 42 levels "Cambodia","Canada",..: 40 40 41 40 40 40 40 40 40 40 ...
##  $ country_self                    : Factor w/ 42 levels "Cambodia","Canada",..: 40 40 41 40 40 40 40 40 40 40 ...
##  $ citizenship                     : Factor w/ 5 levels "Foreign born- Not a citizen of U S",..: 5 5 1 5 5 5 5 5 5 5 ...
##  $ business_or_self_employed       : int  0 0 0 0 0 2 0 0 0 0 ...
##  $ fill_questionnaire_veteran_admin: Factor w/ 3 levels "No","Not in universe",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ veterans_benefits               : int  2 2 2 0 0 2 2 2 2 2 ...
##  $ weeks_worked_in_year            : int  0 52 0 0 0 52 52 30 52 52 ...
##  $ year                            : int  95 94 95 94 94 95 94 95 95 94 ...
##  $ income_level                    : int  -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 -50000 ...

Biến cần dự báo là biến income_level, tuy nhiên tên giữa 2 tập train và test là khác nhau.

train$income_level %>% unique

## [1] -50000  50000

test$income_level %>% unique

## [1] -50000    50000+.
## Levels: -50000  50000+.

Ta sẽ biến đổi về dạng 0 và 1, với 1 là những người có mức thu nhập cao hơn 50000 USD và 0 là thấp hơn 50000 USD

train$income_level <- train$income_level %>% 
  as.factor
test$income_level <- test$income_level %>% 
  as.factor

train$income_level <- recode_factor(train$income_level, 
                         "-50000" = "0", 
                         "50000" = "1")
test$income_level <- recode_factor(test$income_level, 
                                    "-50000" = "0", 
                                    " 50000+." = "1")

Tỉ lệ giữa các thành phần trong biến cần dự báo - income_level

round(prop.table(table(train$income_level))*100)

## 
##  0  1 
## 94  6

Như vậy tỉ lệ giữa 2 factor trong biến income_level đang có sự chênh lệch rất lớn: 94% và 6%. Do vậy ta có thể kết luận rằng dữ liệu trên đang có sự chênh lệch: imbalanced Data.

Ta phân chia tập dữ liệu trên thành 2 phần gồm những biến numeric và biến factor để có thể xử lý riêng biệt

nums <- sapply(train, is.numeric)
cat <- sapply(train, is.factor)

cat_train <- train[,cat]
num_train <- train[,nums]

cat_test <- test[,cat]
num_test <- test[,nums]

Check xem liệu rằng dữ liệu có missing value trên tập numeric

table(is.na(num_train))

## 
##   FALSE 
## 2394276

table(is.na(num_test))

## 
##   FALSE 
## 1197144

Như vậy, không có biến nào có missing value. Bước tiếp theo chúng ta sẽ loại bỏ những biến có correlation cao hơn 0.7 so với biến phụ thuộc.

high.correlation <- findCorrelation(x = cor(num_train), cutoff = 0.7)
num_train <- num_train[,-high.correlation] 
num_test <- num_test[,-high.correlation]

Xử lý missing value của tập factor. Ta kiểm tra xem giữa các biến có tỉ lệ bao nhiêu % value bị missing

factor.missing.train <- sapply(cat_train, function(x){sum(is.na(x))/length(x)})*100
factor.missing.test <- sapply(cat_test, function(x){sum(is.na(x)/length(x))}*100)
factor.missing.train

##                  class_of_worker                        education 
##                        0.0000000                        0.0000000 
##      enrolled_in_edu_inst_lastwk                   marital_status 
##                        0.0000000                        0.0000000 
##              major_industry_code            major_occupation_code 
##                        0.0000000                        0.0000000 
##                             race                  hispanic_origin 
##                        0.0000000                        0.4380447 
##                              sex            member_of_labor_union 
##                        0.0000000                        0.0000000 
##          reason_for_unemployment    full_parttime_employment_stat 
##                        0.0000000                        0.0000000 
##                 tax_filer_status     region_of_previous_residence 
##                        0.0000000                        0.0000000 
##      state_of_previous_residence          d_household_family_stat 
##                        0.3548463                        0.0000000 
##              d_household_summary                    migration_msa 
##                        0.0000000                       49.9671717 
##                    migration_reg             migration_within_reg 
##                       49.9671717                       49.9671717 
##                  live_1_year_ago                migration_sunbelt 
##                        0.0000000                       49.9671717 
##          family_members_under_18                   country_father 
##                        0.0000000                        3.3645244 
##                   country_mother                     country_self 
##                        3.0668144                        1.7005558 
##                      citizenship fill_questionnaire_veteran_admin 
##                        0.0000000                        0.0000000 
##                     income_level 
##                        0.0000000

factor.missing.test

##                  class_of_worker                        education 
##                                0                                0 
##      enrolled_in_edu_inst_lastwk                   marital_status 
##                                0                                0 
##              major_industry_code            major_occupation_code 
##                                0                                0 
##                             race                  hispanic_origin 
##                                0                                0 
##                              sex            member_of_labor_union 
##                                0                                0 
##          reason_for_unemployment    full_parttime_employment_stat 
##                                0                                0 
##                 tax_filer_status     region_of_previous_residence 
##                                0                                0 
##      state_of_previous_residence          d_household_family_stat 
##                                0                                0 
##              d_household_summary                    migration_msa 
##                                0                                0 
##                    migration_reg             migration_within_reg 
##                                0                                0 
##                  live_1_year_ago                migration_sunbelt 
##                                0                                0 
##          family_members_under_18                   country_father 
##                                0                                0 
##                   country_mother                     country_self 
##                                0                                0 
##                      citizenship fill_questionnaire_veteran_admin 
##                                0                                0 
##                     income_level 
##                                0

Ta sẽ loại bỏ những biến có tỉ lệ missing cao hơn 5%

delete <- subset(cat_train, select = factor.missing.train >= 5 )
cat_train <- subset(cat_train, select = factor.missing.train < 5 )

cat_test <- cat_test %>% 
  select(-migration_msa, -migration_reg, -migration_within_reg, -migration_sunbelt)

Đối với những biến có missing chúng ta sẽ ghi nhận là Unavailable trên tập train

#convert to characters on train
indx <- sapply(cat_train, is.factor)
cat_train[indx] <- lapply(cat_train[indx], function(x) {as.character(x)})
#filling missing = "unavailable"
for (i in seq_along(cat_train)) {
  set(cat_train, i=which(is.na(cat_train[[i]])), j=i, value = "Unavailable")
}
#back to factor data
cat_train[indx] <- lapply(cat_train[indx], function(x) {as.factor(x)})

Đối với những biến có missing chúng ta sẽ ghi nhận là Unavailable trên tập test

#convert to characters on test
indx1 <- sapply(cat_test, is.factor)
cat_test[indx1] <- lapply(cat_test[indx1], function(x)  {as.character(x)})

#filling missing = "unavailable"
for (i in seq_along(cat_test)) {
  set(cat_test, i=which(is.na(cat_test[[i]])), j=i, value="Unavailable")
}  

#back to factor data
cat_test[indx1] <- lapply(cat_test[indx1], function(x) {as.factor(x)})

Trong các biến factor của tập factor data có rất nhiều biến với low frequency, do vậy những giá trị này sẽ không có nhiều impact đến model. Ta nên combine những biến này về chung một loại giá trị Other để thể hiện các giá trị này. Những biến có frequency dưới 5% sẽ được combine về giá trị Other

#train
for(i in names(cat_train)){
  p <- 5/100
  ld <- names(which(prop.table(table(cat_train[[i]])) < p))
  levels(cat_train[[i]])[levels(cat_train[[i]]) %in% ld] <- "Other"
}
#test
for(i in names(cat_test)){
  p <- 5/100
  ld <- names(which(prop.table(table(cat_test[[i]])) < p))
  levels(cat_test[[i]])[levels(cat_test[[i]]) %in% ld] <- "Other"
}

Thay đổi biến tuổi thành dạng factor trong khoảng giá trị

#feature engineering
num_train$age1[num_train$age <= 30] <- "young"
num_train$age1[num_train$age <= 60 & num_train$age > 30] <- "adult"
num_train$age1[num_train$age > 60] <- "old"
num_train$age1 <- num_train$age1 %>% as.factor

num_test$age1[num_test$age <= 30] <- "young"
num_test$age1[num_test$age <= 60 & num_test$age > 30] <- "adult"
num_test$age1[num_test$age > 60] <- "old"
num_test$age1 <- num_test$age1 %>% as.factor

Trong tập numeric có những biến với Zero Mean Variance. Đây là những biến không có tác động đến model, ngoài việc xử lý bằng cách xóa khỏi tập dữ liệu, ta có thể biến đổi dựa trên tính chất của biến

nearzeromean <- nearZeroVar(num_train)
names(num_train)[nearzeromean]

## [1] "wage_per_hour"        "capital_gains"        "capital_losses"      
## [4] "dividend_from_Stocks"

num_train %>% 
  select (nearzeromean) %>% 
  summary

##  wage_per_hour     capital_gains     capital_losses   
##  Min.   :   0.00   Min.   :    0.0   Min.   :   0.00  
##  1st Qu.:   0.00   1st Qu.:    0.0   1st Qu.:   0.00  
##  Median :   0.00   Median :    0.0   Median :   0.00  
##  Mean   :  55.43   Mean   :  434.7   Mean   :  37.31  
##  3rd Qu.:   0.00   3rd Qu.:    0.0   3rd Qu.:   0.00  
##  Max.   :9999.00   Max.   :99999.0   Max.   :4608.00  
##  dividend_from_Stocks
##  Min.   :    0.0     
##  1st Qu.:    0.0     
##  Median :    0.0     
##  Mean   :  197.5     
##  3rd Qu.:    0.0     
##  Max.   :99999.0

Do đa phần những biến trên có giá trị chủ yếu là 0 nên ta có thể biến những biến trên về dạng factor với 2 giá trị: Zero và Morethanzero

num_train$wage_per_hour1 <- ifelse(num_train$wage_per_hour == 0, "Zero", "MorethanZero")
num_train$wage_per_hour1 <- num_train$wage_per_hour1 %>% as.factor

num_test$wage_per_hour1 <- ifelse(num_test$wage_per_hour == 0, "Zero", "MorethanZero")
num_test$wage_per_hour1 <- num_test$wage_per_hour1 %>% as.factor

num_train$capital_gains1 <- ifelse(num_train$capital_gains == 0, "Zero", "MorethanZero")
num_train$capital_gains1 <- num_train$capital_gains1 %>% as.factor

num_test$capital_gains1 <- ifelse(num_test$capital_gains == 0, "Zero", "MorethanZero")
num_test$capital_gains1 <- num_test$capital_gains1 %>% as.factor

num_train$capital_losses1 <- ifelse(num_train$capital_losses == 0, "Zero", "MorethanZero")
num_train$capital_losses1 <- num_train$capital_losses1 %>% as.factor

num_test$capital_losses1 <- ifelse(num_test$capital_losses == 0, "Zero", "MorethanZero")
num_test$capital_losses1 <- num_test$capital_losses1 %>% as.factor

num_train$dividend_from_Stocks1 <- ifelse(num_train$dividend_from_Stocks == 0, "Zero", "MorethanZero")
num_train$dividend_from_Stocks1 <- num_train$dividend_from_Stocks1 %>% as.factor

num_test$dividend_from_Stocks1 <- ifelse(num_test$dividend_from_Stocks == 0, "Zero", "MorethanZero")
num_test$dividend_from_Stocks1 <- num_test$dividend_from_Stocks1 %>% as.factor

num_train <- num_train %>% 
  select(-wage_per_hour, -capital_gains, -capital_losses, -dividend_from_Stocks, -age)

num_test <- num_test %>% 
  select(-wage_per_hour, -capital_gains, -capital_losses, -dividend_from_Stocks, -age)

Combine data từ 2 tập numeric và factor

d_train <- cbind(num_train,cat_train)
d_test <- cbind(num_test,cat_test)
#save memory
rm(num_train,num_test,cat_train,cat_test, delete, test, train)

Vì có lỗi trong giá trị của data train và data test, khi data test có thừa giá trị Space, nên ta phải điều chỉnh cả 2 data để có tính unique

fix_data <- function(x){
  str_replace_all(x,
                  c(" "), "") %>% as.factor
}

train.fix <- map_df(d_train, fix_data)
test.fix <- map_df(d_test, fix_data)

3 Xử lý có Imbalanced Data bằng các phương pháp

3.1 Chạy 4 phương pháp xử lý Imbalanced data

Oversampling

#oversampling
data_balanced_over <- ovun.sample(income_level ~ ., data = d_train, method = "over",N = 320000)$data
table(data_balanced_over$income_level)

## 
##      0      1 
## 187141 132859

Undersampling

#undersampling
data_balanced_under <- ovun.sample(income_level ~ ., data = d_train, method = "under", N = 50000, seed = 1)$data
table(data_balanced_under$income_level)

## 
##     0     1 
## 37618 12382

Both

#both
data_balanced_both <- ovun.sample(income_level ~ ., data = d_train, method = "both", p=0.5,N=200000, seed = 1)$data
table(data_balanced_both$income_level)

## 
##      0      1 
##  99838 100162

SMOTE

#SMOTE (data synthetically)
data.rose <- ROSE(income_level ~ ., data = d_train, seed = 1)$data
table(data.rose$income_level)

## 
##     0     1 
## 99614 99909

3.2 Khởi động H2O

#h2o
h2o.init(ip = "localhost", 
         port = 54321, 
         nthreads= -1,
         max_mem_size = "6g") #Đặt mức RAM tối đa

## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     C:\Users\thanhnm3\AppData\Local\Temp\Rtmpm6UAhp/h2o_thanhnm3_started_from_r.out
##     C:\Users\thanhnm3\AppData\Local\Temp\Rtmpm6UAhp/h2o_thanhnm3_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: ... Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         8 seconds 713 milliseconds 
##     H2O cluster version:        3.14.0.3 
##     H2O cluster version age:    1 month and 16 days  
##     H2O cluster name:           H2O_started_from_R_thanhnm3_las221 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   5.33 GB 
##     H2O cluster total cores:    2 
##     H2O cluster allowed cores:  2 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.3.1 (2016-06-21)

3.3 Biến đổi data sang H2O data

data.ori <- as.h2o(d_train)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

data.rose <- as.h2o(data.rose)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

data.over <- as.h2o(data_balanced_over)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

data.under <- as.h2o(data_balanced_under)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

data.both <- as.h2o(data_balanced_both)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

h2o.test <- as.h2o(d_test)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

3.4 Build mô hình dựa trên các tập data này

set.seed(9999)
rf.model.ori <- h2o.randomForest(training_frame = data.ori,
                              x = 1:35, y = 36,
                              balance_classes = T)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |============                                                     |  18%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |==============                                                   |  22%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |======================                                           |  34%
  |                                                                       
  |=========================                                        |  38%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |===========================                                      |  42%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |======================================                           |  58%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |========================================                         |  62%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |================================================                 |  74%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |===================================================              |  78%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=====================================================            |  82%
  |                                                                       
  |========================================================         |  86%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |=============================================================    |  94%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |=================================================================| 100%

set.seed(9999)
rf.model.rose <- h2o.randomForest(training_frame = data.rose,
                              x = 1:35, y = 36)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |============                                                     |  18%
  |                                                                       
  |==============                                                   |  22%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |======================                                           |  34%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |======================================                           |  58%
  |                                                                       
  |========================================                         |  62%
  |                                                                       
  |===========================================                      |  66%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |================================================                 |  74%
  |                                                                       
  |===================================================              |  78%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=====================================================            |  82%
  |                                                                       
  |========================================================         |  86%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=============================================================    |  94%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |=================================================================| 100%

set.seed(9999)
rf.model.over <- h2o.randomForest(training_frame = data.over,
                              x = 1:35, y = 36)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |============                                                     |  18%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |==============                                                   |  22%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |======================                                           |  34%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |=========================                                        |  38%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |===========================                                      |  42%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |==============================                                   |  46%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |======================================                           |  58%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |===========================================                      |  66%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |================================================                 |  74%
  |                                                                       
  |===================================================              |  78%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=====================================================            |  82%
  |                                                                       
  |========================================================         |  86%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |=============================================================    |  94%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |=================================================================| 100%

set.seed(9999)
rf.model.under <- h2o.randomForest(training_frame = data.under,
                              x = 1:35, y = 36)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |==============                                                   |  22%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==============================                                   |  46%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |=================================================================| 100%

set.seed(9999)
rf.model.both <- h2o.randomForest(training_frame = data.both,
                              x = 1:35, y = 36)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |======================                                           |  34%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |=========================                                        |  38%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |================================================                 |  74%
  |                                                                       
  |===================================================              |  78%
  |                                                                       
  |=====================================================            |  82%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |=================================================================| 100%

3.5 Test trên tập test

set.seed(9999)
ori.perf <- h2o.performance(model = rf.model.ori,
                             newdata = h2o.test)
set.seed(9999)
rose.perf <- h2o.performance(model = rf.model.rose,
                             newdata = h2o.test)
set.seed(9999)
over.perf <- h2o.performance(model = rf.model.over,
                             newdata = h2o.test)
set.seed(9999)
under.perf <- h2o.performance(model = rf.model.under,
                             newdata = h2o.test)
set.seed(9999)
both.perf <- h2o.performance(model = rf.model.both,
                             newdata = h2o.test)

3.6 AUC của từng tập

h2o.auc(ori.perf)

## [1] 0.9037655

h2o.auc(rose.perf)

## [1] 0.8851047

h2o.auc(over.perf)

## [1] 0.9069427

h2o.auc(under.perf)

## [1] 0.9136847

h2o.auc(both.perf)

## [1] 0.9060167

h2o.shutdown(prompt = F)

## [1] TRUE

Như vậy ta có thể thấy phương phát sử dụng cả Over-sampling và Under-sampling mang lại chất lượng mô hình tốt hơn so với các phương pháp khác

Imbalanced data classification

Nguyen Minh Thanh

November 11, 2017