Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

1.Introduction

This report tries to address the situation of imbalance binary dataset or imbalance class problem. In such cases, when the binary response variable is imbalance in the data, so the proportion of failure heavily dominates the success proportion, then logistic model cannot perform well for detection of the outcome with low proportion. In other words, the sensitivity of the model would be very low, which means the probability of detection of success cases would be unsatisfyingly low.

In order to illustriate the situation, first I use logit model. Then one standard approach would be illustrated, which is called downsampling or oversampling, using ROSE package. At last, I try to experiment an ensamble method to see whether the method can out-perform the standard downsampling/oversampling approach.

2.Data

The data is used in this video. The video very lucidly illustrates ROSE package and downsampling, so it would be a validation measure for my calculations.

require(knitr) # for better tables in the Markdown
require(caTools) # for sample.split function
require(ROCR) # for the ROC curve 
require(caret) # for confusionmatrix() 
require(ROSE) # for downsampling
require(rpart) # for decision tree 
data <- read.csv("/Users/Shaahin/Downloads/binary.csv")
data$rank <- factor(data$rank)
data$admit <- factor(data$admit)

kable(head(data)) 
admit gre gpa rank
0 380 3.61 3
1 660 3.67 3
1 800 4.00 1
1 640 3.19 4
0 520 2.93 4
1 760 3.00 2
summary(data)
##  admit        gre             gpa        rank   
##  0:273   Min.   :220.0   Min.   :2.260   1: 61  
##  1:127   1st Qu.:520.0   1st Qu.:3.130   2:151  
##          Median :580.0   Median :3.395   3:121  
##          Mean   :587.7   Mean   :3.390   4: 67  
##          3rd Qu.:660.0   3rd Qu.:3.670          
##          Max.   :800.0   Max.   :4.000

The dataset has 4 variables and 400 records. The variable admit is the response variable, which would be predicted by gre, gpa and rank variables.

As it can be seen, the class of 1 in the response variable has lower occurance than class 0. Here we can see the proportions of the classes.

prop.table(x = table(data$admit))
## 
##      0      1 
## 0.6825 0.3175

The class imbalance is not severe, comparing to what is seen, for instance, in mobile game industry or digital marketing.

Let’s see how logistic regression performs on this dataset. However, first I split the dataset into train and test.

set.seed(7)

train_index <- sample.split(Y = data$admit , SplitRatio = 0.7)

train_data <- data[train_index, ]
test_data <- data[!train_index, ]
set.seed(7)
logit_model <- glm(data = train_data ,
                   formula = admit~. ,
                   family = "binomial" )

logit_pred <- predict(object = logit_model,
                      newdata = test_data ,
                      type = "response" )

table(test_data$admit, logit_pred>0.5)
##    
##     FALSE TRUE
##   0    72   10
##   1    31    7
#Or more sophisticated confusionMatrix()
confusionMatrix(data = as.integer(logit_pred>0.5) ,
                reference =  test_data$admit,
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 72 31
##          1 10  7
##                                           
##                Accuracy : 0.6583          
##                  95% CI : (0.5662, 0.7424)
##     No Information Rate : 0.6833          
##     P-Value [Acc > NIR] : 0.755842        
##                                           
##                   Kappa : 0.0731          
##  Mcnemar's Test P-Value : 0.001787        
##                                           
##             Sensitivity : 0.18421         
##             Specificity : 0.87805         
##          Pos Pred Value : 0.41176         
##          Neg Pred Value : 0.69903         
##              Prevalence : 0.31667         
##          Detection Rate : 0.05833         
##    Detection Prevalence : 0.14167         
##       Balanced Accuracy : 0.53113         
##                                           
##        'Positive' Class : 1               
## 

As it can be seen , the sensitivity or the power of the model to detect success cases is very low, and around 18%, when the treshold is 0.5. But what about other thresholds?

roc_pred <- prediction(predictions = logit_pred  , labels = test_data$admit)
roc_perf <- performance(roc_pred , "tpr" , "fpr")
plot(roc_perf,
     colorize = TRUE,
     print.cutoffs.at= seq(0,1,0.05),
     text.adj=c(-0.2,1.7))

# AUC:
as.numeric(performance(roc_pred, "auc")@y.values)
## [1] 0.5872914

So by reducing the threshold, we increase the number of “positive” cases, and thus increase sensitivity and decrease spcificity. Maybe 0.3 is a good trade-off for our threshold.

confusionMatrix(data = as.integer(logit_pred>0.3) ,
                reference =  test_data$admit,
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 40 14
##          1 42 24
##                                           
##                Accuracy : 0.5333          
##                  95% CI : (0.4401, 0.6249)
##     No Information Rate : 0.6833          
##     P-Value [Acc > NIR] : 0.9997872       
##                                           
##                   Kappa : 0.0997          
##  Mcnemar's Test P-Value : 0.0003085       
##                                           
##             Sensitivity : 0.6316          
##             Specificity : 0.4878          
##          Pos Pred Value : 0.3636          
##          Neg Pred Value : 0.7407          
##              Prevalence : 0.3167          
##          Detection Rate : 0.2000          
##    Detection Prevalence : 0.5500          
##       Balanced Accuracy : 0.5597          
##                                           
##        'Positive' Class : 1               
## 

The sensitivity is now around 63% but the specificity is lowered around 56%. Not bad afterall. The overal accuracy is 48%, so the missclassification is around 40%.

Now let’s do the undersampling method, in order to see whether the sensitivity of the model is improved.

3.Undersampling

#downsampling the training data
data_downsample <- ovun.sample(admit~. ,
                               data = train_data ,
                               method = "under")$data
# check the balance
table(data_downsample$admit)
## 
##  0  1 
## 93 89
ds_model <- glm(admit~. , data = data_downsample, family = "binomial")
ds_pred <- predict(object = ds_model, newdata = test_data , type = "response")

roc.curve(test_data$admit, logit_pred)
## Area under the curve (AUC): 0.587
roc.curve(test_data$admit, ds_pred, add.roc=TRUE, col=2)

## Area under the curve (AUC): 0.575
roc_pred <- prediction(predictions = ds_pred  , labels = test_data$admit)
roc_perf <- performance(roc_pred , "tpr" , "fpr")
plot(roc_perf,
     colorize = TRUE,
     print.cutoffs.at= seq(0,1,0.05),
     text.adj=c(-0.2,1.7))

# AUC:
as.numeric(performance(roc_pred, "auc")@y.values)
## [1] 0.5750963

It is funny that the AUC is now lower than what we had before downsampling! Maybe this approach is not suitable for this dataset!

However, there is one difference between the two models. Here, by choosing the threshold as 0.4, we can get better results than before.

confusionMatrix(data = as.integer(ds_pred>0.47) ,
                reference =  test_data$admit,
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 40 12
##          1 42 26
##                                           
##                Accuracy : 0.55            
##                  95% CI : (0.4565, 0.6409)
##     No Information Rate : 0.6833          
##     P-Value [Acc > NIR] : 0.9992          
##                                           
##                   Kappa : 0.1419          
##  Mcnemar's Test P-Value : 7.933e-05       
##                                           
##             Sensitivity : 0.6842          
##             Specificity : 0.4878          
##          Pos Pred Value : 0.3824          
##          Neg Pred Value : 0.7692          
##              Prevalence : 0.3167          
##          Detection Rate : 0.2167          
##    Detection Prevalence : 0.5667          
##       Balanced Accuracy : 0.5860          
##                                           
##        'Positive' Class : 1               
## 

As we can see, now with the same level of specificity, we have higher level of sensitivity, i.e. 68% instead of 63%. This is a good point about this model.

4.ROSE method

ROSE (Random Over-Sampling Examples) aids the task of binary classification in the presence of rare classes. It produces a synthetic, possibly balanced, sample of data simulated according to a smoothed-bootstrap approach.

 # generate new balanced data by ROSE
     
    
    data_rose <- ROSE(admit~. , data=train_data)$data
    # check (im)balance of new data
    table(data_rose$admit)
## 
##   0   1 
## 157 123
    # train logistic regression on balanced data
    rose_model <- glm(admit ~ ., data=data_rose, family="binomial")
    # use the trained model to predict test data
    rose_pred <- predict(rose_model, newdata=test_data,
                                type="response")
    
    roc_pred <- prediction(predictions = rose_pred  , labels = test_data$admit)
    roc_perf <- performance(roc_pred , "tpr" , "fpr")
        
    plot(roc_perf,
     colorize = TRUE,
     print.cutoffs.at= seq(0,1,0.05),
     text.adj=c(-0.2,1.7))

    roc.curve(test_data$admit, logit_pred)
## Area under the curve (AUC): 0.587
    roc.curve(test_data$admit, ds_pred, add.roc=TRUE, col=2)
## Area under the curve (AUC): 0.575
    roc.curve(test_data$admit, rose_pred, add.roc=TRUE, col=3)

## Area under the curve (AUC): 0.639

According to AUC values, the ROSE model is even better than the downsampling model. Now, if we choose our threshold at 0.45, we get

confusionMatrix(data = as.integer(rose_pred>0.45) ,
                reference =  test_data$admit,
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 49 13
##          1 33 25
##                                           
##                Accuracy : 0.6167          
##                  95% CI : (0.5235, 0.7039)
##     No Information Rate : 0.6833          
##     P-Value [Acc > NIR] : 0.950512        
##                                           
##                   Kappa : 0.2238          
##  Mcnemar's Test P-Value : 0.005088        
##                                           
##             Sensitivity : 0.6579          
##             Specificity : 0.5976          
##          Pos Pred Value : 0.4310          
##          Neg Pred Value : 0.7903          
##              Prevalence : 0.3167          
##          Detection Rate : 0.2083          
##    Detection Prevalence : 0.4833          
##       Balanced Accuracy : 0.6277          
##                                           
##        'Positive' Class : 1               
## 

So even at higher level of specificity, the ROSE model has improved sensitivity by 3% to 71%.

5.An ensamble approach to the problem

Till now, every effort was a standard approach to the problem of binary classification. Now, I want to try something new. I want to segmentize the feature space, and fit a regression model in each segment. So the model would have two steps, first determination of the segment of a new observation, then prediction using the corresponding logistic regression.

Nonetheless, in order to keep this report concise, I present the ensamble method in a sequel report.

6.Summary

Imbalance class problems are very common in digital world. From advertising companies whose click-through-rate is very low, to mobile gaming whose paid users compose a small fraction of the total users, they all have to deal with the class imbalance problems. The problem stems from inability of usuall classifiers,e.g. logistic regression models, to have high sensitivity in such situations. So our goal was trying to improve sensitivity while keeping specificity constant at a rational level.

Here the undersamping and ROSE method were compared to logit, and the outline of an innovative idea about an ensamble method was presented as well. In the next part of the study, this ensamble method would be devised, implemented and evaluated step-by-step.