This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

require(knitr) # for better tables in the Markdown
require(caTools) # for sample.split function
require(ROCR) # for the ROC curve 
require(caret) # for confusionmatrix() 
require(ROSE) # for downsampling
require(rpart) # for decision tree 
require(party) # for decision tree 
require(rpart.plot) # for better plotting the rpart trees
require(dplyr) # for data manipulation

1.Introduction

Class imbalance problems are sort of classification problems in which one of the classes of the response variable heavily dominates the others. In such situations, classifiers such as logistic regression have difficulties to detect the observations of the dominated class.

In the previous report, undersampling and random oversampling examples were used to mitigate the problem. Here one ensamble method would be presented.

In this method, first the feature space is segmented using a decision tree, then in each segment a logistic regression is fitted. The result would be evaluated and compared to the previous methods.

2.Data

The data includes 400 observations and 4 variables. The response variable is “admit”, whether an observation/student is admitted or not.

data <- read.csv("/Users/Shaahin/Downloads/binary.csv")
data$rank <- factor(data$rank)
data$admit <- factor(data$admit)

kable(head(data))

admit	gre	gpa	rank
0	380	3.61	3
1	660	3.67	3
1	800	4.00	1
1	640	3.19	4
0	520	2.93	4
1	760	3.00	2

summary(data)

##  admit        gre             gpa        rank   
##  0:273   Min.   :220.0   Min.   :2.260   1: 61  
##  1:127   1st Qu.:520.0   1st Qu.:3.130   2:151  
##          Median :580.0   Median :3.395   3:121  
##          Mean   :587.7   Mean   :3.390   4: 67  
##          3rd Qu.:660.0   3rd Qu.:3.670          
##          Max.   :800.0   Max.   :4.000

We first split this dataset into training and test subsets.

set.seed(7)

train_index <- sample.split(Y = data$admit , SplitRatio = 0.7)

train_data <- data[train_index, ]
test_data <- data[!train_index, ]

3.Segmentation

In order to segmentize the feature space, I would like to use decision trees. They are very understandable, and they may work well here.

tree_model <- rpart(admit ~ . , data = train_data, control =  )
rpart.plot(tree_model)

tree_model_pruned <- rpart(admit ~ . , data = train_data,
                    cp= tree_model$cptable[which.min(tree_model$cptable[,"xerror"]),"CP"] )
rpart.plot(tree_model_pruned)

Not very hopeful about the method at this point. The main nodes do not have high purity, so the segmentation has not yielded what I expected. But anyway, let’s go foreward and fit regression models in each of these segments.

tree_model_pruned

## n= 280 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 280 89 0 (0.6821429 0.3178571)  
##    2) gpa< 3.415 153 29 0 (0.8104575 0.1895425) *
##    3) gpa>=3.415 127 60 0 (0.5275591 0.4724409)  
##      6) rank=2,3,4 102 40 0 (0.6078431 0.3921569)  
##       12) gpa< 3.945 88 30 0 (0.6590909 0.3409091) *
##       13) gpa>=3.945 14  4 1 (0.2857143 0.7142857) *
##      7) rank=1 25  5 1 (0.2000000 0.8000000) *

train_data_seg <- train_data %>% 
        mutate( segment = ifelse(gpa<3.415 ,
                                 yes = "S1",
                                 no = ifelse(rank == 1 
                                             , yes = "S4", 
                                                no = ifelse(gpa<3.945,
                                                            yes = "S2",
                                                            no = "S3"))
                                 )
                )

4.fitting regression models

Now we need four logistic regression models for four segments.

# this part can be automated using parallel computing loop 

model_s1 <- 
        train_data_seg %>%
        filter(segment == "S1") %>%
        select(-segment) %>%
        glm(formula = admit ~ . , family = "binomial"  )

model_s2 <- 
        train_data_seg %>%
        filter(segment == "S2") %>%
        select(-segment) %>%
        glm(formula = admit ~ . , family = "binomial"  )

model_s3 <- 
        train_data_seg %>%
        filter(segment == "S3") %>%
        select(-segment) %>%
        glm(formula = admit ~ . , family = "binomial"  )

model_s4 <- 
        train_data_seg %>%
        filter(segment == "S4") %>%
        select(-c(rank , segment)) %>%
        glm(formula = admit ~ . , family = "binomial"  )

Now let’s evaluate these models on the test data. But first, test data must be labelled.

test_data_seg <- test_data %>% 
        mutate( segment = ifelse(gpa<3.415 ,
                                 yes = "S1",
                                 no = ifelse(rank == 1 
                                             , yes = "S4", 
                                                no = ifelse(gpa<3.945,
                                                            yes = "S2",
                                                            no = "S3"))
                                 )
                )


table(test_data_seg$admit[test_data_seg$segment == "S1"])

## 
##  0  1 
## 39 16

Model evaluation on the test data:

pred_s1 <- test_data_seg %>% 
        filter(segment == "S1") %>%
        predict(object = model_s1 , type = "response")

pred_s2 <- test_data_seg %>% 
        filter(segment == "S2") %>%
        predict(object = model_s1 , type = "response")

pred_s3 <- test_data_seg %>% 
        filter(segment == "S3") %>%
        predict(object = model_s1 , type = "response")

pred_s4 <- test_data_seg %>% 
        filter(segment == "S4") %>%
        predict(object = model_s1 , type = "response")

# confusionMatrix(data = as.integer(logit_pred>0.5) ,
#                 reference =  test_data$admit,
#                 positive = "1")

confusionMatrix(data = as.integer(pred_s1>0.2),
                reference = test_data_seg$admit[test_data_seg$segment == "S1"],
                positive = "1"
)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 25  5
##          1 14 11
##                                           
##                Accuracy : 0.6545          
##                  95% CI : (0.5142, 0.7776)
##     No Information Rate : 0.7091          
##     P-Value [Acc > NIR] : 0.85044         
##                                           
##                   Kappa : 0.2818          
##  Mcnemar's Test P-Value : 0.06646         
##                                           
##             Sensitivity : 0.6875          
##             Specificity : 0.6410          
##          Pos Pred Value : 0.4400          
##          Neg Pred Value : 0.8333          
##              Prevalence : 0.2909          
##          Detection Rate : 0.2000          
##    Detection Prevalence : 0.4545          
##       Balanced Accuracy : 0.6643          
##                                           
##        'Positive' Class : 1               
##

confusionMatrix(data = as.integer(pred_s2>0.18),
                reference = test_data_seg$admit[test_data_seg$segment == "S2"],
                positive = "1"
)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 17  6
##          1 12  9
##                                           
##                Accuracy : 0.5909          
##                  95% CI : (0.4325, 0.7366)
##     No Information Rate : 0.6591          
##     P-Value [Acc > NIR] : 0.8665          
##                                           
##                   Kappa : 0.1698          
##  Mcnemar's Test P-Value : 0.2386          
##                                           
##             Sensitivity : 0.6000          
##             Specificity : 0.5862          
##          Pos Pred Value : 0.4286          
##          Neg Pred Value : 0.7391          
##              Prevalence : 0.3409          
##          Detection Rate : 0.2045          
##    Detection Prevalence : 0.4773          
##       Balanced Accuracy : 0.5931          
##                                           
##        'Positive' Class : 1               
##

confusionMatrix(data = as.integer(pred_s3>0.14),
                reference = test_data_seg$admit[test_data_seg$segment == "S3"],
                positive = "1"
)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 6 1
##          1 5 2
##                                           
##                Accuracy : 0.5714          
##                  95% CI : (0.2886, 0.8234)
##     No Information Rate : 0.7857          
##     P-Value [Acc > NIR] : 0.9830          
##                                           
##                   Kappa : 0.1429          
##  Mcnemar's Test P-Value : 0.2207          
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.5455          
##          Pos Pred Value : 0.2857          
##          Neg Pred Value : 0.8571          
##              Prevalence : 0.2143          
##          Detection Rate : 0.1429          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.6061          
##                                           
##        'Positive' Class : 1               
##

confusionMatrix(data = as.integer(pred_s4>0.1),
                reference = test_data_seg$admit[test_data_seg$segment == "S4"],
                positive = "1"
)

## Warning in confusionMatrix.default(data = as.integer(pred_s4 > 0.1),
## reference = test_data_seg$admit[test_data_seg$segment == : Levels are not
## in the same order for reference and data. Refactoring data to match.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 0 0
##          1 3 4
##                                          
##                Accuracy : 0.5714         
##                  95% CI : (0.1841, 0.901)
##     No Information Rate : 0.5714         
##     P-Value [Acc > NIR] : 0.6531         
##                                          
##                   Kappa : 0              
##  Mcnemar's Test P-Value : 0.2482         
##                                          
##             Sensitivity : 1.0000         
##             Specificity : 0.0000         
##          Pos Pred Value : 0.5714         
##          Neg Pred Value :    NaN         
##              Prevalence : 0.5714         
##          Detection Rate : 0.5714         
##    Detection Prevalence : 1.0000         
##       Balanced Accuracy : 0.5000         
##                                          
##        'Positive' Class : 1              
##

The results show that my idea does not work, at least for this dataset.

5.Summary

In order to improve the sensitivity of model in imbalanced classification problem, I imagined it would be a good idea to segmentize the feature space first using a decision tree, then fit a logistic regression to each node. However, the decision tree did not work well, and the purity of the nodes was not appealing. Fitted logistic regression models did not yield good results for the test data either.

Still, I think it is a good idea to segmentize the space and fit models on each segment. It is similar to non-linear step-wise models, but this idea can perform better because of the power of segmentization method. However, this was not the case here.

It would be great if I could find better results, and I could improve the logit model by my idea. However, looking one level deeper into an idea and not finding interesting results are precious steps.

to be able to say that we looked one layer deeper, and found nothing, is a definite step forward–though not as far as to be able to say that we looked deeper and found thus-and-such. (Tukey, 1977)

Study of Class Imabalance Problems - Part2

Shahin Ashkiani: Contact@Shahin-Ashkiani.com

15/1/2018

1.Introduction

2.Data

3.Segmentation

4.fitting regression models

5.Summary