This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
require(knitr) # for better tables in the Markdown
require(caTools) # for sample.split function
require(ROCR) # for the ROC curve
require(caret) # for confusionmatrix()
require(ROSE) # for downsampling
require(rpart) # for decision tree
require(party) # for decision tree
require(rpart.plot) # for better plotting the rpart trees
require(dplyr) # for data manipulation
require(cluster) # for clustering
#require(flexclust) # for cluster prediction
#require(SwarmSVM) # for cluster prediction
require(clue) # for cluster prediction
Class imbalance problems are sort of classification problems in which one of the classes of the response variable heavily dominates the others. In such situations, classifiers such as logistic regression have difficulties to detect the observations of the dominated class.
In the first report, undersampling and random oversampling examples were used to mitigate the problem. ROSE showed 5% imporvement over the logit.
In the second report, a heuristic approach was used based on segmentation of the feature space using decision tree, and the fitting logit on each segment. The result was not good at all.
This is the third part of the study. In this approach, first I cluster the observation of the training set, then fit logit on the clusters. At last the logits are evaluated using test data. In the cases of clusters with 100% purity, the logit model would be a simple constant output model.
The data includes 400 observations and 4 variables. The response variable is “admit”, whether an observation/student is admitted or not.
data <- read.csv("/Users/Shaahin/Downloads/binary.csv")
data$rank <- factor(data$rank)
data$admit <- factor(data$admit)
kable(head(data))
| admit | gre | gpa | rank |
|---|---|---|---|
| 0 | 380 | 3.61 | 3 |
| 1 | 660 | 3.67 | 3 |
| 1 | 800 | 4.00 | 1 |
| 1 | 640 | 3.19 | 4 |
| 0 | 520 | 2.93 | 4 |
| 1 | 760 | 3.00 | 2 |
summary(data)
## admit gre gpa rank
## 0:273 Min. :220.0 Min. :2.260 1: 61
## 1:127 1st Qu.:520.0 1st Qu.:3.130 2:151
## Median :580.0 Median :3.395 3:121
## Mean :587.7 Mean :3.390 4: 67
## 3rd Qu.:660.0 3rd Qu.:3.670
## Max. :800.0 Max. :4.000
We first split this dataset into training and test subsets.
set.seed(7)
train_index <- sample.split(Y = data$admit , SplitRatio = 0.7)
train_data <- data[train_index, ]
test_data <- data[!train_index, ]
A new idea just reached to my mind. Why not clustering and then logit into each cluster? So instead of feature space segmentation, segmentation of the observations, and then logit on each cluster.
What method of clustering should be used? The data is small, let’s have a look with hierarchical clustering. But NO! The data is mixed of categorical and continues variables. So hclust() won`t work here.
train_prox <- daisy(x = train_data ,
metric = "gower",
stand = TRUE ,
type = list(asymm = 1))
#indicating that the variable1 is asymmetric and imbalanced
Having the proximity matrix, we need to choose a clustering algorithm. Here, partitioning around medoids(PAM) is a good choice however, hclust() and even kmeans() may work.
train_hclust<- hclust(d = train_prox , method = "ward.D" )
plot(train_hclust)
hclust shows 4 clusters would be a good choice. But what does PAM say?
set.seed(7)
# from here https://www.r-bloggers.com/clustering-mixed-data-types-in-r/
sil_width <- c(NA)
for(i in 2:10){
pam_fit <- pam(train_prox,
diss = TRUE,
k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}
plot(1:10, sil_width,
xlab = "Number of clusters",
ylab = "Silhouette Width")
lines(1:10, sil_width)
Silhouette width suggests 7 clusters using PAM algorithm. Which one to choose? I trust more on the latter, but the former is easier to implement. For sake of brevity, I follow the hclust.
NOTE: I just noticed that hclust is not capable of prediction of the clusters of the new observations.
Now let’s have a look at K-means, which is expected to be very similar to PAM. To guess the number of clusters, I use withinss / betweenss measure.
set.seed(7)
w_b_index <- c(NA)
for (i in 2:10){
t <- kmeans(x = train_prox , centers = i )
w_b_index[i] <- (t$withinss / t$betweenss)
}
plot(1:10, w_b_index,
xlab = "Number of clusters",
ylab = "withinSS/betweenSS index")
lines(1:10, w_b_index)
This index suggests to use 3 clusters, or maybe 6. The difference does not seem drastic.
Having evaluated three clustering algorithm, I choose to continue with PAM. However, 7 clusters would yield an over-fit model with very pure cluters regarding the admit variable. Thus, I choose 5 clusters for now.
It is possible to tune the optimum number of clusters based on the evaluation of the model on the test data
set.seed(7)
train_pam <- pam(x = train_prox , diss = TRUE , k = 5 )
# cluters and members
table(train_pam$clustering)
##
## 1 2 3 4 5
## 57 46 41 67 69
# the medoids of the clusters
train_pam$medoids
## [1] "27" "126" "95" "87" "77"
Now we have to label the observations of the test dataset.
First we need to label the training dataset based on the PAM and 7 clusters. Unfortunately, I could not find proper ready-to-use function for it. Neither clue::cl_predict nor flexclust::kcca works out here.
So I write my own predictor. It is based on the Gower distance of the observations in the test_dataset from the medoids of the train dataset.
train_data$cluster <- NULL
test_data$cluster <- NULL
medoids <- train_data[train_pam$medoids,]
t <- daisy(x = rbind(medoids, test_data) ,
metric = "gower",
stand = TRUE ,
type = list(asymm = 1) )
dist_mat <- (as.matrix(t)[1:5,-c(1:5)])
#apply(X = dist_mat , MARGIN = 2 ,FUN = min)
test_clust_labels <- apply(X = dist_mat , MARGIN = 2 , FUN = which.min)
table(test_clust_labels)
## test_clust_labels
## 1 2 3 4 5
## 22 15 21 30 32
test_data$cluster <- factor(test_clust_labels)
Now we have the predictions, what is the next step? Fitting the logits on the clusters of the train data, and then evaluation on the corresponding test data.
train_data$cluster <- train_pam$clustering
for (i in 1:5) {
print(paste0("cluster ",i," :"))
train_data %>%
filter(cluster == i) %>%
select(admit) %>%
table() %>%
print()
}
## [1] "cluster 1 :"
## .
## 0 1
## 20 37
## [1] "cluster 2 :"
## .
## 0 1
## 42 4
## [1] "cluster 3 :"
## .
## 0 1
## 0 41
## [1] "cluster 4 :"
## .
## 0 1
## 67 0
## [1] "cluster 5 :"
## .
## 0 1
## 62 7
As it can be seen, from the 5 clusters, two are totally pure. Cluster 3 and 4 are not multiclass anymore, so we cannot fit a regression model on them. Cluster1 is more balanced than the total training dataset, however cluster 2 and 5 are very imbalanced. I fit three logit models to clusters 1,2,5 and then evaluate the whole model with the test data.
logit_c1 <- train_data %>%
filter(cluster == 1) %>%
select(-cluster) %>%
glm(formula = admit ~ . , family = "binomial")
# since the `rank` variable has zero in every but one class,
#I remove it from logit, I remove it from logit model of cluster 2
logit_c2 <- train_data %>%
filter(cluster == 2) %>%
select(admit, gre, gpa) %>%
glm(formula = admit ~ . , family = "binomial")
# since the `rank` variable has zero in every but one class,
#I remove it from logit model of cluster 5
logit_c5 <- train_data %>%
filter(cluster == 5) %>%
select(admit, gre, gpa) %>%
glm(formula = admit ~ . , family = "binomial")
Now the evaluation of the model on the test set. Evaluation would be based on TPR mainly, since sensitivity is the main issue of importance here.
c1_pred <- test_data %>%
filter(cluster == 1) %>%
select(-cluster) %>%
predict(object = logit_c1, type = "response")
c2_pred <- test_data %>%
filter(cluster == 2) %>%
select(admit,gre,gpa) %>%
predict(object = logit_c2, type = "response")
c3_pred <- rep(1,nrow(test_data[test_data$cluster==3,]))
c4_pred <- rep(0,nrow(test_data[test_data$cluster==4,]))
c5_pred <- test_data %>%
filter(cluster == 5) %>%
select(admit,gre,gpa) %>%
predict(object = logit_c5, type = "response")
table(test_data$cluster)
##
## 1 2 3 4 5
## 22 15 21 30 32
Now evaluation
confusionMatrix(data = as.integer(c1_pred>0.5),
reference = test_data$admit[test_data$cluster == 1],
positive = "1"
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5 4
## 1 3 10
##
## Accuracy : 0.6818
## 95% CI : (0.4513, 0.8614)
## No Information Rate : 0.6364
## P-Value [Acc > NIR] : 0.4203
##
## Kappa : 0.3304
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.7143
## Specificity : 0.6250
## Pos Pred Value : 0.7692
## Neg Pred Value : 0.5556
## Prevalence : 0.6364
## Detection Rate : 0.4545
## Detection Prevalence : 0.5909
## Balanced Accuracy : 0.6696
##
## 'Positive' Class : 1
##
confusionMatrix(data = as.integer(c2_pred>0.5),
reference = test_data$admit[test_data$cluster == 2],
positive = "1"
)
## Warning in confusionMatrix.default(data = as.integer(c2_pred > 0.5),
## reference = test_data$admit[test_data$cluster == : Levels are not in the
## same order for reference and data. Refactoring data to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 13 2
## 1 0 0
##
## Accuracy : 0.8667
## 95% CI : (0.5954, 0.9834)
## No Information Rate : 0.8667
## P-Value [Acc > NIR] : 0.6771
##
## Kappa : 0
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.8667
## Prevalence : 0.1333
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 1
##
confusionMatrix(data = c3_pred,
reference = test_data$admit[test_data$cluster == 3],
positive = "1"
)
## Warning in confusionMatrix.default(data = c3_pred, reference = test_data
## $admit[test_data$cluster == : Levels are not in the same order for
## reference and data. Refactoring data to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 0 0
## 1 0 21
##
## Accuracy : 1
## 95% CI : (0.8389, 1)
## No Information Rate : 1
## P-Value [Acc > NIR] : 1
##
## Kappa : NaN
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1
## Specificity : NA
## Pos Pred Value : NA
## Neg Pred Value : NA
## Prevalence : 1
## Detection Rate : 1
## Detection Prevalence : 1
## Balanced Accuracy : NA
##
## 'Positive' Class : 1
##
confusionMatrix(data = c4_pred,
reference = test_data$admit[test_data$cluster == 4],
positive = "1"
)
## Warning in confusionMatrix.default(data = c4_pred, reference = test_data
## $admit[test_data$cluster == : Levels are not in the same order for
## reference and data. Refactoring data to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 30 0
## 1 0 0
##
## Accuracy : 1
## 95% CI : (0.8843, 1)
## No Information Rate : 1
## P-Value [Acc > NIR] : 1
##
## Kappa : NaN
## Mcnemar's Test P-Value : NA
##
## Sensitivity : NA
## Specificity : 1
## Pos Pred Value : NA
## Neg Pred Value : NA
## Prevalence : 0
## Detection Rate : 0
## Detection Prevalence : 0
## Balanced Accuracy : NA
##
## 'Positive' Class : 1
##
confusionMatrix(data = as.integer(c5_pred>0.5),
reference = test_data$admit[test_data$cluster == 5],
positive = "1"
)
## Warning in confusionMatrix.default(data = as.integer(c5_pred > 0.5),
## reference = test_data$admit[test_data$cluster == : Levels are not in the
## same order for reference and data. Refactoring data to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 31 1
## 1 0 0
##
## Accuracy : 0.9688
## 95% CI : (0.8378, 0.9992)
## No Information Rate : 0.9688
## P-Value [Acc > NIR] : 0.7358
##
## Kappa : 0
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.00000
## Specificity : 1.00000
## Pos Pred Value : NaN
## Neg Pred Value : 0.96875
## Prevalence : 0.03125
## Detection Rate : 0.00000
## Detection Prevalence : 0.00000
## Balanced Accuracy : 0.50000
##
## 'Positive' Class : 1
##
Now I calcuate the total TPR as \(\sigma{TPR}/\sigma{P}\) . In other words, an overal sensitivity is calcuated by divison of total true positives over all models into total actual positives over all models.
overall_TP <- (10+0+21+0+0)
overall_P <- (14+2+21+0+1)
overall_TPR <- overall_TP/overall_P
overall_TPR
## [1] 0.8157895
The overall TPR shows a huge improvement over logit, and ROSE models!
But what is overal TNR or specificity?
overal_TN <- sum(5,13 ,0 , 30, 31)
overal_N <- sum(8, 13, 0 , 30 ,31)
overal_TNR <- overal_TN / overal_N
overal_TNR
## [1] 0.9634146
Very very amazing. It is important to notice that these are the evaluation on the test data!
In order to improve the sensitivity of model in imbalanced classification problem, there are some standard solutions such as ROSE or downsampling approach, or using more sensitive methods such as Random Forrest comparing to logistic regression.
This study is based on an innovative approach to the problem. This method has two steps, first clustering using Partitioning Around Medoids(PAM), and the second step is fitting logistic regression on the clusters, whereever it is possible.
Having fitted the logit models on the clusters, the test data was used to evaluate the performance of the method, and the result was pretty amazing comparing even to ROSE approach.
There are still many aspects that can be improved. Data-wise, a real data-set with more imbalancy should be used to evaluate the method. Clustering-wise, cluster validation such as what we have at fpc package can be used. Model-wise, instead of logit, we can use random forest for instance to possibly get even better results. In addition, the programming of the last part of this study is very naive, and can be improved and automated specially in calculation of the \(TPR_{overall}\). At last, the optimum number of clusters can be determined using overal evaluations on test dataset. Also, cross-validation is a better choice than random split of the dataset into two subsets.