Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Introduction

In the previous analytical report, I used logistic regression on a Framingham Heart Study(FHS) dataset to predict heart diseas of the patients based on 15 demographic, behavioral and medical variables. The measure that I chose to evaluate the performance of the model was accuracy and the error was misclassification error. The model showed above 85% accuracy and I was happy that it works very well, although the main goal of the analytical project was working with a couple of classifiers such as logistic regression, and some exploratory methods such as GoodmanKruskal’s tau.

A couple of days ago, I am deceived by looking at the performance from a very narrow perspective. While the accuracy of the logit model was very high, the TPR was very low. In other words, the model was good at correctly labelling negative instances, while it was awful at detection of actually positive instances. The problem becomes grave in such medical studies where missing detection of positive cases can cost lives.

In general, the problem was due to low proportion of positive instances in the dataset. Rare event datasets can cause problem for the classifiers such as logistic regression. Having said that, this report is a sequel to the previous report, and tends to test whether logit works well on a FHS dataset whose response variable has a very low proportion of success rate. Moreover, I am going to evaluate two remedies of this problem, and evaluate the results.

Let’s get started.

require(knitr)
require(readr)

require(boot)
require(caTools)
require(ggplot2)
require(ROCR)
require(RColorBrewer)
require(zeligverse)
require(brglm)
require(logistf)
require(dplyr)

require(randomForest)
require(gbm)
#dataset <- read.csv(file = "https://d37djvu3ytnwxt.cloudfront.net/assets/courseware/v1/7022cf016eefb6d3747447589423dab0/asset-v1:MITx+15.071x_3+1T2016+type@asset+block/framingham.csv")
url <- "https://d37djvu3ytnwxt.cloudfront.net/assets/courseware/v1/7022cf016eefb6d3747447589423dab0/asset-v1:MITx+15.071x_3+1T2016+type@asset+block/framingham.csv"
dataset <- read_csv(file = url ,
                    col_types = list(
        male = col_logical(),
        age = col_integer(),
        education = col_factor(levels = c(1,2,3,4)),
        currentSmoker = col_logical(),
        BPMeds = col_logical(), 
        prevalentStroke = col_logical(), 
        prevalentHyp =  col_logical(), 
        diabetes = col_logical(), 
        TenYearCHD = col_logical()
      
))


## Removing the instances with missing values
dataset <- dataset %>% filter(complete.cases(dataset))

Now having a glimpse of the dataset.

kable(head(dataset))
male age education currentSmoker cigsPerDay BPMeds prevalentStroke prevalentHyp diabetes totChol sysBP diaBP BMI heartRate glucose TenYearCHD
TRUE 39 4 FALSE 0 FALSE FALSE FALSE FALSE 195 106.0 70 26.97 80 77 FALSE
FALSE 46 2 FALSE 0 FALSE FALSE FALSE FALSE 250 121.0 81 28.73 95 76 FALSE
TRUE 48 1 TRUE 20 FALSE FALSE FALSE FALSE 245 127.5 80 25.34 75 70 FALSE
FALSE 61 3 TRUE 30 FALSE FALSE TRUE FALSE 225 150.0 95 28.58 65 103 TRUE
FALSE 46 3 TRUE 23 FALSE FALSE FALSE FALSE 285 130.0 84 23.10 85 85 FALSE
FALSE 43 2 FALSE 0 FALSE FALSE TRUE FALSE 228 180.0 110 30.30 77 99 FALSE

The response variable is TenYearCHD. Now let’s have a look at the proportions of success and failure.

## the number of success and failures
 table(dataset$TenYearCHD)
## 
## FALSE  TRUE 
##  3101   557
## the proportions of success
dataset %>% dplyr::select(TenYearCHD) %>% summarize(proportion = sum( (TenYearCHD==TRUE))/n())
## # A tibble: 1 x 1
##   proportion
##        <dbl>
## 1   0.152269

The response variable is quite unbalanced but not as much as our example during interview.

Logistic Regression

Now let’s build a logit model on it.

set.seed(7)

logit_model <- glm(formula = TenYearCHD ~ . , data = dataset , family = binomial )

logit_pred <- predict(object = logit_model , newdata = dataset , type = "response")

conf_mat <- table( dataset$TenYearCHD,logit_pred>0.5)

## Accuracy
(conf_mat[1,1]+conf_mat[2,2])/sum(conf_mat)
## [1] 0.8556588
## misclasification error 
1-(conf_mat[1,1]+conf_mat[2,2])/sum(conf_mat)
## [1] 0.1443412
## sensitivity
sen <- conf_mat[2,2]/sum(conf_mat[2,])
sen
## [1] 0.08617594
## specificity 
spec <- conf_mat[1,1]/sum(conf_mat[1,])
spec
## [1] 0.9938729
## precision
prec <- conf_mat[2,2]/sum(conf_mat[,2])
prec 
## [1] 0.7164179
## F1 
F1 <- (2*prec * sen)/(prec+sen)
F1
## [1] 0.1538462

So the model works on the data, but not well at all! Indeed, if we consider sensitivity, then we can see that only 9% of the truely positive instances are marked as positive, by the model. So while the model is so good at detection of negative instances, it is very weak at detection of positive ones. It is slightly better than the baseline model, but it is almost a null classifier. It is specifically dangerous for such medical cases, where missing a potential CHD case can cost a life. So it is better to have low specificity and high sensitivity in such cases, rather than the other way around.

All the above results are achieved by setting the threshold to 0.5. However, can changing the trehsold improve the situation?

But we can play with treshold to improve the situation.

t_seq <- seq(0.1,0.9,by = 0.1)
index_mat <- matrix(ncol = 6 , nrow = length(t_seq))

counter <- 0 
for (treshold in t_seq ) {
        counter <- counter + 1 
       conf_mat <- table( dataset$TenYearCHD , logit_pred>treshold) 
       sen <- conf_mat[2,2]/sum(conf_mat[2,])
       spec <- conf_mat[1,1]/sum(conf_mat[1,])
       prec <- conf_mat[2,2]/sum(conf_mat[,2])
       F1 <- (2*prec * sen)/(prec+sen)
       accuracy <- (conf_mat[1,1]+conf_mat[2,2])/sum(conf_mat)
       index_mat[counter,] <-  c(treshold,sen,spec,prec,F1,accuracy  )
      
       
       
        
}

index_df <- data.frame(index_mat)
colnames(index_df) <- c("treshold","sensitivity","specificity","precision","F1")


ggplot(data = index_df) +
        geom_line(aes(x = treshold, y = sensitivity, color = "Sensitivity")  ) +
        geom_point(aes(x = treshold, y = sensitivity, color = "Sensitivity")  ) +
        geom_line(aes(x = treshold, y = specificity, color = "Specificity") ) +
        geom_point(aes(x = treshold, y = specificity, color = "Specificity") ) + 
        geom_line(aes(x = treshold , y = precision, color = "Precision"), alpha = 0.5) + 
        geom_point(aes(x = treshold , y = precision, color = "Precision"), alpha = 0.5) + 
        geom_line(aes(x = treshold, y = F1, color = "F1"), alpha = 0.5) + 
        geom_point(aes(x = treshold, y = F1, color = "F1"), alpha = 0.5) + 
        geom_line(aes(x = treshold, y = accuracy, color = "accuracy"), alpha = 0.5) + 
        geom_point(aes(x = treshold, y = accuracy, color = "accuracy"), alpha = 0.5) + 
        scale_x_continuous(name="Treshold", limits=c(0.1, 0.9), breaks = t_seq) + 
        scale_y_continuous(name = "Index", limits = c(0,1), breaks = seq(0,1,0.1)) +
        scale_color_brewer(palette = "Set1") 

As we can see the sensitivity is at best on 0.85, where the specificity is below 0.5. At this point the precision is around 0.22, which means only 22% of the cases which labeled as positive, are actually

Let’s have a look at the receiver operator characteristic(ROC) curve to focus on the trade-offs between

perf <- performance(prediction(predictions = logit_pred,labels = dataset$TenYearCHD),"tpr","fpr")
plot(perf,colorize = TRUE, print.cutoffs.at= seq(0,1,0.1), text.adj=c(-0.2,1.7))

Also the trade-offs of sensitivity and specificity

perf <- performance(prediction(predictions = logit_pred,labels = dataset$TenYearCHD),"tpr","tnr")
plot(perf,colorize = TRUE, print.cutoffs.at= seq(0,1,0.1), text.adj=c(-0.2,1.7))

As we can see, the best choice of treshold is perhaps something around 0.15, where sensitivity is around 0.70 and specificity is around 0.6. However, we saw before that at such treshold the precision is around 0.28, which means over 70% of the cases which are labeled as positive, are not actually positive. In other words, the majority of the positive-labels are wrongly labeled!

But it is better to split the dataset into training and test, and then evaluating the model.

set.seed(9)
split_vec <- sample.split(Y = dataset$TenYearCHD , SplitRatio = 0.8)
train <- dataset %>% filter(split_vec == TRUE)
test <- dataset %>% filter(split_vec == FALSE )

## number of instances in the training sample
nrow(train)
## [1] 2927
## number of instances in the test sample
nrow(test)
## [1] 731
## success proportion in training set 
train %>% dplyr::select(TenYearCHD) %>% summarise( sum(train$TenYearCHD==TRUE)/nrow(train))
## # A tibble: 1 x 1
##   `sum(train$TenYearCHD == TRUE)/nrow(train)`
##                                         <dbl>
## 1                                   0.1523744
## success proportion in test set 
test %>% dplyr::select(TenYearCHD) %>% summarise( sum(test$TenYearCHD==TRUE)/nrow(test))
## # A tibble: 1 x 1
##   `sum(test$TenYearCHD == TRUE)/nrow(test)`
##                                       <dbl>
## 1                                 0.1518468
logit_train <- glm(formula = TenYearCHD ~ . , data = train , family = "binomial")

test_pred <- predict(object = logit_train, newdata = test, type = "response")

## having a look at the success/failure of the test set
table(test$TenYearCHD)
## 
## FALSE  TRUE 
##   620   111
## confusion matrix
table(test$TenYearCHD,test_pred>0.5)
##        
##         FALSE TRUE
##   FALSE   614    6
##   TRUE     99   12

The confusion matrix of the prediction on the model does not have any good news. The model is awefully weak at detection of the positive cases, at treshold = 0.5. Let’s have a look at the behaviour of the model at different ranges of the tresholds.

t_seq <- seq(0.1,0.9,by = 0.1)
index_mat <- matrix(ncol = 6 , nrow = length(t_seq))

counter <- 0 
for (treshold in t_seq ) {
        counter <- counter + 1 
       conf_mat <- table( test$TenYearCHD , factor(test_pred>treshold, levels = c("FALSE","TRUE"))) 
       sen <- conf_mat[2,2]/sum(conf_mat[2,])
       spec <- conf_mat[1,1]/sum(conf_mat[1,])
       prec <- conf_mat[2,2]/sum(conf_mat[,2])
       F1 <- (2*prec * sen)/(prec+sen)
       accuracy <- (conf_mat[1,1]+conf_mat[2,2])/sum(conf_mat)
       index_mat[counter,] <-  c(treshold,sen,spec,prec,F1,accuracy  )
       
       
        
}

index_df <- data.frame(index_mat)
colnames(index_df) <- c("treshold","sensitivity","specificity","precision","F1","accuracy")


ggplot(data = index_df) +
        geom_line(aes(x = treshold, y = sensitivity, color = "Sensitivity")  ) +
        geom_point(aes(x = treshold, y = sensitivity, color = "Sensitivity")  ) +
        geom_line(aes(x = treshold, y = specificity, color = "Specificity") ) +
        geom_point(aes(x = treshold, y = specificity, color = "Specificity") ) + 
        geom_line(aes(x = treshold , y = precision, color = "Precision"), alpha = 0.5) + 
        geom_point(aes(x = treshold , y = precision, color = "Precision"), alpha = 0.5) + 
        geom_line(aes(x = treshold, y = F1, color = "F1"), alpha = 0.5) + 
        geom_point(aes(x = treshold, y = F1, color = "F1"), alpha = 0.5) + 
        geom_line(aes(x = treshold, y = accuracy, color = "accuracy"), alpha = 0.5) + 
        geom_point(aes(x = treshold, y = accuracy, color = "accuracy"), alpha = 0.5) + 
        scale_x_continuous(name="Treshold", limits=c(0.1, 0.9), breaks = t_seq) + 
        scale_y_continuous(name = "Index", limits = c(0,1), breaks = seq(0,1,0.1)) +
        scale_color_brewer(palette = "Set1") 

As it can be seen, the model is very bad at labeling the actual positive cases, the best treshold seems to be 0.2, where the true positive rate is equal to 0.5, which means that 50% of the positive cases are correctly labeled by the model. The F1 is very low, not reaching even 0.4.

What is it possible to do to alleviate the phenomenon?

Regular logistic regression estimates the parameters of the model using maximum likelihood method, but the method is easily affected by the rare events. Thus, he phenomenon is known as “logistic regression in rare events data” and there are some remedies for it.

Exact Logistic regression is one approach but seemingly it is very computantionally demanding, and it is for small sample sizes apparently. Moreover, the predictors are categorical in the cited example.

Penalized maximum likelihood is another approach.

king and Zeng,2001 is probably the most famous response to the problem with over 2000 citations,

Bayesian Approach is another alternative to this problem, according to this.

downsampling seems problematic, however it is still a valid approach. The coefficients of the model instance on the down-sampled data are correct, but the intercept needs correction

Nevertheless, many sources insist that logistic regression would do well, if we choose a proper threshold.

Firth’s biased reduced penalized-likelihood logit

brglm_model <- brglm(formula = TenYearCHD ~ . , family = "binomial", data = train )
brglm_pred <- predict(object = brglm_model, newdata = test , type = "response")

# now the indices 
t_seq <- seq(0.1,0.9,by = 0.1)
index_mat <- matrix(ncol = 6 , nrow = length(t_seq))

counter <- 0 
for (treshold in t_seq ) {
        counter <- counter + 1 
       conf_mat <- table( test$TenYearCHD , factor(brglm_pred>treshold, levels = c("FALSE","TRUE"))) 
       sen <- conf_mat[2,2]/sum(conf_mat[2,])
       spec <- conf_mat[1,1]/sum(conf_mat[1,])
       prec <- conf_mat[2,2]/sum(conf_mat[,2])
       F1 <- (2*prec * sen)/(prec+sen)
       accuracy <- (conf_mat[1,1]+conf_mat[2,2])/sum(conf_mat)
       index_mat[counter,] <-  c(treshold,sen,spec,prec,F1,accuracy  )
       
       
        
}

index_df <- data.frame(index_mat)
colnames(index_df) <- c("treshold","sensitivity","specificity","precision","F1","accuracy")

ggplot(data = index_df) +
        geom_line(aes(x = treshold, y = sensitivity, color = "Sensitivity")  ) +
        geom_point(aes(x = treshold, y = sensitivity, color = "Sensitivity")  ) +
        geom_line(aes(x = treshold, y = specificity, color = "Specificity") ) +
        geom_point(aes(x = treshold, y = specificity, color = "Specificity") ) + 
        geom_line(aes(x = treshold , y = precision, color = "Precision"), alpha = 0.5) + 
        geom_point(aes(x = treshold , y = precision, color = "Precision"), alpha = 0.5) + 
        geom_line(aes(x = treshold, y = F1, color = "F1"), alpha = 0.5) + 
        geom_point(aes(x = treshold, y = F1, color = "F1"), alpha = 0.5) + 
        geom_line(aes(x = treshold, y = accuracy, color = "accuracy"), alpha = 0.5) + 
        geom_point(aes(x = treshold, y = accuracy, color = "accuracy"), alpha = 0.5) + 
        scale_x_continuous(name="Treshold", limits=c(0.1, 0.9), breaks = t_seq) + 
        scale_y_continuous(name = "Index", limits = c(0,1), breaks = seq(0,1,0.1)) +
        scale_color_brewer(palette = "Set1") 

table( test$TenYearCHD , brglm_pred>0.5) 
##        
##         FALSE TRUE
##   FALSE   614    6
##   TRUE     99   12
firth_model <- logistf(TenYearCHD ~ . , data=train)

betas <- coef(firth_model)
class(betas)
## [1] "numeric"
X <- model.matrix(firth_model, data=test)
probs <- 1 / (1 + exp(-X %*% betas))


t_seq <- seq(0.1,0.9,by = 0.1)
index_mat <- matrix(ncol = 6 , nrow = length(t_seq))

counter <- 0 
for (treshold in t_seq ) {
        counter <- counter + 1 
       conf_mat <- table( test$TenYearCHD , factor(probs>treshold, levels = c("FALSE","TRUE")) )  
       sen <- conf_mat[2,2]/sum(conf_mat[2,])
       spec <- conf_mat[1,1]/sum(conf_mat[1,])
       prec <- conf_mat[2,2]/sum(conf_mat[,2])
       F1 <- (2*prec * sen)/(prec+sen)
       accuracy <- (conf_mat[1,1]+conf_mat[2,2])/sum(conf_mat)
       index_mat[counter,] <-  c(treshold,sen,spec,prec,F1,accuracy  )
       
       
        
}

index_df <- data.frame(index_mat)
colnames(index_df) <- c("treshold","sensitivity","specificity","precision","F1","accuracy")

ggplot(data = index_df) +
        geom_line(aes(x = treshold, y = sensitivity, color = "Sensitivity")  ) +
        geom_point(aes(x = treshold, y = sensitivity, color = "Sensitivity")  ) +
        geom_line(aes(x = treshold, y = specificity, color = "Specificity") ) +
        geom_point(aes(x = treshold, y = specificity, color = "Specificity") ) + 
        geom_line(aes(x = treshold , y = precision, color = "Precision"), alpha = 0.5) + 
        geom_point(aes(x = treshold , y = precision, color = "Precision"), alpha = 0.5) + 
        geom_line(aes(x = treshold, y = F1, color = "F1"), alpha = 0.5) + 
        geom_point(aes(x = treshold, y = F1, color = "F1"), alpha = 0.5) + 
        geom_line(aes(x = treshold, y = accuracy, color = "accuracy"), alpha = 0.5) + 
        geom_point(aes(x = treshold, y = accuracy, color = "accuracy"), alpha = 0.5) + 
        scale_x_continuous(name="Treshold", limits=c(0.1, 0.9), breaks = t_seq) + 
        scale_y_continuous(name = "Index", limits = c(0,1), breaks = seq(0,1,0.1)) +
        scale_color_brewer(palette = "Set1") 

table( test$TenYearCHD , probs>0.5) 
##        
##         FALSE TRUE
##   FALSE   614    6
##   TRUE     99   12

As it was mentioned here the results of the brglm() and prediction of logistf() are identical!

However, the being identicall is not a problem, but the problem is the logistf does not work as expected!

downsampling

The downsampling is done on the training set, which I got before.

t <- train %>% filter(TenYearCHD == FALSE)

down_index <- sample(x = 1:nrow(t), size = table(train$TenYearCHD)[2] , replace = FALSE)
down_train <- rbind(t[down_index, ], train %>% filter(TenYearCHD == TRUE) ) 

proportion<- table(train$TenYearCHD)[2]/nrow(train)

#proportion <- train %>% summarise( sum(TenYearCHD==TRUE) / n())

down_model <- glm(data = down_train, formula = TenYearCHD ~ . , family = binomial)
betas <- coef(down_model)
betas[1] <- coef(down_model)[1] - log(proportion/(1-proportion))


X <- model.matrix(down_model, data=test)

probs <- 1 / (1 + exp(-X %*% betas))


#
t_seq <- seq(0.1,0.9,by = 0.1)
index_mat <- matrix(ncol = 6 , nrow = length(t_seq))

counter <- 0 
for (treshold in t_seq ) {
        counter <- counter + 1 
       conf_mat <- table( test$TenYearCHD , factor(probs>treshold, levels = c("FALSE","TRUE")) )  
       sen <- conf_mat[2,2]/sum(conf_mat[2,])
       spec <- conf_mat[1,1]/sum(conf_mat[1,])
       prec <- conf_mat[2,2]/sum(conf_mat[,2])
       F1 <- (2*prec * sen)/(prec+sen)
       accuracy <- (conf_mat[1,1]+conf_mat[2,2])/sum(conf_mat)
       index_mat[counter,] <-  c(treshold,sen,spec,prec,F1,accuracy  )
       
       
        
}

index_df <- data.frame(index_mat)
colnames(index_df) <- c("treshold","sensitivity","specificity","precision","F1","accuracy")

ggplot(data = index_df) +
        geom_line(aes(x = treshold, y = sensitivity, color = "Sensitivity")  ) +
        geom_point(aes(x = treshold, y = sensitivity, color = "Sensitivity")  ) +
        geom_line(aes(x = treshold, y = specificity, color = "Specificity") ) +
        geom_point(aes(x = treshold, y = specificity, color = "Specificity") ) + 
        geom_line(aes(x = treshold , y = precision, color = "Precision"), alpha = 0.5) + 
        geom_point(aes(x = treshold , y = precision, color = "Precision"), alpha = 0.5) + 
        geom_line(aes(x = treshold, y = F1, color = "F1"), alpha = 0.5) + 
        geom_point(aes(x = treshold, y = F1, color = "F1"), alpha = 0.5) + 
        geom_line(aes(x = treshold, y = accuracy, color = "accuracy"), alpha = 0.5) + 
        geom_point(aes(x = treshold, y = accuracy, color = "accuracy"), alpha = 0.5) + 
        scale_x_continuous(name="Treshold", limits=c(0.1, 0.9), breaks = t_seq) + 
        scale_y_continuous(name = "Index", limits = c(0,1), breaks = seq(0,1,0.1)) +
        scale_color_brewer(palette = "Set1") 

table( test$TenYearCHD , probs>0.5) 
##        
##         FALSE TRUE
##   FALSE     4  616
##   TRUE      0  111
#matplot(index_df[,1], index_df[,-1], type = "b")

While the sensitivity is improved drastically, at a chosen threshold of 0.85, the precision is still quite low, which means that almost 70% of the cases labeled as success, are not actually positive instances.

A try on Bagging/RandomForest

I expect that the tree-based methods do not work well here, since the rarity of the positive events in the response variable does not let a tree grow.

set.seed(8)

 n = 500 
 train <- data.frame(train)
 test <- data.frame(test)
 rf_model <- randomForest( formula = TenYearCHD ~ . ,
                                               ntree = n, x = train[,-16] ,
                                               y = factor(train[,16]), 
                                               xtest = test[,-16], ytest = factor(test[,16]) )
rf_model
## 
## Call:
##  randomForest(x = train[, -16], y = factor(train[, 16]), xtest = test[,      -16], ytest = factor(test[, 16]), ntree = n, formula = TenYearCHD ~      .) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 15.41%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE  2456   25  0.01007658
## TRUE    426   20  0.95515695
##                 Test set error rate: 15.32%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE   613    7  0.01129032
## TRUE    105    6  0.94594595

As it was expected, the TPR is very low, and the model misses 95% of the actually true instances! based on test error. The OOB error is slightly worse with 96% of missing actually positive instances.

Conclusion

The learnt-lesson is evaluation of a model’s performance from different perspectives! By different perspectives, I mean different indices. The accuracy was very misleading and insufficient in the absence of the other measures. Indices such as accuracy are aggregation of two or more measures and thus their values might not tell the whole story about the model.

Surprisingly, the penalised maximum likelihood logit did not work well on this problem, and the downsampling worked better. However, the downsampling approach could be done more comprehensively by selection of all negative cases in different iterations, and averaging the final measure(s). Nevertheless, the precision of the model for positive cases is very low even after downsampling.

I had seen the reluctancy of the tree model to grow on this dataset. The reason might be the unbalanced response variable, or special condition of this dataset.

In the next report, I go forward to use better classifers for this dataset.