This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
In the previous analytical report, I used logistic regression on a Framingham Heart Study(FHS) dataset to predict heart diseas of the patients based on 15 demographic, behavioral and medical variables. The measure that I chose to evaluate the performance of the model was accuracy
and the error was misclassification error
. The model showed above 85% accuracy and I was happy that it works very well, although the main goal of the analytical project was working with a couple of classifiers such as logistic regression, and some exploratory methods such as GoodmanKruskal’s tau.
A couple of days ago, I am deceived by looking at the performance from a very narrow perspective. While the accuracy of the logit model was very high, the TPR
was very low. In other words, the model was good at correctly labelling negative instances, while it was awful at detection of actually positive instances. The problem becomes grave in such medical studies where missing detection of positive cases can cost lives.
In general, the problem was due to low proportion of positive instances in the dataset. Rare event
datasets can cause problem for the classifiers such as logistic regression. Having said that, this report is a sequel to the previous report, and tends to test whether logit works well on a FHS dataset whose response variable has a very low proportion of success
rate. Moreover, I am going to evaluate two remedies of this problem, and evaluate the results.
Let’s get started.
require(knitr)
require(readr)
require(boot)
require(caTools)
require(ggplot2)
require(ROCR)
require(RColorBrewer)
require(zeligverse)
require(brglm)
require(logistf)
require(dplyr)
require(randomForest)
require(gbm)
#dataset <- read.csv(file = "https://d37djvu3ytnwxt.cloudfront.net/assets/courseware/v1/7022cf016eefb6d3747447589423dab0/asset-v1:MITx+15.071x_3+1T2016+type@asset+block/framingham.csv")
url <- "https://d37djvu3ytnwxt.cloudfront.net/assets/courseware/v1/7022cf016eefb6d3747447589423dab0/asset-v1:MITx+15.071x_3+1T2016+type@asset+block/framingham.csv"
dataset <- read_csv(file = url ,
col_types = list(
male = col_logical(),
age = col_integer(),
education = col_factor(levels = c(1,2,3,4)),
currentSmoker = col_logical(),
BPMeds = col_logical(),
prevalentStroke = col_logical(),
prevalentHyp = col_logical(),
diabetes = col_logical(),
TenYearCHD = col_logical()
))
## Removing the instances with missing values
dataset <- dataset %>% filter(complete.cases(dataset))
Now having a glimpse of the dataset.
kable(head(dataset))
male | age | education | currentSmoker | cigsPerDay | BPMeds | prevalentStroke | prevalentHyp | diabetes | totChol | sysBP | diaBP | BMI | heartRate | glucose | TenYearCHD |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TRUE | 39 | 4 | FALSE | 0 | FALSE | FALSE | FALSE | FALSE | 195 | 106.0 | 70 | 26.97 | 80 | 77 | FALSE |
FALSE | 46 | 2 | FALSE | 0 | FALSE | FALSE | FALSE | FALSE | 250 | 121.0 | 81 | 28.73 | 95 | 76 | FALSE |
TRUE | 48 | 1 | TRUE | 20 | FALSE | FALSE | FALSE | FALSE | 245 | 127.5 | 80 | 25.34 | 75 | 70 | FALSE |
FALSE | 61 | 3 | TRUE | 30 | FALSE | FALSE | TRUE | FALSE | 225 | 150.0 | 95 | 28.58 | 65 | 103 | TRUE |
FALSE | 46 | 3 | TRUE | 23 | FALSE | FALSE | FALSE | FALSE | 285 | 130.0 | 84 | 23.10 | 85 | 85 | FALSE |
FALSE | 43 | 2 | FALSE | 0 | FALSE | FALSE | TRUE | FALSE | 228 | 180.0 | 110 | 30.30 | 77 | 99 | FALSE |
The response variable is TenYearCHD
. Now let’s have a look at the proportions of success and failure.
## the number of success and failures
table(dataset$TenYearCHD)
##
## FALSE TRUE
## 3101 557
## the proportions of success
dataset %>% dplyr::select(TenYearCHD) %>% summarize(proportion = sum( (TenYearCHD==TRUE))/n())
## # A tibble: 1 x 1
## proportion
## <dbl>
## 1 0.152269
The response variable is quite unbalanced but not as much as our example during interview.
Now let’s build a logit model on it.
set.seed(7)
logit_model <- glm(formula = TenYearCHD ~ . , data = dataset , family = binomial )
logit_pred <- predict(object = logit_model , newdata = dataset , type = "response")
conf_mat <- table( dataset$TenYearCHD,logit_pred>0.5)
## Accuracy
(conf_mat[1,1]+conf_mat[2,2])/sum(conf_mat)
## [1] 0.8556588
## misclasification error
1-(conf_mat[1,1]+conf_mat[2,2])/sum(conf_mat)
## [1] 0.1443412
## sensitivity
sen <- conf_mat[2,2]/sum(conf_mat[2,])
sen
## [1] 0.08617594
## specificity
spec <- conf_mat[1,1]/sum(conf_mat[1,])
spec
## [1] 0.9938729
## precision
prec <- conf_mat[2,2]/sum(conf_mat[,2])
prec
## [1] 0.7164179
## F1
F1 <- (2*prec * sen)/(prec+sen)
F1
## [1] 0.1538462
So the model works on the data, but not well at all! Indeed, if we consider sensitivity
, then we can see that only 9%
of the truely positive instances are marked as positive, by the model. So while the model is so good at detection of negative instances, it is very weak at detection of positive ones. It is slightly better than the baseline model, but it is almost a null classifier
. It is specifically dangerous for such medical cases, where missing a potential CHD case can cost a life. So it is better to have low specificity
and high sensitivity
in such cases, rather than the other way around.
All the above results are achieved by setting the threshold to 0.5. However, can changing the trehsold improve the situation?
But we can play with treshold to improve the situation.
t_seq <- seq(0.1,0.9,by = 0.1)
index_mat <- matrix(ncol = 6 , nrow = length(t_seq))
counter <- 0
for (treshold in t_seq ) {
counter <- counter + 1
conf_mat <- table( dataset$TenYearCHD , logit_pred>treshold)
sen <- conf_mat[2,2]/sum(conf_mat[2,])
spec <- conf_mat[1,1]/sum(conf_mat[1,])
prec <- conf_mat[2,2]/sum(conf_mat[,2])
F1 <- (2*prec * sen)/(prec+sen)
accuracy <- (conf_mat[1,1]+conf_mat[2,2])/sum(conf_mat)
index_mat[counter,] <- c(treshold,sen,spec,prec,F1,accuracy )
}
index_df <- data.frame(index_mat)
colnames(index_df) <- c("treshold","sensitivity","specificity","precision","F1")
ggplot(data = index_df) +
geom_line(aes(x = treshold, y = sensitivity, color = "Sensitivity") ) +
geom_point(aes(x = treshold, y = sensitivity, color = "Sensitivity") ) +
geom_line(aes(x = treshold, y = specificity, color = "Specificity") ) +
geom_point(aes(x = treshold, y = specificity, color = "Specificity") ) +
geom_line(aes(x = treshold , y = precision, color = "Precision"), alpha = 0.5) +
geom_point(aes(x = treshold , y = precision, color = "Precision"), alpha = 0.5) +
geom_line(aes(x = treshold, y = F1, color = "F1"), alpha = 0.5) +
geom_point(aes(x = treshold, y = F1, color = "F1"), alpha = 0.5) +
geom_line(aes(x = treshold, y = accuracy, color = "accuracy"), alpha = 0.5) +
geom_point(aes(x = treshold, y = accuracy, color = "accuracy"), alpha = 0.5) +
scale_x_continuous(name="Treshold", limits=c(0.1, 0.9), breaks = t_seq) +
scale_y_continuous(name = "Index", limits = c(0,1), breaks = seq(0,1,0.1)) +
scale_color_brewer(palette = "Set1")
As we can see the sensitivity is at best on 0.85, where the specificity is below 0.5. At this point the precision is around 0.22, which means only 22% of the cases which labeled as positive, are actually
Let’s have a look at the receiver operator characteristic(ROC)
curve to focus on the trade-offs between
perf <- performance(prediction(predictions = logit_pred,labels = dataset$TenYearCHD),"tpr","fpr")
plot(perf,colorize = TRUE, print.cutoffs.at= seq(0,1,0.1), text.adj=c(-0.2,1.7))
Also the trade-offs of sensitivity and specificity
perf <- performance(prediction(predictions = logit_pred,labels = dataset$TenYearCHD),"tpr","tnr")
plot(perf,colorize = TRUE, print.cutoffs.at= seq(0,1,0.1), text.adj=c(-0.2,1.7))
As we can see, the best choice of treshold is perhaps something around 0.15, where sensitivity is around 0.70 and specificity is around 0.6. However, we saw before that at such treshold the precision is around 0.28, which means over 70% of the cases which are labeled as positive, are not actually positive. In other words, the majority of the positive-labels are wrongly labeled!
But it is better to split the dataset into training and test, and then evaluating the model.
set.seed(9)
split_vec <- sample.split(Y = dataset$TenYearCHD , SplitRatio = 0.8)
train <- dataset %>% filter(split_vec == TRUE)
test <- dataset %>% filter(split_vec == FALSE )
## number of instances in the training sample
nrow(train)
## [1] 2927
## number of instances in the test sample
nrow(test)
## [1] 731
## success proportion in training set
train %>% dplyr::select(TenYearCHD) %>% summarise( sum(train$TenYearCHD==TRUE)/nrow(train))
## # A tibble: 1 x 1
## `sum(train$TenYearCHD == TRUE)/nrow(train)`
## <dbl>
## 1 0.1523744
## success proportion in test set
test %>% dplyr::select(TenYearCHD) %>% summarise( sum(test$TenYearCHD==TRUE)/nrow(test))
## # A tibble: 1 x 1
## `sum(test$TenYearCHD == TRUE)/nrow(test)`
## <dbl>
## 1 0.1518468
logit_train <- glm(formula = TenYearCHD ~ . , data = train , family = "binomial")
test_pred <- predict(object = logit_train, newdata = test, type = "response")
## having a look at the success/failure of the test set
table(test$TenYearCHD)
##
## FALSE TRUE
## 620 111
## confusion matrix
table(test$TenYearCHD,test_pred>0.5)
##
## FALSE TRUE
## FALSE 614 6
## TRUE 99 12
The confusion matrix of the prediction on the model does not have any good news. The model is awefully weak at detection of the positive cases, at treshold = 0.5. Let’s have a look at the behaviour of the model at different ranges of the tresholds.
t_seq <- seq(0.1,0.9,by = 0.1)
index_mat <- matrix(ncol = 6 , nrow = length(t_seq))
counter <- 0
for (treshold in t_seq ) {
counter <- counter + 1
conf_mat <- table( test$TenYearCHD , factor(test_pred>treshold, levels = c("FALSE","TRUE")))
sen <- conf_mat[2,2]/sum(conf_mat[2,])
spec <- conf_mat[1,1]/sum(conf_mat[1,])
prec <- conf_mat[2,2]/sum(conf_mat[,2])
F1 <- (2*prec * sen)/(prec+sen)
accuracy <- (conf_mat[1,1]+conf_mat[2,2])/sum(conf_mat)
index_mat[counter,] <- c(treshold,sen,spec,prec,F1,accuracy )
}
index_df <- data.frame(index_mat)
colnames(index_df) <- c("treshold","sensitivity","specificity","precision","F1","accuracy")
ggplot(data = index_df) +
geom_line(aes(x = treshold, y = sensitivity, color = "Sensitivity") ) +
geom_point(aes(x = treshold, y = sensitivity, color = "Sensitivity") ) +
geom_line(aes(x = treshold, y = specificity, color = "Specificity") ) +
geom_point(aes(x = treshold, y = specificity, color = "Specificity") ) +
geom_line(aes(x = treshold , y = precision, color = "Precision"), alpha = 0.5) +
geom_point(aes(x = treshold , y = precision, color = "Precision"), alpha = 0.5) +
geom_line(aes(x = treshold, y = F1, color = "F1"), alpha = 0.5) +
geom_point(aes(x = treshold, y = F1, color = "F1"), alpha = 0.5) +
geom_line(aes(x = treshold, y = accuracy, color = "accuracy"), alpha = 0.5) +
geom_point(aes(x = treshold, y = accuracy, color = "accuracy"), alpha = 0.5) +
scale_x_continuous(name="Treshold", limits=c(0.1, 0.9), breaks = t_seq) +
scale_y_continuous(name = "Index", limits = c(0,1), breaks = seq(0,1,0.1)) +
scale_color_brewer(palette = "Set1")
As it can be seen, the model is very bad at labeling the actual positive cases, the best treshold seems to be 0.2, where the true positive rate is equal to 0.5, which means that 50% of the positive cases are correctly labeled by the model. The F1 is very low, not reaching even 0.4.
Regular logistic regression estimates the parameters of the model using maximum likelihood method, but the method is easily affected by the rare events
. Thus, he phenomenon is known as “logistic regression in rare events data” and there are some remedies for it.
Exact Logistic regression is one approach but seemingly it is very computantionally demanding, and it is for small sample sizes apparently. Moreover, the predictors are categorical in the cited example.
Penalized maximum likelihood is another approach.
king and Zeng,2001 is probably the most famous response to the problem with over 2000 citations,
Bayesian Approach is another alternative to this problem, according to this.
downsampling seems problematic, however it is still a valid approach. The coefficients of the model instance on the down-sampled data are correct, but the intercept needs correction
Nevertheless, many sources insist that logistic regression would do well, if we choose a proper threshold.
brglm_model <- brglm(formula = TenYearCHD ~ . , family = "binomial", data = train )
brglm_pred <- predict(object = brglm_model, newdata = test , type = "response")
# now the indices
t_seq <- seq(0.1,0.9,by = 0.1)
index_mat <- matrix(ncol = 6 , nrow = length(t_seq))
counter <- 0
for (treshold in t_seq ) {
counter <- counter + 1
conf_mat <- table( test$TenYearCHD , factor(brglm_pred>treshold, levels = c("FALSE","TRUE")))
sen <- conf_mat[2,2]/sum(conf_mat[2,])
spec <- conf_mat[1,1]/sum(conf_mat[1,])
prec <- conf_mat[2,2]/sum(conf_mat[,2])
F1 <- (2*prec * sen)/(prec+sen)
accuracy <- (conf_mat[1,1]+conf_mat[2,2])/sum(conf_mat)
index_mat[counter,] <- c(treshold,sen,spec,prec,F1,accuracy )
}
index_df <- data.frame(index_mat)
colnames(index_df) <- c("treshold","sensitivity","specificity","precision","F1","accuracy")
ggplot(data = index_df) +
geom_line(aes(x = treshold, y = sensitivity, color = "Sensitivity") ) +
geom_point(aes(x = treshold, y = sensitivity, color = "Sensitivity") ) +
geom_line(aes(x = treshold, y = specificity, color = "Specificity") ) +
geom_point(aes(x = treshold, y = specificity, color = "Specificity") ) +
geom_line(aes(x = treshold , y = precision, color = "Precision"), alpha = 0.5) +
geom_point(aes(x = treshold , y = precision, color = "Precision"), alpha = 0.5) +
geom_line(aes(x = treshold, y = F1, color = "F1"), alpha = 0.5) +
geom_point(aes(x = treshold, y = F1, color = "F1"), alpha = 0.5) +
geom_line(aes(x = treshold, y = accuracy, color = "accuracy"), alpha = 0.5) +
geom_point(aes(x = treshold, y = accuracy, color = "accuracy"), alpha = 0.5) +
scale_x_continuous(name="Treshold", limits=c(0.1, 0.9), breaks = t_seq) +
scale_y_continuous(name = "Index", limits = c(0,1), breaks = seq(0,1,0.1)) +
scale_color_brewer(palette = "Set1")
table( test$TenYearCHD , brglm_pred>0.5)
##
## FALSE TRUE
## FALSE 614 6
## TRUE 99 12
firth_model <- logistf(TenYearCHD ~ . , data=train)
betas <- coef(firth_model)
class(betas)
## [1] "numeric"
X <- model.matrix(firth_model, data=test)
probs <- 1 / (1 + exp(-X %*% betas))
t_seq <- seq(0.1,0.9,by = 0.1)
index_mat <- matrix(ncol = 6 , nrow = length(t_seq))
counter <- 0
for (treshold in t_seq ) {
counter <- counter + 1
conf_mat <- table( test$TenYearCHD , factor(probs>treshold, levels = c("FALSE","TRUE")) )
sen <- conf_mat[2,2]/sum(conf_mat[2,])
spec <- conf_mat[1,1]/sum(conf_mat[1,])
prec <- conf_mat[2,2]/sum(conf_mat[,2])
F1 <- (2*prec * sen)/(prec+sen)
accuracy <- (conf_mat[1,1]+conf_mat[2,2])/sum(conf_mat)
index_mat[counter,] <- c(treshold,sen,spec,prec,F1,accuracy )
}
index_df <- data.frame(index_mat)
colnames(index_df) <- c("treshold","sensitivity","specificity","precision","F1","accuracy")
ggplot(data = index_df) +
geom_line(aes(x = treshold, y = sensitivity, color = "Sensitivity") ) +
geom_point(aes(x = treshold, y = sensitivity, color = "Sensitivity") ) +
geom_line(aes(x = treshold, y = specificity, color = "Specificity") ) +
geom_point(aes(x = treshold, y = specificity, color = "Specificity") ) +
geom_line(aes(x = treshold , y = precision, color = "Precision"), alpha = 0.5) +
geom_point(aes(x = treshold , y = precision, color = "Precision"), alpha = 0.5) +
geom_line(aes(x = treshold, y = F1, color = "F1"), alpha = 0.5) +
geom_point(aes(x = treshold, y = F1, color = "F1"), alpha = 0.5) +
geom_line(aes(x = treshold, y = accuracy, color = "accuracy"), alpha = 0.5) +
geom_point(aes(x = treshold, y = accuracy, color = "accuracy"), alpha = 0.5) +
scale_x_continuous(name="Treshold", limits=c(0.1, 0.9), breaks = t_seq) +
scale_y_continuous(name = "Index", limits = c(0,1), breaks = seq(0,1,0.1)) +
scale_color_brewer(palette = "Set1")
table( test$TenYearCHD , probs>0.5)
##
## FALSE TRUE
## FALSE 614 6
## TRUE 99 12
As it was mentioned here the results of the brglm()
and prediction of logistf()
are identical!
However, the being identicall is not a problem, but the problem is the logistf
does not work as expected!
The downsampling is done on the training set, which I got before.
t <- train %>% filter(TenYearCHD == FALSE)
down_index <- sample(x = 1:nrow(t), size = table(train$TenYearCHD)[2] , replace = FALSE)
down_train <- rbind(t[down_index, ], train %>% filter(TenYearCHD == TRUE) )
proportion<- table(train$TenYearCHD)[2]/nrow(train)
#proportion <- train %>% summarise( sum(TenYearCHD==TRUE) / n())
down_model <- glm(data = down_train, formula = TenYearCHD ~ . , family = binomial)
betas <- coef(down_model)
betas[1] <- coef(down_model)[1] - log(proportion/(1-proportion))
X <- model.matrix(down_model, data=test)
probs <- 1 / (1 + exp(-X %*% betas))
#
t_seq <- seq(0.1,0.9,by = 0.1)
index_mat <- matrix(ncol = 6 , nrow = length(t_seq))
counter <- 0
for (treshold in t_seq ) {
counter <- counter + 1
conf_mat <- table( test$TenYearCHD , factor(probs>treshold, levels = c("FALSE","TRUE")) )
sen <- conf_mat[2,2]/sum(conf_mat[2,])
spec <- conf_mat[1,1]/sum(conf_mat[1,])
prec <- conf_mat[2,2]/sum(conf_mat[,2])
F1 <- (2*prec * sen)/(prec+sen)
accuracy <- (conf_mat[1,1]+conf_mat[2,2])/sum(conf_mat)
index_mat[counter,] <- c(treshold,sen,spec,prec,F1,accuracy )
}
index_df <- data.frame(index_mat)
colnames(index_df) <- c("treshold","sensitivity","specificity","precision","F1","accuracy")
ggplot(data = index_df) +
geom_line(aes(x = treshold, y = sensitivity, color = "Sensitivity") ) +
geom_point(aes(x = treshold, y = sensitivity, color = "Sensitivity") ) +
geom_line(aes(x = treshold, y = specificity, color = "Specificity") ) +
geom_point(aes(x = treshold, y = specificity, color = "Specificity") ) +
geom_line(aes(x = treshold , y = precision, color = "Precision"), alpha = 0.5) +
geom_point(aes(x = treshold , y = precision, color = "Precision"), alpha = 0.5) +
geom_line(aes(x = treshold, y = F1, color = "F1"), alpha = 0.5) +
geom_point(aes(x = treshold, y = F1, color = "F1"), alpha = 0.5) +
geom_line(aes(x = treshold, y = accuracy, color = "accuracy"), alpha = 0.5) +
geom_point(aes(x = treshold, y = accuracy, color = "accuracy"), alpha = 0.5) +
scale_x_continuous(name="Treshold", limits=c(0.1, 0.9), breaks = t_seq) +
scale_y_continuous(name = "Index", limits = c(0,1), breaks = seq(0,1,0.1)) +
scale_color_brewer(palette = "Set1")
table( test$TenYearCHD , probs>0.5)
##
## FALSE TRUE
## FALSE 4 616
## TRUE 0 111
#matplot(index_df[,1], index_df[,-1], type = "b")
While the sensitivity is improved drastically, at a chosen threshold of 0.85, the precision is still quite low, which means that almost 70% of the cases labeled as success
, are not actually positive instances.
I expect that the tree-based methods do not work well here, since the rarity of the positive events in the response variable does not let a tree grow.
set.seed(8)
n = 500
train <- data.frame(train)
test <- data.frame(test)
rf_model <- randomForest( formula = TenYearCHD ~ . ,
ntree = n, x = train[,-16] ,
y = factor(train[,16]),
xtest = test[,-16], ytest = factor(test[,16]) )
rf_model
##
## Call:
## randomForest(x = train[, -16], y = factor(train[, 16]), xtest = test[, -16], ytest = factor(test[, 16]), ntree = n, formula = TenYearCHD ~ .)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 15.41%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 2456 25 0.01007658
## TRUE 426 20 0.95515695
## Test set error rate: 15.32%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 613 7 0.01129032
## TRUE 105 6 0.94594595
As it was expected, the TPR is very low, and the model misses 95% of the actually true instances! based on test error. The OOB error is slightly worse with 96% of missing actually positive instances.
The learnt-lesson is evaluation of a model’s performance from different perspectives! By different perspectives, I mean different indices. The accuracy
was very misleading and insufficient in the absence of the other measures. Indices such as accuracy are aggregation of two or more measures and thus their values might not tell the whole story about the model.
Surprisingly, the penalised maximum likelihood logit
did not work well on this problem, and the downsampling worked better. However, the downsampling approach could be done more comprehensively by selection of all negative cases in different iterations, and averaging the final measure(s). Nevertheless, the precision of the model for positive cases is very low even after downsampling.
I had seen the reluctancy of the tree model to grow on this dataset. The reason might be the unbalanced response variable, or special condition of this dataset.
In the next report, I go forward to use better classifers for this dataset.