Limitations of ROC-AUC and Accuracy

In binary classification problems, both Accuracy and ROC-AUC have limitations that may impact the evaluation of model performance, particularly when working with imbalanced datasets.

Accuracy is often misleading in cases where the classes are imbalanced. For instance, in datasets where the majority class dominates, a model can achieve a high accuracy by simply predicting the majority class most of the time. This happens because accuracy does not differentiate between true positives and true negatives, making it a poor choice when the minority class (often the class of interest) is more important to predict correctly (Fawcett, 2006; Saito & Rehmsmeier, 2015). In scenarios like fraud detection or disease diagnosis, this can severely undervalue the model’s true capability to catch the rare but critical cases.

ROC-AUC has its own issues, especially in imbalanced datasets. While the ROC curve plots the true positive rate against the false positive rate across different thresholds, the AUC metric can appear inflated when the dataset has a large number of true negatives. This is because the false positive rate remains small due to the high number of negatives, making the AUC misleadingly high even if the model is failing to detect the minority class (Bradley, 1997; Saito & Rehmsmeier, 2015). Thus, ROC-AUC might not accurately reflect a model’s practical utility in imbalanced data scenarios, where metrics like Precision-Recall AUC may be more suitable (Fawcett, 2006).

For these reasons, it is crucial to choose evaluation metrics based on the specific requirements of the problem, especially when working with imbalanced data.

Empirical Results

As mentioned above, in many cases (if not most), Accuracy and even ROC-AUC are not ideal performance metrics. In addition to imbalanced data situations, another common scenario is Credit Scoring at banks and lending institutions. The goal of these institutions is typically to maximize profit, not simply to maximize Accuracy or ROC-AUC.

The reasoning is straightforward: if a potential borrower is actually creditworthy but is incorrectly classified as a bad credit risk (False Negative), the bank only loses the opportunity to earn interest (e.g., 30% on the loan). However, if an unqualified borrower is classified as creditworthy (False Positive), the bank risks losing the entire loan amount. In other words, the cost of a False Positive (FP) is significantly different from that of a False Negative (FN).

I will demonstrate empirical evidence showing that both Accuracy and even ROC-AUC (often used to compare binary classification models, including on platforms like Kaggle) can be misleading by using the German Credit dataset.

In this case, I will use 9 classification models, with Logistic Regression serving as the baseline for comparing the classification performance of various algorithms. The dataset will be split into two parts: 80% for training the models and 20% for validating the classification results.

Below is the R code for empirical results.

#==============================================================================
#  Project Name: ROC-AUC, Accuracy or Profit: Which Metric Is More Important?
#==============================================================================


# Clear R environment: 
rm(list = ls())

# Load some R packages: 
library(dplyr)
library(tidyr)
library(ggplot2)
library(pROC)
library(kableExtra)
library(caret)

# Load data: 
data("GermanCredit")

scaledData <- GermanCredit %>% 
  select(-Purpose.Vacation, -Personal.Female.Single) %>% 
  mutate_if(is.numeric, function(x) {(x - min(x)) / (max(x) - min(x))})

# Split data: 

set.seed(1)
id <- createDataPartition(y = scaledData$Class, p = 0.8, list = FALSE)
df_train_ml <- scaledData[id, ]
df_test_ml <- scaledData[-id, ]


# Use Parallel computing: 
library(doParallel)
registerDoParallel(cores = detectCores() - 1)

# Set conditions for training model and cross-validation: 

set.seed(1)
number <- 5
repeats <- 5
control <- trainControl(method = "repeatedcv", 
                        number = number , 
                        repeats = repeats, 
                        classProbs = TRUE, 
                        savePredictions = "final", 
                        index = createResample(df_train_ml$Class, repeats*number), 
                        summaryFunction = multiClassSummary, 
                        allowParallel = TRUE)

# 9 ML models selected: 

my_models <- c("glm", "rf", "gam", "svmRadial", "knn", "xgbTree", "C5.0", "nnet", "ranger")

# Train these ML Models: 
library(caretEnsemble)

set.seed(1)

system.time(model_list1 <- caretList(Class ~., 
                                     data = df_train_ml,
                                     trControl = control,
                                     metric = "ROC", 
                                     methodList = my_models))

## # weights:  62
## initial  value 608.922464 
## iter  10 value 393.359125
## iter  20 value 375.520866
## iter  30 value 366.000205
## iter  40 value 363.624360
## iter  50 value 363.330005
## iter  60 value 363.313530
## iter  70 value 363.310460
## final  value 363.310448 
## converged

##    user  system elapsed 
##   13.69    0.54  145.01

# Function calculates some metrics (including profit) at given threshold and model selected: 

some_metrics_with_threshold <- function(model_selected) {
  
  myModel <- model_list1[[model_selected]]
  
  # Calculate PD by model selected: 
  
  df_pd <- predict(myModel, df_test_ml, type = "prob")
  
  df_pd %>% pull(Bad) -> pd
  
  actual_labels <- df_test_ml$Class
  
  roc(actual_labels, pd)$auc %>% as.numeric() -> myROC
  
  # A function: 
  
  modelMetricThreshold <- function(threshold) {
    
    # Create data frame that contains actual and predicted lablels: 
    labels_predicted <- case_when(pd >= threshold ~ "Bad", TRUE ~ "Good")
    
    # Create actual - predicted data frame for purpose of comparision: 
    
    tibble(actual = actual_labels, predicted = labels_predicted) -> df_compared
    
    # Calculate Accuracy metric: 
    acc_metric <- sum(labels_predicted == actual_labels) / length(labels_predicted)
    
    # Calculate Sensitiviy metric: 
    df_compared %>% filter(actual == "Bad") -> df_bad_sen
    
    df_bad_sen %>% 
      filter(predicted == "Bad") %>% 
      nrow() / nrow(df_bad_sen) -> sen_metric
    
    # Calculate Specificity metric: 
    
    df_compared %>% filter(actual == "Good") -> df_good_spec
    
    df_good_spec %>% 
      filter(predicted == "Good") %>% 
      nrow() / nrow(df_good_spec) -> spec_metric
    
    # Calculate profit and some metrics at given threshold: 
    
    df_compared %>% 
      mutate(Amount = GermanCredit %>% slice(-id) %>% pull(Amount)) %>% 
      mutate(r = 0.3) %>% 
      mutate(profit = case_when(actual == "Good" & predicted == "Good" ~ Amount*r, 
                                actual == "Bad" & predicted == "Good" ~ -1*Amount, 
                                TRUE ~ 0)) -> df_profit
    df_profit %>% 
      pull(profit) %>% 
      sum() -> prof
    
    # Final results in DF form: 
    
    tibble(Accuracy = acc_metric, 
           Sensitiviy = sen_metric,
           Specificity = spec_metric,
           Profit = prof, 
           ROC = myROC, 
           Threshold = threshold, 
           Model = model_selected) -> df_report
    
    return(df_report)
    
  }
  
  lapply(seq(0.1, 0.8, by = 0.025), modelMetricThreshold) -> reports
  
  do.call("bind_rows", reports) %>% return()
  
}


lapply(my_models, some_metrics_with_threshold) -> resultsCompared

do.call("bind_rows", resultsCompared) -> resultsCompared


resultsCompared %>% 
  pivot_longer(cols = c("Accuracy", "Sensitiviy", "Specificity", "Profit"), 
               names_to = "Metric", 
               values_to = "value") -> dfLong

dfLong %>% 
  ggplot(aes(x = Threshold, y = value, color = Model)) + 
  geom_line() + 
  facet_wrap(~ Metric, scales = "free") + 
  theme_minimal() + 
  theme(axis.title = element_blank()) + 
  scale_x_continuous(breaks = seq(0.1, 0.8, 0.1)) + 
  theme(panel.grid.minor = element_blank()) + 
  labs(title = "Figure 1: Model Performance by Threshold for Classification", 
       caption = "Source: Author's Calulations")

Figure 1 highlights how different metrics vary with respect to the selected classification threshold. Two key observations emerge from this analysis:

Trade-off between Sensitivity and Specificity: As the threshold changes, sensitivity and specificity typically exhibit an inverse relationship. Increasing sensitivity (i.e., correctly identifying positive cases) often results in a decrease in specificity (i.e., correctly identifying negative cases), and vice versa. This trade-off must be managed carefully based on the context of the problem.
U-shaped behavior of Accuracy and Profit: Both accuracy and profit demonstrate a concave (inverted U-shaped) pattern as the threshold varies. This implies that for each classification model, there exists a threshold that maximizes either Accuracy or Profit. However, the threshold that maximizes Profit may not be the same as the one that maximizes Accuracy. In practice, this means that focusing solely on achieving the highest accuracy or ROC-AUC might not align with the goal of maximizing profit, which is typically the priority for financial institutions like banks.

resultsCompared %>% 
  group_by(Model) %>% 
  slice(which.max(Profit)) %>% 
  arrange(-Profit) %>% 
  mutate_if(is.numeric, function(x) {round(x, 3)}) %>% 
  kbl(caption = "Table 1: Model Performance in Descending Order of Maximum Profit", escape = TRUE) %>%
  kable_classic(full_width = FALSE, html_font = "Cambria")

Table 1: Model Performance in Descending Order of Maximum Profit
Accuracy	Sensitiviy	Specificity	Profit	ROC	Threshold	Model
0.635	0.917	0.514	43819.5	0.813	0.150	xgbTree
0.730	0.833	0.686	39038.1	0.802	0.200	glm
0.725	0.717	0.729	38248.0	0.775	0.350	C5.0
0.640	0.867	0.543	34804.5	0.798	0.225	rf
0.685	0.783	0.643	34235.1	0.796	0.300	ranger
0.690	0.833	0.629	33876.2	0.806	0.225	svmRadial
0.670	0.883	0.579	32727.0	0.815	0.175	gam
0.705	0.783	0.671	31615.3	0.790	0.250	nnet
0.585	0.900	0.450	27564.7	0.736	0.125	knn

Table 1 presents the performance of various models ranked by their maximum profit, with the xgbTree model achieving the highest profit of 438,195.5, followed by glm and C5.0. Among these models, xgbTree stands out with relatively high sensitivity (0.917) and a moderately lower specificity (0.514), indicating its stronger ability to identify true positives but potentially generating more false positives. On the other hand, the glm model balances sensitivity (0.833) and specificity (0.686) while maintaining a significant profit of 390,381.8. The C5.0 model ranks third, showing a decent balance between sensitivity and specificity. Lower-performing models, such as knn, yielded the lowest profit of 275,647.4 with moderate sensitivity and specificity. In terms of ROC, all models performed quite well, with values ranging from 0.736 to 0.815, showing overall good classification ability. Threshold values varied across models, influencing their trade-off between sensitivity and specificity.

Clearly, the model that delivers the highest Accuracy or ROC-AUC is not always the most profitable for the bank, reinforcing the importance of selecting metrics aligned with business objectives, such as maximizing profit rather than purely predictive performance.

Conclusion

In conclusion, both Accuracy and ROC-AUC have inherent limitations when applied to binary classification problems, particularly in scenarios involving imbalanced datasets. Accuracy can be deceptive by overestimating performance in datasets where one class is dominant, as it does not distinguish between the importance of true positives and true negatives. On the other hand, ROC-AUC may present an inflated view of model performance in imbalanced datasets because the high number of true negatives keeps the false positive rate low, even when the model fails to correctly classify minority instances. Hence, it is essential to consider the specific context and objectives of the task when selecting evaluation metrics. For imbalanced datasets, metrics such as the Precision-Recall AUC may provide more relevant insights into model performance.

These limitations underscore the importance of choosing the right evaluation metric for the task at hand, especially in fields like healthcare or fraud detection, where the cost of false negatives can be significant.

References

Fawcett, T., 2006. An introduction to ROC analysis. Pattern Recognition Letters, 27(8), pp.861-874.
Saito, T. and Rehmsmeier, M., 2015. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PloS one, 10(3), p.e0118432.
Bradley, A.P., 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), pp.1145-1159.

AUC, Accuracy or Profit: Which Metric Is More Important?

R Data Science Series

Nguyen Chi Dung

Limitations of ROC-AUC and Accuracy

Empirical Results

Conclusion

References