An Empirical Comparison of Logistic Regression, Naïve Bayes, and KNN for Credit Card Fraud Detection

Author

Avery Holloman

Comparing Logistic Regression, Naïve Bayes, and KNN for Credit Card Fraud Detection

Introduction

When I think about financial fraud, it always strikes me how much damage it causes—not just to individuals but to the entire economy. Credit card transactions have become so common, yet they are increasingly targeted by fraudsters. I know that developing effective fraud detection methods is critical, but the skewed nature of the datasets always complicates the process. Fraudulent transactions form such a tiny percentage of the total data that it feels like finding a needle in a haystack. To me, this imbalance is the biggest challenge when it comes to machine learning models for fraud detection.

When I look at the data, it’s clear that standard machine learning algorithms tend to focus on the majority class (non-fraud cases) while misclassifying the minority class (fraud cases) as noise. That’s why I think techniques like resampling are so useful—they help ensure the model doesn’t ignore the smaller, more critical fraud category. I’ve decided to use Random Under-Sampling (RUS) in this study since it simplifies the dataset and creates balance by reducing the majority class.

Background and Motivation

Machine learning always seems like a perfect fit for problems like this—identifying hidden patterns in data to predict and prevent fraud. Still, the skewed distribution of transactions makes it tough. I’ve noticed that algorithms often need help focusing on minority cases without overfitting to noise. That’s where pre-processing techniques like RUS become essential.

For this study, I decided to compare three widely used machine learning algorithms—Logistic Regression (LR), Naïve Bayes (NB), and K-Nearest Neighbor (KNN). I wanted to see how well each one performs in detecting fraudulent transactions when trained on balanced data.

Methodology

I needed to break this study into manageable steps, so I organized it like this:

  1. Data Collection: I used a Kaggle dataset containing 284,807 credit card transactions from European cardholders over two days in 2013. Fraudulent transactions only made up 0.172% of the data, which was a challenge.

  2. Data Preprocessing: To simplify things, I used Principal Component Analysis (PCA) to reduce the dataset to 28 principal components, labeled V1 to V28. The target variable (“Class”) indicates whether a transaction is fraudulent (1) or not (0).

  3. Random Under-Sampling (RUS): I wanted to balance the data, so I applied RUS to create three datasets with fraud-to-non-fraud ratios of 50:50, 34:66, and 25:75. It was interesting to see how balancing the data changed the outcomes.

  4. Dataset Splitting: I split each resampled dataset into training and testing sets, keeping the training data balanced while the testing data reflected real-world distributions.

  5. Model Training and Testing: I trained each model on the balanced training data and evaluated them using the testing data. For KNN, I used K=1 (simple) and a cross-validated K value.

  6. Performance Metrics: I evaluated the models based on accuracy, sensitivity, specificity, precision, F-measure, and area under the curve (AUC).

Results

I ran the experiments and recorded the performance of each algorithm for all three data proportions (50:50, 34:66, 25:75). Here’s how they performed:

Logistic Regression (LR)

I found that Logistic Regression consistently outperformed the other models. It was the most reliable in identifying fraudulent transactions while maintaining high accuracy and precision:

  • Accuracy: 91.2% (50:50), 92.3% (34:66), 95.9% (25:75)

  • Sensitivity: 0.878 (50:50), 0.777 (34:66), 0.839 (25:75)

  • Precision: 0.951 (50:50), 1.0 (34:66), 0.991 (25:75)

Naïve Bayes (NB)

Naïve Bayes worked well when the feature independence assumption held, but it struggled when this wasn’t true. Its sensitivity scores were consistently lower than Logistic Regression:

  • Sensitivity: 0.757 (50:50), 0.718 (34:66), 0.664 (25:75)

K-Nearest Neighbor (KNN)

KNN was the weakest performer, which didn’t surprise me given its reliance on proximity metrics. It struggled with the small sample sizes in the training data:

  • Accuracy: 67.9% (50:50), 68.1% (34:66), 75.1% (25:75)

Analysis

Looking at the results, a few things stood out to me:

  1. Logistic Regression was the most reliable, especially as the training data increased in size. It performed consistently across all metrics, which makes sense given its ability to model feature relationships.

  2. Naïve Bayes was a close second in some cases but fell short when the assumption of feature independence was violated. I noticed its sensitivity and precision were significantly lower than Logistic Regression’s.

  3. KNN really struggled, especially with smaller datasets. I think this was because it relied heavily on the training data, and there just wasn’t enough for it to accurately identify patterns in the fraud cases.

Visualizations

To make sense of the metrics, I plotted them for each algorithm and data proportion. Here’s what I saw:

  1. Sensitivity: Logistic Regression consistently had the highest sensitivity across all data proportions, meaning it correctly identified more fraudulent transactions than the others.

  2. Specificity: Logistic Regression and Naïve Bayes performed well, with near-perfect specificity for the 34:66 and 25:75 proportions.

  3. Accuracy: As expected, Logistic Regression led in accuracy, particularly for the 25:75 proportion.

  4. Precision: Again, Logistic Regression dominated, showing its ability to minimize false positives.

  5. F-Measure: Logistic Regression balanced precision and sensitivity better than Naïve Bayes and KNN.

Conclusion and Future Work

It’s clear to me that Logistic Regression is the best choice for this problem. It consistently outperformed Naïve Bayes and KNN, especially in terms of accuracy and sensitivity. However, I think there’s room to improve the results further by experimenting with advanced techniques like Random Forests or Neural Networks. I’d also like to try other resampling methods to see if they can enhance performance without losing valuable data, which can happen with under-sampling.

# Load necessary libraries
library(caret)
Loading required package: ggplot2
Loading required package: lattice
library(e1071)
library(class)
library(pROC)
Type 'citation("pROC")' for a citation.

Attaching package: 'pROC'
The following objects are masked from 'package:stats':

    cov, smooth, var
library(ggplot2)

# Simulate a dataset
set.seed(123)
n <- 5000
data <- data.frame(
  V1 = rnorm(n, mean = 0, sd = 1),
  V2 = rnorm(n, mean = 1, sd = 1),
  V3 = rnorm(n, mean = 2, sd = 1),
  V4 = rnorm(n, mean = 0.5, sd = 1),
  Amount = abs(rnorm(n, mean = 50, sd = 25)),
  Class = factor(sample(c(0, 1), size = n, replace = TRUE, prob = c(0.97, 0.03)))
)

# Preprocess the data: Scale features
scaled_data <- scale(data[, -ncol(data)])  # Scale all features except the target variable
data <- data.frame(scaled_data, Class = data$Class)  # Combine scaled data with Class
data$Class <- as.factor(data$Class)  # Ensure Class is a factor

# Define a function for Random Under-Sampling (RUS)
under_sample <- function(data, fraud_ratio) {
  fraud <- data[data$Class == 1, ]
  non_fraud <- data[data$Class == 0, ]
  num_non_fraud <- nrow(fraud) * (1 - fraud_ratio) / fraud_ratio
  sampled_non_fraud <- non_fraud[sample(seq_len(nrow(non_fraud)), num_non_fraud), ]
  balanced_data <- rbind(fraud, sampled_non_fraud)
  balanced_data[sample(seq_len(nrow(balanced_data))), ]  # Shuffle rows
}

# Create datasets with different fraud-to-non-fraud ratios
datasets <- list(
  A = under_sample(data, 0.5),  # 50:50 fraud to non-fraud
  B = under_sample(data, 0.34), # 34:66 fraud to non-fraud
  C = under_sample(data, 0.25)  # 25:75 fraud to non-fraud
)

# Define a function to split the data into training and testing sets
split_data <- function(data) {
  set.seed(123)
  trainIndex <- createDataPartition(data$Class, p = 0.7, list = FALSE)
  list(train = data[trainIndex, ], test = data[-trainIndex, ])
}

# Define a function to train models and evaluate them
evaluate_models <- function(train, test) {
  # Logistic Regression
  lr_model <- glm(Class ~ ., data = train, family = binomial)
  lr_pred <- predict(lr_model, test, type = "response")
  lr_class <- ifelse(lr_pred > 0.5, 1, 0)
  
  # Naïve Bayes
  nb_model <- naiveBayes(Class ~ ., data = train)
  nb_class <- predict(nb_model, test)
  
  # KNN
  knn_pred <- knn(
    train[, -ncol(train)], test[, -ncol(test)],
    train$Class, k = 5
  )
  
  # Evaluate the models
  metrics <- data.frame(
    Model = c("Logistic Regression", "Naïve Bayes", "KNN"),
    Accuracy = c(
      mean(lr_class == test$Class),
      mean(nb_class == test$Class),
      mean(knn_pred == test$Class)
    ),
    Sensitivity = c(
      sensitivity(as.factor(lr_class), test$Class, positive = "1"),
      sensitivity(as.factor(nb_class), test$Class, positive = "1"),
      sensitivity(as.factor(knn_pred), test$Class, positive = "1")
    ),
    Specificity = c(
      specificity(as.factor(lr_class), test$Class, negative = "0"),
      specificity(as.factor(nb_class), test$Class, negative = "0"),
      specificity(as.factor(knn_pred), test$Class, negative = "0")
    )
  )
  
  return(metrics)
}

# Evaluate models for each dataset
results <- lapply(datasets, function(ds) {
  split <- split_data(ds)
  evaluate_models(split$train, split$test)
})

# Combine results into a single data frame for visualization
results_df <- do.call(rbind, lapply(names(results), function(name) {
  cbind(Dataset = name, results[[name]])
}))

# Visualize the results
ggplot(results_df, aes(x = Model, y = Accuracy, fill = Dataset)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(title = "Accuracy Comparison Across Models and Datasets", y = "Accuracy", x = "Model")

ggplot(results_df, aes(x = Model, y = Sensitivity, fill = Dataset)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(title = "Sensitivity Comparison Across Models and Datasets", y = "Sensitivity", x = "Model")

ggplot(results_df, aes(x = Model, y = Specificity, fill = Dataset)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(title = "Specificity Comparison Across Models and Datasets", y = "Specificity", x = "Model")