1. Introduction

This notebook presents an Exploratory Data Analysis (EDA) of the IEEE-CIS Fraud Detection dataset, originally provided by Vesta Corporation for a Kaggle competition. The dataset contains anonymized transactional and identity information with the aim of building models to detect fraudulent transactions.

The main goals of this EDA are:

To understand the class distribution and the impact of data imbalance.
To identify relevant patterns and trends in transaction data.
To explore key variables (e.g., transaction amount, time, email domain, device info).
To prepare meaningful insights that will guide future feature engineering and modeling steps.

2. Libraries

We’ll require the next libraries for data manipulation and visualization.

# Core libraries for data analysis and visualization
library(tidyverse)   # Includes dplyr, ggplot2, tibble, readr, etc.
library(data.table)  # Efficient data loading and manipulation for large datasets

# Additional visualization tools
library(gridExtra)   # Arranging multiple ggplot2 plots side by side
library(skimr)     # Quick overview of data frames, including missing values and distributions

library(ggcorrplot)   # Correlation matrix visualization
library(ranger)       # Fast implementation of random forests
library(naniar)       # Tools for visualizing and handling missing data

3. Data Loading

The dataset is composed of two main training files:

train_transaction.csv: Contains transactional data for online purchases, including variables like transaction amount, product codes, card and address information, and whether the transaction was flagged as fraudulent (isFraud).
train_identity.csv: Contains identity-related features (e.g., device info, IP address, browser, email domain) for a subset of transactions. Only ~24% of the transaction records have corresponding identity information.

To perform a comprehensive analysis, we merged both datasets using the common key TransactionID. We applied a left join (all.x = TRUE), meaning all rows from train_transaction are preserved, even if no identity information is available. This approach ensures that:

We maintain the full set of transactions for fraud analysis.
We include identity features only when available, which may be useful in feature engineering but are not required for every record.

# Load training datasets
train_transaction <- fread("../data/ieee-fraud-detection/train_transaction.csv")
train_identity    <- fread("../data/ieee-fraud-detection/train_identity.csv")

# Merge datasets by TransactionID
train_data <- merge(train_transaction, train_identity, by = "TransactionID", all.x = TRUE)

# Check merged dimensions
dim(train_data)

## [1] 590540    434

train_transaction contains 590,540 rows and 394 columns, while train_identity has 144,233 rows and 41 columns.

This results in a merged dataset named train_data, with:

590,540 rows (same as train_transaction)
434 columns, combining both transaction and identity features

Note: Although train_transaction and train_identity contain 394 and 41 columns respectively, they both include the TransactionID column. During the merge, this common column is used as the key and is not duplicated, resulting in a final dataset with 434 unique columns.

Due to the high dimensionality and anonymized nature of the dataset (434 features), we selectively explore only a subset of features that are either commonly interpretable (e.g., TransactionAmt, DeviceType, Email Domains) or structurally relevant (e.g., C1-C2, D1-D15). Exploring all features manually is not scalable, and later stages of this project will incorporate variable importance and dimensionality reduction techniques (e.g., Random Forest, PCA) to guide modeling.

4. Data Overview

Let’s analyze the dataset to understand its structure, missing values, and class balance.

4.1 Missing Values

To assess the quality of the data, we calculated the percentage of missing values in each column using the summarise_all(~ mean(is.na(.))) function chain. This provides a quick overview of how much information is missing in each feature.

# Compute missing value percentage per column
missing_data <- train_data %>%
  summarise_all(~ mean(is.na(.))) %>%   # Calculate the percentage of NAs per column.
  pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_perc") %>%  # Converts the result to a long format (variable, missing_perc) for analysis and visualization.
  arrange(desc(missing_perc))  # Order the variables from highest to lowest number of missing values.

# View top 20 variables with the highest percentage of missing values
head(missing_data, 20)

## # A tibble: 20 × 2
##    variable missing_perc
##    <chr>           <dbl>
##  1 id_24           0.992
##  2 id_25           0.991
##  3 id_07           0.991
##  4 id_08           0.991
##  5 id_21           0.991
##  6 id_26           0.991
##  7 id_22           0.991
##  8 dist2           0.936
##  9 D7              0.934
## 10 id_18           0.924
## 11 D13             0.895
## 12 D14             0.895
## 13 D12             0.890
## 14 id_03           0.888
## 15 id_04           0.888
## 16 D6              0.876
## 17 D8              0.873
## 18 D9              0.873
## 19 id_09           0.873
## 20 id_10           0.873

From the results above, we observe that several variables (especially id_24, id_25, id_07, id_08, and many features from the id_ and D families) have over 85–99% missing values. This is critical because:

Such high missingness may suggest that these features are either optional, sparsely recorded, or system-dependent.
These features may not be useful for modeling without appropriate imputation, encoding, or dimensionality reduction.
We must decide whether to drop these variables or retain them by filling missing values (e.g., using domain knowledge or imputation strategies).

This information will be key during the feature selection and preprocessing phases.

4.2 Class Balance

This dataset is highly imbalanced, with a strong predominance of non-fraudulent transactions. Understanding this imbalance is crucial before training machine learning models, as it can bias the models toward the majority class.

# Distribution of target variable
train_data %>%
  count(isFraud) %>%
  mutate(percentage = n / sum(n))

## Key: <TransactionID>
##    isFraud      n percentage
##      <int>  <int>      <num>
## 1:       0 569877 0.96500999
## 2:       1  20663 0.03499001

Observation:

The target variable isFraud is heavily imbalanced:

Non-fraudulent transactions (isFraud = 0): 569,877 (~96.5%)
Fraudulent transactions (isFraud = 1): 20,663 (~3.5%)

This class imbalance can lead machine learning models to be biased toward predicting the majority class. To address this, we will consider appropriate strategies such as:

Resampling techniques: oversampling the minority class (e.g. SMOTE) or undersampling the majority class.
Class weighting: penalizing misclassification of minority class more.
Anomaly detection approaches: modeling the rare class as outliers.

5. Exploratory Visualizations

5.1 Transaction Amount Distribution

ggplot(train_data, aes(x = TransactionAmt, fill = factor(isFraud))) +
  geom_histogram(bins = 100, position = "identity", alpha = 0.6) +
  scale_x_log10() + 

  labs(
    title = "Transaction Amount Distribution (Log Scale)",
    x = "Transaction Amount (log scale)",
    y = "Count",
    fill = "Is Fraud" 
  )

Observation:

The majority of transactions are clustered between 10 and 500 USD.
There is a heavy class imbalance, which is reflected in the plot: fraudulent transactions (label 1) are visually scarce compared to non-fraudulent ones.
Using log10 scaling on the x-axis helps reveal distribution patterns that would otherwise be compressed due to skewness.
Peaks around certain amounts (like ~$100) suggest transaction rounding behavior, which may be informative for feature engineering.

5.2 Device Info

# Compute top 10 most common device types
train_data %>%
  count(DeviceInfo) %>%                          # Count occurrences of each device type
  slice_max(n, n = 10) %>%                       # Select top 10 by count (modern dplyr alternative to top_n)
  ggplot(aes(x = reorder(DeviceInfo, n), y = n)) +  # Reorder device names by frequency
  geom_col(fill = "steelblue") +                 # Horizontal bar chart
  coord_flip() +                                 # Flip coordinates for readability
  labs(
    title = "Top 10 Device Info", 
    x = "Device Info", 
    y = "Count"
  )

This plot displays the 10 most frequent DeviceInfo values in the training dataset. Notably, over 400,000 entries are missing this value, labeled as NA. Windows and iOS devices are the most common among non-missing values. This suggests the need for careful handling of this feature, potentially through imputation or categorical encoding.

Observation on Device Info:

The bar chart shows the top 10 most frequent values of the DeviceInfo column. The most common entry is clearly "NA", which represents missing values. However, we also observe an unlabeled bar just below "Windows", indicating the presence of empty strings ("").

This distinction is important: - NA represents missing data in R. - "" represents a present but empty string, often due to incomplete logging or device masking.

These two cases should ideally be treated consistently, so we may consider replacing empty strings with NA before modeling or imputation.

5.3 Temporal Features Analysis

We start this section by converting the TransactionDT column to a more interpretable date-time format. The dataset begins at a reference point, which we define as “2017-12-01”. This allows us to derive meaningful temporal features such as the hour of the transaction and the day of the week.

# Convert TransactionDT to hours and days
# Note: the dataset starts at a reference point, we define a fake 'origin'
origin_time <- as.POSIXct("2017-12-01", tz = "UTC")
train_data$TransactionDate <- origin_time + train_data$TransactionDT

train_data$TransactionHour <- lubridate::hour(train_data$TransactionDate)
train_data$TransactionDay  <- lubridate::wday(train_data$TransactionDate, label = TRUE)

next step is visualizing the distribution of transactions by hour and day of the week. This helps us understand if there are specific times or days when fraudulent transactions are more likely to occur.

# Plot fraud rate by hour
ggplot(train_data, aes(x = TransactionHour, fill = factor(isFraud))) +
  geom_histogram(binwidth = 1, position = "fill") +
  labs(
    title = "Fraud Ratio by Hour of Day",
    x = "Hour of Day",
    y = "Proportion",
    fill = "Is Fraud"
  ) +
  scale_x_continuous(breaks = 0:23)

Observation:

This plot shows the proportion of fraudulent vs. non-fraudulent transactions across different hours of the day.

Most transactions occur between 5 a.m. and 12 p.m. (UTC reference).
Fraudulent activity is most frequent between 6 a.m. and 9 a.m., with a peak around 7 a.m.
This temporal concentration may suggest:
- Automated fraud attempts during low supervision periods.
- Scheduled transaction scripts or bot activity.
- Behavioral patterns specific to a time zone (which we can’t determine precisely due to anonymization, but worth noting).

This information could be useful to: - Engineer features like IsMorning, IsNight, etc. - Apply time-aware models or rules.

ggplot(train_data, aes(x = TransactionDay, fill = factor(isFraud))) +
  geom_bar(position = "fill") +
  labs(
    title = "Fraud Ratio by Day of Week",
    x = "Day of Week",
    y = "Proportion",
    fill = "Is Fraud"
  )

Observation:

This bar plot shows the proportion of fraudulent transactions by day of the week.

The fraud rate remains remarkably stable across all days, with no major peaks or drops.
This uniformity may imply:
- The fraud system is automated, running independently of the calendar.
- There is no strong behavioral signal in the day-of-week feature.

As a result, TransactionDay may have limited predictive power on its own, but could still be useful when combined with other temporal or categorical features.

5.4 Categorical Features Analysis

We will explore some categorical features that may provide insights into the fraud detection process. The following features are of particular interest:

ProductCD: Product code associated with the transaction.
card4: Card type (e.g., Visa, Mastercard).
card6: Card type (e.g., credit, debit).
P_emaildomain: Email domain of the payer.
R_emaildomain: Email domain of the recipient.
DeviceType: Type of device used for the transaction.

We analyze the fraud proportions across these categories using proportion-based bar plots (position = "fill" in ggplot2) to better visualize class imbalance across each level.

# List of selected categorical features to analyze
cat_vars <- c("ProductCD", "card4", "card6", "P_emaildomain", "R_emaildomain", "DeviceType")

ProductCD

ggplot(train_data, aes(x = ProductCD, fill = factor(isFraud))) +
  geom_bar(position = "fill") +
  labs(
    title = "Fraud Proportion by ProductCD",
    x = "ProductCD",
    y = "Proportion",
    fill = "Is Fraud"
  )

Observation: - Product code C has a noticeably higher fraud rate compared to the others. - This may suggest that some product categories are more prone to fraud, perhaps due to their price range, popularity, or nature of transactions.

card4

ggplot(train_data, aes(x = card4, fill = factor(isFraud))) +
  geom_bar(position = "fill") +
  labs(
    title = "Fraud Proportion by Card Type (card4)",
    x = "Card Type",
    y = "Proportion",
    fill = "Is Fraud"
  )

Observation: - Among card types, Discover shows the highest fraud proportion. - Visa and Mastercard dominate in volume, but their fraud ratio is relatively lower.

card6

ggplot(train_data, aes(x = card6, fill = factor(isFraud))) +
  geom_bar(position = "fill") +
  labs(
    title = "Fraud Proportion by Card Type (card6)",
    x = "Card Type (card6)",
    y = "Proportion",
    fill = "Is Fraud"
  )

Observation: - Fraud is more frequent in credit cards compared to debit or charge cards. - Transactions labeled as debit or credit are rare and also show lower fraud prevalence, this may due to the duplication of options since there is a separate credit and debit option..

P_emaildomain

# Top 10 most common payer email domains
train_data %>%
  count(P_emaildomain) %>%
  slice_max(n, n = 10) %>%
  pull(P_emaildomain) -> top_p_domains

ggplot(train_data %>% filter(P_emaildomain %in% top_p_domains),
       aes(x = P_emaildomain, fill = factor(isFraud))) +
  geom_bar(position = "fill") +
  labs(
    title = "Fraud Proportion by Payer Email Domain",
    x = "Payer Email Domain (Top 10)",
    y = "Proportion",
    fill = "Is Fraud"
  ) +
  coord_flip()

Observation: - Most common payer domains (e.g. outlook.com, hotmail.com, gmail.com, yahoo.com) show low fraud rates. - Fraud proportions appear relatively low across these domains, although outlook.com and hotmail.com show a slightly higher fraud ratio compared to others like gmail.com or icloud.com. - Domains like anonymous.com or rare email providers may require additional attention.

R_emaildomain

# Top 10 most common recipient email domains
train_data %>%
  count(R_emaildomain) %>%
  slice_max(n, n = 10) %>%
  pull(R_emaildomain) -> top_r_domains

ggplot(train_data %>% filter(R_emaildomain %in% top_r_domains),
       aes(x = R_emaildomain, fill = factor(isFraud))) +
  geom_bar(position = "fill") +
  labs(
    title = "Fraud Proportion by Recipient Email Domain",
    x = "Recipient Email Domain (Top 10)",
    y = "Proportion",
    fill = "Is Fraud"
  ) +
  coord_flip()

Observation: - Recipient email domains exhibit wider variation. - Notably, domains such as outlook.com, icloud.com and gmail.com have higher fraud proportions, suggesting possible use in fraudulent operations.

DeviceType

ggplot(train_data, aes(x = DeviceType, fill = factor(isFraud))) +
  geom_bar(position = "fill") +
  labs(
    title = "Fraud Proportion by Device Type",
    x = "Device Type",
    y = "Proportion",
    fill = "Is Fraud"
  )

Observation: - Mobile transactions exhibit a slightly higher proportion of fraud compared to desktop. - This aligns with common patterns in fraud behavior exploiting less secure or unsupervised mobile environments.

5.5 Continuous Features Analysis - C1 to C14

The dataset includes a family of anonymized continuous variables labeled C1 through C14. These features likely capture aggregated behavioral metrics or historical transaction patterns.

We will begin by visualizing C1 and C2 in detail, and then generalize the analysis to the entire set of C variables.

5.5.1 C1 and C2 – Initial Exploration

# Gather variables into long format for facetting
train_data %>%
  select(C1, C2, isFraud) %>%
  pivot_longer(cols = c(C1, C2), names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Value, fill = factor(isFraud))) +
  geom_histogram(bins = 100, position = "identity", alpha = 0.5) +
  facet_wrap(~ Variable, scales = "free") +
  scale_x_continuous(trans = "log1p") +
  labs(
    title = "Distribution of C1 and C2 by Fraud Status",
    x = "Value (log scale)",
    y = "Count",
    fill = "Is Fraud"
  )

Observation: - The distributions are highly skewed to the right. - Fraudulent and non-fraudulent transactions overlap heavily in this space, though fine-grained differences might still be exploitable with modeling. - A log transformation improves visualization but still shows long-tailed distributions.

To reduce noise from extreme outliers and improve interpretability, we apply filtering below the 99th percentile.

# Calculating 99th percentiles for C1 and C2
c1_99 <- quantile(train_data$C1, 0.99, na.rm = TRUE)
c2_99 <- quantile(train_data$C2, 0.99, na.rm = TRUE)

# Filtering and plotting
train_data %>%
  filter(C1 < c1_99, C2 < c2_99) %>%
  select(C1, C2, isFraud) %>%
  pivot_longer(cols = c(C1, C2), names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Value, fill = factor(isFraud))) +
  geom_histogram(bins = 60, position = "identity", alpha = 0.5) +
  facet_wrap(~ Variable, scales = "free") +
  labs(
    title = "Distribution of C1 and C2 by Fraud Status (Filtered < 99th Percentile)",
    x = "Value",
    y = "Count",
    fill = "Is Fraud"
  )

Note: While we remove outliers for visualization clarity, they are retained for modeling, as they may encode meaningful anomalies.

5.5.2 General Analysis of C1–C14

Now, we extend the distribution analysis to the full set of C variables.

# Identify C variables
c_vars <- names(train_data)[grepl("^C\\d+$", names(train_data))]

# Convert to long format
c_long <- train_data %>%
  select(isFraud, all_of(c_vars)) %>%
  pivot_longer(cols = all_of(c_vars), names_to = "Variable", values_to = "Value")

# Calculate thresholds by variable
thresholds_c <- c_long %>%
  group_by(Variable) %>%
  summarise(threshold = quantile(Value, 0.99, na.rm = TRUE), .groups = "drop")

# Join and filter
c_long <- c_long %>%
  left_join(thresholds_c, by = "Variable") %>%
  filter(is.na(Value) | Value < threshold)

# Plot
ggplot(c_long, aes(x = Value, fill = factor(isFraud))) +
  geom_histogram(bins = 60, position = "identity", alpha = 0.5) +
  facet_wrap(~ Variable, scales = "free", ncol = 3) +
  labs(
    title = "Distribution of C1–C14 Variables (Filtered < 99th Percentile)",
    x = "Value", y = "Count", fill = "Is Fraud"
  ) +
  theme_minimal()

plot_c_block <- function(vars_subset) {
  c_long_block <- c_long %>%
    filter(Variable %in% vars_subset) %>%
    mutate(Variable = factor(Variable, levels = vars_subset)) 

  ggplot(c_long_block, aes(x = Value, fill = factor(isFraud))) +
    geom_histogram(bins = 60, position = "identity", alpha = 0.5) +
    facet_wrap(~ Variable, scales = "free", ncol = 2) +
    labs(
      title = paste0("Distribution of ", paste(vars_subset, collapse = ", "), " (Filtered < 99th Percentile)"),
      x = "Value", y = "Count", fill = "Is Fraud"
    ) +
    theme_minimal()
}

plot_c_block(paste0("C", 1:5))

Note: Variable C3 is not shown because it likely contains only a constant value or zeros (checking the data frame, we observed is a column full of zeros as far as we know), which offers no variation for distribution plots or predictive modeling.

plot_c_block(paste0("C", 6:10))

plot_c_block(paste0("C", 11:14))

skim(select(train_data, all_of(c_vars)))

Data summary
Name	select(train_data, all_of…
Number of rows	590540
Number of columns	14
Key	TransactionID
_______________________
Column type frequency:
numeric	14
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p25	p50	p75	p100	hist
C1	1	14.09	133.57	1	1	3	4685	▇▁▁▁▁
C2	1	15.27	154.67	1	1	3	5691	▇▁▁▁▁
C3	1	0.01	0.15	0	0	0	26	▇▁▁▁▁
C4	1	4.09	68.85	0	0	0	2253	▇▁▁▁▁
C5	1	5.57	25.79	0	0	1	349	▇▁▁▁▁
C6	1	9.07	71.51	1	1	2	2253	▇▁▁▁▁
C7	1	2.85	61.73	0	0	0	2255	▇▁▁▁▁
C8	1	5.14	95.38	0	0	0	3331	▇▁▁▁▁
C9	1	4.48	16.67	0	1	2	210	▇▁▁▁▁
C10	1	5.24	95.58	0	0	0	3257	▇▁▁▁▁
C11	1	10.24	94.34	1	1	2	3188	▇▁▁▁▁
C12	1	4.08	86.67	0	0	0	3188	▇▁▁▁▁
C13	1	32.54	129.36	1	3	12	2918	▇▁▁▁▁
C14	1	8.30	49.54	1	1	2	1429	▇▁▁▁▁

5.6 Continuous Features Analysis - D1 to D15

Let’s check what are the D variables

# List all variables starting with "D"
names(train_data)[grepl("^D", names(train_data))]

##  [1] "D1"         "D2"         "D3"         "D4"         "D5"        
##  [6] "D6"         "D7"         "D8"         "D9"         "D10"       
## [11] "D11"        "D12"        "D13"        "D14"        "D15"       
## [16] "DeviceType" "DeviceInfo"

5.6.1 Distribution of D1 to D5 (Filtered < 99th Percentile)

The D variables are anonymous temporal or delay-related features. Since their meaning is unknown, we group and visualize them in blocks of five, filtering extreme outliers (above 99th percentile) to highlight more common patterns.

# Step 1: define the D1 to D5 variables
d_vars_1 <- paste0("D", 1:5)

# Step 2: calculate 99th percentile thresholds
d_thresholds_1 <- sapply(d_vars_1, function(var) {
  quantile(train_data[[var]], 0.99, na.rm = TRUE)
})
names(d_thresholds_1) <- d_vars_1

# Step 3: create filtered version of the dataset
train_data_d1_d5 <- train_data
for (var in d_vars_1) {
  threshold <- d_thresholds_1[[var]]
  train_data_d1_d5 <- train_data_d1_d5 %>%
    filter(is.na(.data[[var]]) | .data[[var]] < threshold)
}

# Step 4: reshape for plotting
d_long_1 <- train_data_d1_d5 %>%
  select(isFraud, all_of(d_vars_1)) %>%
  pivot_longer(cols = all_of(d_vars_1), names_to = "variable", values_to = "value") %>%
  filter(!is.na(value))

# Optional: enforce variable order for facet
d_long_1$variable <- factor(d_long_1$variable, levels = d_vars_1)

# Step 5: plot
ggplot(d_long_1, aes(x = value, fill = factor(isFraud))) +
  geom_histogram(bins = 50, position = "identity", alpha = 0.6) +
  facet_wrap(~ variable, scales = "free", ncol = 2) +
  labs(
    title = "Distribution of D1–D5 Variables (Filtered < 99th Percentile)",
    x = "Value",
    y = "Count",
    fill = "Is Fraud"
  ) +
  theme_minimal()

# Step 1: define the D6 to D10 variables
d_vars_2 <- paste0("D", 6:10)

# Step 2: calculate 99th percentile thresholds
d_thresholds_2 <- sapply(d_vars_2, function(var) {
  quantile(train_data[[var]], 0.99, na.rm = TRUE)
})
names(d_thresholds_2) <- d_vars_2

# Step 3: create filtered version of the dataset
train_data_d6_d10 <- train_data
for (var in d_vars_2) {
  threshold <- d_thresholds_2[[var]]
  train_data_d6_d10 <- train_data_d6_d10 %>%
    filter(is.na(.data[[var]]) | .data[[var]] < threshold)
}

# Step 4: reshape for plotting
d_long_2 <- train_data_d6_d10 %>%
  select(isFraud, all_of(d_vars_2)) %>%
  pivot_longer(cols = all_of(d_vars_2), names_to = "variable", values_to = "value") %>%
  filter(!is.na(value))

# Optional: enforce variable order for facet
d_long_2$variable <- factor(d_long_2$variable, levels = d_vars_2)

# Step 5: plot
ggplot(d_long_2, aes(x = value, fill = factor(isFraud))) +
  geom_histogram(bins = 50, position = "identity", alpha = 0.6) +
  facet_wrap(~ variable, scales = "free", ncol = 2) +
  labs(
    title = "Distribution of D6–D10 Variables (Filtered < 99th Percentile)",
    x = "Value",
    y = "Count",
    fill = "Is Fraud"
  ) +
  theme_minimal()

# Step 1: define the D11 to D15 variables
d_vars_3 <- paste0("D", 11:15)

# Step 2: calculate 99th percentile thresholds
d_thresholds_3 <- sapply(d_vars_3, function(var) {
  quantile(train_data[[var]], 0.99, na.rm = TRUE)
})
names(d_thresholds_3) <- d_vars_3

# Step 3: create filtered version of the dataset
train_data_d11_d15 <- train_data
for (var in d_vars_3) {
  threshold <- d_thresholds_3[[var]]
  train_data_d11_d15 <- train_data_d11_d15 %>%
    filter(is.na(.data[[var]]) | .data[[var]] < threshold)
}

# Step 4: reshape for plotting
d_long_3 <- train_data_d11_d15 %>%
  select(isFraud, all_of(d_vars_3)) %>%
  pivot_longer(cols = all_of(d_vars_3), names_to = "variable", values_to = "value") %>%
  filter(!is.na(value))

# Optional: enforce variable order for facet
d_long_3$variable <- factor(d_long_3$variable, levels = d_vars_3)

# Step 5: plot
ggplot(d_long_3, aes(x = value, fill = factor(isFraud))) +
  geom_histogram(bins = 50, position = "identity", alpha = 0.6) +
  facet_wrap(~ variable, scales = "free", ncol = 2) +
  labs(
    title = "Distribution of D11–D15 Variables (Filtered < 99th Percentile)",
    x = "Value",
    y = "Count",
    fill = "Is Fraud"
  ) +
  theme_minimal()

# Summary statistics for D variables
d_vars <- paste0("D", 1:15)
skim(select(train_data, all_of(d_vars)))

Data summary
Name	select(train_data, all_of…
Number of rows	590540
Number of columns	15
Key	TransactionID
_______________________
Column type frequency:
numeric	15
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
D1	1269	1.00	94.35	157.66	0	0.00	3.00	122.00	640.00	▇▁▁▁▁
D2	280797	0.52	169.56	177.32	0	26.00	97.00	276.00	640.00	▇▂▂▁▁
D3	262878	0.55	28.34	62.38	0	1.00	8.00	27.00	819.00	▇▁▁▁▁
D4	168922	0.71	140.00	191.10	-122	0.00	26.00	253.00	869.00	▇▂▂▂▁
D5	309841	0.48	42.34	89.00	0	1.00	10.00	32.00	819.00	▇▁▁▁▁
D6	517353	0.12	69.81	143.67	-83	0.00	0.00	40.00	873.00	▇▁▁▁▁
D7	551623	0.07	41.64	99.74	0	0.00	0.00	17.00	843.00	▇▁▁▁▁
D8	515614	0.13	146.06	231.66	0	0.96	37.88	187.96	1707.79	▇▁▁▁▁
D9	515614	0.13	0.56	0.32	0	0.21	0.67	0.83	0.96	▆▂▂▇▇
D10	76022	0.87	123.98	182.62	0	0.00	15.00	197.00	876.00	▇▁▁▁▁
D11	279287	0.53	146.62	186.04	-53	0.00	43.00	274.00	670.00	▇▂▂▂▁
D12	525823	0.11	54.04	124.27	-83	0.00	0.00	13.00	648.00	▇▁▁▁▁
D13	528588	0.10	17.90	67.61	0	0.00	0.00	0.00	847.00	▇▁▁▁▁
D14	528353	0.11	57.72	136.31	-193	0.00	0.00	2.00	878.00	▇▁▁▁▁
D15	89113	0.85	163.74	202.73	-83	0.00	52.00	314.00	879.00	▇▂▂▂▁

Now that we have explored the individual distributions of continuous features, the next step is to evaluate pairwise correlations to detect redundancy and guide feature selection for the baseline model.

5.7 Correlation Analysis

In this section, we examine linear relationships between numeric variables to identify:

Redundancy or multicollinearity
Structural patterns and feature clusters
Variables that may be useful (or harmful) in modeling

These insights can inform feature selection, regularization, and dimensionality reduction strategies such as PCA or Lasso.

5.7.1 Selection of Numerical Features

We restrict the analysis to numeric variables with fewer than 20% missing values to ensure robustness of correlations.

# Identify numeric columns with low NA ratio
# Exclude TransactionID explicitly
numeric_vars <- train_data %>%
  select(where(is.numeric)) %>%
  select(-TransactionID) %>% 
  summarise_all(~ mean(is.na(.))) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "NA_Percent") %>%
  filter(NA_Percent < 0.2) %>%
  pull(Variable)

# Filter and clean
train_data_num <- train_data %>%
  select(all_of(numeric_vars)) %>%
  drop_na()

5.7.2 Computing Correlation Matrix

We compute the Pearson correlation matrix for the cleaned numeric dataset.

# Compute Pearson correlation matrix
cor_matrix <- cor(train_data_num, use = "complete.obs", method = "pearson")

# Quick check
dim(cor_matrix)

## [1] 178 178

head(cor_matrix[, 1:5])

##                     isFraud TransactionDT TransactionAmt         card1
## isFraud         1.000000000 -0.0018940039    0.038327320  0.0019043194
## TransactionDT  -0.001894004  1.0000000000    0.006214577 -0.0005050166
## TransactionAmt  0.038327320  0.0062145769    1.000000000 -0.0021376019
## card1           0.001904319 -0.0005050166   -0.002137602  1.0000000000
## card2           0.018776409 -0.0056159505    0.030696590  0.0008656832
## card3           0.035755266 -0.0634543974   -0.026487992 -0.0002535563
##                        card2
## isFraud         0.0187764094
## TransactionDT  -0.0056159505
## TransactionAmt  0.0306965897
## card1           0.0008656832
## card2           1.0000000000
## card3           0.0080829847

5.7.3 Visualizing Correlation Matrix

This global heatmap offers a high-level view of how numerical variables relate to each other.

# Plot
ggcorrplot(cor_matrix,
           hc.order = TRUE, type = "lower", lab = FALSE,
           colors = c("steelblue", "white", "darkred"),
           title = "Correlation Matrix of Selected Numerical Variables",
           ggtheme = theme_minimal())

Note: Due to the high number of features, the full matrix may be hard to interpret directly. The following sections highlight key patterns.

5.7.4 Top Correlations with `isFraud`

We extract the variables most linearly associated with the target.

# Ordering by absolute correlation with isFraud
cor_target <- cor_matrix[, "isFraud"]
cor_target_sorted <- sort(abs(cor_target), decreasing = TRUE)

# View top correlations
head(cor_target_sorted, 20)

##    isFraud       V283       V281       V282       V292       V315       V289 
## 1.00000000 0.14837641 0.13693604 0.11916314 0.09441988 0.08961929 0.08775409 
##       V291       V312       V288       V313        V90        V29        V74 
## 0.08304392 0.08250770 0.08109182 0.08015370 0.07973573 0.07951064 0.07922080 
##        V69        V56       V314       V284        V70        V91 
## 0.07894555 0.07863102 0.07792122 0.07598742 0.07586851 0.07567217

Observation:

No feature has a strong linear correlation with isFraud (> 0.15).
The most correlated variables include V283, V281, V292, V291, V315, among others.
This suggests that fraud detection may require nonlinear patterns or interactions.

5.7.5 Correlation Among Top Predictors

We isolate and visualize the top 15 variables most correlated with the target.

# Obtain the top 15 variables with highest absolute correlation with isFraud (except isFraud itself)
top_vars <- names(cor_target_sorted)[2:16]

# Subset the correlation matrix for these variables
cor_subset <- cor_matrix[top_vars, top_vars]

# Heatmap
ggcorrplot(cor_subset,
           hc.order = TRUE, type = "lower", lab = TRUE,
           colors = c("steelblue", "white", "darkred"),
           title = "Correlation Matrix of Top 15 Variables",
           ggtheme = theme_minimal())

Observation:

Variables such as V282–V283, V313–V315, and V291–V292 show strong mutual correlations, indicating:
- Possible feature redundancy
- Opportunities for aggregation or dimensionality reduction

5.7.6 Detecting Multicollinearity (Corr > 0.90)

To identify highly redundant variables, we extract all variable pairs with correlation above 0.90 (excluding self-correlation).

high_corr_pairs <- which(abs(cor_matrix) > 0.9 & abs(cor_matrix) < 1, arr.ind = TRUE)

# Remove symmetric duplicates
high_corr_df <- as.data.frame(high_corr_pairs)
high_corr_df <- high_corr_df[high_corr_df$row < high_corr_df$col, ]

# Add variable names and correlation values
high_corr_df <- high_corr_df %>%
  mutate(var1 = rownames(cor_matrix)[row],
         var2 = colnames(cor_matrix)[col],
         corr = cor_matrix[cbind(row, col)]) %>%
  arrange(desc(abs(corr)))

head(high_corr_df, 10)

##        row col var1 var2      corr
## C7.3    16  21   C7  C12 0.9989629
## C8.2    17  19   C8  C10 0.9952333
## C1.2    10  20   C1  C11 0.9944343
## V101.2  98 149 V101 V293 0.9938229
## V95     92  98  V95 V101 0.9922605
## V103.4 100 151 V103 V295 0.9920565
## V279.2 135 149 V279 V293 0.9905360
## V57     54  55  V57  V58 0.9901596
## V95.1   92 135  V95 V279 0.9900867
## C7.2    16  19   C7  C10 0.9881490

Observations

The top variables correlated with isFraud (e.g., V283, V281, V292, V315, V289) exhibit relatively low Pearson correlation coefficients (mostly under 0.15). This suggests that no single variable is linearly predictive of fraud on its own, reaffirming the complex and subtle nature of fraud detection.
A subset of features (e.g., V69, V90, V291) show some moderate correlation, but these are likely to be more effective when used together or transformed (e.g., via non-linear models or interactions).
Several pairs of variables exhibit very high mutual correlation (above 0.99), such as:
- C7 ↔︎ C12
- C10 ↔︎ C8
- C1 ↔︎ C11
- V101 ↔︎ V293
- V95 ↔︎ V101, V279

These indicate redundancy or multicollinearity, which could impact model stability and interpretability—particularly for linear models. - TransactionID was excluded from the analysis, as it is an identifier and does not convey predictive information about the target.

Next steps (planning)

Use dimensionality reduction techniques like PCA or autoencoders to handle correlated variables and reduce noise in the feature space.
Apply regularized models (e.g., Lasso, Ridge, Elastic Net) that can manage multicollinearity and perform variable selection automatically.
Consider dropping or consolidating highly correlated variables, especially where domain knowledge is lacking due to anonymization. Groupings like Vxxx or Cxx can also be explored via unsupervised clustering or feature aggregation.

5.8 Feature Importance (Baseline Model)

In this section, we estimate the relative importance of features using a Random Forest classifier. This approach allows us to identify which variables contribute most to the prediction of fraudulent transactions, even without optimizing the model for accuracy yet.

We use the ranger package, a fast implementation of Random Forests, to compute variable importance based on Gini impurity reduction.

5.8.1 Preparing data

We reuse the cleaned numeric dataset (train_data_num) built in the correlation section. Since ranger requires the target variable to be a factor, we convert isFraud accordingly.

# Add isFraud as a factor (classification requirement)
train_data_rf <- train_data_num %>%
  mutate(isFraud = factor(isFraud))  # Make sure it's a factor for ranking

5.8.2 Training a Random Forest Model

We train a lightweight model with 100 trees, using default settings and importance = "impurity" to retrieve Gini-based feature rankings.

# Training simple model to estimate importance
rf_model <- ranger(
  formula = isFraud ~ .,
  data = train_data_rf,
  importance = "impurity",     
  num.trees = 100,              
  seed = 123                    
)

## Growing trees.. Progress: 29%. Estimated remaining time: 1 minute, 18 seconds.
## Growing trees.. Progress: 59%. Estimated remaining time: 45 seconds.
## Growing trees.. Progress: 89%. Estimated remaining time: 11 seconds.

Note: This model is not meant for final prediction but for preliminary insight into variable contributions.

5.8.3 Extracting and inspectioning importance

We extract the feature importances into a tidy dataframe, sorted in descending order.

# Convert ordered importance to a tidy dataframe
importance_df <- as.data.frame(rf_model$variable.importance) %>%
  tibble::rownames_to_column("Variable") %>%
  rename(Importance = `rf_model$variable.importance`) %>%
  arrange(desc(Importance))

# show top 20 features
head(importance_df, 20)

##           Variable Importance
## 1    TransactionDT   606.9360
## 2            card1   582.9908
## 3            addr1   510.8864
## 4   TransactionAmt   507.5807
## 5            card2   491.6512
## 6              D15   398.7465
## 7              D10   342.2649
## 8              C13   339.6097
## 9               D1   336.5991
## 10 TransactionHour   331.4401
## 11            V307   279.7027
## 12           card5   258.1799
## 13              C6   250.6636
## 14            V310   245.7405
## 15              C9   233.9571
## 16              C1   228.9751
## 17            V308   226.7537
## 18              C2   220.3989
## 19            V127   217.4486
## 20             C14   202.2062

5.8.4 Ranking visualization

The top 20 variables by importance are shown below.

importance_df %>%
  slice_max(order_by = Importance, n = 20) %>%
  ggplot(aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Top 20 Feature Importances (Random Forest)",
    x = "Variable",
    y = "Importance"
  ) +
  theme_minimal()

Observations - The most important features for distinguishing fraud include: - TransactionDT (time-related) - card1, card2, card5 (card identifiers) - addr1 (billing region) - TransactionAmt (amount of the transaction) - Several anonymized behavioral or delay features (D10, D15, C13, C1, etc.) - Variables like TransactionHour (engineered from timestamp) also appear high in the ranking, confirming the relevance of temporal patterns. - These insights can guide: - Feature selection (e.g., reducing dimensionality) - Feature engineering (e.g., combinations or interactions) - Model interpretation and post-hoc analysis (e.g., SHAP, PDP)

5.9 Cross-feature exploratory analysis

To better understand how fraud is distributed across categories, we compute the fraud rate for selected categorical variables. This highlights which categories are more likely to be associated with fraudulent behavior.

# Helper function to plot fraud rates for a categorical variable
plot_fraud_rate <- function(data, var) {
  data %>%
    group_by(.data[[var]]) %>%
    summarise(
      Count = n(),
      FraudRate = mean(isFraud, na.rm = TRUE)
    ) %>%
    filter(Count > 200) %>%  # Optional: filter rare categories
    ggplot(aes(x = reorder(.data[[var]], -FraudRate), y = FraudRate)) +
    geom_col(fill = "steelblue") +
    labs(
      title = paste("Fraud Rate by", var),
      x = var,
      y = "Fraud Rate"
    ) +
    theme_minimal()
}

5.9.1 Fraud Rate by `ProductCD`

# Plot for ProductCD
plot_fraud_rate(train_data, "ProductCD")

Observations:

Category C has the highest fraud rate (nearly 12%), while W has the lowest.
This suggests that transactions involving product type C should be treated with increased scrutiny during model training or rule design.

5.9.2 Fraud Rate by `card4`

# Plot for card4
plot_fraud_rate(train_data, "card4")

Observations:

Discover cards exhibit the highest fraud rate (around 8%), compared to others like Visa and Mastercard.
This pattern may reflect underlying behavioral or systemic differences in transactions processed by these networks.

5.9.3 Fraud Rate by `card6`

# Plot for card6
plot_fraud_rate(train_data, "card6")

Observations:

Credit cards show a significantly higher fraud rate than debit cards.
This might relate to the differing risk profiles or spending habits associated with each type.

5.9.4 Summary

This cross-feature fraud rate analysis reinforces that some categorical variables are informative for fraud prediction. Product type (ProductCD), card provider (card4), and card type (card6) show meaningful variation in fraud rates across categories. These features may be especially valuable for downstream modeling, and further interaction effects could be explored.

5.10 Dimensionality Reduction with PCA

Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of high-dimensional data by transforming correlated variables into a smaller number of uncorrelated variables called principal components. Although PCA is not ideal for interpretability, it is useful for visualizing potential separation between fraudulent and non-fraudulent transactions.

5.10.1 Data Preparation for PCA

We use the cleaned numerical dataset from the correlation analysis (train_data_num) and apply z-score standardization (mean = 0, std. dev. = 1) to each variable before computing the principal components.

# Ensure all numeric values are scaled
numeric_scaled <- train_data_num %>%
  select(-isFraud) %>%
  mutate_all(~ scale(.) %>% as.vector())

# Attach target back
pca_data <- cbind(numeric_scaled, isFraud = train_data_num$isFraud)

Note: • Out of 177 numeric features available (excluding isFraud), 176 were used in PCA. • The variable V305 was excluded because it had zero variance (i.e., constant value across samples). Variables with zero variance do not contribute to PCA and must be removed.

5.10.2 PCA Execution

To make PCA computationally tractable, we randomly sample 10,000 observations from the dataset and remove variables with zero variance.

# Step 1: sample 10,000 rows (with isFraud)
set.seed(123)
pca_sample <- train_data_num %>%
  sample_n(10000)

# Step 2: drop isFraud and remove constant columns (zero variance)
scaled_sample <- pca_sample %>%
  select(-isFraud) %>%
  select(where(~ var(.x, na.rm = TRUE) > 0)) %>%
  mutate_all(~ scale(.) %>% as.vector())

# Step 3: apply PCA
pca_result <- prcomp(scaled_sample, center = TRUE, scale. = TRUE)

# Step 4: create plot data
pca_df <- as.data.frame(pca_result$x[, 1:2])
pca_df$isFraud <- factor(pca_sample$isFraud)

# Step 5: plot
ggplot(pca_df, aes(x = PC1, y = PC2, color = isFraud)) +
  geom_point(alpha = 0.5, size = 1) +
  labs(
    title = "PCA (Sample of 10K)",
    x = "PC1", y = "PC2",
    color = "Is Fraud"
  ) +
  theme_minimal()

Interpretation:

Each point represents a transaction, projected into a 2D space formed by the two directions of greatest variance in the dataset: PC1 and PC2.
Transactions are not clearly separated by fraud status, confirming that fraud patterns are subtle and not captured by linear combinations alone.
Some outliers lie far along the principal axes, suggesting transactions that deviate significantly from typical behavior. These may be interesting candidates for anomaly detection.
Most fraudulent transactions are embedded within the large cloud of normal transactions, again reinforcing the challenge of detecting fraud via simple linear boundaries.

5.10.3 Variance Explained by Principal Components

We now assess how much variance each principal component captures.

# Show variance explained by each component
summary(pca_result)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5    PC6     PC7
## Standard deviation     5.1425 4.5066 3.03917 2.88263 2.68403 2.5758 2.36057
## Proportion of Variance 0.1503 0.1154 0.05248 0.04721 0.04093 0.0377 0.03166
## Cumulative Proportion  0.1503 0.2656 0.31813 0.36534 0.40628 0.4440 0.47563
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     2.25449 2.17135 2.07532 1.98786 1.88113 1.83180 1.80247
## Proportion of Variance 0.02888 0.02679 0.02447 0.02245 0.02011 0.01907 0.01846
## Cumulative Proportion  0.50451 0.53130 0.55577 0.57823 0.59833 0.61740 0.63586
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     1.79871 1.70719 1.62377 1.55728 1.54799 1.47625 1.46329
## Proportion of Variance 0.01838 0.01656 0.01498 0.01378 0.01362 0.01238 0.01217
## Cumulative Proportion  0.65424 0.67080 0.68578 0.69956 0.71317 0.72556 0.73772
##                           PC22    PC23    PC24    PC25    PC26    PC27    PC28
## Standard deviation     1.36702 1.31024 1.29528 1.24126 1.21929 1.19460 1.16834
## Proportion of Variance 0.01062 0.00975 0.00953 0.00875 0.00845 0.00811 0.00776
## Cumulative Proportion  0.74834 0.75809 0.76763 0.77638 0.78483 0.79294 0.80069
##                           PC29    PC30    PC31    PC32    PC33    PC34    PC35
## Standard deviation     1.11454 1.05855 1.03422 1.01995 1.01169 1.00100 0.98569
## Proportion of Variance 0.00706 0.00637 0.00608 0.00591 0.00582 0.00569 0.00552
## Cumulative Proportion  0.80775 0.81412 0.82019 0.82610 0.83192 0.83761 0.84313
##                           PC36    PC37    PC38    PC39    PC40    PC41    PC42
## Standard deviation     0.97888 0.96356 0.95498 0.94370 0.93291 0.92044 0.90789
## Proportion of Variance 0.00544 0.00528 0.00518 0.00506 0.00495 0.00481 0.00468
## Cumulative Proportion  0.84858 0.85385 0.85903 0.86409 0.86904 0.87385 0.87854
##                           PC43    PC44    PC45   PC46    PC47    PC48    PC49
## Standard deviation     0.88397 0.86265 0.83973 0.8181 0.81116 0.78244 0.76745
## Proportion of Variance 0.00444 0.00423 0.00401 0.0038 0.00374 0.00348 0.00335
## Cumulative Proportion  0.88298 0.88720 0.89121 0.8950 0.89875 0.90223 0.90558
##                           PC50    PC51    PC52    PC53   PC54   PC55    PC56
## Standard deviation     0.76407 0.74789 0.74469 0.72143 0.7149 0.7138 0.69530
## Proportion of Variance 0.00332 0.00318 0.00315 0.00296 0.0029 0.0029 0.00275
## Cumulative Proportion  0.90889 0.91207 0.91522 0.91818 0.9211 0.9240 0.92673
##                           PC57    PC58    PC59    PC60    PC61    PC62    PC63
## Standard deviation     0.68798 0.67949 0.67065 0.66653 0.65336 0.64526 0.62831
## Proportion of Variance 0.00269 0.00262 0.00256 0.00252 0.00243 0.00237 0.00224
## Cumulative Proportion  0.92942 0.93204 0.93459 0.93712 0.93954 0.94191 0.94415
##                           PC64    PC65    PC66    PC67    PC68    PC69    PC70
## Standard deviation     0.60621 0.59718 0.58791 0.57647 0.56673 0.56126 0.55216
## Proportion of Variance 0.00209 0.00203 0.00196 0.00189 0.00182 0.00179 0.00173
## Cumulative Proportion  0.94624 0.94827 0.95023 0.95212 0.95394 0.95573 0.95747
##                          PC71    PC72    PC73    PC74   PC75    PC76    PC77
## Standard deviation     0.5308 0.52561 0.50808 0.49850 0.4965 0.49027 0.48475
## Proportion of Variance 0.0016 0.00157 0.00147 0.00141 0.0014 0.00137 0.00134
## Cumulative Proportion  0.9591 0.96064 0.96210 0.96352 0.9649 0.96628 0.96762
##                           PC78    PC79    PC80    PC81    PC82    PC83    PC84
## Standard deviation     0.46959 0.46175 0.45647 0.43407 0.42677 0.42427 0.41541
## Proportion of Variance 0.00125 0.00121 0.00118 0.00107 0.00103 0.00102 0.00098
## Cumulative Proportion  0.96887 0.97008 0.97127 0.97234 0.97337 0.97439 0.97537
##                           PC85    PC86   PC87    PC88    PC89    PC90   PC91
## Standard deviation     0.40688 0.39933 0.3982 0.39196 0.37975 0.37784 0.3744
## Proportion of Variance 0.00094 0.00091 0.0009 0.00087 0.00082 0.00081 0.0008
## Cumulative Proportion  0.97631 0.97722 0.9781 0.97899 0.97981 0.98063 0.9814
##                           PC92    PC93    PC94   PC95    PC96    PC97    PC98
## Standard deviation     0.36478 0.35869 0.35517 0.3501 0.34361 0.34269 0.33472
## Proportion of Variance 0.00076 0.00073 0.00072 0.0007 0.00067 0.00067 0.00064
## Cumulative Proportion  0.98218 0.98291 0.98363 0.9843 0.98499 0.98566 0.98630
##                           PC99   PC100  PC101   PC102   PC103   PC104   PC105
## Standard deviation     0.33052 0.32822 0.3255 0.32324 0.31825 0.30783 0.30704
## Proportion of Variance 0.00062 0.00061 0.0006 0.00059 0.00058 0.00054 0.00054
## Cumulative Proportion  0.98692 0.98753 0.9881 0.98872 0.98930 0.98984 0.99037
##                          PC106   PC107   PC108   PC109   PC110   PC111  PC112
## Standard deviation     0.29942 0.29454 0.29009 0.28739 0.28617 0.27207 0.2640
## Proportion of Variance 0.00051 0.00049 0.00048 0.00047 0.00047 0.00042 0.0004
## Cumulative Proportion  0.99088 0.99138 0.99185 0.99232 0.99279 0.99321 0.9936
##                          PC113   PC114   PC115   PC116   PC117   PC118  PC119
## Standard deviation     0.26155 0.24839 0.24188 0.24119 0.23946 0.23507 0.2309
## Proportion of Variance 0.00039 0.00035 0.00033 0.00033 0.00033 0.00031 0.0003
## Cumulative Proportion  0.99399 0.99435 0.99468 0.99501 0.99533 0.99565 0.9960
##                          PC120   PC121   PC122   PC123   PC124   PC125  PC126
## Standard deviation     0.21818 0.21165 0.20692 0.20400 0.19685 0.19099 0.1868
## Proportion of Variance 0.00027 0.00025 0.00024 0.00024 0.00022 0.00021 0.0002
## Cumulative Proportion  0.99622 0.99648 0.99672 0.99696 0.99718 0.99738 0.9976
##                          PC127   PC128   PC129   PC130   PC131   PC132   PC133
## Standard deviation     0.18423 0.17031 0.16841 0.16066 0.15537 0.15339 0.14808
## Proportion of Variance 0.00019 0.00016 0.00016 0.00015 0.00014 0.00013 0.00012
## Cumulative Proportion  0.99777 0.99794 0.99810 0.99825 0.99838 0.99852 0.99864
##                          PC134   PC135   PC136  PC137  PC138   PC139   PC140
## Standard deviation     0.14503 0.14010 0.13706 0.1319 0.1295 0.12376 0.12038
## Proportion of Variance 0.00012 0.00011 0.00011 0.0001 0.0001 0.00009 0.00008
## Cumulative Proportion  0.99876 0.99887 0.99898 0.9991 0.9992 0.99926 0.99934
##                          PC141   PC142   PC143   PC144   PC145   PC146   PC147
## Standard deviation     0.11364 0.10747 0.10728 0.10513 0.10323 0.09613 0.09512
## Proportion of Variance 0.00007 0.00007 0.00007 0.00006 0.00006 0.00005 0.00005
## Cumulative Proportion  0.99942 0.99948 0.99955 0.99961 0.99967 0.99972 0.99978
##                          PC148   PC149   PC150   PC151   PC152   PC153   PC154
## Standard deviation     0.07832 0.07094 0.06686 0.06293 0.06216 0.05921 0.05172
## Proportion of Variance 0.00003 0.00003 0.00003 0.00002 0.00002 0.00002 0.00002
## Cumulative Proportion  0.99981 0.99984 0.99986 0.99989 0.99991 0.99993 0.99994
##                          PC155   PC156   PC157   PC158   PC159   PC160   PC161
## Standard deviation     0.03950 0.03821 0.03756 0.03705 0.03368 0.03073 0.02888
## Proportion of Variance 0.00001 0.00001 0.00001 0.00001 0.00001 0.00001 0.00000
## Cumulative Proportion  0.99995 0.99996 0.99997 0.99998 0.99998 0.99999 0.99999
##                          PC162   PC163   PC164    PC165    PC166    PC167
## Standard deviation     0.02304 0.01983 0.01272 0.007694 0.004664 0.003767
## Proportion of Variance 0.00000 0.00000 0.00000 0.000000 0.000000 0.000000
## Cumulative Proportion  1.00000 1.00000 1.00000 1.000000 1.000000 1.000000
##                           PC168    PC169     PC170     PC171    PC172     PC173
## Standard deviation     0.001825 0.001339 4.816e-09 2.151e-09 1.11e-09 8.273e-10
## Proportion of Variance 0.000000 0.000000 0.000e+00 0.000e+00 0.00e+00 0.000e+00
## Cumulative Proportion  1.000000 1.000000 1.000e+00 1.000e+00 1.00e+00 1.000e+00
##                           PC174     PC175 PC176
## Standard deviation     5.51e-15 3.928e-15 2e-15
## Proportion of Variance 0.00e+00 0.000e+00 0e+00
## Cumulative Proportion  1.00e+00 1.000e+00 1e+00

You can also visualize it more clarly:

# Variance proportion
pca_var <- summary(pca_result)$importance[2, 1:10]  # Proportion of variance

# Barplot
barplot(pca_var,
        names.arg = paste0("PC", 1:10),
        main = "Variance Explained by Top 10 Principal Components",
        xlab = "Principal Component",
        ylab = "Proportion of Variance",
        col = "steelblue")

Interpretation:

PC1 captures about 13–14% of the total variance, followed by PC2 with around 11%.
The first 2 components together explain approximately 25% of the total variability.
Additional components (PC3 to PC10) each add between 3% and 6%, showing that the variance is moderately distributed across many components.
This implies that the data’s structure is complex and not dominated by a single axis of variation.

Conclusion:

PCA confirms that: - Fraudulent transactions do not linearly separate from non-fraudulent ones in the first two components. - However, PCA remains valuable for: - Anomaly detection and visual exploration, - Dimensionality reduction before clustering or linear modeling, - Reducing noise and multicollinearity in downstream tasks.

If used for modeling, a larger number of principal components should be retained to preserve enough information. Future steps could include training models on the top k components or integrating PCA into a pipeline with clustering techniques.

5.11 Missing Data Analysis (NA Patterns)

While we’ve previously reviewed missing value percentages across variables, visualizing their distribution across observations can reveal structured patterns—such as whether certain blocks of features tend to be missing together, or whether missingness is concentrated in specific transaction types.

To do this, we generate a missing data heatmap of the top 50 variables with the highest proportion of missing values. Due to computational limits, we sample 5,000 rows from the full dataset.

# Step 1: select top 50 variables with highest NA percentage
na_top_vars <- train_data %>%
  summarise(across(everything(), ~ mean(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "na_rate") %>%
  arrange(desc(na_rate)) %>%
  slice(1:50) %>%
  pull(variable)

Now we have selected the top 50 variables with the highest missingness. Next, we will visualize the missing data patterns.

# step 2: Sample for plotting
set.seed(123)
train_data_na_subset <- train_data %>%
  select(all_of(na_top_vars)) %>%
  slice_sample(n = 5000)

# step3: Plot heatmap with rotated labels
vis_miss(train_data_na_subset, sort_miss = TRUE) +
  labs(title = "Missing Data Heatmap - Top 50 Variables (Sample of 5,000 Rows)") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, size = 7),
    plot.title = element_text(size = 14, face = "bold")
  )

Interpretation:

The majority of the top 50 variables exhibit high missingness, often above 85%, with some exceeding 95%.
There is clear horizontal banding, which suggests that missingness is not completely random—it may be associated with particular transaction types or sources.
Blocks of variables (e.g., many id_ or V features) appear to be missing together, possibly reflecting system-level logging gaps or different data collection pipelines.

Conclusion:

This visual supports previous numeric summaries and reinforces the need for: - Careful imputation strategies or selective removal of high-missingness features. - Exploring whether missingness itself is informative for predicting fraud (e.g., through missingness indicators).

6. Summary and Next Steps

This EDA provided deep insights into the structure, quality, and behavioral patterns of the IEEE-CIS Fraud Detection dataset. By analyzing categorical, temporal, and continuous features—as well as missing data and dimensional structure—we established a comprehensive foundation for building effective fraud detection models.

Key Findings:

Severe class imbalance: Fraud accounts for only ~3.5% of all transactions.
Behavioral patterns: Fraud is more frequent in specific product types (ProductCD = C), card networks (Discover), and time windows (early hours of the day).
Feature complexity: No single variable shows strong linear correlation with fraud; importance is distributed and often nonlinear.
Missingness: Many features (especially id_ and Vxxx) exhibit >85% missing values, suggesting careful imputation or exclusion is needed.
PCA analysis: Confirms lack of linear separability between fraud and non-fraud classes, and highlights the complexity of the feature space.

Next steps:

Feature Engineering

Create binary flags (e.g., IsMorning, IsWeekend, IsHighAmount)
Encode categorical variables (One-Hot, Frequency, or Target Encoding)
Derive interaction terms or cluster-based features

Handle Missing Data

Drop low-variance or extremely sparse variables (e.g., V305)
Impute missing values using median/mean or advanced models
Optionally, create missingness indicator features

Baseline Modeling

Train interpretable models (Logistic Regression, Decision Trees)
Benchmark with ensemble methods (Random Forest, LightGBM, XGBoost)

Model Evaluation

Use stratified k-fold cross-validation
Focus on ROC-AUC, Precision-Recall curves, and F1-score
Track performance on fraud detection sensitivity (Recall) and false positive rates

Pipeline Integration

Prepare a clean preprocessing and modeling pipeline
Add feature selection or dimensionality reduction as needed
Explore advanced models (e.g., TabNet, AutoML, or neural networks if justified)

Exploratory Data Analysis (EDA) - IEEE-CIS Fraud Detection

Manuel Alejandro Matías Astorga

1. Introduction

2. Libraries

3. Data Loading

4. Data Overview

4.1 Missing Values

4.2 Class Balance

5. Exploratory Visualizations

5.1 Transaction Amount Distribution

5.2 Device Info

5.3 Temporal Features Analysis

5.4 Categorical Features Analysis

5.5 Continuous Features Analysis - C1 to C14

5.5.1 C1 and C2 – Initial Exploration

5.5.2 General Analysis of C1–C14

5.6 Continuous Features Analysis - D1 to D15

5.6.1 Distribution of D1 to D5 (Filtered < 99th Percentile)

5.7 Correlation Analysis

5.7.1 Selection of Numerical Features

5.7.2 Computing Correlation Matrix

5.7.3 Visualizing Correlation Matrix

5.7.4 Top Correlations with isFraud

5.7.5 Correlation Among Top Predictors

5.7.6 Detecting Multicollinearity (Corr > 0.90)

5.8 Feature Importance (Baseline Model)

5.8.1 Preparing data

5.8.2 Training a Random Forest Model

5.8.3 Extracting and inspectioning importance

5.8.4 Ranking visualization

5.9 Cross-feature exploratory analysis

5.9.1 Fraud Rate by ProductCD

5.9.2 Fraud Rate by card4

5.9.3 Fraud Rate by card6

5.9.4 Summary

5.10 Dimensionality Reduction with PCA

5.10.1 Data Preparation for PCA

5.10.2 PCA Execution

5.10.3 Variance Explained by Principal Components

5.11 Missing Data Analysis (NA Patterns)

6. Summary and Next Steps

5.7.4 Top Correlations with `isFraud`

5.9.1 Fraud Rate by `ProductCD`

5.9.2 Fraud Rate by `card4`

5.9.3 Fraud Rate by `card6`