isFraudThis notebook presents an Exploratory Data Analysis (EDA) of the IEEE-CIS Fraud Detection dataset, originally provided by Vesta Corporation for a Kaggle competition. The dataset contains anonymized transactional and identity information with the aim of building models to detect fraudulent transactions.
The main goals of this EDA are:
We’ll require the next libraries for data manipulation and visualization.
# Core libraries for data analysis and visualization
library(tidyverse) # Includes dplyr, ggplot2, tibble, readr, etc.
library(data.table) # Efficient data loading and manipulation for large datasets
# Additional visualization tools
library(gridExtra) # Arranging multiple ggplot2 plots side by side
library(skimr) # Quick overview of data frames, including missing values and distributions
library(ggcorrplot) # Correlation matrix visualization
library(ranger) # Fast implementation of random forests
library(naniar) # Tools for visualizing and handling missing data
The dataset is composed of two main training files:
train_transaction.csv: Contains
transactional data for online purchases, including variables like
transaction amount, product codes, card and address information, and
whether the transaction was flagged as fraudulent
(isFraud).
train_identity.csv: Contains
identity-related features (e.g., device info, IP address, browser, email
domain) for a subset of transactions. Only ~24% of the transaction
records have corresponding identity information.
To perform a comprehensive analysis, we merged both datasets using
the common key TransactionID. We applied a left
join (all.x = TRUE), meaning all rows from
train_transaction are preserved, even if no identity
information is available. This approach ensures that:
# Load training datasets
train_transaction <- fread("../data/ieee-fraud-detection/train_transaction.csv")
train_identity <- fread("../data/ieee-fraud-detection/train_identity.csv")
# Merge datasets by TransactionID
train_data <- merge(train_transaction, train_identity, by = "TransactionID", all.x = TRUE)
# Check merged dimensions
dim(train_data)
## [1] 590540 434
train_transaction contains 590,540 rows and 394 columns,
while train_identity has 144,233 rows and 41 columns.
This results in a merged dataset named train_data,
with:
train_transaction)Note: Although train_transaction and
train_identity contain 394 and 41 columns respectively,
they both include the TransactionID column. During the
merge, this common column is used as the key and is not duplicated,
resulting in a final dataset with 434 unique columns.
Due to the high dimensionality and anonymized nature of the dataset
(434 features), we selectively explore only a subset of features that
are either commonly interpretable (e.g., TransactionAmt,
DeviceType, Email Domains) or structurally
relevant (e.g., C1-C2, D1-D15). Exploring all
features manually is not scalable, and later stages of this project will
incorporate variable importance and dimensionality reduction techniques
(e.g., Random Forest, PCA) to guide modeling.
Let’s analyze the dataset to understand its structure, missing values, and class balance.
To assess the quality of the data, we calculated the percentage of
missing values in each column using the
summarise_all(~ mean(is.na(.))) function chain. This
provides a quick overview of how much information is missing in each
feature.
# Compute missing value percentage per column
missing_data <- train_data %>%
summarise_all(~ mean(is.na(.))) %>% # Calculate the percentage of NAs per column.
pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_perc") %>% # Converts the result to a long format (variable, missing_perc) for analysis and visualization.
arrange(desc(missing_perc)) # Order the variables from highest to lowest number of missing values.
# View top 20 variables with the highest percentage of missing values
head(missing_data, 20)
## # A tibble: 20 × 2
## variable missing_perc
## <chr> <dbl>
## 1 id_24 0.992
## 2 id_25 0.991
## 3 id_07 0.991
## 4 id_08 0.991
## 5 id_21 0.991
## 6 id_26 0.991
## 7 id_22 0.991
## 8 dist2 0.936
## 9 D7 0.934
## 10 id_18 0.924
## 11 D13 0.895
## 12 D14 0.895
## 13 D12 0.890
## 14 id_03 0.888
## 15 id_04 0.888
## 16 D6 0.876
## 17 D8 0.873
## 18 D9 0.873
## 19 id_09 0.873
## 20 id_10 0.873
From the results above, we observe that several variables (especially
id_24, id_25, id_07,
id_08, and many features from the id_ and
D families) have over 85–99% missing values. This is
critical because:
This information will be key during the feature selection and preprocessing phases.
This dataset is highly imbalanced, with a strong predominance of non-fraudulent transactions. Understanding this imbalance is crucial before training machine learning models, as it can bias the models toward the majority class.
# Distribution of target variable
train_data %>%
count(isFraud) %>%
mutate(percentage = n / sum(n))
## Key: <TransactionID>
## isFraud n percentage
## <int> <int> <num>
## 1: 0 569877 0.96500999
## 2: 1 20663 0.03499001
Observation:
The target variable isFraud is heavily imbalanced:
This class imbalance can lead machine learning models to be biased toward predicting the majority class. To address this, we will consider appropriate strategies such as:
ggplot(train_data, aes(x = TransactionAmt, fill = factor(isFraud))) +
geom_histogram(bins = 100, position = "identity", alpha = 0.6) +
scale_x_log10() +
labs(
title = "Transaction Amount Distribution (Log Scale)",
x = "Transaction Amount (log scale)",
y = "Count",
fill = "Is Fraud"
)
Observation:
# Compute top 10 most common device types
train_data %>%
count(DeviceInfo) %>% # Count occurrences of each device type
slice_max(n, n = 10) %>% # Select top 10 by count (modern dplyr alternative to top_n)
ggplot(aes(x = reorder(DeviceInfo, n), y = n)) + # Reorder device names by frequency
geom_col(fill = "steelblue") + # Horizontal bar chart
coord_flip() + # Flip coordinates for readability
labs(
title = "Top 10 Device Info",
x = "Device Info",
y = "Count"
)
This plot displays the 10 most frequent DeviceInfo values in the training dataset. Notably, over 400,000 entries are missing this value, labeled as NA. Windows and iOS devices are the most common among non-missing values. This suggests the need for careful handling of this feature, potentially through imputation or categorical encoding.
Observation on Device Info:
The bar chart shows the top 10 most frequent values of the
DeviceInfo column. The most common entry is clearly
"NA", which represents missing values. However, we also
observe an unlabeled bar just below "Windows", indicating
the presence of empty strings ("").
This distinction is important: - NA represents
missing data in R. - "" represents a present
but empty string, often due to incomplete logging or device
masking.
These two cases should ideally be treated consistently, so we may
consider replacing empty strings with NA before modeling or
imputation.
We start this section by converting the TransactionDT
column to a more interpretable date-time format. The dataset begins at a
reference point, which we define as “2017-12-01”. This allows us to
derive meaningful temporal features such as the hour of the transaction
and the day of the week.
# Convert TransactionDT to hours and days
# Note: the dataset starts at a reference point, we define a fake 'origin'
origin_time <- as.POSIXct("2017-12-01", tz = "UTC")
train_data$TransactionDate <- origin_time + train_data$TransactionDT
train_data$TransactionHour <- lubridate::hour(train_data$TransactionDate)
train_data$TransactionDay <- lubridate::wday(train_data$TransactionDate, label = TRUE)
next step is visualizing the distribution of transactions by hour and day of the week. This helps us understand if there are specific times or days when fraudulent transactions are more likely to occur.
# Plot fraud rate by hour
ggplot(train_data, aes(x = TransactionHour, fill = factor(isFraud))) +
geom_histogram(binwidth = 1, position = "fill") +
labs(
title = "Fraud Ratio by Hour of Day",
x = "Hour of Day",
y = "Proportion",
fill = "Is Fraud"
) +
scale_x_continuous(breaks = 0:23)
Observation:
This plot shows the proportion of fraudulent vs. non-fraudulent transactions across different hours of the day.
This information could be useful to: - Engineer features like
IsMorning, IsNight, etc. - Apply time-aware
models or rules.
ggplot(train_data, aes(x = TransactionDay, fill = factor(isFraud))) +
geom_bar(position = "fill") +
labs(
title = "Fraud Ratio by Day of Week",
x = "Day of Week",
y = "Proportion",
fill = "Is Fraud"
)
Observation:
This bar plot shows the proportion of fraudulent transactions by day of the week.
As a result, TransactionDay may have limited
predictive power on its own, but could still be useful when
combined with other temporal or categorical features.
We will explore some categorical features that may provide insights into the fraud detection process. The following features are of particular interest:
ProductCD: Product code associated with the
transaction.card4: Card type (e.g., Visa, Mastercard).card6: Card type (e.g., credit, debit).P_emaildomain: Email domain of the payer.R_emaildomain: Email domain of the recipient.DeviceType: Type of device used for the
transaction.We analyze the fraud proportions across these categories using
proportion-based bar plots (position = "fill" in
ggplot2) to better visualize class imbalance across each
level.
# List of selected categorical features to analyze
cat_vars <- c("ProductCD", "card4", "card6", "P_emaildomain", "R_emaildomain", "DeviceType")
ProductCDggplot(train_data, aes(x = ProductCD, fill = factor(isFraud))) +
geom_bar(position = "fill") +
labs(
title = "Fraud Proportion by ProductCD",
x = "ProductCD",
y = "Proportion",
fill = "Is Fraud"
)
Observation: - Product code C has a noticeably higher fraud rate compared to the others. - This may suggest that some product categories are more prone to fraud, perhaps due to their price range, popularity, or nature of transactions.
card4ggplot(train_data, aes(x = card4, fill = factor(isFraud))) +
geom_bar(position = "fill") +
labs(
title = "Fraud Proportion by Card Type (card4)",
x = "Card Type",
y = "Proportion",
fill = "Is Fraud"
)
Observation: - Among card types,
Discover shows the highest fraud proportion. -
Visa and Mastercard dominate in volume, but
their fraud ratio is relatively lower.
card6ggplot(train_data, aes(x = card6, fill = factor(isFraud))) +
geom_bar(position = "fill") +
labs(
title = "Fraud Proportion by Card Type (card6)",
x = "Card Type (card6)",
y = "Proportion",
fill = "Is Fraud"
)
Observation: - Fraud is more frequent in
credit cards compared to debit or
charge cards. - Transactions labeled as
debit or credit are rare and also show lower fraud
prevalence, this may due to the duplication of options since there is a
separate credit and debit option..
P_emaildomain# Top 10 most common payer email domains
train_data %>%
count(P_emaildomain) %>%
slice_max(n, n = 10) %>%
pull(P_emaildomain) -> top_p_domains
ggplot(train_data %>% filter(P_emaildomain %in% top_p_domains),
aes(x = P_emaildomain, fill = factor(isFraud))) +
geom_bar(position = "fill") +
labs(
title = "Fraud Proportion by Payer Email Domain",
x = "Payer Email Domain (Top 10)",
y = "Proportion",
fill = "Is Fraud"
) +
coord_flip()
Observation: - Most common payer domains
(e.g. outlook.com, hotmail.com,
gmail.com, yahoo.com) show low fraud rates. -
Fraud proportions appear relatively low across these domains, although
outlook.com and hotmail.com show a slightly
higher fraud ratio compared to others like gmail.com or
icloud.com. - Domains like anonymous.com or
rare email providers may require additional attention.
R_emaildomain# Top 10 most common recipient email domains
train_data %>%
count(R_emaildomain) %>%
slice_max(n, n = 10) %>%
pull(R_emaildomain) -> top_r_domains
ggplot(train_data %>% filter(R_emaildomain %in% top_r_domains),
aes(x = R_emaildomain, fill = factor(isFraud))) +
geom_bar(position = "fill") +
labs(
title = "Fraud Proportion by Recipient Email Domain",
x = "Recipient Email Domain (Top 10)",
y = "Proportion",
fill = "Is Fraud"
) +
coord_flip()
Observation: - Recipient email domains exhibit wider
variation. - Notably, domains such as outlook.com,
icloud.com and gmail.com have higher fraud
proportions, suggesting possible use in fraudulent operations.
DeviceTypeggplot(train_data, aes(x = DeviceType, fill = factor(isFraud))) +
geom_bar(position = "fill") +
labs(
title = "Fraud Proportion by Device Type",
x = "Device Type",
y = "Proportion",
fill = "Is Fraud"
)
Observation: - Mobile transactions
exhibit a slightly higher proportion of fraud compared to desktop. -
This aligns with common patterns in fraud behavior exploiting less
secure or unsupervised mobile environments.
The dataset includes a family of anonymized continuous variables
labeled C1 through C14. These features likely
capture aggregated behavioral metrics or historical transaction
patterns.
We will begin by visualizing C1 and C2 in
detail, and then generalize the analysis to the entire set of C
variables.
# Gather variables into long format for facetting
train_data %>%
select(C1, C2, isFraud) %>%
pivot_longer(cols = c(C1, C2), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Value, fill = factor(isFraud))) +
geom_histogram(bins = 100, position = "identity", alpha = 0.5) +
facet_wrap(~ Variable, scales = "free") +
scale_x_continuous(trans = "log1p") +
labs(
title = "Distribution of C1 and C2 by Fraud Status",
x = "Value (log scale)",
y = "Count",
fill = "Is Fraud"
)
Observation: - The distributions are highly skewed to the right. - Fraudulent and non-fraudulent transactions overlap heavily in this space, though fine-grained differences might still be exploitable with modeling. - A log transformation improves visualization but still shows long-tailed distributions.
To reduce noise from extreme outliers and improve interpretability, we apply filtering below the 99th percentile.
# Calculating 99th percentiles for C1 and C2
c1_99 <- quantile(train_data$C1, 0.99, na.rm = TRUE)
c2_99 <- quantile(train_data$C2, 0.99, na.rm = TRUE)
# Filtering and plotting
train_data %>%
filter(C1 < c1_99, C2 < c2_99) %>%
select(C1, C2, isFraud) %>%
pivot_longer(cols = c(C1, C2), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Value, fill = factor(isFraud))) +
geom_histogram(bins = 60, position = "identity", alpha = 0.5) +
facet_wrap(~ Variable, scales = "free") +
labs(
title = "Distribution of C1 and C2 by Fraud Status (Filtered < 99th Percentile)",
x = "Value",
y = "Count",
fill = "Is Fraud"
)
Note: While we remove outliers for visualization clarity, they are retained for modeling, as they may encode meaningful anomalies.
Now, we extend the distribution analysis to the full set of C variables.
# Identify C variables
c_vars <- names(train_data)[grepl("^C\\d+$", names(train_data))]
# Convert to long format
c_long <- train_data %>%
select(isFraud, all_of(c_vars)) %>%
pivot_longer(cols = all_of(c_vars), names_to = "Variable", values_to = "Value")
# Calculate thresholds by variable
thresholds_c <- c_long %>%
group_by(Variable) %>%
summarise(threshold = quantile(Value, 0.99, na.rm = TRUE), .groups = "drop")
# Join and filter
c_long <- c_long %>%
left_join(thresholds_c, by = "Variable") %>%
filter(is.na(Value) | Value < threshold)
# Plot
ggplot(c_long, aes(x = Value, fill = factor(isFraud))) +
geom_histogram(bins = 60, position = "identity", alpha = 0.5) +
facet_wrap(~ Variable, scales = "free", ncol = 3) +
labs(
title = "Distribution of C1–C14 Variables (Filtered < 99th Percentile)",
x = "Value", y = "Count", fill = "Is Fraud"
) +
theme_minimal()
plot_c_block <- function(vars_subset) {
c_long_block <- c_long %>%
filter(Variable %in% vars_subset) %>%
mutate(Variable = factor(Variable, levels = vars_subset))
ggplot(c_long_block, aes(x = Value, fill = factor(isFraud))) +
geom_histogram(bins = 60, position = "identity", alpha = 0.5) +
facet_wrap(~ Variable, scales = "free", ncol = 2) +
labs(
title = paste0("Distribution of ", paste(vars_subset, collapse = ", "), " (Filtered < 99th Percentile)"),
x = "Value", y = "Count", fill = "Is Fraud"
) +
theme_minimal()
}
plot_c_block(paste0("C", 1:5))
Note: Variable C3 is not shown because it likely contains only a constant value or zeros (checking the data frame, we observed is a column full of zeros as far as we know), which offers no variation for distribution plots or predictive modeling.
plot_c_block(paste0("C", 6:10))
plot_c_block(paste0("C", 11:14))
skim(select(train_data, all_of(c_vars)))
| Name | select(train_data, all_of… |
| Number of rows | 590540 |
| Number of columns | 14 |
| Key | TransactionID |
| _______________________ | |
| Column type frequency: | |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| C1 | 0 | 1 | 14.09 | 133.57 | 0 | 1 | 1 | 3 | 4685 | ▇▁▁▁▁ |
| C2 | 0 | 1 | 15.27 | 154.67 | 0 | 1 | 1 | 3 | 5691 | ▇▁▁▁▁ |
| C3 | 0 | 1 | 0.01 | 0.15 | 0 | 0 | 0 | 0 | 26 | ▇▁▁▁▁ |
| C4 | 0 | 1 | 4.09 | 68.85 | 0 | 0 | 0 | 0 | 2253 | ▇▁▁▁▁ |
| C5 | 0 | 1 | 5.57 | 25.79 | 0 | 0 | 0 | 1 | 349 | ▇▁▁▁▁ |
| C6 | 0 | 1 | 9.07 | 71.51 | 0 | 1 | 1 | 2 | 2253 | ▇▁▁▁▁ |
| C7 | 0 | 1 | 2.85 | 61.73 | 0 | 0 | 0 | 0 | 2255 | ▇▁▁▁▁ |
| C8 | 0 | 1 | 5.14 | 95.38 | 0 | 0 | 0 | 0 | 3331 | ▇▁▁▁▁ |
| C9 | 0 | 1 | 4.48 | 16.67 | 0 | 0 | 1 | 2 | 210 | ▇▁▁▁▁ |
| C10 | 0 | 1 | 5.24 | 95.58 | 0 | 0 | 0 | 0 | 3257 | ▇▁▁▁▁ |
| C11 | 0 | 1 | 10.24 | 94.34 | 0 | 1 | 1 | 2 | 3188 | ▇▁▁▁▁ |
| C12 | 0 | 1 | 4.08 | 86.67 | 0 | 0 | 0 | 0 | 3188 | ▇▁▁▁▁ |
| C13 | 0 | 1 | 32.54 | 129.36 | 0 | 1 | 3 | 12 | 2918 | ▇▁▁▁▁ |
| C14 | 0 | 1 | 8.30 | 49.54 | 0 | 1 | 1 | 2 | 1429 | ▇▁▁▁▁ |
Let’s check what are the D variables
# List all variables starting with "D"
names(train_data)[grepl("^D", names(train_data))]
## [1] "D1" "D2" "D3" "D4" "D5"
## [6] "D6" "D7" "D8" "D9" "D10"
## [11] "D11" "D12" "D13" "D14" "D15"
## [16] "DeviceType" "DeviceInfo"
The D variables are anonymous temporal or delay-related
features. Since their meaning is unknown, we group and visualize them in
blocks of five, filtering extreme outliers (above 99th percentile) to
highlight more common patterns.
# Step 1: define the D1 to D5 variables
d_vars_1 <- paste0("D", 1:5)
# Step 2: calculate 99th percentile thresholds
d_thresholds_1 <- sapply(d_vars_1, function(var) {
quantile(train_data[[var]], 0.99, na.rm = TRUE)
})
names(d_thresholds_1) <- d_vars_1
# Step 3: create filtered version of the dataset
train_data_d1_d5 <- train_data
for (var in d_vars_1) {
threshold <- d_thresholds_1[[var]]
train_data_d1_d5 <- train_data_d1_d5 %>%
filter(is.na(.data[[var]]) | .data[[var]] < threshold)
}
# Step 4: reshape for plotting
d_long_1 <- train_data_d1_d5 %>%
select(isFraud, all_of(d_vars_1)) %>%
pivot_longer(cols = all_of(d_vars_1), names_to = "variable", values_to = "value") %>%
filter(!is.na(value))
# Optional: enforce variable order for facet
d_long_1$variable <- factor(d_long_1$variable, levels = d_vars_1)
# Step 5: plot
ggplot(d_long_1, aes(x = value, fill = factor(isFraud))) +
geom_histogram(bins = 50, position = "identity", alpha = 0.6) +
facet_wrap(~ variable, scales = "free", ncol = 2) +
labs(
title = "Distribution of D1–D5 Variables (Filtered < 99th Percentile)",
x = "Value",
y = "Count",
fill = "Is Fraud"
) +
theme_minimal()
# Step 1: define the D6 to D10 variables
d_vars_2 <- paste0("D", 6:10)
# Step 2: calculate 99th percentile thresholds
d_thresholds_2 <- sapply(d_vars_2, function(var) {
quantile(train_data[[var]], 0.99, na.rm = TRUE)
})
names(d_thresholds_2) <- d_vars_2
# Step 3: create filtered version of the dataset
train_data_d6_d10 <- train_data
for (var in d_vars_2) {
threshold <- d_thresholds_2[[var]]
train_data_d6_d10 <- train_data_d6_d10 %>%
filter(is.na(.data[[var]]) | .data[[var]] < threshold)
}
# Step 4: reshape for plotting
d_long_2 <- train_data_d6_d10 %>%
select(isFraud, all_of(d_vars_2)) %>%
pivot_longer(cols = all_of(d_vars_2), names_to = "variable", values_to = "value") %>%
filter(!is.na(value))
# Optional: enforce variable order for facet
d_long_2$variable <- factor(d_long_2$variable, levels = d_vars_2)
# Step 5: plot
ggplot(d_long_2, aes(x = value, fill = factor(isFraud))) +
geom_histogram(bins = 50, position = "identity", alpha = 0.6) +
facet_wrap(~ variable, scales = "free", ncol = 2) +
labs(
title = "Distribution of D6–D10 Variables (Filtered < 99th Percentile)",
x = "Value",
y = "Count",
fill = "Is Fraud"
) +
theme_minimal()
# Step 1: define the D11 to D15 variables
d_vars_3 <- paste0("D", 11:15)
# Step 2: calculate 99th percentile thresholds
d_thresholds_3 <- sapply(d_vars_3, function(var) {
quantile(train_data[[var]], 0.99, na.rm = TRUE)
})
names(d_thresholds_3) <- d_vars_3
# Step 3: create filtered version of the dataset
train_data_d11_d15 <- train_data
for (var in d_vars_3) {
threshold <- d_thresholds_3[[var]]
train_data_d11_d15 <- train_data_d11_d15 %>%
filter(is.na(.data[[var]]) | .data[[var]] < threshold)
}
# Step 4: reshape for plotting
d_long_3 <- train_data_d11_d15 %>%
select(isFraud, all_of(d_vars_3)) %>%
pivot_longer(cols = all_of(d_vars_3), names_to = "variable", values_to = "value") %>%
filter(!is.na(value))
# Optional: enforce variable order for facet
d_long_3$variable <- factor(d_long_3$variable, levels = d_vars_3)
# Step 5: plot
ggplot(d_long_3, aes(x = value, fill = factor(isFraud))) +
geom_histogram(bins = 50, position = "identity", alpha = 0.6) +
facet_wrap(~ variable, scales = "free", ncol = 2) +
labs(
title = "Distribution of D11–D15 Variables (Filtered < 99th Percentile)",
x = "Value",
y = "Count",
fill = "Is Fraud"
) +
theme_minimal()
# Summary statistics for D variables
d_vars <- paste0("D", 1:15)
skim(select(train_data, all_of(d_vars)))
| Name | select(train_data, all_of… |
| Number of rows | 590540 |
| Number of columns | 15 |
| Key | TransactionID |
| _______________________ | |
| Column type frequency: | |
| numeric | 15 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| D1 | 1269 | 1.00 | 94.35 | 157.66 | 0 | 0.00 | 3.00 | 122.00 | 640.00 | ▇▁▁▁▁ |
| D2 | 280797 | 0.52 | 169.56 | 177.32 | 0 | 26.00 | 97.00 | 276.00 | 640.00 | ▇▂▂▁▁ |
| D3 | 262878 | 0.55 | 28.34 | 62.38 | 0 | 1.00 | 8.00 | 27.00 | 819.00 | ▇▁▁▁▁ |
| D4 | 168922 | 0.71 | 140.00 | 191.10 | -122 | 0.00 | 26.00 | 253.00 | 869.00 | ▇▂▂▂▁ |
| D5 | 309841 | 0.48 | 42.34 | 89.00 | 0 | 1.00 | 10.00 | 32.00 | 819.00 | ▇▁▁▁▁ |
| D6 | 517353 | 0.12 | 69.81 | 143.67 | -83 | 0.00 | 0.00 | 40.00 | 873.00 | ▇▁▁▁▁ |
| D7 | 551623 | 0.07 | 41.64 | 99.74 | 0 | 0.00 | 0.00 | 17.00 | 843.00 | ▇▁▁▁▁ |
| D8 | 515614 | 0.13 | 146.06 | 231.66 | 0 | 0.96 | 37.88 | 187.96 | 1707.79 | ▇▁▁▁▁ |
| D9 | 515614 | 0.13 | 0.56 | 0.32 | 0 | 0.21 | 0.67 | 0.83 | 0.96 | ▆▂▂▇▇ |
| D10 | 76022 | 0.87 | 123.98 | 182.62 | 0 | 0.00 | 15.00 | 197.00 | 876.00 | ▇▁▁▁▁ |
| D11 | 279287 | 0.53 | 146.62 | 186.04 | -53 | 0.00 | 43.00 | 274.00 | 670.00 | ▇▂▂▂▁ |
| D12 | 525823 | 0.11 | 54.04 | 124.27 | -83 | 0.00 | 0.00 | 13.00 | 648.00 | ▇▁▁▁▁ |
| D13 | 528588 | 0.10 | 17.90 | 67.61 | 0 | 0.00 | 0.00 | 0.00 | 847.00 | ▇▁▁▁▁ |
| D14 | 528353 | 0.11 | 57.72 | 136.31 | -193 | 0.00 | 0.00 | 2.00 | 878.00 | ▇▁▁▁▁ |
| D15 | 89113 | 0.85 | 163.74 | 202.73 | -83 | 0.00 | 52.00 | 314.00 | 879.00 | ▇▂▂▂▁ |
Now that we have explored the individual distributions of continuous features, the next step is to evaluate pairwise correlations to detect redundancy and guide feature selection for the baseline model.
In this section, we examine linear relationships between numeric variables to identify:
These insights can inform feature selection,
regularization, and dimensionality reduction
strategies such as PCA or Lasso.
We restrict the analysis to numeric variables with fewer than 20% missing values to ensure robustness of correlations.
# Identify numeric columns with low NA ratio
# Exclude TransactionID explicitly
numeric_vars <- train_data %>%
select(where(is.numeric)) %>%
select(-TransactionID) %>%
summarise_all(~ mean(is.na(.))) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "NA_Percent") %>%
filter(NA_Percent < 0.2) %>%
pull(Variable)
# Filter and clean
train_data_num <- train_data %>%
select(all_of(numeric_vars)) %>%
drop_na()
We compute the Pearson correlation matrix for the cleaned numeric dataset.
# Compute Pearson correlation matrix
cor_matrix <- cor(train_data_num, use = "complete.obs", method = "pearson")
# Quick check
dim(cor_matrix)
## [1] 178 178
head(cor_matrix[, 1:5])
## isFraud TransactionDT TransactionAmt card1
## isFraud 1.000000000 -0.0018940039 0.038327320 0.0019043194
## TransactionDT -0.001894004 1.0000000000 0.006214577 -0.0005050166
## TransactionAmt 0.038327320 0.0062145769 1.000000000 -0.0021376019
## card1 0.001904319 -0.0005050166 -0.002137602 1.0000000000
## card2 0.018776409 -0.0056159505 0.030696590 0.0008656832
## card3 0.035755266 -0.0634543974 -0.026487992 -0.0002535563
## card2
## isFraud 0.0187764094
## TransactionDT -0.0056159505
## TransactionAmt 0.0306965897
## card1 0.0008656832
## card2 1.0000000000
## card3 0.0080829847
This global heatmap offers a high-level view of how numerical variables relate to each other.
# Plot
ggcorrplot(cor_matrix,
hc.order = TRUE, type = "lower", lab = FALSE,
colors = c("steelblue", "white", "darkred"),
title = "Correlation Matrix of Selected Numerical Variables",
ggtheme = theme_minimal())
Note: Due to the high number of features, the full matrix may be hard to interpret directly. The following sections highlight key patterns.
isFraudWe extract the variables most linearly associated with the target.
# Ordering by absolute correlation with isFraud
cor_target <- cor_matrix[, "isFraud"]
cor_target_sorted <- sort(abs(cor_target), decreasing = TRUE)
# View top correlations
head(cor_target_sorted, 20)
## isFraud V283 V281 V282 V292 V315 V289
## 1.00000000 0.14837641 0.13693604 0.11916314 0.09441988 0.08961929 0.08775409
## V291 V312 V288 V313 V90 V29 V74
## 0.08304392 0.08250770 0.08109182 0.08015370 0.07973573 0.07951064 0.07922080
## V69 V56 V314 V284 V70 V91
## 0.07894555 0.07863102 0.07792122 0.07598742 0.07586851 0.07567217
Observation:
We isolate and visualize the top 15 variables most correlated with the target.
# Obtain the top 15 variables with highest absolute correlation with isFraud (except isFraud itself)
top_vars <- names(cor_target_sorted)[2:16]
# Subset the correlation matrix for these variables
cor_subset <- cor_matrix[top_vars, top_vars]
# Heatmap
ggcorrplot(cor_subset,
hc.order = TRUE, type = "lower", lab = TRUE,
colors = c("steelblue", "white", "darkred"),
title = "Correlation Matrix of Top 15 Variables",
ggtheme = theme_minimal())
Observation:
V282–V283,
V313–V315, and
V291–V292 show strong mutual correlations,
indicating:
To identify highly redundant variables, we extract all variable pairs with correlation above 0.90 (excluding self-correlation).
high_corr_pairs <- which(abs(cor_matrix) > 0.9 & abs(cor_matrix) < 1, arr.ind = TRUE)
# Remove symmetric duplicates
high_corr_df <- as.data.frame(high_corr_pairs)
high_corr_df <- high_corr_df[high_corr_df$row < high_corr_df$col, ]
# Add variable names and correlation values
high_corr_df <- high_corr_df %>%
mutate(var1 = rownames(cor_matrix)[row],
var2 = colnames(cor_matrix)[col],
corr = cor_matrix[cbind(row, col)]) %>%
arrange(desc(abs(corr)))
head(high_corr_df, 10)
## row col var1 var2 corr
## C7.3 16 21 C7 C12 0.9989629
## C8.2 17 19 C8 C10 0.9952333
## C1.2 10 20 C1 C11 0.9944343
## V101.2 98 149 V101 V293 0.9938229
## V95 92 98 V95 V101 0.9922605
## V103.4 100 151 V103 V295 0.9920565
## V279.2 135 149 V279 V293 0.9905360
## V57 54 55 V57 V58 0.9901596
## V95.1 92 135 V95 V279 0.9900867
## C7.2 16 19 C7 C10 0.9881490
Observations
V283,
V281, V292, V315,
V289) exhibit relatively low Pearson correlation
coefficients (mostly under 0.15). This suggests that no single variable
is linearly predictive of fraud on its own, reaffirming the complex and
subtle nature of fraud detection.V69, V90,
V291) show some moderate correlation, but these are likely
to be more effective when used together or transformed (e.g., via
non-linear models or interactions).C7 ↔︎ C12C10 ↔︎ C8C1 ↔︎ C11V101 ↔︎ V293V95 ↔︎ V101, V279These indicate redundancy or multicollinearity, which could impact
model stability and interpretability—particularly for linear models. -
TransactionID was excluded from the analysis, as it is an
identifier and does not convey predictive information about the
target.
Next steps (planning)
Vxxx or Cxx can also be
explored via unsupervised clustering or feature aggregation.In this section, we estimate the relative importance of features using a Random Forest classifier. This approach allows us to identify which variables contribute most to the prediction of fraudulent transactions, even without optimizing the model for accuracy yet.
We use the ranger package, a fast implementation of
Random Forests, to compute variable importance based on
Gini impurity reduction.
We reuse the cleaned numeric dataset (train_data_num)
built in the correlation section. Since ranger requires the target
variable to be a factor, we convert isFraud
accordingly.
# Add isFraud as a factor (classification requirement)
train_data_rf <- train_data_num %>%
mutate(isFraud = factor(isFraud)) # Make sure it's a factor for ranking
We train a lightweight model with 100 trees, using default settings
and importance = "impurity" to retrieve Gini-based
feature rankings.
# Training simple model to estimate importance
rf_model <- ranger(
formula = isFraud ~ .,
data = train_data_rf,
importance = "impurity",
num.trees = 100,
seed = 123
)
## Growing trees.. Progress: 29%. Estimated remaining time: 1 minute, 18 seconds.
## Growing trees.. Progress: 59%. Estimated remaining time: 45 seconds.
## Growing trees.. Progress: 89%. Estimated remaining time: 11 seconds.
Note: This model is not meant for final prediction but for preliminary insight into variable contributions.
We extract the feature importances into a tidy
dataframe, sorted in descending order.
# Convert ordered importance to a tidy dataframe
importance_df <- as.data.frame(rf_model$variable.importance) %>%
tibble::rownames_to_column("Variable") %>%
rename(Importance = `rf_model$variable.importance`) %>%
arrange(desc(Importance))
# show top 20 features
head(importance_df, 20)
## Variable Importance
## 1 TransactionDT 606.9360
## 2 card1 582.9908
## 3 addr1 510.8864
## 4 TransactionAmt 507.5807
## 5 card2 491.6512
## 6 D15 398.7465
## 7 D10 342.2649
## 8 C13 339.6097
## 9 D1 336.5991
## 10 TransactionHour 331.4401
## 11 V307 279.7027
## 12 card5 258.1799
## 13 C6 250.6636
## 14 V310 245.7405
## 15 C9 233.9571
## 16 C1 228.9751
## 17 V308 226.7537
## 18 C2 220.3989
## 19 V127 217.4486
## 20 C14 202.2062
The top 20 variables by importance are shown below.
importance_df %>%
slice_max(order_by = Importance, n = 20) %>%
ggplot(aes(x = reorder(Variable, Importance), y = Importance)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "Top 20 Feature Importances (Random Forest)",
x = "Variable",
y = "Importance"
) +
theme_minimal()
Observations - The most important features for
distinguishing fraud include: - TransactionDT
(time-related) - card1, card2,
card5 (card identifiers) - addr1 (billing
region) - TransactionAmt (amount of the transaction) -
Several anonymized behavioral or delay features (D10,
D15, C13, C1, etc.) - Variables
like TransactionHour (engineered from timestamp) also
appear high in the ranking, confirming the relevance of temporal
patterns. - These insights can guide: - Feature selection (e.g.,
reducing dimensionality) - Feature engineering (e.g., combinations or
interactions) - Model interpretation and post-hoc analysis (e.g., SHAP,
PDP)
To better understand how fraud is distributed across categories, we compute the fraud rate for selected categorical variables. This highlights which categories are more likely to be associated with fraudulent behavior.
# Helper function to plot fraud rates for a categorical variable
plot_fraud_rate <- function(data, var) {
data %>%
group_by(.data[[var]]) %>%
summarise(
Count = n(),
FraudRate = mean(isFraud, na.rm = TRUE)
) %>%
filter(Count > 200) %>% # Optional: filter rare categories
ggplot(aes(x = reorder(.data[[var]], -FraudRate), y = FraudRate)) +
geom_col(fill = "steelblue") +
labs(
title = paste("Fraud Rate by", var),
x = var,
y = "Fraud Rate"
) +
theme_minimal()
}
ProductCD# Plot for ProductCD
plot_fraud_rate(train_data, "ProductCD")
Observations:
C has the highest fraud rate (nearly 12%),
while W has the lowest.C should be treated with increased scrutiny during model
training or rule design.card4# Plot for card4
plot_fraud_rate(train_data, "card4")
Observations:
Discover cards exhibit the highest fraud rate (around
8%), compared to others like Visa and
Mastercard.card6# Plot for card6
plot_fraud_rate(train_data, "card6")
Observations:
Credit cards show a significantly higher fraud rate
than debit cards.This cross-feature fraud rate analysis reinforces that some
categorical variables are informative for fraud prediction. Product type
(ProductCD), card provider (card4), and card
type (card6) show meaningful variation in fraud rates
across categories. These features may be especially valuable for
downstream modeling, and further interaction effects could be
explored.
Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of high-dimensional data by transforming correlated variables into a smaller number of uncorrelated variables called principal components. Although PCA is not ideal for interpretability, it is useful for visualizing potential separation between fraudulent and non-fraudulent transactions.
We use the cleaned numerical dataset from the correlation analysis
(train_data_num) and apply z-score
standardization (mean = 0, std. dev. = 1) to each variable
before computing the principal components.
# Ensure all numeric values are scaled
numeric_scaled <- train_data_num %>%
select(-isFraud) %>%
mutate_all(~ scale(.) %>% as.vector())
# Attach target back
pca_data <- cbind(numeric_scaled, isFraud = train_data_num$isFraud)
Note: • Out of 177 numeric features available
(excluding isFraud), 176 were used in PCA. • The variable
V305 was excluded because it had zero variance (i.e.,
constant value across samples). Variables with zero variance do not
contribute to PCA and must be removed.
To make PCA computationally tractable, we randomly sample 10,000 observations from the dataset and remove variables with zero variance.
# Step 1: sample 10,000 rows (with isFraud)
set.seed(123)
pca_sample <- train_data_num %>%
sample_n(10000)
# Step 2: drop isFraud and remove constant columns (zero variance)
scaled_sample <- pca_sample %>%
select(-isFraud) %>%
select(where(~ var(.x, na.rm = TRUE) > 0)) %>%
mutate_all(~ scale(.) %>% as.vector())
# Step 3: apply PCA
pca_result <- prcomp(scaled_sample, center = TRUE, scale. = TRUE)
# Step 4: create plot data
pca_df <- as.data.frame(pca_result$x[, 1:2])
pca_df$isFraud <- factor(pca_sample$isFraud)
# Step 5: plot
ggplot(pca_df, aes(x = PC1, y = PC2, color = isFraud)) +
geom_point(alpha = 0.5, size = 1) +
labs(
title = "PCA (Sample of 10K)",
x = "PC1", y = "PC2",
color = "Is Fraud"
) +
theme_minimal()
Interpretation:
We now assess how much variance each principal component captures.
# Show variance explained by each component
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 5.1425 4.5066 3.03917 2.88263 2.68403 2.5758 2.36057
## Proportion of Variance 0.1503 0.1154 0.05248 0.04721 0.04093 0.0377 0.03166
## Cumulative Proportion 0.1503 0.2656 0.31813 0.36534 0.40628 0.4440 0.47563
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 2.25449 2.17135 2.07532 1.98786 1.88113 1.83180 1.80247
## Proportion of Variance 0.02888 0.02679 0.02447 0.02245 0.02011 0.01907 0.01846
## Cumulative Proportion 0.50451 0.53130 0.55577 0.57823 0.59833 0.61740 0.63586
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 1.79871 1.70719 1.62377 1.55728 1.54799 1.47625 1.46329
## Proportion of Variance 0.01838 0.01656 0.01498 0.01378 0.01362 0.01238 0.01217
## Cumulative Proportion 0.65424 0.67080 0.68578 0.69956 0.71317 0.72556 0.73772
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 1.36702 1.31024 1.29528 1.24126 1.21929 1.19460 1.16834
## Proportion of Variance 0.01062 0.00975 0.00953 0.00875 0.00845 0.00811 0.00776
## Cumulative Proportion 0.74834 0.75809 0.76763 0.77638 0.78483 0.79294 0.80069
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 1.11454 1.05855 1.03422 1.01995 1.01169 1.00100 0.98569
## Proportion of Variance 0.00706 0.00637 0.00608 0.00591 0.00582 0.00569 0.00552
## Cumulative Proportion 0.80775 0.81412 0.82019 0.82610 0.83192 0.83761 0.84313
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.97888 0.96356 0.95498 0.94370 0.93291 0.92044 0.90789
## Proportion of Variance 0.00544 0.00528 0.00518 0.00506 0.00495 0.00481 0.00468
## Cumulative Proportion 0.84858 0.85385 0.85903 0.86409 0.86904 0.87385 0.87854
## PC43 PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 0.88397 0.86265 0.83973 0.8181 0.81116 0.78244 0.76745
## Proportion of Variance 0.00444 0.00423 0.00401 0.0038 0.00374 0.00348 0.00335
## Cumulative Proportion 0.88298 0.88720 0.89121 0.8950 0.89875 0.90223 0.90558
## PC50 PC51 PC52 PC53 PC54 PC55 PC56
## Standard deviation 0.76407 0.74789 0.74469 0.72143 0.7149 0.7138 0.69530
## Proportion of Variance 0.00332 0.00318 0.00315 0.00296 0.0029 0.0029 0.00275
## Cumulative Proportion 0.90889 0.91207 0.91522 0.91818 0.9211 0.9240 0.92673
## PC57 PC58 PC59 PC60 PC61 PC62 PC63
## Standard deviation 0.68798 0.67949 0.67065 0.66653 0.65336 0.64526 0.62831
## Proportion of Variance 0.00269 0.00262 0.00256 0.00252 0.00243 0.00237 0.00224
## Cumulative Proportion 0.92942 0.93204 0.93459 0.93712 0.93954 0.94191 0.94415
## PC64 PC65 PC66 PC67 PC68 PC69 PC70
## Standard deviation 0.60621 0.59718 0.58791 0.57647 0.56673 0.56126 0.55216
## Proportion of Variance 0.00209 0.00203 0.00196 0.00189 0.00182 0.00179 0.00173
## Cumulative Proportion 0.94624 0.94827 0.95023 0.95212 0.95394 0.95573 0.95747
## PC71 PC72 PC73 PC74 PC75 PC76 PC77
## Standard deviation 0.5308 0.52561 0.50808 0.49850 0.4965 0.49027 0.48475
## Proportion of Variance 0.0016 0.00157 0.00147 0.00141 0.0014 0.00137 0.00134
## Cumulative Proportion 0.9591 0.96064 0.96210 0.96352 0.9649 0.96628 0.96762
## PC78 PC79 PC80 PC81 PC82 PC83 PC84
## Standard deviation 0.46959 0.46175 0.45647 0.43407 0.42677 0.42427 0.41541
## Proportion of Variance 0.00125 0.00121 0.00118 0.00107 0.00103 0.00102 0.00098
## Cumulative Proportion 0.96887 0.97008 0.97127 0.97234 0.97337 0.97439 0.97537
## PC85 PC86 PC87 PC88 PC89 PC90 PC91
## Standard deviation 0.40688 0.39933 0.3982 0.39196 0.37975 0.37784 0.3744
## Proportion of Variance 0.00094 0.00091 0.0009 0.00087 0.00082 0.00081 0.0008
## Cumulative Proportion 0.97631 0.97722 0.9781 0.97899 0.97981 0.98063 0.9814
## PC92 PC93 PC94 PC95 PC96 PC97 PC98
## Standard deviation 0.36478 0.35869 0.35517 0.3501 0.34361 0.34269 0.33472
## Proportion of Variance 0.00076 0.00073 0.00072 0.0007 0.00067 0.00067 0.00064
## Cumulative Proportion 0.98218 0.98291 0.98363 0.9843 0.98499 0.98566 0.98630
## PC99 PC100 PC101 PC102 PC103 PC104 PC105
## Standard deviation 0.33052 0.32822 0.3255 0.32324 0.31825 0.30783 0.30704
## Proportion of Variance 0.00062 0.00061 0.0006 0.00059 0.00058 0.00054 0.00054
## Cumulative Proportion 0.98692 0.98753 0.9881 0.98872 0.98930 0.98984 0.99037
## PC106 PC107 PC108 PC109 PC110 PC111 PC112
## Standard deviation 0.29942 0.29454 0.29009 0.28739 0.28617 0.27207 0.2640
## Proportion of Variance 0.00051 0.00049 0.00048 0.00047 0.00047 0.00042 0.0004
## Cumulative Proportion 0.99088 0.99138 0.99185 0.99232 0.99279 0.99321 0.9936
## PC113 PC114 PC115 PC116 PC117 PC118 PC119
## Standard deviation 0.26155 0.24839 0.24188 0.24119 0.23946 0.23507 0.2309
## Proportion of Variance 0.00039 0.00035 0.00033 0.00033 0.00033 0.00031 0.0003
## Cumulative Proportion 0.99399 0.99435 0.99468 0.99501 0.99533 0.99565 0.9960
## PC120 PC121 PC122 PC123 PC124 PC125 PC126
## Standard deviation 0.21818 0.21165 0.20692 0.20400 0.19685 0.19099 0.1868
## Proportion of Variance 0.00027 0.00025 0.00024 0.00024 0.00022 0.00021 0.0002
## Cumulative Proportion 0.99622 0.99648 0.99672 0.99696 0.99718 0.99738 0.9976
## PC127 PC128 PC129 PC130 PC131 PC132 PC133
## Standard deviation 0.18423 0.17031 0.16841 0.16066 0.15537 0.15339 0.14808
## Proportion of Variance 0.00019 0.00016 0.00016 0.00015 0.00014 0.00013 0.00012
## Cumulative Proportion 0.99777 0.99794 0.99810 0.99825 0.99838 0.99852 0.99864
## PC134 PC135 PC136 PC137 PC138 PC139 PC140
## Standard deviation 0.14503 0.14010 0.13706 0.1319 0.1295 0.12376 0.12038
## Proportion of Variance 0.00012 0.00011 0.00011 0.0001 0.0001 0.00009 0.00008
## Cumulative Proportion 0.99876 0.99887 0.99898 0.9991 0.9992 0.99926 0.99934
## PC141 PC142 PC143 PC144 PC145 PC146 PC147
## Standard deviation 0.11364 0.10747 0.10728 0.10513 0.10323 0.09613 0.09512
## Proportion of Variance 0.00007 0.00007 0.00007 0.00006 0.00006 0.00005 0.00005
## Cumulative Proportion 0.99942 0.99948 0.99955 0.99961 0.99967 0.99972 0.99978
## PC148 PC149 PC150 PC151 PC152 PC153 PC154
## Standard deviation 0.07832 0.07094 0.06686 0.06293 0.06216 0.05921 0.05172
## Proportion of Variance 0.00003 0.00003 0.00003 0.00002 0.00002 0.00002 0.00002
## Cumulative Proportion 0.99981 0.99984 0.99986 0.99989 0.99991 0.99993 0.99994
## PC155 PC156 PC157 PC158 PC159 PC160 PC161
## Standard deviation 0.03950 0.03821 0.03756 0.03705 0.03368 0.03073 0.02888
## Proportion of Variance 0.00001 0.00001 0.00001 0.00001 0.00001 0.00001 0.00000
## Cumulative Proportion 0.99995 0.99996 0.99997 0.99998 0.99998 0.99999 0.99999
## PC162 PC163 PC164 PC165 PC166 PC167
## Standard deviation 0.02304 0.01983 0.01272 0.007694 0.004664 0.003767
## Proportion of Variance 0.00000 0.00000 0.00000 0.000000 0.000000 0.000000
## Cumulative Proportion 1.00000 1.00000 1.00000 1.000000 1.000000 1.000000
## PC168 PC169 PC170 PC171 PC172 PC173
## Standard deviation 0.001825 0.001339 4.816e-09 2.151e-09 1.11e-09 8.273e-10
## Proportion of Variance 0.000000 0.000000 0.000e+00 0.000e+00 0.00e+00 0.000e+00
## Cumulative Proportion 1.000000 1.000000 1.000e+00 1.000e+00 1.00e+00 1.000e+00
## PC174 PC175 PC176
## Standard deviation 5.51e-15 3.928e-15 2e-15
## Proportion of Variance 0.00e+00 0.000e+00 0e+00
## Cumulative Proportion 1.00e+00 1.000e+00 1e+00
You can also visualize it more clarly:
# Variance proportion
pca_var <- summary(pca_result)$importance[2, 1:10] # Proportion of variance
# Barplot
barplot(pca_var,
names.arg = paste0("PC", 1:10),
main = "Variance Explained by Top 10 Principal Components",
xlab = "Principal Component",
ylab = "Proportion of Variance",
col = "steelblue")
Interpretation:
Conclusion:
PCA confirms that: - Fraudulent transactions do not linearly separate from non-fraudulent ones in the first two components. - However, PCA remains valuable for: - Anomaly detection and visual exploration, - Dimensionality reduction before clustering or linear modeling, - Reducing noise and multicollinearity in downstream tasks.
If used for modeling, a larger number of principal components should be retained to preserve enough information. Future steps could include training models on the top k components or integrating PCA into a pipeline with clustering techniques.
While we’ve previously reviewed missing value percentages across variables, visualizing their distribution across observations can reveal structured patterns—such as whether certain blocks of features tend to be missing together, or whether missingness is concentrated in specific transaction types.
To do this, we generate a missing data heatmap of the top 50 variables with the highest proportion of missing values. Due to computational limits, we sample 5,000 rows from the full dataset.
# Step 1: select top 50 variables with highest NA percentage
na_top_vars <- train_data %>%
summarise(across(everything(), ~ mean(is.na(.)))) %>%
pivot_longer(everything(), names_to = "variable", values_to = "na_rate") %>%
arrange(desc(na_rate)) %>%
slice(1:50) %>%
pull(variable)
Now we have selected the top 50 variables with the highest missingness. Next, we will visualize the missing data patterns.
# step 2: Sample for plotting
set.seed(123)
train_data_na_subset <- train_data %>%
select(all_of(na_top_vars)) %>%
slice_sample(n = 5000)
# step3: Plot heatmap with rotated labels
vis_miss(train_data_na_subset, sort_miss = TRUE) +
labs(title = "Missing Data Heatmap - Top 50 Variables (Sample of 5,000 Rows)") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 90, hjust = 1, size = 7),
plot.title = element_text(size = 14, face = "bold")
)
Interpretation:
Conclusion:
This visual supports previous numeric summaries and reinforces the need for: - Careful imputation strategies or selective removal of high-missingness features. - Exploring whether missingness itself is informative for predicting fraud (e.g., through missingness indicators).
This EDA provided deep insights into the structure, quality, and behavioral patterns of the IEEE-CIS Fraud Detection dataset. By analyzing categorical, temporal, and continuous features—as well as missing data and dimensional structure—we established a comprehensive foundation for building effective fraud detection models.
Key Findings:
ProductCD = C), card
networks (Discover), and time windows (early hours of the
day).Next steps:
IsMorning,
IsWeekend, IsHighAmount)V305)