This notebook examines a dataset of credit card transactions, taken from a recent Kaggle competition. Specifically, this analysis focuses on understanding the statistical structure of the Amount variable ($ amount charged) across over 285k transactions. Without getting too deep into the statistical weeds, this also explores whether this collection of charges conforms to:
Resources
The Pareto Distribution looks like this:
Prepare the environment and read in the data:
# clear environment, load packages
rm(list=ls())
library(dplyr)
library(plotly)
# disable scientific notation
options(scipen=999)
# read in data, view a selection of variables
fraud <- read.csv("creditcard.csv")
head(fraud %>% dplyr::select(Time, V1, V2, V3, V28, Amount, Class))
## Time V1 V2 V3 V28 Amount Class
## 1 0 -1.36 -0.07 2.54 -0.02 149.62 0
## 2 0 1.19 0.27 0.17 0.01 2.69 0
## 3 1 -1.36 -1.34 1.77 -0.06 378.66 0
## 4 1 -0.97 -0.19 1.79 0.06 123.50 0
## 5 2 -1.16 0.88 1.55 0.22 69.99 0
## 6 2 -0.43 0.96 1.14 0.08 3.67 0
One of the first steps in understanding the Amount variable is to examine the raw distribution of the dollar amount spent across all transactions. Charges of 0.00 were witheld, and the X axis was restricted to the mean + 1 standard deviation. This filter was applied to remove outliers: there are a number large charges which create an extensively long right tail. The result below looks strikingly similar to the Pareto Distribution. This is not surprising as it is logical for consumers to make many more smaller purchases than large ones.
# create data frame
df <- fraud %>%
dplyr::select(Amount) %>%
filter(Amount != 0 & Amount <=mean(fraud$Amount) + (sd(fraud$Amount)))
#%>% mutate(Amount = replace(Amount, Amount < 1, 1))
x <- df[['Amount']]
# calculate density distribution
fit <- density(x)
# visualize
plot_ly(x = x, type = "histogram", name = "Histogram") %>%
add_trace(x = ~fit$x,
y = ~fit$y,
type = "scatter",
mode = "lines",
fill = "tozeroy",
yaxis = "y2", name = "Density") %>%
layout(title = "Amount (ex. $0)", yaxis2 = list(overlaying = "y", side = "right", showgrid = FALSE))
It is also prudent to do a quick examination of the distribution of only the fraudulent transaction amounts. This requires filtering to Class == 1. The resulting frequencies also bear resemblence to the Pareto Distribution.
# create data frame
df <- fraud %>%
dplyr::select(Amount, Class) %>%
filter(Class == 1)
x <- df[['Amount']]
# calculate density distribution
fit <- density(x)
# visualize
plot_ly(x = x, type = "histogram", name = "Histogram") %>%
add_trace(x = ~fit$x,
y = ~fit$y,
type = "scatter",
mode = "lines",
fill = "tozeroy",
yaxis = "y2", name = "Density") %>%
layout(title = "Fraudulent Amounts (Class = 1)",
yaxis2 = list(overlaying = "y",
side = "right",
showgrid = FALSE))
While it seems curious that the majority (~23%) of the fraudulent charges were for $1, this may be explained by criminals testing a card to ensure it works before making larger charges.
f <- as.data.frame(table(fraud %>% filter(Class == 1) %>%
dplyr::select(Amount)))
f %>%
mutate(prop = Freq/sum(Freq)) %>%
arrange(desc(prop))
The law states that in many naturally occurring collections of numbers, the leading significant digit is likely to be small.[1] For example, in sets that obey the law, the number 1 appears as the leading significant digit about 30% of the time, while 9 appears as the leading significant digit less than 5% of the time. If the digits were distributed uniformly, they would each occur about 11.1% of the time.[2] Benford’s law also makes predictions about the distribution of second digits, third digits, digit combinations, and so on. Source - Wikipedia
To examine whether this phenomena is found in our sample of transations, a vector of numbers is created by stripping the first number from each Amount spent. A decision was made to remove values less than 1 for the sake of simplicity. The resulting vecotr has 9 distinct values and is ~268k observations.
# create amount vector, filter out amounts less than 1
amount <- data.frame(
first_digit = {
first_digit = fraud %>%
dplyr::select(Amount) %>%
dplyr::filter(Amount >= 1)
first_digit = as.numeric(substr(as.character(first_digit$Amount), 1, 1))
}
)
# display unique values
sort(unique(amount$first_digit))
## [1] 1 2 3 4 5 6 7 8 9
The next step is to examine a simple frequency table of the first-digits, as compared to the Benford frequencies. The resulting frequency distribution is notably similar to that of Benford’s Distribution:
# create frequnecy table
freq <- as.data.frame(table(amount$first_digit)) %>%
mutate(first_digit_prop = round(Freq/sum(Freq),3)) %>%
arrange(desc(first_digit_prop))
colnames(freq)[1] <- 'first_digit'
colnames(freq)[2] <- 'frequency'
# include beford frequencies
benford_dist <- data.frame(benford = c(0.301, 0.176, 0.125, 0.097, 0.079, 0.067, 0.058, 0.051, 0.046))
freq <- cbind(freq, benford_dist)
freq
# view histogram
plot_ly(x = amount$first_digit, type = "histogram", name = "Histogram") %>%
layout(title = "Transaction Amounts: First Digit Frequency",
xaxis = list(title = "First Digits"))
###Conclusion
The Amount variable does appear to resemble both the Pareto and Benford Distributions. This is a valuable peice of information when it comes to future fraud prediction efforts. Understanding the underlying distribution of your response variable can only serve to strengthen your statistical modeling efforts.