General Instructions

Each lab in this course will have multiple components. First, there will a piece like the document below, which includes instructions, tutorials, and problems to be addressed in your write-up. Any part of the document below marked with a star, \(\star\), is a problem for your write-up. Second, there will be an R script where you will do all of the computations required for the lab. And third, you will complete a write-up to be turned in the Friday that you did your lab.

If you are comfortable doing so, I strongly suggest using RMarkdown to type your lab write-up. However, if you are new to R, you may handwrite your write-up (I’m also happy to work with you to learn RMarkdown!). All of your computational work will be done in RStudio Cloud, and both your lab write-up and your R script will be considered when grading your work.

Lab Overview

In this lab, you will learn about a surprising distribution coming from a surprising result known as Benford’s Law. This distribution is frequently used as one tool for identifying financial fraud, and you will use it here to investigate a finanical data set that includes both fraudulent charges and legitimate charges.

As part of this lab, you will develop and practice skills related to plotting the distribution functions of discrete random variables, and you will use your visualizations as part of your fraud investigation.

R Tutorial

We need a little tutorial in plotting the distribution functions of discrete random variables using the R package ggplot2. If you are comfortable with the ggplot2 package in R, feel free to skip this section.

Below is an example of a discrete distribution for a random variable \(Z\) that can take the possible values \(z=13, 14, 15, 16, 17, 18\). We’ll imagine the p variables represents the true probabilities and p_est represents some estimated values.

dist
##    z    p p_est
## 1 13 0.10  0.08
## 2 14 0.40  0.42
## 3 15 0.15  0.15
## 4 16 0.20  0.16
## 5 17 0.10  0.11
## 6 18 0.05  0.08

We can plot the distribution using the package ggplot2. The main idea behind the ggplot2 syntax is that you tell R what data you want it to plot, then add layers for aesthetic settings and graphical elements. The code below does just this. The first line tells R the data, and then specifies which variables to plot in aes (aesthetics). The next layer, telling R that you want a bar plot, is added with a \(+\) symbol.

ggplot(data = dist, aes(x = z, y = p)) +
        geom_bar(stat = "identity")

Notice that everything looks pretty good so far, but not every bar has a label. We can change this with having R interpret the z values as factors, then renaming the x-axis label (otherwise the label would say “factor(z)”).

ggplot(data = dist, aes(x = factor(z), y = p)) +
        geom_bar(stat = "identity") +
        xlab("z")

We can also add a second set of bars that will allow us to quickly compare the actual probabilities and the estimated probabilities. But for this to work well, we need to reformat the dataframe using the gather function in the package tidyr. The gather function code below will make two new columns while keeping the z column. The first, “prob”, will correspond to the names of the two original columns, p and p_est. The second new column, “val”, will correspond to the values of p and p_est. Thus, each row of the new dataframe corresponds to a combination of a value of z and either a value of p or a value of p_est from the original dataframe.

dist_long <- gather(dist, "prob", "val", -z)
dist_long
##     z  prob  val
## 1  13     p 0.10
## 2  14     p 0.40
## 3  15     p 0.15
## 4  16     p 0.20
## 5  17     p 0.10
## 6  18     p 0.05
## 7  13 p_est 0.08
## 8  14 p_est 0.42
## 9  15 p_est 0.15
## 10 16 p_est 0.16
## 11 17 p_est 0.11
## 12 18 p_est 0.08

We can now make our grouped bar chart by adding a fill aesthetic term in ggplot. To get the bars side-by-side, we position them using the position arguement in geom_bar.

ggplot(data = dist_long, aes(x = factor(z), y = val, fill = prob)) +
        geom_bar(stat = "identity", position = "dodge") +
        xlab("z") +
        ylab("probability")

Background and Theory

In the early 1900s, physicist Frank Benford noticed a curious pattern in a data set, and subsequently found the same pattern in a wide variety of other data sets (although he was not the first to make this observation). Specifically, he looked at the leading digit (1-9) in all the numbers in a data set and counted how many times each occurred. A reasonable assumption is that the leading digits should appear about the same number of times (i.e., they should be distributed uniformly). But this doesn’t always happen! In particular, 1 occurred most often, with each successive digit appearing less often. Benford was able to write down a formula for this distribution, \[p(d) = \log_{10}\left(\frac{d+1}{d}\right) \quad d=1,2,\ldots,9,\] where \(D\) is the random variable representing the leading digit.

A table shows the values of the distribution for each

d 1 2 3 4 5 6 7 8 9
p 0.30103000 0.17609126 0.12493874 0.09691001 0.07918125 0.06694679 0.05799195 0.05115252 0.04575749
  1. (\(\star\)) Show that \(p(d)\) is a probability distribution.

  2. (\(\star\)) In your R script, make a bar plot of the Benford distribution, and show it in your write-up.

One very big caveat for Benford’s law is that it typically only applies when your data spans several orders of magnitude. As an example, if you consider the heights of players in the NBA in inches, all heights are two-digit numbers, and some digits (e.g., 1) will never appear as the leading digit in this data set. So, if you want to apply Benford’s Law, you need to check this condition.

Fraud Detection

The general idea behind using Benford’s Law to help detect fraud is that people are generally pretty bad at making up fake data that behaves like real data actually would. In our quest to make things look “random,” we often make fake data that doesn’t fit the same distribution as true data. Benford’s distribution is one possible distribution to check suspected fraudulent data against.

We will use a data set obtained from Kaggle that consists of 284,807 credit card transactions from bank customers in Europe during a period in 2013. Of these transactions, 492 fraudulent charges were found. This data set has been preprocessed for you by

  1. (\(\star\)) In your R script, plot a grouped bar chart for the fraudulent data with the Benford distribution. Do the same for the legitimate data.

  2. (\(\star\)) What observations can you make based on the bar charts you produced? What does this suggest about how Benford’s Law can be used in fraud detection? What are some potential limitations or drawbacks of this method?