Benford’s Law with R

Definition

Bendfords Law (or the first-digit law) is the observation that small leading digits are more common in a given data set than large ones.
In other words, for any random set of data, we would expect a leading digit of 1 more often than 9
Examples: Voting records, Business transactions, powers of 2, factoirals, and much more

Each leading digit has a probability of $P(d) = log_{10}(1 + {1 \over d})$

US Census by City/Town

Here is an example using the population size for cities and towns in the US
Our data set has 19509 records ranging from 1 to 8.391881^{6}
Visually, the trend seems to follow Benford’s Law (red line) very closely.

How to

To get started, we use the R library “benford.analysis” to make things a bit easier.
Here we can see the results:

data(census.2009)
benford(census.2009$pop.2009)

## 
## Benford object:
##  
## Data: census.2009$pop.2009 
## Number of observations used = 19509 
## Number of obs. for second order = 7950 
## First digits analysed = 2
## 
## Mantissa: 
## 
##    Statistic  Value
##         Mean  0.503
##          Var  0.084
##  Ex.Kurtosis -1.207
##     Skewness -0.013
## 
## 
## The 5 largest deviations: 
## 
##   digits absolute.diff
## 1     15         45.81
## 2     32         36.28
## 3     60         33.95
## 4     11         33.22
## 5     28         26.68
## 
## Stats:
## 
##  Pearson's Chi-squared test
## 
## data:  census.2009$pop.2009
## X-squared = 107.16, df = 89, p-value = 0.09222
## 
## 
##  Mantissa Arc Test
## 
## data:  census.2009$pop.2009
## L2 = 4.198e-05, df = 2, p-value = 0.4409
## 
## Mean Absolute Deviation (MAD): 0.0006134141
## MAD Conformity - Nigrini (2012): Close conformity
## Distortion Factor: 0.7404623
## 
## Remember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!

Then we plot it…

x = seq(1,9,1)
p = log10(1 + (1/x))

df = data.frame(x=trends$bfd$digits,y=trends$bfd$data.dist,p=trends$bfd$benford.dist)
plot_ly(df,x=df$x,y=df$y,type="bar",name="US Census") %>%
    add_lines(p,x=x,name="benford", line = list(shape = "spline")) %>%
        add_markers(p,x=x,name="benford") %>%
            add_markers(df$y,x=x)

And run fit tests…

Here we will take a look at some numbers and run the chi-square test and Mean Absoulte Deviation to see how well it fits

	1	2	3	4	5	6	7	8	9
population	0.2941	0.1815	0.1200	0.0947	0.0799	0.0702	0.0598	0.0535	0.0463
benford	0.3010	0.1761	0.1249	0.0969	0.0792	0.0669	0.0580	0.0512	0.0458

$chi-square = \sum_{i=1}^{k} {(Pi-P0)^2 \over Pi}$

Where:

$K =$ degrees of freedom

$P_i =$ actual count

$P0_i= $ expected count

## 
##  Pearson's Chi-squared test
## 
## data:  census.2009$pop.2009
## X-squared = 17.524, df = 8, p-value = 0.0251

$MAD = {\sum_{i=1}^k |Pi-P0i| \over K}$

Where:

$K =$ Number of bins

$P_i =$ actual proportion

$P0_i =$ expected proportion

Mean Absolute Deviation (MAD): 0.0031193

MAD Conformity - Nigrini (2012): Close conformity

Distortion Factor: 0.7404623

History

Benford’s Law was first noticed by an Astronomer names Simon Newcomb when he noticed that the logarithm tables he was using showed significantly more wear on the earlier pages (pages with 1’s, then 2’s, etc.)
Newcomb developed a hypothesis and even discovered that the law holds for second-digits as well!
Since its discovery, Benford’s law has been found to work for many sets including:
- River lengths
- Population sizes
- Physical constants
- Numbers found in magazines/newspapers/etc.
- Genome markers
- And more…

What sets are likely to fit?

Attributes that make for a set that is likely to fit well:

Sets that span several magnitudes:
- Good: Records of bank transactions that could include tens, hundred, thousands and more dollars in each transacation
- Bad: Grocery store prices are less likely to simply because the price range of items are usually less than $20.
Sets that are not limited by specific rules:
- Good: Street adresses. This one can be tricky, because generally adresses are assigned in numerical order. however, since roads tend to be very long, the adresses will span multiple magitudes and offset the artificiality.
- Bad: Phone Numbers. These are also assigned numerically, but since all phone numbers have the same number of digits, the distribution of numbers will be fairly equal.
Sets which are not produced by active human creation:
- good: Ecological measurements
- bad: Polling people on their favorite numbers

Applications

Benford’s Law has found many real world uses.
It is great for detecting biases in data.
- It should be noted that Benford’s Law is great at detecting bias, but not at detecting a lack of bias
Notably, it has been used for fraud detection because people are bad at tracking their leading digits and tend to try to stay under certain thresholds.
- For instance, if a company required approval for purchases over $5000, and their purchases had an abnormally high number of 4’s in the leading digit, then it’s likely that fraud was occuring.
It was used to detect election fraud in Iran’s 2009 presidential election
It was used to uncover a russian bot network on Twitter