Benford’s Law in R

Introduction

We are all surrounded by numbers. They connect everything - even humans - we are all part of a mathematical formula/design or a pattern. These numbers, which are merely a way to quantify reality, seem to obey what is known as The Benford’s Law.

Take a bunch of naturally-occurring numbers, it could be numbers on the first page of the Financial Times or the population of countries or the tax fillings - and you strip away everything but the first digit and it will generally follow a pattern. You might think that the chances of any digit between 1 and 9 must be equal. However, we end up seeing starting-digit 1 to occur more than 2, starting-digit 2 more than 3 and so on. Not only more, we see a precise percentage. About 30.1% of the numbers start with 1 and about 4.6% start with a 9. We can graph it out to see this beautiful curve.

It almost seems to be wrong! But that’s the beauty of Benford’s Law.

Benford’s law comes form the empirical observation that in many datasets the leading digits of numbers are more likely to be small than large, for instance 1 is more likely to occur as the leading digit than 2. This observation was first published by Newcomb in 1881, and given experimental support in 1983 by Benford who analysed over 20,000 numbers collected from naturally occurring data sets such as the area of the riverbeds, atomic weights of elements, etc.

Required Libraries

library(benford.analysis)
library(tuneR)                        # for music
library(rvest)                        # for web-scraping
library(dplyr)
library(tidyverse)
library(imagerExtra); library(imager) # for images
library(gt); library(glue)            # for tables

Expected Digit Frequency

First Digit ($D_1$) Proportion:

$P(D_1=d_1) = log_{10}(1+ 1/d_1)$
For e.g. $P(D_1=1) = log_{10}(2) = 0.30103$

First Two Digit ($D_1D_2$) Proportion:

$P(D_1D_2 = d_1d_2) = log_{10}(1 + 1/d_1d_2)$
For e.g. $P(D_1D_2=19) = log_{10}(1 + 1/19) = 0.022276$

Second Digit ($D_2$) Proportion:

$P(D_2 = d_2) = \sum_{d_1=1}^9 log_{10}(1 + 1/d_1d_2)$
For e.g. $P(D_2=4) = log_{10}(1 + 1/14) + log_{10}(1 + 1/24)+ ... +log_{10}(1 + 1/94) = 0.1003$

Here, $d_1 \in \{1,2, ...,9\}$ and $d_2 \in \{0,1, ...,9\}$ and $d_1d_2 \in \{10,11, ...,99\}$. We can easily extend these formula to include more digits.

# first digit
benford1 = log10(1+ 1/1:9)
# first two digits
benford12 = log10(1+ 1/10:99)
# second digit
aux = 10:99 - floor(10:99/10)*10
benford2 = (data.frame(v1 = benford12,v2 = aux) %>% group_by(v2) %>% summarise(v1 = sum(v1)))$v1

Geometric Progression

Let’s start is a Geometric Progression (GP) Series. We can work out 2 series based on common ratios of 2 and 0.8 (500 terms each).

The $n^{th}$ term of a GP Series is given by : $a_n = ar^{n-1}$ where, ‘a’ is the first term and ‘r’ is the common ratio.

#Series 1 with a = 1 and r = 2
gp.series_1 = rep()
common.ratio_1 = 2
first.term_1 = 1
for (i in 1:500) {gp.series_1[i]=first.term_1*common.ratio_1^(i-1)}
trends.gp_1 = benford(gp.series_1, number.of.digits = 1,discrete = T, sign = "positive")
#Series 1 with a = 1 and r = 0.8
gp.series_2 = rep()
common.ratio_2 = 0.8
first.term_2 = 1
for (i in 1:500) {gp.series_2[i]=first.term_2*common.ratio_2^(i-1)}
trends.gp_2 = benford(gp.series_2, number.of.digits = 1,discrete = T, sign = "positive")

[1] "MAD = 0.00093 implying Close conformity"

[1] "MAD = 0.0017 implying Close conformity"

Exponential Expansion

Let’s work out different values of x in $e^x$ from 0, 1, …, 500

exp.series = rep()
for (i in 1:500) {exp.series[i]=exp(i)}
trends.exp = benford(exp.series, number.of.digits = 1,discrete = T, sign = "positive")

[1] "MAD = 0.0011 implying Close conformity"

Fibonacci Sequence

The Fibonacci Sequence is the series of numbers: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34 …

The 1st two terms are 0 and 1 and the third term is the sum of these. Thus, we have $n^{th} = (n-1)^{th} + (n-2)^{th}$ term.

To test if the Fibonacci sequence conforms to Benford’s Law, we test the first 500 numbers of the series. We can simply generate it using a for loop.

n = c(0,1,rep())
for (i in 3:501) { n[i]= n[i-1]+n[i-2] }
n = n[-1] #remove 0 
trends.f = benford(n, number.of.digits = 1,discrete = T, sign = "positive")

[1] "MAD = 0.0013 implying Close conformity"

Sin Cos Tan

Let’s take up a trigonometric functions like $sin(x), cos(x), tan(x)$ for values of x going from 0 to 100 increasing by 0.01 and see if they comply to Benford’s law

x = seq(0,100,0.01)
trends.sin = benford(sin(x),number.of.digits = 1, sign = "both",discrete = T)
trends.cos = benford(cos(x),number.of.digits = 1, sign = "both",discrete = T)
trends.tan = benford(tan(x),number.of.digits = 1, sign = "both",discrete = T)

[1] "For sin(x) : MAD = 0.0898 implying Nonconformity"

[1] "For cos(x) : MAD = 0.0888 implying Nonconformity"

[1] "For tan(x) : MAD = 0.0041 implying Close conformity"

We can see the proportions by calling bfd$data.dist.

     sin    cos    tan benford
1 0.0708 0.0720 0.3101  0.3010
2 0.0721 0.0732 0.1687  0.1761
3 0.0743 0.0756 0.1183  0.1249
4 0.0776 0.0786 0.0934  0.0969
5 0.0837 0.0838 0.0784  0.0792
6 0.0914 0.0917 0.0686  0.0669
7 0.1043 0.1042 0.0604  0.0580
8 0.1301 0.1296 0.0537  0.0512
9 0.2957 0.2914 0.0484  0.0458

One reason why sin and cos don’t comply whereas tan does comply could be their respective ranges. On one hand, $sin(x)$ and $cos(x)$ lie in the range [-1,1], $tan(x)$ lies in the range of [-Inf, Inf].

Probability Distributions

All long-tailed Probability Distributions comply to Benford’s Law. You can see below the graphs digit distribution of randomly generated sample of size 50,000 (bar graphs) and benford’s proportions (line graph).

Below table shows the proportion of respective distributions.

There are some restrictions on parameters of the distributions for yielding conformity to Benford’s law, for e.g. :-

Log-Normal : sigma > 0.8
F_m,n : m <= 12
Student’s t : degree of freedom < 6
Chi-square : degree of freedom < 2
Weibull : shape parameter < 1
Gamma : shape parameter < 1
Beta : shape1 <= 1 -and- shape2 > 4
Pareto : lambda >= 4.06
Burr : gamma >= 1.286

To visualise how changing parameter values change the underlying distribution, check out my R shiny dashboard

Music

Yes! Music, Songs and even Symphonies conform to the Benford’s Law.

The music of Beethoven, Bach, Tchaikovsky and Seedhe Maut have something in common. They all obey to Benford’s Law. What researchers did was they measured how long each note is played cumulatively over the entire song… and out popped Benford every single time.

We are surrounded by music. Music is made up of oscillating signals that are usually collected at 44 kHz. What it means is we collect about 44 thousand samples every second. It was also theorized and shown that even in the music, we see Benford’s Law. In a paper authored by Smith (1997), recommended to transform the data using Fourier Transforms for better conformity.

To test this theory, we use the song named Kranti by Seedhe Maut, an India Rapper Duo.

The music file is in MP3 format, so to extract numerical values we convert it to Wave format - load it in and then perform FFT. We then take modulus to convert complex numbers to a numeric value before applying the Benford’s test.

#Step(1) Convert from MP3 to Wave
# r = readMP3("Kranti Seedhe Maut.mp3")
# writeWave(r,"tmp.wav",extensible = FALSE)

#Step(2) Take FFT and modulus
w = readWave("tmp.wav")
data = w@left %>% fft() %>% Mod() 

#Step(3) Test for Benford's law
trends.m = benford(data, number.of.digits = 1, discrete = T, sign = "both")

[1] "MAD = 0.0094 implying Acceptable conformity"

To understand why use Fast Fourier Transform click here.

Painting

Saturn Devouring His Son by Francisco De Goya

In the recent years, we have seen application of Benford’s law in forensic image analysis to see if photos were tampered. When applying Benford’s law to images, the process is a bit different.

The image used is a painting by the renowned artist named Francisco De Goya. The painting is named Saturn Devouring His Son which depicts the Greek myth of Titan Cronus. Do check out the story behind this painting!

#Step(1) Load image to R workspace in grey-scale
im = load.image("Painting3.jpg") %>% grayscale()

#Step(2) Perform Discrete Cosine Transform on data
im_df = DCT2D(im) %>% as.data.frame()

#Step(3) Test Transformed Data for Benford's law
trends.im = benford(im_df$value, number.of.digits = 1, discrete = T, round = 1, sign = "both")

[1] "MAD = 0.0044 implying Close conformity"

To create any black-&-white image of size 8 by 8 we need a combination of all these different grids at the same time. These grids are the 64 different cosine functions which combine to form any 8 by 8 image. Each of these is weighted by some coefficient which is essentially a number representing the contribution of each of these individual blocks to the whole image. To calculate these coefficients we use Direct Cosine Transformation.

To understand what Direct Cosine Transformation does in order to convert a picture into numbers click here.

Shiny App for Your Images

Indian Population

Here, we apply Benford’s Law for village-wise population collected during the Population Census of 2011. The data is collected from the National Data and Analytics Platform (NDAP), click here.

india = read.csv("population.csv")
trends.population = benford(india$population, number.of.digits = 1, discrete = T, sign = "positive")

[1] "MAD = 0.0051 implying Close conformity"

Indian Land Use

Here, we pick up Land use statistics classified into categories: Data Available here

Forest Land Area
Land Area Under Non-Agriculture Uses
Baren and Unculturable Land Area
Permanent Pasture & Other Grazing Land Area
Land Under Miscellaneous Use
Culturable Waste Land Area
Fallow Lands Other than Current Fallow Lands
Current Fallow Land Area
Net Sown Land Area
Cropped Land Area
Land Area Sown More Than Once

#Load CSV File
data = read.csv("data.csv")
#Format a bit
data$State = as.factor(data$State)
data$District = as.factor(data$District)
data$Year = as.factor(data$Year)
#Loop to convert all other columns into numeric
for (i in 4:15) {
  a = i
  data[,a] = as.numeric(data[,a])
}
#Split the data based on year into different dataframes
x = split(data, data$Year)
#Loop for Benford
# 1st run of the loop : a = 4 i.e. Forest Land Area column is selected and tested for all years
# and so on ...
for (i in 4:15) {
  a = i
  print(names(data[a]))
  print(benford(x$`2011-12`[,a],number.of.digits = 1,sign = "positive",discrete = T)$MAD.conformity)
  print(benford(x$`2012-13`[,a],number.of.digits = 1,sign = "positive",discrete = T)$MAD.conformity)
  print(benford(x$`2013-14`[,a],number.of.digits = 1,sign = "positive",discrete = T)$MAD.conformity)
  print(benford(x$`2014-15`[,a],number.of.digits = 1,sign = "positive",discrete = T)$MAD.conformity)
  print(benford(x$`2015-16`[,a],number.of.digits = 1,sign = "positive",discrete = T)$MAD.conformity)
  print(benford(x$`2016-17`[,a],number.of.digits = 1,sign = "positive",discrete = T)$MAD.conformity)
  print(benford(x$`2017-18`[,a],number.of.digits = 1,sign = "positive",discrete = T)$MAD.conformity)
  print(benford(x$`2018-19`[,a],number.of.digits = 1,sign = "positive",discrete = T)$MAD.conformity)
}

Clearly, for data where the result is ‘Non-conformity’ - it is safe to assume that there must be some error in data collection or the dataset is incomplete/inaccurate. The year 2018 has the highest non-conformity among all the years.

However, Forest Land Area and Permanent Pasture & Other Grazing Land Area, Land Area sown more than once and most of the other categories comply with Benford’s Law and that is bizarre!

Let’s look at the plots for the year 2013-14.

Indian Banks’ Total Assets

Benford’s law is most commonly applied on macroeconomic and financial data to detect accounting fraud. Here, we take the total assets held by Indian Scheduled Banks over the years 2005 to 2021.

We have over 169 different banks categorised as public sector, private sector, foreign, small finance banks, nationalised banks and so on. For downloading the data click here. You can access the entire balance sheet of these banks under the section Statistical Tables Relating to Banks in India.

banks = read.csv("indian.banks.csv")
banks$Banks = as.factor(banks$Banks)
summary(banks$Total.Assets) #figures are in crores INR

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     30    2583   23777  113165  109007 4534430

trends.banks = benford(banks$Total.Assets,number.of.digits = 1,sign = "positive",discrete = T)

[1] "MAD =  0.0056 implying Close conformity"

National Stock Exchange

You can grab the dataset from the National Stock Exchange India website.

trades = read.csv("NSETrades.csv")
trends.trades = benford(trades$No.of.Trades, number.of.digits = 1, sign = "positive",discrete = T)

[1] "MAD =  0.0056 implying Close conformity"

Next time you execute a buy/sell of a stock feeling certain that it is entirely your decision, remember that you’re decision contributes to the beautiful Benford’s law curve.

Total Area of All Countries

Roger Pinkham (1961) showed that Benford’s Law is scale-invariant under multiplication. This means that if all numbers in a field that conformed to Benford’s law multiplied by a non-zero constant, the new list would also be a Benford Set.

The scale-invariance grants Benford’s Law Universality

Here, we use the Total Area (Land and Water) of 233 countries. We check if the area-figures conform to Benford’s Law in kilometre-square and then mile-square. To go from km-sq to mile-sq we multiply by 0.3861, a non-zero constant.

#Step(1) Web-scrape Country-wise Total Area
col_link1 = "https://www.worldometers.info/geography/largest-countries-in-the-world/"
col_page1 = read_html(col_link1)
col_table1 = col_page1 %>% html_nodes("table") %>% html_table()
col_table1 = col_table1[[1]]

#Step(2) Data Cleaning
km = col_table1[,3]  
km = as.numeric(gsub(",","",km$`Tot. Area (Km²)`)) 
km = km[-c(234,235)] #last two entries are zero - drop them
#
mile = col_table1[,4]
mile = as.numeric(gsub(",","",mile$`Tot. Area (mi²)`))
mile = mile[-c(234,235)] #last two entries are zero - drop them

#Step(3) Test for Benford's law
trends.km = benford(km, number.of.digits = 1, discrete = T, sign = "both")
trends.mile = benford(mile, number.of.digits = 1, discrete = T, sign = "positive")

[1] "For sq-km MAD = 0.0064 implying Acceptable conformity"

[1] "For sq-mile MAD = 0.0148 implying Marginally acceptable conformity"

The area of countries conform to Benford’s Law in both units km-sq and mile-sq. Thus, the scale-invariance property holds.

Sachin Tendulkar’s Runs

When it comes to sports it all boils down to a split-second decision. Imagine you are in the shoes of the greatest of all Sachin Tendulkar. You don’t know whether the next ball will be a full-toss, bouncer, wide-ball or a no-ball. You don’t know with certainty the direction in which the ball will end up going in. Whether it will be a six or a catch-out, whether the fielders will stop it or will the ball hit the boundary, whether you will finishing running-between the wickets or get run-out, or an over-throw.

From the conditions of the weather to the conditions of the pitch. Even your psychological status and bowel movements - all factors play a role to effect what runs are scored on an individual ball. It seems so random when we think about it. Yet, as we will see below, these runs scored comply to Benford’s Law.

So next time you’re watching the IPL, I hope your mind wanders off thinking about Benford’s Law for a brief moment, like mine does.

Let’s have a look at the runs-scored by the ‘The God of Cricket’ here. These runs are scored over different game formats of Test-series, ODI and T-20 between 1989 to 2013. Spanning over matches played against countries like Australia, Bangladesh, England, New Zealand, Pakistan, South Africa, Sri Lanka, West Indies - to many a few.

sachin = read.csv("sachin.csv",header = T)
trends.sachin = benford(sachin$Runs, number.of.digits = 1, discrete = T, sign = "positive")

## [1] "MAD = 0.0138 implying Marginally acceptable conformity"

Marginally acceptable conformity implies that the dataset must have some restrictions which did not allow us to observe close conformity. One restriction could be those matches where Sachin was not-out. In such cases, we have a censoring mechanism/ an artificial barrier preventing us to know what the actual runs could be if the player was bowled out. However, real-life data sets will never conform perfectly to Benford’s Law. It will always be an approximation - which is in itself quite bizarre!

Distance of Stars

Yes, Stars! Let’s see if Benford’s law holds on astronomical distances of ‘heavenly bodies’ from the Sun. For this we use the HYG3.0 Database which contains ALL stars that in Hipparcos, Yale Bright Star, and Gliese catalogs.

We delete the entries with a value >= 100,000 because it indicates missing or dubious parallax data in Hipparcos. The star’s distance in parsecs, the most common unit in astrometry. To convert parsecs to light years, multiply by 3.262.

The Github Link for the dataset click here

stars = read.csv("stars.csv")
trends.stars = benford(stars$parsecs, number.of.digits = 1, discrete = T, sign = "positive")

## [1] "MAD = 0.0114 implying Acceptable conformity"

Logarithmic Basis of Benford’s Law

The mathematical basis of Benford’s law is Mantissa of the logs which are expected to be uniformly distributed. This will occur if the numbers when ranked from smallest to largest, form a geometric sequence. Our real-life data needs to only approximate a geometric sequence. As a test of the logarithmic property and the benfordness, the expectation is that the logs form a straight line. Even though the digit patterns pass every test in the book, the geometric basis of the law may not be followed as closely. The line may not always be straight throughout. It is seldom that real financial or science-based data will have the straight line pattern throughout.

Consideration while graphing logs - dealing with negative values : ignore them, have a minimum threshold like 10 or take log of absolute values. This will not affect the process because the mantissa/ fractional part of log is not impacted.

A necessary condition to test the mantissa for U[0,1) is that the mean is 0.5 and variance is 1/12. Data that satisfies only the mean and variance requirements might have little or no conformity to Benford’s law.

Non-conformity to Benford’s Law could indicate - (1) incomplete data set (2) sample not being representative of the population (3) excessive rounding of data (4) data errors, inconsistencies or anomalies and/or (5) conformity to power law with a large exponent

Understanding R’s output

As per Mean Absolute Deviation, 0.44% we concluded Close conformity to Benford’s Law.

print(trends.im)

The Chi-Square Test is based on the premise that we are drawing a random sample from an ‘infinite’ set of data that follows our expected distribution. If our sample is small (say with 500 records), the chi-square test will tolerate some deviations. As the sample size increases, the chi-square test looks for the results to tend towards a perfect Benford Set. The chi-square test is too sensitive to small deviations from Benford’s Law for large data sets to be used for a conformity assessment.

Mean Absolute Deviation, $MAD = \sum_{i=1}^K |Actual-Expected|/K$ where K is the number of bins. Below we can see the MAD figures and their respective conclusions.

We can have a look at the Mantissa and MAD figures. Based on the expected values of Mean, Variance, Kurtosis and Skewness.

We can plot the results.

plot(trends.im)

The plots show the first and second digits. In both cases, we can see that the data (histogram) fits the Benford’s Law (in red).

Other Examples

Some Sources

In a conversation with Anand Gandhi on instagram, after I was done telling him about this beautiful law, he replied - what the underlying cause is - Entropy or Complexity!