Benford’s Law - Analysis and Applications

library(reshape2)
library(dplyr)

## Warning: package 'dplyr' was built under R version 3.6.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#install.packages('benford.analysis')
library(benford.analysis)
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.6.2

## ── Attaching packages ────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ stringr 1.4.0
## ✓ tidyr   1.2.0     ✓ forcats 0.5.1
## ✓ readr   2.1.2

## Warning: package 'ggplot2' was built under R version 3.6.2

## Warning: package 'tibble' was built under R version 3.6.2

## Warning: package 'tidyr' was built under R version 3.6.2

## Warning: package 'readr' was built under R version 3.6.2

## Warning: package 'purrr' was built under R version 3.6.2

## Warning: package 'forcats' was built under R version 3.6.2

## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Understanding Benford package

Digits Distribution: Checking the distribution of the first digit (in this case) with the expected Benford theoretical distribution, outlined previously

Second-Order Test: data is sorted from smallest to largest and the difference of each consecutive pair is taken, first digit distribution is then created from these list of differences

Summation Test: data is grouped by the records of the first digit and the sum is computed for each group, observe if the sums follow a uniform distribution

Theoretical Benford’s Law

Simple visualization of a theoretical Benford distribution with the given probabilities.

theory_probs = c(30.1, 17.6, 12.5, 9.7, 7.9, 6.7, 5.8, 5.1, 4.6)
digits = c(1:9)
theory_df <- data.frame(digits, theory_probs)

ggplot(theory_df, aes(x=digits, y=theory_probs)) + geom_col(mapping=aes(x=digits)) + geom_text(aes(label = theory_probs), vjust = -0.2) + xlab('Digits (1-9)') + ylab('Theoretical Probabilities (%)') + scale_x_discrete(labels=(c(1,2,3,4,5,6,7,8,9))) + ggtitle("Theoretical Benford's Distribution")

Random Number Generator (with equally likely probabilities)

It is important to notice that a random number generator with equally likely probabilities does not create a Benford distribution as expected since all possible values (thus the first digits) are weighed equally. Notice the uniform nature of the distribution.

random_numb_gen <- sample(x=1:10000, size=1000, replace=TRUE)
getFirstDigit <- function(x) {
    #floor(x / (10 ^ floor(log10(x))))
    char = as.character(x)
    substr(char, 1, 1)
}

first_digits = c()
for (i in random_numb_gen){
  first_digits = c(first_digits, as.numeric(getFirstDigit(i)))
}

ggplot(mapping = aes(first_digits)) + geom_bar() + ylab('Frequency') + xlab('First Digit (1-9)') + scale_x_discrete(labels=(c(1,2,3,4,5,6,7,8,9))) + ggtitle('Distribution of First Digits')

Observe the second-order and summation tests with the red-dotted lines being the expectation for a Benford distribution.

#random_numb_gen <- sample(x=1:10000, size=1000, replace=TRUE)
rand_benford <- benford(random_numb_gen, number.of.digits = 1, sign = 'positive')
plot(rand_benford)

Personal Income Dataset

Observe the distribution of the first digit in personal income values being close to Benford.

per_income <- read.csv('/Users/jackboydell/Desktop/MATH 266/Personal Income.csv', header =TRUE)
per_income <- per_income[,c(1,2)]
head(per_income)

##   observation_date
## 1       1959-01-01
## 2       1959-02-01
## 3       1959-03-01
## 4       1959-04-01
## 5       1959-05-01
## 6       1959-06-01
##   Personal.Income..Billions.of.Dollars..Monthly..Seasonally.Adjusted.Annual.Rate
## 1                                                                          391.8
## 2                                                                          393.7
## 3                                                                          396.5
## 4                                                                          399.9
## 5                                                                          402.4
## 6                                                                          404.8

per_inc_benford <- benford(per_income[,2], number.of.digits=1, sign='positive')
plot(per_inc_benford)

Google Stock Volume

Observe the distribution of the first digit in Google stock volume being roughly Benford with slight deviations in a few of the middle range digits (2,5,6).

Google_stock_df <- read.csv('/Users/jackboydell/Desktop/MATH 266/FAANG Stock Datasets/Google.csv', header=TRUE)
head(Google_stock_df)

##         Date     Open     High      Low    Close Adj.Close   Volume
## 1 2004-08-19 50.05005 52.08208 48.02803 50.22022  50.22022 44659000
## 2 2004-08-20 50.55556 54.59459 50.30030 54.20921  54.20921 22834300
## 3 2004-08-23 55.43043 56.79680 54.57958 54.75475  54.75475 18256100
## 4 2004-08-24 55.67567 55.85585 51.83684 52.48749  52.48749 15247300
## 5 2004-08-25 52.53253 54.05405 51.99199 53.05306  53.05306  9188600
## 6 2004-08-26 52.52753 54.02903 52.38238 54.00901  54.00901  7094800

volume <- Google_stock_df[,c(6)]
Google_volumne_benf <- benford(volume, number.of.digits=1, sign='positive')
plot(Google_volumne_benf)