ANOMALY DETECTION IN PUBLIC EXPENDITURES: THE SÃO PAULO DEPARTMENT OF PUBLIC HEALTH CASE

John Vasconcelos

RESEARCH INTEREST

Applicability of machine learning and AI in identifying disruptive patterns in public and nonprofit budgeting, finance, and accounting.

1 CONTEXT

  • While AI is still expensive, ML is scalable and accessible
  • There are multiple algorithms available
  • Algorithms are complex and general
  • Algorithms must be studied for specialized uses
  • Anomaly detection models can optimize public oversight by minimizing false positives

2 OBJECTIVE

To assess the efficacy of the Isolation Forest (iForest) algorithms in identifying public expenditure anomalies that merit investigation by governmental oversight bodies.

3 RESEARCH QUESTION

Can iForest leverage granular expenditure data to objectively uncover disruptive patterns of fraud or errors?

4 HYPOTHESIS

  • iForest is effective to identify anomalous expenditure amounts based on public expenditure elements
  • iForest is not as effective as simple aggregation to identify inaccurate ledger entries based on public expenditure categorization

5 REVIEW OF LITERATURE

  • Most research is ML centered or private sector oriented
  • Most research applied to public sector lacks evidence due to unavailable labels
  • Most research applies to single methods rather than comparing the most effective for each case

6 CONCEPTS

  • ML is a field of AI that enables computers to “learn” patterns from data, without being explicitly programmed for specific tasks.

  • Supervised learning requires “labeled” training datasets. Unsupervised learning does not require such labels, working on internal patterns.

7 METHODOLOGY

7.1 Tools

  • The industry standard for tabular ML is the scikit-learn package in Python. For Deep Learning and AI, the standard is TensorFlow and PyTorch.
  • R offers ML packages but not as popular
  • Stata is not in the game
  • Presentation: Quarto

7 METHODOLOGY

7.2 Data

  • Census of public expenditures downloaded from the open data website of the state government of São Paulo, Brazil.
  • The datasets utilized covered a period of 10 years, from 2014 to 2023.
  • The dataset was composed of approximately 1.6 million rows across 36 features, from which, 5 were used in the model after cleaning and reshaping. The remaining features were useful for indentifying the spotted transactions.

7 METHODOLOGY

7.3 Dataset overview - Clean data


7 METHODOLOGY

7.4 Expenditure categorization in Brazil

7 METHODOLOGY

7.5 iForest algorithm

  • One of the most popular in anomaly detection
  • Works based on parallel decision trees
  • Each tree splits categories/values up seeking to isolate anomalies
  • Anomalous observations are isolated after fewer splits

7 METHODOLOGY

7.6 iForest choice

  • Nonsupervised algorithm
  • Fast processing, good scalability
  • Good for millions of rows and multiple dimensions
  • No need for previous cluster number determination (it is not a clustering algotithm)
  • Good for outlier identification
  • Non parametric

7 METHODOLOGY

7.7 Observational design

  • Summarized category tables to reveal unique combination of expenditure categories
  • Unusual combinations could reveal error or fraud
  • Will the algorithm spot these changes?
  • Assumption: Anomalous category combinations are not restricted by the financial systems

7 METHODOLOGY

7.8 Quasi-experimental design

  • Group T1: 20 Randomly selected transactions had the accrued value manually multipled by 20, 100 and 500 respectively, totaling 60 transactions
  • Group T2: 60 Randomly selected transactions had their expense element changed to another element inconsistent with their category (current expenses with capital elements, vice-versa)
  • Control group: Remaining transactions

8 RESULTS

Distribution of categories

8 RESULTS

Selected measure

8 RESULTS

Descriptive of elements by year (In BRL)

8 RESULTS

Descriptive of categories by year (In BRL)

8 RESULTS

Descriptive per bidding type by year (In BRL)

8 RESULTS

Descriptive - Categories (In BRL)

8 RESULTS

Descriptive - Categories (In BRL)

8 RESULTS

Descriptive - Main elements (In BRL)

8 RESULTS

Descriptive - Main elements (In BRL)

8 RESULTS

Visual normality test

Shapiro-Wilk test does not handle big data. Anderson-Darling is too sensible to big data.

8 RESULTS

Visual normality test

8 RESULTS

Histograms: entire period (In BRL)

8 RESULTS

Histograms: entire period (In BRL)

8 RESULTS

Anomaly score distribution

9 CONCLUSION

Numerous simulations, with different parameters, revealed that the algorithm was (i) ineffective to identify errors in expenditure categorization, and (ii) had a low effectiveness in identifying numerical outliers.

9 CONCLUSION

  • The simulation that detected the most errors returned 8 of the treated transactions, out of a total of 1,580 possible outliers. From these, 6 were transactions multiplied by 500, one by 20 and one by 100.

  • The most efficient simulation returned 4 out of 158.

  • No treated transaction targeted for categorization was identified

9 CONCLUSION

Utilizing grouping for identifying incorrect categorization was more effective, although requiring intense human observation.

9 CONCLUSION

  • The descriptive analysis showed that the values per element were dense with a high occurrance of ouliers. These outliers were nearly evenly distributed except for extreme outliers.
  • The model easily identifies outliers separated by “blank space” in the distribuion, so these are easily spotted
  • In practical applications, errors and fraud are not necessarily so extreme

9 CONCLUSION

  • Data on line items might had allowed for more accurate anomaly isolation due to shorter and more even distribution ranges
  • For financial outliers, the model might be useful for narrowing down the scope for human analysis
  • Although applicable to categorization, mere grouping does not address individualized outlier behavior in the context of each group

10 VALIDITY

  • Census-based internal assessment. Internal validity is not an issue.
  • Country and state not randomly selected: selection bias, hurts exernal validity
  • Not adjusted for inflation

11 FURTHER RESEARCH

Complementary data:

  • Employees posting the entries, role, age, income
  • Date, time and location of the entries
  • Managers and directors approving

11 FURTHER RESEARCH

Comparative research:

  • Replicate the model between states and local governments
  • Apply different ML models, including clustering and supervised

11 FURTHER RESEARCH

Experimental research:

  • Partnerships with government to test identified anomalies
  • With labels available, test supervised models

Questions?