ANOMALY DETECTION IN PUBLIC EXPENDITURES: THE SÃO PAULO DEPARTMENT OF PUBLIC HEALTH CASE
RESEARCH INTEREST
Applicability of machine learning and AI in identifying disruptive patterns in public and nonprofit budgeting, finance, and accounting.
1 CONTEXT
- While AI is still expensive, ML is scalable and accessible
- There are multiple algorithms available
- Algorithms are complex and general
- Algorithms must be studied for specialized uses
- Anomaly detection models can optimize public oversight by minimizing false positives
2 OBJECTIVE
To assess the efficacy of the Isolation Forest (iForest) algorithms in identifying public expenditure anomalies that merit investigation by governmental oversight bodies.
3 RESEARCH QUESTION
Can iForest leverage granular expenditure data to objectively uncover disruptive patterns of fraud or errors?
4 HYPOTHESIS
- iForest is effective to identify anomalous expenditure amounts based on public expenditure elements
- iForest is not as effective as simple aggregation to identify inaccurate ledger entries based on public expenditure categorization
5 REVIEW OF LITERATURE
- Most research is ML centered or private sector oriented
- Most research applied to public sector lacks evidence due to unavailable labels
- Most research applies to single methods rather than comparing the most effective for each case
6 CONCEPTS
ML is a field of AI that enables computers to “learn” patterns from data, without being explicitly programmed for specific tasks.
Supervised learning requires “labeled” training datasets. Unsupervised learning does not require such labels, working on internal patterns.
7 METHODOLOGY
- The industry standard for tabular ML is the scikit-learn package in Python. For Deep Learning and AI, the standard is TensorFlow and PyTorch.
- R offers ML packages but not as popular
- Stata is not in the game
- Presentation: Quarto
7 METHODOLOGY
7.2 Data
- Census of public expenditures downloaded from the open data website of the state government of São Paulo, Brazil.
- The datasets utilized covered a period of 10 years, from 2014 to 2023.
- The dataset was composed of approximately 1.6 million rows across 36 features, from which, 5 were used in the model after cleaning and reshaping. The remaining features were useful for indentifying the spotted transactions.
7 METHODOLOGY
7.3 Dataset overview - Clean data
7 METHODOLOGY
7.4 Expenditure categorization in Brazil
7 METHODOLOGY
7.5 iForest algorithm
- One of the most popular in anomaly detection
- Works based on parallel decision trees
- Each tree splits categories/values up seeking to isolate anomalies
- Anomalous observations are isolated after fewer splits
7 METHODOLOGY
7.6 iForest choice
- Nonsupervised algorithm
- Fast processing, good scalability
- Good for millions of rows and multiple dimensions
- No need for previous cluster number determination (it is not a clustering algotithm)
- Good for outlier identification
- Non parametric
7 METHODOLOGY
7.7 Observational design
- Summarized category tables to reveal unique combination of expenditure categories
- Unusual combinations could reveal error or fraud
- Will the algorithm spot these changes?
- Assumption: Anomalous category combinations are not restricted by the financial systems
7 METHODOLOGY
7.8 Quasi-experimental design
- Group T1: 20 Randomly selected transactions had the accrued value manually multipled by 20, 100 and 500 respectively, totaling 60 transactions
- Group T2: 60 Randomly selected transactions had their expense element changed to another element inconsistent with their category (current expenses with capital elements, vice-versa)
- Control group: Remaining transactions
8 RESULTS
Distribution of categories
8 RESULTS
Selected measure
8 RESULTS
Descriptive of elements by year (In BRL)
8 RESULTS
Descriptive of categories by year (In BRL)
8 RESULTS
Descriptive per bidding type by year (In BRL)
8 RESULTS
Descriptive - Categories (In BRL)
8 RESULTS
Descriptive - Categories (In BRL)
8 RESULTS
Descriptive - Main elements (In BRL)
8 RESULTS
Descriptive - Main elements (In BRL)
8 RESULTS
Visual normality test
Shapiro-Wilk test does not handle big data. Anderson-Darling is too sensible to big data.
8 RESULTS
Visual normality test
8 RESULTS
Histograms: entire period (In BRL)
8 RESULTS
Histograms: entire period (In BRL)
8 RESULTS
Anomaly score distribution
9 CONCLUSION
Numerous simulations, with different parameters, revealed that the algorithm was (i) ineffective to identify errors in expenditure categorization, and (ii) had a low effectiveness in identifying numerical outliers.
9 CONCLUSION
The simulation that detected the most errors returned 8 of the treated transactions, out of a total of 1,580 possible outliers. From these, 6 were transactions multiplied by 500, one by 20 and one by 100.
The most efficient simulation returned 4 out of 158.
No treated transaction targeted for categorization was identified
9 CONCLUSION
Utilizing grouping for identifying incorrect categorization was more effective, although requiring intense human observation.
9 CONCLUSION
- The descriptive analysis showed that the values per element were dense with a high occurrance of ouliers. These outliers were nearly evenly distributed except for extreme outliers.
- The model easily identifies outliers separated by “blank space” in the distribuion, so these are easily spotted
- In practical applications, errors and fraud are not necessarily so extreme
9 CONCLUSION
- Data on line items might had allowed for more accurate anomaly isolation due to shorter and more even distribution ranges
- For financial outliers, the model might be useful for narrowing down the scope for human analysis
- Although applicable to categorization, mere grouping does not address individualized outlier behavior in the context of each group
10 VALIDITY
- Census-based internal assessment. Internal validity is not an issue.
- Country and state not randomly selected: selection bias, hurts exernal validity
- Not adjusted for inflation
11 FURTHER RESEARCH
Complementary data:
- Employees posting the entries, role, age, income
- Date, time and location of the entries
- Managers and directors approving
11 FURTHER RESEARCH
Comparative research:
- Replicate the model between states and local governments
- Apply different ML models, including clustering and supervised
11 FURTHER RESEARCH
Experimental research:
- Partnerships with government to test identified anomalies
- With labels available, test supervised models