1 Data Information

This preliminary study was carried out with a public dataset of financial data for fraud detection that is generated synthetically and is available here. The dataset is stratified as described below and, as can be seen from the pie chart, has extremely unbalanced classes.

1.1 Synthetic Class Balancing

Class balancing is crucial for training machine learning algorithms and is an issue that must be dealt with prior to the learning process. For the purposes of preliminary studies and also taking into account the limited access to computer resources, I opted for the random undersampling technique, which randomly takes samples from the majority class (i.e. non-fraud) with the ultimate goal of being a balanced data set.

2 Machine Learning

2.1 Supervised Classifiers

2.1.1 Accuracy Results

2.1.2 Log-loss Results

3 What’s next?

Preliminary results indicate that supervised learning algorithms were able to generate mapping functions for detecting financial fraud with favorable predictive performance using the random undersampling technique. Furthermore, discrepancies in the performance of the algorithms were observed when more than one classification metric is taken into account, stressing the need to analyze more classification metrics to evaluate the algorithms.

Future will focus on modeling and training neural networks for predicting financial fraud using this synthetically balanced data set. Ultimately, a comparison of the predictive performance of all the algorithms used here will be performed.

3.1 Neural Networks

3.1.1 Accuracy Results

3.1.2 Log-loss Results