Before enlisting in the military, I wanted to get out of my comfort zone and participated in the Data Hackathon organized by the city of Suwon, South Korea. The Data Hackathon included submitting a project proposal, presenting it in front of the panelists, and doing a Q&A session.
While I don’t reside in the city of Suwon, I found that the high crime rate in Suwon was staticially significant from other cities and developed a prototype web application that predicts the time and area the crime is likely to break out in the near future. To be honest, I was expecting a compliment from the panelists. However, the panelists were rather lukewarm about my proposal. One of them said “It is not realistic to implement it as we’ve installed and stationed many surveillance cameras and patrols in the city. These days, physical crimes like murders or rapes are not that common, but we’ve seen a rising number of voice scam, drug dealing, and sex trafficking.”
I consoled myself by thinking about the fact that I exposed myself to an experience working with a new dataset and had a unique opportunity to present my findings in front of the panelists. I thought to myself, ‘How do criminals commit crimes like drug dealing and sex trafficking? Do they deal strictly with cash? Or is it through bank transactions? Is there a way to find a specific transaction related to a suspicious activity?’. I had these back-to-back questions on the way home and looked for some data related to bank transactions. And that’s how it all started… Anomaly detection project…
========================================================================================================================
========================================================================================================================
Each record is a unique transaction.
Each column is self-explanatory with its column name, so I’ll just skim through.
If there’s a transaction of $300 from Account A whose balance is $1000 to Account B whose balance is $2000, the record will show amount = 300, oldbalanceOrig = 1000, newbalanceOrig = 700, oldbalanceDest = 2000, newbalanceDest = 2300
ErrorBalanceOrig & ErrorBalanceDest indicates a transaction whose Account Balance didn’t change in the same amount as the amount. For example, if amount = 400, oldbalanceOrig = 1000, newbalance = 800, the ErrorBalanceOrig = “-1”.
step represents the time the transaction occurred.
========================================================================================================================
========================================================================================================================
========================================================================================================================
========================================================================================================================
========================================================================================================================
========================================================================================================================
========================================================================================================================
========================================================================================================================
Normal transactions usually occur during day time, while fraudulent transactions don’t occur at a specific time of the day.
This time period variable will probably have some impact on our predictive model.
========================================================================================================================
========================================================================================================================
Fraudulent transactions only have types TRANSFER and CASH_OUT.
Fraudulent transactions’ amounts vary from small amounts all the way up to 10000000. We can kind of infer that this is a synthetic data.
========================================================================================================================
Let’s now run two algorithms (Logistic Regression vs. Random Forest) on the anomaly detection predictive model and compare which one performs better. Note that I used a simple undersampling method to balance out the imbalanced data set.
Area under the curve: 0.8436 Setting levels: control = 0, case = 1
Area under the curve: 0.8436 Setting direction: controls < cases
========================================================================================================================
========================================================================================================================
========================================================================================================================
Compared to Logistic Regression, Random Forest performs better with AUC at 0.9964.
The plot on the bottom demonstrates the impact of variables on the model in descending order. newbalanceOrig and a feature-engineered variable ErrorBalanceOrig have the greatest impact on the model’s performance. Other variables like step (time of transaction) don’t have as much importance on the model’s performance.
========================================================================================================================
========================================================================================================================
The plot on the bottom shows the ROC Curve comparing the performance of the two models.
Random Forest (red) performs way better than Logistic Regression (green).
========================================================================================================================
## Predicted: NO Predicted: YES
## Actual: NO 275648 393
## Actual: YES 2 838
## [1] "Accuracy: 0.998573394346308"
## [1] "Recall: 0.997619047619048"
## [1] "Precision: 0.680747359870024"
========================================================================================================================
FN: FN is where the model predicted the transaction to be normal (IsFraud = 0), but it actually turns out to be fraudulent (IsFraud = 1).
FP: FP is where the model predicted the transaction to be fraudulent (IsFraud = 1), but it actually turns out to be normal (IsFraud = 0).
The prediction (0/1) is determined by the probability based on the threshold value. The default value of threshold is 0.5, but in real life setting, we consider the trade-off between FN and FP and adjust the threshold. In our case of anomaly detection, FN brings about more costs than FP; therefore, we take a conservative value of threshold = 0.01 in order to minimize FN.
Accordingly, our Random Forest model has the following performance: Accuracy: 0.9986353, Recall: 0.9929329, Precision: 0.6938272.
========================================================================================================================
I am curious to know what real data looks like for detecting fraudulent transactions. For instance, it may have information such as location, device type, ip address, customer information, customer transaction history, etc. Given these data, we will be able to build a more complicated yet better performing model.
My understanding is that the transaction is categorized as fraudulent after the transaction occurred, and I wonder what process occurs in order to label a transaction as fraudulent.
A feeling of disappointment(?) and the panelists’ feedback on my initial proposal founded a groundwork for motivating me to do a side project on fraudulent transaction. Just as Steve Jobs mentioned in his “Connecting dots” speech, I think of this project as drawing one more dot in my career and that does it for this post.