Use Tree based Machine Learning methods to accurately identify fraudulent transactions.
| Variable | Data Type | Defintion |
|---|---|---|
| step | Interval | Maps a unit of time in the real world. In this case 1 step is 1 hour of time. |
| type | Categorical | CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER |
| amount | Numerical | amount of the transaction in local currency |
| nameOrig | Categorical | customer who started the transaction |
| oldbalanceOrg | Numerical | initial balance before the transaction |
| newbalanceOrig | Numerical | customer’s balance after the transaction. |
| nameDest | Categorical | recipient ID of the transaction. |
| oldbalanceDest | Numerical | initial recipient balance before the transaction. |
| newbalanceDest | Numerical | recipient’s balance after the transaction. |
| isFraud | Binary | identifies a fraudulent transaction (1) and non fraudulent (0) |
| isFlaggedFraud | Binary | flags illegal attempts to transfer more than 200.000 units in a single transaction. |
#import data
transactions<- fread("C:/Users/pamel/Downloads/input/PS_20174392719_1491204439457_log.csv", header = TRUE, stringsAsFactors = TRUE)
transactions<- as.data.frame(transactions)
The output below gives us an idea of what our data looks like
head(transactions)
The datset contains 11 attributes and 6,362,620 transaction observations
dim(transactions)
## [1] 6362620 11
Now we will take a look at a summary of the data. The Table below provides us with descriptive information of each variable including:
summarizeColumns(transactions) %>%
select(name,type,na,mean,median,min,max,nlevs)
From the table above, we were able to get the following takeaways:
Now that we have a better idea of what we are dealing with, lets see if we can notice any interesting insights through data visualizations
One of the common issues with fraud detection data is that fraud is typically a rare event compared to not fraud cases. Lets take a look at the number of Fraudulent cases vs not fraudulent cases to see if that is an issue with our data.
# Using the isFraud variable, count the number of fraud vs not fraud transactions
fraud_count<- transactions %>% count(isFraud)
print(fraud_count)
## # A tibble: 2 x 2
## isFraud n
## <int> <int>
## 1 0 6354407
## 2 1 8213
# Lets visualize this through a frequency plot
g <- ggplot(transactions, aes(isFraud))
# Number of cases in Fraud vs Not Fraud:
g + geom_bar()+geom_label(stat='count',aes( label= paste0( round( ((..count..)/sum(..count..)) ,4)*100 , "%" ) ) )+ #add the percentage of each type as label
labs(x = "Fraud vs Not Fraud", y = "Frequency", title = "Frequency of Fraud", subtitle = "Labels as Percent of Total Observations")
The data set is extremely imbalanced, with only .13% of the transaction data being fraudelent. To reduce model bias, we should consider sampling methods, such as undersampling before getting into our modeling stage.
First lets look at Transaction Types for Fraud and Not Fraud
ggplot(data = transactions, aes(x = type , fill = as.factor(isFraud))) + geom_bar() + labs(title = "Frequency of Transaction Type", subtitle = "Fraud vs Not Fraud", x = 'Transaction Type' , y = 'No of transactions' ) +theme_classic()
The plot above is not super helpful for Fraud Transactions since they are rare. Lets consider plotting the Transaction Types of Fraud cases only.
Here we plot the frequency of Fraud Transactions for each Transaction type
Fraud_trans_type <- transactions %>% group_by(type) %>% summarise(fraud_transactions = sum(isFraud))
ggplot(data = Fraud_trans_type, aes(x = type, y = fraud_transactions)) + geom_col(aes(fill = 'type'), show.legend = FALSE) + labs(title = 'Fraud transactions as Per type', x = 'Transcation type', y = 'No of Fraud Transactions') + geom_label(aes(label = fraud_transactions)) + theme_classic()
We can see from the plot above that Fraud Transactions only consist of CASH_OUT and TRANSFER transaction types. This will be important later, as we can simplify our analysis by only including these two elements of type.
ggplot(data = transactions[transactions$isFraud==1,], aes(x = amount , fill =amount)) + geom_histogram(bins = 30, aes(fill = 'amount')) + labs(title = 'Fraud transaction Amount distribution', y = 'No. of Fraud transacts', x = 'Amount in Dollars')
The distribution of amount for Fraud transactions is heavily right skewed. This suggests that the majority of the fraud transactions are of smaller amounts.
p1<- ggplot(data = transactions, aes(x = factor(isFraud) ,y = log1p(oldbalanceOrg), fill = factor(isFraud))) + geom_boxplot(show.legend = FALSE) +labs(title= 'Old Balance in Sender Accounts' , x = 'isFraud', y='Balance Amount') + theme_classic()
p2 <- ggplot(data = transactions, aes(x = factor(isFraud) ,y = log1p(oldbalanceDest), fill = factor(isFraud))) + geom_boxplot(show.legend = FALSE) +labs(title= 'Old balance in Receiver Accounts' , x = 'isFraud',y='Balance Amount') + theme_classic()
grid.arrange(p1, p2, nrow = 1)
In the majority of fraud transactions, the Old balance of the Origin account (where the payments are made) is higher than rest of the origin accounts while the Old balance in Destination accounts is Lower than rest.
Each step represents 1 hour of real world and there are total 743 steps for 30 days of data . Lets convert them into 24 hours where each day has 1 to 24 hours and the pattern repeats again
# Convert Step to Hours in 24 hours format
transactions$hour <- mod(transactions$step, 24)
#Plot newly formatted data
p5<- ggplot(data = transactions, aes(x = hour)) + geom_bar(aes(fill = 'isFraud'), show.legend = FALSE) +labs(title= 'Total transactions at different Hours', y = 'No. of transactions') + theme_classic()
p6<-ggplot(data = transactions[transactions$isFraud==1,], aes(x = hour)) + geom_bar(aes(fill = 'isFraud'), show.legend = FALSE) +labs(title= 'Fraud transactions at different Hours', y = 'No. of fraud transactions') + theme_classic()
grid.arrange(p5, p6, ncol = 1, nrow = 2)
The total number of transactions happening between 0 to 9 hours are very low but this is not the case for Fraud transactions. We can be concluded that fraud transactions are very often between 12am to 9 am.
Check to see if transactions where the transaction Amount is greater than the balance available in the Origin account occur
head(transactions[(transactions$amount > transactions$oldbalanceOrg)& (transactions$newbalanceDest > transactions$oldbalanceDest), c("amount","oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest", "isFraud")], 10)
As noted earlier, we will filter our data by type to include only CASH_OUT and TRANSFER, we can also drop nameOrig and nameDest, as there are too many unique levels to create dummy variables. We also know that we can remove the step column since we know this column was used to develop the hours attribute.
# Filtering transactions and drop irrelevant features
transactions1<- transactions %>%
select( -one_of('step','nameOrig', 'nameDest', 'isFlaggedFraud')) %>%
filter(type %in% c('CASH_OUT','TRANSFER'))
All the fraud transactions occur in CASH_OUT and TRANSFER type, so we drop other types. step was used in creating the hours variable so we must remove it. nameOrig, nameDest, and isFlaggedFraud are not useful so we drop them.
Since transaction type is categorical, we need to create dummy variables so the data is numerical
transactions1 <- dummy_cols(transactions1)
transactions1$isFraud <- as.factor(transactions1$isFraud)
transactions1 <- transactions1[,-1]
transactions1<-as.data.frame(transactions1)
#summarizeColumns(transactions1)
We know that fraud transactions are a rare event. Since fraudulent transactions only make up .13% of the data, duplicating fraudulent transactions in order to balance the data is not the best technique. It makes more sense to sample down non-event cases through undersampling.
set.seed(12345)
train_id <- sample(seq_len(nrow(transactions1)), size = floor(0.7*nrow(transactions1)))
train <- transactions1[train_id,]
valid <- transactions1[-train_id,]
table(train$isFraud)
##
## 0 1
## 1933579 5707
table(valid$isFraud)
##
## 0 1
## 828617 2506
suppressMessages(library(unbalanced))
set.seed(12345)
prop.table(table(train$isFraud))
##
## 0 1
## 0.997057164 0.002942836
inputs <- train[,-6]
target <- train[,6]
under_sam <- ubUnder(X = inputs, Y = target)
train_u <- cbind(under_sam$X, under_sam$Y)
train_u$isFraud <- train_u$`under_sam$Y`
train_u$`under_sam$Y` <- NULL
table(train_u$isFraud)
##
## 0 1
## 5707 5707
prop.table(table(train_u$isFraud))
##
## 0 1
## 0.5 0.5
Using the undersampling method we are able to get a balanced training data set with 5707 observations each. However, our validation data will remain imbalanced,
We consider tree-based modeling methods such as Decision Trees and Random Forests to predict fraudulent transactions.
Decision tree using the undersampled data
model_dt <- rpart(isFraud ~ ., data = train_u)
prp(model_dt)
predict_dt <- predict(model_dt, valid, type = "class")
confusionMatrix(valid$isFraud,predict_dt)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 775510 53107
## 1 125 2381
##
## Accuracy : 0.936
## 95% CI : (0.9354, 0.9365)
## No Information Rate : 0.9332
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.0768
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.99984
## Specificity : 0.04291
## Pos Pred Value : 0.93591
## Neg Pred Value : 0.95012
## Prevalence : 0.93324
## Detection Rate : 0.93309
## Detection Prevalence : 0.99698
## Balanced Accuracy : 0.52137
##
## 'Positive' Class : 0
##
#Accuracy of the model is 93.6%
The F1 score is the weighted average of Precision and Recall. We can calculate this metric using the information provided above since:
#Recall=Sensitivity
Recall_1<-0.99984
Precision_1<-0.93591
F1<-2*(Recall_1*Precision_1)/(Recall_1+Precision_1)
F1*100
## [1] 96.68193
The F1 score for the decision tree model is 96.68% which is very good.
Now, we will consider another tree-based modeling method.
Lets look at a random forest model given the default parameters
# Create a Random Forest model with default parameters
rf1 <- randomForest(isFraud ~ ., data = train_u, importance = TRUE)
rf1
##
## Call:
## randomForest(formula = isFraud ~ ., data = train_u, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 2.26%
## Confusion matrix:
## 0 1 class.error
## 0 5575 132 0.02312949
## 1 126 5581 0.02207815
The estimate of error rate is 2.26%, lets see if we can improve this by tuning the paramters. In random forest models:
Fine tuning parameters of Random Forest model
rf2 <- randomForest(isFraud ~ ., data = train_u, ntree = 500, mtry = 6, importance = TRUE)
rf2
##
## Call:
## randomForest(formula = isFraud ~ ., data = train_u, ntree = 500, mtry = 6, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 1.15%
## Confusion matrix:
## 0 1 class.error
## 0 5605 102 0.017872788
## 1 29 5678 0.005081479
The estimate of error rate is 1.08% which is better than the first model. Let’s consider one more random forest scenario
In this model we will reduce the number of trees in the forest but maintain number of predictors included for the node split
rf3 <- randomForest(isFraud ~ ., data = train_u, ntree = 200, mtry = 6, importance = TRUE)
rf3
##
## Call:
## randomForest(formula = isFraud ~ ., data = train_u, ntree = 200, mtry = 6, importance = TRUE)
## Type of random forest: classification
## Number of trees: 200
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 1.1%
## Confusion matrix:
## 0 1 class.error
## 0 5614 93 0.016295777
## 1 32 5675 0.005607149
The estimate of error rate is 1.17% in scenario which is an increase from model 2. We will use model 2 to check our validation data.
Here, we determine how well our rf model predicts on the training data set
# Predicting on train set
predTrain <- predict(rf2, train, type = "class")
# Checking classification accuracy
table(predTrain, train$isFraud)
##
## predTrain 0 1
## 0 1906590 0
## 1 26989 5707
The model seems to perform well on the training data set, now lets find out how well it does on validation
# Predicting on Validation set
predValid <- predict(rf2, valid, type = "class")
# Checking classification accuracy
mean(predValid == valid$isFraud)
## [1] 0.9860442
table(predValid,valid$isFraud)
##
## predValid 0 1
## 0 817029 11
## 1 11588 2495
The random forest model performed very well using accuracy at 98.6%
Let’s calculate the F1 score for our random forest model
TP=817055
TN=2494
FP=11562
FN=12
#Calculate Recall
Recall=(TP/(TP+FN))
Recall*100
## [1] 99.99853
#Calculate Precision
Precision=(TP/(TP+FP))
#Calculate F1
F1_rf<-2*(Recall_1*Precision_1)/(Recall_1+Precision_1)
F1_rf*100
## [1] 96.68193
The random forest model has an f1 score of 96.68%
# To check important variables
importance(rf2)
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## amount 35.57795116 39.50215 47.12195 777.08510
## oldbalanceOrg 105.44569497 67.92024 124.15724 2745.65796
## newbalanceOrig 116.79263642 67.34054 115.94945 467.37422
## oldbalanceDest 49.88811238 31.01897 51.63019 282.64662
## newbalanceDest 30.76430827 54.88612 58.74716 976.36467
## hour 37.39136275 20.16681 39.19163 324.08880
## type_CASH_IN 0.00000000 0.00000 0.00000 0.00000
## type_CASH_OUT 0.14048173 20.74286 20.76799 64.98086
## type_DEBIT 0.00000000 0.00000 0.00000 0.00000
## type_PAYMENT 0.00000000 0.00000 0.00000 0.00000
## type_TRANSFER 0.09748201 21.57515 21.42828 64.30536
varImpPlot(rf2)
Through the plots above, we can see the variables in our random forest model ranked from most to least important when predicting our response. This helps with the interpratibility issue that comes with using random forest models.
Our decision tree model has:
Our random forest model has:
Typically, in this scenario I would use f1 score to determine which model to use, since f1 is a good balance of Precision and Recall. However, the F1 score is the same for both models. Although Recall is only.001 higher for the random forest model as it is for the decision tree model, we will consider this to be our scoring metric. The reason I picked Recall over accuracy, even though they both work, is because Recall is a good measure when there is a high cost associated with False Negatives. Since the cost of falsely assuming a transaction is not fraud is much higher for the company than accusing a transaction of fraud when it isn’t, the recall measure is a good option. The company should use the random forest model in order to predict fraud.
Paysim provided simulated data for fraudulent mobile money transactions.The goal of this project was to accurately predict fraudulent transactions through tree-based machine learning algorithms. Both models performed much better than expected, possibly due to the nature of the simulated data.