1 Goal

Use Tree based Machine Learning methods to accurately identify fraudulent transactions.

3 About the Data

The dataset used in this project is from https://www.kaggle.com/ntnu-testimon/paysim1. Paysim provided synthetic data for mobile money transactions for a one month period. This dataset contains 11 attributes which you can see below and roughly 6.3 million observations. A summary of variable characteristics is provided below:
Variable Data Type Defintion
step Interval Maps a unit of time in the real world. In this case 1 step is 1 hour of time.
type Categorical CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER
amount Numerical amount of the transaction in local currency
nameOrig Categorical customer who started the transaction
oldbalanceOrg Numerical initial balance before the transaction
newbalanceOrig Numerical customer’s balance after the transaction.
nameDest Categorical recipient ID of the transaction.
oldbalanceDest Numerical initial recipient balance before the transaction.
newbalanceDest Numerical recipient’s balance after the transaction.
isFraud Binary identifies a fraudulent transaction (1) and non fraudulent (0)
isFlaggedFraud Binary flags illegal attempts to transfer more than 200.000 units in a single transaction.

4 Data Preprocessing

4.1 Load Libraries and Data

#import data
transactions<- fread("C:/Users/pamel/Downloads/input/PS_20174392719_1491204439457_log.csv", header = TRUE, stringsAsFactors = TRUE)
transactions<- as.data.frame(transactions)

4.2 Overview of Data

The output below gives us an idea of what our data looks like

head(transactions)

The datset contains 11 attributes and 6,362,620 transaction observations

dim(transactions)
## [1] 6362620      11

Now we will take a look at a summary of the data. The Table below provides us with descriptive information of each variable including:

  • Type of Variable: type
  • Missing Values: na
  • Factors (for categorical variables): nlevs
summarizeColumns(transactions) %>%
  select(name,type,na,mean,median,min,max,nlevs)

4.3 Key Takeaways

From the table above, we were able to get the following takeaways:

  • Our dataset has contains over 6.3 million observations
  • The data has 11 attributes, including 2 binary variables and 3 categorical variables. We will need to address the categorical variables to prepare the data for modeling.
  • There are no missing values! Since none of the data is missing we do not have to consider any methods for imputing missing values.

5 Exploratory Data Analysis

Now that we have a better idea of what we are dealing with, lets see if we can notice any interesting insights through data visualizations

5.1 How many transactions are fraudulent?

One of the common issues with fraud detection data is that fraud is typically a rare event compared to not fraud cases. Lets take a look at the number of Fraudulent cases vs not fraudulent cases to see if that is an issue with our data.

# Using the isFraud variable, count the number of fraud vs not fraud transactions
fraud_count<- transactions %>% count(isFraud)
print(fraud_count)
## # A tibble: 2 x 2
##   isFraud       n
##     <int>   <int>
## 1       0 6354407
## 2       1    8213
# Lets visualize this through a frequency plot
g <- ggplot(transactions, aes(isFraud))
# Number of cases in Fraud vs Not Fraud:
g + geom_bar()+geom_label(stat='count',aes(   label=  paste0(  round(   ((..count..)/sum(..count..)) ,4)*100 ,  "%" ) ) )+ #add the percentage of each type as label
    labs(x = "Fraud vs Not Fraud", y = "Frequency", title = "Frequency of Fraud", subtitle = "Labels as Percent of Total Observations")

The data set is extremely imbalanced, with only .13% of the transaction data being fraudelent. To reduce model bias, we should consider sampling methods, such as undersampling before getting into our modeling stage.

5.2 What Types of Transactions correspond with Fraud?

First lets look at Transaction Types for Fraud and Not Fraud

ggplot(data = transactions, aes(x = type , fill = as.factor(isFraud))) + geom_bar() + labs(title = "Frequency of Transaction Type", subtitle = "Fraud vs Not Fraud", x = 'Transaction Type' , y = 'No of transactions' ) +theme_classic()

The plot above is not super helpful for Fraud Transactions since they are rare. Lets consider plotting the Transaction Types of Fraud cases only.

5.2.1 Transaction Types for Fraudulent cases

Here we plot the frequency of Fraud Transactions for each Transaction type

Fraud_trans_type <- transactions %>% group_by(type) %>% summarise(fraud_transactions = sum(isFraud))


ggplot(data = Fraud_trans_type, aes(x = type,  y = fraud_transactions)) + geom_col(aes(fill = 'type'), show.legend = FALSE) + labs(title = 'Fraud transactions as Per type', x = 'Transcation type', y = 'No of Fraud Transactions') + geom_label(aes(label = fraud_transactions)) + theme_classic()

We can see from the plot above that Fraud Transactions only consist of CASH_OUT and TRANSFER transaction types. This will be important later, as we can simplify our analysis by only including these two elements of type.

5.3 What does the distribution of Transaction Amount look like for Fraudulent cases?

ggplot(data = transactions[transactions$isFraud==1,], aes(x = amount ,  fill =amount)) + geom_histogram(bins = 30, aes(fill = 'amount')) + labs(title = 'Fraud transaction Amount distribution', y = 'No. of Fraud transacts', x = 'Amount in Dollars')

The distribution of amount for Fraud transactions is heavily right skewed. This suggests that the majority of the fraud transactions are of smaller amounts.

5.4 oldbalanceOrg vs oldbalanceDest

p1<- ggplot(data = transactions, aes(x = factor(isFraud) ,y = log1p(oldbalanceOrg), fill = factor(isFraud))) + geom_boxplot(show.legend = FALSE) +labs(title= 'Old Balance in Sender Accounts' , x = 'isFraud', y='Balance Amount') +  theme_classic()

p2 <- ggplot(data = transactions, aes(x = factor(isFraud) ,y = log1p(oldbalanceDest), fill = factor(isFraud))) + geom_boxplot(show.legend = FALSE) +labs(title= 'Old balance in Receiver Accounts' , x = 'isFraud',y='Balance Amount') +  theme_classic()

grid.arrange(p1, p2, nrow = 1)

In the majority of fraud transactions, the Old balance of the Origin account (where the payments are made) is higher than rest of the origin accounts while the Old balance in Destination accounts is Lower than rest.

5.5 Does Fraud occur more often at a certain time of day?

Each step represents 1 hour of real world and there are total 743 steps for 30 days of data . Lets convert them into 24 hours where each day has 1 to 24 hours and the pattern repeats again

# Convert Step to Hours in 24 hours format
transactions$hour <- mod(transactions$step, 24)

#Plot newly formatted data
p5<- ggplot(data = transactions, aes(x = hour)) + geom_bar(aes(fill = 'isFraud'), show.legend = FALSE) +labs(title= 'Total transactions at different Hours', y = 'No. of transactions') + theme_classic()

p6<-ggplot(data = transactions[transactions$isFraud==1,], aes(x = hour)) + geom_bar(aes(fill = 'isFraud'), show.legend = FALSE) +labs(title= 'Fraud transactions at different Hours', y = 'No. of fraud transactions') + theme_classic()

grid.arrange(p5, p6, ncol = 1, nrow = 2)

The total number of transactions happening between 0 to 9 hours are very low but this is not the case for Fraud transactions. We can be concluded that fraud transactions are very often between 12am to 9 am.

5.6 Important Insights to Consider

  1. The data is heavily imbalanced for each target class. We should consider sampling methods, like undersampling, to reduce model bias
  2. We can filter our transaction types to include only CASH_OUT and TRANSFER types since these are the only transaction types with fraudulent cases
  3. Fraudulent transactions tend to be of smaller amounts
  4. Fraudulent transactions tend to occur from 12am-9am

6 Feature Engineering and Data Cleaning

Check to see if transactions where the transaction Amount is greater than the balance available in the Origin account occur

head(transactions[(transactions$amount > transactions$oldbalanceOrg)& (transactions$newbalanceDest > transactions$oldbalanceDest), c("amount","oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest", "isFraud")], 10)

6.1 Filter data

As noted earlier, we will filter our data by type to include only CASH_OUT and TRANSFER, we can also drop nameOrig and nameDest, as there are too many unique levels to create dummy variables. We also know that we can remove the step column since we know this column was used to develop the hours attribute.

# Filtering transactions and drop irrelevant features
transactions1<- transactions %>% 
  select( -one_of('step','nameOrig', 'nameDest', 'isFlaggedFraud')) %>%
  filter(type %in% c('CASH_OUT','TRANSFER'))

All the fraud transactions occur in CASH_OUT and TRANSFER type, so we drop other types. step was used in creating the hours variable so we must remove it. nameOrig, nameDest, and isFlaggedFraud are not useful so we drop them.

6.2 Encoding Dummy variables for transaction type

Since transaction type is categorical, we need to create dummy variables so the data is numerical

transactions1 <- dummy_cols(transactions1)

transactions1$isFraud <- as.factor(transactions1$isFraud)
transactions1 <- transactions1[,-1]
transactions1<-as.data.frame(transactions1)
#summarizeColumns(transactions1)

6.3 Train and Test Data

We know that fraud transactions are a rare event. Since fraudulent transactions only make up .13% of the data, duplicating fraudulent transactions in order to balance the data is not the best technique. It makes more sense to sample down non-event cases through undersampling.

set.seed(12345)
train_id <- sample(seq_len(nrow(transactions1)), size = floor(0.7*nrow(transactions1)))

train <- transactions1[train_id,]
valid <- transactions1[-train_id,]

table(train$isFraud)
## 
##       0       1 
## 1933579    5707
table(valid$isFraud)
## 
##      0      1 
## 828617   2506

6.3.1 Undersampling

suppressMessages(library(unbalanced))
set.seed(12345)
prop.table(table(train$isFraud))
## 
##           0           1 
## 0.997057164 0.002942836
inputs <- train[,-6]
target <- train[,6]

under_sam <- ubUnder(X = inputs, Y = target)
train_u <- cbind(under_sam$X, under_sam$Y)
train_u$isFraud <- train_u$`under_sam$Y`
train_u$`under_sam$Y` <- NULL

table(train_u$isFraud)
## 
##    0    1 
## 5707 5707
prop.table(table(train_u$isFraud))
## 
##   0   1 
## 0.5 0.5

Using the undersampling method we are able to get a balanced training data set with 5707 observations each. However, our validation data will remain imbalanced,

7 Modeling

We consider tree-based modeling methods such as Decision Trees and Random Forests to predict fraudulent transactions.

7.1 Decision Tree

Decision tree using the undersampled data

model_dt <- rpart(isFraud ~ ., data = train_u)
prp(model_dt) 

predict_dt <- predict(model_dt, valid, type = "class")
confusionMatrix(valid$isFraud,predict_dt)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 775510  53107
##          1    125   2381
##                                           
##                Accuracy : 0.936           
##                  95% CI : (0.9354, 0.9365)
##     No Information Rate : 0.9332          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.0768          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.99984         
##             Specificity : 0.04291         
##          Pos Pred Value : 0.93591         
##          Neg Pred Value : 0.95012         
##              Prevalence : 0.93324         
##          Detection Rate : 0.93309         
##    Detection Prevalence : 0.99698         
##       Balanced Accuracy : 0.52137         
##                                           
##        'Positive' Class : 0               
## 
#Accuracy of the model is 93.6%

7.1.1 Calculate the F1 score

The F1 score is the weighted average of Precision and Recall. We can calculate this metric using the information provided above since:

  • Recall=Sensitivity
  • Precision = Pos Pred Value
#Recall=Sensitivity
Recall_1<-0.99984
Precision_1<-0.93591

F1<-2*(Recall_1*Precision_1)/(Recall_1+Precision_1)
F1*100
## [1] 96.68193

The F1 score for the decision tree model is 96.68% which is very good.

7.2 Random Forest

Now, we will consider another tree-based modeling method.

Lets look at a random forest model given the default parameters

# Create a Random Forest model with default parameters
rf1 <- randomForest(isFraud ~ ., data = train_u, importance = TRUE)
rf1
## 
## Call:
##  randomForest(formula = isFraud ~ ., data = train_u, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 2.26%
## Confusion matrix:
##      0    1 class.error
## 0 5575  132  0.02312949
## 1  126 5581  0.02207815

The estimate of error rate is 2.26%, lets see if we can improve this by tuning the paramters. In random forest models:

  • ntree = number of trees in forest
  • mtry = number of predictor variables included in split

Fine tuning parameters of Random Forest model

rf2 <- randomForest(isFraud ~ ., data = train_u, ntree = 500, mtry = 6, importance = TRUE)
rf2
## 
## Call:
##  randomForest(formula = isFraud ~ ., data = train_u, ntree = 500,      mtry = 6, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 1.15%
## Confusion matrix:
##      0    1 class.error
## 0 5605  102 0.017872788
## 1   29 5678 0.005081479

The estimate of error rate is 1.08% which is better than the first model. Let’s consider one more random forest scenario

In this model we will reduce the number of trees in the forest but maintain number of predictors included for the node split

rf3 <- randomForest(isFraud ~ ., data = train_u, ntree = 200, mtry = 6, importance = TRUE)
rf3
## 
## Call:
##  randomForest(formula = isFraud ~ ., data = train_u, ntree = 200,      mtry = 6, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 1.1%
## Confusion matrix:
##      0    1 class.error
## 0 5614   93 0.016295777
## 1   32 5675 0.005607149

The estimate of error rate is 1.17% in scenario which is an increase from model 2. We will use model 2 to check our validation data.

Here, we determine how well our rf model predicts on the training data set

# Predicting on train set
predTrain <- predict(rf2, train, type = "class")
# Checking classification accuracy
table(predTrain, train$isFraud)
##          
## predTrain       0       1
##         0 1906590       0
##         1   26989    5707

The model seems to perform well on the training data set, now lets find out how well it does on validation

# Predicting on Validation set
predValid <- predict(rf2, valid, type = "class")
# Checking classification accuracy
mean(predValid == valid$isFraud)                    
## [1] 0.9860442
table(predValid,valid$isFraud)
##          
## predValid      0      1
##         0 817029     11
##         1  11588   2495

The random forest model performed very well using accuracy at 98.6%

Let’s calculate the F1 score for our random forest model

TP=817055
TN=2494
FP=11562
FN=12
#Calculate Recall
Recall=(TP/(TP+FN))
Recall*100
## [1] 99.99853
#Calculate Precision
Precision=(TP/(TP+FP))
#Calculate F1
F1_rf<-2*(Recall_1*Precision_1)/(Recall_1+Precision_1)
F1_rf*100
## [1] 96.68193

The random forest model has an f1 score of 96.68%

# To check important variables
importance(rf2)        
##                           0        1 MeanDecreaseAccuracy MeanDecreaseGini
## amount          35.57795116 39.50215             47.12195        777.08510
## oldbalanceOrg  105.44569497 67.92024            124.15724       2745.65796
## newbalanceOrig 116.79263642 67.34054            115.94945        467.37422
## oldbalanceDest  49.88811238 31.01897             51.63019        282.64662
## newbalanceDest  30.76430827 54.88612             58.74716        976.36467
## hour            37.39136275 20.16681             39.19163        324.08880
## type_CASH_IN     0.00000000  0.00000              0.00000          0.00000
## type_CASH_OUT    0.14048173 20.74286             20.76799         64.98086
## type_DEBIT       0.00000000  0.00000              0.00000          0.00000
## type_PAYMENT     0.00000000  0.00000              0.00000          0.00000
## type_TRANSFER    0.09748201 21.57515             21.42828         64.30536
varImpPlot(rf2)

Through the plots above, we can see the variables in our random forest model ranked from most to least important when predicting our response. This helps with the interpratibility issue that comes with using random forest models.

7.3 Model Conclusions

Our decision tree model has:

  • accuracy of 93.6%
  • f1 score of 96.68%
  • Recall of 99.998

Our random forest model has:

  • accuracy of 98.6%
  • f1 score of 96.68%
  • Recall of 99.999

Typically, in this scenario I would use f1 score to determine which model to use, since f1 is a good balance of Precision and Recall. However, the F1 score is the same for both models. Although Recall is only.001 higher for the random forest model as it is for the decision tree model, we will consider this to be our scoring metric. The reason I picked Recall over accuracy, even though they both work, is because Recall is a good measure when there is a high cost associated with False Negatives. Since the cost of falsely assuming a transaction is not fraud is much higher for the company than accusing a transaction of fraud when it isn’t, the recall measure is a good option. The company should use the random forest model in order to predict fraud.

8 Conclusion

Paysim provided simulated data for fraudulent mobile money transactions.The goal of this project was to accurately predict fraudulent transactions through tree-based machine learning algorithms. Both models performed much better than expected, possibly due to the nature of the simulated data.

8.1 Suggested next steps

  • Consider additional methods for tuning model parameters, in addition consider using clustering to find a more optimal random forest model
  • Consider additional machine learning techniques such as XGBoost and Support Vector Machine algorithms
  • Diving deeper into why fraudulent trasactions are only included in TRANSFER and CASH_OUT transaction types.