This Rmarkdown data science project explores and detects fraudulent activity in a synthetically created fraud dataset found on Kaggle.

Here are the details from Kaggle:

This dataset contains 10,000 synthetic financial transactions designed for fraud detection research and model development. It simulates realistic user behavior and fraudulent patterns to provide a safe environment for testing machine learning models without exposing any real sensitive data.

Key features include:

transaction_id: Unique identifier for each transaction user_id: Identifier for the user performing the transaction amount: Transaction amount (in local currency) transaction_type: Type of transaction (POS, Online, ATM, QR) merchant_category: Category of the merchant country: Country where the transaction took place hour: Hour of the transaction (0–23) device_risk_score and ip_risk_score: Risk indicators for the device and IP is_fraud: Label indicating if the transaction is fraudulent (1) or legitimate (0) Fraud patterns simulated include: Transactions with unusually high amounts Transactions in unusual countries for the user Night-time transactions Rapid multiple transactions Transactions from newly created accounts High-risk devices or IP addresses

Lets call in some packages to use. Lets use the randomForest package we just read about and used in last posted Rpubs publication instead of caret.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
financial <- read.csv('synthetic_fraud_dataset.csv', header=T, na.strings=c('',' ','na','NA'))
str(financial)
## 'data.frame':    10000 obs. of  10 variables:
##  $ transaction_id   : int  9608 456 4747 6934 1646 2183 1919 3479 6796 5129 ...
##  $ user_id          : int  363 692 587 445 729 944 829 845 129 249 ...
##  $ amount           : num  4922.6 48 136.9 80.5 120 ...
##  $ transaction_type : chr  "ATM" "QR" "Online" "POS" ...
##  $ merchant_category: chr  "Travel" "Food" "Travel" "Clothing" ...
##  $ country          : chr  "TR" "US" "TR" "TR" ...
##  $ hour             : int  12 21 14 23 16 17 12 7 16 6 ...
##  $ device_risk_score: num  0.9923 0.1686 0.2961 0.1248 0.0981 ...
##  $ ip_risk_score    : num  0.9479 0.2241 0.1251 0.1592 0.0275 ...
##  $ is_fraud         : int  1 0 0 0 0 0 0 0 0 0 ...
financial$transaction_type <- as.factor(financial$transaction_type)
financial$merchant_category <- as.factor(financial$merchant_category)
financial$country <- as.factor(financial$country)
str(financial)
## 'data.frame':    10000 obs. of  10 variables:
##  $ transaction_id   : int  9608 456 4747 6934 1646 2183 1919 3479 6796 5129 ...
##  $ user_id          : int  363 692 587 445 729 944 829 845 129 249 ...
##  $ amount           : num  4922.6 48 136.9 80.5 120 ...
##  $ transaction_type : Factor w/ 4 levels "ATM","Online",..: 1 4 2 3 2 3 2 2 4 4 ...
##  $ merchant_category: Factor w/ 5 levels "Clothing","Electronics",..: 5 3 5 1 4 1 5 4 3 4 ...
##  $ country          : Factor w/ 6 levels "DE","FR","NG",..: 4 6 4 4 2 1 5 6 1 5 ...
##  $ hour             : int  12 21 14 23 16 17 12 7 16 6 ...
##  $ device_risk_score: num  0.9923 0.1686 0.2961 0.1248 0.0981 ...
##  $ ip_risk_score    : num  0.9479 0.2241 0.1251 0.1592 0.0275 ...
##  $ is_fraud         : int  1 0 0 0 0 0 0 0 0 0 ...
summary(financial$is_fraud)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00    0.05    0.00    1.00

The target is the is_fraud feature that we must predict the case as fraud or not. The details in the background data say that this feature is fraud if 1 and not fraud if 0. We need this to be a classification, so we can substitute the 1 for yes and 0 for no.

financial$is_fraud <- gsub('1','yes',financial$is_fraud)
financial$is_fraud <- gsub('0','no', financial$is_fraud)
financial$is_fraud <- as.factor(financial$is_fraud)
str(financial$is_fraud)
##  Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
summary(financial)
##  transaction_id    user_id          amount         transaction_type
##  Min.   :   0   Min.   :  0.0   Min.   :    1.00   ATM   :2529     
##  1st Qu.:2500   1st Qu.:247.0   1st Qu.:   65.08   Online:2397     
##  Median :5000   Median :503.0   Median :  101.69   POS   :2568     
##  Mean   :5000   Mean   :500.1   Mean   :  178.14   QR    :2506     
##  3rd Qu.:7499   3rd Qu.:750.2   3rd Qu.:  138.28                   
##  Max.   :9999   Max.   :999.0   Max.   :11628.21                   
##    merchant_category country        hour       device_risk_score  
##  Clothing   :1982    DE:1930   Min.   : 0.00   Min.   :3.027e-05  
##  Electronics:2007    FR:2027   1st Qu.:10.00   1st Qu.:7.572e-02  
##  Food       :2023    NG: 100   Median :14.00   Median :1.566e-01  
##  Grocery    :1973    TR:1928   Mean   :14.25   Mean   :1.838e-01  
##  Travel     :2015    UK:1965   3rd Qu.:19.00   3rd Qu.:2.349e-01  
##                      US:2050   Max.   :23.00   Max.   :9.987e-01  
##  ip_risk_score       is_fraud  
##  Min.   :0.0000088   no :9500  
##  1st Qu.:0.0777621   yes: 500  
##  Median :0.1582898             
##  Mean   :0.1846686             
##  3rd Qu.:0.2369681             
##  Max.   :0.9996027

The data says that 500 of the 10,000 observations are fraud and the rest are not. Lets start modeling our data by splitting into testing and training sets and then running the tuner and randomForest model building function. And see how well it can perform on very large data.

set.seed(12345)
intrain <- sample(1:10000, .80*10000)
length(intrain)
## [1] 8000

We got 8,000 random samples to put in our training set and now will make the training set to train model and the testing set to predict the model accuracy.

training <- financial[intrain,]
testing <- financial[-intrain,]
training %>% group_by(is_fraud) %>% count(is_fraud)
## # A tibble: 2 × 2
## # Groups:   is_fraud [2]
##   is_fraud     n
##   <fct>    <int>
## 1 no        7593
## 2 yes        407

About 80% of the fraud cases are in the training data set. And that leaves the testing set with the remaining 20% of fraud cases.

testing %>% group_by(is_fraud) %>% count(is_fraud)
## # A tibble: 2 × 2
## # Groups:   is_fraud [2]
##   is_fraud     n
##   <fct>    <int>
## 1 no        1907
## 2 yes         93

Now lets see how many features or mtry to use in our tuner before running our random forest model with randomForest.

# tune1 <- tuneRF(training[,-10], training[,10], mtryStart=7, ntreeTry=50, stepFactor=2, improve=0.01,
#         trace=TRUE, plot=TRUE, doBest=TRUE)

Error produced with above code, so maybe a mix of factors and numeric regression features to make a classification are giving the error.

mtry = 7 OOB error = 0% Searching left … mtry = 4 OOB error = 0% NaN 0.01 Error in if (Improve > improve) { : missing value where TRUE/FALSE needed

After looking at the data better, I realized the factors for transaction ID and user ID are not adding to this predictive model, so we can remove them.

training <- training[,-c(1:2)]
testing <- testing[,-c(1:2)]
summary(training)
##      amount         transaction_type   merchant_category country  
##  Min.   :    1.00   ATM   :2012      Clothing   :1587    DE:1527  
##  1st Qu.:   65.02   Online:1933      Electronics:1585    FR:1615  
##  Median :  101.74   POS   :2046      Food       :1620    NG:  88  
##  Mean   :  179.53   QR    :2009      Grocery    :1597    TR:1538  
##  3rd Qu.:  138.82                    Travel     :1611    UK:1579  
##  Max.   :11628.21                                        US:1653  
##       hour       device_risk_score   ip_risk_score       is_fraud  
##  Min.   : 0.00   Min.   :3.865e-05   Min.   :0.0000088   no :7593  
##  1st Qu.:10.00   1st Qu.:7.598e-02   1st Qu.:0.0790742   yes: 407  
##  Median :14.00   Median :1.569e-01   Median :0.1590170             
##  Mean   :14.22   Mean   :1.846e-01   Mean   :0.1857001             
##  3rd Qu.:19.00   3rd Qu.:2.348e-01   3rd Qu.:0.2375397             
##  Max.   :23.00   Max.   :9.987e-01   Max.   :0.9996027
summary(testing)
##      amount         transaction_type   merchant_category country 
##  Min.   :    1.00   ATM   :517       Clothing   :395     DE:403  
##  1st Qu.:   65.44   Online:464       Electronics:422     FR:412  
##  Median :  101.46   POS   :522       Food       :403     NG: 12  
##  Mean   :  172.60   QR    :497       Grocery    :376     TR:390  
##  3rd Qu.:  136.00                    Travel     :404     UK:386  
##  Max.   :11085.08                                        US:397  
##       hour       device_risk_score   ip_risk_score       is_fraud  
##  Min.   : 0.00   Min.   :3.027e-05   Min.   :0.0003205   no :1907  
##  1st Qu.:10.00   1st Qu.:7.420e-02   1st Qu.:0.0736583   yes:  93  
##  Median :14.00   Median :1.540e-01   Median :0.1527330             
##  Mean   :14.36   Mean   :1.806e-01   Mean   :0.1805427             
##  3rd Qu.:19.00   3rd Qu.:2.350e-01   3rd Qu.:0.2348973             
##  Max.   :23.00   Max.   :9.985e-01   Max.   :0.9983842
# tune1 <- tuneRF(training[,-8], training[,8],  mtryStart=2, ntreeTry=50, stepFactor=2, improve=0.0001,
#          trace=TRUE, plot=TRUE, doBest=FALSE)
# 

Still getting error, so the NaN must be for missing values if it divides by zero in modeling oob or upvotes. Lets see if there are any incomplete cases in the data.

training1 <- training[complete.cases(training),]
summary(training1)
##      amount         transaction_type   merchant_category country  
##  Min.   :    1.00   ATM   :2012      Clothing   :1587    DE:1527  
##  1st Qu.:   65.02   Online:1933      Electronics:1585    FR:1615  
##  Median :  101.74   POS   :2046      Food       :1620    NG:  88  
##  Mean   :  179.53   QR    :2009      Grocery    :1597    TR:1538  
##  3rd Qu.:  138.82                    Travel     :1611    UK:1579  
##  Max.   :11628.21                                        US:1653  
##       hour       device_risk_score   ip_risk_score       is_fraud  
##  Min.   : 0.00   Min.   :3.865e-05   Min.   :0.0000088   no :7593  
##  1st Qu.:10.00   1st Qu.:7.598e-02   1st Qu.:0.0790742   yes: 407  
##  Median :14.00   Median :1.569e-01   Median :0.1590170             
##  Mean   :14.22   Mean   :1.846e-01   Mean   :0.1857001             
##  3rd Qu.:19.00   3rd Qu.:2.348e-01   3rd Qu.:0.2375397             
##  Max.   :23.00   Max.   :9.987e-01   Max.   :0.9996027

There are not any NA values. It must be due to the mix of factors and of numeric values. Lets make these numeric values factors. The percent values are between 0 and 1 for device_risk_score and ip_risk_score. Lets put a threshold up so that if a risk it is greater than 60% and not a risk otherwise. There will be 2 factors of risk or not based on this threshold for both features. The hour is between 0.00 and 23.00 with most transactions occuring at hour 14 or 2 pm. Lets create a four factor class for this feature with dividing 23 hours into portions of the day: 0-6 will be early am, 7-11 am, 12-17 afternoon, and 18-23 evening. For the amount, most transactions are centered around 100 USD and the 3rd quantile is less than 140 USD meaning about 75% of these transactions are less than 140 USD. So, lets make that a 4 class feature as well with less than 100 small, 101-139 medium, 140-300 large, and greater than 300 very large. Lets make these changes to the original data we made changes to before splitting data into testing and training sets.

low <- subset(financial, financial$amount<100)
medium <- subset(financial, financial$amount >= 100 & financial$amount <= 140) 
large <- subset(financial, financial$amount > 140 & financial$amount <= 300)
veryLarge <- subset(financial, financial$amount > 300)

Now lets add a column called Amount to the data with separate class of data before rbinding the data. I did the math and all this adds to 10,000 observations.

low$Amount <- 'low'
medium$Amount <- 'medium'
large$Amount <- 'large'
veryLarge$Amount <- 'very large'

Amount <- rbind(low,medium,large,veryLarge)
str(Amount)
## 'data.frame':    10000 obs. of  11 variables:
##  $ transaction_id   : int  456 6934 2183 3479 6796 5129 8538 8410 8201 6484 ...
##  $ user_id          : int  692 445 944 845 129 249 679 559 564 984 ...
##  $ amount           : num  48 80.5 97.1 96.5 83.3 ...
##  $ transaction_type : Factor w/ 4 levels "ATM","Online",..: 4 3 3 2 4 4 2 2 1 4 ...
##  $ merchant_category: Factor w/ 5 levels "Clothing","Electronics",..: 3 1 1 4 3 4 3 3 1 5 ...
##  $ country          : Factor w/ 6 levels "DE","FR","NG",..: 6 4 1 6 1 5 6 1 4 5 ...
##  $ hour             : int  21 23 17 7 16 6 10 10 20 19 ...
##  $ device_risk_score: num  0.1686 0.1248 0.2354 0.0823 0.0218 ...
##  $ ip_risk_score    : num  0.224 0.159 0.105 0.034 0.28 ...
##  $ is_fraud         : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Amount           : chr  "low" "low" "low" "low" ...

Lets drop the first 3 features and change the Amount feature into a factor before changing the hour, device_risk_score, and ip_risk_score into factor features.

Amount1 <- Amount[,-c(1:3)]
Amount1$Amount <- as.factor(Amount1$Amount)
str(Amount1)
## 'data.frame':    10000 obs. of  8 variables:
##  $ transaction_type : Factor w/ 4 levels "ATM","Online",..: 4 3 3 2 4 4 2 2 1 4 ...
##  $ merchant_category: Factor w/ 5 levels "Clothing","Electronics",..: 3 1 1 4 3 4 3 3 1 5 ...
##  $ country          : Factor w/ 6 levels "DE","FR","NG",..: 6 4 1 6 1 5 6 1 4 5 ...
##  $ hour             : int  21 23 17 7 16 6 10 10 20 19 ...
##  $ device_risk_score: num  0.1686 0.1248 0.2354 0.0823 0.0218 ...
##  $ ip_risk_score    : num  0.224 0.159 0.105 0.034 0.28 ...
##  $ is_fraud         : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Amount           : Factor w/ 4 levels "large","low",..: 2 2 2 2 2 2 2 2 2 2 ...

Lets put a threshold up so that if a risk it is greater than 60% and not a risk otherwise. There will be 2 factors of risk or not based on this threshold for both features.

lowRisk_device <- subset(Amount1, Amount1$device_risk_score <= 0.60)
highRisk_device <- subset(Amount1, Amount1$device_risk_score > 0.60)

lowRisk_device$device_risk <- 'low risk'
highRisk_device$device_risk <- 'high risk'

Amount2 <- rbind(lowRisk_device,highRisk_device)
Amount2$device_risk <- as.factor(Amount2$device_risk)

str(Amount2)
## 'data.frame':    10000 obs. of  9 variables:
##  $ transaction_type : Factor w/ 4 levels "ATM","Online",..: 4 3 3 2 4 4 2 2 1 4 ...
##  $ merchant_category: Factor w/ 5 levels "Clothing","Electronics",..: 3 1 1 4 3 4 3 3 1 5 ...
##  $ country          : Factor w/ 6 levels "DE","FR","NG",..: 6 4 1 6 1 5 6 1 4 5 ...
##  $ hour             : int  21 23 17 7 16 6 10 10 20 19 ...
##  $ device_risk_score: num  0.1686 0.1248 0.2354 0.0823 0.0218 ...
##  $ ip_risk_score    : num  0.224 0.159 0.105 0.034 0.28 ...
##  $ is_fraud         : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Amount           : Factor w/ 4 levels "large","low",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ device_risk      : Factor w/ 2 levels "high risk","low risk": 2 2 2 2 2 2 2 2 2 2 ...

Drop the device_risk_score we just made into a factor from numeric type.

Amount3 <- Amount2[,-5]
str(Amount3)
## 'data.frame':    10000 obs. of  8 variables:
##  $ transaction_type : Factor w/ 4 levels "ATM","Online",..: 4 3 3 2 4 4 2 2 1 4 ...
##  $ merchant_category: Factor w/ 5 levels "Clothing","Electronics",..: 3 1 1 4 3 4 3 3 1 5 ...
##  $ country          : Factor w/ 6 levels "DE","FR","NG",..: 6 4 1 6 1 5 6 1 4 5 ...
##  $ hour             : int  21 23 17 7 16 6 10 10 20 19 ...
##  $ ip_risk_score    : num  0.224 0.159 0.105 0.034 0.28 ...
##  $ is_fraud         : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Amount           : Factor w/ 4 levels "large","low",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ device_risk      : Factor w/ 2 levels "high risk","low risk": 2 2 2 2 2 2 2 2 2 2 ...

Now make the feature for ip risk a factor with threshold 60% as well. We may have inadvertantly discovered the risk was 60% device risk due to the number of observations in the low risk subset being 500 and observations in the high risk subset were 9,500 by device risk, but lets continue. This is a synthetic financial data set.

lowRisk_ip <- subset(Amount3, Amount3$ip_risk_score < 0.60)
highRisk_ip <- subset(Amount3, Amount3$ip_risk_score >= 0.60)

lowRisk_ip$ip_risk <- 'low risk'
highRisk_ip$ip_risk <- 'high risk'

Amount4 <- rbind(lowRisk_ip,highRisk_ip)

Amount4$ip_risk <- as.factor(Amount4$ip_risk)
str(Amount4)
## 'data.frame':    10000 obs. of  9 variables:
##  $ transaction_type : Factor w/ 4 levels "ATM","Online",..: 4 3 3 2 4 4 2 2 1 4 ...
##  $ merchant_category: Factor w/ 5 levels "Clothing","Electronics",..: 3 1 1 4 3 4 3 3 1 5 ...
##  $ country          : Factor w/ 6 levels "DE","FR","NG",..: 6 4 1 6 1 5 6 1 4 5 ...
##  $ hour             : int  21 23 17 7 16 6 10 10 20 19 ...
##  $ ip_risk_score    : num  0.224 0.159 0.105 0.034 0.28 ...
##  $ is_fraud         : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Amount           : Factor w/ 4 levels "large","low",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ device_risk      : Factor w/ 2 levels "high risk","low risk": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ip_risk          : Factor w/ 2 levels "high risk","low risk": 2 2 2 2 2 2 2 2 2 2 ...

drop the ip_risk_score numeric feature we just made into a factor feature of 2 classes high and low risk.

Amount5 <- Amount4[,-5]
str(Amount5)
## 'data.frame':    10000 obs. of  8 variables:
##  $ transaction_type : Factor w/ 4 levels "ATM","Online",..: 4 3 3 2 4 4 2 2 1 4 ...
##  $ merchant_category: Factor w/ 5 levels "Clothing","Electronics",..: 3 1 1 4 3 4 3 3 1 5 ...
##  $ country          : Factor w/ 6 levels "DE","FR","NG",..: 6 4 1 6 1 5 6 1 4 5 ...
##  $ hour             : int  21 23 17 7 16 6 10 10 20 19 ...
##  $ is_fraud         : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Amount           : Factor w/ 4 levels "large","low",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ device_risk      : Factor w/ 2 levels "high risk","low risk": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ip_risk          : Factor w/ 2 levels "high risk","low risk": 2 2 2 2 2 2 2 2 2 2 ...

Now, lets make the hour into a 4 class feature. The hour is between 0.00 and 23.00 with most transactions occuring at hour 14 or 2 pm. Lets create a four factor class for this feature with dividing 23 hours into portions of the day: 0-6 will be early am, 7-11 will be am, 12-17 will be afternoon, and 18-23 will be evening.

earlyAM <- subset(Amount5, Amount5$hour >= 0 & Amount5$hour <= 6)
AM <- subset(Amount5, Amount5$hour >= 7 & Amount5$hour <= 11)
afternoon <- subset(Amount5, Amount5$hour >= 12 & Amount5$hour <= 17)
evening <- subset(Amount5, Amount5$hour >= 18 & Amount5$hour <= 23)

earlyAM$Hour <- 'early AM'
AM$Hour <- 'AM'
afternoon$Hour <- 'afternoon'
evening$Hour <- 'evening'

Amount6 <- rbind(earlyAM,AM,afternoon, evening)
Amount6$Hour <- as.factor(Amount6$Hour)
str(Amount6)
## 'data.frame':    10000 obs. of  9 variables:
##  $ transaction_type : Factor w/ 4 levels "ATM","Online",..: 4 3 1 3 2 4 1 1 2 2 ...
##  $ merchant_category: Factor w/ 5 levels "Clothing","Electronics",..: 4 4 3 4 4 5 5 4 4 3 ...
##  $ country          : Factor w/ 6 levels "DE","FR","NG",..: 5 4 2 5 1 1 2 6 1 1 ...
##  $ hour             : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ is_fraud         : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Amount           : Factor w/ 4 levels "large","low",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ device_risk      : Factor w/ 2 levels "high risk","low risk": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ip_risk          : Factor w/ 2 levels "high risk","low risk": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Hour             : Factor w/ 4 levels "afternoon","AM",..: 3 3 3 3 3 3 3 3 3 3 ...

Lets drop the hour feature we made into a factor of 4 levels.

Amount7 <- Amount6[,-4]
str(Amount7)
## 'data.frame':    10000 obs. of  8 variables:
##  $ transaction_type : Factor w/ 4 levels "ATM","Online",..: 4 3 1 3 2 4 1 1 2 2 ...
##  $ merchant_category: Factor w/ 5 levels "Clothing","Electronics",..: 4 4 3 4 4 5 5 4 4 3 ...
##  $ country          : Factor w/ 6 levels "DE","FR","NG",..: 5 4 2 5 1 1 2 6 1 1 ...
##  $ is_fraud         : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Amount           : Factor w/ 4 levels "large","low",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ device_risk      : Factor w/ 2 levels "high risk","low risk": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ip_risk          : Factor w/ 2 levels "high risk","low risk": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Hour             : Factor w/ 4 levels "afternoon","AM",..: 3 3 3 3 3 3 3 3 3 3 ...

Now we have a data frame of 8 features that are all factor level. Lets see if we can predict the fraud risk with the 7 factor features in our data.

set.seed(123)

intrain <- sample(1:10000,.8*10000)
training <- Amount7[intrain,]
testing <- Amount7[-intrain,]
tune1 <- tuneRF(training[,-8], training[,8], mtryStart=5, ntreeTry=50, stepFactor=2, improve=0.01,
         trace=TRUE, plot=TRUE, doBest=TRUE)
## mtry = 5  OOB error = 65.9% 
## Searching left ...
## mtry = 3     OOB error = 66.36% 
## -0.007018209 0.01 
## Searching right ...
## mtry = 7     OOB error = 66.09% 
## -0.00284522 0.01

So the mtry seems to be the lowest with 50 trees to start and under 65% out of bag error. Lets use that in our random forest model.

set.seed(123)
finance_rf <- randomForest( is_fraud~ ., data=training, ntree=1000, mtry=5,
                          keep.forest=TRUE, importance=TRUE)

The results if the Rstudio version is different than knitr html version output: Call: randomForest(formula = is_fraud ~ ., data = training, ntree = 1000, mtry = 5, keep.forest = TRUE, importance = TRUE) Type of random forest: classification Number of trees: 1000 No. of variables tried at each split: 5

    OOB estimate of  error rate: 0%

Confusion matrix: no yes class.error no 7590 0 0 yes 0 410 0

Looks like perfect accuracy on the training model. Lets look at the features by importance in this model.

finance_rf['importance']
## $importance
##                             no       yes MeanDecreaseAccuracy MeanDecreaseGini
## transaction_type  0.0000000000 0.0000000         0.000000e+00       0.00000000
## merchant_category 0.0000000000 0.0000000         0.000000e+00       0.00000000
## country           0.0000104639 0.0000000         9.887937e-06       0.05263088
## Amount            0.0017515566 0.0000000         1.661435e-03      25.99341148
## device_risk       0.0262005476 0.4684148         4.893406e-02     381.99777220
## ip_risk           0.0251636285 0.4478594         4.682421e-02     367.39194169
## Hour              0.0000000000 0.0000000         0.000000e+00       0.00000000

Looks like the device risk and then the ip risk were the most important in this model accuracy and we saw earlier that these features must have been what split the data into 500 fraud cases and 9,500 legit cases when the data set was made synthetically. The transaction type, merchant category, and hour had no importance in this model’s prediction, but very little importance was played in the country and amount withdrawn or spent in this model’s prediction accuracy.

Lets see how well it predicts on unseen data in our testing set the hold out 20% of the factorized features of financial data.

prediction_rf <- predict(finance_rf, testing)
summary(prediction_rf)
##   no  yes 
## 1910   90
finance_results_df <- data.frame(prediction_rf, type=testing$is_fraud)
head(finance_results_df, 100)
##      prediction_rf type
## 103             no   no
## 526             no   no
## 717             no   no
## 1138            no   no
## 1142            no   no
## 1214            no   no
## 1228            no   no
## 1316            no   no
## 1519            no   no
## 1562            no   no
## 1658            no   no
## 1678            no   no
## 2464            no   no
## 2577            no   no
## 2792            no   no
## 2860            no   no
## 2900            no   no
## 2956            no   no
## 3072            no   no
## 4091            no   no
## 4256            no   no
## 4338            no   no
## 4373            no   no
## 4535            no   no
## 4597            no   no
## 5081            no   no
## 5115            no   no
## 5119            no   no
## 5406            no   no
## 5489            no   no
## 5639            no   no
## 5705            no   no
## 5862            no   no
## 5970            no   no
## 6276            no   no
## 6281            no   no
## 6359            no   no
## 6377            no   no
## 6546            no   no
## 6576            no   no
## 6649            no   no
## 6685            no   no
## 6950            no   no
## 7282            no   no
## 7297            no   no
## 7368            no   no
## 7543            no   no
## 7570            no   no
## 7782            no   no
## 7802            no   no
## 8009            no   no
## 8247            no   no
## 8535            no   no
## 8676            no   no
## 8681            no   no
## 9270            no   no
## 9493            no   no
## 289             no   no
## 1221            no   no
## 1251            no   no
## 1557            no   no
## 1886            no   no
## 2071            no   no
## 2412            no   no
## 3267            no   no
## 3319            no   no
## 3461            no   no
## 4177            no   no
## 4793            no   no
## 5007            no   no
## 5098            no   no
## 5531            no   no
## 5586            no   no
## 5690            no   no
## 5904            no   no
## 6519            no   no
## 7426            no   no
## 7428            no   no
## 7444            no   no
## 8248            no   no
## 8303            no   no
## 8586            no   no
## 8832            no   no
## 8883            no   no
## 9791            no   no
## 9840            no   no
## 178             no   no
## 714             no   no
## 983             no   no
## 1178            no   no
## 1721            no   no
## 1938            no   no
## 2301            no   no
## 2608            no   no
## 2669            no   no
## 3695            no   no
## 3870            no   no
## 4008            no   no
## 4519            no   no
## 4999            no   no
tail(finance_results_df, 100)
##      prediction_rf type
## 4527            no   no
## 4562            no   no
## 4574            no   no
## 4590            no   no
## 4603            no   no
## 4690            no   no
## 4734            no   no
## 4747            no   no
## 4789            no   no
## 4884            no   no
## 4956            no   no
## 5108            no   no
## 5202            no   no
## 5214            no   no
## 5236            no   no
## 5260            no   no
## 5368            no   no
## 5413            no   no
## 5424            no   no
## 5495            no   no
## 5558            no   no
## 5570            no   no
## 5576            no   no
## 5633            no   no
## 5682            no   no
## 5734            no   no
## 5790            no   no
## 5856            no   no
## 5955            no   no
## 6054            no   no
## 6366            no   no
## 6408            no   no
## 6506            no   no
## 6516            no   no
## 6596            no   no
## 6637            no   no
## 6785            no   no
## 6964            no   no
## 7092            no   no
## 7166            no   no
## 7214            no   no
## 7266            no   no
## 7303            no   no
## 7432            no   no
## 7714            no   no
## 7752            no   no
## 7842            no   no
## 7859            no   no
## 7871            no   no
## 8157            no   no
## 8311            no   no
## 8314            no   no
## 8371            no   no
## 8394            no   no
## 8415            no   no
## 8478            no   no
## 8506            no   no
## 8609            no   no
## 8679            no   no
## 8731            no   no
## 8807            no   no
## 8838            no   no
## 8899            no   no
## 9117            no   no
## 9140            no   no
## 9142            no   no
## 9154            no   no
## 9168            no   no
## 9345            no   no
## 9387            no   no
## 9439            no   no
## 9452            no   no
## 9464            no   no
## 9484            no   no
## 9485            no   no
## 9663            no   no
## 9748            no   no
## 9825            no   no
## 9904            no   no
## 9934            no   no
## 9970            no   no
## 951            yes  yes
## 6093           yes  yes
## 2894           yes  yes
## 6082           yes  yes
## 209            yes  yes
## 1392           yes  yes
## 2221           yes  yes
## 2543           yes  yes
## 3945           yes  yes
## 4765           yes  yes
## 5178           yes  yes
## 8192           yes  yes
## 8317           yes  yes
## 8701           yes  yes
## 8876           yes  yes
## 9093           yes  yes
## 9097           yes  yes
## 9545           yes  yes
## 9638           yes  yes

Looks like twins all down the line the entries for prediction match the testing class. This model worked well in random forest with the randomForest package in R with 100% accuracy. Lets just look at the results.

sum(finance_results_df$prediction_rf==finance_results_df$type)/length(finance_results_df$type)
## [1] 1

The results are 1 or 100% as output in Rstudio. We will see if same results within the knitr html outputs published.

Thanks for following along as we explored this financial data after making class thresholds from numeric and integer data to predict if the transaction was fraud or not. Continue checking in as we move along many data science projects to discover truth in the data. From here and last publication, it is clear that randomForest is a robust function to model data but works the best with large amounts of samples, and this was created data but made to be similar to realistic financial data. Many of the decision trees use threshold or upvote values and sometimes it benefits to alter the data without changing its true value into a threshold of numeric data that is best for mathematical regression formulas to get numeric values when finding best fit for a numeric value like market rates, insurance rates, tax assessments, and so on