This Rmarkdown data science project explores and detects fraudulent activity in a synthetically created fraud dataset found on Kaggle.
Here are the details from Kaggle:
This dataset contains 10,000 synthetic financial transactions designed for fraud detection research and model development. It simulates realistic user behavior and fraudulent patterns to provide a safe environment for testing machine learning models without exposing any real sensitive data.
Key features include:
transaction_id: Unique identifier for each transaction user_id: Identifier for the user performing the transaction amount: Transaction amount (in local currency) transaction_type: Type of transaction (POS, Online, ATM, QR) merchant_category: Category of the merchant country: Country where the transaction took place hour: Hour of the transaction (0–23) device_risk_score and ip_risk_score: Risk indicators for the device and IP is_fraud: Label indicating if the transaction is fraudulent (1) or legitimate (0) Fraud patterns simulated include: Transactions with unusually high amounts Transactions in unusual countries for the user Night-time transactions Rapid multiple transactions Transactions from newly created accounts High-risk devices or IP addresses
Lets call in some packages to use. Lets use the randomForest package we just read about and used in last posted Rpubs publication instead of caret.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
financial <- read.csv('synthetic_fraud_dataset.csv', header=T, na.strings=c('',' ','na','NA'))
str(financial)
## 'data.frame': 10000 obs. of 10 variables:
## $ transaction_id : int 9608 456 4747 6934 1646 2183 1919 3479 6796 5129 ...
## $ user_id : int 363 692 587 445 729 944 829 845 129 249 ...
## $ amount : num 4922.6 48 136.9 80.5 120 ...
## $ transaction_type : chr "ATM" "QR" "Online" "POS" ...
## $ merchant_category: chr "Travel" "Food" "Travel" "Clothing" ...
## $ country : chr "TR" "US" "TR" "TR" ...
## $ hour : int 12 21 14 23 16 17 12 7 16 6 ...
## $ device_risk_score: num 0.9923 0.1686 0.2961 0.1248 0.0981 ...
## $ ip_risk_score : num 0.9479 0.2241 0.1251 0.1592 0.0275 ...
## $ is_fraud : int 1 0 0 0 0 0 0 0 0 0 ...
financial$transaction_type <- as.factor(financial$transaction_type)
financial$merchant_category <- as.factor(financial$merchant_category)
financial$country <- as.factor(financial$country)
str(financial)
## 'data.frame': 10000 obs. of 10 variables:
## $ transaction_id : int 9608 456 4747 6934 1646 2183 1919 3479 6796 5129 ...
## $ user_id : int 363 692 587 445 729 944 829 845 129 249 ...
## $ amount : num 4922.6 48 136.9 80.5 120 ...
## $ transaction_type : Factor w/ 4 levels "ATM","Online",..: 1 4 2 3 2 3 2 2 4 4 ...
## $ merchant_category: Factor w/ 5 levels "Clothing","Electronics",..: 5 3 5 1 4 1 5 4 3 4 ...
## $ country : Factor w/ 6 levels "DE","FR","NG",..: 4 6 4 4 2 1 5 6 1 5 ...
## $ hour : int 12 21 14 23 16 17 12 7 16 6 ...
## $ device_risk_score: num 0.9923 0.1686 0.2961 0.1248 0.0981 ...
## $ ip_risk_score : num 0.9479 0.2241 0.1251 0.1592 0.0275 ...
## $ is_fraud : int 1 0 0 0 0 0 0 0 0 0 ...
summary(financial$is_fraud)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 0.05 0.00 1.00
The target is the is_fraud feature that we must predict the case as fraud or not. The details in the background data say that this feature is fraud if 1 and not fraud if 0. We need this to be a classification, so we can substitute the 1 for yes and 0 for no.
financial$is_fraud <- gsub('1','yes',financial$is_fraud)
financial$is_fraud <- gsub('0','no', financial$is_fraud)
financial$is_fraud <- as.factor(financial$is_fraud)
str(financial$is_fraud)
## Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
summary(financial)
## transaction_id user_id amount transaction_type
## Min. : 0 Min. : 0.0 Min. : 1.00 ATM :2529
## 1st Qu.:2500 1st Qu.:247.0 1st Qu.: 65.08 Online:2397
## Median :5000 Median :503.0 Median : 101.69 POS :2568
## Mean :5000 Mean :500.1 Mean : 178.14 QR :2506
## 3rd Qu.:7499 3rd Qu.:750.2 3rd Qu.: 138.28
## Max. :9999 Max. :999.0 Max. :11628.21
## merchant_category country hour device_risk_score
## Clothing :1982 DE:1930 Min. : 0.00 Min. :3.027e-05
## Electronics:2007 FR:2027 1st Qu.:10.00 1st Qu.:7.572e-02
## Food :2023 NG: 100 Median :14.00 Median :1.566e-01
## Grocery :1973 TR:1928 Mean :14.25 Mean :1.838e-01
## Travel :2015 UK:1965 3rd Qu.:19.00 3rd Qu.:2.349e-01
## US:2050 Max. :23.00 Max. :9.987e-01
## ip_risk_score is_fraud
## Min. :0.0000088 no :9500
## 1st Qu.:0.0777621 yes: 500
## Median :0.1582898
## Mean :0.1846686
## 3rd Qu.:0.2369681
## Max. :0.9996027
The data says that 500 of the 10,000 observations are fraud and the rest are not. Lets start modeling our data by splitting into testing and training sets and then running the tuner and randomForest model building function. And see how well it can perform on very large data.
set.seed(12345)
intrain <- sample(1:10000, .80*10000)
length(intrain)
## [1] 8000
We got 8,000 random samples to put in our training set and now will make the training set to train model and the testing set to predict the model accuracy.
training <- financial[intrain,]
testing <- financial[-intrain,]
training %>% group_by(is_fraud) %>% count(is_fraud)
## # A tibble: 2 × 2
## # Groups: is_fraud [2]
## is_fraud n
## <fct> <int>
## 1 no 7593
## 2 yes 407
About 80% of the fraud cases are in the training data set. And that leaves the testing set with the remaining 20% of fraud cases.
testing %>% group_by(is_fraud) %>% count(is_fraud)
## # A tibble: 2 × 2
## # Groups: is_fraud [2]
## is_fraud n
## <fct> <int>
## 1 no 1907
## 2 yes 93
Now lets see how many features or mtry to use in our tuner before running our random forest model with randomForest.
# tune1 <- tuneRF(training[,-10], training[,10], mtryStart=7, ntreeTry=50, stepFactor=2, improve=0.01,
# trace=TRUE, plot=TRUE, doBest=TRUE)
Error produced with above code, so maybe a mix of factors and numeric regression features to make a classification are giving the error.
mtry = 7 OOB error = 0% Searching left … mtry = 4 OOB error = 0% NaN 0.01 Error in if (Improve > improve) { : missing value where TRUE/FALSE needed
After looking at the data better, I realized the factors for transaction ID and user ID are not adding to this predictive model, so we can remove them.
training <- training[,-c(1:2)]
testing <- testing[,-c(1:2)]
summary(training)
## amount transaction_type merchant_category country
## Min. : 1.00 ATM :2012 Clothing :1587 DE:1527
## 1st Qu.: 65.02 Online:1933 Electronics:1585 FR:1615
## Median : 101.74 POS :2046 Food :1620 NG: 88
## Mean : 179.53 QR :2009 Grocery :1597 TR:1538
## 3rd Qu.: 138.82 Travel :1611 UK:1579
## Max. :11628.21 US:1653
## hour device_risk_score ip_risk_score is_fraud
## Min. : 0.00 Min. :3.865e-05 Min. :0.0000088 no :7593
## 1st Qu.:10.00 1st Qu.:7.598e-02 1st Qu.:0.0790742 yes: 407
## Median :14.00 Median :1.569e-01 Median :0.1590170
## Mean :14.22 Mean :1.846e-01 Mean :0.1857001
## 3rd Qu.:19.00 3rd Qu.:2.348e-01 3rd Qu.:0.2375397
## Max. :23.00 Max. :9.987e-01 Max. :0.9996027
summary(testing)
## amount transaction_type merchant_category country
## Min. : 1.00 ATM :517 Clothing :395 DE:403
## 1st Qu.: 65.44 Online:464 Electronics:422 FR:412
## Median : 101.46 POS :522 Food :403 NG: 12
## Mean : 172.60 QR :497 Grocery :376 TR:390
## 3rd Qu.: 136.00 Travel :404 UK:386
## Max. :11085.08 US:397
## hour device_risk_score ip_risk_score is_fraud
## Min. : 0.00 Min. :3.027e-05 Min. :0.0003205 no :1907
## 1st Qu.:10.00 1st Qu.:7.420e-02 1st Qu.:0.0736583 yes: 93
## Median :14.00 Median :1.540e-01 Median :0.1527330
## Mean :14.36 Mean :1.806e-01 Mean :0.1805427
## 3rd Qu.:19.00 3rd Qu.:2.350e-01 3rd Qu.:0.2348973
## Max. :23.00 Max. :9.985e-01 Max. :0.9983842
# tune1 <- tuneRF(training[,-8], training[,8], mtryStart=2, ntreeTry=50, stepFactor=2, improve=0.0001,
# trace=TRUE, plot=TRUE, doBest=FALSE)
#
Still getting error, so the NaN must be for missing values if it divides by zero in modeling oob or upvotes. Lets see if there are any incomplete cases in the data.
training1 <- training[complete.cases(training),]
summary(training1)
## amount transaction_type merchant_category country
## Min. : 1.00 ATM :2012 Clothing :1587 DE:1527
## 1st Qu.: 65.02 Online:1933 Electronics:1585 FR:1615
## Median : 101.74 POS :2046 Food :1620 NG: 88
## Mean : 179.53 QR :2009 Grocery :1597 TR:1538
## 3rd Qu.: 138.82 Travel :1611 UK:1579
## Max. :11628.21 US:1653
## hour device_risk_score ip_risk_score is_fraud
## Min. : 0.00 Min. :3.865e-05 Min. :0.0000088 no :7593
## 1st Qu.:10.00 1st Qu.:7.598e-02 1st Qu.:0.0790742 yes: 407
## Median :14.00 Median :1.569e-01 Median :0.1590170
## Mean :14.22 Mean :1.846e-01 Mean :0.1857001
## 3rd Qu.:19.00 3rd Qu.:2.348e-01 3rd Qu.:0.2375397
## Max. :23.00 Max. :9.987e-01 Max. :0.9996027
There are not any NA values. It must be due to the mix of factors and of numeric values. Lets make these numeric values factors. The percent values are between 0 and 1 for device_risk_score and ip_risk_score. Lets put a threshold up so that if a risk it is greater than 60% and not a risk otherwise. There will be 2 factors of risk or not based on this threshold for both features. The hour is between 0.00 and 23.00 with most transactions occuring at hour 14 or 2 pm. Lets create a four factor class for this feature with dividing 23 hours into portions of the day: 0-6 will be early am, 7-11 am, 12-17 afternoon, and 18-23 evening. For the amount, most transactions are centered around 100 USD and the 3rd quantile is less than 140 USD meaning about 75% of these transactions are less than 140 USD. So, lets make that a 4 class feature as well with less than 100 small, 101-139 medium, 140-300 large, and greater than 300 very large. Lets make these changes to the original data we made changes to before splitting data into testing and training sets.
low <- subset(financial, financial$amount<100)
medium <- subset(financial, financial$amount >= 100 & financial$amount <= 140)
large <- subset(financial, financial$amount > 140 & financial$amount <= 300)
veryLarge <- subset(financial, financial$amount > 300)
Now lets add a column called Amount to the data with separate class of data before rbinding the data. I did the math and all this adds to 10,000 observations.
low$Amount <- 'low'
medium$Amount <- 'medium'
large$Amount <- 'large'
veryLarge$Amount <- 'very large'
Amount <- rbind(low,medium,large,veryLarge)
str(Amount)
## 'data.frame': 10000 obs. of 11 variables:
## $ transaction_id : int 456 6934 2183 3479 6796 5129 8538 8410 8201 6484 ...
## $ user_id : int 692 445 944 845 129 249 679 559 564 984 ...
## $ amount : num 48 80.5 97.1 96.5 83.3 ...
## $ transaction_type : Factor w/ 4 levels "ATM","Online",..: 4 3 3 2 4 4 2 2 1 4 ...
## $ merchant_category: Factor w/ 5 levels "Clothing","Electronics",..: 3 1 1 4 3 4 3 3 1 5 ...
## $ country : Factor w/ 6 levels "DE","FR","NG",..: 6 4 1 6 1 5 6 1 4 5 ...
## $ hour : int 21 23 17 7 16 6 10 10 20 19 ...
## $ device_risk_score: num 0.1686 0.1248 0.2354 0.0823 0.0218 ...
## $ ip_risk_score : num 0.224 0.159 0.105 0.034 0.28 ...
## $ is_fraud : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Amount : chr "low" "low" "low" "low" ...
Lets drop the first 3 features and change the Amount feature into a factor before changing the hour, device_risk_score, and ip_risk_score into factor features.
Amount1 <- Amount[,-c(1:3)]
Amount1$Amount <- as.factor(Amount1$Amount)
str(Amount1)
## 'data.frame': 10000 obs. of 8 variables:
## $ transaction_type : Factor w/ 4 levels "ATM","Online",..: 4 3 3 2 4 4 2 2 1 4 ...
## $ merchant_category: Factor w/ 5 levels "Clothing","Electronics",..: 3 1 1 4 3 4 3 3 1 5 ...
## $ country : Factor w/ 6 levels "DE","FR","NG",..: 6 4 1 6 1 5 6 1 4 5 ...
## $ hour : int 21 23 17 7 16 6 10 10 20 19 ...
## $ device_risk_score: num 0.1686 0.1248 0.2354 0.0823 0.0218 ...
## $ ip_risk_score : num 0.224 0.159 0.105 0.034 0.28 ...
## $ is_fraud : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Amount : Factor w/ 4 levels "large","low",..: 2 2 2 2 2 2 2 2 2 2 ...
Lets put a threshold up so that if a risk it is greater than 60% and not a risk otherwise. There will be 2 factors of risk or not based on this threshold for both features.
lowRisk_device <- subset(Amount1, Amount1$device_risk_score <= 0.60)
highRisk_device <- subset(Amount1, Amount1$device_risk_score > 0.60)
lowRisk_device$device_risk <- 'low risk'
highRisk_device$device_risk <- 'high risk'
Amount2 <- rbind(lowRisk_device,highRisk_device)
Amount2$device_risk <- as.factor(Amount2$device_risk)
str(Amount2)
## 'data.frame': 10000 obs. of 9 variables:
## $ transaction_type : Factor w/ 4 levels "ATM","Online",..: 4 3 3 2 4 4 2 2 1 4 ...
## $ merchant_category: Factor w/ 5 levels "Clothing","Electronics",..: 3 1 1 4 3 4 3 3 1 5 ...
## $ country : Factor w/ 6 levels "DE","FR","NG",..: 6 4 1 6 1 5 6 1 4 5 ...
## $ hour : int 21 23 17 7 16 6 10 10 20 19 ...
## $ device_risk_score: num 0.1686 0.1248 0.2354 0.0823 0.0218 ...
## $ ip_risk_score : num 0.224 0.159 0.105 0.034 0.28 ...
## $ is_fraud : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Amount : Factor w/ 4 levels "large","low",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ device_risk : Factor w/ 2 levels "high risk","low risk": 2 2 2 2 2 2 2 2 2 2 ...
Drop the device_risk_score we just made into a factor from numeric type.
Amount3 <- Amount2[,-5]
str(Amount3)
## 'data.frame': 10000 obs. of 8 variables:
## $ transaction_type : Factor w/ 4 levels "ATM","Online",..: 4 3 3 2 4 4 2 2 1 4 ...
## $ merchant_category: Factor w/ 5 levels "Clothing","Electronics",..: 3 1 1 4 3 4 3 3 1 5 ...
## $ country : Factor w/ 6 levels "DE","FR","NG",..: 6 4 1 6 1 5 6 1 4 5 ...
## $ hour : int 21 23 17 7 16 6 10 10 20 19 ...
## $ ip_risk_score : num 0.224 0.159 0.105 0.034 0.28 ...
## $ is_fraud : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Amount : Factor w/ 4 levels "large","low",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ device_risk : Factor w/ 2 levels "high risk","low risk": 2 2 2 2 2 2 2 2 2 2 ...
Now make the feature for ip risk a factor with threshold 60% as well. We may have inadvertantly discovered the risk was 60% device risk due to the number of observations in the low risk subset being 500 and observations in the high risk subset were 9,500 by device risk, but lets continue. This is a synthetic financial data set.
lowRisk_ip <- subset(Amount3, Amount3$ip_risk_score < 0.60)
highRisk_ip <- subset(Amount3, Amount3$ip_risk_score >= 0.60)
lowRisk_ip$ip_risk <- 'low risk'
highRisk_ip$ip_risk <- 'high risk'
Amount4 <- rbind(lowRisk_ip,highRisk_ip)
Amount4$ip_risk <- as.factor(Amount4$ip_risk)
str(Amount4)
## 'data.frame': 10000 obs. of 9 variables:
## $ transaction_type : Factor w/ 4 levels "ATM","Online",..: 4 3 3 2 4 4 2 2 1 4 ...
## $ merchant_category: Factor w/ 5 levels "Clothing","Electronics",..: 3 1 1 4 3 4 3 3 1 5 ...
## $ country : Factor w/ 6 levels "DE","FR","NG",..: 6 4 1 6 1 5 6 1 4 5 ...
## $ hour : int 21 23 17 7 16 6 10 10 20 19 ...
## $ ip_risk_score : num 0.224 0.159 0.105 0.034 0.28 ...
## $ is_fraud : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Amount : Factor w/ 4 levels "large","low",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ device_risk : Factor w/ 2 levels "high risk","low risk": 2 2 2 2 2 2 2 2 2 2 ...
## $ ip_risk : Factor w/ 2 levels "high risk","low risk": 2 2 2 2 2 2 2 2 2 2 ...
drop the ip_risk_score numeric feature we just made into a factor feature of 2 classes high and low risk.
Amount5 <- Amount4[,-5]
str(Amount5)
## 'data.frame': 10000 obs. of 8 variables:
## $ transaction_type : Factor w/ 4 levels "ATM","Online",..: 4 3 3 2 4 4 2 2 1 4 ...
## $ merchant_category: Factor w/ 5 levels "Clothing","Electronics",..: 3 1 1 4 3 4 3 3 1 5 ...
## $ country : Factor w/ 6 levels "DE","FR","NG",..: 6 4 1 6 1 5 6 1 4 5 ...
## $ hour : int 21 23 17 7 16 6 10 10 20 19 ...
## $ is_fraud : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Amount : Factor w/ 4 levels "large","low",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ device_risk : Factor w/ 2 levels "high risk","low risk": 2 2 2 2 2 2 2 2 2 2 ...
## $ ip_risk : Factor w/ 2 levels "high risk","low risk": 2 2 2 2 2 2 2 2 2 2 ...
Now, lets make the hour into a 4 class feature. The hour is between 0.00 and 23.00 with most transactions occuring at hour 14 or 2 pm. Lets create a four factor class for this feature with dividing 23 hours into portions of the day: 0-6 will be early am, 7-11 will be am, 12-17 will be afternoon, and 18-23 will be evening.
earlyAM <- subset(Amount5, Amount5$hour >= 0 & Amount5$hour <= 6)
AM <- subset(Amount5, Amount5$hour >= 7 & Amount5$hour <= 11)
afternoon <- subset(Amount5, Amount5$hour >= 12 & Amount5$hour <= 17)
evening <- subset(Amount5, Amount5$hour >= 18 & Amount5$hour <= 23)
earlyAM$Hour <- 'early AM'
AM$Hour <- 'AM'
afternoon$Hour <- 'afternoon'
evening$Hour <- 'evening'
Amount6 <- rbind(earlyAM,AM,afternoon, evening)
Amount6$Hour <- as.factor(Amount6$Hour)
str(Amount6)
## 'data.frame': 10000 obs. of 9 variables:
## $ transaction_type : Factor w/ 4 levels "ATM","Online",..: 4 3 1 3 2 4 1 1 2 2 ...
## $ merchant_category: Factor w/ 5 levels "Clothing","Electronics",..: 4 4 3 4 4 5 5 4 4 3 ...
## $ country : Factor w/ 6 levels "DE","FR","NG",..: 5 4 2 5 1 1 2 6 1 1 ...
## $ hour : int 6 6 6 6 6 6 6 6 6 6 ...
## $ is_fraud : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Amount : Factor w/ 4 levels "large","low",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ device_risk : Factor w/ 2 levels "high risk","low risk": 2 2 2 2 2 2 2 2 2 2 ...
## $ ip_risk : Factor w/ 2 levels "high risk","low risk": 2 2 2 2 2 2 2 2 2 2 ...
## $ Hour : Factor w/ 4 levels "afternoon","AM",..: 3 3 3 3 3 3 3 3 3 3 ...
Lets drop the hour feature we made into a factor of 4 levels.
Amount7 <- Amount6[,-4]
str(Amount7)
## 'data.frame': 10000 obs. of 8 variables:
## $ transaction_type : Factor w/ 4 levels "ATM","Online",..: 4 3 1 3 2 4 1 1 2 2 ...
## $ merchant_category: Factor w/ 5 levels "Clothing","Electronics",..: 4 4 3 4 4 5 5 4 4 3 ...
## $ country : Factor w/ 6 levels "DE","FR","NG",..: 5 4 2 5 1 1 2 6 1 1 ...
## $ is_fraud : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Amount : Factor w/ 4 levels "large","low",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ device_risk : Factor w/ 2 levels "high risk","low risk": 2 2 2 2 2 2 2 2 2 2 ...
## $ ip_risk : Factor w/ 2 levels "high risk","low risk": 2 2 2 2 2 2 2 2 2 2 ...
## $ Hour : Factor w/ 4 levels "afternoon","AM",..: 3 3 3 3 3 3 3 3 3 3 ...
Now we have a data frame of 8 features that are all factor level. Lets see if we can predict the fraud risk with the 7 factor features in our data.
set.seed(123)
intrain <- sample(1:10000,.8*10000)
training <- Amount7[intrain,]
testing <- Amount7[-intrain,]
tune1 <- tuneRF(training[,-8], training[,8], mtryStart=5, ntreeTry=50, stepFactor=2, improve=0.01,
trace=TRUE, plot=TRUE, doBest=TRUE)
## mtry = 5 OOB error = 65.9%
## Searching left ...
## mtry = 3 OOB error = 66.36%
## -0.007018209 0.01
## Searching right ...
## mtry = 7 OOB error = 66.09%
## -0.00284522 0.01
So the mtry seems to be the lowest with 50 trees to start and under 65% out of bag error. Lets use that in our random forest model.
set.seed(123)
finance_rf <- randomForest( is_fraud~ ., data=training, ntree=1000, mtry=5,
keep.forest=TRUE, importance=TRUE)
The results if the Rstudio version is different than knitr html version output: Call: randomForest(formula = is_fraud ~ ., data = training, ntree = 1000, mtry = 5, keep.forest = TRUE, importance = TRUE) Type of random forest: classification Number of trees: 1000 No. of variables tried at each split: 5
OOB estimate of error rate: 0%
Confusion matrix: no yes class.error no 7590 0 0 yes 0 410 0
Looks like perfect accuracy on the training model. Lets look at the features by importance in this model.
finance_rf['importance']
## $importance
## no yes MeanDecreaseAccuracy MeanDecreaseGini
## transaction_type 0.0000000000 0.0000000 0.000000e+00 0.00000000
## merchant_category 0.0000000000 0.0000000 0.000000e+00 0.00000000
## country 0.0000104639 0.0000000 9.887937e-06 0.05263088
## Amount 0.0017515566 0.0000000 1.661435e-03 25.99341148
## device_risk 0.0262005476 0.4684148 4.893406e-02 381.99777220
## ip_risk 0.0251636285 0.4478594 4.682421e-02 367.39194169
## Hour 0.0000000000 0.0000000 0.000000e+00 0.00000000
Looks like the device risk and then the ip risk were the most important in this model accuracy and we saw earlier that these features must have been what split the data into 500 fraud cases and 9,500 legit cases when the data set was made synthetically. The transaction type, merchant category, and hour had no importance in this model’s prediction, but very little importance was played in the country and amount withdrawn or spent in this model’s prediction accuracy.
Lets see how well it predicts on unseen data in our testing set the hold out 20% of the factorized features of financial data.
prediction_rf <- predict(finance_rf, testing)
summary(prediction_rf)
## no yes
## 1910 90
finance_results_df <- data.frame(prediction_rf, type=testing$is_fraud)
head(finance_results_df, 100)
## prediction_rf type
## 103 no no
## 526 no no
## 717 no no
## 1138 no no
## 1142 no no
## 1214 no no
## 1228 no no
## 1316 no no
## 1519 no no
## 1562 no no
## 1658 no no
## 1678 no no
## 2464 no no
## 2577 no no
## 2792 no no
## 2860 no no
## 2900 no no
## 2956 no no
## 3072 no no
## 4091 no no
## 4256 no no
## 4338 no no
## 4373 no no
## 4535 no no
## 4597 no no
## 5081 no no
## 5115 no no
## 5119 no no
## 5406 no no
## 5489 no no
## 5639 no no
## 5705 no no
## 5862 no no
## 5970 no no
## 6276 no no
## 6281 no no
## 6359 no no
## 6377 no no
## 6546 no no
## 6576 no no
## 6649 no no
## 6685 no no
## 6950 no no
## 7282 no no
## 7297 no no
## 7368 no no
## 7543 no no
## 7570 no no
## 7782 no no
## 7802 no no
## 8009 no no
## 8247 no no
## 8535 no no
## 8676 no no
## 8681 no no
## 9270 no no
## 9493 no no
## 289 no no
## 1221 no no
## 1251 no no
## 1557 no no
## 1886 no no
## 2071 no no
## 2412 no no
## 3267 no no
## 3319 no no
## 3461 no no
## 4177 no no
## 4793 no no
## 5007 no no
## 5098 no no
## 5531 no no
## 5586 no no
## 5690 no no
## 5904 no no
## 6519 no no
## 7426 no no
## 7428 no no
## 7444 no no
## 8248 no no
## 8303 no no
## 8586 no no
## 8832 no no
## 8883 no no
## 9791 no no
## 9840 no no
## 178 no no
## 714 no no
## 983 no no
## 1178 no no
## 1721 no no
## 1938 no no
## 2301 no no
## 2608 no no
## 2669 no no
## 3695 no no
## 3870 no no
## 4008 no no
## 4519 no no
## 4999 no no
tail(finance_results_df, 100)
## prediction_rf type
## 4527 no no
## 4562 no no
## 4574 no no
## 4590 no no
## 4603 no no
## 4690 no no
## 4734 no no
## 4747 no no
## 4789 no no
## 4884 no no
## 4956 no no
## 5108 no no
## 5202 no no
## 5214 no no
## 5236 no no
## 5260 no no
## 5368 no no
## 5413 no no
## 5424 no no
## 5495 no no
## 5558 no no
## 5570 no no
## 5576 no no
## 5633 no no
## 5682 no no
## 5734 no no
## 5790 no no
## 5856 no no
## 5955 no no
## 6054 no no
## 6366 no no
## 6408 no no
## 6506 no no
## 6516 no no
## 6596 no no
## 6637 no no
## 6785 no no
## 6964 no no
## 7092 no no
## 7166 no no
## 7214 no no
## 7266 no no
## 7303 no no
## 7432 no no
## 7714 no no
## 7752 no no
## 7842 no no
## 7859 no no
## 7871 no no
## 8157 no no
## 8311 no no
## 8314 no no
## 8371 no no
## 8394 no no
## 8415 no no
## 8478 no no
## 8506 no no
## 8609 no no
## 8679 no no
## 8731 no no
## 8807 no no
## 8838 no no
## 8899 no no
## 9117 no no
## 9140 no no
## 9142 no no
## 9154 no no
## 9168 no no
## 9345 no no
## 9387 no no
## 9439 no no
## 9452 no no
## 9464 no no
## 9484 no no
## 9485 no no
## 9663 no no
## 9748 no no
## 9825 no no
## 9904 no no
## 9934 no no
## 9970 no no
## 951 yes yes
## 6093 yes yes
## 2894 yes yes
## 6082 yes yes
## 209 yes yes
## 1392 yes yes
## 2221 yes yes
## 2543 yes yes
## 3945 yes yes
## 4765 yes yes
## 5178 yes yes
## 8192 yes yes
## 8317 yes yes
## 8701 yes yes
## 8876 yes yes
## 9093 yes yes
## 9097 yes yes
## 9545 yes yes
## 9638 yes yes
Looks like twins all down the line the entries for prediction match the testing class. This model worked well in random forest with the randomForest package in R with 100% accuracy. Lets just look at the results.
sum(finance_results_df$prediction_rf==finance_results_df$type)/length(finance_results_df$type)
## [1] 1
The results are 1 or 100% as output in Rstudio. We will see if same results within the knitr html outputs published.
Thanks for following along as we explored this financial data after making class thresholds from numeric and integer data to predict if the transaction was fraud or not. Continue checking in as we move along many data science projects to discover truth in the data. From here and last publication, it is clear that randomForest is a robust function to model data but works the best with large amounts of samples, and this was created data but made to be similar to realistic financial data. Many of the decision trees use threshold or upvote values and sometimes it benefits to alter the data without changing its true value into a threshold of numeric data that is best for mathematical regression formulas to get numeric values when finding best fit for a numeric value like market rates, insurance rates, tax assessments, and so on