Case Study 1:
Below is a dataset containing synthetic transactions and some transactions are marked as fraudulent. We would like you to perform the following using the language of your choice:
• Describe the dataset and any issues with it. • Generate a minimum of 5 visualizations using the data and write a brief description of your observations. Additionally, all attempts should be made to make the visualizations visually appealing • Create a feature set and perform prediction of fraudulent transactions using at least 2 algorithms. Describe any data cleansing that must be performed. • Visualize the test results and propose what could be done to improve results. Also describe assumptions you made and your approach.
Dataset: https://www.kaggle.com/ntnu-testimon/paysim1
Output:
An HTML website hosting all visualizations and documenting all visualizations and descriptions. All code hosted on GitHub for viewing. Please provide URL’s to both the output and the GitHub repo.
We begin by loading the packages that will be required throughout the course of our analysis.
library(tidyr)
library(DT)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(kableExtra)
library(lubridate)
library(readxl)
library(highcharter)
library(lubridate)
library(scales)
library(RColorBrewer)
library(wesanderson)
library(plotly)
library(shiny)
library(readxl)
library(readr)
library(scales)
library(stringr)
library(boot)
library(reshape2)
library(gridExtra)
After loading the required packages, we move on to the data perparation step which would start by loading the data into our R-studio.
#data <- data.table::fread("PS_20174392719_1491204439457_log.csv")
data <- read.csv('PS_20174392719_1491204439457_log.csv')
Now, I am converting the character variables to factor for my analysis.
data <- mutate_if(data, is.character, as.factor)
str(data)
## 'data.frame': 6362620 obs. of 11 variables:
## $ step : int 1 1 1 1 1 1 1 1 1 1 ...
## $ type : Factor w/ 5 levels "CASH_IN","CASH_OUT",..: 4 4 5 2 4 4 4 4 4 3 ...
## $ amount : num 9840 1864 181 181 11668 ...
## $ nameOrig : Factor w/ 6353307 levels "C1000000639",..: 757870 2188999 1002157 5828263 3445982 6026526 1805948 2999172 869141 5407277 ...
## $ oldbalanceOrg : num 170136 21249 181 181 41554 ...
## $ newbalanceOrig: num 160296 19385 0 0 29886 ...
## $ nameDest : Factor w/ 2722362 levels "C1000004082",..: 1662095 1733925 439686 391697 828920 2247219 2063364 2314009 768941 282961 ...
## $ oldbalanceDest: num 0 0 0 21182 0 ...
## $ newbalanceDest: num 0 0 0 0 0 ...
## $ isFraud : int 0 0 1 1 0 0 0 0 0 0 ...
## $ isFlaggedFraud: int 0 0 0 0 0 0 0 0 0 0 ...
summary(data)
## step type amount nameOrig
## Min. : 1.0 CASH_IN :1399284 Min. : 0 C1065307291: 3
## 1st Qu.:156.0 CASH_OUT:2237500 1st Qu.: 13390 C1462946854: 3
## Median :239.0 DEBIT : 41432 Median : 74872 C1530544995: 3
## Mean :243.4 PAYMENT :2151495 Mean : 179862 C1677795071: 3
## 3rd Qu.:335.0 TRANSFER: 532909 3rd Qu.: 208721 C1784010646: 3
## Max. :743.0 Max. :92445517 C1832548028: 3
## (Other) :6362602
## oldbalanceOrg newbalanceOrig nameDest
## Min. : 0 Min. : 0 C1286084959: 113
## 1st Qu.: 0 1st Qu.: 0 C985934102 : 109
## Median : 14208 Median : 0 C665576141 : 105
## Mean : 833883 Mean : 855114 C2083562754: 102
## 3rd Qu.: 107315 3rd Qu.: 144258 C1590550415: 101
## Max. :59585040 Max. :49585040 C248609774 : 101
## (Other) :6361989
## oldbalanceDest newbalanceDest isFraud isFlaggedFraud
## Min. : 0 Min. : 0 Min. :0.000000 Min. :0.0e+00
## 1st Qu.: 0 1st Qu.: 0 1st Qu.:0.000000 1st Qu.:0.0e+00
## Median : 132706 Median : 214661 Median :0.000000 Median :0.0e+00
## Mean : 1100702 Mean : 1224996 Mean :0.001291 Mean :2.5e-06
## 3rd Qu.: 943037 3rd Qu.: 1111909 3rd Qu.:0.000000 3rd Qu.:0.0e+00
## Max. :356015889 Max. :356179279 Max. :1.000000 Max. :1.0e+00
##
glimpse(data)
## Rows: 6,362,620
## Columns: 11
## $ step <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ type <fct> PAYMENT, PAYMENT, TRANSFER, CASH_OUT, PAYMENT, PAYME...
## $ amount <dbl> 9839.64, 1864.28, 181.00, 181.00, 11668.14, 7817.71,...
## $ nameOrig <fct> C1231006815, C1666544295, C1305486145, C840083671, C...
## $ oldbalanceOrg <dbl> 170136.0, 21249.0, 181.0, 181.0, 41554.0, 53860.0, 1...
## $ newbalanceOrig <dbl> 160296.36, 19384.72, 0.00, 0.00, 29885.86, 46042.29,...
## $ nameDest <fct> M1979787155, M2044282225, C553264065, C38997010, M12...
## $ oldbalanceDest <dbl> 0, 0, 0, 21182, 0, 0, 0, 0, 0, 41898, 10845, 0, 0, 0...
## $ newbalanceDest <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00...
## $ isFraud <int> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ isFlaggedFraud <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
So, there are 11 columns/observations. There are 6,362,620 rows. The Distribution, Summary and Glimpse of the Data is available above for a better understanding.
Next we will check for missing values and deal with them accordingly.
summary(is.na(data))
## step type amount nameOrig
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:6362620 FALSE:6362620 FALSE:6362620 FALSE:6362620
## oldbalanceOrg newbalanceOrig nameDest oldbalanceDest
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:6362620 FALSE:6362620 FALSE:6362620 FALSE:6362620
## newbalanceDest isFraud isFlaggedFraud
## Mode :logical Mode :logical Mode :logical
## FALSE:6362620 FALSE:6362620 FALSE:6362620
colSums(is.na(data))
## step type amount nameOrig oldbalanceOrg
## 0 0 0 0 0
## newbalanceOrig nameDest oldbalanceDest newbalanceDest isFraud
## 0 0 0 0 0
## isFlaggedFraud
## 0
The Data set do not have any Null Values.
The final data set (top 20) can be found below in an interactive table.
datatable(head(data, 20), class = 'cell-border stripe')
#No. of fraud records
data %>% count(isFraud)
## isFraud n
## 1 0 6354407
## 2 1 8213
#percent of fraudulent records
prop.table(table(data$isFraud))*100
##
## 0 1
## 99.870918 0.129082
There are 8213 number of fraudulent transactions. Which is equivalent to around 0.1290%.
data %>%
count(type, sort = TRUE) %>%
ggplot(aes(x = reorder(type, +n), y = n)) +
geom_col(stat ="identity", color = "black", fill="#641E16") +
coord_flip() +
theme_gray() +
geom_text(aes(label = n), hjust = 2.0, color = "white", size = 3.5) +
ggtitle("Transactions as per Type", subtitle = "") +
xlab('Transaction Type') +
ylab('No of transactions') +
theme(legend.position = "none",
plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
axis.title.y = element_text(),
axis.title.x = element_text(),
axis.ticks = element_blank())
Above is the visual representations of the Transactions per type.
# Number of Fraudulent Transactions
data %>% count(isFraud)
## isFraud n
## 1 0 6354407
## 2 1 8213
#percent of fraudulent records
prop.table(table(data$isFraud))*100
##
## 0 1
## 99.870918 0.129082
#Finding out the category/type from which more no. of fraud transactions took place
fraud_trans <- data %>%
group_by(type) %>%
summarise(fraud_transactions = sum(isFraud))
fraud_trans
## # A tibble: 5 x 2
## type fraud_transactions
## <fct> <int>
## 1 CASH_IN 0
## 2 CASH_OUT 4116
## 3 DEBIT 0
## 4 PAYMENT 0
## 5 TRANSFER 4097
fraud_trans %>%
ggplot(aes(x = reorder(type, +fraud_transactions), y = fraud_transactions)) +
geom_col(stat ="identity", color = "black", fill="#641E16") +
coord_flip() +
theme_gray() +
geom_text(aes(label = fraud_transactions), hjust = 2.0, color = "white", size = 3.5) +
ggtitle("Fraud Transactions Per Type", subtitle = "") +
xlab('Transaction Type') +
ylab('No of Fraud Transactions') +
theme(legend.position = "none",
plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
axis.title.y = element_text(),
axis.title.x = element_text(),
axis.ticks = element_blank())
Cash Out and Transfer are the ones with all the numbers of Fraudulent Transactions with 4116 and 4097 transactions respectively,
#frequency distribution of amount in fradulent transactions
data %>%
filter(isFraud == 1) %>%
ggplot(aes(x = amount, fill = amount)) +
geom_histogram(bins = 50, aes(fill = 'Amt'), fill="#641E16") +
theme_gray() +
ggtitle('Fraud transaction Amount distribution', subtitle = "") +
xlab('Amount in Dollars') +
ylab('No of Fraud Transactions') +
theme(legend.position = "none",
plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
axis.title.y = element_text(),
axis.title.x = element_text(),
axis.ticks = element_blank())
The Distribution is rightly screwed which means that the fraudulent transactions are generally of small amount.
#understanding Step
summary(data$step)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 156.0 239.0 243.4 335.0 743.0
summary(is.na(data$step))
## Mode FALSE
## logical 6362620
Step - Maps a unit of time in the real world. In this case 1 step is 1 hour of time.
#Distribution of transactions at different time intervals
timeInt <- data %>%
ggplot(aes(x = step)) +
geom_histogram(bins = 150, aes(fill = 'isFraud'), show.legend = FALSE, fill="#641E16") +
theme_gray() +
ggtitle('Total transactions at different time interval', subtitle = "") +
xlab('Steps') +
ylab('No. of transactions') +
theme(legend.position = "none",
plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
axis.title.y = element_text(),
axis.title.x = element_text(),
axis.ticks = element_blank())
fradtimInt <- data %>%
filter(isFraud == 1) %>%
ggplot(aes(x = step)) +
geom_histogram(bins =150, aes(fill = 'isFraud'), show.legend = FALSE, fill="#641E16") +
theme_gray() +
ggtitle('Fraud transactions at different time interval', subtitle = "") +
xlab('Steps') +
ylab('No.of fraud transactions') +
theme(legend.position = "none",
plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
axis.title.y = element_text(),
axis.title.x = element_text(),
axis.ticks = element_blank())
grid.arrange(timeInt, fradtimInt, nrow=2, ncol=1)
These shows the Number of Fraudulent transactions and no. of Transactions step-wise. The overall transactions drop after 400 but the fradulent transactions remain the same.
#creating new dataset with filtering and subsetting
data_filt <- data %>%
select(-c('step','nameOrig', 'nameDest', 'isFlaggedFraud')) %>%
filter(type %in% c('CASH_OUT', 'TRANSFER'))
table(data_filt$type)
##
## CASH_IN CASH_OUT DEBIT PAYMENT TRANSFER
## 0 2237500 0 0 532909
summary(data_filt)
## type amount oldbalanceOrg newbalanceOrig
## CASH_IN : 0 Min. : 0 Min. : 0 Min. : 0
## CASH_OUT:2237500 1st Qu.: 82974 1st Qu.: 0 1st Qu.: 0
## DEBIT : 0 Median : 171261 Median : 304 Median : 0
## PAYMENT : 0 Mean : 317536 Mean : 47643 Mean : 16092
## TRANSFER: 532909 3rd Qu.: 306791 3rd Qu.: 30997 3rd Qu.: 0
## Max. :92445517 Max. :59585040 Max. :49585040
## oldbalanceDest newbalanceDest isFraud
## Min. : 0 Min. : 0 Min. :0.000000
## 1st Qu.: 128073 1st Qu.: 327289 1st Qu.:0.000000
## Median : 555710 Median : 828054 Median :0.000000
## Mean : 1703551 Mean : 2049734 Mean :0.002965
## 3rd Qu.: 1735900 3rd Qu.: 2122198 3rd Qu.:0.000000
## Max. :356015889 Max. :356179279 Max. :1.000000
Cconverting isFraud to Factor for my Analysis.
#convert isFraud variable into factor variable for developing model
data_filt$isFraud <- as.factor(data_filt$isFraud)
str(data_filt)
## 'data.frame': 2770409 obs. of 7 variables:
## $ type : Factor w/ 5 levels "CASH_IN","CASH_OUT",..: 5 2 2 5 5 2 2 2 2 5 ...
## $ amount : num 181 181 229134 215310 311686 ...
## $ oldbalanceOrg : num 181 181 15325 705 10835 ...
## $ newbalanceOrig: num 0 0 0 0 0 ...
## $ oldbalanceDest: num 0 21182 5083 22425 6267 ...
## $ newbalanceDest: num 0 0 51513 0 2719173 ...
## $ isFraud : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1 ...
head(data_filt)
## type amount oldbalanceOrg newbalanceOrig oldbalanceDest newbalanceDest
## 1 TRANSFER 181.0 181.00 0 0 0.00
## 2 CASH_OUT 181.0 181.00 0 21182 0.00
## 3 CASH_OUT 229133.9 15325.00 0 5083 51513.44
## 4 TRANSFER 215310.3 705.00 0 22425 0.00
## 5 TRANSFER 311685.9 10835.00 0 6267 2719172.89
## 6 CASH_OUT 110414.7 26845.41 0 288800 2415.16
## isFraud
## 1 1
## 2 1
## 3 0
## 4 0
## 5 0
## 6 0
#splitting the data into train and test
set.seed(99999)
index <- sample(2, nrow(data_filt), replace = T, prob = c(.8,.2))
train <- data_filt[index==1,]
test <- data_filt[index==2,]
dim(train)
## [1] 2216587 7
dim(test)
## [1] 553822 7
#Developing a model on train data with Random Forest algorithm
library(randomForest)
memory.limit(1000000)
## [1] 1e+06
fit_forest <- randomForest(isFraud ~ ., data = train, ntree=20)
I developed a Model on Train Data with Random Forest Algorithm.
#printing
print(fit_forest)
##
## Call:
## randomForest(formula = isFraud ~ ., data = train, ntree = 20)
## Type of random forest: classification
## Number of trees: 20
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 0.08%
## Confusion matrix:
## 0 1 class.error
## 0 2209628 140 6.335507e-05
## 1 1699 4906 2.572294e-01
Printing the Observation of the Fitted Model.
#prediction on training data
p1 <- predict(fit_forest, train)
Performed Prediction on the training Model. he Values can be seen here.
#Confusion Matrix on train data
library(caret)
confusionMatrix(train$isFraud, p1)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2209936 45
## 1 1431 5175
##
## Accuracy : 0.9993
## 95% CI : (0.9993, 0.9994)
## No Information Rate : 0.9976
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8749
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9994
## Specificity : 0.9914
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.7834
## Prevalence : 0.9976
## Detection Rate : 0.9970
## Detection Prevalence : 0.9970
## Balanced Accuracy : 0.9954
##
## 'Positive' Class : 0
##
#prediction on test data
p2 <- predict(fit_forest, test)
#confusion matrix on test data
confusionMatrix(test$isFraud, p2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 552186 29
## 1 439 1168
##
## Accuracy : 0.9992
## 95% CI : (0.9991, 0.9992)
## No Information Rate : 0.9978
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8327
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9992
## Specificity : 0.9758
## Pos Pred Value : 0.9999
## Neg Pred Value : 0.7268
## Prevalence : 0.9978
## Detection Rate : 0.9970
## Detection Prevalence : 0.9971
## Balanced Accuracy : 0.9875
##
## 'Positive' Class : 0
##
#Tuning parameters
#optimal no. of trees
plot(fit_forest)
hist(treesize(fit_forest),
main = "No. of Nodes for the Trees",
col = "#641E16")
#feature importance
fit_forest$importance
## MeanDecreaseGini
## type 1590.4526
## amount 1613.2493
## oldbalanceOrg 3330.4982
## newbalanceOrig 896.1162
## oldbalanceDest 738.8763
## newbalanceDest 2957.5363
varImpPlot(fit_forest)
importance_matrix <- data.frame(Variables = rownames(fit_forest$importance), fit_forest$importance, row.names = NULL)
importance_matrix %>%
ggplot( aes(y = MeanDecreaseGini , x = Variables,))+
geom_col(stat ="identity", color = "black", fill="#641E16") +
coord_flip() +
theme_gray() +
geom_text(aes(label = round(MeanDecreaseGini,2)), hjust = 2.0, color = "white", size = 3.5) +
ggtitle('Variiable importance plot', subtitle = "") +
theme(legend.position = "none",
plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
axis.title.y = element_text(),
axis.title.x = element_text(),
axis.ticks = element_blank())
dim(test)
## [1] 553822 7
model <- glm(isFraud ~.,family=binomial(link='logit'),data=train)
summary(model)
##
## Call:
## glm(formula = isFraud ~ ., family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.4904 -0.0441 -0.0147 -0.0029 8.4904
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.593e+00 2.741e-02 -204.04 <2e-16 ***
## typeTRANSFER 1.659e+00 3.620e-02 45.84 <2e-16 ***
## amount -1.739e-05 4.522e-07 -38.45 <2e-16 ***
## oldbalanceOrg 2.942e-05 4.625e-07 63.62 <2e-16 ***
## newbalanceOrig -3.234e-05 4.789e-07 -67.54 <2e-16 ***
## oldbalanceDest 4.684e-06 9.797e-08 47.81 <2e-16 ***
## newbalanceDest -5.022e-06 9.694e-08 -51.80 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 90030 on 2216586 degrees of freedom
## Residual deviance: 35495 on 2216580 degrees of freedom
## AIC: 35509
##
## Number of Fisher Scoring iterations: 19
anova(model, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: isFraud
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 2216586 90030
## type 1 3142.7 2216585 86887 < 2.2e-16 ***
## amount 1 903.7 2216584 85983 < 2.2e-16 ***
## oldbalanceOrg 1 21020.7 2216583 64963 < 2.2e-16 ***
## newbalanceOrig 1 26371.5 2216582 38591 < 2.2e-16 ***
## oldbalanceDest 1 778.2 2216581 37813 < 2.2e-16 ***
## newbalanceDest 1 2318.6 2216580 35495 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better.