1. Synopsis

Case Study 1:

Below is a dataset containing synthetic transactions and some transactions are marked as fraudulent. We would like you to perform the following using the language of your choice:

• Describe the dataset and any issues with it. • Generate a minimum of 5 visualizations using the data and write a brief description of your observations. Additionally, all attempts should be made to make the visualizations visually appealing • Create a feature set and perform prediction of fraudulent transactions using at least 2 algorithms. Describe any data cleansing that must be performed. • Visualize the test results and propose what could be done to improve results. Also describe assumptions you made and your approach.

Dataset: https://www.kaggle.com/ntnu-testimon/paysim1

Output:

An HTML website hosting all visualizations and documenting all visualizations and descriptions. All code hosted on GitHub for viewing. Please provide URL’s to both the output and the GitHub repo.

2. Packages Required

We begin by loading the packages that will be required throughout the course of our analysis.

library(tidyr)
library(DT)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(kableExtra)
library(lubridate)
library(readxl)
library(highcharter)
library(lubridate)
library(scales)
library(RColorBrewer)
library(wesanderson)
library(plotly)
library(shiny)
library(readxl)
library(readr)
library(scales)
library(stringr)
library(boot)
library(reshape2)
library(gridExtra)

3. Data Preparation

After loading the required packages, we move on to the data perparation step which would start by loading the data into our R-studio.

3.1 Loading and Reading the Data

#data <- data.table::fread("PS_20174392719_1491204439457_log.csv")

data <- read.csv('PS_20174392719_1491204439457_log.csv')

Now, I am converting the character variables to factor for my analysis.

data <- mutate_if(data, is.character, as.factor)

Summary and Glimpse of the Data

str(data)
## 'data.frame':    6362620 obs. of  11 variables:
##  $ step          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ type          : Factor w/ 5 levels "CASH_IN","CASH_OUT",..: 4 4 5 2 4 4 4 4 4 3 ...
##  $ amount        : num  9840 1864 181 181 11668 ...
##  $ nameOrig      : Factor w/ 6353307 levels "C1000000639",..: 757870 2188999 1002157 5828263 3445982 6026526 1805948 2999172 869141 5407277 ...
##  $ oldbalanceOrg : num  170136 21249 181 181 41554 ...
##  $ newbalanceOrig: num  160296 19385 0 0 29886 ...
##  $ nameDest      : Factor w/ 2722362 levels "C1000004082",..: 1662095 1733925 439686 391697 828920 2247219 2063364 2314009 768941 282961 ...
##  $ oldbalanceDest: num  0 0 0 21182 0 ...
##  $ newbalanceDest: num  0 0 0 0 0 ...
##  $ isFraud       : int  0 0 1 1 0 0 0 0 0 0 ...
##  $ isFlaggedFraud: int  0 0 0 0 0 0 0 0 0 0 ...
summary(data)
##       step             type             amount                nameOrig      
##  Min.   :  1.0   CASH_IN :1399284   Min.   :       0   C1065307291:      3  
##  1st Qu.:156.0   CASH_OUT:2237500   1st Qu.:   13390   C1462946854:      3  
##  Median :239.0   DEBIT   :  41432   Median :   74872   C1530544995:      3  
##  Mean   :243.4   PAYMENT :2151495   Mean   :  179862   C1677795071:      3  
##  3rd Qu.:335.0   TRANSFER: 532909   3rd Qu.:  208721   C1784010646:      3  
##  Max.   :743.0                      Max.   :92445517   C1832548028:      3  
##                                                        (Other)    :6362602  
##  oldbalanceOrg      newbalanceOrig            nameDest      
##  Min.   :       0   Min.   :       0   C1286084959:    113  
##  1st Qu.:       0   1st Qu.:       0   C985934102 :    109  
##  Median :   14208   Median :       0   C665576141 :    105  
##  Mean   :  833883   Mean   :  855114   C2083562754:    102  
##  3rd Qu.:  107315   3rd Qu.:  144258   C1590550415:    101  
##  Max.   :59585040   Max.   :49585040   C248609774 :    101  
##                                        (Other)    :6361989  
##  oldbalanceDest      newbalanceDest         isFraud         isFlaggedFraud   
##  Min.   :        0   Min.   :        0   Min.   :0.000000   Min.   :0.0e+00  
##  1st Qu.:        0   1st Qu.:        0   1st Qu.:0.000000   1st Qu.:0.0e+00  
##  Median :   132706   Median :   214661   Median :0.000000   Median :0.0e+00  
##  Mean   :  1100702   Mean   :  1224996   Mean   :0.001291   Mean   :2.5e-06  
##  3rd Qu.:   943037   3rd Qu.:  1111909   3rd Qu.:0.000000   3rd Qu.:0.0e+00  
##  Max.   :356015889   Max.   :356179279   Max.   :1.000000   Max.   :1.0e+00  
## 
glimpse(data)
## Rows: 6,362,620
## Columns: 11
## $ step           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ type           <fct> PAYMENT, PAYMENT, TRANSFER, CASH_OUT, PAYMENT, PAYME...
## $ amount         <dbl> 9839.64, 1864.28, 181.00, 181.00, 11668.14, 7817.71,...
## $ nameOrig       <fct> C1231006815, C1666544295, C1305486145, C840083671, C...
## $ oldbalanceOrg  <dbl> 170136.0, 21249.0, 181.0, 181.0, 41554.0, 53860.0, 1...
## $ newbalanceOrig <dbl> 160296.36, 19384.72, 0.00, 0.00, 29885.86, 46042.29,...
## $ nameDest       <fct> M1979787155, M2044282225, C553264065, C38997010, M12...
## $ oldbalanceDest <dbl> 0, 0, 0, 21182, 0, 0, 0, 0, 0, 41898, 10845, 0, 0, 0...
## $ newbalanceDest <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00...
## $ isFraud        <int> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ isFlaggedFraud <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

Observations

So, there are 11 columns/observations. There are 6,362,620 rows. The Distribution, Summary and Glimpse of the Data is available above for a better understanding.

Checking for NA

Next we will check for missing values and deal with them accordingly.

summary(is.na(data))
##     step            type           amount         nameOrig      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:6362620   FALSE:6362620   FALSE:6362620   FALSE:6362620  
##  oldbalanceOrg   newbalanceOrig   nameDest       oldbalanceDest 
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:6362620   FALSE:6362620   FALSE:6362620   FALSE:6362620  
##  newbalanceDest   isFraud        isFlaggedFraud 
##  Mode :logical   Mode :logical   Mode :logical  
##  FALSE:6362620   FALSE:6362620   FALSE:6362620
colSums(is.na(data))
##           step           type         amount       nameOrig  oldbalanceOrg 
##              0              0              0              0              0 
## newbalanceOrig       nameDest oldbalanceDest newbalanceDest        isFraud 
##              0              0              0              0              0 
## isFlaggedFraud 
##              0

Observations

The Data set do not have any Null Values.

3.3 Cleaned Dataset

The final data set (top 20) can be found below in an interactive table.

datatable(head(data, 20), class = 'cell-border stripe')

4. Task

Counting the Number of Fraudulent Cases

#No. of fraud records
data %>% count(isFraud)
##   isFraud       n
## 1       0 6354407
## 2       1    8213
#percent of fraudulent records
prop.table(table(data$isFraud))*100
## 
##         0         1 
## 99.870918  0.129082

Observations

There are 8213 number of fraudulent transactions. Which is equivalent to around 0.1290%.

data %>% 
  count(type, sort = TRUE) %>%
  ggplot(aes(x = reorder(type, +n), y = n)) +
  geom_col(stat ="identity", color = "black", fill="#641E16") +
  coord_flip() +
  theme_gray() +
  geom_text(aes(label = n), hjust = 2.0, color = "white", size = 3.5) +
  ggtitle("Transactions as per Type", subtitle = "") + 
  xlab('Transaction Type') + 
  ylab('No of transactions') +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())

Observations

Above is the visual representations of the Transactions per type.

# Number of Fraudulent Transactions
data %>% count(isFraud)
##   isFraud       n
## 1       0 6354407
## 2       1    8213
#percent of fraudulent records
prop.table(table(data$isFraud))*100
## 
##         0         1 
## 99.870918  0.129082
#Finding out the category/type from which more no. of fraud transactions took place
fraud_trans <- data %>% 
  group_by(type) %>% 
  summarise(fraud_transactions = sum(isFraud))

fraud_trans
## # A tibble: 5 x 2
##   type     fraud_transactions
##   <fct>                 <int>
## 1 CASH_IN                   0
## 2 CASH_OUT               4116
## 3 DEBIT                     0
## 4 PAYMENT                   0
## 5 TRANSFER               4097
fraud_trans %>% 
  ggplot(aes(x = reorder(type, +fraud_transactions), y = fraud_transactions)) +
  geom_col(stat ="identity", color = "black", fill="#641E16") +
  coord_flip() +
  theme_gray() +
  geom_text(aes(label = fraud_transactions), hjust = 2.0, color = "white", size = 3.5) +
  ggtitle("Fraud Transactions Per Type", subtitle = "") + 
  xlab('Transaction Type') + 
  ylab('No of Fraud Transactions') +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())

Observations

Cash Out and Transfer are the ones with all the numbers of Fraudulent Transactions with 4116 and 4097 transactions respectively,

#frequency distribution of amount in fradulent transactions

data %>% 
  filter(isFraud == 1) %>% 
  ggplot(aes(x = amount,  fill = amount)) +
  geom_histogram(bins = 50, aes(fill = 'Amt'), fill="#641E16") +
  theme_gray() +
  ggtitle('Fraud transaction Amount distribution', subtitle = "") + 
  xlab('Amount in Dollars') + 
  ylab('No of Fraud Transactions') +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())

Observations

The Distribution is rightly screwed which means that the fraudulent transactions are generally of small amount.

#understanding Step

summary(data$step)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   156.0   239.0   243.4   335.0   743.0
summary(is.na(data$step))
##    Mode   FALSE 
## logical 6362620

Observations

Step - Maps a unit of time in the real world. In this case 1 step is 1 hour of time.

#Distribution of transactions at different time intervals
timeInt <- data %>% 
  ggplot(aes(x = step)) +
  geom_histogram(bins = 150, aes(fill = 'isFraud'), show.legend = FALSE, fill="#641E16") +
  theme_gray() +
  ggtitle('Total transactions at different time interval', subtitle = "") + 
  xlab('Steps') + 
  ylab('No. of transactions') +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())

fradtimInt <- data %>% 
  filter(isFraud == 1) %>% 
  ggplot(aes(x = step)) +
  geom_histogram(bins =150, aes(fill = 'isFraud'), show.legend = FALSE, fill="#641E16") +
  theme_gray() +
  ggtitle('Fraud transactions at different time interval', subtitle = "") + 
  xlab('Steps') + 
  ylab('No.of fraud transactions') +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())

grid.arrange(timeInt, fradtimInt, nrow=2, ncol=1)

Observations

These shows the Number of Fraudulent transactions and no. of Transactions step-wise. The overall transactions drop after 400 but the fradulent transactions remain the same.

Random Forest Algorithm

#creating new dataset with filtering and subsetting
data_filt <- data %>% 
  select(-c('step','nameOrig', 'nameDest', 'isFlaggedFraud')) %>%
  filter(type %in% c('CASH_OUT', 'TRANSFER'))

table(data_filt$type)
## 
##  CASH_IN CASH_OUT    DEBIT  PAYMENT TRANSFER 
##        0  2237500        0        0   532909
summary(data_filt)
##        type             amount         oldbalanceOrg      newbalanceOrig    
##  CASH_IN :      0   Min.   :       0   Min.   :       0   Min.   :       0  
##  CASH_OUT:2237500   1st Qu.:   82974   1st Qu.:       0   1st Qu.:       0  
##  DEBIT   :      0   Median :  171261   Median :     304   Median :       0  
##  PAYMENT :      0   Mean   :  317536   Mean   :   47643   Mean   :   16092  
##  TRANSFER: 532909   3rd Qu.:  306791   3rd Qu.:   30997   3rd Qu.:       0  
##                     Max.   :92445517   Max.   :59585040   Max.   :49585040  
##  oldbalanceDest      newbalanceDest         isFraud        
##  Min.   :        0   Min.   :        0   Min.   :0.000000  
##  1st Qu.:   128073   1st Qu.:   327289   1st Qu.:0.000000  
##  Median :   555710   Median :   828054   Median :0.000000  
##  Mean   :  1703551   Mean   :  2049734   Mean   :0.002965  
##  3rd Qu.:  1735900   3rd Qu.:  2122198   3rd Qu.:0.000000  
##  Max.   :356015889   Max.   :356179279   Max.   :1.000000

Cconverting isFraud to Factor for my Analysis.

#convert isFraud variable into factor variable for developing model
data_filt$isFraud <- as.factor(data_filt$isFraud)

str(data_filt)
## 'data.frame':    2770409 obs. of  7 variables:
##  $ type          : Factor w/ 5 levels "CASH_IN","CASH_OUT",..: 5 2 2 5 5 2 2 2 2 5 ...
##  $ amount        : num  181 181 229134 215310 311686 ...
##  $ oldbalanceOrg : num  181 181 15325 705 10835 ...
##  $ newbalanceOrig: num  0 0 0 0 0 ...
##  $ oldbalanceDest: num  0 21182 5083 22425 6267 ...
##  $ newbalanceDest: num  0 0 51513 0 2719173 ...
##  $ isFraud       : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1 ...
head(data_filt)
##       type   amount oldbalanceOrg newbalanceOrig oldbalanceDest newbalanceDest
## 1 TRANSFER    181.0        181.00              0              0           0.00
## 2 CASH_OUT    181.0        181.00              0          21182           0.00
## 3 CASH_OUT 229133.9      15325.00              0           5083       51513.44
## 4 TRANSFER 215310.3        705.00              0          22425           0.00
## 5 TRANSFER 311685.9      10835.00              0           6267     2719172.89
## 6 CASH_OUT 110414.7      26845.41              0         288800        2415.16
##   isFraud
## 1       1
## 2       1
## 3       0
## 4       0
## 5       0
## 6       0
#splitting the data into train and test
set.seed(99999)
index <- sample(2, nrow(data_filt), replace = T, prob = c(.8,.2))

train <- data_filt[index==1,]
test <- data_filt[index==2,]

dim(train)
## [1] 2216587       7
dim(test)
## [1] 553822      7
#Developing a model on train data with Random Forest algorithm
library(randomForest)
memory.limit(1000000)
## [1] 1e+06
fit_forest <- randomForest(isFraud ~ ., data = train, ntree=20)

Observations

I developed a Model on Train Data with Random Forest Algorithm.

#printing

print(fit_forest)
## 
## Call:
##  randomForest(formula = isFraud ~ ., data = train, ntree = 20) 
##                Type of random forest: classification
##                      Number of trees: 20
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0.08%
## Confusion matrix:
##         0    1  class.error
## 0 2209628  140 6.335507e-05
## 1    1699 4906 2.572294e-01

Observations

Printing the Observation of the Fitted Model.

#prediction on training data
p1 <- predict(fit_forest, train)

Observations

Performed Prediction on the training Model. he Values can be seen here.

#Confusion Matrix on train data
library(caret)
confusionMatrix(train$isFraud, p1)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction       0       1
##          0 2209936      45
##          1    1431    5175
##                                           
##                Accuracy : 0.9993          
##                  95% CI : (0.9993, 0.9994)
##     No Information Rate : 0.9976          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8749          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9994          
##             Specificity : 0.9914          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.7834          
##              Prevalence : 0.9976          
##          Detection Rate : 0.9970          
##    Detection Prevalence : 0.9970          
##       Balanced Accuracy : 0.9954          
##                                           
##        'Positive' Class : 0               
## 
#prediction on test data
p2 <- predict(fit_forest, test)
#confusion matrix on test data
confusionMatrix(test$isFraud, p2)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 552186     29
##          1    439   1168
##                                           
##                Accuracy : 0.9992          
##                  95% CI : (0.9991, 0.9992)
##     No Information Rate : 0.9978          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8327          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9992          
##             Specificity : 0.9758          
##          Pos Pred Value : 0.9999          
##          Neg Pred Value : 0.7268          
##              Prevalence : 0.9978          
##          Detection Rate : 0.9970          
##    Detection Prevalence : 0.9971          
##       Balanced Accuracy : 0.9875          
##                                           
##        'Positive' Class : 0               
## 
#Tuning parameters
#optimal no. of trees
plot(fit_forest)

hist(treesize(fit_forest),
     main = "No. of Nodes for the Trees",
     col = "#641E16")

#feature importance
fit_forest$importance
##                MeanDecreaseGini
## type                  1590.4526
## amount                1613.2493
## oldbalanceOrg         3330.4982
## newbalanceOrig         896.1162
## oldbalanceDest         738.8763
## newbalanceDest        2957.5363
varImpPlot(fit_forest)

importance_matrix <- data.frame(Variables = rownames(fit_forest$importance), fit_forest$importance, row.names = NULL)

importance_matrix %>% 
  ggplot( aes(y = MeanDecreaseGini , x = Variables,))+
  geom_col(stat ="identity", color = "black", fill="#641E16") + 
  coord_flip() +
  theme_gray() +
  geom_text(aes(label = round(MeanDecreaseGini,2)), hjust = 2.0, color = "white", size = 3.5) +
  ggtitle('Variiable importance plot', subtitle = "") + 
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())

Logistic Regression Algorithm

dim(test)
## [1] 553822      7

Building a Logistic Regression Model

model <- glm(isFraud ~.,family=binomial(link='logit'),data=train)

Summary of the Logistic Model

summary(model)
## 
## Call:
## glm(formula = isFraud ~ ., family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.4904  -0.0441  -0.0147  -0.0029   8.4904  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -5.593e+00  2.741e-02 -204.04   <2e-16 ***
## typeTRANSFER    1.659e+00  3.620e-02   45.84   <2e-16 ***
## amount         -1.739e-05  4.522e-07  -38.45   <2e-16 ***
## oldbalanceOrg   2.942e-05  4.625e-07   63.62   <2e-16 ***
## newbalanceOrig -3.234e-05  4.789e-07  -67.54   <2e-16 ***
## oldbalanceDest  4.684e-06  9.797e-08   47.81   <2e-16 ***
## newbalanceDest -5.022e-06  9.694e-08  -51.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 90030  on 2216586  degrees of freedom
## Residual deviance: 35495  on 2216580  degrees of freedom
## AIC: 35509
## 
## Number of Fisher Scoring iterations: 19

Anova of the Logistic Model

anova(model, test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: isFraud
## 
## Terms added sequentially (first to last)
## 
## 
##                Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                         2216586      90030              
## type            1   3142.7   2216585      86887 < 2.2e-16 ***
## amount          1    903.7   2216584      85983 < 2.2e-16 ***
## oldbalanceOrg   1  21020.7   2216583      64963 < 2.2e-16 ***
## newbalanceOrig  1  26371.5   2216582      38591 < 2.2e-16 ***
## oldbalanceDest  1    778.2   2216581      37813 < 2.2e-16 ***
## newbalanceDest  1   2318.6   2216580      35495 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Observations

The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better.