Exploratory Analysis of Customer Risk

We have received a dataset of around 1765 Codigos. The name of the dataset is the Arca Data. Our first analysis is to see if we can predict somehow the Customer Risk Label. For this analysis I considered the Customer Avg Difference days between transactions, the Number of Total transactions, The Number of In and Out transactions, the MaxAmount, the AvgAmount, AvgAmountIn, AvgAmountOut, the Number of Beneficiary Accounts the Standard Deviation of the Amount the Median Amount and the Customer Type. At the beginning we applied a Principal Component Analysis and we tried to represent the Risk Labels in two and three dimensions by taking the top 2-3 most explanatory PC. As we can see from the plot below, there is an overalp between the Risk Lables, actualy each one seems to be a subset of the other. Thus indicates that those variables cannot explain well the Risk Levels.

The table below represents the Centers of the Risk Levels. More particiularly is the Average of Each Measure By CODIGO. For example if in Risk=1 there exists there CODIGOS with AvgAmount 100, 200, 300 the Average Amount of Risk=1 will be equal to 200.

As we can see is the Risk=1 gets higher values in all measure than Risk=2, and Risk=2 gets higher values than Risk=3.

CustomerRisk Avg_Diff N_In_Tr N_Out_Tr N_Tr MaxAmount AvgAmount AvgAmountIn AvgAmountOut NumberOfBeneficiaryAccounts StDevAmount MedianAmount
1 28.03817 7.350516 12.628866 19.979381 337652.92 68867.899 82922.474 99383.650 6.876289 91349.095 40351.397
2 29.93685 3.853470 5.026992 8.880463 67542.39 25052.658 27337.631 19273.004 3.363753 18797.486 19413.636
3 27.79898 2.737500 2.923214 5.660714 12109.65 5249.971 3795.745 4835.867 2.208929 3017.811 4401.577

PCA Analysis and Visualization

From the PCA Analysis we can see that the Risk Labels appear to be a subset of each other. The problem is that Risk=1 gets higher Values in some variables like MaxAmount, Avgmount etc but they have also vary big variance and this is the reason why the “ellipses” in the plot are subsets of each others.

Summary_Variables<-Summary_Codigo_Risk%>%dplyr::select(-CustomerRisk, -Customer_Type, -CODIGO)

Variables_PCA<-prcomp(Summary_Variables, center = TRUE, scale. = TRUE)

summary(Variables_PCA)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.1349 1.7564 0.99869 0.91289 0.76226 0.70538
## Proportion of Variance 0.4144 0.2804 0.09067 0.07576 0.05282 0.04523
## Cumulative Proportion  0.4144 0.6948 0.78548 0.86124 0.91406 0.95929
##                            PC7     PC8     PC9    PC10      PC11
## Standard deviation     0.50228 0.36198 0.22263 0.12205 6.846e-15
## Proportion of Variance 0.02293 0.01191 0.00451 0.00135 0.000e+00
## Cumulative Proportion  0.98223 0.99414 0.99865 1.00000 1.000e+00
g <- ggbiplot(Variables_PCA, obs.scale = 1, var.scale = 1, 
groups = as.factor(Summary_Codigo_Risk$CustomerRisk), ellipse = TRUE)

g <- g + scale_color_discrete(name = '')+ggtitle("PCA on Customer Risk")
print(g)

PCA Analysis and Visualization in 3D

knit_hooks$set(webgl = hook_webgl)
plot3d(Variables_PCA$x, col=(Summary_Codigo_Risk$CustomerRisk))

You must enable Javascript to view this page properly.

Decision Trees

We tried to apply a Decision Tree in order to set rules regarding the Risk Labels based on Customers History. The Decision Tree suggests than on the AvgAmountIn can explain sufficiently the Risk Labels. Finally there is no Rule for Risk Label = 1

set.seed(101)
df<-Summary_Codigo_Risk%>%dplyr::select(-CODIGO)
df$CustomerRisk<-as.factor(df$CustomerRisk)
tree <- rpart(CustomerRisk ~., data = df, control = rpart.control(cp = 0.001) )
bestcp <- tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"]
tree.pruned <- prune(tree, cp = bestcp)

tree.pruned
## n= 1995 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 1995 875 3 (0.04862155 0.38997494 0.56140351)  
##   2) AvgAmountIn>=4979.25 558 211 2 (0.11648746 0.62186380 0.26164875) *
##   3) AvgAmountIn< 4979.25 1437 463 3 (0.02226862 0.29993041 0.67780097) *
prp(tree.pruned, faclen = 0, cex = 0.8, extra = 1)

fancyRpartPlot(tree.pruned, sub="")

How well does the Model explain the Data

The Error Rate of the Model is around 34%

predictions<-predict(tree.pruned, df, "class")
actual<-df$CustomerRisk
table(predictions, actual)
##            actual
## predictions   1   2   3
##           1   0   0   0
##           2  65 347 146
##           3  32 431 974
err_rate_tree<-length(predictions[!predictions==actual])/nrow(df)
err_rate_tree
## [1] 0.3378446

Analysis of Beneficiary Accounts

Check if One CODIGO is sending money to multiple Accounts

avg_send_accounts<-df%>%select(CODIGO, BeneficiaryAccount, Amount, TransactionType)%>%filter(TransactionType=='out')%>%group_by(CODIGO)%>%summarise(Number_of_Beneficiary_Accounts=length(unique(BeneficiaryAccount)))%>%ungroup%>%summarise(AvgAccounts=mean(Number_of_Beneficiary_Accounts))

kable(avg_send_accounts, format="markdown")
AvgAccounts
2.838446
send<-df%>%select(CODIGO, BeneficiaryAccount, Amount, TransactionType)%>%filter(TransactionType=='out')%>%group_by(CODIGO)%>%summarise(Number_of_Beneficiary_Accounts=length(unique(BeneficiaryAccount)))%>%arrange(Number_of_Beneficiary_Accounts)%>%ungroup()%>%group_by(Number_of_Beneficiary_Accounts)%>%summarise(Number_of_CODIGOS=n())%>%mutate(Proportion=Number_of_CODIGOS/sum(Number_of_CODIGOS))

kable(send, format="markdown")
Number_of_Beneficiary_Accounts Number_of_CODIGOS Proportion
1 824 0.5616905
2 232 0.1581459
3 111 0.0756646
4 65 0.0443081
5 53 0.0361282
6 44 0.0299932
7 34 0.0231766
8 20 0.0136333
9 18 0.0122699
10 14 0.0095433
11 11 0.0074983
12 9 0.0061350
13 4 0.0027267
14 2 0.0013633
15 3 0.0020450
16 3 0.0020450
17 1 0.0006817
19 2 0.0013633
20 1 0.0006817
21 2 0.0013633
26 1 0.0006817
28 2 0.0013633
30 1 0.0006817
31 1 0.0006817
32 1 0.0006817
35 2 0.0013633
36 1 0.0006817
40 1 0.0006817
44 1 0.0006817
55 1 0.0006817
57 1 0.0006817
58 1 0.0006817

Check if One CODIGO is receiving money from multiple Accounts

avg_receive_accounts<-df%>%select(CODIGO, BeneficiaryAccount, Amount, TransactionType)%>%filter(TransactionType=='in')%>%group_by(CODIGO)%>%summarise(Number_of_Beneficiary_Accounts=length(unique(BeneficiaryAccount)))%>%ungroup%>%summarise(AvgAccounts=mean(Number_of_Beneficiary_Accounts))

kable(avg_receive_accounts, format="markdown")
AvgAccounts
1.293251
receive<-df%>%select(CODIGO, BeneficiaryAccount, Amount, TransactionType)%>%filter(TransactionType=='in')%>%group_by(CODIGO)%>%summarise(Number_of_Beneficiary_Accounts=length(unique(BeneficiaryAccount)))%>%arrange(Number_of_Beneficiary_Accounts)%>%ungroup()%>%group_by(Number_of_Beneficiary_Accounts)%>%summarise(Number_of_CODIGOS=n())%>%mutate(Proportion=Number_of_CODIGOS/sum(Number_of_CODIGOS))

kable(receive, format="markdown")
Number_of_Beneficiary_Accounts Number_of_CODIGOS Proportion
1 978 0.7587277
2 254 0.1970520
3 49 0.0380140
4 6 0.0046548
5 2 0.0015516

List of CODIGOS who Send money to Beneficiary Accounts

SEE beneficiary_send_detailed.csv

List of CODIGOS who Recieve money to Beneficiary Accounts

SEE beneficiary_receive_detailed.csv