Bank Analysis

Exploratory Analysis of Customer Risk

We have received a dataset of around 1765 Codigos. The name of the dataset is the Arca Data. Our first analysis is to see if we can predict somehow the Customer Risk Label. For this analysis I considered the Customer Avg Difference days between transactions, the Number of Total transactions, The Number of In and Out transactions, the MaxAmount, the AvgAmount, AvgAmountIn, AvgAmountOut, the Number of Beneficiary Accounts the Standard Deviation of the Amount the Median Amount and the Customer Type. At the beginning we applied a Principal Component Analysis and we tried to represent the Risk Labels in two and three dimensions by taking the top 2-3 most explanatory PC. As we can see from the plot below, there is an overalp between the Risk Lables, actualy each one seems to be a subset of the other. Thus indicates that those variables cannot explain well the Risk Levels.

The table below represents the Centers of the Risk Levels. More particiularly is the Average of Each Measure By CODIGO. For example if in Risk=1 there exists there CODIGOS with AvgAmount 100, 200, 300 the Average Amount of Risk=1 will be equal to 200.

As we can see is the Risk=1 gets higher values in all measure than Risk=2, and Risk=2 gets higher values than Risk=3.

CustomerRisk	Avg_Diff	N_In_Tr	N_Out_Tr	N_Tr	MaxAmount	AvgAmount	AvgAmountIn	AvgAmountOut	NumberOfBeneficiaryAccounts	StDevAmount	MedianAmount
1	28.03817	7.350516	12.628866	19.979381	337652.92	68867.899	82922.474	99383.650	6.876289	91349.095	40351.397
2	29.93685	3.853470	5.026992	8.880463	67542.39	25052.658	27337.631	19273.004	3.363753	18797.486	19413.636
3	27.79898	2.737500	2.923214	5.660714	12109.65	5249.971	3795.745	4835.867	2.208929	3017.811	4401.577

PCA Analysis and Visualization

From the PCA Analysis we can see that the Risk Labels appear to be a subset of each other. The problem is that Risk=1 gets higher Values in some variables like MaxAmount, Avgmount etc but they have also vary big variance and this is the reason why the “ellipses” in the plot are subsets of each others.

Summary_Variables<-Summary_Codigo_Risk%>%dplyr::select(-CustomerRisk, -Customer_Type, -CODIGO)

Variables_PCA<-prcomp(Summary_Variables, center = TRUE, scale. = TRUE)

summary(Variables_PCA)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.1349 1.7564 0.99869 0.91289 0.76226 0.70538
## Proportion of Variance 0.4144 0.2804 0.09067 0.07576 0.05282 0.04523
## Cumulative Proportion  0.4144 0.6948 0.78548 0.86124 0.91406 0.95929
##                            PC7     PC8     PC9    PC10      PC11
## Standard deviation     0.50228 0.36198 0.22263 0.12205 6.846e-15
## Proportion of Variance 0.02293 0.01191 0.00451 0.00135 0.000e+00
## Cumulative Proportion  0.98223 0.99414 0.99865 1.00000 1.000e+00

g <- ggbiplot(Variables_PCA, obs.scale = 1, var.scale = 1, 
groups = as.factor(Summary_Codigo_Risk$CustomerRisk), ellipse = TRUE)

g <- g + scale_color_discrete(name = '')+ggtitle("PCA on Customer Risk")
print(g)

PCA Analysis and Visualization in 3D

knit_hooks$set(webgl = hook_webgl)
plot3d(Variables_PCA$x, col=(Summary_Codigo_Risk$CustomerRisk))

You must enable Javascript to view this page properly.

Decision Trees

We tried to apply a Decision Tree in order to set rules regarding the Risk Labels based on Customers History. The Decision Tree suggests than on the AvgAmountIn can explain sufficiently the Risk Labels. Finally there is no Rule for Risk Label = 1

set.seed(101)
df<-Summary_Codigo_Risk%>%dplyr::select(-CODIGO)
df$CustomerRisk<-as.factor(df$CustomerRisk)
tree <- rpart(CustomerRisk ~., data = df, control = rpart.control(cp = 0.001) )
bestcp <- tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"]
tree.pruned <- prune(tree, cp = bestcp)

tree.pruned

## n= 1995 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 1995 875 3 (0.04862155 0.38997494 0.56140351)  
##   2) AvgAmountIn>=4979.25 558 211 2 (0.11648746 0.62186380 0.26164875) *
##   3) AvgAmountIn< 4979.25 1437 463 3 (0.02226862 0.29993041 0.67780097) *

prp(tree.pruned, faclen = 0, cex = 0.8, extra = 1)

fancyRpartPlot(tree.pruned, sub="")

How well does the Model explain the Data

The Error Rate of the Model is around 34%

predictions<-predict(tree.pruned, df, "class")
actual<-df$CustomerRisk
table(predictions, actual)

##            actual
## predictions   1   2   3
##           1   0   0   0
##           2  65 347 146
##           3  32 431 974

err_rate_tree<-length(predictions[!predictions==actual])/nrow(df)
err_rate_tree

## [1] 0.3378446

Analysis of Beneficiary Accounts

Check if One CODIGO is sending money to multiple Accounts

avg_send_accounts<-df%>%select(CODIGO, BeneficiaryAccount, Amount, TransactionType)%>%filter(TransactionType=='out')%>%group_by(CODIGO)%>%summarise(Number_of_Beneficiary_Accounts=length(unique(BeneficiaryAccount)))%>%ungroup%>%summarise(AvgAccounts=mean(Number_of_Beneficiary_Accounts))

kable(avg_send_accounts, format="markdown")

AvgAccounts
2.838446

send<-df%>%select(CODIGO, BeneficiaryAccount, Amount, TransactionType)%>%filter(TransactionType=='out')%>%group_by(CODIGO)%>%summarise(Number_of_Beneficiary_Accounts=length(unique(BeneficiaryAccount)))%>%arrange(Number_of_Beneficiary_Accounts)%>%ungroup()%>%group_by(Number_of_Beneficiary_Accounts)%>%summarise(Number_of_CODIGOS=n())%>%mutate(Proportion=Number_of_CODIGOS/sum(Number_of_CODIGOS))

kable(send, format="markdown")

Number_of_Beneficiary_Accounts	Number_of_CODIGOS	Proportion
1	824	0.5616905
2	232	0.1581459
3	111	0.0756646
4	65	0.0443081
5	53	0.0361282
6	44	0.0299932
7	34	0.0231766
8	20	0.0136333
9	18	0.0122699
10	14	0.0095433
11	11	0.0074983
12	9	0.0061350
13	4	0.0027267
14	2	0.0013633
15	3	0.0020450
16	3	0.0020450
17	1	0.0006817
19	2	0.0013633
20	1	0.0006817
21	2	0.0013633
26	1	0.0006817
28	2	0.0013633
30	1	0.0006817
31	1	0.0006817
32	1	0.0006817
35	2	0.0013633
36	1	0.0006817
40	1	0.0006817
44	1	0.0006817
55	1	0.0006817
57	1	0.0006817
58	1	0.0006817

Check if One CODIGO is receiving money from multiple Accounts

avg_receive_accounts<-df%>%select(CODIGO, BeneficiaryAccount, Amount, TransactionType)%>%filter(TransactionType=='in')%>%group_by(CODIGO)%>%summarise(Number_of_Beneficiary_Accounts=length(unique(BeneficiaryAccount)))%>%ungroup%>%summarise(AvgAccounts=mean(Number_of_Beneficiary_Accounts))

kable(avg_receive_accounts, format="markdown")

AvgAccounts
1.293251

receive<-df%>%select(CODIGO, BeneficiaryAccount, Amount, TransactionType)%>%filter(TransactionType=='in')%>%group_by(CODIGO)%>%summarise(Number_of_Beneficiary_Accounts=length(unique(BeneficiaryAccount)))%>%arrange(Number_of_Beneficiary_Accounts)%>%ungroup()%>%group_by(Number_of_Beneficiary_Accounts)%>%summarise(Number_of_CODIGOS=n())%>%mutate(Proportion=Number_of_CODIGOS/sum(Number_of_CODIGOS))

kable(receive, format="markdown")

Number_of_Beneficiary_Accounts	Number_of_CODIGOS	Proportion
1	978	0.7587277
2	254	0.1970520
3	49	0.0380140
4	6	0.0046548
5	2	0.0015516

List of CODIGOS who Send money to Beneficiary Accounts

SEE beneficiary_send_detailed.csv

List of CODIGOS who Recieve money to Beneficiary Accounts

SEE beneficiary_receive_detailed.csv