We have received a dataset of around 1765 Codigos. The name of the dataset is the Arca Data. Our first analysis is to see if we can predict somehow the Customer Risk Label. For this analysis I considered the Customer Avg Difference days between transactions, the Number of Total transactions, The Number of In and Out transactions, the MaxAmount, the AvgAmount, AvgAmountIn, AvgAmountOut, the Number of Beneficiary Accounts the Standard Deviation of the Amount the Median Amount and the Customer Type. At the beginning we applied a Principal Component Analysis and we tried to represent the Risk Labels in two and three dimensions by taking the top 2-3 most explanatory PC. As we can see from the plot below, there is an overalp between the Risk Lables, actualy each one seems to be a subset of the other. Thus indicates that those variables cannot explain well the Risk Levels.
The table below represents the Centers of the Risk Levels. More particiularly is the Average of Each Measure By CODIGO. For example if in Risk=1 there exists there CODIGOS with AvgAmount 100, 200, 300 the Average Amount of Risk=1 will be equal to 200.
As we can see is the Risk=1 gets higher values in all measure than Risk=2, and Risk=2 gets higher values than Risk=3.
| CustomerRisk | Avg_Diff | N_In_Tr | N_Out_Tr | N_Tr | MaxAmount | AvgAmount | AvgAmountIn | AvgAmountOut | NumberOfBeneficiaryAccounts | StDevAmount | MedianAmount |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 28.03817 | 7.350516 | 12.628866 | 19.979381 | 337652.92 | 68867.899 | 82922.474 | 99383.650 | 6.876289 | 91349.095 | 40351.397 |
| 2 | 29.93685 | 3.853470 | 5.026992 | 8.880463 | 67542.39 | 25052.658 | 27337.631 | 19273.004 | 3.363753 | 18797.486 | 19413.636 |
| 3 | 27.79898 | 2.737500 | 2.923214 | 5.660714 | 12109.65 | 5249.971 | 3795.745 | 4835.867 | 2.208929 | 3017.811 | 4401.577 |
From the PCA Analysis we can see that the Risk Labels appear to be a subset of each other. The problem is that Risk=1 gets higher Values in some variables like MaxAmount, Avgmount etc but they have also vary big variance and this is the reason why the “ellipses” in the plot are subsets of each others.
Summary_Variables<-Summary_Codigo_Risk%>%dplyr::select(-CustomerRisk, -Customer_Type, -CODIGO)
Variables_PCA<-prcomp(Summary_Variables, center = TRUE, scale. = TRUE)
summary(Variables_PCA)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.1349 1.7564 0.99869 0.91289 0.76226 0.70538
## Proportion of Variance 0.4144 0.2804 0.09067 0.07576 0.05282 0.04523
## Cumulative Proportion 0.4144 0.6948 0.78548 0.86124 0.91406 0.95929
## PC7 PC8 PC9 PC10 PC11
## Standard deviation 0.50228 0.36198 0.22263 0.12205 6.846e-15
## Proportion of Variance 0.02293 0.01191 0.00451 0.00135 0.000e+00
## Cumulative Proportion 0.98223 0.99414 0.99865 1.00000 1.000e+00
g <- ggbiplot(Variables_PCA, obs.scale = 1, var.scale = 1,
groups = as.factor(Summary_Codigo_Risk$CustomerRisk), ellipse = TRUE)
g <- g + scale_color_discrete(name = '')+ggtitle("PCA on Customer Risk")
print(g)
knit_hooks$set(webgl = hook_webgl)
plot3d(Variables_PCA$x, col=(Summary_Codigo_Risk$CustomerRisk))
You must enable Javascript to view this page properly.
We tried to apply a Decision Tree in order to set rules regarding the Risk Labels based on Customers History. The Decision Tree suggests than on the AvgAmountIn can explain sufficiently the Risk Labels. Finally there is no Rule for Risk Label = 1
set.seed(101)
df<-Summary_Codigo_Risk%>%dplyr::select(-CODIGO)
df$CustomerRisk<-as.factor(df$CustomerRisk)
tree <- rpart(CustomerRisk ~., data = df, control = rpart.control(cp = 0.001) )
bestcp <- tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"]
tree.pruned <- prune(tree, cp = bestcp)
tree.pruned
## n= 1995
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 1995 875 3 (0.04862155 0.38997494 0.56140351)
## 2) AvgAmountIn>=4979.25 558 211 2 (0.11648746 0.62186380 0.26164875) *
## 3) AvgAmountIn< 4979.25 1437 463 3 (0.02226862 0.29993041 0.67780097) *
prp(tree.pruned, faclen = 0, cex = 0.8, extra = 1)
fancyRpartPlot(tree.pruned, sub="")
The Error Rate of the Model is around 34%
predictions<-predict(tree.pruned, df, "class")
actual<-df$CustomerRisk
table(predictions, actual)
## actual
## predictions 1 2 3
## 1 0 0 0
## 2 65 347 146
## 3 32 431 974
err_rate_tree<-length(predictions[!predictions==actual])/nrow(df)
err_rate_tree
## [1] 0.3378446
avg_send_accounts<-df%>%select(CODIGO, BeneficiaryAccount, Amount, TransactionType)%>%filter(TransactionType=='out')%>%group_by(CODIGO)%>%summarise(Number_of_Beneficiary_Accounts=length(unique(BeneficiaryAccount)))%>%ungroup%>%summarise(AvgAccounts=mean(Number_of_Beneficiary_Accounts))
kable(avg_send_accounts, format="markdown")
| AvgAccounts |
|---|
| 2.838446 |
send<-df%>%select(CODIGO, BeneficiaryAccount, Amount, TransactionType)%>%filter(TransactionType=='out')%>%group_by(CODIGO)%>%summarise(Number_of_Beneficiary_Accounts=length(unique(BeneficiaryAccount)))%>%arrange(Number_of_Beneficiary_Accounts)%>%ungroup()%>%group_by(Number_of_Beneficiary_Accounts)%>%summarise(Number_of_CODIGOS=n())%>%mutate(Proportion=Number_of_CODIGOS/sum(Number_of_CODIGOS))
kable(send, format="markdown")
| Number_of_Beneficiary_Accounts | Number_of_CODIGOS | Proportion |
|---|---|---|
| 1 | 824 | 0.5616905 |
| 2 | 232 | 0.1581459 |
| 3 | 111 | 0.0756646 |
| 4 | 65 | 0.0443081 |
| 5 | 53 | 0.0361282 |
| 6 | 44 | 0.0299932 |
| 7 | 34 | 0.0231766 |
| 8 | 20 | 0.0136333 |
| 9 | 18 | 0.0122699 |
| 10 | 14 | 0.0095433 |
| 11 | 11 | 0.0074983 |
| 12 | 9 | 0.0061350 |
| 13 | 4 | 0.0027267 |
| 14 | 2 | 0.0013633 |
| 15 | 3 | 0.0020450 |
| 16 | 3 | 0.0020450 |
| 17 | 1 | 0.0006817 |
| 19 | 2 | 0.0013633 |
| 20 | 1 | 0.0006817 |
| 21 | 2 | 0.0013633 |
| 26 | 1 | 0.0006817 |
| 28 | 2 | 0.0013633 |
| 30 | 1 | 0.0006817 |
| 31 | 1 | 0.0006817 |
| 32 | 1 | 0.0006817 |
| 35 | 2 | 0.0013633 |
| 36 | 1 | 0.0006817 |
| 40 | 1 | 0.0006817 |
| 44 | 1 | 0.0006817 |
| 55 | 1 | 0.0006817 |
| 57 | 1 | 0.0006817 |
| 58 | 1 | 0.0006817 |
avg_receive_accounts<-df%>%select(CODIGO, BeneficiaryAccount, Amount, TransactionType)%>%filter(TransactionType=='in')%>%group_by(CODIGO)%>%summarise(Number_of_Beneficiary_Accounts=length(unique(BeneficiaryAccount)))%>%ungroup%>%summarise(AvgAccounts=mean(Number_of_Beneficiary_Accounts))
kable(avg_receive_accounts, format="markdown")
| AvgAccounts |
|---|
| 1.293251 |
receive<-df%>%select(CODIGO, BeneficiaryAccount, Amount, TransactionType)%>%filter(TransactionType=='in')%>%group_by(CODIGO)%>%summarise(Number_of_Beneficiary_Accounts=length(unique(BeneficiaryAccount)))%>%arrange(Number_of_Beneficiary_Accounts)%>%ungroup()%>%group_by(Number_of_Beneficiary_Accounts)%>%summarise(Number_of_CODIGOS=n())%>%mutate(Proportion=Number_of_CODIGOS/sum(Number_of_CODIGOS))
kable(receive, format="markdown")
| Number_of_Beneficiary_Accounts | Number_of_CODIGOS | Proportion |
|---|---|---|
| 1 | 978 | 0.7587277 |
| 2 | 254 | 0.1970520 |
| 3 | 49 | 0.0380140 |
| 4 | 6 | 0.0046548 |
| 5 | 2 | 0.0015516 |
SEE beneficiary_send_detailed.csv
SEE beneficiary_receive_detailed.csv