df_clients <- readRDS("data/test_client.rds")
df_loans <- readRDS("data/test_loans.rds")
df_txn <- readRDS("data/test_txn.rds")
str(df_clients)
## Classes 'tbl_df', 'tbl' and 'data.frame': 34755 obs. of 10 variables:
## $ clientID_A : int 816 909 910 913 914 915 916 972 973 975 ...
## ..- attr(*, "description")= chr "Andafin's surrogate keys"
## ..- attr(*, "source")= chr ""
## ..- attr(*, "values")= chr "Auto increment"
## ..- attr(*, "displayName")= chr ""
## ..- attr(*, "shortName")= chr ""
## $ dateOfBirth : Date, format: "1970-06-13" "1985-10-25" ...
## $ gender : Factor w/ 3 levels "Missing","Male",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ ageGroup : Factor w/ 7 levels "Missing","60+",..: 4 5 7 3 5 5 4 6 4 4 ...
## $ newSector_A : Factor w/ 7 levels "Missing/other",..: 7 1 2 1 4 6 5 1 2 3 ...
## $ maritalStatus: Factor w/ 4 levels "Missing","Divorced",..: 3 1 1 3 3 1 1 3 1 1 ...
## $ regionID_A : num 11 11 11 11 11 11 11 11 11 11 ...
## $ firstLoanDate: Date, format: "2014-08-25" "2014-03-20" ...
## $ HYcohort : Factor w/ 15 levels "12-H2","13-H1",..: 5 4 4 5 5 5 5 4 4 4 ...
## $ branchID_A : num 4 4 4 4 4 4 4 4 4 4 ...
summary(df_clients)
## clientID_A dateOfBirth gender ageGroup
## Min. : 816 Min. :1753-12-25 Missing: 1 Missing: 745
## 1st Qu.: 28938 1st Qu.:1968-10-31 Male : 140 60+ : 1844
## Median : 44865 Median :1978-04-20 Female :34614 50-59 : 6684
## Mean : 80925 Mean :1977-02-28 40-49 : 9251
## 3rd Qu.: 54518 3rd Qu.:1985-09-17 30-39 :10826
## Max. :498279 Max. :2016-08-11 25-29 : 4040
## <25 : 1365
## newSector_A maritalStatus regionID_A
## Missing/other:7588 Missing :28027 Min. :11.00
## Trades :5186 Divorced: 29 1st Qu.:11.00
## Services :1258 Married : 6503 Median :12.00
## Manufacturing:8131 Single : 196 Mean :11.93
## Livestock :6530 3rd Qu.:13.00
## Fishing :2167 Max. :14.00
## Agriculture :3895
## firstLoanDate HYcohort branchID_A
## Min. :2014-01-01 14-H2 :22706 Min. : 1.00
## 1st Qu.:2014-05-26 14-H1 :12049 1st Qu.: 6.00
## Median :2014-08-11 12-H2 : 0 Median :12.00
## Mean :2014-08-01 13-H1 : 0 Mean :13.79
## 3rd Qu.:2014-10-22 13-H2 : 0 3rd Qu.:18.00
## Max. :2014-12-31 15-H1 : 0 Max. :59.00
## (Other): 0
Checking regionID_A
RegionID_A could be a factor not a number, so let’s check:
unique(df_clients$regionID_A)
## [1] 11 13 14 12
regionID_A ranging fromm 11 to 14, so we convert that field to factor:
df_clients$regionID_A = as.factor(df_clients$regionID_A)
head(df_clients$regionID_A)
## [1] 11 11 11 11 11 11
## Levels: 11 12 13 14
Checking branchID_A
branchID_A could be a factor not a number, so let’s check:
unique(df_clients$branchID_A)
## [1] 4 27 8 42 22 18 54 5 31 1 34 6 24 48 25 16 19 56 7 38 36 9 20
## [24] 23 26 2 15 10 11 51 55 12 57 29 13 59 21 14 58 17 43 44
branchID_A ranging fromm 4 to 58, so we convert that field to factor:
df_clients$branchID_A = as.factor(df_clients$branchID_A)
head(df_clients$branchID_A)
## [1] 4 4 4 4 4 4
## 42 Levels: 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 ... 59
Checking clientID_A
We check if it exists duplicate client ID, and there is no duplicate entry.
df_clients[duplicated(df_clients$clientID_A),]
## # A tibble: 0 x 10
## # … with 10 variables: clientID_A <int>, dateOfBirth <date>, gender <fct>,
## # ageGroup <fct>, newSector_A <fct>, maritalStatus <fct>,
## # regionID_A <fct>, firstLoanDate <date>, HYcohort <fct>,
## # branchID_A <fct>
Checking missing values
plot_missing(df_clients)
It is no missing values on the plot.
str(df_loans)
## Classes 'tbl_df', 'tbl' and 'data.frame': 77454 obs. of 15 variables:
## $ loanID_A : num 20949 20950 20951 20952 20953 ...
## $ loanType_A : Factor w/ 5 levels "Micro","Fortnight",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ loanDate : Date, format: "2014-01-01" "2014-01-01" ...
## $ amount : num 50000 25000 25000 25000 25000 25000 25000 50000 25000 25000 ...
## $ system_A : chr "Micro Finance" "Micro Finance" "Micro Finance" "Micro Finance" ...
## $ loanSizeGroup : Factor w/ 6 levels ">200k","100-200k",..: 4 5 5 5 5 5 5 4 5 5 ...
## $ isRescheduled : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ isRestructuredLoans: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ memberID_A : num 442 464 463 465 466 ...
## $ clientID_A : int 1032 1054 1053 1055 1056 1057 1058 2401 2511 2512 ...
## $ groupID_A : num 3056 3063 3063 3063 3064 ...
## $ centerID_A : num 294 294 294 294 294 294 294 315 312 312 ...
## $ fieldOfficerID_A : chr "371" "371" "371" "371" ...
## $ branchID_A : num 4 4 4 4 4 4 4 4 4 4 ...
## $ regionID_A : num 11 11 11 11 11 11 11 11 11 11 ...
## - attr(*, "creator")= chr "sinh"
## - attr(*, "description")= chr "MF Loans current variables for analysis"
## - attr(*, "createdAt")= POSIXct, format: "2019-11-04 10:15:35"
## - attr(*, "updatedAt")= POSIXct, format: "2019-11-04 10:15:35"
## - attr(*, "updateHistory")=Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of 3 variables:
## ..$ modifier : chr "sinh"
## ..$ timestamp: POSIXct, format: "2019-11-04 10:15:35"
## ..$ note : chr "MF Loans current variables for analysis"
summary(df_loans)
## loanID_A loanType_A loanDate
## Min. : 495 Micro :55323 Min. :2014-01-01
## 1st Qu.: 56266 Fortnight : 1702 1st Qu.:2014-05-12
## Median : 87324 Cultivation: 302 Median :2014-08-21
## Mean : 88643 Housing : 0 Mean :2014-08-06
## 3rd Qu.:109032 Others :20127 3rd Qu.:2014-10-28
## Max. :499738 Max. :2014-12-31
##
## amount system_A loanSizeGroup isRescheduled
## Min. : 500 Length:77454 >200k : 13 Mode :logical
## 1st Qu.: 10000 Class :character 100-200k: 227 FALSE:77452
## Median : 25000 Mode :character 75-100k : 2958 TRUE :2
## Mean : 30116 50-75k :12750
## 3rd Qu.: 35000 25-50k :38357
## Max. :750000 <25k :23149
##
## isRestructuredLoans memberID_A clientID_A groupID_A
## Mode :logical Min. : 153 Min. : 743 Min. : 1
## FALSE:77454 1st Qu.: 17538 1st Qu.: 18171 1st Qu.: 5065
## Median : 34205 Median : 34890 Median : 9663
## Mean : 35027 Mean : 53782 Mean :10592
## 3rd Qu.: 46691 3rd Qu.: 47411 3rd Qu.:13854
## Max. :231340 Max. :498279 Max. :80617
## NA's :418 NA's :418
## centerID_A fieldOfficerID_A branchID_A regionID_A
## Min. : 1 Length:77454 Min. : 1.00 Min. :11.0
## 1st Qu.: 511 Class :character 1st Qu.: 5.00 1st Qu.:11.0
## Median :1098 Mode :character Median : 9.00 Median :12.0
## Mean :1408 Mean :10.98 Mean :12.1
## 3rd Qu.:2071 3rd Qu.:15.00 3rd Qu.:13.0
## Max. :7683 Max. :59.00 Max. :14.0
## NA's :418
Checking system_A
unique(df_loans$system_A)
## [1] "Micro Finance" "Non-Micro Finance"
System A consists two values “Micro Finance”, “Non-Micro Finance”. So we convert it to factor:
df_loans$system_A <- as.factor(df_loans$system_A)
summary(df_loans)
## loanID_A loanType_A loanDate
## Min. : 495 Micro :55323 Min. :2014-01-01
## 1st Qu.: 56266 Fortnight : 1702 1st Qu.:2014-05-12
## Median : 87324 Cultivation: 302 Median :2014-08-21
## Mean : 88643 Housing : 0 Mean :2014-08-06
## 3rd Qu.:109032 Others :20127 3rd Qu.:2014-10-28
## Max. :499738 Max. :2014-12-31
##
## amount system_A loanSizeGroup
## Min. : 500 Micro Finance :77036 >200k : 13
## 1st Qu.: 10000 Non-Micro Finance: 418 100-200k: 227
## Median : 25000 75-100k : 2958
## Mean : 30116 50-75k :12750
## 3rd Qu.: 35000 25-50k :38357
## Max. :750000 <25k :23149
##
## isRescheduled isRestructuredLoans memberID_A clientID_A
## Mode :logical Mode :logical Min. : 153 Min. : 743
## FALSE:77452 FALSE:77454 1st Qu.: 17538 1st Qu.: 18171
## TRUE :2 Median : 34205 Median : 34890
## Mean : 35027 Mean : 53782
## 3rd Qu.: 46691 3rd Qu.: 47411
## Max. :231340 Max. :498279
## NA's :418
## groupID_A centerID_A fieldOfficerID_A branchID_A
## Min. : 1 Min. : 1 Length:77454 Min. : 1.00
## 1st Qu.: 5065 1st Qu.: 511 Class :character 1st Qu.: 5.00
## Median : 9663 Median :1098 Mode :character Median : 9.00
## Mean :10592 Mean :1408 Mean :10.98
## 3rd Qu.:13854 3rd Qu.:2071 3rd Qu.:15.00
## Max. :80617 Max. :7683 Max. :59.00
## NA's :418 NA's :418
## regionID_A
## Min. :11.0
## 1st Qu.:11.0
## Median :12.0
## Mean :12.1
## 3rd Qu.:13.0
## Max. :14.0
##
Checking fieldOfficerID_A
unique(df_loans$fieldOfficerID_A)
## [1] "371"
## [2] "364"
## [3] "552"
## [4] "370"
## [5] "56"
## [6] "712"
## [7] "881"
## [8] "668"
## [9] "775"
## [10] "842"
## [11] "330"
## [12] "724"
## [13] "44"
## [14] "614"
## [15] "343"
## [16] "375"
## [17] "884"
## [18] "452"
## [19] "448"
## [20] "562"
## [21] "58"
## [22] "210"
## [23] "676"
## [24] "675"
## [25] "718"
## [26] "388"
## [27] "853"
## [28] "381"
## [29] "399"
## [30] "447"
## [31] "556"
## [32] "123"
## [33] "892"
## [34] "839"
## [35] "802"
## [36] "701"
## [37] "811"
## [38] "476"
## [39] "406"
## [40] "835"
## [41] "817"
## [42] "901"
## [43] "262"
## [44] "331"
## [45] "329"
## [46] "82"
## [47] "29"
## [48] "450"
## [49] "314"
## [50] "512"
## [51] "651"
## [52] "806"
## [53] "779"
## [54] "716"
## [55] "885"
## [56] NA
## [57] "354"
## [58] "569"
## [59] "563"
## [60] "179"
## [61] "605"
## [62] "875"
## [63] "816"
## [64] "603"
## [65] "890"
## [66] "78"
## [67] "60"
## [68] "32"
## [69] "744"
## [70] "297"
## [71] "528"
## [72] "42"
## [73] "591"
## [74] "742"
## [75] "65"
## [76] "288"
## [77] "317"
## [78] "362"
## [79] "361"
## [80] "567"
## [81] "527"
## [82] "13"
## [83] "231"
## [84] "909"
## [85] "680"
## [86] "311"
## [87] "714"
## [88] "363"
## [89] "221"
## [90] "719"
## [91] "79"
## [92] "889"
## [93] "E674EF76-5111-427E-98C0-2F6B18BA28CA"
## [94] "232"
## [95] "59"
## [96] "327"
## [97] "711"
## [98] "459"
## [99] "40E0D868-500B-4244-A396-F591B3B338AF"
## [100] "184"
## [101] "206"
## [102] "688"
## [103] "687"
## [104] "9AB1A76B-C9FC-422F-BDEF-AFAC632743AD"
## [105] "41"
## [106] "3743701E-05FF-44E4-BA01-47B76AAEAC1E"
## [107] "152"
## [108] "309"
## [109] "543"
## [110] "544"
## [111] "47"
## [112] "218"
## [113] "236"
## [114] "AC40277B-A84C-45FF-BF15-1ADF85DC8694"
## [115] "851"
## [116] "573"
## [117] "217"
## [118] "627"
## [119] "483"
## [120] "620"
## [121] "589"
## [122] "564"
## [123] "765"
## [124] "439"
## [125] "441"
## [126] "253"
## [127] "867"
## [128] "358"
## [129] "D2E421F6-BA2D-4CBE-A968-1DACF63E9BB2"
## [130] "873"
## [131] "326"
## [132] "368"
## [133] "574"
## [134] "638"
## [135] "588"
## [136] "713"
## [137] "629"
## [138] "449"
## [139] "834CDACD-B889-4629-B28D-069F4FE86AB6"
## [140] "8B1B234E-A222-48B3-A668-237B6FA05EC9"
## [141] "772"
## [142] "437"
## [143] "674"
## [144] "D0CDE2BE-42C4-4968-A7BC-E0AD61D26633"
## [145] "894"
## [146] "76"
## [147] "8A45C473-C826-46C2-ABF4-10DEC118D220"
## [148] "176"
These includes numbers and characters, few are very long string of characters. We could leave it there to see if it could affect the data analysis later.
Checking branchID_A
branchID_A could be a factor not a number, so let’s check:
unique(df_loans$branchID_A)
## [1] 4 5 6 1 19 7 8 21 2 11 56 10 12 13 9 14 57 58 44 20 15 16 17
## [24] 18 59
branchID_A ranging fromm 4 to 59, so we convert that field to factor:
df_loans$branchID_A = as.factor(df_loans$branchID_A)
head(df_loans$branchID_A)
## [1] 4 4 4 4 4 4
## 25 Levels: 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 44 ... 59
Checking regionID_A
unique(df_loans$regionID_A)
## [1] 11 14 12 13
df_loans$regionID_A = as.factor(df_loans$regionID_A)
head(df_loans$regionID_A)
## [1] 11 11 11 11 11 11
## Levels: 11 12 13 14
Checking missing values
plot_missing(df_loans)
Fields fieldOfficerID_A, centerID_A, groupID_A, memberID_A are having missing values, the total of missing values are fromm more than 300 to less than 500, which account for 0.42% to 0.54%.
str(df_txn)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5116997 obs. of 4 variables:
## $ receiptID_A: num 1e+09 1e+09 1e+09 1e+09 1e+09 ...
## $ receiptDate: Date, format: "2014-10-31" "2014-10-31" ...
## $ amount : num 1120 1120 1120 1120 1120 1120 1120 1120 1120 1120 ...
## $ loanID_A : num 1e+05 1e+05 1e+05 1e+05 1e+05 ...
summary(df_txn)
## receiptID_A receiptDate amount
## Min. :2.100e+05 Min. :2014-01-02 Min. : 0
## 1st Qu.:1.140e+09 1st Qu.:2015-03-20 1st Qu.: 950
## Median :1.543e+09 Median :2015-07-13 Median : 1100
## Mean :1.582e+09 Mean :2015-06-25 Mean : 1805
## 3rd Qu.:2.055e+09 3rd Qu.:2015-10-07 3rd Qu.: 2000
## Max. :3.137e+09 Max. :2015-12-31 Max. :18573863
## loanID_A
## Min. : 21
## 1st Qu.:113987
## Median :154254
## Mean :158176
## 3rd Qu.:205492
## Max. :313694
Check the missing data
plot_missing(df_txn)
There is no missing data
Summmary
There are three datasets: clients, loans and receipts
Clients data consists of ID, birth of date, gender, age group, sector, martial status, region, first loan date, HYcohort and branch as below:
summary(df_clients)
## clientID_A dateOfBirth gender ageGroup
## Min. : 816 Min. :1753-12-25 Missing: 1 Missing: 745
## 1st Qu.: 28938 1st Qu.:1968-10-31 Male : 140 60+ : 1844
## Median : 44865 Median :1978-04-20 Female :34614 50-59 : 6684
## Mean : 80925 Mean :1977-02-28 40-49 : 9251
## 3rd Qu.: 54518 3rd Qu.:1985-09-17 30-39 :10826
## Max. :498279 Max. :2016-08-11 25-29 : 4040
## <25 : 1365
## newSector_A maritalStatus regionID_A firstLoanDate
## Missing/other:7588 Missing :28027 11:15350 Min. :2014-01-01
## Trades :5186 Divorced: 29 12:10156 1st Qu.:2014-05-26
## Services :1258 Married : 6503 13: 5667 Median :2014-08-11
## Manufacturing:8131 Single : 196 14: 3582 Mean :2014-08-01
## Livestock :6530 3rd Qu.:2014-10-22
## Fishing :2167 Max. :2014-12-31
## Agriculture :3895
## HYcohort branchID_A
## 14-H2 :22706 4 : 3491
## 14-H1 :12049 9 : 2771
## 12-H2 : 0 21 : 2597
## 13-H1 : 0 18 : 2503
## 13-H2 : 0 20 : 2247
## 15-H1 : 0 12 : 1680
## (Other): 0 (Other):19466
Loans data consists of ID, type of loan, date, amount, system, size group, member ID, client ID, group ID, center ID, office, branch, region as below:
summary(df_loans)
## loanID_A loanType_A loanDate
## Min. : 495 Micro :55323 Min. :2014-01-01
## 1st Qu.: 56266 Fortnight : 1702 1st Qu.:2014-05-12
## Median : 87324 Cultivation: 302 Median :2014-08-21
## Mean : 88643 Housing : 0 Mean :2014-08-06
## 3rd Qu.:109032 Others :20127 3rd Qu.:2014-10-28
## Max. :499738 Max. :2014-12-31
##
## amount system_A loanSizeGroup
## Min. : 500 Micro Finance :77036 >200k : 13
## 1st Qu.: 10000 Non-Micro Finance: 418 100-200k: 227
## Median : 25000 75-100k : 2958
## Mean : 30116 50-75k :12750
## 3rd Qu.: 35000 25-50k :38357
## Max. :750000 <25k :23149
##
## isRescheduled isRestructuredLoans memberID_A clientID_A
## Mode :logical Mode :logical Min. : 153 Min. : 743
## FALSE:77452 FALSE:77454 1st Qu.: 17538 1st Qu.: 18171
## TRUE :2 Median : 34205 Median : 34890
## Mean : 35027 Mean : 53782
## 3rd Qu.: 46691 3rd Qu.: 47411
## Max. :231340 Max. :498279
## NA's :418
## groupID_A centerID_A fieldOfficerID_A branchID_A
## Min. : 1 Min. : 1 Length:77454 4 : 6901
## 1st Qu.: 5065 1st Qu.: 511 Class :character 6 : 6667
## Median : 9663 Median :1098 Mode :character 7 : 6234
## Mean :10592 Mean :1408 2 : 5065
## 3rd Qu.:13854 3rd Qu.:2071 21 : 4864
## Max. :80617 Max. :7683 9 : 4840
## NA's :418 NA's :418 (Other):42883
## regionID_A
## 11:26070
## 12:28017
## 13:12960
## 14:10407
##
##
##
Receipts: having all receipts info: ID, date, amount.
summary(df_txn)
## receiptID_A receiptDate amount
## Min. :2.100e+05 Min. :2014-01-02 Min. : 0
## 1st Qu.:1.140e+09 1st Qu.:2015-03-20 1st Qu.: 950
## Median :1.543e+09 Median :2015-07-13 Median : 1100
## Mean :1.582e+09 Mean :2015-06-25 Mean : 1805
## 3rd Qu.:2.055e+09 3rd Qu.:2015-10-07 3rd Qu.: 2000
## Max. :3.137e+09 Max. :2015-12-31 Max. :18573863
## loanID_A
## Min. : 21
## 1st Qu.:113987
## Median :154254
## Mean :158176
## 3rd Qu.:205492
## Max. :313694
Relationship:
| Data | Data | Connection |
|---|---|---|
| Clients | Loans | Client ID |
| Receipt | Loans | Loan ID |
###Checking the data relationship
temp <- df_loans %>%
anti_join(df_clients, by="clientID_A") %>%
count()
temp
## # A tibble: 1 x 1
## n
## <int>
## 1 29757
temp / count(df_loans) * 100
## n
## 1 38.41893
df_loans %>%
anti_join(df_clients, by="clientID_A") %>%
distinct(clientID_A) %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 16729
There are 29757 / 38% of loan transactions with 16729 clients ID having no existence on clients table. Probably we should ask for missing clients on client data table.
Additionally, we can find that there exist repeated information of regionID_A and branchID_A on loans table, which already exist on clients table. We could skip these info on clients table and take into account of these info on clients table.
temp <- df_txn %>%
anti_join(df_loans, by="loanID_A") %>%
count()
temp
## # A tibble: 1 x 1
## n
## <int>
## 1 3446562
temp / count(df_txn) * 100
## n
## 1 67.35517
We could find that 3,446,562 / about 67% of receipts having no its ID on loan table. Probably we should ask for missing these data.
Clients data
df_loans %>%
ggplot( aes(x=loanSizeGroup, y=amount, fill=loanSizeGroup)) +
geom_boxplot() +
ggtitle("Amount by loan size group")
df_loans %>% arrange(desc(amount))
## # A tibble: 77,454 x 15
## loanID_A loanType_A loanDate amount system_A loanSizeGroup
## <dbl> <fct> <date> <dbl> <fct> <fct>
## 1 61293 Micro 2014-06-11 750000 Micro F… >200k
## 2 73697 Micro 2014-06-23 750000 Micro F… >200k
## 3 554 Cultivati… 2014-03-04 400000 Non-Mic… >200k
## 4 109141 Micro 2014-07-14 350000 Micro F… >200k
## 5 115898 Micro 2014-12-17 350000 Micro F… >200k
## 6 115899 Micro 2014-12-17 350000 Micro F… >200k
## 7 115900 Micro 2014-12-17 350000 Micro F… >200k
## 8 37603 Micro 2014-04-07 300000 Micro F… >200k
## 9 573 Cultivati… 2014-08-29 300000 Non-Mic… >200k
## 10 111721 Micro 2014-11-07 300000 Micro F… >200k
## # … with 77,444 more rows, and 9 more variables: isRescheduled <lgl>,
## # isRestructuredLoans <lgl>, memberID_A <dbl>, clientID_A <int>,
## # groupID_A <dbl>, centerID_A <dbl>, fieldOfficerID_A <chr>,
## # branchID_A <fct>, regionID_A <fct>
At the size group greater than 200k, there could be one outlier transaction over 700k of amount. We could consider this later.
df_txn %>% group_by(loanID_A) %>%
summarise(n=n()) %>%
ggplot( aes(y=n)) +
geom_boxplot() +
ggtitle("Distribution of number of receipts per loan transaction")
df_txn %>% group_by(loanID_A) %>%
summarise(n=n()) %>%
arrange(desc(n)) %>%
ggplot( aes(x=n)) +
stat_bin(breaks=seq(0, 60, 5), fill="#69b3a2", color="#e9ecef", alpha=0.9) +
ggtitle("Distribution of number of receipts per loan transaction")
The number of receipts per loan transaction is ranging from 5 to 60, with few outliers having value greater than 55. Also, the highest distribution placed on 0-5 receipts and 15-20 receipts accordingly.
df_txn %>%
ggplot( aes(y=amount)) +
geom_boxplot() +
ggtitle("Amount by loan size group")
cnt_over_2000 <- df_txn %>% filter(amount > 2000) %>% count()
cnt <- df_txn %>% count()
cnt_over_2000 / cnt * 100
## n
## 1 21.12141
summary(df_txn$amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 950 1100 1805 2000 18573863
From the plot, we can see that there are few outliers over 200k, many outliers greater than the third quantiles, 2000. Those amount of outliers account for over 21% of total receipts. The most receipts are distributed from 950 to 2000 with the mean of 1805.
Check if the total paid match with the total amount of loan
temp <- df_txn %>%
group_by(loanID_A) %>%
summarise(totalPaid = sum(amount), n = n()) %>%
inner_join(df_loans, by="loanID_A") %>%
mutate(amount_change = totalPaid - amount,
percent = amount_change / amount * 100)
temp %>%
ggplot( aes(y=amount_change)) +
geom_boxplot() +
ggtitle("Amount paid vs amount of loan")
temp %>% count()
## # A tibble: 1 x 1
## n
## <int>
## 1 73520
temp %>%
filter(amount_change < 0) %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 95
summary(temp$amount_change)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -399910 1400 7500 7934 10400 99340
summary(temp$percent)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -100.00 15.00 26.00 23.27 26.00 191.40
There are 73520 loan transactions, in which 95 transaction was not being paid fully or in debt. The distribution of subtraction of total paid and loan amount is from 1400 to 7934, or 15% to 26%, with the mean of 7934 or 23%.
For cleanliness of data, we could focus on few points below:
Clients data:
Loans data:
There are 29757 / 38% of loan transactions with 16729 clients ID having no existence on clients table. Probably we should ask for missing clients on client data table.
Fields fieldOfficerID_A, centerID_A, groupID_A, memberID_A are having missing values, the total ofmissing values are fromm more than 300 to less than 500, which account for 0.42% to 0.54%
Loans vs Clients: