steve-test-interview

Plan to do

Data dictionary
Checking missing values
Checking the relationship between datasets
Checking outliers

Evaluate cleanliness of the dataset

df_clients <- readRDS("data/test_client.rds")
df_loans <- readRDS("data/test_loans.rds")
df_txn <- readRDS("data/test_txn.rds")

1. Clients data

str(df_clients)

## Classes 'tbl_df', 'tbl' and 'data.frame':    34755 obs. of  10 variables:
##  $ clientID_A   : int  816 909 910 913 914 915 916 972 973 975 ...
##   ..- attr(*, "description")= chr "Andafin's surrogate keys"
##   ..- attr(*, "source")= chr ""
##   ..- attr(*, "values")= chr "Auto increment"
##   ..- attr(*, "displayName")= chr ""
##   ..- attr(*, "shortName")= chr ""
##  $ dateOfBirth  : Date, format: "1970-06-13" "1985-10-25" ...
##  $ gender       : Factor w/ 3 levels "Missing","Male",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ ageGroup     : Factor w/ 7 levels "Missing","60+",..: 4 5 7 3 5 5 4 6 4 4 ...
##  $ newSector_A  : Factor w/ 7 levels "Missing/other",..: 7 1 2 1 4 6 5 1 2 3 ...
##  $ maritalStatus: Factor w/ 4 levels "Missing","Divorced",..: 3 1 1 3 3 1 1 3 1 1 ...
##  $ regionID_A   : num  11 11 11 11 11 11 11 11 11 11 ...
##  $ firstLoanDate: Date, format: "2014-08-25" "2014-03-20" ...
##  $ HYcohort     : Factor w/ 15 levels "12-H2","13-H1",..: 5 4 4 5 5 5 5 4 4 4 ...
##  $ branchID_A   : num  4 4 4 4 4 4 4 4 4 4 ...

summary(df_clients)

##    clientID_A      dateOfBirth             gender         ageGroup    
##  Min.   :   816   Min.   :1753-12-25   Missing:    1   Missing:  745  
##  1st Qu.: 28938   1st Qu.:1968-10-31   Male   :  140   60+    : 1844  
##  Median : 44865   Median :1978-04-20   Female :34614   50-59  : 6684  
##  Mean   : 80925   Mean   :1977-02-28                   40-49  : 9251  
##  3rd Qu.: 54518   3rd Qu.:1985-09-17                   30-39  :10826  
##  Max.   :498279   Max.   :2016-08-11                   25-29  : 4040  
##                                                        <25    : 1365  
##         newSector_A    maritalStatus     regionID_A   
##  Missing/other:7588   Missing :28027   Min.   :11.00  
##  Trades       :5186   Divorced:   29   1st Qu.:11.00  
##  Services     :1258   Married : 6503   Median :12.00  
##  Manufacturing:8131   Single  :  196   Mean   :11.93  
##  Livestock    :6530                    3rd Qu.:13.00  
##  Fishing      :2167                    Max.   :14.00  
##  Agriculture  :3895                                   
##  firstLoanDate           HYcohort       branchID_A   
##  Min.   :2014-01-01   14-H2  :22706   Min.   : 1.00  
##  1st Qu.:2014-05-26   14-H1  :12049   1st Qu.: 6.00  
##  Median :2014-08-11   12-H2  :    0   Median :12.00  
##  Mean   :2014-08-01   13-H1  :    0   Mean   :13.79  
##  3rd Qu.:2014-10-22   13-H2  :    0   3rd Qu.:18.00  
##  Max.   :2014-12-31   15-H1  :    0   Max.   :59.00  
##                       (Other):    0

Gender is missing 1 client
ageGroup missing 745 clients, we could use date of birth to refill these missing infos.
newSector_A missing 7588, but we could consider these missing infos as other sector.

Checking regionID_A

RegionID_A could be a factor not a number, so let’s check:

unique(df_clients$regionID_A)

## [1] 11 13 14 12

regionID_A ranging fromm 11 to 14, so we convert that field to factor:

df_clients$regionID_A = as.factor(df_clients$regionID_A)

head(df_clients$regionID_A)

## [1] 11 11 11 11 11 11
## Levels: 11 12 13 14

Checking branchID_A

branchID_A could be a factor not a number, so let’s check:

unique(df_clients$branchID_A)

##  [1]  4 27  8 42 22 18 54  5 31  1 34  6 24 48 25 16 19 56  7 38 36  9 20
## [24] 23 26  2 15 10 11 51 55 12 57 29 13 59 21 14 58 17 43 44

branchID_A ranging fromm 4 to 58, so we convert that field to factor:

df_clients$branchID_A = as.factor(df_clients$branchID_A)

head(df_clients$branchID_A)

## [1] 4 4 4 4 4 4
## 42 Levels: 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 ... 59

Checking clientID_A

We check if it exists duplicate client ID, and there is no duplicate entry.

df_clients[duplicated(df_clients$clientID_A),]

## # A tibble: 0 x 10
## # … with 10 variables: clientID_A <int>, dateOfBirth <date>, gender <fct>,
## #   ageGroup <fct>, newSector_A <fct>, maritalStatus <fct>,
## #   regionID_A <fct>, firstLoanDate <date>, HYcohort <fct>,
## #   branchID_A <fct>

Checking missing values

plot_missing(df_clients)

It is no missing values on the plot.

2. Loans data

str(df_loans)

## Classes 'tbl_df', 'tbl' and 'data.frame':    77454 obs. of  15 variables:
##  $ loanID_A           : num  20949 20950 20951 20952 20953 ...
##  $ loanType_A         : Factor w/ 5 levels "Micro","Fortnight",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ loanDate           : Date, format: "2014-01-01" "2014-01-01" ...
##  $ amount             : num  50000 25000 25000 25000 25000 25000 25000 50000 25000 25000 ...
##  $ system_A           : chr  "Micro Finance" "Micro Finance" "Micro Finance" "Micro Finance" ...
##  $ loanSizeGroup      : Factor w/ 6 levels ">200k","100-200k",..: 4 5 5 5 5 5 5 4 5 5 ...
##  $ isRescheduled      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ isRestructuredLoans: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ memberID_A         : num  442 464 463 465 466 ...
##  $ clientID_A         : int  1032 1054 1053 1055 1056 1057 1058 2401 2511 2512 ...
##  $ groupID_A          : num  3056 3063 3063 3063 3064 ...
##  $ centerID_A         : num  294 294 294 294 294 294 294 315 312 312 ...
##  $ fieldOfficerID_A   : chr  "371" "371" "371" "371" ...
##  $ branchID_A         : num  4 4 4 4 4 4 4 4 4 4 ...
##  $ regionID_A         : num  11 11 11 11 11 11 11 11 11 11 ...
##  - attr(*, "creator")= chr "sinh"
##  - attr(*, "description")= chr "MF Loans current variables for analysis"
##  - attr(*, "createdAt")= POSIXct, format: "2019-11-04 10:15:35"
##  - attr(*, "updatedAt")= POSIXct, format: "2019-11-04 10:15:35"
##  - attr(*, "updateHistory")=Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  3 variables:
##   ..$ modifier : chr "sinh"
##   ..$ timestamp: POSIXct, format: "2019-11-04 10:15:35"
##   ..$ note     : chr "MF Loans current variables for analysis"

summary(df_loans)

##     loanID_A            loanType_A       loanDate         
##  Min.   :   495   Micro      :55323   Min.   :2014-01-01  
##  1st Qu.: 56266   Fortnight  : 1702   1st Qu.:2014-05-12  
##  Median : 87324   Cultivation:  302   Median :2014-08-21  
##  Mean   : 88643   Housing    :    0   Mean   :2014-08-06  
##  3rd Qu.:109032   Others     :20127   3rd Qu.:2014-10-28  
##  Max.   :499738                       Max.   :2014-12-31  
##                                                           
##      amount         system_A          loanSizeGroup   isRescheduled  
##  Min.   :   500   Length:77454       >200k   :   13   Mode :logical  
##  1st Qu.: 10000   Class :character   100-200k:  227   FALSE:77452    
##  Median : 25000   Mode  :character   75-100k : 2958   TRUE :2        
##  Mean   : 30116                      50-75k  :12750                  
##  3rd Qu.: 35000                      25-50k  :38357                  
##  Max.   :750000                      <25k    :23149                  
##                                                                      
##  isRestructuredLoans   memberID_A       clientID_A       groupID_A    
##  Mode :logical       Min.   :   153   Min.   :   743   Min.   :    1  
##  FALSE:77454         1st Qu.: 17538   1st Qu.: 18171   1st Qu.: 5065  
##                      Median : 34205   Median : 34890   Median : 9663  
##                      Mean   : 35027   Mean   : 53782   Mean   :10592  
##                      3rd Qu.: 46691   3rd Qu.: 47411   3rd Qu.:13854  
##                      Max.   :231340   Max.   :498279   Max.   :80617  
##                      NA's   :418                       NA's   :418    
##    centerID_A   fieldOfficerID_A     branchID_A      regionID_A  
##  Min.   :   1   Length:77454       Min.   : 1.00   Min.   :11.0  
##  1st Qu.: 511   Class :character   1st Qu.: 5.00   1st Qu.:11.0  
##  Median :1098   Mode  :character   Median : 9.00   Median :12.0  
##  Mean   :1408                      Mean   :10.98   Mean   :12.1  
##  3rd Qu.:2071                      3rd Qu.:15.00   3rd Qu.:13.0  
##  Max.   :7683                      Max.   :59.00   Max.   :14.0  
##  NA's   :418

Checking system_A

unique(df_loans$system_A)

## [1] "Micro Finance"     "Non-Micro Finance"

System A consists two values “Micro Finance”, “Non-Micro Finance”. So we convert it to factor:

df_loans$system_A <- as.factor(df_loans$system_A)
summary(df_loans)

##     loanID_A            loanType_A       loanDate         
##  Min.   :   495   Micro      :55323   Min.   :2014-01-01  
##  1st Qu.: 56266   Fortnight  : 1702   1st Qu.:2014-05-12  
##  Median : 87324   Cultivation:  302   Median :2014-08-21  
##  Mean   : 88643   Housing    :    0   Mean   :2014-08-06  
##  3rd Qu.:109032   Others     :20127   3rd Qu.:2014-10-28  
##  Max.   :499738                       Max.   :2014-12-31  
##                                                           
##      amount                    system_A      loanSizeGroup  
##  Min.   :   500   Micro Finance    :77036   >200k   :   13  
##  1st Qu.: 10000   Non-Micro Finance:  418   100-200k:  227  
##  Median : 25000                             75-100k : 2958  
##  Mean   : 30116                             50-75k  :12750  
##  3rd Qu.: 35000                             25-50k  :38357  
##  Max.   :750000                             <25k    :23149  
##                                                             
##  isRescheduled   isRestructuredLoans   memberID_A       clientID_A    
##  Mode :logical   Mode :logical       Min.   :   153   Min.   :   743  
##  FALSE:77452     FALSE:77454         1st Qu.: 17538   1st Qu.: 18171  
##  TRUE :2                             Median : 34205   Median : 34890  
##                                      Mean   : 35027   Mean   : 53782  
##                                      3rd Qu.: 46691   3rd Qu.: 47411  
##                                      Max.   :231340   Max.   :498279  
##                                      NA's   :418                      
##    groupID_A       centerID_A   fieldOfficerID_A     branchID_A   
##  Min.   :    1   Min.   :   1   Length:77454       Min.   : 1.00  
##  1st Qu.: 5065   1st Qu.: 511   Class :character   1st Qu.: 5.00  
##  Median : 9663   Median :1098   Mode  :character   Median : 9.00  
##  Mean   :10592   Mean   :1408                      Mean   :10.98  
##  3rd Qu.:13854   3rd Qu.:2071                      3rd Qu.:15.00  
##  Max.   :80617   Max.   :7683                      Max.   :59.00  
##  NA's   :418     NA's   :418                                      
##    regionID_A  
##  Min.   :11.0  
##  1st Qu.:11.0  
##  Median :12.0  
##  Mean   :12.1  
##  3rd Qu.:13.0  
##  Max.   :14.0  
##

Checking fieldOfficerID_A

unique(df_loans$fieldOfficerID_A)

##   [1] "371"                                 
##   [2] "364"                                 
##   [3] "552"                                 
##   [4] "370"                                 
##   [5] "56"                                  
##   [6] "712"                                 
##   [7] "881"                                 
##   [8] "668"                                 
##   [9] "775"                                 
##  [10] "842"                                 
##  [11] "330"                                 
##  [12] "724"                                 
##  [13] "44"                                  
##  [14] "614"                                 
##  [15] "343"                                 
##  [16] "375"                                 
##  [17] "884"                                 
##  [18] "452"                                 
##  [19] "448"                                 
##  [20] "562"                                 
##  [21] "58"                                  
##  [22] "210"                                 
##  [23] "676"                                 
##  [24] "675"                                 
##  [25] "718"                                 
##  [26] "388"                                 
##  [27] "853"                                 
##  [28] "381"                                 
##  [29] "399"                                 
##  [30] "447"                                 
##  [31] "556"                                 
##  [32] "123"                                 
##  [33] "892"                                 
##  [34] "839"                                 
##  [35] "802"                                 
##  [36] "701"                                 
##  [37] "811"                                 
##  [38] "476"                                 
##  [39] "406"                                 
##  [40] "835"                                 
##  [41] "817"                                 
##  [42] "901"                                 
##  [43] "262"                                 
##  [44] "331"                                 
##  [45] "329"                                 
##  [46] "82"                                  
##  [47] "29"                                  
##  [48] "450"                                 
##  [49] "314"                                 
##  [50] "512"                                 
##  [51] "651"                                 
##  [52] "806"                                 
##  [53] "779"                                 
##  [54] "716"                                 
##  [55] "885"                                 
##  [56] NA                                    
##  [57] "354"                                 
##  [58] "569"                                 
##  [59] "563"                                 
##  [60] "179"                                 
##  [61] "605"                                 
##  [62] "875"                                 
##  [63] "816"                                 
##  [64] "603"                                 
##  [65] "890"                                 
##  [66] "78"                                  
##  [67] "60"                                  
##  [68] "32"                                  
##  [69] "744"                                 
##  [70] "297"                                 
##  [71] "528"                                 
##  [72] "42"                                  
##  [73] "591"                                 
##  [74] "742"                                 
##  [75] "65"                                  
##  [76] "288"                                 
##  [77] "317"                                 
##  [78] "362"                                 
##  [79] "361"                                 
##  [80] "567"                                 
##  [81] "527"                                 
##  [82] "13"                                  
##  [83] "231"                                 
##  [84] "909"                                 
##  [85] "680"                                 
##  [86] "311"                                 
##  [87] "714"                                 
##  [88] "363"                                 
##  [89] "221"                                 
##  [90] "719"                                 
##  [91] "79"                                  
##  [92] "889"                                 
##  [93] "E674EF76-5111-427E-98C0-2F6B18BA28CA"
##  [94] "232"                                 
##  [95] "59"                                  
##  [96] "327"                                 
##  [97] "711"                                 
##  [98] "459"                                 
##  [99] "40E0D868-500B-4244-A396-F591B3B338AF"
## [100] "184"                                 
## [101] "206"                                 
## [102] "688"                                 
## [103] "687"                                 
## [104] "9AB1A76B-C9FC-422F-BDEF-AFAC632743AD"
## [105] "41"                                  
## [106] "3743701E-05FF-44E4-BA01-47B76AAEAC1E"
## [107] "152"                                 
## [108] "309"                                 
## [109] "543"                                 
## [110] "544"                                 
## [111] "47"                                  
## [112] "218"                                 
## [113] "236"                                 
## [114] "AC40277B-A84C-45FF-BF15-1ADF85DC8694"
## [115] "851"                                 
## [116] "573"                                 
## [117] "217"                                 
## [118] "627"                                 
## [119] "483"                                 
## [120] "620"                                 
## [121] "589"                                 
## [122] "564"                                 
## [123] "765"                                 
## [124] "439"                                 
## [125] "441"                                 
## [126] "253"                                 
## [127] "867"                                 
## [128] "358"                                 
## [129] "D2E421F6-BA2D-4CBE-A968-1DACF63E9BB2"
## [130] "873"                                 
## [131] "326"                                 
## [132] "368"                                 
## [133] "574"                                 
## [134] "638"                                 
## [135] "588"                                 
## [136] "713"                                 
## [137] "629"                                 
## [138] "449"                                 
## [139] "834CDACD-B889-4629-B28D-069F4FE86AB6"
## [140] "8B1B234E-A222-48B3-A668-237B6FA05EC9"
## [141] "772"                                 
## [142] "437"                                 
## [143] "674"                                 
## [144] "D0CDE2BE-42C4-4968-A7BC-E0AD61D26633"
## [145] "894"                                 
## [146] "76"                                  
## [147] "8A45C473-C826-46C2-ABF4-10DEC118D220"
## [148] "176"

These includes numbers and characters, few are very long string of characters. We could leave it there to see if it could affect the data analysis later.

Checking branchID_A

branchID_A could be a factor not a number, so let’s check:

unique(df_loans$branchID_A)

##  [1]  4  5  6  1 19  7  8 21  2 11 56 10 12 13  9 14 57 58 44 20 15 16 17
## [24] 18 59

branchID_A ranging fromm 4 to 59, so we convert that field to factor:

df_loans$branchID_A = as.factor(df_loans$branchID_A)

head(df_loans$branchID_A)

## [1] 4 4 4 4 4 4
## 25 Levels: 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 44 ... 59

Checking regionID_A

unique(df_loans$regionID_A)

## [1] 11 14 12 13

df_loans$regionID_A = as.factor(df_loans$regionID_A)

head(df_loans$regionID_A)

## [1] 11 11 11 11 11 11
## Levels: 11 12 13 14

Checking missing values

plot_missing(df_loans)

Fields fieldOfficerID_A, centerID_A, groupID_A, memberID_A are having missing values, the total of missing values are fromm more than 300 to less than 500, which account for 0.42% to 0.54%.

2. Receipt data

str(df_txn)

## Classes 'tbl_df', 'tbl' and 'data.frame':    5116997 obs. of  4 variables:
##  $ receiptID_A: num  1e+09 1e+09 1e+09 1e+09 1e+09 ...
##  $ receiptDate: Date, format: "2014-10-31" "2014-10-31" ...
##  $ amount     : num  1120 1120 1120 1120 1120 1120 1120 1120 1120 1120 ...
##  $ loanID_A   : num  1e+05 1e+05 1e+05 1e+05 1e+05 ...

summary(df_txn)

##   receiptID_A         receiptDate             amount        
##  Min.   :2.100e+05   Min.   :2014-01-02   Min.   :       0  
##  1st Qu.:1.140e+09   1st Qu.:2015-03-20   1st Qu.:     950  
##  Median :1.543e+09   Median :2015-07-13   Median :    1100  
##  Mean   :1.582e+09   Mean   :2015-06-25   Mean   :    1805  
##  3rd Qu.:2.055e+09   3rd Qu.:2015-10-07   3rd Qu.:    2000  
##  Max.   :3.137e+09   Max.   :2015-12-31   Max.   :18573863  
##     loanID_A     
##  Min.   :    21  
##  1st Qu.:113987  
##  Median :154254  
##  Mean   :158176  
##  3rd Qu.:205492  
##  Max.   :313694

Check the missing data

plot_missing(df_txn)

There is no missing data

Summmary

There are three datasets: clients, loans and receipts

Clients data consists of ID, birth of date, gender, age group, sector, martial status, region, first loan date, HYcohort and branch as below:

summary(df_clients)

##    clientID_A      dateOfBirth             gender         ageGroup    
##  Min.   :   816   Min.   :1753-12-25   Missing:    1   Missing:  745  
##  1st Qu.: 28938   1st Qu.:1968-10-31   Male   :  140   60+    : 1844  
##  Median : 44865   Median :1978-04-20   Female :34614   50-59  : 6684  
##  Mean   : 80925   Mean   :1977-02-28                   40-49  : 9251  
##  3rd Qu.: 54518   3rd Qu.:1985-09-17                   30-39  :10826  
##  Max.   :498279   Max.   :2016-08-11                   25-29  : 4040  
##                                                        <25    : 1365  
##         newSector_A    maritalStatus   regionID_A firstLoanDate       
##  Missing/other:7588   Missing :28027   11:15350   Min.   :2014-01-01  
##  Trades       :5186   Divorced:   29   12:10156   1st Qu.:2014-05-26  
##  Services     :1258   Married : 6503   13: 5667   Median :2014-08-11  
##  Manufacturing:8131   Single  :  196   14: 3582   Mean   :2014-08-01  
##  Livestock    :6530                               3rd Qu.:2014-10-22  
##  Fishing      :2167                               Max.   :2014-12-31  
##  Agriculture  :3895                                                   
##     HYcohort       branchID_A   
##  14-H2  :22706   4      : 3491  
##  14-H1  :12049   9      : 2771  
##  12-H2  :    0   21     : 2597  
##  13-H1  :    0   18     : 2503  
##  13-H2  :    0   20     : 2247  
##  15-H1  :    0   12     : 1680  
##  (Other):    0   (Other):19466

Loans data consists of ID, type of loan, date, amount, system, size group, member ID, client ID, group ID, center ID, office, branch, region as below:

summary(df_loans)

##     loanID_A            loanType_A       loanDate         
##  Min.   :   495   Micro      :55323   Min.   :2014-01-01  
##  1st Qu.: 56266   Fortnight  : 1702   1st Qu.:2014-05-12  
##  Median : 87324   Cultivation:  302   Median :2014-08-21  
##  Mean   : 88643   Housing    :    0   Mean   :2014-08-06  
##  3rd Qu.:109032   Others     :20127   3rd Qu.:2014-10-28  
##  Max.   :499738                       Max.   :2014-12-31  
##                                                           
##      amount                    system_A      loanSizeGroup  
##  Min.   :   500   Micro Finance    :77036   >200k   :   13  
##  1st Qu.: 10000   Non-Micro Finance:  418   100-200k:  227  
##  Median : 25000                             75-100k : 2958  
##  Mean   : 30116                             50-75k  :12750  
##  3rd Qu.: 35000                             25-50k  :38357  
##  Max.   :750000                             <25k    :23149  
##                                                             
##  isRescheduled   isRestructuredLoans   memberID_A       clientID_A    
##  Mode :logical   Mode :logical       Min.   :   153   Min.   :   743  
##  FALSE:77452     FALSE:77454         1st Qu.: 17538   1st Qu.: 18171  
##  TRUE :2                             Median : 34205   Median : 34890  
##                                      Mean   : 35027   Mean   : 53782  
##                                      3rd Qu.: 46691   3rd Qu.: 47411  
##                                      Max.   :231340   Max.   :498279  
##                                      NA's   :418                      
##    groupID_A       centerID_A   fieldOfficerID_A     branchID_A   
##  Min.   :    1   Min.   :   1   Length:77454       4      : 6901  
##  1st Qu.: 5065   1st Qu.: 511   Class :character   6      : 6667  
##  Median : 9663   Median :1098   Mode  :character   7      : 6234  
##  Mean   :10592   Mean   :1408                      2      : 5065  
##  3rd Qu.:13854   3rd Qu.:2071                      21     : 4864  
##  Max.   :80617   Max.   :7683                      9      : 4840  
##  NA's   :418     NA's   :418                       (Other):42883  
##  regionID_A
##  11:26070  
##  12:28017  
##  13:12960  
##  14:10407  
##            
##            
##

Receipts: having all receipts info: ID, date, amount.

summary(df_txn)

##   receiptID_A         receiptDate             amount        
##  Min.   :2.100e+05   Min.   :2014-01-02   Min.   :       0  
##  1st Qu.:1.140e+09   1st Qu.:2015-03-20   1st Qu.:     950  
##  Median :1.543e+09   Median :2015-07-13   Median :    1100  
##  Mean   :1.582e+09   Mean   :2015-06-25   Mean   :    1805  
##  3rd Qu.:2.055e+09   3rd Qu.:2015-10-07   3rd Qu.:    2000  
##  Max.   :3.137e+09   Max.   :2015-12-31   Max.   :18573863  
##     loanID_A     
##  Min.   :    21  
##  1st Qu.:113987  
##  Median :154254  
##  Mean   :158176  
##  3rd Qu.:205492  
##  Max.   :313694

Relationship:

Data	Data	Connection
Clients	Loans	Client ID
Receipt	Loans	Loan ID

###Checking the data relationship

temp <- df_loans %>%
  anti_join(df_clients, by="clientID_A") %>% 
  count()

temp

## # A tibble: 1 x 1
##       n
##   <int>
## 1 29757

temp / count(df_loans) * 100

##          n
## 1 38.41893

df_loans %>%
  anti_join(df_clients, by="clientID_A") %>% 
  distinct(clientID_A) %>% 
  count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1 16729

There are 29757 / 38% of loan transactions with 16729 clients ID having no existence on clients table. Probably we should ask for missing clients on client data table.

Additionally, we can find that there exist repeated information of regionID_A and branchID_A on loans table, which already exist on clients table. We could skip these info on clients table and take into account of these info on clients table.

temp <- df_txn %>%
  anti_join(df_loans, by="loanID_A") %>% 
  count()

temp

## # A tibble: 1 x 1
##         n
##     <int>
## 1 3446562

temp / count(df_txn) * 100

##          n
## 1 67.35517

We could find that 3,446,562 / about 67% of receipts having no its ID on loan table. Probably we should ask for missing these data.

Check outliers

Clients data

df_loans %>%
  ggplot( aes(x=loanSizeGroup, y=amount, fill=loanSizeGroup)) +
    geom_boxplot() +
    ggtitle("Amount by loan size group")

df_loans %>% arrange(desc(amount))

## # A tibble: 77,454 x 15
##    loanID_A loanType_A loanDate   amount system_A loanSizeGroup
##       <dbl> <fct>      <date>      <dbl> <fct>    <fct>        
##  1    61293 Micro      2014-06-11 750000 Micro F… >200k        
##  2    73697 Micro      2014-06-23 750000 Micro F… >200k        
##  3      554 Cultivati… 2014-03-04 400000 Non-Mic… >200k        
##  4   109141 Micro      2014-07-14 350000 Micro F… >200k        
##  5   115898 Micro      2014-12-17 350000 Micro F… >200k        
##  6   115899 Micro      2014-12-17 350000 Micro F… >200k        
##  7   115900 Micro      2014-12-17 350000 Micro F… >200k        
##  8    37603 Micro      2014-04-07 300000 Micro F… >200k        
##  9      573 Cultivati… 2014-08-29 300000 Non-Mic… >200k        
## 10   111721 Micro      2014-11-07 300000 Micro F… >200k        
## # … with 77,444 more rows, and 9 more variables: isRescheduled <lgl>,
## #   isRestructuredLoans <lgl>, memberID_A <dbl>, clientID_A <int>,
## #   groupID_A <dbl>, centerID_A <dbl>, fieldOfficerID_A <chr>,
## #   branchID_A <fct>, regionID_A <fct>

At the size group greater than 200k, there could be one outlier transaction over 700k of amount. We could consider this later.

df_txn %>% group_by(loanID_A) %>%
    summarise(n=n()) %>%
    ggplot( aes(y=n)) +
      geom_boxplot() +
      ggtitle("Distribution of number of receipts per loan transaction")

df_txn %>% group_by(loanID_A) %>%
    summarise(n=n()) %>%
    arrange(desc(n)) %>%
    ggplot( aes(x=n)) +
      stat_bin(breaks=seq(0, 60, 5), fill="#69b3a2", color="#e9ecef", alpha=0.9) +
      ggtitle("Distribution of number of receipts per loan transaction")

The number of receipts per loan transaction is ranging from 5 to 60, with few outliers having value greater than 55. Also, the highest distribution placed on 0-5 receipts and 15-20 receipts accordingly.

df_txn %>%
  ggplot( aes(y=amount)) +
    geom_boxplot() +
    ggtitle("Amount by loan size group")

cnt_over_2000 <- df_txn %>% filter(amount > 2000) %>% count()
cnt <- df_txn %>% count()

cnt_over_2000 / cnt * 100

##          n
## 1 21.12141

summary(df_txn$amount)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0      950     1100     1805     2000 18573863

From the plot, we can see that there are few outliers over 200k, many outliers greater than the third quantiles, 2000. Those amount of outliers account for over 21% of total receipts. The most receipts are distributed from 950 to 2000 with the mean of 1805.

Few insights

Check if the total paid match with the total amount of loan

temp <- df_txn %>% 
  group_by(loanID_A) %>%
  summarise(totalPaid = sum(amount), n = n()) %>%
  inner_join(df_loans, by="loanID_A") %>%
  mutate(amount_change = totalPaid - amount, 
         percent = amount_change / amount * 100) 


temp %>%
  ggplot( aes(y=amount_change)) +
    geom_boxplot() +
    ggtitle("Amount paid vs amount of loan")

temp %>% count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1 73520

temp %>% 
  filter(amount_change < 0) %>%
  count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1    95

summary(temp$amount_change)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -399910    1400    7500    7934   10400   99340

summary(temp$percent)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -100.00   15.00   26.00   23.27   26.00  191.40

There are 73520 loan transactions, in which 95 transaction was not being paid fully or in debt. The distribution of subtraction of total paid and loan amount is from 1400 to 7934, or 15% to 26%, with the mean of 7934 or 23%.

Summary

For cleanliness of data, we could focus on few points below:

Clients data:

ageGroup missing 745 clients, we could use date of birth to refill these missing infos.

Loans data:

There are 29757 / 38% of loan transactions with 16729 clients ID having no existence on clients table. Probably we should ask for missing clients on client data table.
Fields fieldOfficerID_A, centerID_A, groupID_A, memberID_A are having missing values, the total ofmissing values are fromm more than 300 to less than 500, which account for 0.42% to 0.54%

Loans vs Clients:

We could find that 3,446,562 / about 67% of receipts having no its ID on loan table. Probably we should ask for missing these data.
With the loans, at the size group greater than 200k, there could be few transactions over 700k of amount, which is big outlier.
There are 73520 loan transactions, in which 95 transaction was not being paid fully or in debt. The distribution of subtraction of total paid and loan amount is from 1400 to 7934, or 15% to 26%, with the mean of 7934 or 23%.