Theory

Credit Risk

Credit risk is the risk that must be borne by a person or institution when providing credit - usually in the form of money - to other individuals or parties.

The risk cannot be used as principal and interest on the loan, resulting in the following losses:

disruption of cash flow so that working capital is disrupted.
increase operational costs to catch up with the payments (collection).

To minimize the credit risk, a process called credit scoring and credit rating is usually carried out for the borrower. The output of this process will be the basis for determining whether a new loan application is accepted or rejected.

Credit Score

Credit score is the risk value assigned to an individual or organization applying for a loan based on the track record of the loan and the payments made. The process of granting a credit score is usually referred to as credit scoring.

Credit score calculations are usually made based on historical data on the length of delay in payments and those who do not pay at all (bad debt). Bad debt usually results in credit institutions having to confiscate assets or write offs.

Credit scores usually vary between institutions. However, many later adopt the FICO Score model which has a value range of 300 - 850. The higher the score obtained, the better the level of a person or an institution’s ability to repay loans.

Risk Rating

Sometimes many institutions use a risk rating or level of risk. In contrast to the credit score, the higher this rating indicates an increasing risk.

In addition, codification is also made simpler than the range of values so that decisions can be taken faster. For example, suppose the use of combinations such as the letters AAA, AA+, P-1, and so on. Or for many internal borrowing institutions, the categorization only uses a small number range, for example 1 to 5.

The following is an example of risk rating data generated based on historical data on the length of the loan repayment process. Pay attention to the risk_rating column where there are numbers 1 to 5 which indicate the lowest to highest risk.

For full data set, you can download this dataset at https://storage.googleapis.com/dqlab-dataset/credit_scoring_dqlab.xlsx.

The risk_rating column is directly related to the overdue_average column, or the late payment column.

If the delay is up to 30 days (0 - 30 days) then it is given a value of 1.
If the delay is 31 to 45 days (31 - 45 days) then the score is given a score of 2.
etc.

From the several columns are also taken by the analyst to look for patterns of relevance to this rating, namely:

revenue in millions per year (pendapatan_setahun_juta).
loan duration in months (durasi_pinjaman_bulan).
number of dependents (jumlah_tanggungan).
whether there is an active mortgage or not (kpr_aktif).

Analysis and Decision Making Model

Still related to the previous data example, but with intact data examples - DQLab will provide an illustration of follow-up activities on the data with the following example scenario.

An analyst will search our data for patterns. Here are the findings:

if the number of dependents is more than 4, the risk tendency is very high (ratings 4 and 5).
if the loan duration is longer, which is more than 24 months, then the risk tendency also increases (ratings 4 and 5).

From these two findings, the analyst will form rules to guide decision making (decision making models) for new loan applications for the following:

if the number of dependents is less than 5 people, and the loan duration is less than 24 months, the rating is given a value of 2 and the loan application is accepted.
if the number of dependents is more than 4 people and the loan duration is more than 24 months, then the rating is given a value of 5 and the loan application is rejected.
if the number of dependents is less than 5, and the loan duration is less than 36 months, then the rating is given a value of 3 and a loan is given.

Now, we call these three rules a model to predict the value of the risk rating and become the basis for making decisions on new loan applications.

With the model, lending institutions will make decisions faster and with less decision-making errors.

Modelling with Decision Tree

From the analysis and decision making model above, the series can actually be modeled with a decision tree structure, as shown visually as below.

Decision Tree

Decision Tree in Machine Learning

Decision Tree is a suitable output model produced by analysts to help identify risk ratings. And fortunately, this model can be automatically generated from machine learning algorithms with historical credit data input. And this has been demonstrated with an example using an algorithm named C5.0.

library("openxlsx")
library("C50")

#data preparation
dataCreditRating <- read.xlsx(xlsxFile = "https://storage.googleapis.com/dqlab-dataset/credit_scoring_dqlab.xlsx")
dataCreditRating$risk_rating <- as.factor(dataCreditRating$risk_rating) 

#use C5.0 algorithm
drop_columns <- c("kpr_aktif", "pendapatan_setahun_juta", "risk_rating", "rata_rata_overdue")
datafeed <- dataCreditRating[ , !(names(dataCreditRating) %in% drop_columns)]
modelKu <- C5.0(datafeed, as.factor(dataCreditRating$risk_rating))
summary(modelKu)

## 
## Call:
## C5.0.default(x = datafeed, y = as.factor(dataCreditRating$risk_rating))
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Fri Aug 27 13:32:26 2021
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 900 cases (4 attributes) from undefined.data
## 
## Decision tree:
## 
## jumlah_tanggungan > 4:
## :...durasi_pinjaman_bulan <= 24: 4 (112/30)
## :   durasi_pinjaman_bulan > 24: 5 (140/55)
## jumlah_tanggungan <= 4:
## :...jumlah_tanggungan > 2: 3 (246/22)
##     jumlah_tanggungan <= 2:
##     :...durasi_pinjaman_bulan <= 36: 1 (294/86)
##         durasi_pinjaman_bulan > 36:
##         :...jumlah_tanggungan <= 0: 2 (41/8)
##             jumlah_tanggungan > 0: 3 (67/4)
## 
## 
## Evaluation on training data (900 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       6  205(22.8%)   <<
## 
## 
##     (a)   (b)   (c)   (d)   (e)    <-classified as
##    ----  ----  ----  ----  ----
##     208     2     5     6     6    (a): class 1
##      86    33    21     6    13    (b): class 2
##             4   287                (c): class 3
##             2          82    36    (d): class 4
##                        18    85    (e): class 5
## 
## 
##  Attribute usage:
## 
##  100.00% jumlah_tanggungan
##   72.67% durasi_pinjaman_bulan
## 
## 
## Time: 0.1 secs

C5.0 is an naming code algorithm for decision tree. Many other algorithms such as random forest, CART, CHAID, MARS, and others. However, C5.0 is a very popular algorithm because it has very good performance in terms of speed and accuracy. This algorithm is often categorized as classification in machine learning, where the goal is to categorize or classify something – in our example risk rating – based on input from other data.

Assignment

We will predict risk rating from this dataset.

library("openxlsx")

dataCreditRating <- read.xlsx(xlsxFile = "https://storage.googleapis.com/dqlab-dataset/credit_scoring_dqlab.xlsx")
str(dataCreditRating)

## 'data.frame':    900 obs. of  7 variables:
##  $ kode_kontrak           : chr  "AGR-000001" "AGR-000011" "AGR-000030" "AGR-000043" ...
##  $ pendapatan_setahun_juta: num  295 271 159 210 165 220 70 88 163 100 ...
##  $ kpr_aktif              : chr  "YA" "YA" "TIDAK" "YA" ...
##  $ durasi_pinjaman_bulan  : num  48 36 12 12 36 24 36 48 48 36 ...
##  $ jumlah_tanggungan      : num  5 5 0 3 0 5 3 3 5 6 ...
##  $ rata_rata_overdue      : chr  "61 - 90 days" "61 - 90 days" "0 - 30 days" "46 - 60 days" ...
##  $ risk_rating            : num  4 4 1 3 2 1 2 2 2 2 ...

In the C5.0 algorithm in R, the class variable must always be a factor. So if it is read as another data type, it must first be converted to a factor.

For our class variable, namely risk_rating, it is still read in numeric datatype. To become a class variable used in the C5.0 algorithm, it needs to be converted to a factor. This can be done using the following command.

dataCreditRating$risk_rating <- as.factor(dataCreditRating$risk_rating)
str(dataCreditRating)

## 'data.frame':    900 obs. of  7 variables:
##  $ kode_kontrak           : chr  "AGR-000001" "AGR-000011" "AGR-000030" "AGR-000043" ...
##  $ pendapatan_setahun_juta: num  295 271 159 210 165 220 70 88 163 100 ...
##  $ kpr_aktif              : chr  "YA" "YA" "TIDAK" "YA" ...
##  $ durasi_pinjaman_bulan  : num  48 36 12 12 36 24 36 48 48 36 ...
##  $ jumlah_tanggungan      : num  5 5 0 3 0 5 3 3 5 6 ...
##  $ rata_rata_overdue      : chr  "61 - 90 days" "61 - 90 days" "0 - 30 days" "46 - 60 days" ...
##  $ risk_rating            : Factor w/ 5 levels "1","2","3","4",..: 4 4 1 3 2 1 2 2 2 2 ...

Not all input variables that we need to use, especially those that are very closely related to risk_rating, namely rata_rata_overdue. We will discard these input variables. The process is known as feature selection. Since we are using data frame as our input data type for C5.0, we can enter the fields we want to use as filters.

input_columns <- c("durasi_pinjaman_bulan", "jumlah_tanggungan")
datafeed <- dataCreditRating[ , input_columns ]
str(datafeed)

## 'data.frame':    900 obs. of  2 variables:
##  $ durasi_pinjaman_bulan: num  48 36 12 12 36 24 36 48 48 36 ...
##  $ jumlah_tanggungan    : num  5 5 0 3 0 5 3 3 5 6 ...

Note: kode_kontrak should not be selected because it is unique to the entire data, and is not a determinant for forming patterns. But this is included in order to show that C5.0 has the ability to automatically discard irrelevant input variables.

For the process of forming machine learning models and seeing its accuracy, usually our dataset needs to be divided into two, namely:

Training set: is the portion of the dataset used by the algorithm for analysis and as input for model formation.

Testing set: is the portion of the dataset that is not used to build the model, but to test the model that has been created.

The formation usually uses a random selection method. We will divide our dataset into 800 rows of data for the training set and 100 rows of data for the testing set.

#set random index portion for training and testing set
set.seed(100) #code to uniform random number fetch across R
indeks_training_set <- sample(900, 800) #create a random sequence with a value range of 1 to 900, but taken as many as 800 values.

#create and show training and testing set
input_training_set <- datafeed[indeks_training_set,]
class_training_set <- dataCreditRating[indeks_training_set,]$risk_rating
input_testing_set <- datafeed[-indeks_training_set,]

str(input_training_set)

## 'data.frame':    800 obs. of  2 variables:
##  $ durasi_pinjaman_bulan: num  36 24 36 36 36 24 12 48 48 12 ...
##  $ jumlah_tanggungan    : num  1 1 5 1 5 3 3 3 0 0 ...

str(class_training_set)

##  Factor w/ 5 levels "1","2","3","4",..: 1 1 4 1 5 3 3 3 2 1 ...

str(input_testing_set)

## 'data.frame':    100 obs. of  2 variables:
##  $ durasi_pinjaman_bulan: num  12 36 48 36 48 48 12 12 12 12 ...
##  $ jumlah_tanggungan    : num  0 0 3 3 6 5 0 0 0 4 ...

With the previous preparations, it is time for us to use the C5.0 algorithm to generate a decision tree model using a function also named C5.0. This function also requires an R package named “C50”.

library("C50")
risk_rating_model <- C5.0(input_training_set, class_training_set)

#overview model
summary(risk_rating_model)

## 
## Call:
## C5.0.default(x = input_training_set, y = class_training_set)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Fri Aug 27 13:32:27 2021
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 800 cases (3 attributes) from undefined.data
## 
## Decision tree:
## 
## jumlah_tanggungan > 4:
## :...durasi_pinjaman_bulan <= 24: 4 (105/30)
## :   durasi_pinjaman_bulan > 24: 5 (120/51)
## jumlah_tanggungan <= 4:
## :...jumlah_tanggungan > 2: 3 (216/20)
##     jumlah_tanggungan <= 2:
##     :...durasi_pinjaman_bulan <= 36: 1 (264/80)
##         durasi_pinjaman_bulan > 36:
##         :...jumlah_tanggungan <= 0: 2 (37/7)
##             jumlah_tanggungan > 0: 3 (58/4)
## 
## 
## Evaluation on training data (800 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       6  192(24.0%)   <<
## 
## 
##     (a)   (b)   (c)   (d)   (e)    <-classified as
##    ----  ----  ----  ----  ----
##     184     2     5     6     6    (a): class 1
##      80    30    19     6    11    (b): class 2
##             3   250                (c): class 3
##             2          75    34    (d): class 4
##                        18    69    (e): class 5
## 
## 
##  Attribute usage:
## 
##  100.00% jumlah_tanggungan
##   73.00% durasi_pinjaman_bulan
## 
## 
## Time: 0.0 secs

In addition to the text model from the previous practice, we can also generate a decision tree in graphical form. And it only takes one code line to do this, which is:

plot(risk_rating_model)

Class specified by attribute `outcome' means that our class variable is labeled or named as outcome. If we want to change the label that is more representative, namely “Risk Rating”, then we can add a control parameter with input in the form of the C5.0Control function and the label parameter as follows.

risk_rating_model <- C5.0(
  input_training_set, 
  class_training_set, 
  control = C5.0Control(label="Risk Rating")
  )
summary(risk_rating_model)

## 
## Call:
## C5.0.default(x = input_training_set, y = class_training_set, control
##  = C5.0Control(label = "Risk Rating"))
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Fri Aug 27 13:32:29 2021
## -------------------------------
## 
## Class specified by attribute `Risk Rating'
## 
## Read 800 cases (3 attributes) from undefined.data
## 
## Decision tree:
## 
## jumlah_tanggungan > 4:
## :...durasi_pinjaman_bulan <= 24: 4 (105/30)
## :   durasi_pinjaman_bulan > 24: 5 (120/51)
## jumlah_tanggungan <= 4:
## :...jumlah_tanggungan > 2: 3 (216/20)
##     jumlah_tanggungan <= 2:
##     :...durasi_pinjaman_bulan <= 36: 1 (264/80)
##         durasi_pinjaman_bulan > 36:
##         :...jumlah_tanggungan <= 0: 2 (37/7)
##             jumlah_tanggungan > 0: 3 (58/4)
## 
## 
## Evaluation on training data (800 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       6  192(24.0%)   <<
## 
## 
##     (a)   (b)   (c)   (d)   (e)    <-classified as
##    ----  ----  ----  ----  ----
##     184     2     5     6     6    (a): class 1
##      80    30    19     6    11    (b): class 2
##             3   250                (c): class 3
##             2          75    34    (d): class 4
##                        18    69    (e): class 5
## 
## 
##  Attribute usage:
## 
##  100.00% jumlah_tanggungan
##   73.00% durasi_pinjaman_bulan
## 
## 
## Time: 0.0 secs

Elements of the Decision Tree C5.0

Read 800 cases (3 attributes) from undefined.data means we read 800 rows of data. This is because we took 800 of our 900 data. Then for the 3 attributes section, this means we have three variables, namely:

input variables; durasi_pinjaman and jumlah_tanggungan.
class variable; risk_rating

For undefined.data, we can ignore it, because this section should contain .data file information from the original C5.0 program. If you want to know more about this, see https://www.rulequest.com/see5-unix.html and focus on preparing data.

This is what the coloring means:

the blue color is the node and its split state. Connections between nodes (connectors) are written with colons and repeating dots (:…).
the red color is the leaf node or its classification.
the purple color is the error statistic in the form (class_number / error_number).

Evaluation on training data (800 cases):

        Decision Tree   
      ----------------  
      Size      Errors  

         6  180(22.5%)   <<

The information contained in this output is:

800 cases is the number of rows of data (cases) that are processed.
Size = 6 is the number of leaf nodes (end nodes) of the decision tree.
Errors = 192(24.0%) , 192 is the number of misclassified records and 24.0% is the ratio of the entire population.

  (a)   (b)   (c)   (d)   (e)    <-classified as
      ----  ----  ----  ----  ----
       179     1     5     5     6    (a): class 1
        80    30    14     3    12    (b): class 2
               4   258                (c): class 3
               2          73    31    (d): class 4
                          17    80    (e): class 5

Confusion matrix or error matrix is a table that shows the results of the classification carried out by the model versus (compared) with the actual classification data, thereby showing how accurately the model performs the classification or prediction.

Confusion matrix consists of the same number of columns and rows. Where the row and column headers are representations of class variable values - for our example they are risk_rating representations. For our case where there are 5 class variables, then the table is 5 x 5 as shown above.

The column headers indicate the class risk_rating value predicted or classified by the model, using the labels (a), (b), (c), and so on.
The row headers show the class risk_rating value in the actual data. Still represented by (a), (b), (c), (d) and (e). However, here, the label information has been given to represent which risk_rating value. It can be seen that (a) is a representation of risk_rating with a value of 1, (b) is a representation of risk_rating with a value of 2, and so on.
Each intersection between a column and a row is the predicted information from the class in the value in the column compared to the actual data for the class in the value in the row.

Finally, let’s try to add up all these numbers:

number with correct prediction: 620 (179 + 30 + 258 + 73 + 80)
numbers with wrong predictions: 180 (1 + 5 + 5 + 6 + 80 + 14 + 3 + 12 + 4 + 2 + 31 + 17)

The total is 800 data, according to the actual statistics. The number 180 which is an error is also consistent with the output results.

We can change label class variables this syntax below.

dataCreditRating$risk_rating[dataCreditRating$risk_rating == "1"] <- "satu"
dataCreditRating$risk_rating[dataCreditRating$risk_rating == "2"] <- "dua"
dataCreditRating$risk_rating[dataCreditRating$risk_rating == "3"] <- "tiga"
dataCreditRating$risk_rating[dataCreditRating$risk_rating == "4"] <- "empat"
dataCreditRating$risk_rating[dataCreditRating$risk_rating == "5"] <- "lima"

The last output is a list of determinant variables used in decision tree model.

Attribute usage:

    100.00% jumlah_tanggungan
     72.62% durasi_pinjaman_bulan

The output tells the level of importance of the use of each variable. Here, the jumlah_tanggungan ranks first with a value of 100% and the durasi_pinjaman with 72.62%. This also explains why jumlah_tanggungan occupies the root node in our model.

Elements of the Decision Tree Plot C5.0

Here is a picture of the Decision Tree C5.0 plot which has been colored with the following explanation (after the picture).

The red color shows the nodes and their numbering
- red circle 1 is the first level node which is the root node with a determinant variable jumlah_tanggungan.
- red circle 2 is the second level node with a determinant variable jumlah_tanggungan.
- red circle 3 is Node 7 which is the leaf node for risk_rating classification.
The blue color indicates the split condition to the next nodes
- blue circle 4 indicates a split condition where the loan duration is less or equal to 24 months.
- blue circle 5 indicates a split condition where the loan duration is more than 24 months.
The green color indicates the amount of data that has been classified
- green circle 6 shows the classification results of 98 data.
- the green circle 7 shows the classification results of 129 data.
The purple color indicates the classification results and their distribution (in the ratio range between the numbers 0 and 1)
- the purple circle number 8 indicates the risk_rating of that node is majority 4, and thus the model takes it as its classification. In addition, risk_rating 5, 1, and 2 are data that also actually fall into a condition that ends at Node 10.
- the purple circle number 9 indicates the risk_rating of that node is majority 5, and thus the model takes it as its classification. In addition, risk_rating 4, 2, and 1 are data that also actually fall into a condition that ends at Node 11.

Model Evaluation

The confusion matrix contained in the output of our previous model is the evaluation of the model using a training set. However, we need to evaluate this model for the testing set we have prepared.

Package C50 has a function predict(model, test_set), which can be used to make predictions based on model input and test data. The full function looks as follows.

predict(risk_rating_model, input_testing_set)

##   [1] 1 1 3 3 5 5 1 1 1 3 1 2 1 1 3 3 1 3 3 3 3 3 1 5 1 1 3 1 3 5 1 1 2 1 5 1 1
##  [38] 5 3 3 3 3 4 3 3 1 3 5 2 3 2 5 3 5 1 5 4 5 3 4 1 3 4 4 3 5 5 5 3 1 1 1 1 3
##  [75] 5 1 4 5 3 1 3 3 3 3 3 1 3 3 5 4 5 3 3 3 1 1 5 5 3 3
## Levels: 1 2 3 4 5

It can be seen that the prediction results are all in accordance with the position of the data line from the testing set. And this is also in accordance with the risk_rating value range, which is 1 to 5.

We will store the risk_rating of the initial dataset and this prediction result into the other two column names in the input_testing_set data frame. Let’s name the column with risk_rating and hasil_prediksi.

input_testing_set$risk_rating <- dataCreditRating[-indeks_training_set,]$risk_rating #save the original value of risk_rating into column risk_rating

input_testing_set$hasil_prediksi <- predict(risk_rating_model, input_testing_set) #save the predicted value into column hasil_prediksi

print(input_testing_set)

##     durasi_pinjaman_bulan jumlah_tanggungan risk_rating hasil_prediksi
## 3                      12                 0           1              1
## 5                      36                 0           2              1
## 8                      48                 3           2              3
## 40                     36                 3           2              3
## 41                     48                 6           2              5
## 44                     48                 5           2              5
## 58                     12                 0           1              1
## 70                     12                 0           1              1
## 109                    12                 0           1              1
## 110                    12                 4           3              3
## 122                    12                 0           1              1
## 151                    48                 0           2              2
## 179                    36                 1           1              1
## 180                    36                 1           2              1
## 182                    24                 4           3              3
## 195                    48                 3           3              3
## 200                    24                 0           1              1
## 217                    12                 4           3              3
## 230                    48                 2           3              3
## 231                    12                 3           3              3
## 234                    24                 3           3              3
## 236                    24                 4           3              3
## 238                    24                 0           1              1
## 245                    36                 5           4              5
## 252                    24                 0           1              1
## 253                    24                 0           1              1
## 260                    48                 1           3              3
## 265                    36                 0           2              1
## 275                    12                 3           3              3
## 279                    36                 6           5              5
## 285                    36                 1           1              1
## 295                    24                 0           1              1
## 317                    48                 0           2              2
## 343                    24                 0           1              1
## 350                    48                 6           5              5
## 352                    12                 1           1              1
## 356                    36                 2           2              1
## 369                    48                 6           5              5
## 373                    48                 3           3              3
## 375                    48                 2           3              3
## 384                    24                 3           3              3
## 388                    36                 3           3              3
## 399                    24                 6           4              4
## 419                    48                 3           3              3
## 433                    24                 4           3              3
## 437                    36                 1           1              1
## 446                    24                 3           3              3
## 455                    48                 5           5              5
## 493                    48                 0           2              2
## 496                    12                 3           3              3
## 501                    48                 0           3              2
## 521                    48                 5           4              5
## 524                    48                 2           3              3
## 527                    36                 5           5              5
## 534                    36                 1           1              1
## 536                    48                 6           5              5
## 544                    12                 5           4              4
## 548                    48                 6           5              5
## 561                    12                 3           3              3
## 565                    12                 6           4              4
## 574                    24                 1           1              1
## 577                    48                 2           3              3
## 587                    12                 6           4              4
## 594                    12                 6           4              4
## 612                    24                 4           3              3
## 616                    48                 6           5              5
## 621                    36                 5           5              5
## 632                    48                 6           5              5
## 641                    36                 4           3              3
## 645                    12                 2           2              1
## 657                    12                 2           1              1
## 675                    12                 2           1              1
## 687                    12                 2           1              1
## 697                    36                 4           3              3
## 704                    48                 6           5              5
## 707                    12                 2           1              1
## 716                    12                 5           4              4
## 721                    36                 5           5              5
## 729                    48                 1           3              3
## 737                    12                 2           1              1
## 743                    36                 3           3              3
## 748                    48                 1           3              3
## 749                    36                 4           3              3
## 786                    48                 1           3              3
## 799                    12                 3           3              3
## 801                    24                 2           1              1
## 806                    24                 4           3              3
## 814                    36                 3           3              3
## 825                    36                 6           5              5
## 831                    24                 6           4              4
## 861                    48                 5           5              5
## 863                    12                 3           3              3
## 869                    48                 3           3              3
## 870                    48                 3           3              3
## 872                    24                 2           1              1
## 880                    36                 1           2              1
## 888                    48                 5           5              5
## 890                    48                 5           5              5
## 893                    48                 3           3              3
## 897                    48                 2           3              3

Note: -index_training_set (with a minus sign in front) represents the index numbers for the testing set.

After the prediction results for the testing set are complete, the next step is to try to see which distribution is correct and incorrect prediction. We do this with confusion matrix. To create it, we can use dcast(column ~ row, dataframe) function from reshape2 package.

library("reshape2")
dcast(hasil_prediksi ~ risk_rating, data=input_testing_set)

## Using hasil_prediksi as value column: use value.var to override.

## Aggregation function missing: defaulting to length

##   hasil_prediksi  1 2  3 4  5
## 1              1 24 6  0 0  0
## 2              2  0 3  1 0  0
## 3              3  0 2 37 0  0
## 4              4  0 0  0 7  0
## 5              5  0 2  0 2 16

The column’s headers show the predicted risk_rating results, while the row’s headers show the actual risk_rating data.

To calculate the percentage error, we can first calculate the amount of data with the correct prediction. The result is said to be true if the risk_rating data is the same as the hasil_prediksi. This if we write it with code is as follows.

input_testing_set$risk_rating==input_testing_set$hasil_prediksi

The next step, is to filter the data frame with the results earlier with the following syntax.

input_testing_set[input_testing_set$risk_rating==input_testing_set$hasil_prediksi,]

We will then count the number of rows of this filtering by adding nrow() function to the above syntax, as follows.

nrow(input_testing_set[input_testing_set$risk_rating==input_testing_set$hasil_prediksi,])

## [1] 87

How about incorrect prediction? We can just add != operator.

 nrow(input_testing_set[input_testing_set$risk_rating!=input_testing_set$hasil_prediksi,])

## [1] 13

It can be seen that the number of prediction errors is 13. This result is consistent when compared to the number of 107 correct predictions, where the total of both is 120 – which is the amount of data for the testing set.

Model Evaluation for New Applicants

The new submission data needs to be formed as a single data frame with input where the names of the variables used must match exactly. From the beginning of the modeling, we use two variables, namely:

jumlah_tanggungan
durasi_pinjaman_bulan

The both are numeric datatype (numbers). And the following is an example of creating a dataframe with the two variables.

aplikasi_baru <- data.frame(jumlah_tanggungan = 6, durasi_pinjaman_bulan = 12)
print(aplikasi_baru)

##   jumlah_tanggungan durasi_pinjaman_bulan
## 1                 6                    12

The new application data that we created previously will predict its risk_rating value with the predict() function.

predict(risk_rating_model, aplikasi_baru)

## [1] 4
## Levels: 1 2 3 4 5

This means that the risk_rating prediction result for this new application is 4, out of possibilities 1, 2, 3, 4 and 5. This 4 is a fairly high risk value, so this new application may be rejected according to the policy of the loan institution.

Above, we have learned how to predict from a dataframe based on the model that has been created. Now we try to predict from non-existent data from the previous modeled data set.

#Membuat data frame aplikasi baru
aplikasi_baru <- data.frame(jumlah_tanggungan = 6, durasi_pinjaman_bulan = 64)

#melakukan prediksi
predict(risk_rating_model, aplikasi_baru)

## [1] 5
## Levels: 1 2 3 4 5

This means that the risk_rating prediction result for this new application is 5, out of possibilities 1, 2, 3, 4 and 5. This 5 is a very high risk value because the duration of the loan is not included in the data carried out by the model.

Conclusion

We have predicted the credit risk value (risk_rating) of the new application data.

The function used is very simple, but we need to be strict with the conditions when preparing the input:

the input is a dataframe
the fields in the dataframe must match the input used to generate the model

If the conditions are not met, for example, if we have an excess of one column data frame that is not used when providing input to the model, it will result in an error when making predictions.

Data Science in Finance Credit Risk Analysis

Joseph Armando Carvallo

20/02/2021