Credit risk is the risk that must be borne by a person or institution when providing credit - usually in the form of money - to other individuals or parties.
The risk cannot be used as principal and interest on the loan, resulting in the following losses:
disruption of cash flow so that working capital is disrupted.
increase operational costs to catch up with the payments (collection).
To minimize the credit risk, a process called credit scoring and credit rating is usually carried out for the borrower. The output of this process will be the basis for determining whether a new loan application is accepted or rejected.
Credit score is the risk value assigned to an individual or organization applying for a loan based on the track record of the loan and the payments made. The process of granting a credit score is usually referred to as credit scoring.
Credit score calculations are usually made based on historical data on the length of delay in payments and those who do not pay at all (bad debt). Bad debt usually results in credit institutions having to confiscate assets or write offs.
Credit scores usually vary between institutions. However, many later adopt the FICO Score model which has a value range of 300 - 850. The higher the score obtained, the better the level of a person or an institution’s ability to repay loans.
Sometimes many institutions use a risk rating or level of risk. In contrast to the credit score, the higher this rating indicates an increasing risk.
In addition, codification is also made simpler than the range of values so that decisions can be taken faster. For example, suppose the use of combinations such as the letters AAA, AA+, P-1, and so on. Or for many internal borrowing institutions, the categorization only uses a small number range, for example 1 to 5.
The following is an example of risk rating data generated based on historical data on the length of the loan repayment process. Pay attention to the risk_rating column where there are numbers 1 to 5 which indicate the lowest to highest risk.
For full data set, you can download this dataset at https://storage.googleapis.com/dqlab-dataset/credit_scoring_dqlab.xlsx.
The risk_rating column is directly related to the overdue_average column, or the late payment column.
If the delay is up to 30 days (0 - 30 days) then it is given a value of 1.
If the delay is 31 to 45 days (31 - 45 days) then the score is given a score of 2.
etc.
From the several columns are also taken by the analyst to look for patterns of relevance to this rating, namely:
revenue in millions per year (pendapatan_setahun_juta).
loan duration in months (durasi_pinjaman_bulan).
number of dependents (jumlah_tanggungan).
whether there is an active mortgage or not (kpr_aktif).
Still related to the previous data example, but with intact data examples - DQLab will provide an illustration of follow-up activities on the data with the following example scenario.
An analyst will search our data for patterns. Here are the findings:
if the number of dependents is more than 4, the risk tendency is very high (ratings 4 and 5).
if the loan duration is longer, which is more than 24 months, then the risk tendency also increases (ratings 4 and 5).
From these two findings, the analyst will form rules to guide decision making (decision making models) for new loan applications for the following:
if the number of dependents is less than 5 people, and the loan duration is less than 24 months, the rating is given a value of 2 and the loan application is accepted.
if the number of dependents is more than 4 people and the loan duration is more than 24 months, then the rating is given a value of 5 and the loan application is rejected.
if the number of dependents is less than 5, and the loan duration is less than 36 months, then the rating is given a value of 3 and a loan is given.
Now, we call these three rules a model to predict the value of the risk rating and become the basis for making decisions on new loan applications.
With the model, lending institutions will make decisions faster and with less decision-making errors.
From the analysis and decision making model above, the series can actually be modeled with a decision tree structure, as shown visually as below.
Decision Tree is a suitable output model produced by analysts to help identify risk ratings. And fortunately, this model can be automatically generated from machine learning algorithms with historical credit data input. And this has been demonstrated with an example using an algorithm named C5.0.
library("openxlsx")
library("C50")
#data preparation
dataCreditRating <- read.xlsx(xlsxFile = "https://storage.googleapis.com/dqlab-dataset/credit_scoring_dqlab.xlsx")
dataCreditRating$risk_rating <- as.factor(dataCreditRating$risk_rating)
#use C5.0 algorithm
drop_columns <- c("kpr_aktif", "pendapatan_setahun_juta", "risk_rating", "rata_rata_overdue")
datafeed <- dataCreditRating[ , !(names(dataCreditRating) %in% drop_columns)]
modelKu <- C5.0(datafeed, as.factor(dataCreditRating$risk_rating))
summary(modelKu)
##
## Call:
## C5.0.default(x = datafeed, y = as.factor(dataCreditRating$risk_rating))
##
##
## C5.0 [Release 2.07 GPL Edition] Fri Aug 27 13:32:26 2021
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 900 cases (4 attributes) from undefined.data
##
## Decision tree:
##
## jumlah_tanggungan > 4:
## :...durasi_pinjaman_bulan <= 24: 4 (112/30)
## : durasi_pinjaman_bulan > 24: 5 (140/55)
## jumlah_tanggungan <= 4:
## :...jumlah_tanggungan > 2: 3 (246/22)
## jumlah_tanggungan <= 2:
## :...durasi_pinjaman_bulan <= 36: 1 (294/86)
## durasi_pinjaman_bulan > 36:
## :...jumlah_tanggungan <= 0: 2 (41/8)
## jumlah_tanggungan > 0: 3 (67/4)
##
##
## Evaluation on training data (900 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 6 205(22.8%) <<
##
##
## (a) (b) (c) (d) (e) <-classified as
## ---- ---- ---- ---- ----
## 208 2 5 6 6 (a): class 1
## 86 33 21 6 13 (b): class 2
## 4 287 (c): class 3
## 2 82 36 (d): class 4
## 18 85 (e): class 5
##
##
## Attribute usage:
##
## 100.00% jumlah_tanggungan
## 72.67% durasi_pinjaman_bulan
##
##
## Time: 0.1 secs
C5.0 is an naming code algorithm for decision tree. Many other algorithms such as random forest, CART, CHAID, MARS, and others. However, C5.0 is a very popular algorithm because it has very good performance in terms of speed and accuracy. This algorithm is often categorized as classification in machine learning, where the goal is to categorize or classify something – in our example risk rating – based on input from other data.
We will predict risk rating from this dataset.
library("openxlsx")
dataCreditRating <- read.xlsx(xlsxFile = "https://storage.googleapis.com/dqlab-dataset/credit_scoring_dqlab.xlsx")
str(dataCreditRating)
## 'data.frame': 900 obs. of 7 variables:
## $ kode_kontrak : chr "AGR-000001" "AGR-000011" "AGR-000030" "AGR-000043" ...
## $ pendapatan_setahun_juta: num 295 271 159 210 165 220 70 88 163 100 ...
## $ kpr_aktif : chr "YA" "YA" "TIDAK" "YA" ...
## $ durasi_pinjaman_bulan : num 48 36 12 12 36 24 36 48 48 36 ...
## $ jumlah_tanggungan : num 5 5 0 3 0 5 3 3 5 6 ...
## $ rata_rata_overdue : chr "61 - 90 days" "61 - 90 days" "0 - 30 days" "46 - 60 days" ...
## $ risk_rating : num 4 4 1 3 2 1 2 2 2 2 ...
In the C5.0 algorithm in R, the class variable must always be a factor. So if it is read as another data type, it must first be converted to a factor.
For our class variable, namely risk_rating, it is still read in numeric datatype. To become a class variable used in the C5.0 algorithm, it needs to be converted to a factor. This can be done using the following command.
dataCreditRating$risk_rating <- as.factor(dataCreditRating$risk_rating)
str(dataCreditRating)
## 'data.frame': 900 obs. of 7 variables:
## $ kode_kontrak : chr "AGR-000001" "AGR-000011" "AGR-000030" "AGR-000043" ...
## $ pendapatan_setahun_juta: num 295 271 159 210 165 220 70 88 163 100 ...
## $ kpr_aktif : chr "YA" "YA" "TIDAK" "YA" ...
## $ durasi_pinjaman_bulan : num 48 36 12 12 36 24 36 48 48 36 ...
## $ jumlah_tanggungan : num 5 5 0 3 0 5 3 3 5 6 ...
## $ rata_rata_overdue : chr "61 - 90 days" "61 - 90 days" "0 - 30 days" "46 - 60 days" ...
## $ risk_rating : Factor w/ 5 levels "1","2","3","4",..: 4 4 1 3 2 1 2 2 2 2 ...
Not all input variables that we need to use, especially those that are very closely related to risk_rating, namely rata_rata_overdue. We will discard these input variables. The process is known as feature selection. Since we are using data frame as our input data type for C5.0, we can enter the fields we want to use as filters.
input_columns <- c("durasi_pinjaman_bulan", "jumlah_tanggungan")
datafeed <- dataCreditRating[ , input_columns ]
str(datafeed)
## 'data.frame': 900 obs. of 2 variables:
## $ durasi_pinjaman_bulan: num 48 36 12 12 36 24 36 48 48 36 ...
## $ jumlah_tanggungan : num 5 5 0 3 0 5 3 3 5 6 ...
Note: kode_kontrak should not be selected because it is unique to the entire data, and is not a determinant for forming patterns. But this is included in order to show that C5.0 has the ability to automatically discard irrelevant input variables.
For the process of forming machine learning models and seeing its accuracy, usually our dataset needs to be divided into two, namely:
Training set: is the portion of the dataset used by the algorithm for analysis and as input for model formation.
Testing set: is the portion of the dataset that is not used to build the model, but to test the model that has been created.
The formation usually uses a random selection method. We will divide our dataset into 800 rows of data for the training set and 100 rows of data for the testing set.
#set random index portion for training and testing set
set.seed(100) #code to uniform random number fetch across R
indeks_training_set <- sample(900, 800) #create a random sequence with a value range of 1 to 900, but taken as many as 800 values.
#create and show training and testing set
input_training_set <- datafeed[indeks_training_set,]
class_training_set <- dataCreditRating[indeks_training_set,]$risk_rating
input_testing_set <- datafeed[-indeks_training_set,]
str(input_training_set)
## 'data.frame': 800 obs. of 2 variables:
## $ durasi_pinjaman_bulan: num 36 24 36 36 36 24 12 48 48 12 ...
## $ jumlah_tanggungan : num 1 1 5 1 5 3 3 3 0 0 ...
str(class_training_set)
## Factor w/ 5 levels "1","2","3","4",..: 1 1 4 1 5 3 3 3 2 1 ...
str(input_testing_set)
## 'data.frame': 100 obs. of 2 variables:
## $ durasi_pinjaman_bulan: num 12 36 48 36 48 48 12 12 12 12 ...
## $ jumlah_tanggungan : num 0 0 3 3 6 5 0 0 0 4 ...
With the previous preparations, it is time for us to use the C5.0 algorithm to generate a decision tree model using a function also named C5.0. This function also requires an R package named “C50”.
library("C50")
risk_rating_model <- C5.0(input_training_set, class_training_set)
#overview model
summary(risk_rating_model)
##
## Call:
## C5.0.default(x = input_training_set, y = class_training_set)
##
##
## C5.0 [Release 2.07 GPL Edition] Fri Aug 27 13:32:27 2021
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 800 cases (3 attributes) from undefined.data
##
## Decision tree:
##
## jumlah_tanggungan > 4:
## :...durasi_pinjaman_bulan <= 24: 4 (105/30)
## : durasi_pinjaman_bulan > 24: 5 (120/51)
## jumlah_tanggungan <= 4:
## :...jumlah_tanggungan > 2: 3 (216/20)
## jumlah_tanggungan <= 2:
## :...durasi_pinjaman_bulan <= 36: 1 (264/80)
## durasi_pinjaman_bulan > 36:
## :...jumlah_tanggungan <= 0: 2 (37/7)
## jumlah_tanggungan > 0: 3 (58/4)
##
##
## Evaluation on training data (800 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 6 192(24.0%) <<
##
##
## (a) (b) (c) (d) (e) <-classified as
## ---- ---- ---- ---- ----
## 184 2 5 6 6 (a): class 1
## 80 30 19 6 11 (b): class 2
## 3 250 (c): class 3
## 2 75 34 (d): class 4
## 18 69 (e): class 5
##
##
## Attribute usage:
##
## 100.00% jumlah_tanggungan
## 73.00% durasi_pinjaman_bulan
##
##
## Time: 0.0 secs
In addition to the text model from the previous practice, we can also generate a decision tree in graphical form. And it only takes one code line to do this, which is:
plot(risk_rating_model)
Class specified by attribute `outcome' means that our class variable is labeled or named as outcome. If we want to change the label that is more representative, namely “Risk Rating”, then we can add a control parameter with input in the form of the C5.0Control function and the label parameter as follows.
risk_rating_model <- C5.0(
input_training_set,
class_training_set,
control = C5.0Control(label="Risk Rating")
)
summary(risk_rating_model)
##
## Call:
## C5.0.default(x = input_training_set, y = class_training_set, control
## = C5.0Control(label = "Risk Rating"))
##
##
## C5.0 [Release 2.07 GPL Edition] Fri Aug 27 13:32:29 2021
## -------------------------------
##
## Class specified by attribute `Risk Rating'
##
## Read 800 cases (3 attributes) from undefined.data
##
## Decision tree:
##
## jumlah_tanggungan > 4:
## :...durasi_pinjaman_bulan <= 24: 4 (105/30)
## : durasi_pinjaman_bulan > 24: 5 (120/51)
## jumlah_tanggungan <= 4:
## :...jumlah_tanggungan > 2: 3 (216/20)
## jumlah_tanggungan <= 2:
## :...durasi_pinjaman_bulan <= 36: 1 (264/80)
## durasi_pinjaman_bulan > 36:
## :...jumlah_tanggungan <= 0: 2 (37/7)
## jumlah_tanggungan > 0: 3 (58/4)
##
##
## Evaluation on training data (800 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 6 192(24.0%) <<
##
##
## (a) (b) (c) (d) (e) <-classified as
## ---- ---- ---- ---- ----
## 184 2 5 6 6 (a): class 1
## 80 30 19 6 11 (b): class 2
## 3 250 (c): class 3
## 2 75 34 (d): class 4
## 18 69 (e): class 5
##
##
## Attribute usage:
##
## 100.00% jumlah_tanggungan
## 73.00% durasi_pinjaman_bulan
##
##
## Time: 0.0 secs
Read 800 cases (3 attributes) from undefined.data means we read 800 rows of data. This is because we took 800 of our 900 data. Then for the 3 attributes section, this means we have three variables, namely:
input variables; durasi_pinjaman and jumlah_tanggungan.
class variable; risk_rating
For undefined.data, we can ignore it, because this section should contain .data file information from the original C5.0 program. If you want to know more about this, see https://www.rulequest.com/see5-unix.html and focus on preparing data.
This is what the coloring means:
the blue color is the node and its split state. Connections between nodes (connectors) are written with colons and repeating dots (:…).
the red color is the leaf node or its classification.
the purple color is the error statistic in the form (class_number / error_number).
Evaluation on training data (800 cases):
Decision Tree
----------------
Size Errors
6 180(22.5%) <<
The information contained in this output is:
800 cases is the number of rows of data (cases) that are processed.
Size = 6 is the number of leaf nodes (end nodes) of the decision tree.
Errors = 192(24.0%) , 192 is the number of misclassified records and 24.0% is the ratio of the entire population.
(a) (b) (c) (d) (e) <-classified as
---- ---- ---- ---- ----
179 1 5 5 6 (a): class 1
80 30 14 3 12 (b): class 2
4 258 (c): class 3
2 73 31 (d): class 4
17 80 (e): class 5
Confusion matrix or error matrix is a table that shows the results of the classification carried out by the model versus (compared) with the actual classification data, thereby showing how accurately the model performs the classification or prediction.
Confusion matrix consists of the same number of columns and rows. Where the row and column headers are representations of class variable values - for our example they are risk_rating representations. For our case where there are 5 class variables, then the table is 5 x 5 as shown above.
The column headers indicate the class risk_rating value predicted or classified by the model, using the labels (a), (b), (c), and so on.
The row headers show the class risk_rating value in the actual data. Still represented by (a), (b), (c), (d) and (e). However, here, the label information has been given to represent which risk_rating value. It can be seen that (a) is a representation of risk_rating with a value of 1, (b) is a representation of risk_rating with a value of 2, and so on.
Each intersection between a column and a row is the predicted information from the class in the value in the column compared to the actual data for the class in the value in the row.
Finally, let’s try to add up all these numbers:
number with correct prediction: 620 (179 + 30 + 258 + 73 + 80)
numbers with wrong predictions: 180 (1 + 5 + 5 + 6 + 80 + 14 + 3 + 12 + 4 + 2 + 31 + 17)
The total is 800 data, according to the actual statistics. The number 180 which is an error is also consistent with the output results.
We can change label class variables this syntax below.
dataCreditRating$risk_rating[dataCreditRating$risk_rating == "1"] <- "satu"
dataCreditRating$risk_rating[dataCreditRating$risk_rating == "2"] <- "dua"
dataCreditRating$risk_rating[dataCreditRating$risk_rating == "3"] <- "tiga"
dataCreditRating$risk_rating[dataCreditRating$risk_rating == "4"] <- "empat"
dataCreditRating$risk_rating[dataCreditRating$risk_rating == "5"] <- "lima"
The last output is a list of determinant variables used in decision tree model.
Attribute usage:
100.00% jumlah_tanggungan
72.62% durasi_pinjaman_bulan
The output tells the level of importance of the use of each variable. Here, the jumlah_tanggungan ranks first with a value of 100% and the durasi_pinjaman with 72.62%. This also explains why jumlah_tanggungan occupies the root node in our model.
Here is a picture of the Decision Tree C5.0 plot which has been colored with the following explanation (after the picture).
The red color shows the nodes and their numbering
red circle 1 is the first level node which is the root node with a determinant variable jumlah_tanggungan.
red circle 2 is the second level node with a determinant variable jumlah_tanggungan.
red circle 3 is Node 7 which is the leaf node for risk_rating classification.
The blue color indicates the split condition to the next nodes
blue circle 4 indicates a split condition where the loan duration is less or equal to 24 months.
blue circle 5 indicates a split condition where the loan duration is more than 24 months.
The green color indicates the amount of data that has been classified
green circle 6 shows the classification results of 98 data.
the green circle 7 shows the classification results of 129 data.
The purple color indicates the classification results and their distribution (in the ratio range between the numbers 0 and 1)
the purple circle number 8 indicates the risk_rating of that node is majority 4, and thus the model takes it as its classification. In addition, risk_rating 5, 1, and 2 are data that also actually fall into a condition that ends at Node 10.
the purple circle number 9 indicates the risk_rating of that node is majority 5, and thus the model takes it as its classification. In addition, risk_rating 4, 2, and 1 are data that also actually fall into a condition that ends at Node 11.
The confusion matrix contained in the output of our previous model is the evaluation of the model using a training set. However, we need to evaluate this model for the testing set we have prepared.
Package C50 has a function predict(model, test_set), which can be used to make predictions based on model input and test data. The full function looks as follows.
predict(risk_rating_model, input_testing_set)
## [1] 1 1 3 3 5 5 1 1 1 3 1 2 1 1 3 3 1 3 3 3 3 3 1 5 1 1 3 1 3 5 1 1 2 1 5 1 1
## [38] 5 3 3 3 3 4 3 3 1 3 5 2 3 2 5 3 5 1 5 4 5 3 4 1 3 4 4 3 5 5 5 3 1 1 1 1 3
## [75] 5 1 4 5 3 1 3 3 3 3 3 1 3 3 5 4 5 3 3 3 1 1 5 5 3 3
## Levels: 1 2 3 4 5
It can be seen that the prediction results are all in accordance with the position of the data line from the testing set. And this is also in accordance with the risk_rating value range, which is 1 to 5.
We will store the risk_rating of the initial dataset and this prediction result into the other two column names in the input_testing_set data frame. Let’s name the column with risk_rating and hasil_prediksi.
input_testing_set$risk_rating <- dataCreditRating[-indeks_training_set,]$risk_rating #save the original value of risk_rating into column risk_rating
input_testing_set$hasil_prediksi <- predict(risk_rating_model, input_testing_set) #save the predicted value into column hasil_prediksi
print(input_testing_set)
## durasi_pinjaman_bulan jumlah_tanggungan risk_rating hasil_prediksi
## 3 12 0 1 1
## 5 36 0 2 1
## 8 48 3 2 3
## 40 36 3 2 3
## 41 48 6 2 5
## 44 48 5 2 5
## 58 12 0 1 1
## 70 12 0 1 1
## 109 12 0 1 1
## 110 12 4 3 3
## 122 12 0 1 1
## 151 48 0 2 2
## 179 36 1 1 1
## 180 36 1 2 1
## 182 24 4 3 3
## 195 48 3 3 3
## 200 24 0 1 1
## 217 12 4 3 3
## 230 48 2 3 3
## 231 12 3 3 3
## 234 24 3 3 3
## 236 24 4 3 3
## 238 24 0 1 1
## 245 36 5 4 5
## 252 24 0 1 1
## 253 24 0 1 1
## 260 48 1 3 3
## 265 36 0 2 1
## 275 12 3 3 3
## 279 36 6 5 5
## 285 36 1 1 1
## 295 24 0 1 1
## 317 48 0 2 2
## 343 24 0 1 1
## 350 48 6 5 5
## 352 12 1 1 1
## 356 36 2 2 1
## 369 48 6 5 5
## 373 48 3 3 3
## 375 48 2 3 3
## 384 24 3 3 3
## 388 36 3 3 3
## 399 24 6 4 4
## 419 48 3 3 3
## 433 24 4 3 3
## 437 36 1 1 1
## 446 24 3 3 3
## 455 48 5 5 5
## 493 48 0 2 2
## 496 12 3 3 3
## 501 48 0 3 2
## 521 48 5 4 5
## 524 48 2 3 3
## 527 36 5 5 5
## 534 36 1 1 1
## 536 48 6 5 5
## 544 12 5 4 4
## 548 48 6 5 5
## 561 12 3 3 3
## 565 12 6 4 4
## 574 24 1 1 1
## 577 48 2 3 3
## 587 12 6 4 4
## 594 12 6 4 4
## 612 24 4 3 3
## 616 48 6 5 5
## 621 36 5 5 5
## 632 48 6 5 5
## 641 36 4 3 3
## 645 12 2 2 1
## 657 12 2 1 1
## 675 12 2 1 1
## 687 12 2 1 1
## 697 36 4 3 3
## 704 48 6 5 5
## 707 12 2 1 1
## 716 12 5 4 4
## 721 36 5 5 5
## 729 48 1 3 3
## 737 12 2 1 1
## 743 36 3 3 3
## 748 48 1 3 3
## 749 36 4 3 3
## 786 48 1 3 3
## 799 12 3 3 3
## 801 24 2 1 1
## 806 24 4 3 3
## 814 36 3 3 3
## 825 36 6 5 5
## 831 24 6 4 4
## 861 48 5 5 5
## 863 12 3 3 3
## 869 48 3 3 3
## 870 48 3 3 3
## 872 24 2 1 1
## 880 36 1 2 1
## 888 48 5 5 5
## 890 48 5 5 5
## 893 48 3 3 3
## 897 48 2 3 3
Note: -index_training_set (with a minus sign in front) represents the index numbers for the testing set.
After the prediction results for the testing set are complete, the next step is to try to see which distribution is correct and incorrect prediction. We do this with confusion matrix. To create it, we can use dcast(column ~ row, dataframe) function from reshape2 package.
library("reshape2")
dcast(hasil_prediksi ~ risk_rating, data=input_testing_set)
## Using hasil_prediksi as value column: use value.var to override.
## Aggregation function missing: defaulting to length
## hasil_prediksi 1 2 3 4 5
## 1 1 24 6 0 0 0
## 2 2 0 3 1 0 0
## 3 3 0 2 37 0 0
## 4 4 0 0 0 7 0
## 5 5 0 2 0 2 16
The column’s headers show the predicted risk_rating results, while the row’s headers show the actual risk_rating data.
To calculate the percentage error, we can first calculate the amount of data with the correct prediction. The result is said to be true if the risk_rating data is the same as the hasil_prediksi. This if we write it with code is as follows.
input_testing_set$risk_rating==input_testing_set$hasil_prediksi
The next step, is to filter the data frame with the results earlier with the following syntax.
input_testing_set[input_testing_set$risk_rating==input_testing_set$hasil_prediksi,]
We will then count the number of rows of this filtering by adding nrow() function to the above syntax, as follows.
nrow(input_testing_set[input_testing_set$risk_rating==input_testing_set$hasil_prediksi,])
## [1] 87
How about incorrect prediction? We can just add != operator.
nrow(input_testing_set[input_testing_set$risk_rating!=input_testing_set$hasil_prediksi,])
## [1] 13
It can be seen that the number of prediction errors is 13. This result is consistent when compared to the number of 107 correct predictions, where the total of both is 120 – which is the amount of data for the testing set.
The new submission data needs to be formed as a single data frame with input where the names of the variables used must match exactly. From the beginning of the modeling, we use two variables, namely:
jumlah_tanggungan
durasi_pinjaman_bulan
The both are numeric datatype (numbers). And the following is an example of creating a dataframe with the two variables.
aplikasi_baru <- data.frame(jumlah_tanggungan = 6, durasi_pinjaman_bulan = 12)
print(aplikasi_baru)
## jumlah_tanggungan durasi_pinjaman_bulan
## 1 6 12
The new application data that we created previously will predict its risk_rating value with the predict() function.
predict(risk_rating_model, aplikasi_baru)
## [1] 4
## Levels: 1 2 3 4 5
This means that the risk_rating prediction result for this new application is 4, out of possibilities 1, 2, 3, 4 and 5. This 4 is a fairly high risk value, so this new application may be rejected according to the policy of the loan institution.
Above, we have learned how to predict from a dataframe based on the model that has been created. Now we try to predict from non-existent data from the previous modeled data set.
#Membuat data frame aplikasi baru
aplikasi_baru <- data.frame(jumlah_tanggungan = 6, durasi_pinjaman_bulan = 64)
#melakukan prediksi
predict(risk_rating_model, aplikasi_baru)
## [1] 5
## Levels: 1 2 3 4 5
This means that the risk_rating prediction result for this new application is 5, out of possibilities 1, 2, 3, 4 and 5. This 5 is a very high risk value because the duration of the loan is not included in the data carried out by the model.
We have predicted the credit risk value (risk_rating) of the new application data.
The function used is very simple, but we need to be strict with the conditions when preparing the input:
the input is a dataframe
the fields in the dataframe must match the input used to generate the model
If the conditions are not met, for example, if we have an excess of one column data frame that is not used when providing input to the model, it will result in an error when making predictions.