Introduction
A corporate credit rating is an opinion of an independent agency regarding the likelihood that a corporation will fully meet its financial obligations(meet the terms of a contract) as they come due. A corporation’s corporate credit rating reveals its ability to pay its creditors. It is important to remember that corporate credit ratings are not facts but opinions.
Corporate credit ratings, provided by specialized agencies, evaluate a company’s creditworthiness and offer an essential financial cue to potential investors. It aids in giving investors a clearer understanding of the risk connected to the company’s credit investment returns. Every business strives to achieve a high credit rating to attract more investment and enjoy cheaper debt interest rates.
Most credit rating companies use a unique discrete ordinal rating scale. The three most prominent credit rating agencies are Standard and Poor’s (S&P), Moody’s, and Fitch. The S&P uses a grading scale with the following grades: AAA, AA+, AA, AA, A+, BBB+, BBB, BBB, BB+, BB, BB, B+, B+, B, B, CCC+, CCC, CCC, CC, C, D. There are a total of 22 grades on the scale, with AAA being the most promising and D being the riskiest.
Data can be access on https://www.kaggle.com/datasets/kirtandelwadia/corporate-credit-rating-with-financial-ratios
For more information about financial indicators visit: https://financialmodelingprep.com/market-indexes-major-markets The additional features are Name, Symbol (for trading), Rating Agency Name, Date and Sector.
Business Problem
This project aims to predict the credit rating grade of the companies. With the scale based on S&P Grading Scale This is a case of Regression analysis which is part of the Supervised Learning problem. Credit Rating (Grade) is the target variable while financial ratios data are the features.
Data Processing
Load Libraries
library(dplyr)
library(tidyr)
library(glue)
library(caret)
library(ggplot2)
library(lubridate)
library(corrplot)
library(tibble)
library(GGally)
library(plotly)
library(nnet)
library(randomForest)
library(stargazer)
library(DMwR)
library(party)
library(e1071)
library(kernlab)
library(scales)
library(class)
library(psych)
library(knitr)
library(reshape2)
library(pROC)
Read Data
# read data
data <- read.csv("data/ratio.csv")
A list of 2029 credit ratings issued by major agencies such as Standard and Poors to big US firms (traded on NYSE or Nasdaq) from 2010 to 2016. There are 30 features for every company of which 25 are financial indicators. They can be divided in:
Liquidity Measurement Ratios
: currentRatio, quickRatio, cashRatio, daysOfSalesOutstandingProfitability Indicator Ratios
: grossProfitMargin, operatingProfitMargin, pretaxProfitMargin, netProfitMargin, effectiveTaxRate, returnOnAssets, returnOnEquity, returnOnCapitalEmployedDebt Ratios
: debtRatio, debtEquityRatioOperating Performance Ratios
:` assetTurnoverCash Flow Indicator Ratios
: operatingCashFlowPerShare, freeCashFlowPerShare, cashPerShare, operatingCashFlowSalesRatio, freeCashFlowOperatingCashFlowRatio
In order to ensure that the data is “fully prepared,” we demonstrate how to use various data transformations, scaling, handling outliers, or any other statistical strategy. It is best practice to preprocess our data before performing analysis. Data must first be cleaned and transformed before being used for analysis and modeling.
Pre-processing
# data structure
glimpse(data)
Rows: 7,805
Columns: 25
$ Rating.Agency <chr> "Standard & Poor's Ratings Services", "S…
$ Corporation <chr> "American States Water Co.", "Automatic …
$ Rating <chr> "A-", "AAA", "BBB-", "AA-", "A", "BBB+",…
$ Rating.Date <chr> "2010-07-30", "2010-09-16", "2010-11-23"…
$ CIK <int> 1056903, 8670, 8858, 1035201, 721371, 72…
$ Binary.Rating <int> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1…
$ SIC.Code <dbl> 4941, 7374, 5065, 4941, 5122, 5122, 3312…
$ Sector <chr> "Utils", "BusEq", "Shops", "Utils", "Sho…
$ Ticker <chr> "AWR", "ADP", "AVT", "CWT", "CAH", "CAH"…
$ Current.Ratio <dbl> 1.1507, 1.1129, 1.9276, 0.8358, 1.2931, …
$ Long.term.Debt...Capital <dbl> 0.4551, 0.0072, 0.2924, 0.4708, 0.2644, …
$ Debt.Equity.Ratio <dbl> 0.8847, 0.0073, 0.4255, 0.9491, 0.4036, …
$ Gross.Margin <dbl> 77.6230, 43.6619, 11.9008, 64.5096, 3.83…
$ Operating.Margin <dbl> 19.4839, 19.8327, 3.3173, 18.4549, 1.326…
$ EBIT.Margin <dbl> 19.4839, 19.8327, 3.3173, 18.4549, 1.326…
$ EBITDA.Margin <dbl> 28.9834, 23.9379, 3.6338, 27.9377, 1.584…
$ Pre.Tax.Profit.Margin <dbl> 13.6093, 20.8699, 3.0536, 15.1135, 1.230…
$ Net.Profit.Margin <dbl> 8.3224, 13.5690, 2.1418, 9.0246, 0.6518,…
$ Asset.Turnover <dbl> 0.3173, 0.3324, 2.4620, 0.2946, 4.9276, …
$ ROE...Return.On.Equity <dbl> 8.1724, 22.0354, 13.6376, 9.6412, 11.125…
$ Return.On.Tangible.Equity <dbl> 8.1978, 47.2858, 16.7991, 9.7015, 19.418…
$ ROA...Return.On.Assets <dbl> 2.6385, 4.4944, 5.2731, 2.6583, 2.9364, …
$ ROI...Return.On.Investment <dbl> 4.4530, 21.8765, 9.6494, 5.1018, 8.1844,…
$ Operating.Cash.Flow.Per.Share <dbl> 1.9957, 0.2501, -7.6079, 1.7438, 1.9725,…
$ Free.Cash.Flow.Per.Share <dbl> -0.1333, 0.3132, -7.3231, -0.8999, 2.417…
Parting credit score into a grade scale based on S&P Rating Grade Scale and removed unused coloumns
# change columns format
data <- data %>% mutate(Rating = as.factor(Rating),
Sector = as.factor(Sector),
Rating.Agency = as.factor(Rating.Agency),
Rating.Date = ymd(Rating.Date),
Grade = case_when(Rating == 'AAA'~ "Lowest Risk",
Rating %in% c('AA+' , 'AA', 'AA-') ~ "Very Low Risk",
Rating %in% c("A+" , 'A' , "A-") ~ "Low Credit Risk",
Rating %in% c('BBB+' , 'BBB' , 'BBB-') ~ "Moderate Credit Risk",
Rating %in% c('BB+' , 'BB' , 'BB-') ~ "Substantial Credit Risk",
Rating %in% c('B+' , 'B' , 'B-') ~ "High Credit Risk",
Rating %in% c('CCC+', 'CCC' , 'CCC-') ~ "Very High Credit Risk",
Rating %in% c('CC' , 'C' , 'CC+', 'D') ~ "In or Near Default"),
Grade = as.factor(Grade)) %>% select(-c(CIK, Binary.Rating, SIC.Code)) %>% rename(FCFPS = Free.Cash.Flow.Per.Share,
OCFPS = Operating.Cash.Flow.Per.Share ,
ROI = ROI...Return.On.Investment,
ROA = ROA...Return.On.Assets,
ROTE = Return.On.Tangible.Equity,
ROE = ROE...Return.On.Equity,
LTDC = Long.term.Debt...Capital,
PTPM = Pre.Tax.Profit.Margin,
DER = Debt.Equity.Ratio,
NPM = Net.Profit.Margin)
#data$Rating <- ordered(data$Rating, levels = c("AAA", "AA+", "AA", "AA-", "A+" , 'A' , "A-", "BBB+", "BBB", "BBB-", "BB+", "BB", "BB-", 'B+' , 'B' , 'B-', "CCC+", "CCC", "CCC-", "CC+", "CC", "C", "D"))
#data$Grade <- ordered(data$Grade, levels = c("Lowest Risk", "Very Low Risk", "Low Credit Risk", "Moderate Credit Risk", "Substantial Credit Risk", "High Credit Risk" , "Very High Credit Risk", "In or Near Default", "In Default"))
Check missing data
# check missing value
colSums(is.na(data))
Rating.Agency Corporation Rating Rating.Date
0 0 0 0
Sector Ticker Current.Ratio LTDC
0 0 0 0
DER Gross.Margin Operating.Margin EBIT.Margin
0 0 0 0
EBITDA.Margin PTPM NPM Asset.Turnover
0 0 0 0
ROE ROTE ROA ROI
0 0 0 0
OCFPS FCFPS Grade
0 0 0
Removed missing data
# remove duplicate
unique(data)
# remove row containing NA value
data <- data %>% filter(complete.cases(.))
Check data distribution of each predictor
data %>%
select_if(is.numeric) %>%
boxplot(main = 'Distribution of Each Predictor', xlab = 'Predictor', ylab = 'Values')
Handling Outliers
Our data can be visually examined to identify whether any outliers
are present. By requiring our model to accommoDate them, outliers
impact the dependent variable we’re developing. As their names indicate,
outliers lie outside our model’s majority. The resolving capability of
our model might be reduced if we include outliers. We can observe from
the boxplot that some variables, such
Return.On.Tangible.Equity
abd
Return.On.Tangible.Equity
have noticeable outliers.
boxplot(data$DER)
boxplot(data$ROI)
boxplot(data$ROTE)
boxplot(data$ROE)
boxplot(data$FCFPS)
Create data_outlier function to remove extreme outlier
data_outliers <- function(x) {
Q1 <- quantile(x, probs=.25)
Q3 <- quantile(x, probs=.75)
iqr = Q3-Q1
upper_limit = Q3 + (iqr*1.5)
lower_limit = Q1 - (iqr*1.5)
x > upper_limit | x < lower_limit
}
#outlier remover function
remove_outliers <- function(df, cols = names(df)) {
for (col in cols) {
df <- df[!data_outliers(df[[col]]),]
}
df
}
Remove the extreme outlier
data_no_outlier <- remove_outliers(data, c("ROTE", "ROE", "ROI", "NPM", "OCFPS", "FCFPS"))
#remove_outliers(data, c("Current.Ratio", "Long.term.Debt...Capital","Debt.Equity.Ratio", "Gross.Margin" ,"Operating.Margin", "EBIT.Margin", "EBITDA.Margin", "Pre.Tax.Profit.Margin","Net.Profit.Margin", "Asset.Turnover", "ROE...Return.On.Equity", "Return.On.Tangible.Equity", "ROA...Return.On.Assets", "ROI...Return.On.Investment", "Operating.Cash.Flow.Per.Share", "Free.Cash.Flow.Per.Share"))
Find other outlier value with zscore
To remove other unsignifigance outlier, we’re using zcore from one of numerical variable since the targeted variable is a categorical format.
# Check the outlier and remove after scaling using zscore threshold point = 3
data_no_outlier <- data_no_outlier %>%
mutate(zscore = (Current.Ratio - mean(Current.Ratio)) / sd(Current.Ratio)) %>%
filter(zscore <=3) %>%
select(-c(zscore))
data_no_outlier %>%
group_by(Grade) %>%
summarise(freq = n())
# A tibble: 8 × 2
Grade freq
<fct> <int>
1 High Credit Risk 431
2 In or Near Default 14
3 Low Credit Risk 1210
4 Lowest Risk 53
5 Moderate Credit Risk 1557
6 Substantial Credit Risk 793
7 Very High Credit Risk 87
8 Very Low Risk 280
Outlier on Original Data Distribution
# check if the outliers has an influence to targeted variable
data %>%
mutate(Date = as.numeric(Rating.Date)) %>%
ggplot(aes(x = Rating.Date, y = Grade)) +
geom_point() +
geom_point(data = data_no_outlier, aes(x = Rating.Date, y = Grade), col = 'red') +
labs(
title = 'Distribution of Credit Rating : Original vs Outlier (Black)',
x = NULL,
y = 'Rating') +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
Distribution on Each Predictor
The graph shows that
current.ratio
,operating.margin
,ROA
,ROE
,ROTE
, andEBIT
, are all fairly uniform in shape. This can imply that these variables are combined freely and without followed by other variables.
Below we can find the main statistical moments
for the
numeric variables:
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Current.Ratio | 1 | 4425 | 1.78 | 1.05 | 1.52 | 1.62 | 0.79 | 0.21 | 6.55 | 6.34 | 1.77 | 3.99 | 0.02 |
LTDC | 2 | 4425 | 0.41 | 0.24 | 0.41 | 0.40 | 0.16 | -1.72 | 3.71 | 5.44 | 5.73 | 76.66 | 0.00 |
DER | 3 | 4425 | 0.96 | 1.19 | 0.75 | 0.80 | 0.49 | -16.51 | 21.28 | 37.80 | 3.62 | 73.58 | 0.02 |
Gross.Margin | 4 | 4425 | 41.02 | 22.20 | 38.47 | 39.94 | 25.90 | 2.07 | 100.00 | 97.93 | 0.42 | -0.43 | 0.33 |
Operating.Margin | 5 | 4425 | 13.51 | 9.22 | 12.28 | 12.88 | 9.15 | -16.51 | 53.97 | 70.47 | 0.63 | 0.57 | 0.14 |
EBIT.Margin | 6 | 4425 | 13.56 | 9.29 | 12.30 | 12.91 | 9.10 | -16.51 | 53.97 | 70.47 | 0.68 | 0.75 | 0.14 |
EBITDA.Margin | 7 | 4425 | 21.27 | 12.97 | 18.66 | 20.24 | 12.60 | -8.36 | 88.35 | 96.71 | 0.85 | 0.88 | 0.19 |
PTPM | 8 | 4425 | 10.85 | 8.79 | 9.76 | 10.29 | 8.37 | -23.27 | 42.49 | 65.77 | 0.52 | 0.59 | 0.13 |
NPM | 9 | 4425 | 7.81 | 6.56 | 6.97 | 7.45 | 6.52 | -10.68 | 25.88 | 36.57 | 0.43 | 0.07 | 0.10 |
Asset.Turnover | 10 | 4425 | 0.82 | 0.72 | 0.64 | 0.68 | 0.40 | 0.07 | 8.50 | 8.44 | 3.23 | 16.46 | 0.01 |
ROE | 11 | 4425 | 12.42 | 8.70 | 11.84 | 12.34 | 7.69 | -13.06 | 37.90 | 50.96 | 0.08 | 0.20 | 0.13 |
ROTE | 12 | 4425 | 10.02 | 26.35 | 11.66 | 10.50 | 19.19 | -69.74 | 85.58 | 155.32 | -0.21 | 0.77 | 0.40 |
ROA | 13 | 4425 | 4.79 | 3.50 | 4.31 | 4.67 | 3.27 | -5.04 | 17.47 | 22.51 | 0.34 | 0.13 | 0.05 |
ROI | 14 | 4425 | 7.38 | 5.36 | 6.74 | 7.23 | 5.00 | -6.82 | 21.97 | 28.79 | 0.27 | -0.11 | 0.08 |
OCFPS | 15 | 4425 | 0.21 | 1.15 | 0.18 | 0.21 | 0.92 | -2.86 | 3.41 | 6.26 | -0.01 | 0.25 | 0.02 |
FCFPS | 16 | 4425 | 0.05 | 1.20 | 0.07 | 0.05 | 1.03 | -2.98 | 3.10 | 6.08 | -0.03 | 0.06 | 0.02 |
library(zoo)
recap <- data_no_outlier %>%
mutate(year = year(data_no_outlier$Rating.Date),
year = as.factor(year),
grade = case_when(Rating == 'AAA'~ "Lowest Risk",
Rating %in% c('AA+', 'AA','AA-') ~ "Very Low Risk",
Rating %in% c('A+', 'A','A-') ~ "Low Credit Risk",
Rating %in% c('BBB+', 'BBB','BBB-') ~ "Moderate Credit Risk",
Rating %in% c('BB+', 'BB','BB-') ~ "Substantial Credit Risk",
Rating %in% c('B+', 'B','B-') ~ "High Credit Risk",
Rating %in% c('CCC+', 'CCC','CCC-') ~ "Very High Credit Risk",
Rating %in% c('CC', 'C', 'CC+', 'D') ~ "In or Near Default")) %>%
group_by(grade, year) %>%
summarise(count = n()) %>%
ggplot(aes(x = year, y = count)) +
geom_line(aes(group = grade, col = grade)) +
geom_point(aes(col = grade, group = grade)) +
labs(title = "Rating Recap",
x = NULL,
y = NULL) +
theme(axis.text.x = element_text(hjust = 1, angle = 45),
plot.title = element_text(face = "bold"),
panel.background = element_rect(fill = "#ffffff"),
axis.line.y = element_line(colour = "grey"),
axis.line.x = element_line(colour = "grey"))
ggplotly(recap) %>% plotly::layout(legend=list(x=0,xanchor='left',
yanchor='top',
orientation='h'))
library(glue)
plotcount <-data_no_outlier %>%
mutate(year = year(data_no_outlier$Rating.Date),
year = as.factor(year),
grade = case_when(Rating == 'AAA'~ "Lowest Risk",
Rating %in% c('AA+', 'AA','AA-') ~ "Very Low Risk",
Rating %in% c('A+', 'A','A-') ~ "Low Credit Risk",
Rating %in% c('BBB+', 'BBB','BBB-') ~ "Moderate Credit Risk",
Rating %in% c('BB+', 'BB','BB-') ~ "Substantial Credit Risk",
Rating %in% c('B+', 'B','B-') ~ "High Credit Risk",
Rating %in% c('CCC+', 'CCC','CCC-') ~ "Very High Credit Risk",
Rating %in% c('CC', 'C', 'CC+', "D") ~ "In or Near Default")) %>%
group_by(grade, year) %>%
summarise(freq = n()) %>%
mutate(label = glue("Total : {freq}
Grade : {grade}")) %>%
ggplot(aes(x=year, y = freq, fill = grade, text = label)) +
geom_col(position = "dodge") +
labs(title = "Credit Grade based on S&P",
x = NULL,
y = NULL,
fill = "Grade") +
theme(legend.title = element_blank(),
axis.text.x = element_text(hjust = 1),
plot.title = element_text(face = "bold"),
panel.background = element_rect(fill = "#ffffff"),
axis.line.y = element_line(colour = "grey"),
axis.line.x = element_line(colour = "grey"))
ggplotly(plotcount, tooltip = "text") %>% plotly::layout(legend=list(x=0,xanchor='left',
yanchor='top',
orientation='h'))
plotcount <-data_no_outlier %>%
mutate(year = year(data_no_outlier$Rating.Date),
year = as.factor(year),
grade = case_when(Rating == 'AAA'~ "Lowest Risk",
Rating %in% c('AA+', 'AA','AA-') ~ "Very Low Risk",
Rating %in% c('A+', 'A','A-') ~ "Low Credit Risk",
Rating %in% c('BBB+', 'BBB','BBB-') ~ "Moderate Credit Risk",
Rating %in% c('BB+', 'BB','BB-') ~ "Substantial Credit Risk",
Rating %in% c('B+', 'B','B-') ~ "High Credit Risk",
Rating %in% c('CCC+', 'CCC','CCC-') ~ "Very High Credit Risk",
Rating %in% c('CC', 'C', 'CC+', 'D') ~ "In or Near Default")) %>%
group_by(Rating.Agency, year) %>%
summarise(freq = n()) %>%
mutate(label = glue("Total : {freq}
Rating.Agency : {Rating.Agency}")) %>%
ggplot(aes(x=year, y = freq, fill = Rating.Agency, text = label)) +
geom_col(position = "dodge") +
labs(title = "Credit Grade based on S&P",
x = NULL,
y = NULL,
fill = "Grade") +
theme(legend.title = element_blank(),
axis.text.x = element_text(hjust = 1),
plot.title = element_text(face = "bold"),
panel.background = element_rect(fill = "#ffffff"),
axis.line.y = element_line(colour = "grey"),
axis.line.x = element_line(colour = "grey"))
ggplotly(plotcount, tooltip = "text") %>% plotly::layout(legend=list(x=0,xanchor='left',
yanchor='top',
orientation='h'))
data_no_outlier %>% select_if(is.numeric) %>% describe()
vars n mean sd median trimmed mad min max range
Current.Ratio 1 4425 1.78 1.05 1.52 1.62 0.79 0.21 6.55 6.34
LTDC 2 4425 0.41 0.24 0.41 0.40 0.16 -1.72 3.71 5.44
DER 3 4425 0.96 1.19 0.75 0.80 0.49 -16.51 21.28 37.80
Gross.Margin 4 4425 41.02 22.20 38.47 39.94 25.90 2.07 100.00 97.93
Operating.Margin 5 4425 13.51 9.22 12.28 12.88 9.15 -16.51 53.97 70.47
skew kurtosis se
Current.Ratio 1.77 3.99 0.02
LTDC 5.73 76.66 0.00
DER 3.62 73.58 0.02
Gross.Margin 0.42 -0.43 0.33
Operating.Margin 0.63 0.57 0.14
[ reached 'max' / getOption("max.print") -- omitted 11 rows ]
Correlation
between features:
We will check the correlation between all the
numerical variables
, so we will use the list
data_numeric_vars
.
Firstly, the correlation will be checked by the graph
corplot
:
It seems that the variables are not highly correlated
,
there are not values close to dark blue or dark red.
data_no_outlier %>% select_if(is.numeric) %>% cor() %>% as.data.frame()
Current.Ratio LTDC DER Gross.Margin
Current.Ratio 1.0000000 -0.178515061 -0.15364205 -0.167916846
LTDC -0.1785151 1.000000000 0.39163198 -0.003271595
DER -0.1536421 0.391631981 1.00000000 0.034986713
Gross.Margin -0.1679168 -0.003271595 0.03498671 1.000000000
Operating.Margin EBIT.Margin EBITDA.Margin PTPM
Current.Ratio -0.088624193 -0.09044379 -0.17917273 0.003487013
LTDC 0.013978920 0.01262490 0.05588717 -0.142907279
DER 0.009805031 0.01098823 0.06963403 -0.150852342
Gross.Margin 0.586646423 0.58348155 0.67830637 0.497289641
NPM Asset.Turnover ROE ROTE ROA
Current.Ratio 0.0819301 0.0241215 -0.01415236 0.143857063 0.20619358
LTDC -0.1448751 -0.1151765 -0.04885294 -0.176340053 -0.29771236
DER -0.1435755 -0.1098951 -0.01506373 -0.128222423 -0.23817451
Gross.Margin 0.4494412 -0.5339575 0.02653752 -0.007552834 0.05511231
ROI OCFPS FCFPS
Current.Ratio 0.08396988 -0.06386495 -0.04264660
LTDC -0.32259601 -0.04368007 -0.05034773
DER -0.23991519 0.01861499 -0.02716629
Gross.Margin 0.01467265 0.04844019 0.06352412
[ reached 'max' / getOption("max.print") -- omitted 12 rows ]
To be sure, we will check the maximum
and
minimum
correlations, but in the maximum the value one
should be removed, because each variable is evaluated against
itself:
Maximum.correlation |
---|
0.99692 |
Minimum.correlation |
---|
-0.57921 |
The stronger the correlation, or how near 1 or -1 it is, the more
closely related the predictors are. The correlation matrix graphic above
shows the correlatiion on each variables. In our dataset,
Operating.Margin
and EBIT.Margin
have the
highest positive correlations (0.96) also NSM
and
Rating
have the highest negative correlations (-0.57)
Model Fitting
Data Splitting
We now split the data into train and validation sets. The training set is used to train the model, which is checked against the validation set.
remove unused columns
data_clean <- data_no_outlier %>% select(-c(Corporation, Rating.Agency, Rating.Date, Sector, Ticker, Rating))
set.seed(123)
index <-createDataPartition(data_clean$Grade, p = 0.8, list = FALSE)
train <- data_clean[index,]
test <- data_clean[-index,]
#BalTrain <- droplevels.data.frame(train)
#BalTest <- droplevels.data.frame(test)
Check the Data Split
table(test$Grade)
High Credit Risk In or Near Default Low Credit Risk
86 2 242
Lowest Risk Moderate Credit Risk Substantial Credit Risk
10 311 158
Very High Credit Risk Very Low Risk
17 56
table(train$Grade)
High Credit Risk In or Near Default Low Credit Risk
345 12 968
Lowest Risk Moderate Credit Risk Substantial Credit Risk
43 1246 635
Very High Credit Risk Very Low Risk
70 224
library(performanceEstimation)
# To SMOTE train_set
set.seed(1234)
train_smote <- smote(Grade~., train , perc.over = 40, perc.under = 10)
table(train_smote$Grade)
High Credit Risk In or Near Default Low Credit Risk
496 492 1302
Lowest Risk Moderate Credit Risk Substantial Credit Risk
69 1639 881
Very High Credit Risk Very Low Risk
92 321
#nrow(smote_train_set) # 8993 rows = 4692 + 4301 (52% vs. 48%)
#nrow(smote_train_set[smote_train_set$class=="Sustain",]) # 4692 rows = 1.2(10)(391)
#nrow(smote_train_set[smote_train_set$class=="Bankrupt",]) # 4301 rows = 391 + 10(391)
# Save training set
write.csv(train_smote, file="train_smote.csv", row.names=FALSE)
Model Fitting
MLR - Multinomial Logistic Regression:
The first model to be applied will be multinomial logistic
regression. It is a classification method that generalize logistic
regression to multiclass problems, in this case three classes or levels.
The first step in the application of one classisication method is to
train the model. To do that, the algorithm should be applied to the
training sample, in this case train.
model_lr <- multinom(formula = Grade ~ ., data = train_smote)
# weights: 144 (119 variable)
initial value 13749.267474
iter 10 value 11002.318919
iter 20 value 9986.627702
iter 30 value 9604.165893
iter 40 value 9337.170956
iter 50 value 9216.755819
iter 60 value 9091.369768
iter 70 value 8848.188956
iter 80 value 8620.417284
iter 90 value 8481.465769
iter 100 value 8360.355987
final value 8360.355987
stopped after 100 iterations
saveRDS(model_lr, "model_lr.rds")
zscore_lr <- summary(model_lr)$coefficients/summary(model_lr)$standard.errors
((1 - pnorm(abs(zscore_lr), 0, 1)) * 2) %>% as.data.frame()
(Intercept) Current.Ratio
In or Near Default 0.000000000000000000000 0.000000000
Low Credit Risk 0.000000000000009769963 0.000000000
Lowest Risk 0.622580089402264835741 0.005210372
Moderate Credit Risk 0.000000000000000000000 0.000000000
LTDC DER Gross.Margin
In or Near Default 0.00000000000007394085 0.122483130204410 0.000002840239
Low Credit Risk 0.00089274131094319031 0.000000005128242 0.000123048164
Lowest Risk 0.08467737756918447545 0.783427056096841 0.008853508845
Moderate Credit Risk 0.00000000001263167348 0.000000005090008 0.503642339906
Operating.Margin EBIT.Margin EBITDA.Margin
In or Near Default 0.02215274 0.1577790 0.000000000000
Low Credit Risk 0.55326972 0.6055659 0.000002393161
Lowest Risk 0.27434224 0.3131979 0.034093847215
Moderate Credit Risk 0.46510011 0.4348954 0.001543974953
PTPM NPM Asset.Turnover
In or Near Default 0.04765292871907 0.0011571379333 0.0000000000000002220446
Low Credit Risk 0.00000004119937 0.0000001364062 0.1688071111847531113881
Lowest Risk 0.00038817130160 0.5748345135941 0.0007336048805883788049
Moderate Credit Risk 0.00131090051759 0.0049422676723 0.0000002381285177932568
ROE ROTE ROA
In or Near Default 0.0000000000000 0.00000005200874 0.20898953067587778
Low Credit Risk 0.8795474889081 0.00579935247972 0.00001129185601001
Lowest Risk 0.0000138753251 0.00003314986346 0.00000000007071588
Moderate Credit Risk 0.0000001283812 0.02126714269237 0.00004177999758515
ROI OCFPS
In or Near Default 0.0000000000000000000000 0.00000012533
Low Credit Risk 0.0000012688581576547620 0.00007172481
Lowest Risk 0.0000000000000004440892 0.00189551614
Moderate Credit Risk 0.0530720660680894518180 0.00274928093
FCFPS
In or Near Default 0.000000000000567546
Low Credit Risk 0.000005093438031345
Lowest Risk 0.035244138286956828
Moderate Credit Risk 0.000181802596967762
[ reached 'max' / getOption("max.print") -- omitted 3 rows ]
exp(coef(model_lr)) %>% as.data.frame()
(Intercept) Current.Ratio LTDC DER
In or Near Default 590.0562626 0.2245488 0.02027662 1.1161291
Low Credit Risk 12.5792129 0.4216018 0.17941020 0.5263733
Lowest Risk 0.6077025 0.6753355 0.03002144 0.9135379
Moderate Credit Risk 58.2972970 0.5875833 0.03220894 0.5465358
Gross.Margin Operating.Margin EBIT.Margin EBITDA.Margin
In or Near Default 1.0265805 6.4403690 0.3152593 0.5308937
Low Credit Risk 1.0139895 0.8234632 1.1849971 0.9489040
Lowest Risk 1.0211796 0.6899649 1.4060398 0.9448835
Moderate Credit Risk 0.9977182 0.7879500 1.2913378 0.9698804
PTPM NPM Asset.Turnover ROE ROTE
In or Near Default 1.076986 0.8872874 0.16979868 0.6298140 1.023172
Low Credit Risk 1.132438 1.1116516 0.86892745 0.9970949 1.007120
Lowest Risk 1.209161 1.0296362 0.09427143 0.7346197 1.025885
Moderate Credit Risk 1.065679 1.0536375 0.58494319 1.0924984 1.005703
ROA ROI OCFPS FCFPS
In or Near Default 0.8727437 2.111726 0.4976056 2.369864
Low Credit Risk 0.7144567 1.325395 0.7607762 1.341585
Lowest Risk 0.4705117 2.779813 0.5885482 1.397981
Moderate Credit Risk 0.7375788 1.111440 0.8242089 1.254240
[ reached 'max' / getOption("max.print") -- omitted 3 rows ]
pred_model_lr <- predict(object = model_lr, newdata = test, positive = "Lowest Risk")
confusionMatrix(pred_model_lr, test$Grade)
Confusion Matrix and Statistics
Reference
Prediction High Credit Risk In or Near Default Low Credit Risk
High Credit Risk 18 0 1
In or Near Default 2 1 7
Low Credit Risk 4 0 96
Lowest Risk 2 0 1
Moderate Credit Risk 24 0 132
Substantial Credit Risk 36 1 3
Very High Credit Risk 0 0 0
Very Low Risk 0 0 2
Reference
Prediction Lowest Risk Moderate Credit Risk
High Credit Risk 0 1
In or Near Default 3 7
Low Credit Risk 3 57
Lowest Risk 0 1
Moderate Credit Risk 3 210
Substantial Credit Risk 0 29
Very High Credit Risk 0 1
Very Low Risk 1 5
Reference
Prediction Substantial Credit Risk Very High Credit Risk
High Credit Risk 12 2
In or Near Default 2 4
Low Credit Risk 14 2
Lowest Risk 0 0
Moderate Credit Risk 68 4
Substantial Credit Risk 61 4
Very High Credit Risk 0 0
Very Low Risk 1 1
Reference
Prediction Very Low Risk
High Credit Risk 1
In or Near Default 1
Low Credit Risk 31
Lowest Risk 0
Moderate Credit Risk 19
Substantial Credit Risk 1
Very High Credit Risk 0
Very Low Risk 3
Overall Statistics
Accuracy : 0.441
95% CI : (0.408, 0.4745)
No Information Rate : 0.3526
P-Value [Acc > NIR] : 0.00000003725
Kappa : 0.223
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: High Credit Risk Class: In or Near Default
Sensitivity 0.20930 0.500000
Specificity 0.97864 0.970455
Pos Pred Value 0.51429 0.037037
Neg Pred Value 0.91972 0.998830
Prevalence 0.09751 0.002268
Detection Rate 0.02041 0.001134
Detection Prevalence 0.03968 0.030612
Balanced Accuracy 0.59397 0.735227
Class: Low Credit Risk Class: Lowest Risk
Sensitivity 0.3967 0.000000
Specificity 0.8266 0.995413
Pos Pred Value 0.4638 0.000000
Neg Pred Value 0.7837 0.988610
Prevalence 0.2744 0.011338
Detection Rate 0.1088 0.000000
Detection Prevalence 0.2347 0.004535
Balanced Accuracy 0.6116 0.497706
Class: Moderate Credit Risk Class: Substantial Credit Risk
Sensitivity 0.6752 0.38608
Specificity 0.5622 0.89779
Pos Pred Value 0.4565 0.45185
Neg Pred Value 0.7607 0.87015
Prevalence 0.3526 0.17914
Detection Rate 0.2381 0.06916
Detection Prevalence 0.5215 0.15306
Balanced Accuracy 0.6187 0.64193
Class: Very High Credit Risk Class: Very Low Risk
Sensitivity 0.000000 0.053571
Specificity 0.998844 0.987893
Pos Pred Value 0.000000 0.230769
Neg Pred Value 0.980704 0.939010
Prevalence 0.019274 0.063492
Detection Rate 0.000000 0.003401
Detection Prevalence 0.001134 0.014739
Balanced Accuracy 0.499422 0.520732
K Nearest Neighborhood
# prediktor data train
train_x <- train_smote %>% select_if(is.numeric) # dipilih semua kolom yang numerik karena akan discaling
# target data train
train_y <- train_smote %>% select(Grade) # dipisahkan khusus untuk kelas target
# prediktor data test
test_x <- test %>% select_if(is.numeric)
# target data test
test_y <- test %>% select(Grade)
train_smote %>%
pivot_longer(cols = -c(Grade), names_to = 'predictor') %>%
group_by(predictor) %>%
summarize(value = max(value)) %>%
ggplot(aes(x = predictor, y = value)) +
geom_col(fill = 'pink') +
labs(
title = 'Data Range Before Scaling',
x = 'Variable',
y = 'Value') +
theme(legend.title = element_blank(),
axis.text.x = element_text(hjust = 1, angle = 45),
plot.title = element_text(face = "bold"),
panel.background = element_rect(fill = "#ffffff"),
axis.line.y = element_line(colour = "grey"),
axis.line.x = element_line(colour = "grey"))
# code ini hanya boleh dirun 1 kali
train_x <- scale(train_x)
test_x <- scale(test_x,
center=attr(train_x, "scaled:center"), #nilai rata-rata train
scale=attr(train_x, "scaled:scale")) # nilai sd train
sqrt(nrow(train_x))
[1] 81.31421
library(class)
model_knn <- knn(train = train_x, #prediktor data train
test = test_x, #prediktor data test
cl = train_y$Grade, #target data train
k=81) # jumlah k yang digunakan untuk klasifikasi
saveRDS(model_knn, "model_knn.rds")
confusionMatrix(data=model_knn, reference = test_y$Grade, positive="Lowest Risk")
Confusion Matrix and Statistics
Reference
Prediction High Credit Risk In or Near Default Low Credit Risk
High Credit Risk 15 0 0
In or Near Default 4 1 15
Low Credit Risk 8 0 141
Lowest Risk 0 0 0
Moderate Credit Risk 28 1 78
Substantial Credit Risk 31 0 3
Very High Credit Risk 0 0 0
Very Low Risk 0 0 5
Reference
Prediction Lowest Risk Moderate Credit Risk
High Credit Risk 0 5
In or Near Default 2 17
Low Credit Risk 8 95
Lowest Risk 0 0
Moderate Credit Risk 0 171
Substantial Credit Risk 0 21
Very High Credit Risk 0 0
Very Low Risk 0 2
Reference
Prediction Substantial Credit Risk Very High Credit Risk
High Credit Risk 12 1
In or Near Default 4 4
Low Credit Risk 20 3
Lowest Risk 0 0
Moderate Credit Risk 70 4
Substantial Credit Risk 52 5
Very High Credit Risk 0 0
Very Low Risk 0 0
Reference
Prediction Very Low Risk
High Credit Risk 0
In or Near Default 3
Low Credit Risk 31
Lowest Risk 0
Moderate Credit Risk 15
Substantial Credit Risk 2
Very High Credit Risk 0
Very Low Risk 5
Overall Statistics
Accuracy : 0.4365
95% CI : (0.4035, 0.47)
No Information Rate : 0.3526
P-Value [Acc > NIR] : 0.0000001647
Kappa : 0.2284
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: High Credit Risk Class: In or Near Default
Sensitivity 0.17442 0.500000
Specificity 0.97739 0.944318
Pos Pred Value 0.45455 0.020000
Neg Pred Value 0.91637 0.998798
Prevalence 0.09751 0.002268
Detection Rate 0.01701 0.001134
Detection Prevalence 0.03741 0.056689
Balanced Accuracy 0.57590 0.722159
Class: Low Credit Risk Class: Lowest Risk
Sensitivity 0.5826 0.00000
Specificity 0.7422 1.00000
Pos Pred Value 0.4608 NaN
Neg Pred Value 0.8247 0.98866
Prevalence 0.2744 0.01134
Detection Rate 0.1599 0.00000
Detection Prevalence 0.3469 0.00000
Balanced Accuracy 0.6624 0.50000
Class: Moderate Credit Risk Class: Substantial Credit Risk
Sensitivity 0.5498 0.32911
Specificity 0.6567 0.91436
Pos Pred Value 0.4659 0.45614
Neg Pred Value 0.7282 0.86198
Prevalence 0.3526 0.17914
Detection Rate 0.1939 0.05896
Detection Prevalence 0.4161 0.12925
Balanced Accuracy 0.6033 0.62174
Class: Very High Credit Risk Class: Very Low Risk
Sensitivity 0.00000 0.089286
Specificity 1.00000 0.991525
Pos Pred Value NaN 0.416667
Neg Pred Value 0.98073 0.941379
Prevalence 0.01927 0.063492
Detection Rate 0.00000 0.005669
Detection Prevalence 0.00000 0.013605
Balanced Accuracy 0.50000 0.540406
Naive Bayes
library(e1071)
model_naive <- naiveBayes(formula = Grade ~ . ,
data = train_smote,
laplace = 1)
pred_naive <- predict(model_naive, newdata = test, type = "class")
confusionMatrix(data = pred_naive, #hasil prediksi
reference = test$Grade) #data aktual
Confusion Matrix and Statistics
Reference
Prediction High Credit Risk In or Near Default Low Credit Risk
High Credit Risk 29 0 10
In or Near Default 1 1 7
Low Credit Risk 4 0 107
Lowest Risk 0 0 32
Moderate Credit Risk 18 1 65
Substantial Credit Risk 27 0 3
Very High Credit Risk 4 0 0
Very Low Risk 3 0 18
Reference
Prediction Lowest Risk Moderate Credit Risk
High Credit Risk 0 23
In or Near Default 1 10
Low Credit Risk 4 82
Lowest Risk 2 15
Moderate Credit Risk 0 133
Substantial Credit Risk 0 31
Very High Credit Risk 0 0
Very Low Risk 3 17
Reference
Prediction Substantial Credit Risk Very High Credit Risk
High Credit Risk 34 7
In or Near Default 6 1
Low Credit Risk 20 2
Lowest Risk 1 1
Moderate Credit Risk 43 3
Substantial Credit Risk 45 0
Very High Credit Risk 2 2
Very Low Risk 7 1
Reference
Prediction Very Low Risk
High Credit Risk 1
In or Near Default 0
Low Credit Risk 20
Lowest Risk 17
Moderate Credit Risk 6
Substantial Credit Risk 3
Very High Credit Risk 0
Very Low Risk 9
Overall Statistics
Accuracy : 0.3719
95% CI : (0.3399, 0.4047)
No Information Rate : 0.3526
P-Value [Acc > NIR] : 0.1227
Kappa : 0.1939
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: High Credit Risk Class: In or Near Default
Sensitivity 0.33721 0.500000
Specificity 0.90578 0.970455
Pos Pred Value 0.27885 0.037037
Neg Pred Value 0.92674 0.998830
Prevalence 0.09751 0.002268
Detection Rate 0.03288 0.001134
Detection Prevalence 0.11791 0.030612
Balanced Accuracy 0.62149 0.735227
Class: Low Credit Risk Class: Lowest Risk
Sensitivity 0.4421 0.200000
Specificity 0.7937 0.924312
Pos Pred Value 0.4477 0.029412
Neg Pred Value 0.7900 0.990172
Prevalence 0.2744 0.011338
Detection Rate 0.1213 0.002268
Detection Prevalence 0.2710 0.077098
Balanced Accuracy 0.6179 0.562156
Class: Moderate Credit Risk Class: Substantial Credit Risk
Sensitivity 0.4277 0.28481
Specificity 0.7618 0.91160
Pos Pred Value 0.4944 0.41284
Neg Pred Value 0.7096 0.85382
Prevalence 0.3526 0.17914
Detection Rate 0.1508 0.05102
Detection Prevalence 0.3050 0.12358
Balanced Accuracy 0.5947 0.59821
Class: Very High Credit Risk Class: Very Low Risk
Sensitivity 0.117647 0.16071
Specificity 0.993064 0.94068
Pos Pred Value 0.250000 0.15517
Neg Pred Value 0.982838 0.94296
Prevalence 0.019274 0.06349
Detection Rate 0.002268 0.01020
Detection Prevalence 0.009070 0.06576
Balanced Accuracy 0.555355 0.55070
Decision Tree
model_ctree <- ctree(formula = Grade~.,
data = train_smote,
control = ctree_control(mincriterion = 0.1,
minsplit = 10,
minbucket = 10))
# prediksi kelas di data test
pred_ctree <- predict(object = model_ctree,
newdata = test,
type = "response")
# confusion matrix data test
confusionMatrix(pred_ctree, reference = test$Grade, positive = "Lowest Risk")
Confusion Matrix and Statistics
Reference
Prediction High Credit Risk In or Near Default Low Credit Risk
High Credit Risk 41 0 1
In or Near Default 2 1 0
Low Credit Risk 3 0 147
Lowest Risk 0 0 3
Moderate Credit Risk 17 0 56
Substantial Credit Risk 17 0 8
Very High Credit Risk 4 1 1
Very Low Risk 2 0 26
Reference
Prediction Lowest Risk Moderate Credit Risk
High Credit Risk 0 12
In or Near Default 0 2
Low Credit Risk 2 86
Lowest Risk 1 1
Moderate Credit Risk 1 170
Substantial Credit Risk 0 24
Very High Credit Risk 0 2
Very Low Risk 6 14
Reference
Prediction Substantial Credit Risk Very High Credit Risk
High Credit Risk 32 6
In or Near Default 1 1
Low Credit Risk 18 0
Lowest Risk 0 0
Moderate Credit Risk 37 3
Substantial Credit Risk 67 1
Very High Credit Risk 0 3
Very Low Risk 3 3
Reference
Prediction Very Low Risk
High Credit Risk 0
In or Near Default 0
Low Credit Risk 28
Lowest Risk 2
Moderate Credit Risk 5
Substantial Credit Risk 1
Very High Credit Risk 0
Very Low Risk 20
Overall Statistics
Accuracy : 0.5102
95% CI : (0.4767, 0.5437)
No Information Rate : 0.3526
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.3524
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: High Credit Risk Class: In or Near Default
Sensitivity 0.47674 0.500000
Specificity 0.93593 0.993182
Pos Pred Value 0.44565 0.142857
Neg Pred Value 0.94304 0.998857
Prevalence 0.09751 0.002268
Detection Rate 0.04649 0.001134
Detection Prevalence 0.10431 0.007937
Balanced Accuracy 0.70634 0.746591
Class: Low Credit Risk Class: Lowest Risk
Sensitivity 0.6074 0.100000
Specificity 0.7859 0.993119
Pos Pred Value 0.5176 0.142857
Neg Pred Value 0.8411 0.989714
Prevalence 0.2744 0.011338
Detection Rate 0.1667 0.001134
Detection Prevalence 0.3220 0.007937
Balanced Accuracy 0.6967 0.546560
Class: Moderate Credit Risk Class: Substantial Credit Risk
Sensitivity 0.5466 0.42405
Specificity 0.7916 0.92956
Pos Pred Value 0.5882 0.56780
Neg Pred Value 0.7622 0.88089
Prevalence 0.3526 0.17914
Detection Rate 0.1927 0.07596
Detection Prevalence 0.3277 0.13379
Balanced Accuracy 0.6691 0.67680
Class: Very High Credit Risk Class: Very Low Risk
Sensitivity 0.176471 0.35714
Specificity 0.990751 0.93462
Pos Pred Value 0.272727 0.27027
Neg Pred Value 0.983927 0.95545
Prevalence 0.019274 0.06349
Detection Rate 0.003401 0.02268
Detection Prevalence 0.012472 0.08390
Balanced Accuracy 0.583611 0.64588
plot(model_ctree, type = "simple")
The data has more than 10 features so it’s hard to see the proper
decision tree
Model Evaluation
The Decision tree has the most significant accuracy of all the classification models we’ve created. It has an average balance between sensitivity and specificity besides Lowest Risk Grade and Very High Credit Risk, which are already imbalanced from the start.
Model Improvement
We shouldn’t transform the data_train because we already did it before in the beginning such as scaling, tranforming several variable into log, or removing any outliers and we are not tranforming the targeted variabel into a scaled version, because we wont scaled back the Test Result in the end of our research.
Random Forest
Create random forest model as
model_rf
set.seed(123)
model_rf <- randomForest(x = train_smote %>% select(-Grade),
y = train_smote$Grade,
ntree = 500)
model_rf
Call:
randomForest(x = train_smote %>% select(-Grade), y = train_smote$Grade, ntree = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 4
OOB estimate of error rate: 18.78%
Confusion matrix:
High Credit Risk In or Near Default Low Credit Risk
High Credit Risk 406 0 3
In or Near Default 0 489 0
Low Credit Risk 6 0 1027
Lowest Risk 0 0 12
Moderate Credit Risk 6 1 217
Substantial Credit Risk 64 1 12
Very High Credit Risk 22 4 0
Very Low Risk 3 0 68
Lowest Risk Moderate Credit Risk
High Credit Risk 0 12
In or Near Default 0 0
Low Credit Risk 3 196
Lowest Risk 42 0
Moderate Credit Risk 0 1351
Substantial Credit Risk 0 95
Very High Credit Risk 0 2
Very Low Risk 14 19
Substantial Credit Risk Very High Credit Risk
High Credit Risk 61 13
In or Near Default 1 2
Low Credit Risk 11 0
Lowest Risk 0 0
Moderate Credit Risk 52 1
Substantial Credit Risk 705 4
Very High Credit Risk 3 61
Very Low Risk 0 0
Very Low Risk class.error
High Credit Risk 1 0.181451613
In or Near Default 0 0.006097561
Low Credit Risk 59 0.211213518
Lowest Risk 15 0.391304348
Moderate Credit Risk 11 0.175716901
Substantial Credit Risk 0 0.199772985
Very High Credit Risk 0 0.336956522
Very Low Risk 217 0.323987539
Check the summary and Predictor contribution on Targeted Variable
model_rf$finalModel
NULL
varImp(model_rf)
Overall
Current.Ratio 292.1237
LTDC 334.5292
DER 288.9476
Gross.Margin 241.8301
Operating.Margin 178.1447
EBIT.Margin 168.2291
EBITDA.Margin 219.7983
PTPM 218.6801
NPM 196.7420
Asset.Turnover 217.5991
ROE 203.7833
ROTE 203.0130
ROA 204.7714
ROI 222.5107
OCFPS 161.5081
FCFPS 156.3341
Model Random Forest - Evaluation
pred_rf <- predict(object = model_rf, newdata = test)
confusionMatrix(data = pred_naive, #hasil prediksi
reference = test$Grade,
positive = "Lowest Risk")
Confusion Matrix and Statistics
Reference
Prediction High Credit Risk In or Near Default Low Credit Risk
High Credit Risk 29 0 10
In or Near Default 1 1 7
Low Credit Risk 4 0 107
Lowest Risk 0 0 32
Moderate Credit Risk 18 1 65
Substantial Credit Risk 27 0 3
Very High Credit Risk 4 0 0
Very Low Risk 3 0 18
Reference
Prediction Lowest Risk Moderate Credit Risk
High Credit Risk 0 23
In or Near Default 1 10
Low Credit Risk 4 82
Lowest Risk 2 15
Moderate Credit Risk 0 133
Substantial Credit Risk 0 31
Very High Credit Risk 0 0
Very Low Risk 3 17
Reference
Prediction Substantial Credit Risk Very High Credit Risk
High Credit Risk 34 7
In or Near Default 6 1
Low Credit Risk 20 2
Lowest Risk 1 1
Moderate Credit Risk 43 3
Substantial Credit Risk 45 0
Very High Credit Risk 2 2
Very Low Risk 7 1
Reference
Prediction Very Low Risk
High Credit Risk 1
In or Near Default 0
Low Credit Risk 20
Lowest Risk 17
Moderate Credit Risk 6
Substantial Credit Risk 3
Very High Credit Risk 0
Very Low Risk 9
Overall Statistics
Accuracy : 0.3719
95% CI : (0.3399, 0.4047)
No Information Rate : 0.3526
P-Value [Acc > NIR] : 0.1227
Kappa : 0.1939
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: High Credit Risk Class: In or Near Default
Sensitivity 0.33721 0.500000
Specificity 0.90578 0.970455
Pos Pred Value 0.27885 0.037037
Neg Pred Value 0.92674 0.998830
Prevalence 0.09751 0.002268
Detection Rate 0.03288 0.001134
Detection Prevalence 0.11791 0.030612
Balanced Accuracy 0.62149 0.735227
Class: Low Credit Risk Class: Lowest Risk
Sensitivity 0.4421 0.200000
Specificity 0.7937 0.924312
Pos Pred Value 0.4477 0.029412
Neg Pred Value 0.7900 0.990172
Prevalence 0.2744 0.011338
Detection Rate 0.1213 0.002268
Detection Prevalence 0.2710 0.077098
Balanced Accuracy 0.6179 0.562156
Class: Moderate Credit Risk Class: Substantial Credit Risk
Sensitivity 0.4277 0.28481
Specificity 0.7618 0.91160
Pos Pred Value 0.4944 0.41284
Neg Pred Value 0.7096 0.85382
Prevalence 0.3526 0.17914
Detection Rate 0.1508 0.05102
Detection Prevalence 0.3050 0.12358
Balanced Accuracy 0.5947 0.59821
Class: Very High Credit Risk Class: Very Low Risk
Sensitivity 0.117647 0.16071
Specificity 0.993064 0.94068
Pos Pred Value 0.250000 0.15517
Neg Pred Value 0.982838 0.94296
Prevalence 0.019274 0.06349
Detection Rate 0.002268 0.01020
Detection Prevalence 0.009070 0.06576
Balanced Accuracy 0.555355 0.55070
Check the summary and Predictor contribution on Targeted Variable
varImp(model_rf)
Overall
Current.Ratio 292.1237
LTDC 334.5292
DER 288.9476
Gross.Margin 241.8301
Operating.Margin 178.1447
EBIT.Margin 168.2291
EBITDA.Margin 219.7983
PTPM 218.6801
NPM 196.7420
Asset.Turnover 217.5991
ROE 203.7833
ROTE 203.0130
ROA 204.7714
ROI 222.5107
OCFPS 161.5081
FCFPS 156.3341
The plot above showing how big the influence of each predictor, top 3
predictor
is
LTDC,
DERand
Current.Ratio`
Support Vector Regression
model_svm <- svm(Grade ~ ., data = train_smote)
model_svm
Call:
svm(formula = Grade ~ ., data = train_smote)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
Number of Support Vectors: 4417
pred_svm <- predict(object = model_svm, newdata = test)
confusionMatrix(data = pred_svm, #hasil prediksi
reference = test$Grade)
Confusion Matrix and Statistics
Reference
Prediction High Credit Risk In or Near Default Low Credit Risk
High Credit Risk 32 0 0
In or Near Default 0 1 0
Low Credit Risk 1 0 167
Lowest Risk 0 0 0
Moderate Credit Risk 27 0 64
Substantial Credit Risk 24 0 4
Very High Credit Risk 2 1 0
Very Low Risk 0 0 7
Reference
Prediction Lowest Risk Moderate Credit Risk
High Credit Risk 0 6
In or Near Default 0 2
Low Credit Risk 7 94
Lowest Risk 0 0
Moderate Credit Risk 2 183
Substantial Credit Risk 0 23
Very High Credit Risk 0 0
Very Low Risk 1 3
Reference
Prediction Substantial Credit Risk Very High Credit Risk
High Credit Risk 23 8
In or Near Default 0 1
Low Credit Risk 7 2
Lowest Risk 0 0
Moderate Credit Risk 66 3
Substantial Credit Risk 61 0
Very High Credit Risk 0 3
Very Low Risk 1 0
Reference
Prediction Very Low Risk
High Credit Risk 1
In or Near Default 0
Low Credit Risk 39
Lowest Risk 0
Moderate Credit Risk 11
Substantial Credit Risk 1
Very High Credit Risk 0
Very Low Risk 4
Overall Statistics
Accuracy : 0.5113
95% CI : (0.4778, 0.5448)
No Information Rate : 0.3526
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.3279
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: High Credit Risk Class: In or Near Default
Sensitivity 0.37209 0.500000
Specificity 0.95226 0.996591
Pos Pred Value 0.45714 0.250000
Neg Pred Value 0.93350 0.998861
Prevalence 0.09751 0.002268
Detection Rate 0.03628 0.001134
Detection Prevalence 0.07937 0.004535
Balanced Accuracy 0.66218 0.748295
Class: Low Credit Risk Class: Lowest Risk
Sensitivity 0.6901 0.00000
Specificity 0.7656 1.00000
Pos Pred Value 0.5268 NaN
Neg Pred Value 0.8673 0.98866
Prevalence 0.2744 0.01134
Detection Rate 0.1893 0.00000
Detection Prevalence 0.3594 0.00000
Balanced Accuracy 0.7279 0.50000
Class: Moderate Credit Risk Class: Substantial Credit Risk
Sensitivity 0.5884 0.38608
Specificity 0.6970 0.92818
Pos Pred Value 0.5140 0.53982
Neg Pred Value 0.7567 0.87386
Prevalence 0.3526 0.17914
Detection Rate 0.2075 0.06916
Detection Prevalence 0.4036 0.12812
Balanced Accuracy 0.6427 0.65713
Class: Very High Credit Risk Class: Very Low Risk
Sensitivity 0.176471 0.071429
Specificity 0.996532 0.985472
Pos Pred Value 0.500000 0.250000
Neg Pred Value 0.984018 0.939954
Prevalence 0.019274 0.063492
Detection Rate 0.003401 0.004535
Detection Prevalence 0.006803 0.018141
Balanced Accuracy 0.586501 0.528450
Model Evaluation
The Support Vector Machine Model has the most significant accuracy of all the models we’ve created. It has an average balance between sensitivity and specificity besides Lowest Risk, Very Low Risk and Very High Credit Risk, which are already imbalanced from the start. Besides that, we have a not a
Conclusion
In this research project, we have examined various concrete formulations with different Ratings. We developed a model that aligns to the available information. Utilizing model as a framework, we developed a fresh formulation and, being used to predicted the Rating.
Throughout this project, we have employed a
Support Vector Machine
Model. Compared to a standard
regression, the model better describes the data. As we have discovered,
despite being more complicated, it is a model which could be understood.
The prediction model implementing “model_svm” obtained an Accuray for
about 51%