Corporate Credit Rating Analysis

Sara Monica

Introduction

A corporate credit rating is an opinion of an independent agency regarding the likelihood that a corporation will fully meet its financial obligations(meet the terms of a contract) as they come due. A corporation’s corporate credit rating reveals its ability to pay its creditors. It is important to remember that corporate credit ratings are not facts but opinions.

Corporate credit ratings, provided by specialized agencies, evaluate a company’s creditworthiness and offer an essential financial cue to potential investors. It aids in giving investors a clearer understanding of the risk connected to the company’s credit investment returns. Every business strives to achieve a high credit rating to attract more investment and enjoy cheaper debt interest rates.

Most credit rating companies use a unique discrete ordinal rating scale. The three most prominent credit rating agencies are Standard and Poor’s (S&P), Moody’s, and Fitch. The S&P uses a grading scale with the following grades: AAA, AA+, AA, AA, A+, BBB+, BBB, BBB, BB+, BB, BB, B+, B+, B, B, CCC+, CCC, CCC, CC, C, D. There are a total of 22 grades on the scale, with AAA being the most promising and D being the riskiest.

Data can be access on https://www.kaggle.com/datasets/kirtandelwadia/corporate-credit-rating-with-financial-ratios

For more information about financial indicators visit: https://financialmodelingprep.com/market-indexes-major-markets The additional features are Name, Symbol (for trading), Rating Agency Name, Date and Sector.

Business Problem

This project aims to predict the credit rating grade of the companies. With the scale based on S&P Grading Scale This is a case of Regression analysis which is part of the Supervised Learning problem. Credit Rating (Grade) is the target variable while financial ratios data are the features.

Data Processing

Load Libraries

library(dplyr)
library(tidyr)
library(glue)
library(caret)
library(ggplot2)
library(lubridate)
library(corrplot)
library(tibble)
library(GGally)
library(plotly)
library(nnet)
library(randomForest)
library(stargazer)
library(DMwR)
library(party)
library(e1071)
library(kernlab)
library(scales)
library(class)
library(psych)
library(knitr)
library(reshape2)
library(pROC)

Read Data

#  read data
data <- read.csv("data/ratio.csv")

A list of 2029 credit ratings issued by major agencies such as Standard and Poors to big US firms (traded on NYSE or Nasdaq) from 2010 to 2016. There are 30 features for every company of which 25 are financial indicators. They can be divided in:

Liquidity Measurement Ratios: currentRatio, quickRatio, cashRatio, daysOfSalesOutstanding
Profitability Indicator Ratios: grossProfitMargin, operatingProfitMargin, pretaxProfitMargin, netProfitMargin, effectiveTaxRate, returnOnAssets, returnOnEquity, returnOnCapitalEmployed
Debt Ratios: debtRatio, debtEquityRatio
Operating Performance Ratios:` assetTurnover
Cash Flow Indicator Ratios: operatingCashFlowPerShare, freeCashFlowPerShare, cashPerShare, operatingCashFlowSalesRatio, freeCashFlowOperatingCashFlowRatio

In order to ensure that the data is “fully prepared,” we demonstrate how to use various data transformations, scaling, handling outliers, or any other statistical strategy. It is best practice to preprocess our data before performing analysis. Data must first be cleaned and transformed before being used for analysis and modeling.

Pre-processing

# data structure
glimpse(data)

Rows: 7,805
Columns: 25
$ Rating.Agency                 <chr> "Standard & Poor's Ratings Services", "S…
$ Corporation                   <chr> "American States Water Co.", "Automatic …
$ Rating                        <chr> "A-", "AAA", "BBB-", "AA-", "A", "BBB+",…
$ Rating.Date                   <chr> "2010-07-30", "2010-09-16", "2010-11-23"…
$ CIK                           <int> 1056903, 8670, 8858, 1035201, 721371, 72…
$ Binary.Rating                 <int> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1…
$ SIC.Code                      <dbl> 4941, 7374, 5065, 4941, 5122, 5122, 3312…
$ Sector                        <chr> "Utils", "BusEq", "Shops", "Utils", "Sho…
$ Ticker                        <chr> "AWR", "ADP", "AVT", "CWT", "CAH", "CAH"…
$ Current.Ratio                 <dbl> 1.1507, 1.1129, 1.9276, 0.8358, 1.2931, …
$ Long.term.Debt...Capital      <dbl> 0.4551, 0.0072, 0.2924, 0.4708, 0.2644, …
$ Debt.Equity.Ratio             <dbl> 0.8847, 0.0073, 0.4255, 0.9491, 0.4036, …
$ Gross.Margin                  <dbl> 77.6230, 43.6619, 11.9008, 64.5096, 3.83…
$ Operating.Margin              <dbl> 19.4839, 19.8327, 3.3173, 18.4549, 1.326…
$ EBIT.Margin                   <dbl> 19.4839, 19.8327, 3.3173, 18.4549, 1.326…
$ EBITDA.Margin                 <dbl> 28.9834, 23.9379, 3.6338, 27.9377, 1.584…
$ Pre.Tax.Profit.Margin         <dbl> 13.6093, 20.8699, 3.0536, 15.1135, 1.230…
$ Net.Profit.Margin             <dbl> 8.3224, 13.5690, 2.1418, 9.0246, 0.6518,…
$ Asset.Turnover                <dbl> 0.3173, 0.3324, 2.4620, 0.2946, 4.9276, …
$ ROE...Return.On.Equity        <dbl> 8.1724, 22.0354, 13.6376, 9.6412, 11.125…
$ Return.On.Tangible.Equity     <dbl> 8.1978, 47.2858, 16.7991, 9.7015, 19.418…
$ ROA...Return.On.Assets        <dbl> 2.6385, 4.4944, 5.2731, 2.6583, 2.9364, …
$ ROI...Return.On.Investment    <dbl> 4.4530, 21.8765, 9.6494, 5.1018, 8.1844,…
$ Operating.Cash.Flow.Per.Share <dbl> 1.9957, 0.2501, -7.6079, 1.7438, 1.9725,…
$ Free.Cash.Flow.Per.Share      <dbl> -0.1333, 0.3132, -7.3231, -0.8999, 2.417…

Parting credit score into a grade scale based on S&P Rating Grade Scale and removed unused coloumns

# change columns format
data <- data %>% mutate(Rating = as.factor(Rating),
                  Sector = as.factor(Sector),
                  Rating.Agency = as.factor(Rating.Agency),
                  Rating.Date  = ymd(Rating.Date),
                  Grade = case_when(Rating == 'AAA'~ "Lowest Risk",
                  Rating %in% c('AA+' , 'AA', 'AA-') ~ "Very Low Risk",
                  Rating %in% c("A+" , 'A' , "A-") ~ "Low Credit Risk",
                  Rating %in% c('BBB+' , 'BBB' , 'BBB-') ~ "Moderate Credit Risk",
                  Rating %in% c('BB+' , 'BB' , 'BB-') ~ "Substantial Credit Risk",
                  Rating %in% c('B+' , 'B' , 'B-') ~ "High Credit Risk",
                  Rating %in% c('CCC+', 'CCC' , 'CCC-') ~ "Very High Credit Risk",
                  Rating %in% c('CC' , 'C' , 'CC+', 'D') ~ "In or Near Default"),
                  Grade = as.factor(Grade))  %>%   select(-c(CIK, Binary.Rating, SIC.Code)) %>% rename(FCFPS = Free.Cash.Flow.Per.Share,
                                                                                                       OCFPS = Operating.Cash.Flow.Per.Share ,
                                                                                                       ROI = ROI...Return.On.Investment,
                                                                                                       ROA = ROA...Return.On.Assets,
                                                                                                       ROTE = Return.On.Tangible.Equity,
                                                                                                       ROE = ROE...Return.On.Equity,
                                                                                                       LTDC = Long.term.Debt...Capital,
                                                                                                       PTPM = Pre.Tax.Profit.Margin,
                                                                                                       DER = Debt.Equity.Ratio,
                                                                                                       NPM = Net.Profit.Margin)

#data$Rating <- ordered(data$Rating, levels = c("AAA", "AA+", "AA", "AA-", "A+" , 'A' , "A-", "BBB+", "BBB", "BBB-", "BB+", "BB", "BB-", 'B+' , 'B' , 'B-',  "CCC+", "CCC", "CCC-", "CC+", "CC", "C", "D"))
#data$Grade <- ordered(data$Grade, levels = c("Lowest Risk", "Very Low Risk", "Low Credit Risk", "Moderate Credit Risk", "Substantial Credit Risk", "High Credit Risk" , "Very High Credit Risk", "In or Near Default", "In Default"))

Check missing data

#  check missing value
colSums(is.na(data))

   Rating.Agency      Corporation           Rating      Rating.Date 
               0                0                0                0 
          Sector           Ticker    Current.Ratio             LTDC 
               0                0                0                0 
             DER     Gross.Margin Operating.Margin      EBIT.Margin 
               0                0                0                0 
   EBITDA.Margin             PTPM              NPM   Asset.Turnover 
               0                0                0                0 
             ROE             ROTE              ROA              ROI 
               0                0                0                0 
           OCFPS            FCFPS            Grade 
               0                0                0

Removed missing data

# remove duplicate 
unique(data)
# remove row containing NA value 
data <- data %>% filter(complete.cases(.))

Check data distribution of each predictor

data %>% 
   select_if(is.numeric) %>% 
   boxplot(main = 'Distribution of Each Predictor', xlab = 'Predictor', ylab = 'Values')

Handling Outliers

Our data can be visually examined to identify whether any outliers are present. By requiring our model to accommoDate them, outliers impact the dependent variable we’re developing. As their names indicate, outliers lie outside our model’s majority. The resolving capability of our model might be reduced if we include outliers. We can observe from the boxplot that some variables, such Return.On.Tangible.Equity abd Return.On.Tangible.Equity have noticeable outliers.

boxplot(data$DER)

boxplot(data$ROI)

boxplot(data$ROTE)

boxplot(data$ROE)

boxplot(data$FCFPS)

Create data_outlier function to remove extreme outlier

data_outliers <- function(x) {

  Q1 <- quantile(x, probs=.25)
  Q3 <- quantile(x, probs=.75)
  iqr = Q3-Q1

 upper_limit =  Q3 + (iqr*1.5)
 lower_limit = Q1 - (iqr*1.5)

 x > upper_limit | x < lower_limit
}

#outlier remover function
remove_outliers <- function(df, cols = names(df)) {
  for (col in cols) {
    df <- df[!data_outliers(df[[col]]),]
  }
  df
}

Remove the extreme outlier

data_no_outlier <- remove_outliers(data, c("ROTE", "ROE", "ROI", "NPM", "OCFPS", "FCFPS"))

#remove_outliers(data, c("Current.Ratio", "Long.term.Debt...Capital","Debt.Equity.Ratio", "Gross.Margin" ,"Operating.Margin", "EBIT.Margin", "EBITDA.Margin",  "Pre.Tax.Profit.Margin","Net.Profit.Margin", "Asset.Turnover", "ROE...Return.On.Equity", "Return.On.Tangible.Equity", "ROA...Return.On.Assets", "ROI...Return.On.Investment", "Operating.Cash.Flow.Per.Share", "Free.Cash.Flow.Per.Share"))

Find other outlier value with zscore

To remove other unsignifigance outlier, we’re using zcore from one of numerical variable since the targeted variable is a categorical format.

# Check the outlier and remove after scaling using zscore threshold point = 3
data_no_outlier <- data_no_outlier %>%
  mutate(zscore = (Current.Ratio - mean(Current.Ratio)) / sd(Current.Ratio)) %>%
  filter(zscore <=3) %>%
  select(-c(zscore))

data_no_outlier %>% 
  group_by(Grade) %>% 
  summarise(freq = n())

# A tibble: 8 × 2
  Grade                    freq
  <fct>                   <int>
1 High Credit Risk          431
2 In or Near Default         14
3 Low Credit Risk          1210
4 Lowest Risk                53
5 Moderate Credit Risk     1557
6 Substantial Credit Risk   793
7 Very High Credit Risk      87
8 Very Low Risk             280

Outlier on Original Data Distribution

# check if the outliers has an influence to targeted variable
data %>% 
    mutate(Date = as.numeric(Rating.Date)) %>%
    ggplot(aes(x = Rating.Date, y = Grade)) +
    geom_point() + 
    geom_point(data = data_no_outlier, aes(x = Rating.Date, y = Grade), col = 'red') + 
    labs(
        title = 'Distribution of Credit Rating : Original vs Outlier (Black)',
        x = NULL,
        y = 'Rating') +
    theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

Distribution on Each Predictor

The graph shows that current.ratio, operating.margin, ROA, ROE, ROTE, and EBIT, are all fairly uniform in shape. This can imply that these variables are combined freely and without followed by other variables.

Below we can find the main statistical moments for the numeric variables:

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
Current.Ratio	1	4425	1.78	1.05	1.52	1.62	0.79	0.21	6.55	6.34	1.77	3.99	0.02
LTDC	2	4425	0.41	0.24	0.41	0.40	0.16	-1.72	3.71	5.44	5.73	76.66	0.00
DER	3	4425	0.96	1.19	0.75	0.80	0.49	-16.51	21.28	37.80	3.62	73.58	0.02
Gross.Margin	4	4425	41.02	22.20	38.47	39.94	25.90	2.07	100.00	97.93	0.42	-0.43	0.33
Operating.Margin	5	4425	13.51	9.22	12.28	12.88	9.15	-16.51	53.97	70.47	0.63	0.57	0.14
EBIT.Margin	6	4425	13.56	9.29	12.30	12.91	9.10	-16.51	53.97	70.47	0.68	0.75	0.14
EBITDA.Margin	7	4425	21.27	12.97	18.66	20.24	12.60	-8.36	88.35	96.71	0.85	0.88	0.19
PTPM	8	4425	10.85	8.79	9.76	10.29	8.37	-23.27	42.49	65.77	0.52	0.59	0.13
NPM	9	4425	7.81	6.56	6.97	7.45	6.52	-10.68	25.88	36.57	0.43	0.07	0.10
Asset.Turnover	10	4425	0.82	0.72	0.64	0.68	0.40	0.07	8.50	8.44	3.23	16.46	0.01
ROE	11	4425	12.42	8.70	11.84	12.34	7.69	-13.06	37.90	50.96	0.08	0.20	0.13
ROTE	12	4425	10.02	26.35	11.66	10.50	19.19	-69.74	85.58	155.32	-0.21	0.77	0.40
ROA	13	4425	4.79	3.50	4.31	4.67	3.27	-5.04	17.47	22.51	0.34	0.13	0.05
ROI	14	4425	7.38	5.36	6.74	7.23	5.00	-6.82	21.97	28.79	0.27	-0.11	0.08
OCFPS	15	4425	0.21	1.15	0.18	0.21	0.92	-2.86	3.41	6.26	-0.01	0.25	0.02
FCFPS	16	4425	0.05	1.20	0.07	0.05	1.03	-2.98	3.10	6.08	-0.03	0.06	0.02

library(zoo)
recap <- data_no_outlier %>% 
  mutate(year = year(data_no_outlier$Rating.Date),
         year = as.factor(year),
         grade = case_when(Rating == 'AAA'~ "Lowest Risk",
                  Rating %in% c('AA+', 'AA','AA-') ~ "Very Low Risk",
                  Rating %in% c('A+', 'A','A-') ~ "Low Credit Risk",
                  Rating %in% c('BBB+', 'BBB','BBB-') ~ "Moderate Credit Risk",
                  Rating %in% c('BB+', 'BB','BB-') ~ "Substantial Credit Risk",
                  Rating %in% c('B+', 'B','B-') ~ "High Credit Risk",
                  Rating %in% c('CCC+', 'CCC','CCC-') ~ "Very High Credit Risk",
                  Rating %in% c('CC', 'C', 'CC+', 'D') ~ "In or Near Default")) %>% 
  group_by(grade, year) %>% 
  summarise(count = n()) %>%
  ggplot(aes(x = year, y = count)) + 
  geom_line(aes(group = grade, col = grade)) +
  geom_point(aes(col = grade, group = grade)) +
      labs(title = "Rating Recap",
             x = NULL,
             y = NULL) +
        theme(axis.text.x = element_text(hjust = 1, angle = 45),
              plot.title = element_text(face = "bold"),
              panel.background = element_rect(fill = "#ffffff"),
              axis.line.y = element_line(colour = "grey"),
              axis.line.x = element_line(colour = "grey"))

ggplotly(recap) %>% plotly::layout(legend=list(x=0,xanchor='left',
                                 yanchor='top',
                                 orientation='h'))

library(glue)
plotcount <-data_no_outlier %>% 
  mutate(year = year(data_no_outlier$Rating.Date),
         year = as.factor(year),
         grade = case_when(Rating == 'AAA'~ "Lowest Risk",
                  Rating %in% c('AA+', 'AA','AA-') ~ "Very Low Risk",
                  Rating %in% c('A+', 'A','A-') ~ "Low Credit Risk",
                  Rating %in% c('BBB+', 'BBB','BBB-') ~ "Moderate Credit Risk",
                  Rating %in% c('BB+', 'BB','BB-') ~ "Substantial Credit Risk",
                  Rating %in% c('B+', 'B','B-') ~ "High Credit Risk",
                  Rating %in% c('CCC+', 'CCC','CCC-') ~ "Very High Credit Risk",
                  Rating %in% c('CC', 'C', 'CC+', "D") ~ "In or Near Default")) %>% 
  group_by(grade, year) %>% 
  summarise(freq = n()) %>%
  mutate(label = glue("Total : {freq}
                       Grade : {grade}")) %>% 
  ggplot(aes(x=year, y = freq, fill = grade, text = label)) + 
  geom_col(position = "dodge") +
      labs(title = "Credit Grade based on S&P",
             x = NULL,
             y = NULL,
             fill = "Grade") +
        theme(legend.title = element_blank(),
              axis.text.x = element_text(hjust = 1),
              plot.title = element_text(face = "bold"),
              panel.background = element_rect(fill = "#ffffff"),
              axis.line.y = element_line(colour = "grey"),
              axis.line.x = element_line(colour = "grey"))
ggplotly(plotcount, tooltip = "text")  %>% plotly::layout(legend=list(x=0,xanchor='left',
                                 yanchor='top',
                                 orientation='h'))

plotcount <-data_no_outlier %>% 
  mutate(year = year(data_no_outlier$Rating.Date),
         year = as.factor(year),
         grade = case_when(Rating == 'AAA'~ "Lowest Risk",
                  Rating %in% c('AA+', 'AA','AA-') ~ "Very Low Risk",
                  Rating %in% c('A+', 'A','A-') ~ "Low Credit Risk",
                  Rating %in% c('BBB+', 'BBB','BBB-') ~ "Moderate Credit Risk",
                  Rating %in% c('BB+', 'BB','BB-') ~ "Substantial Credit Risk",
                  Rating %in% c('B+', 'B','B-') ~ "High Credit Risk",
                  Rating %in% c('CCC+', 'CCC','CCC-') ~ "Very High Credit Risk",
                  Rating %in% c('CC', 'C', 'CC+', 'D') ~ "In or Near Default")) %>% 
  group_by(Rating.Agency, year) %>% 
  summarise(freq = n()) %>%
  mutate(label = glue("Total : {freq}
                       Rating.Agency : {Rating.Agency}")) %>% 
  ggplot(aes(x=year, y = freq, fill = Rating.Agency, text = label)) + 
  geom_col(position = "dodge") +
      labs(title = "Credit Grade based on S&P",
             x = NULL,
             y = NULL,
             fill = "Grade") +
        theme(legend.title = element_blank(),
              axis.text.x = element_text(hjust = 1),
              plot.title = element_text(face = "bold"),
              panel.background = element_rect(fill = "#ffffff"),
              axis.line.y = element_line(colour = "grey"),
              axis.line.x = element_line(colour = "grey"))
ggplotly(plotcount, tooltip = "text")  %>% plotly::layout(legend=list(x=0,xanchor='left',
                                 yanchor='top',
                                 orientation='h'))

data_no_outlier %>% select_if(is.numeric) %>% describe()

                 vars    n  mean    sd median trimmed   mad    min    max range
Current.Ratio       1 4425  1.78  1.05   1.52    1.62  0.79   0.21   6.55  6.34
LTDC                2 4425  0.41  0.24   0.41    0.40  0.16  -1.72   3.71  5.44
DER                 3 4425  0.96  1.19   0.75    0.80  0.49 -16.51  21.28 37.80
Gross.Margin        4 4425 41.02 22.20  38.47   39.94 25.90   2.07 100.00 97.93
Operating.Margin    5 4425 13.51  9.22  12.28   12.88  9.15 -16.51  53.97 70.47
                 skew kurtosis   se
Current.Ratio    1.77     3.99 0.02
LTDC             5.73    76.66 0.00
DER              3.62    73.58 0.02
Gross.Margin     0.42    -0.43 0.33
Operating.Margin 0.63     0.57 0.14
 [ reached 'max' / getOption("max.print") -- omitted 11 rows ]

`Correlation` between features:

We will check the correlation between all the numerical variables, so we will use the list data_numeric_vars.

Firstly, the correlation will be checked by the graph corplot:

It seems that the variables are not highly correlated, there are not values close to dark blue or dark red.

data_no_outlier %>% select_if(is.numeric) %>% cor() %>% as.data.frame()

              Current.Ratio         LTDC         DER Gross.Margin
Current.Ratio     1.0000000 -0.178515061 -0.15364205 -0.167916846
LTDC             -0.1785151  1.000000000  0.39163198 -0.003271595
DER              -0.1536421  0.391631981  1.00000000  0.034986713
Gross.Margin     -0.1679168 -0.003271595  0.03498671  1.000000000
              Operating.Margin EBIT.Margin EBITDA.Margin         PTPM
Current.Ratio     -0.088624193 -0.09044379   -0.17917273  0.003487013
LTDC               0.013978920  0.01262490    0.05588717 -0.142907279
DER                0.009805031  0.01098823    0.06963403 -0.150852342
Gross.Margin       0.586646423  0.58348155    0.67830637  0.497289641
                     NPM Asset.Turnover         ROE         ROTE         ROA
Current.Ratio  0.0819301      0.0241215 -0.01415236  0.143857063  0.20619358
LTDC          -0.1448751     -0.1151765 -0.04885294 -0.176340053 -0.29771236
DER           -0.1435755     -0.1098951 -0.01506373 -0.128222423 -0.23817451
Gross.Margin   0.4494412     -0.5339575  0.02653752 -0.007552834  0.05511231
                      ROI       OCFPS       FCFPS
Current.Ratio  0.08396988 -0.06386495 -0.04264660
LTDC          -0.32259601 -0.04368007 -0.05034773
DER           -0.23991519  0.01861499 -0.02716629
Gross.Margin   0.01467265  0.04844019  0.06352412
 [ reached 'max' / getOption("max.print") -- omitted 12 rows ]

To be sure, we will check the maximum and minimum correlations, but in the maximum the value one should be removed, because each variable is evaluated against itself:

Maximum.correlation
0.99692

Minimum.correlation
-0.57921

The stronger the correlation, or how near 1 or -1 it is, the more closely related the predictors are. The correlation matrix graphic above shows the correlatiion on each variables. In our dataset, Operating.Margin and EBIT.Margin have the highest positive correlations (0.96) also NSM and Rating have the highest negative correlations (-0.57)

Model Fitting

Data Splitting

We now split the data into train and validation sets. The training set is used to train the model, which is checked against the validation set.

remove unused columns

data_clean <- data_no_outlier %>% select(-c(Corporation, Rating.Agency, Rating.Date, Sector, Ticker, Rating))

set.seed(123)

index <-createDataPartition(data_clean$Grade, p = 0.8, list = FALSE)

train <- data_clean[index,]
test <- data_clean[-index,]

#BalTrain <- droplevels.data.frame(train) 
#BalTest <- droplevels.data.frame(test)

Check the Data Split

table(test$Grade)


       High Credit Risk      In or Near Default         Low Credit Risk 
                     86                       2                     242 
            Lowest Risk    Moderate Credit Risk Substantial Credit Risk 
                     10                     311                     158 
  Very High Credit Risk           Very Low Risk 
                     17                      56

table(train$Grade)


       High Credit Risk      In or Near Default         Low Credit Risk 
                    345                      12                     968 
            Lowest Risk    Moderate Credit Risk Substantial Credit Risk 
                     43                    1246                     635 
  Very High Credit Risk           Very Low Risk 
                     70                     224

library(performanceEstimation)
# To SMOTE train_set
set.seed(1234)
train_smote <- smote(Grade~., train , perc.over = 40, perc.under = 10)
table(train_smote$Grade)


       High Credit Risk      In or Near Default         Low Credit Risk 
                    496                     492                    1302 
            Lowest Risk    Moderate Credit Risk Substantial Credit Risk 
                     69                    1639                     881 
  Very High Credit Risk           Very Low Risk 
                     92                     321

#nrow(smote_train_set) # 8993 rows = 4692 + 4301 (52% vs. 48%)
#nrow(smote_train_set[smote_train_set$class=="Sustain",]) # 4692 rows = 1.2(10)(391)
#nrow(smote_train_set[smote_train_set$class=="Bankrupt",]) # 4301 rows = 391 + 10(391)

# Save training set
 write.csv(train_smote, file="train_smote.csv", row.names=FALSE)

Model Fitting

MLR - Multinomial Logistic Regression:

The first model to be applied will be multinomial logistic regression. It is a classification method that generalize logistic regression to multiclass problems, in this case three classes or levels. The first step in the application of one classisication method is to train the model. To do that, the algorithm should be applied to the training sample, in this case train.

model_lr <- multinom(formula = Grade ~ ., data = train_smote)

# weights:  144 (119 variable)
initial  value 13749.267474 
iter  10 value 11002.318919
iter  20 value 9986.627702
iter  30 value 9604.165893
iter  40 value 9337.170956
iter  50 value 9216.755819
iter  60 value 9091.369768
iter  70 value 8848.188956
iter  80 value 8620.417284
iter  90 value 8481.465769
iter 100 value 8360.355987
final  value 8360.355987 
stopped after 100 iterations

saveRDS(model_lr, "model_lr.rds")

zscore_lr <- summary(model_lr)$coefficients/summary(model_lr)$standard.errors
((1 - pnorm(abs(zscore_lr), 0, 1)) * 2) %>% as.data.frame()

                                 (Intercept) Current.Ratio
In or Near Default   0.000000000000000000000   0.000000000
Low Credit Risk      0.000000000000009769963   0.000000000
Lowest Risk          0.622580089402264835741   0.005210372
Moderate Credit Risk 0.000000000000000000000   0.000000000
                                       LTDC               DER   Gross.Margin
In or Near Default   0.00000000000007394085 0.122483130204410 0.000002840239
Low Credit Risk      0.00089274131094319031 0.000000005128242 0.000123048164
Lowest Risk          0.08467737756918447545 0.783427056096841 0.008853508845
Moderate Credit Risk 0.00000000001263167348 0.000000005090008 0.503642339906
                     Operating.Margin EBIT.Margin  EBITDA.Margin
In or Near Default         0.02215274   0.1577790 0.000000000000
Low Credit Risk            0.55326972   0.6055659 0.000002393161
Lowest Risk                0.27434224   0.3131979 0.034093847215
Moderate Credit Risk       0.46510011   0.4348954 0.001543974953
                                 PTPM             NPM           Asset.Turnover
In or Near Default   0.04765292871907 0.0011571379333 0.0000000000000002220446
Low Credit Risk      0.00000004119937 0.0000001364062 0.1688071111847531113881
Lowest Risk          0.00038817130160 0.5748345135941 0.0007336048805883788049
Moderate Credit Risk 0.00131090051759 0.0049422676723 0.0000002381285177932568
                                 ROE             ROTE                 ROA
In or Near Default   0.0000000000000 0.00000005200874 0.20898953067587778
Low Credit Risk      0.8795474889081 0.00579935247972 0.00001129185601001
Lowest Risk          0.0000138753251 0.00003314986346 0.00000000007071588
Moderate Credit Risk 0.0000001283812 0.02126714269237 0.00004177999758515
                                          ROI         OCFPS
In or Near Default   0.0000000000000000000000 0.00000012533
Low Credit Risk      0.0000012688581576547620 0.00007172481
Lowest Risk          0.0000000000000004440892 0.00189551614
Moderate Credit Risk 0.0530720660680894518180 0.00274928093
                                    FCFPS
In or Near Default   0.000000000000567546
Low Credit Risk      0.000005093438031345
Lowest Risk          0.035244138286956828
Moderate Credit Risk 0.000181802596967762
 [ reached 'max' / getOption("max.print") -- omitted 3 rows ]

exp(coef(model_lr)) %>% as.data.frame()

                     (Intercept) Current.Ratio       LTDC       DER
In or Near Default   590.0562626     0.2245488 0.02027662 1.1161291
Low Credit Risk       12.5792129     0.4216018 0.17941020 0.5263733
Lowest Risk            0.6077025     0.6753355 0.03002144 0.9135379
Moderate Credit Risk  58.2972970     0.5875833 0.03220894 0.5465358
                     Gross.Margin Operating.Margin EBIT.Margin EBITDA.Margin
In or Near Default      1.0265805        6.4403690   0.3152593     0.5308937
Low Credit Risk         1.0139895        0.8234632   1.1849971     0.9489040
Lowest Risk             1.0211796        0.6899649   1.4060398     0.9448835
Moderate Credit Risk    0.9977182        0.7879500   1.2913378     0.9698804
                         PTPM       NPM Asset.Turnover       ROE     ROTE
In or Near Default   1.076986 0.8872874     0.16979868 0.6298140 1.023172
Low Credit Risk      1.132438 1.1116516     0.86892745 0.9970949 1.007120
Lowest Risk          1.209161 1.0296362     0.09427143 0.7346197 1.025885
Moderate Credit Risk 1.065679 1.0536375     0.58494319 1.0924984 1.005703
                           ROA      ROI     OCFPS    FCFPS
In or Near Default   0.8727437 2.111726 0.4976056 2.369864
Low Credit Risk      0.7144567 1.325395 0.7607762 1.341585
Lowest Risk          0.4705117 2.779813 0.5885482 1.397981
Moderate Credit Risk 0.7375788 1.111440 0.8242089 1.254240
 [ reached 'max' / getOption("max.print") -- omitted 3 rows ]

pred_model_lr <- predict(object = model_lr, newdata = test, positive = "Lowest Risk")
confusionMatrix(pred_model_lr, test$Grade)

Confusion Matrix and Statistics

                         Reference
Prediction                High Credit Risk In or Near Default Low Credit Risk
  High Credit Risk                      18                  0               1
  In or Near Default                     2                  1               7
  Low Credit Risk                        4                  0              96
  Lowest Risk                            2                  0               1
  Moderate Credit Risk                  24                  0             132
  Substantial Credit Risk               36                  1               3
  Very High Credit Risk                  0                  0               0
  Very Low Risk                          0                  0               2
                         Reference
Prediction                Lowest Risk Moderate Credit Risk
  High Credit Risk                  0                    1
  In or Near Default                3                    7
  Low Credit Risk                   3                   57
  Lowest Risk                       0                    1
  Moderate Credit Risk              3                  210
  Substantial Credit Risk           0                   29
  Very High Credit Risk             0                    1
  Very Low Risk                     1                    5
                         Reference
Prediction                Substantial Credit Risk Very High Credit Risk
  High Credit Risk                             12                     2
  In or Near Default                            2                     4
  Low Credit Risk                              14                     2
  Lowest Risk                                   0                     0
  Moderate Credit Risk                         68                     4
  Substantial Credit Risk                      61                     4
  Very High Credit Risk                         0                     0
  Very Low Risk                                 1                     1
                         Reference
Prediction                Very Low Risk
  High Credit Risk                    1
  In or Near Default                  1
  Low Credit Risk                    31
  Lowest Risk                         0
  Moderate Credit Risk               19
  Substantial Credit Risk             1
  Very High Credit Risk               0
  Very Low Risk                       3

Overall Statistics
                                         
               Accuracy : 0.441          
                 95% CI : (0.408, 0.4745)
    No Information Rate : 0.3526         
    P-Value [Acc > NIR] : 0.00000003725  
                                         
                  Kappa : 0.223          
                                         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: High Credit Risk Class: In or Near Default
Sensitivity                          0.20930                  0.500000
Specificity                          0.97864                  0.970455
Pos Pred Value                       0.51429                  0.037037
Neg Pred Value                       0.91972                  0.998830
Prevalence                           0.09751                  0.002268
Detection Rate                       0.02041                  0.001134
Detection Prevalence                 0.03968                  0.030612
Balanced Accuracy                    0.59397                  0.735227
                     Class: Low Credit Risk Class: Lowest Risk
Sensitivity                          0.3967           0.000000
Specificity                          0.8266           0.995413
Pos Pred Value                       0.4638           0.000000
Neg Pred Value                       0.7837           0.988610
Prevalence                           0.2744           0.011338
Detection Rate                       0.1088           0.000000
Detection Prevalence                 0.2347           0.004535
Balanced Accuracy                    0.6116           0.497706
                     Class: Moderate Credit Risk Class: Substantial Credit Risk
Sensitivity                               0.6752                        0.38608
Specificity                               0.5622                        0.89779
Pos Pred Value                            0.4565                        0.45185
Neg Pred Value                            0.7607                        0.87015
Prevalence                                0.3526                        0.17914
Detection Rate                            0.2381                        0.06916
Detection Prevalence                      0.5215                        0.15306
Balanced Accuracy                         0.6187                        0.64193
                     Class: Very High Credit Risk Class: Very Low Risk
Sensitivity                              0.000000             0.053571
Specificity                              0.998844             0.987893
Pos Pred Value                           0.000000             0.230769
Neg Pred Value                           0.980704             0.939010
Prevalence                               0.019274             0.063492
Detection Rate                           0.000000             0.003401
Detection Prevalence                     0.001134             0.014739
Balanced Accuracy                        0.499422             0.520732

K Nearest Neighborhood

# prediktor data train
train_x <- train_smote %>% select_if(is.numeric) # dipilih semua kolom yang numerik karena akan discaling

# target data train
train_y <- train_smote %>% select(Grade) # dipisahkan khusus untuk kelas target

# prediktor data test
test_x <- test %>% select_if(is.numeric)

# target data test
test_y <-  test %>% select(Grade)

train_smote %>%
    pivot_longer(cols = -c(Grade), names_to = 'predictor') %>% 
    group_by(predictor) %>% 
    summarize(value = max(value)) %>% 
    ggplot(aes(x = predictor, y = value)) +
    geom_col(fill = 'pink') + 
    labs(
        title = 'Data Range Before Scaling',
        x = 'Variable',
        y = 'Value') + 
        theme(legend.title = element_blank(),
              axis.text.x = element_text(hjust = 1, angle = 45),
              plot.title = element_text(face = "bold"),
              panel.background = element_rect(fill = "#ffffff"),
              axis.line.y = element_line(colour = "grey"),
              axis.line.x = element_line(colour = "grey"))

# code ini hanya boleh dirun 1 kali
train_x <- scale(train_x)
test_x <- scale(test_x, 
                center=attr(train_x, "scaled:center"), #nilai rata-rata train
                scale=attr(train_x, "scaled:scale")) # nilai sd train

sqrt(nrow(train_x))

[1] 81.31421

library(class)

model_knn <- knn(train = train_x, #prediktor data train
    test = test_x, #prediktor data test
    cl = train_y$Grade, #target data train
    k=81) # jumlah k yang digunakan untuk klasifikasi
saveRDS(model_knn, "model_knn.rds")

confusionMatrix(data=model_knn, reference = test_y$Grade, positive="Lowest Risk")

Confusion Matrix and Statistics

                         Reference
Prediction                High Credit Risk In or Near Default Low Credit Risk
  High Credit Risk                      15                  0               0
  In or Near Default                     4                  1              15
  Low Credit Risk                        8                  0             141
  Lowest Risk                            0                  0               0
  Moderate Credit Risk                  28                  1              78
  Substantial Credit Risk               31                  0               3
  Very High Credit Risk                  0                  0               0
  Very Low Risk                          0                  0               5
                         Reference
Prediction                Lowest Risk Moderate Credit Risk
  High Credit Risk                  0                    5
  In or Near Default                2                   17
  Low Credit Risk                   8                   95
  Lowest Risk                       0                    0
  Moderate Credit Risk              0                  171
  Substantial Credit Risk           0                   21
  Very High Credit Risk             0                    0
  Very Low Risk                     0                    2
                         Reference
Prediction                Substantial Credit Risk Very High Credit Risk
  High Credit Risk                             12                     1
  In or Near Default                            4                     4
  Low Credit Risk                              20                     3
  Lowest Risk                                   0                     0
  Moderate Credit Risk                         70                     4
  Substantial Credit Risk                      52                     5
  Very High Credit Risk                         0                     0
  Very Low Risk                                 0                     0
                         Reference
Prediction                Very Low Risk
  High Credit Risk                    0
  In or Near Default                  3
  Low Credit Risk                    31
  Lowest Risk                         0
  Moderate Credit Risk               15
  Substantial Credit Risk             2
  Very High Credit Risk               0
  Very Low Risk                       5

Overall Statistics
                                        
               Accuracy : 0.4365        
                 95% CI : (0.4035, 0.47)
    No Information Rate : 0.3526        
    P-Value [Acc > NIR] : 0.0000001647  
                                        
                  Kappa : 0.2284        
                                        
 Mcnemar's Test P-Value : NA            

Statistics by Class:

                     Class: High Credit Risk Class: In or Near Default
Sensitivity                          0.17442                  0.500000
Specificity                          0.97739                  0.944318
Pos Pred Value                       0.45455                  0.020000
Neg Pred Value                       0.91637                  0.998798
Prevalence                           0.09751                  0.002268
Detection Rate                       0.01701                  0.001134
Detection Prevalence                 0.03741                  0.056689
Balanced Accuracy                    0.57590                  0.722159
                     Class: Low Credit Risk Class: Lowest Risk
Sensitivity                          0.5826            0.00000
Specificity                          0.7422            1.00000
Pos Pred Value                       0.4608                NaN
Neg Pred Value                       0.8247            0.98866
Prevalence                           0.2744            0.01134
Detection Rate                       0.1599            0.00000
Detection Prevalence                 0.3469            0.00000
Balanced Accuracy                    0.6624            0.50000
                     Class: Moderate Credit Risk Class: Substantial Credit Risk
Sensitivity                               0.5498                        0.32911
Specificity                               0.6567                        0.91436
Pos Pred Value                            0.4659                        0.45614
Neg Pred Value                            0.7282                        0.86198
Prevalence                                0.3526                        0.17914
Detection Rate                            0.1939                        0.05896
Detection Prevalence                      0.4161                        0.12925
Balanced Accuracy                         0.6033                        0.62174
                     Class: Very High Credit Risk Class: Very Low Risk
Sensitivity                               0.00000             0.089286
Specificity                               1.00000             0.991525
Pos Pred Value                                NaN             0.416667
Neg Pred Value                            0.98073             0.941379
Prevalence                                0.01927             0.063492
Detection Rate                            0.00000             0.005669
Detection Prevalence                      0.00000             0.013605
Balanced Accuracy                         0.50000             0.540406

Naive Bayes

library(e1071)
model_naive <- naiveBayes(formula = Grade ~ . , 
                           data = train_smote, 
                           laplace = 1)

pred_naive <- predict(model_naive, newdata = test, type = "class")
confusionMatrix(data = pred_naive, #hasil prediksi
                reference = test$Grade) #data aktual

Confusion Matrix and Statistics

                         Reference
Prediction                High Credit Risk In or Near Default Low Credit Risk
  High Credit Risk                      29                  0              10
  In or Near Default                     1                  1               7
  Low Credit Risk                        4                  0             107
  Lowest Risk                            0                  0              32
  Moderate Credit Risk                  18                  1              65
  Substantial Credit Risk               27                  0               3
  Very High Credit Risk                  4                  0               0
  Very Low Risk                          3                  0              18
                         Reference
Prediction                Lowest Risk Moderate Credit Risk
  High Credit Risk                  0                   23
  In or Near Default                1                   10
  Low Credit Risk                   4                   82
  Lowest Risk                       2                   15
  Moderate Credit Risk              0                  133
  Substantial Credit Risk           0                   31
  Very High Credit Risk             0                    0
  Very Low Risk                     3                   17
                         Reference
Prediction                Substantial Credit Risk Very High Credit Risk
  High Credit Risk                             34                     7
  In or Near Default                            6                     1
  Low Credit Risk                              20                     2
  Lowest Risk                                   1                     1
  Moderate Credit Risk                         43                     3
  Substantial Credit Risk                      45                     0
  Very High Credit Risk                         2                     2
  Very Low Risk                                 7                     1
                         Reference
Prediction                Very Low Risk
  High Credit Risk                    1
  In or Near Default                  0
  Low Credit Risk                    20
  Lowest Risk                        17
  Moderate Credit Risk                6
  Substantial Credit Risk             3
  Very High Credit Risk               0
  Very Low Risk                       9

Overall Statistics
                                          
               Accuracy : 0.3719          
                 95% CI : (0.3399, 0.4047)
    No Information Rate : 0.3526          
    P-Value [Acc > NIR] : 0.1227          
                                          
                  Kappa : 0.1939          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: High Credit Risk Class: In or Near Default
Sensitivity                          0.33721                  0.500000
Specificity                          0.90578                  0.970455
Pos Pred Value                       0.27885                  0.037037
Neg Pred Value                       0.92674                  0.998830
Prevalence                           0.09751                  0.002268
Detection Rate                       0.03288                  0.001134
Detection Prevalence                 0.11791                  0.030612
Balanced Accuracy                    0.62149                  0.735227
                     Class: Low Credit Risk Class: Lowest Risk
Sensitivity                          0.4421           0.200000
Specificity                          0.7937           0.924312
Pos Pred Value                       0.4477           0.029412
Neg Pred Value                       0.7900           0.990172
Prevalence                           0.2744           0.011338
Detection Rate                       0.1213           0.002268
Detection Prevalence                 0.2710           0.077098
Balanced Accuracy                    0.6179           0.562156
                     Class: Moderate Credit Risk Class: Substantial Credit Risk
Sensitivity                               0.4277                        0.28481
Specificity                               0.7618                        0.91160
Pos Pred Value                            0.4944                        0.41284
Neg Pred Value                            0.7096                        0.85382
Prevalence                                0.3526                        0.17914
Detection Rate                            0.1508                        0.05102
Detection Prevalence                      0.3050                        0.12358
Balanced Accuracy                         0.5947                        0.59821
                     Class: Very High Credit Risk Class: Very Low Risk
Sensitivity                              0.117647              0.16071
Specificity                              0.993064              0.94068
Pos Pred Value                           0.250000              0.15517
Neg Pred Value                           0.982838              0.94296
Prevalence                               0.019274              0.06349
Detection Rate                           0.002268              0.01020
Detection Prevalence                     0.009070              0.06576
Balanced Accuracy                        0.555355              0.55070

Decision Tree

model_ctree <- ctree(formula = Grade~.,
                    data = train_smote,
                    control = ctree_control(mincriterion = 0.1, 
                                                    minsplit = 10,
                                                    minbucket =  10))

# prediksi kelas di data test
pred_ctree <- predict(object = model_ctree,
                          newdata = test,
                          type = "response")

# confusion matrix data test
confusionMatrix(pred_ctree, reference = test$Grade, positive = "Lowest Risk")

Confusion Matrix and Statistics

                         Reference
Prediction                High Credit Risk In or Near Default Low Credit Risk
  High Credit Risk                      41                  0               1
  In or Near Default                     2                  1               0
  Low Credit Risk                        3                  0             147
  Lowest Risk                            0                  0               3
  Moderate Credit Risk                  17                  0              56
  Substantial Credit Risk               17                  0               8
  Very High Credit Risk                  4                  1               1
  Very Low Risk                          2                  0              26
                         Reference
Prediction                Lowest Risk Moderate Credit Risk
  High Credit Risk                  0                   12
  In or Near Default                0                    2
  Low Credit Risk                   2                   86
  Lowest Risk                       1                    1
  Moderate Credit Risk              1                  170
  Substantial Credit Risk           0                   24
  Very High Credit Risk             0                    2
  Very Low Risk                     6                   14
                         Reference
Prediction                Substantial Credit Risk Very High Credit Risk
  High Credit Risk                             32                     6
  In or Near Default                            1                     1
  Low Credit Risk                              18                     0
  Lowest Risk                                   0                     0
  Moderate Credit Risk                         37                     3
  Substantial Credit Risk                      67                     1
  Very High Credit Risk                         0                     3
  Very Low Risk                                 3                     3
                         Reference
Prediction                Very Low Risk
  High Credit Risk                    0
  In or Near Default                  0
  Low Credit Risk                    28
  Lowest Risk                         2
  Moderate Credit Risk                5
  Substantial Credit Risk             1
  Very High Credit Risk               0
  Very Low Risk                      20

Overall Statistics
                                               
               Accuracy : 0.5102               
                 95% CI : (0.4767, 0.5437)     
    No Information Rate : 0.3526               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.3524               
                                               
 Mcnemar's Test P-Value : NA                   

Statistics by Class:

                     Class: High Credit Risk Class: In or Near Default
Sensitivity                          0.47674                  0.500000
Specificity                          0.93593                  0.993182
Pos Pred Value                       0.44565                  0.142857
Neg Pred Value                       0.94304                  0.998857
Prevalence                           0.09751                  0.002268
Detection Rate                       0.04649                  0.001134
Detection Prevalence                 0.10431                  0.007937
Balanced Accuracy                    0.70634                  0.746591
                     Class: Low Credit Risk Class: Lowest Risk
Sensitivity                          0.6074           0.100000
Specificity                          0.7859           0.993119
Pos Pred Value                       0.5176           0.142857
Neg Pred Value                       0.8411           0.989714
Prevalence                           0.2744           0.011338
Detection Rate                       0.1667           0.001134
Detection Prevalence                 0.3220           0.007937
Balanced Accuracy                    0.6967           0.546560
                     Class: Moderate Credit Risk Class: Substantial Credit Risk
Sensitivity                               0.5466                        0.42405
Specificity                               0.7916                        0.92956
Pos Pred Value                            0.5882                        0.56780
Neg Pred Value                            0.7622                        0.88089
Prevalence                                0.3526                        0.17914
Detection Rate                            0.1927                        0.07596
Detection Prevalence                      0.3277                        0.13379
Balanced Accuracy                         0.6691                        0.67680
                     Class: Very High Credit Risk Class: Very Low Risk
Sensitivity                              0.176471              0.35714
Specificity                              0.990751              0.93462
Pos Pred Value                           0.272727              0.27027
Neg Pred Value                           0.983927              0.95545
Prevalence                               0.019274              0.06349
Detection Rate                           0.003401              0.02268
Detection Prevalence                     0.012472              0.08390
Balanced Accuracy                        0.583611              0.64588

plot(model_ctree, type = "simple")

The data has more than 10 features so it’s hard to see the proper decision tree

Model Evaluation

The Decision tree has the most significant accuracy of all the classification models we’ve created. It has an average balance between sensitivity and specificity besides Lowest Risk Grade and Very High Credit Risk, which are already imbalanced from the start.

Model Improvement

We shouldn’t transform the data_train because we already did it before in the beginning such as scaling, tranforming several variable into log, or removing any outliers and we are not tranforming the targeted variabel into a scaled version, because we wont scaled back the Test Result in the end of our research.

Random Forest

Create random forest model as model_rf

set.seed(123)
model_rf <- randomForest(x = train_smote %>% select(-Grade),
                         y = train_smote$Grade, 
                         ntree = 500)

model_rf


Call:
 randomForest(x = train_smote %>% select(-Grade), y = train_smote$Grade,      ntree = 500) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 4

        OOB estimate of  error rate: 18.78%
Confusion matrix:
                        High Credit Risk In or Near Default Low Credit Risk
High Credit Risk                     406                  0               3
In or Near Default                     0                489               0
Low Credit Risk                        6                  0            1027
Lowest Risk                            0                  0              12
Moderate Credit Risk                   6                  1             217
Substantial Credit Risk               64                  1              12
Very High Credit Risk                 22                  4               0
Very Low Risk                          3                  0              68
                        Lowest Risk Moderate Credit Risk
High Credit Risk                  0                   12
In or Near Default                0                    0
Low Credit Risk                   3                  196
Lowest Risk                      42                    0
Moderate Credit Risk              0                 1351
Substantial Credit Risk           0                   95
Very High Credit Risk             0                    2
Very Low Risk                    14                   19
                        Substantial Credit Risk Very High Credit Risk
High Credit Risk                             61                    13
In or Near Default                            1                     2
Low Credit Risk                              11                     0
Lowest Risk                                   0                     0
Moderate Credit Risk                         52                     1
Substantial Credit Risk                     705                     4
Very High Credit Risk                         3                    61
Very Low Risk                                 0                     0
                        Very Low Risk class.error
High Credit Risk                    1 0.181451613
In or Near Default                  0 0.006097561
Low Credit Risk                    59 0.211213518
Lowest Risk                        15 0.391304348
Moderate Credit Risk               11 0.175716901
Substantial Credit Risk             0 0.199772985
Very High Credit Risk               0 0.336956522
Very Low Risk                     217 0.323987539

Check the summary and Predictor contribution on Targeted Variable

model_rf$finalModel

NULL

varImp(model_rf)

                  Overall
Current.Ratio    292.1237
LTDC             334.5292
DER              288.9476
Gross.Margin     241.8301
Operating.Margin 178.1447
EBIT.Margin      168.2291
EBITDA.Margin    219.7983
PTPM             218.6801
NPM              196.7420
Asset.Turnover   217.5991
ROE              203.7833
ROTE             203.0130
ROA              204.7714
ROI              222.5107
OCFPS            161.5081
FCFPS            156.3341

Model Random Forest - Evaluation

pred_rf <- predict(object = model_rf, newdata = test)


confusionMatrix(data = pred_naive, #hasil prediksi
                reference = test$Grade,
                positive = "Lowest Risk")

Confusion Matrix and Statistics

                         Reference
Prediction                High Credit Risk In or Near Default Low Credit Risk
  High Credit Risk                      29                  0              10
  In or Near Default                     1                  1               7
  Low Credit Risk                        4                  0             107
  Lowest Risk                            0                  0              32
  Moderate Credit Risk                  18                  1              65
  Substantial Credit Risk               27                  0               3
  Very High Credit Risk                  4                  0               0
  Very Low Risk                          3                  0              18
                         Reference
Prediction                Lowest Risk Moderate Credit Risk
  High Credit Risk                  0                   23
  In or Near Default                1                   10
  Low Credit Risk                   4                   82
  Lowest Risk                       2                   15
  Moderate Credit Risk              0                  133
  Substantial Credit Risk           0                   31
  Very High Credit Risk             0                    0
  Very Low Risk                     3                   17
                         Reference
Prediction                Substantial Credit Risk Very High Credit Risk
  High Credit Risk                             34                     7
  In or Near Default                            6                     1
  Low Credit Risk                              20                     2
  Lowest Risk                                   1                     1
  Moderate Credit Risk                         43                     3
  Substantial Credit Risk                      45                     0
  Very High Credit Risk                         2                     2
  Very Low Risk                                 7                     1
                         Reference
Prediction                Very Low Risk
  High Credit Risk                    1
  In or Near Default                  0
  Low Credit Risk                    20
  Lowest Risk                        17
  Moderate Credit Risk                6
  Substantial Credit Risk             3
  Very High Credit Risk               0
  Very Low Risk                       9

Overall Statistics
                                          
               Accuracy : 0.3719          
                 95% CI : (0.3399, 0.4047)
    No Information Rate : 0.3526          
    P-Value [Acc > NIR] : 0.1227          
                                          
                  Kappa : 0.1939          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: High Credit Risk Class: In or Near Default
Sensitivity                          0.33721                  0.500000
Specificity                          0.90578                  0.970455
Pos Pred Value                       0.27885                  0.037037
Neg Pred Value                       0.92674                  0.998830
Prevalence                           0.09751                  0.002268
Detection Rate                       0.03288                  0.001134
Detection Prevalence                 0.11791                  0.030612
Balanced Accuracy                    0.62149                  0.735227
                     Class: Low Credit Risk Class: Lowest Risk
Sensitivity                          0.4421           0.200000
Specificity                          0.7937           0.924312
Pos Pred Value                       0.4477           0.029412
Neg Pred Value                       0.7900           0.990172
Prevalence                           0.2744           0.011338
Detection Rate                       0.1213           0.002268
Detection Prevalence                 0.2710           0.077098
Balanced Accuracy                    0.6179           0.562156
                     Class: Moderate Credit Risk Class: Substantial Credit Risk
Sensitivity                               0.4277                        0.28481
Specificity                               0.7618                        0.91160
Pos Pred Value                            0.4944                        0.41284
Neg Pred Value                            0.7096                        0.85382
Prevalence                                0.3526                        0.17914
Detection Rate                            0.1508                        0.05102
Detection Prevalence                      0.3050                        0.12358
Balanced Accuracy                         0.5947                        0.59821
                     Class: Very High Credit Risk Class: Very Low Risk
Sensitivity                              0.117647              0.16071
Specificity                              0.993064              0.94068
Pos Pred Value                           0.250000              0.15517
Neg Pred Value                           0.982838              0.94296
Prevalence                               0.019274              0.06349
Detection Rate                           0.002268              0.01020
Detection Prevalence                     0.009070              0.06576
Balanced Accuracy                        0.555355              0.55070

Check the summary and Predictor contribution on Targeted Variable

varImp(model_rf)

                  Overall
Current.Ratio    292.1237
LTDC             334.5292
DER              288.9476
Gross.Margin     241.8301
Operating.Margin 178.1447
EBIT.Margin      168.2291
EBITDA.Margin    219.7983
PTPM             218.6801
NPM              196.7420
Asset.Turnover   217.5991
ROE              203.7833
ROTE             203.0130
ROA              204.7714
ROI              222.5107
OCFPS            161.5081
FCFPS            156.3341

The plot above showing how big the influence of each predictor, top 3 predictor isLTDC,DERandCurrent.Ratio`

Support Vector Regression

model_svm <- svm(Grade ~ ., data = train_smote)
model_svm


Call:
svm(formula = Grade ~ ., data = train_smote)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  1 

Number of Support Vectors:  4417

pred_svm <- predict(object = model_svm, newdata = test)


confusionMatrix(data = pred_svm, #hasil prediksi
                reference = test$Grade)

Confusion Matrix and Statistics

                         Reference
Prediction                High Credit Risk In or Near Default Low Credit Risk
  High Credit Risk                      32                  0               0
  In or Near Default                     0                  1               0
  Low Credit Risk                        1                  0             167
  Lowest Risk                            0                  0               0
  Moderate Credit Risk                  27                  0              64
  Substantial Credit Risk               24                  0               4
  Very High Credit Risk                  2                  1               0
  Very Low Risk                          0                  0               7
                         Reference
Prediction                Lowest Risk Moderate Credit Risk
  High Credit Risk                  0                    6
  In or Near Default                0                    2
  Low Credit Risk                   7                   94
  Lowest Risk                       0                    0
  Moderate Credit Risk              2                  183
  Substantial Credit Risk           0                   23
  Very High Credit Risk             0                    0
  Very Low Risk                     1                    3
                         Reference
Prediction                Substantial Credit Risk Very High Credit Risk
  High Credit Risk                             23                     8
  In or Near Default                            0                     1
  Low Credit Risk                               7                     2
  Lowest Risk                                   0                     0
  Moderate Credit Risk                         66                     3
  Substantial Credit Risk                      61                     0
  Very High Credit Risk                         0                     3
  Very Low Risk                                 1                     0
                         Reference
Prediction                Very Low Risk
  High Credit Risk                    1
  In or Near Default                  0
  Low Credit Risk                    39
  Lowest Risk                         0
  Moderate Credit Risk               11
  Substantial Credit Risk             1
  Very High Credit Risk               0
  Very Low Risk                       4

Overall Statistics
                                          
               Accuracy : 0.5113          
                 95% CI : (0.4778, 0.5448)
    No Information Rate : 0.3526          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3279          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: High Credit Risk Class: In or Near Default
Sensitivity                          0.37209                  0.500000
Specificity                          0.95226                  0.996591
Pos Pred Value                       0.45714                  0.250000
Neg Pred Value                       0.93350                  0.998861
Prevalence                           0.09751                  0.002268
Detection Rate                       0.03628                  0.001134
Detection Prevalence                 0.07937                  0.004535
Balanced Accuracy                    0.66218                  0.748295
                     Class: Low Credit Risk Class: Lowest Risk
Sensitivity                          0.6901            0.00000
Specificity                          0.7656            1.00000
Pos Pred Value                       0.5268                NaN
Neg Pred Value                       0.8673            0.98866
Prevalence                           0.2744            0.01134
Detection Rate                       0.1893            0.00000
Detection Prevalence                 0.3594            0.00000
Balanced Accuracy                    0.7279            0.50000
                     Class: Moderate Credit Risk Class: Substantial Credit Risk
Sensitivity                               0.5884                        0.38608
Specificity                               0.6970                        0.92818
Pos Pred Value                            0.5140                        0.53982
Neg Pred Value                            0.7567                        0.87386
Prevalence                                0.3526                        0.17914
Detection Rate                            0.2075                        0.06916
Detection Prevalence                      0.4036                        0.12812
Balanced Accuracy                         0.6427                        0.65713
                     Class: Very High Credit Risk Class: Very Low Risk
Sensitivity                              0.176471             0.071429
Specificity                              0.996532             0.985472
Pos Pred Value                           0.500000             0.250000
Neg Pred Value                           0.984018             0.939954
Prevalence                               0.019274             0.063492
Detection Rate                           0.003401             0.004535
Detection Prevalence                     0.006803             0.018141
Balanced Accuracy                        0.586501             0.528450

Model Evaluation

The Support Vector Machine Model has the most significant accuracy of all the models we’ve created. It has an average balance between sensitivity and specificity besides Lowest Risk, Very Low Risk and Very High Credit Risk, which are already imbalanced from the start. Besides that, we have a not a

Conclusion

In this research project, we have examined various concrete formulations with different Ratings. We developed a model that aligns to the available information. Utilizing model as a framework, we developed a fresh formulation and, being used to predicted the Rating.

Throughout this project, we have employed a Support Vector Machine Model. Compared to a standard regression, the model better describes the data. As we have discovered, despite being more complicated, it is a model which could be understood. The prediction model implementing “model_svm” obtained an Accuray for about 51%

Corporate Credit Rating Analysis

Introduction

Business Problem

Data Processing

Load Libraries

Handling Outliers

Correlation between features:

Model Fitting

Data Splitting

Model Fitting

MLR - Multinomial Logistic Regression:

K Nearest Neighborhood

Naive Bayes

Decision Tree

Model Evaluation

Model Improvement

Random Forest

Support Vector Regression

Model Evaluation

Conclusion

`Correlation` between features: