Credit Score Analysis MSCI 3230

Erin Dane

2022-04-06

Objectives

The objective of this analysis is finding the best way to predict the credit score of different companies. A sample of the data set can be found below. The given data set contains the following variables.

Variables

Fleverage: Numeric
Profit: Numeric
Networth: Numeric
Sales: Numeric
Years: Numeric
Score: Numeric
RiskLevel: 2 binary dummy variables
Industry: Binary
IsAbvOrBelAvgScore: Computed variable (for use in contingency table)

> knitr::kable(head(Credit))

Score	Years	Sales	Networth	Profit	FLeverage	Industry	RiskLevel	IsHigh	IsAbvOrBelAvgScore
119	14	1600	3400	240	5.71	Non-IT	HIGH	1	above
103	7	640	180	60	2.35	IT	HIGH	1	below
126	16	1120	1870	450	2.34	Non-IT	HIGH	1	above
92	6	360	700	110	2.15	Non-IT	HIGH	1	below
102	6	370	790	45	1.96	IT	HIGH	1	below
105	24	1400	1850	240	1.87	Non-IT	HIGH	1	below

Descriptive Analysis

Normality Test of Distribution

The null hypothesis is that the data is normally distributed. In this case the pvalue is greater than 0.05 meaning that the null hypothesis is not rejected. From observing the QQPlot all the data is distributed within the 95% CI. In the box plot the median is slightly lower in the box and the bottom whisker is slightly shorter, this would indicate a small positive skew. The histogram also displays that price is slightly right skewed as tail is on the right. Given the normality test’s null hypotheis could not be rejected and that the skew is very slight it should not have an impact on the regression analysis.

> normalityTest(~Score, test="shapiro.test", data=Credit)


    Shapiro-Wilk normality test

data:  Score
W = 0.96191, p-value = 0.2321

> with(Credit, qqPlot(Score, dist="norm", id=list(method="y", n=2, labels=rownames(Credit))))

[1] 29 31

> Boxplot( ~ Score, data=Credit, id=list(method="y"))

> hist(Credit$Score, main="Histogram for Credit Score", 
+      xlab="Score")

Transformation

Lambda is 0 which would indicate a log transformation. This log transformation was used just to see if it would change the normality test drastically. Since the initial pvalue in the normality test was suitable for analysis no transformation needed to be completed. However, to conduct a full in depth analysis powerTransform and log transformation were used.

> TF <- (powerTransform(Score ~ 1, data=Credit, family="bcPower"))
> summary(TF)

bcPower Transformation to Normality 
   Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
Y1   -0.0249           1      -1.3568        1.307

Likelihood ratio test that transformation parameter is equal to 0
 (log transformation)
                             LRT df    pval
LR test, lambda = (0) 0.00134556  1 0.97074

Likelihood ratio test that no transformation is needed
                           LRT df    pval
LR test, lambda = (1) 2.201093  1 0.13791

> lamb <- TF$lambda 
> XCredit <- Credit$Score^TF$lamb
> hist(XCredit)

> shapiro.test(XCredit)


    Shapiro-Wilk normality test

data:  XCredit
W = 0.98248, p-value = 0.8147

Summary Statistics

The summary statistics can be used to further identify distribution. Since the mean price is greater than the median price, price is positively skewed. \[mean = 111.32\] \[median = 105 \] \[111.32>105 = positiveskew\]

> numSummary(Credit[,c("FLeverage", "Networth", "Profit", "Sales", "Score", "Years"), drop=FALSE], statistics=c("mean", "sd", "IQR", "quantiles"), quantiles=c(0,.25,.5,.75,1))

                 mean           sd     IQR    0%    25%    50%     75%    100%
FLeverage    1.376486    0.9019707    0.79  0.15   0.84   1.24    1.63    5.71
Networth  1078.567568 1215.3490477 1380.00 14.00 220.00 800.00 1600.00 5600.00
Profit     180.486486  184.9114836  219.00  6.00  46.00 110.00  265.00  720.00
Sales      710.675676  678.0887788  675.00 45.00 175.00 640.00  850.00 3200.00
Score      111.324324   23.9235519   31.00 66.00  95.00 105.00  126.00  165.00
Years       11.243243    6.8451337    9.00  2.00   6.00   8.00   15.00   28.00
           n
FLeverage 37
Networth  37
Profit    37
Sales     37
Score     37
Years     37

Correlation

Correlation is shown through a correlation matrix and a correlation plot. All the bold values in the matrix are significant (great than 0,4 or less than -0.4). This was achieved using pander and emphasize.strong.cells functions. The variables with the strongest correlation with the response variable score are net worth, profit, sales and years. All of these have significant postive correlations with score. In the correlation plot blue means positive correlation and red means negative. The more narrow the oval the stronger the correlation. The corr plot again shows that score, sales, networth and profits relationship is very strong. Some other things to note is that there is information overlap between variables something like sales and profit have a lot of similar factors.

> library(pander)
> library(corrplot)

corrplot 0.92 loaded

> C <- cor(Credit[,c("FLeverage","Networth","Profit","Sales","Score","Years")], use="complete")
> emphasize.strong.cells(which(C > 0.4 & C != 1 | C < -0.4, arr.ind = TRUE))
> pander(C)

Table continues below
	FLeverage	Networth	Profit	Sales	Score
FLeverage	1	0.2015	-0.03863	0.1485	-0.1122
Networth	0.2015	1	0.8295	0.9682	0.7466
Profit	-0.03863	0.8295	1	0.8742	0.8097
Sales	0.1485	0.9682	0.8742	1	0.7644
Score	-0.1122	0.7466	0.8097	0.7644	1
Years	-0.105	0.3272	0.3916	0.3445	0.5053

	Years
FLeverage	-0.105
Networth	0.3272
Profit	0.3916
Sales	0.3445
Score	0.5053
Years	1

> corrplot(C, order="FPC", method = "ellipse")

Contingency Table

Two contingency tables were created. The first compares the risk of the firm with the industry of the firm. It seems that non-IT firms are more commonly high risk than IT industry. Non-IT firms have an even spread across High, Medium and low, while IT firms are mostly medium risk with a few low risk and a couple high risk. The second table includes a computed column that ranks the credit score as above or below average. In this table it is evident that non-IT firms have far more companies above average credit scores while IT companies have a lot of firms below average credit score. This is intresting as non-IT firms ahve more high risk firms than IT industry.

> local({
+   .Table <- xtabs(~Industry+RiskLevel, data=Credit)
+   cat("\nFrequency table:\n")
+   print(.Table)
+   .Test <- chisq.test(.Table, correct=FALSE)
+   print(.Test)
+ })


Frequency table:
        RiskLevel
Industry HIGH LOW MEDIUM
  IT        2   5     11
  Non-IT    7   6      6

    Pearson's Chi-squared test

data:  .Table
X-squared = 4.3154, df = 2, p-value = 0.1156

The second table includes a computed column that ranks the credit score as above or below average. In this table it is evident that non-IT firms have far more companies above average credit scores while IT companies have a lot of firms below average credit score. This is intresting as non-IT firms ahve more high risk firms than IT industry.

> local({
+   .Table <- xtabs(~IsAbvOrBelAvgScore+Industry, data=Credit)
+   cat("\nFrequency table:\n")
+   print(.Table)
+   .Test <- chisq.test(.Table, correct=FALSE)
+   print(.Test)
+ })


Frequency table:
                  Industry
IsAbvOrBelAvgScore IT Non-IT
             above  2     13
             below 16      6

    Pearson's Chi-squared test

data:  .Table
X-squared = 12.594, df = 1, p-value = 0.0003869

Predictive Analysis

Fit Models

The regression model skews positively. This can be noticed by identifying the postive median value and the deviation of Q1 vs Q3 from the median. An adjusted R squared below 0.4 is seen as a low correlation and above 0.7 is seen as a very high correlation. From the Adjusted R squared in this model is can be concluded there is a strong correlation.

Formula before the stepwise regression is

\[y = 0.057627(Profit)+0.720434(Years)+0.007413(Networth)-3.804567(FLeverage)-0.001386(Sales)+91.049792\]

Taking into account standard error

\[max = 0.084439(Profit)+1.075721(Years)+0.014947(Networth)-1.070918(FLeverage)+0.013829(Sales)+97.146112\] \[min = 0.030815(Profit)+0.365147(Years)-0.000121(Networth)-6.538216(FLeverage)-0.016601(Sales)+84.9534742\]

Using the stepwise analysis the optimal equation is

\[y = 0.056469(Profit)+0.720375(Years)+0.006811(Networth)-3.805008(FLeverage)+90.924808 \]

Overall the best model was the one using profit, years, networth and FLeverage. So a perdicted credit score can be estimated by inputing the before mentioned variables into the equation formulated by the stepwise mutiple regression analysis.

> CreditReg <- lm(Score~FLeverage+Networth+Profit+Sales+Years, data=Credit)
> summary(CreditReg)


Call:
lm(formula = Score ~ FLeverage + Networth + Profit + Sales + 
    Years, data = Credit)

Residuals:
     Min       1Q   Median       3Q      Max 
-21.8301 -10.3265   0.4754   7.3336  28.2362 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 91.049792   6.096318  14.935 1.04e-15 ***
FLeverage   -3.804567   2.733649  -1.392   0.1739    
Networth     0.007413   0.007534   0.984   0.3327    
Profit       0.057627   0.026812   2.149   0.0395 *  
Sales       -0.001386   0.015215  -0.091   0.9280    
Years        0.720434   0.355287   2.028   0.0513 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.35 on 31 degrees of freedom
Multiple R-squared:  0.7321,    Adjusted R-squared:  0.6888 
F-statistic: 16.94 on 5 and 31 DF,  p-value: 4.553e-08

> library(MASS, pos=22)
> stepwise(CreditReg, direction='forward', criterion='AIC')


Direction:  forward
Criterion:  AIC 

Start:  AIC=235.93
Score ~ 1

            Df Sum of Sq     RSS    AIC
+ Profit     1   13509.1  7095.0 198.48
+ Sales      1   12038.9  8565.2 205.45
+ Networth   1   11484.0  9120.1 207.77
+ Years      1    5261.4 15342.7 227.02
<none>                   20604.1 235.93
+ FLeverage  1     259.5 20344.6 237.46

Step:  AIC=198.48
Score ~ Profit

            Df Sum of Sq    RSS    AIC
+ Years      1    862.08 6232.9 195.69
<none>                   7095.0 198.48
+ Networth   1    370.71 6724.3 198.50
+ Sales      1    279.56 6815.4 198.99
+ FLeverage  1    135.19 6959.8 199.77

Step:  AIC=195.69
Score ~ Profit + Years

            Df Sum of Sq    RSS    AIC
+ Networth   1    365.53 5867.4 195.45
<none>                   6232.9 195.69
+ Sales      1    274.81 5958.1 196.02
+ FLeverage  1     77.46 6155.5 197.22

Step:  AIC=195.45
Score ~ Profit + Years + Networth

            Df Sum of Sq    RSS    AIC
+ FLeverage  1    345.04 5522.3 195.21
<none>                   5867.4 195.45
+ Sales      1      1.56 5865.8 197.44

Step:  AIC=195.21
Score ~ Profit + Years + Networth + FLeverage

        Df Sum of Sq    RSS    AIC
<none>               5522.3 195.21
+ Sales  1    1.4774 5520.9 197.20


Call:
lm(formula = Score ~ Profit + Years + Networth + FLeverage, data = Credit)

Coefficients:
(Intercept)       Profit        Years     Networth    FLeverage  
  90.924808     0.056469     0.720375     0.006811    -3.805008