The objective of this analysis is finding the best way to predict the credit score of different companies. A sample of the data set can be found below. The given data set contains the following variables.
> knitr::kable(head(Credit))
| Score | Years | Sales | Networth | Profit | FLeverage | Industry | RiskLevel | IsHigh | IsMedium | IsAbvOrBelAvgScore |
|---|---|---|---|---|---|---|---|---|---|---|
| 119 | 14 | 1600 | 3400 | 240 | 5.71 | Non-IT | HIGH | 1 | 0 | above |
| 103 | 7 | 640 | 180 | 60 | 2.35 | IT | HIGH | 1 | 0 | below |
| 126 | 16 | 1120 | 1870 | 450 | 2.34 | Non-IT | HIGH | 1 | 0 | above |
| 92 | 6 | 360 | 700 | 110 | 2.15 | Non-IT | HIGH | 1 | 0 | below |
| 102 | 6 | 370 | 790 | 45 | 1.96 | IT | HIGH | 1 | 0 | below |
| 105 | 24 | 1400 | 1850 | 240 | 1.87 | Non-IT | HIGH | 1 | 0 | below |
The null hypothesis is that the data is normally distributed. In this case the pvalue is greater than 0.05 meaning that the null hypothesis is not rejected. From observing the QQPlot all the data is distributed within the 95% CI. In the box plot the median is slightly lower in the box and the bottom whisker is slightly shorter, this would indicate a small positive skew. The histogram also displays that price is slightly right skewed as tail is on the right. Given the normality test’s null hypotheis could not be rejected and that the skew is very slight it should not have an impact on the regression analysis.
> normalityTest(~Score, test="shapiro.test", data=Credit)
Shapiro-Wilk normality test
data: Score
W = 0.96191, p-value = 0.2321
> with(Credit, qqPlot(Score, dist="norm", id=list(method="y", n=2, labels=rownames(Credit))))
[1] 29 31
> Boxplot( ~ Score, data=Credit, id=list(method="y"))
> hist(Credit$Score, main="Histogram for Credit Score",
+ xlab="Score")
Lambda is 0 which would indicate a log transformation. This log transformation was used just to see if it would change the normality test drastically. Since the initial pvalue in the normality test was suitable for analysis no transformation needed to be completed. However, to conduct a full in depth analysis powerTransform and log transformation were used.
> TF <- (powerTransform(Score ~ 1, data=Credit, family="bcPower"))
> summary(TF)
bcPower Transformation to Normality
Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
Y1 -0.0249 1 -1.3568 1.307
Likelihood ratio test that transformation parameter is equal to 0
(log transformation)
LRT df pval
LR test, lambda = (0) 0.00134556 1 0.97074
Likelihood ratio test that no transformation is needed
LRT df pval
LR test, lambda = (1) 2.201093 1 0.13791
> lamb <- TF$lambda
> XCredit <- Credit$Score^TF$lamb
> hist(XCredit)
> shapiro.test(XCredit)
Shapiro-Wilk normality test
data: XCredit
W = 0.98248, p-value = 0.8147
The summary statistics can be used to further identify distribution. Since the mean price is greater than the median price, price is positively skewed. \[mean = 111.32\] \[median = 105 \] \[111.32>105 = positiveskew\]
> numSummary(Credit[,c("FLeverage", "Networth", "Profit", "Sales", "Score", "Years"), drop=FALSE], statistics=c("mean", "sd", "IQR", "quantiles"), quantiles=c(0,.25,.5,.75,1))
mean sd IQR 0% 25% 50% 75% 100%
FLeverage 1.376486 0.9019707 0.79 0.15 0.84 1.24 1.63 5.71
Networth 1078.567568 1215.3490477 1380.00 14.00 220.00 800.00 1600.00 5600.00
Profit 180.486486 184.9114836 219.00 6.00 46.00 110.00 265.00 720.00
Sales 710.675676 678.0887788 675.00 45.00 175.00 640.00 850.00 3200.00
Score 111.324324 23.9235519 31.00 66.00 95.00 105.00 126.00 165.00
Years 11.243243 6.8451337 9.00 2.00 6.00 8.00 15.00 28.00
n
FLeverage 37
Networth 37
Profit 37
Sales 37
Score 37
Years 37
Correlation is shown through a correlation matrix and a correlation plot. All the bold values in the matrix are significant (great than 0,4 or less than -0.4). This was achieved using pander and emphasize.strong.cells functions. The variables with the strongest correlation with the response variable score are net worth, profit, sales and years. All of these have significant postive correlations with score. In the correlation plot blue means positive correlation and red means negative. The more narrow the oval the stronger the correlation. The corr plot again shows that score, sales, networth and profits relationship is very strong. Some other things to note is that there is information overlap between variables something like sales and profit have a lot of similar factors.
> library(pander)
> library(corrplot)
corrplot 0.92 loaded
> C <- cor(Credit[,c("FLeverage","Networth","Profit","Sales","Score","Years")], use="complete")
> emphasize.strong.cells(which(C > 0.4 & C != 1 | C < -0.4, arr.ind = TRUE))
> pander(C)
| Â | FLeverage | Networth | Profit | Sales | Score |
|---|---|---|---|---|---|
| FLeverage | 1 | 0.2015 | -0.03863 | 0.1485 | -0.1122 |
| Networth | 0.2015 | 1 | 0.8295 | 0.9682 | 0.7466 |
| Profit | -0.03863 | 0.8295 | 1 | 0.8742 | 0.8097 |
| Sales | 0.1485 | 0.9682 | 0.8742 | 1 | 0.7644 |
| Score | -0.1122 | 0.7466 | 0.8097 | 0.7644 | 1 |
| Years | -0.105 | 0.3272 | 0.3916 | 0.3445 | 0.5053 |
| Â | Years |
|---|---|
| FLeverage | -0.105 |
| Networth | 0.3272 |
| Profit | 0.3916 |
| Sales | 0.3445 |
| Score | 0.5053 |
| Years | 1 |
> corrplot(C, order="FPC", method = "ellipse")
Two contingency tables were created. The first compares the risk of the firm with the industry of the firm. It seems that non-IT firms are more commonly high risk than IT industry. Non-IT firms have an even spread across High, Medium and low, while IT firms are mostly medium risk with a few low risk and a couple high risk. The second table includes a computed column that ranks the credit score as above or below average. In this table it is evident that non-IT firms have far more companies above average credit scores while IT companies have a lot of firms below average credit score. This is intresting as non-IT firms ahve more high risk firms than IT industry.
> local({
+ .Table <- xtabs(~Industry+RiskLevel, data=Credit)
+ cat("\nFrequency table:\n")
+ print(.Table)
+ .Test <- chisq.test(.Table, correct=FALSE)
+ print(.Test)
+ })
Frequency table:
RiskLevel
Industry HIGH LOW MEDIUM
IT 2 5 11
Non-IT 7 6 6
Pearson's Chi-squared test
data: .Table
X-squared = 4.3154, df = 2, p-value = 0.1156
The second table includes a computed column that ranks the credit score as above or below average. In this table it is evident that non-IT firms have far more companies above average credit scores while IT companies have a lot of firms below average credit score. This is intresting as non-IT firms ahve more high risk firms than IT industry.
> local({
+ .Table <- xtabs(~IsAbvOrBelAvgScore+Industry, data=Credit)
+ cat("\nFrequency table:\n")
+ print(.Table)
+ .Test <- chisq.test(.Table, correct=FALSE)
+ print(.Test)
+ })
Frequency table:
Industry
IsAbvOrBelAvgScore IT Non-IT
above 2 13
below 16 6
Pearson's Chi-squared test
data: .Table
X-squared = 12.594, df = 1, p-value = 0.0003869
The regression model skews positively. This can be noticed by identifying the postive median value and the deviation of Q1 vs Q3 from the median. An adjusted R squared below 0.4 is seen as a low correlation and above 0.7 is seen as a very high correlation. From the Adjusted R squared in this model is can be concluded there is a strong correlation.
\[y = 0.057627(Profit)+0.720434(Years)+0.007413(Networth)-3.804567(FLeverage)-0.001386(Sales)+91.049792\]
\[max = 0.084439(Profit)+1.075721(Years)+0.014947(Networth)-1.070918(FLeverage)+0.013829(Sales)+97.146112\] \[min = 0.030815(Profit)+0.365147(Years)-0.000121(Networth)-6.538216(FLeverage)-0.016601(Sales)+84.9534742\]
\[y = 0.056469(Profit)+0.720375(Years)+0.006811(Networth)-3.805008(FLeverage)+90.924808 \]
> CreditReg <- lm(Score~FLeverage+Networth+Profit+Sales+Years, data=Credit)
> summary(CreditReg)
Call:
lm(formula = Score ~ FLeverage + Networth + Profit + Sales +
Years, data = Credit)
Residuals:
Min 1Q Median 3Q Max
-21.8301 -10.3265 0.4754 7.3336 28.2362
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 91.049792 6.096318 14.935 1.04e-15 ***
FLeverage -3.804567 2.733649 -1.392 0.1739
Networth 0.007413 0.007534 0.984 0.3327
Profit 0.057627 0.026812 2.149 0.0395 *
Sales -0.001386 0.015215 -0.091 0.9280
Years 0.720434 0.355287 2.028 0.0513 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.35 on 31 degrees of freedom
Multiple R-squared: 0.7321, Adjusted R-squared: 0.6888
F-statistic: 16.94 on 5 and 31 DF, p-value: 4.553e-08
> library(MASS, pos=22)
> stepwise(CreditReg, direction='forward', criterion='AIC')
Direction: forward
Criterion: AIC
Start: AIC=235.93
Score ~ 1
Df Sum of Sq RSS AIC
+ Profit 1 13509.1 7095.0 198.48
+ Sales 1 12038.9 8565.2 205.45
+ Networth 1 11484.0 9120.1 207.77
+ Years 1 5261.4 15342.7 227.02
<none> 20604.1 235.93
+ FLeverage 1 259.5 20344.6 237.46
Step: AIC=198.48
Score ~ Profit
Df Sum of Sq RSS AIC
+ Years 1 862.08 6232.9 195.69
<none> 7095.0 198.48
+ Networth 1 370.71 6724.3 198.50
+ Sales 1 279.56 6815.4 198.99
+ FLeverage 1 135.19 6959.8 199.77
Step: AIC=195.69
Score ~ Profit + Years
Df Sum of Sq RSS AIC
+ Networth 1 365.53 5867.4 195.45
<none> 6232.9 195.69
+ Sales 1 274.81 5958.1 196.02
+ FLeverage 1 77.46 6155.5 197.22
Step: AIC=195.45
Score ~ Profit + Years + Networth
Df Sum of Sq RSS AIC
+ FLeverage 1 345.04 5522.3 195.21
<none> 5867.4 195.45
+ Sales 1 1.56 5865.8 197.44
Step: AIC=195.21
Score ~ Profit + Years + Networth + FLeverage
Df Sum of Sq RSS AIC
<none> 5522.3 195.21
+ Sales 1 1.4774 5520.9 197.20
Call:
lm(formula = Score ~ Profit + Years + Networth + FLeverage, data = Credit)
Coefficients:
(Intercept) Profit Years Networth FLeverage
90.924808 0.056469 0.720375 0.006811 -3.805008