MSCI 3230 Test - Credit

Filip Dragicevic

2022-04-06

> setwd("C:/Users/filip/OneDrive/Desktop/MSCI 3230")
> credit <- read.table("C:/Users/filip/OneDrive/Desktop/MSCI 3230/credit.csv",
+    header=TRUE, stringsAsFactors=TRUE, sep=",", na.strings="NA", dec=".", 
+   strip.white=TRUE)
> summary(credit)
     Score           Years           Sales           Networth   
 Min.   : 66.0   Min.   : 2.00   Min.   :  45.0   Min.   :  14  
 1st Qu.: 95.0   1st Qu.: 6.00   1st Qu.: 175.0   1st Qu.: 220  
 Median :105.0   Median : 8.00   Median : 640.0   Median : 800  
 Mean   :111.3   Mean   :11.24   Mean   : 710.7   Mean   :1079  
 3rd Qu.:126.0   3rd Qu.:15.00   3rd Qu.: 850.0   3rd Qu.:1600  
 Max.   :165.0   Max.   :28.00   Max.   :3200.0   Max.   :5600  
     Profit        FLeverage       Industry   RiskLevel 
 Min.   :  6.0   Min.   :0.150   IT    :18   HIGH  : 9  
 1st Qu.: 46.0   1st Qu.:0.840   Non-IT:19   LOW   :11  
 Median :110.0   Median :1.240               MEDIUM:17  
 Mean   :180.5   Mean   :1.376                          
 3rd Qu.:265.0   3rd Qu.:1.630                          
 Max.   :720.0   Max.   :5.710                          

Introduction

- The credit dataset comprises of 37 rows & 9 columns and is used to identify factors that influence the credit score of a company.

- It contains columns by the names of: Score, Years, Sales, Networth, Profit, FLeverage, Industry, and Risk Level.

- From the data summary, we can see the important values for the numerical columns, such as Score and Years, and the counts for the string columns, such as Industry and RiskLevel.

Preprocessing

- In this dataset, there are no missing values that need to be accounted for.

- As seen from the boxplot below, there is also only one major outlier, the FLeverage score of 5.71. However, since we can see that they are high risk, this value should be left as is.

> Boxplot( ~ FLeverage, data=credit, id=list(method="y"))

[1] "1"

Exploratory Analysis - Scatter Plots

- For all scatter plots below, ‘Score’ will be used as the dependent variable, as it is the outcome we want to predict using the other explanatory values.

> scatterplot(FLeverage~Score, regLine=FALSE, smooth=FALSE, boxplots=FALSE, 
+   data=credit)

- FLeverage does not have a very distinct pattern, as some of the highest credit scores have lower financial leverage.

- The line of best fit would be slightly negative, meaning there is a negative correlation.

- However the relationship is linear, meaning it can potentially be used for further linear regression analysis.

- There is also one outlier, the FLeverage value above 5.

> scatterplot(Networth~Score, regLine=FALSE, smooth=FALSE, boxplots=FALSE, 
+   data=credit)

- Networth has a positive correlation, as the line of best that would have a postive slope.

- This means that as Networth increases, so does the credit score

- The relationship is linear and can potentially be used for linear regression.

- Although most values are clustered below 2000, there are three cases of Networths being above 3000.

> scatterplot(Profit~Score, regLine=FALSE, smooth=FALSE, boxplots=FALSE, 
+   data=credit)

> scatterplot(Sales~Score, regLine=FALSE, smooth=FALSE, boxplots=FALSE, 
+   data=credit)

- Similar to Networth, Profit and Sales both also have very similar relationships to the credit score.

- As these values increases, so does the credit score, both having a very positive correlation with positive sloped lines of best fit.

- Both are also linear and can be used for linear regression, with only a few cases being much higher than others.

> scatterplot(Years~Score, regLine=FALSE, smooth=FALSE, boxplots=FALSE, 
+   data=credit)

- Years does not have a very distinct pattern. Although a positive line of best can be drawn, the values are much more scattered than other factors.

- This means that there may not be a very strong correlation between years and credit score.

Exploratory Analysis - Correlation Matrix

> cor(credit[,c("FLeverage","Networth","Profit","Sales","Score","Years")], 
+   use="complete")
            FLeverage  Networth      Profit     Sales      Score      Years
FLeverage  1.00000000 0.2015277 -0.03863209 0.1485221 -0.1122239 -0.1049561
Networth   0.20152766 1.0000000  0.82947923 0.9681539  0.7465679  0.3272236
Profit    -0.03863209 0.8294792  1.00000000 0.8741540  0.8097232  0.3916367
Sales      0.14852214 0.9681539  0.87415399 1.0000000  0.7643921  0.3445221
Score     -0.11222393 0.7465679  0.80972320 0.7643921  1.0000000  0.5053267
Years     -0.10495610 0.3272236  0.39163668 0.3445221  0.5053267  1.0000000

- A correlation matrix using all numerical values shows us which variables are positively correlated to Score.

- We can conclude that Profit, Sales, and Networth are all better indicators of credit score, as they have the highest scores (0.81, 0.76, 0.75).

- Years also has a slightly positive correlation, however is just above 0.5 and not a strong indicator.

- FLeverage is negatively correlated, meaning the opposite is true.

Exploratory Analysis - Test of Normality

- We can confirm the test of normality by using a shapiro-wilk analysis for Profit, Sales, and Networth.

- Since the p-value is less the 0.05 for all, we reject the null the null hypothesis that the distribution is not normal, confirming it is normally distributed and can be used for linear regression.

> normalityTest(~Profit, test="shapiro.test", data=credit)

    Shapiro-Wilk normality test

data:  Profit
W = 0.83663, p-value = 7.789e-05
> normalityTest(~Sales, test="shapiro.test", data=credit)

    Shapiro-Wilk normality test

data:  Sales
W = 0.79775, p-value = 1.175e-05
> normalityTest(~Networth, test="shapiro.test", data=credit)

    Shapiro-Wilk normality test

data:  Networth
W = 0.76359, p-value = 2.611e-06

Predictive Analysis - Creating Dummy Variables

- When using dummy variables in linear regression, always choose 1 less than ?n? in your model and use the other as a benchmark

- For RiskLevel, create (n-1 = 3-1 = 2) dummy variables (IsHIGH, IsMEDIUM) and compare them to LOW.

> credit$IsHIGH <- with(credit, ifelse(RiskLevel=="HIGH",1,0))
> credit$IsMEDIUM <- with(credit, ifelse(RiskLevel=="MEDIUM",1,0))

- For Insdustry, create 1, NotIT, and compare to if it is IT industry.

> credit$NotIT <- with(credit, ifelse(Industry=="Non-IT",1,0))

Predictive Analysis - Linear Regression Models

> RegModel.2 <- lm(Score~IsHIGH+IsMEDIUM+NotIT, data=credit)
> summary(RegModel.2)

Call:
lm(formula = Score ~ IsHIGH + IsMEDIUM + NotIT, data = credit)

Residuals:
    Min      1Q  Median      3Q     Max 
-26.781 -13.781   4.265  14.265  27.265 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  105.718      6.099  17.335  < 2e-16 ***
IsHIGH       -18.953      7.806  -2.428   0.0208 *  
IsMEDIUM     -13.548      6.712  -2.018   0.0518 .  
NotIT         32.016      5.981   5.353 6.51e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.09 on 33 degrees of freedom
Multiple R-squared:  0.5322,    Adjusted R-squared:  0.4897 
F-statistic: 12.52 on 3 and 33 DF,  p-value: 1.259e-05

- High risk is likely to have a worse credit score by about 19 compared to low risk.

- Medium risk is likely to have a worse credit score by about 13 compared to low risk.

- These values naturally make sense.

- Non-IT industries are likely to have a better credit score by about 32 compared to IT.

- P-Value is less than 0.05, however, Adjusted R-Squared is about 0.49, which is not very strong, meaning this model is not useful.

> RegModel.3 <- lm(Score~Networth+Profit+Sales, data=credit)
> summary(RegModel.3)

Call:
lm(formula = Score ~ Networth + Profit + Sales, data = credit)

Residuals:
    Min      1Q  Median      3Q     Max 
-26.762 -10.573  -0.117   6.016  33.724 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 92.093819   3.633116  25.348  < 2e-16 ***
Networth     0.005322   0.007894   0.674  0.50489    
Profit       0.080132   0.026748   2.996  0.00516 ** 
Sales       -0.001368   0.016273  -0.084  0.93349    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.27 on 33 degrees of freedom
Multiple R-squared:  0.6737,    Adjusted R-squared:  0.6441 
F-statistic: 22.71 on 3 and 33 DF,  p-value: 3.676e-08

- When using all 3 of our strongest variables, we get an Adjusted R-Sqaured of 0.64 and a P-Value less than 0.05. This model can be used, but can also be stronger.

> RegModel.4 <- lm(Score~Profit+Sales, data=credit)
> summary(RegModel.4)

Call:
lm(formula = Score ~ Profit + Sales, data = credit)

Residuals:
    Min      1Q  Median      3Q     Max 
-26.639  -9.712  -0.275   6.745  37.275 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 91.298544   3.408607  26.785  < 2e-16 ***
Profit       0.077635   0.026277   2.955  0.00565 ** 
Sales        0.008462   0.007166   1.181  0.24582    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.16 on 34 degrees of freedom
Multiple R-squared:  0.6692,    Adjusted R-squared:  0.6498 
F-statistic: 34.39 on 2 and 34 DF,  p-value: 6.795e-09

- When using Profit and Sales, we get an Adjusted R-Sqaured of 0.65 and a P-Value less than 0.05. This model can be used, but can also be stronger.

> RegModel.5 <- lm(Score~Profit, data=credit)
> summary(RegModel.5)

Call:
lm(formula = Score ~ Profit, data = credit)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.130  -9.950  -0.559   7.765  36.536 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 92.41643    3.29294  28.065  < 2e-16 ***
Profit       0.10476    0.01283   8.163 1.29e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.24 on 35 degrees of freedom
Multiple R-squared:  0.6557,    Adjusted R-squared:  0.6458 
F-statistic: 66.64 on 1 and 35 DF,  p-value: 1.288e-09

- When using just profit, the model also has an Adjusted R-Squared of 0.65, meaning no difference between this and the previous.

- Therefore, the predictors to get the best model are Profit and Sales.