MSCI 3230 Test - Credit
Filip Dragicevic
2022-04-06
> setwd("C:/Users/filip/OneDrive/Desktop/MSCI 3230")
> credit <- read.table("C:/Users/filip/OneDrive/Desktop/MSCI 3230/credit.csv",
+ header=TRUE, stringsAsFactors=TRUE, sep=",", na.strings="NA", dec=".",
+ strip.white=TRUE)
> summary(credit)
Score Years Sales Networth
Min. : 66.0 Min. : 2.00 Min. : 45.0 Min. : 14
1st Qu.: 95.0 1st Qu.: 6.00 1st Qu.: 175.0 1st Qu.: 220
Median :105.0 Median : 8.00 Median : 640.0 Median : 800
Mean :111.3 Mean :11.24 Mean : 710.7 Mean :1079
3rd Qu.:126.0 3rd Qu.:15.00 3rd Qu.: 850.0 3rd Qu.:1600
Max. :165.0 Max. :28.00 Max. :3200.0 Max. :5600
Profit FLeverage Industry RiskLevel
Min. : 6.0 Min. :0.150 IT :18 HIGH : 9
1st Qu.: 46.0 1st Qu.:0.840 Non-IT:19 LOW :11
Median :110.0 Median :1.240 MEDIUM:17
Mean :180.5 Mean :1.376
3rd Qu.:265.0 3rd Qu.:1.630
Max. :720.0 Max. :5.710
Introduction
- The credit dataset comprises of 37 rows & 9 columns and is used to identify factors that influence the credit score of a company.
- It contains columns by the names of: Score, Years, Sales, Networth, Profit, FLeverage, Industry, and Risk Level.
- From the data summary, we can see the important values for the numerical columns, such as Score and Years, and the counts for the string columns, such as Industry and RiskLevel.
Preprocessing
- In this dataset, there are no missing values that need to be accounted for.
- As seen from the boxplot below, there is also only one major outlier, the FLeverage score of 5.71. However, since we can see that they are high risk, this value should be left as is.
> Boxplot( ~ FLeverage, data=credit, id=list(method="y"))

[1] "1"
Exploratory Analysis - Scatter Plots
- For all scatter plots below, ‘Score’ will be used as the dependent variable, as it is the outcome we want to predict using the other explanatory values.
> scatterplot(FLeverage~Score, regLine=FALSE, smooth=FALSE, boxplots=FALSE,
+ data=credit)

- FLeverage does not have a very distinct pattern, as some of the highest credit scores have lower financial leverage.
- The line of best fit would be slightly negative, meaning there is a negative correlation.
- However the relationship is linear, meaning it can potentially be used for further linear regression analysis.
- There is also one outlier, the FLeverage value above 5.
> scatterplot(Networth~Score, regLine=FALSE, smooth=FALSE, boxplots=FALSE,
+ data=credit)

- Networth has a positive correlation, as the line of best that would have a postive slope.
- This means that as Networth increases, so does the credit score
- The relationship is linear and can potentially be used for linear regression.
- Although most values are clustered below 2000, there are three cases of Networths being above 3000.
> scatterplot(Profit~Score, regLine=FALSE, smooth=FALSE, boxplots=FALSE,
+ data=credit)

> scatterplot(Sales~Score, regLine=FALSE, smooth=FALSE, boxplots=FALSE,
+ data=credit)

- Similar to Networth, Profit and Sales both also have very similar relationships to the credit score.
- As these values increases, so does the credit score, both having a very positive correlation with positive sloped lines of best fit.
- Both are also linear and can be used for linear regression, with only a few cases being much higher than others.
> scatterplot(Years~Score, regLine=FALSE, smooth=FALSE, boxplots=FALSE,
+ data=credit)

- Years does not have a very distinct pattern. Although a positive line of best can be drawn, the values are much more scattered than other factors.
- This means that there may not be a very strong correlation between years and credit score.
Exploratory Analysis - Correlation Matrix
> cor(credit[,c("FLeverage","Networth","Profit","Sales","Score","Years")],
+ use="complete")
FLeverage Networth Profit Sales Score Years
FLeverage 1.00000000 0.2015277 -0.03863209 0.1485221 -0.1122239 -0.1049561
Networth 0.20152766 1.0000000 0.82947923 0.9681539 0.7465679 0.3272236
Profit -0.03863209 0.8294792 1.00000000 0.8741540 0.8097232 0.3916367
Sales 0.14852214 0.9681539 0.87415399 1.0000000 0.7643921 0.3445221
Score -0.11222393 0.7465679 0.80972320 0.7643921 1.0000000 0.5053267
Years -0.10495610 0.3272236 0.39163668 0.3445221 0.5053267 1.0000000
- A correlation matrix using all numerical values shows us which variables are positively correlated to Score.
- We can conclude that Profit, Sales, and Networth are all better indicators of credit score, as they have the highest scores (0.81, 0.76, 0.75).
- Years also has a slightly positive correlation, however is just above 0.5 and not a strong indicator.
- FLeverage is negatively correlated, meaning the opposite is true.
Exploratory Analysis - Test of Normality
- We can confirm the test of normality by using a shapiro-wilk analysis for Profit, Sales, and Networth.
- Since the p-value is less the 0.05 for all, we reject the null the null hypothesis that the distribution is not normal, confirming it is normally distributed and can be used for linear regression.
> normalityTest(~Profit, test="shapiro.test", data=credit)
Shapiro-Wilk normality test
data: Profit
W = 0.83663, p-value = 7.789e-05
> normalityTest(~Sales, test="shapiro.test", data=credit)
Shapiro-Wilk normality test
data: Sales
W = 0.79775, p-value = 1.175e-05
> normalityTest(~Networth, test="shapiro.test", data=credit)
Shapiro-Wilk normality test
data: Networth
W = 0.76359, p-value = 2.611e-06
Predictive Analysis - Creating Dummy Variables
- When using dummy variables in linear regression, always choose 1 less than ?n? in your model and use the other as a benchmark
- For RiskLevel, create (n-1 = 3-1 = 2) dummy variables (IsHIGH, IsMEDIUM) and compare them to LOW.
> credit$IsHIGH <- with(credit, ifelse(RiskLevel=="HIGH",1,0))
> credit$IsMEDIUM <- with(credit, ifelse(RiskLevel=="MEDIUM",1,0))
- For Insdustry, create 1, NotIT, and compare to if it is IT industry.
> credit$NotIT <- with(credit, ifelse(Industry=="Non-IT",1,0))
Predictive Analysis - Linear Regression Models
> RegModel.2 <- lm(Score~IsHIGH+IsMEDIUM+NotIT, data=credit)
> summary(RegModel.2)
Call:
lm(formula = Score ~ IsHIGH + IsMEDIUM + NotIT, data = credit)
Residuals:
Min 1Q Median 3Q Max
-26.781 -13.781 4.265 14.265 27.265
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 105.718 6.099 17.335 < 2e-16 ***
IsHIGH -18.953 7.806 -2.428 0.0208 *
IsMEDIUM -13.548 6.712 -2.018 0.0518 .
NotIT 32.016 5.981 5.353 6.51e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17.09 on 33 degrees of freedom
Multiple R-squared: 0.5322, Adjusted R-squared: 0.4897
F-statistic: 12.52 on 3 and 33 DF, p-value: 1.259e-05
- High risk is likely to have a worse credit score by about 19 compared to low risk.
- Medium risk is likely to have a worse credit score by about 13 compared to low risk.
- These values naturally make sense.
- Non-IT industries are likely to have a better credit score by about 32 compared to IT.
- P-Value is less than 0.05, however, Adjusted R-Squared is about 0.49, which is not very strong, meaning this model is not useful.
> RegModel.3 <- lm(Score~Networth+Profit+Sales, data=credit)
> summary(RegModel.3)
Call:
lm(formula = Score ~ Networth + Profit + Sales, data = credit)
Residuals:
Min 1Q Median 3Q Max
-26.762 -10.573 -0.117 6.016 33.724
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 92.093819 3.633116 25.348 < 2e-16 ***
Networth 0.005322 0.007894 0.674 0.50489
Profit 0.080132 0.026748 2.996 0.00516 **
Sales -0.001368 0.016273 -0.084 0.93349
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.27 on 33 degrees of freedom
Multiple R-squared: 0.6737, Adjusted R-squared: 0.6441
F-statistic: 22.71 on 3 and 33 DF, p-value: 3.676e-08
- When using all 3 of our strongest variables, we get an Adjusted R-Sqaured of 0.64 and a P-Value less than 0.05. This model can be used, but can also be stronger.
> RegModel.4 <- lm(Score~Profit+Sales, data=credit)
> summary(RegModel.4)
Call:
lm(formula = Score ~ Profit + Sales, data = credit)
Residuals:
Min 1Q Median 3Q Max
-26.639 -9.712 -0.275 6.745 37.275
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 91.298544 3.408607 26.785 < 2e-16 ***
Profit 0.077635 0.026277 2.955 0.00565 **
Sales 0.008462 0.007166 1.181 0.24582
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.16 on 34 degrees of freedom
Multiple R-squared: 0.6692, Adjusted R-squared: 0.6498
F-statistic: 34.39 on 2 and 34 DF, p-value: 6.795e-09
- When using Profit and Sales, we get an Adjusted R-Sqaured of 0.65 and a P-Value less than 0.05. This model can be used, but can also be stronger.
> RegModel.5 <- lm(Score~Profit, data=credit)
> summary(RegModel.5)
Call:
lm(formula = Score ~ Profit, data = credit)
Residuals:
Min 1Q Median 3Q Max
-28.130 -9.950 -0.559 7.765 36.536
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 92.41643 3.29294 28.065 < 2e-16 ***
Profit 0.10476 0.01283 8.163 1.29e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.24 on 35 degrees of freedom
Multiple R-squared: 0.6557, Adjusted R-squared: 0.6458
F-statistic: 66.64 on 1 and 35 DF, p-value: 1.288e-09
- When using just profit, the model also has an Adjusted R-Squared of 0.65, meaning no difference between this and the previous.
- Therefore, the predictors to get the best model are Profit and Sales.