setwd("E:/YR4 SEM2/Data Science/Lab Test 3")
credit<- read.csv("credit.csv", header = TRUE)

Computing New Variables

Non-numerical variables must be converted in order to make them machine readable for linear regression models.

credit$IsIndustry <- with(credit, ifelse(Industry=="IT",1,0))

This chunk created a column called “IsIndustry”, giving a value of 1 to wall industries that belong to IT

credit$IsHigh <- with(credit, ifelse(RiskLevel=="HIGH",1,0))
credit$IsMedium <- with(credit, ifelse(RiskLevel=="MEDIUM",1,0))
credit$IsLow <- with(credit, ifelse(RiskLevel=="LOW",1,0))

The result of the computation of these next 3 new variables are three columns in the dataset, each uniquely specifying what Risk Level the individuals credit is. The three new columns are:

  1. IsHigh (1,0)(Y,N)
  2. IsMedium (1,0)(Y,N)
  3. IsLow (1,0)(Y,N)

Plots

Normality Tests

Next, we will conduct normality tests on the dataset using a histogram and quantile comparison plot

with(credit, hist(Score, scale="frequency", breaks="Sturges", 
  col="darkgray"))

The qqplot function would not knit in my rmarkdown file so here is the code and an image of the qqplot result

with(credit, qqPlot(Score, dist=“norm”, id=list(method=“y”, n=2, labels=rownames(credit))))

qqplot

Further, a Shapiro-Wilk normality test is conducted as well and the results are as follows

The Shapiro Test will not knit as well so I will provide the code and results below.

normalityTest(~Score, test=“shapiro.test”, data=credit)

The Results of the Shapiro Walk Test:

data: Score W = 0.96191, p-value = 0.2321

These two exploratory models show us the data is fairly normally distributed. An iterpretation of a slight left skew could be reasonable as well. Visually we can see in the histogram and qq blot that the data looks normal. A p-value of 0.2321 which is greater than 0.05 tells us that the data may be left skewed.

Contingency Table

A contingency table can be computed using the risk level and industry variables to compare data between the two categorical columns.

local({
  .Table <- xtabs(~Industry+RiskLevel, data=credit)
  cat("\nFrequency table:\n")
  print(.Table)
  .Test <- chisq.test(.Table, correct=FALSE)
  print(.Test)
})
## 
## Frequency table:
##         RiskLevel
## Industry HIGH LOW MEDIUM
##   IT        2   5     11
##   Non-IT    7   6      6
## 
##  Pearson's Chi-squared test
## 
## data:  .Table
## X-squared = 4.3154, df = 2, p-value = 0.1156

The results of this tell us that we will Not reject the null since the p value is > 0.05, meaning that there is some relationship between risk level and industry.

Linear Regression

To avoid multiple linearity, I will create 3 initial regression models, each including 1 of the following new variables:
  1. IsHigh
  2. IsMedium
  3. IsLow

RegModel.1

  • This model includes “IsHigh”
RegModel.1 <- 
  lm(Score~FLeverage+IsHigh+IsIndustry+Networth+Profit+Sales+Years, 
  data=credit)
summary(RegModel.1)
## 
## Call:
## lm(formula = Score ~ FLeverage + IsHigh + IsIndustry + Networth + 
##     Profit + Sales + Years, data = credit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.1741  -8.7140   0.7981   8.1980  25.1072 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 98.508964   9.217589  10.687 1.43e-11 ***
## FLeverage   -4.901261   3.818160  -1.284    0.209    
## IsHigh       1.311580   7.984201   0.164    0.871    
## IsIndustry  -7.061991   7.418803  -0.952    0.349    
## Networth     0.006605   0.008407   0.786    0.438    
## Profit       0.039517   0.032093   1.231    0.228    
## Sales        0.001388   0.016745   0.083    0.934    
## Years        0.661317   0.368333   1.795    0.083 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.51 on 29 degrees of freedom
## Multiple R-squared:  0.7431, Adjusted R-squared:  0.6811 
## F-statistic: 11.98 on 7 and 29 DF,  p-value: 4.443e-07

RegModel.2

  • This model includes “IsMedium”
RegModel.2 <- 
  lm(Score~FLeverage+IsIndustry+IsMedium+Networth+Profit+Sales+Years, 
  data=credit)
summary(RegModel.2)
## 
## Call:
## lm(formula = Score ~ FLeverage + IsIndustry + IsMedium + Networth + 
##     Profit + Sales + Years, data = credit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.873  -7.948   1.161   9.009  24.170 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 99.974505   9.505537  10.518 2.08e-11 ***
## FLeverage   -4.520376   2.822140  -1.602   0.1200    
## IsIndustry  -6.959225   6.873718  -1.012   0.3197    
## IsMedium    -2.580373   4.757180  -0.542   0.5917    
## Networth     0.006491   0.007730   0.840   0.4079    
## Profit       0.040355   0.031893   1.265   0.2158    
## Sales        0.001006   0.015875   0.063   0.9499    
## Years        0.635446   0.364572   1.743   0.0919 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.45 on 29 degrees of freedom
## Multiple R-squared:  0.7455, Adjusted R-squared:  0.684 
## F-statistic: 12.13 on 7 and 29 DF,  p-value: 3.919e-07

RegModel.3

  • This model includes “IsLow”
RegModel.3 <- 
  lm(Score~FLeverage+IsIndustry+IsLow+Networth+Profit+Sales+Years, 
  data=credit)
summary(RegModel.3)
## 
## Call:
## lm(formula = Score ~ FLeverage + IsIndustry + IsLow + Networth + 
##     Profit + Sales + Years, data = credit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.600  -8.715   0.476   9.129  24.482 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 97.091208   9.495720  10.225 3.98e-11 ***
## FLeverage   -3.329562   3.421079  -0.973    0.338    
## IsIndustry  -8.069348   6.840667  -1.180    0.248    
## IsLow        3.772968   6.354331   0.594    0.557    
## Networth     0.005121   0.007842   0.653    0.519    
## Profit       0.039511   0.031776   1.243    0.224    
## Sales        0.003071   0.015726   0.195    0.847    
## Years        0.603324   0.372287   1.621    0.116    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.43 on 29 degrees of freedom
## Multiple R-squared:  0.746,  Adjusted R-squared:  0.6846 
## F-statistic: 12.17 on 7 and 29 DF,  p-value: 3.813e-07

Interpretation of Regression Models

All models have a similar adjusted R-squared value of, 0.6811, 0.684, and 0.6846. “IsHigh” and “IsLow” have positive correlation to credit score, whereas “IsMedium” is negatively correlated The most significant variables are: * “Years” it consistently has the lowest P value (closest to 0.05) * “Profit” * “IsHigh” * “IsLow” * “Networth”

Lets create a new regression model.

Regmodel.5 takes into consideration the observances above, here is the result:

RegModel.5 <- lm(Score~IsHigh+IsLow+Networth+Profit+Years, data=credit)
summary(RegModel.5)
## 
## Call:
## lm(formula = Score ~ IsHigh + IsLow + Networth + Profit + Years, 
##     data = credit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.0864  -8.9585  -0.9796   7.9085  29.9183 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 85.080391   4.699825  18.103  < 2e-16 ***
## IsHigh      -0.441658   5.684478  -0.078  0.93857    
## IsLow        5.818117   5.523965   1.053  0.30037    
## Networth     0.004786   0.003331   1.437  0.16078    
## Profit       0.065742   0.022388   2.936  0.00621 ** 
## Years        0.675448   0.366097   1.845  0.07461 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.46 on 31 degrees of freedom
## Multiple R-squared:  0.7274, Adjusted R-squared:  0.6834 
## F-statistic: 16.54 on 5 and 31 DF,  p-value: 5.896e-08

This model is what should be used to predict credit score