setwd("E:/YR4 SEM2/Data Science/Lab Test 3")
credit<- read.csv("credit.csv", header = TRUE)
credit$IsIndustry <- with(credit, ifelse(Industry=="IT",1,0))
This chunk created a column called “IsIndustry”, giving a value of 1 to wall industries that belong to IT
credit$IsHigh <- with(credit, ifelse(RiskLevel=="HIGH",1,0))
credit$IsMedium <- with(credit, ifelse(RiskLevel=="MEDIUM",1,0))
credit$IsLow <- with(credit, ifelse(RiskLevel=="LOW",1,0))
The result of the computation of these next 3 new variables are three columns in the dataset, each uniquely specifying what Risk Level the individuals credit is. The three new columns are:
with(credit, hist(Score, scale="frequency", breaks="Sturges",
col="darkgray"))
with(credit, qqPlot(Score, dist=“norm”, id=list(method=“y”, n=2, labels=rownames(credit))))
qqplot
normalityTest(~Score, test=“shapiro.test”, data=credit)
The Results of the Shapiro Walk Test:
data: Score W = 0.96191, p-value = 0.2321
These two exploratory models show us the data is fairly normally distributed. An iterpretation of a slight left skew could be reasonable as well. Visually we can see in the histogram and qq blot that the data looks normal. A p-value of 0.2321 which is greater than 0.05 tells us that the data may be left skewed.
local({
.Table <- xtabs(~Industry+RiskLevel, data=credit)
cat("\nFrequency table:\n")
print(.Table)
.Test <- chisq.test(.Table, correct=FALSE)
print(.Test)
})
##
## Frequency table:
## RiskLevel
## Industry HIGH LOW MEDIUM
## IT 2 5 11
## Non-IT 7 6 6
##
## Pearson's Chi-squared test
##
## data: .Table
## X-squared = 4.3154, df = 2, p-value = 0.1156
The results of this tell us that we will Not reject the null since the p value is > 0.05, meaning that there is some relationship between risk level and industry.
RegModel.1 <-
lm(Score~FLeverage+IsHigh+IsIndustry+Networth+Profit+Sales+Years,
data=credit)
summary(RegModel.1)
##
## Call:
## lm(formula = Score ~ FLeverage + IsHigh + IsIndustry + Networth +
## Profit + Sales + Years, data = credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.1741 -8.7140 0.7981 8.1980 25.1072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.508964 9.217589 10.687 1.43e-11 ***
## FLeverage -4.901261 3.818160 -1.284 0.209
## IsHigh 1.311580 7.984201 0.164 0.871
## IsIndustry -7.061991 7.418803 -0.952 0.349
## Networth 0.006605 0.008407 0.786 0.438
## Profit 0.039517 0.032093 1.231 0.228
## Sales 0.001388 0.016745 0.083 0.934
## Years 0.661317 0.368333 1.795 0.083 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.51 on 29 degrees of freedom
## Multiple R-squared: 0.7431, Adjusted R-squared: 0.6811
## F-statistic: 11.98 on 7 and 29 DF, p-value: 4.443e-07
RegModel.2 <-
lm(Score~FLeverage+IsIndustry+IsMedium+Networth+Profit+Sales+Years,
data=credit)
summary(RegModel.2)
##
## Call:
## lm(formula = Score ~ FLeverage + IsIndustry + IsMedium + Networth +
## Profit + Sales + Years, data = credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.873 -7.948 1.161 9.009 24.170
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99.974505 9.505537 10.518 2.08e-11 ***
## FLeverage -4.520376 2.822140 -1.602 0.1200
## IsIndustry -6.959225 6.873718 -1.012 0.3197
## IsMedium -2.580373 4.757180 -0.542 0.5917
## Networth 0.006491 0.007730 0.840 0.4079
## Profit 0.040355 0.031893 1.265 0.2158
## Sales 0.001006 0.015875 0.063 0.9499
## Years 0.635446 0.364572 1.743 0.0919 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.45 on 29 degrees of freedom
## Multiple R-squared: 0.7455, Adjusted R-squared: 0.684
## F-statistic: 12.13 on 7 and 29 DF, p-value: 3.919e-07
RegModel.3 <-
lm(Score~FLeverage+IsIndustry+IsLow+Networth+Profit+Sales+Years,
data=credit)
summary(RegModel.3)
##
## Call:
## lm(formula = Score ~ FLeverage + IsIndustry + IsLow + Networth +
## Profit + Sales + Years, data = credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.600 -8.715 0.476 9.129 24.482
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 97.091208 9.495720 10.225 3.98e-11 ***
## FLeverage -3.329562 3.421079 -0.973 0.338
## IsIndustry -8.069348 6.840667 -1.180 0.248
## IsLow 3.772968 6.354331 0.594 0.557
## Networth 0.005121 0.007842 0.653 0.519
## Profit 0.039511 0.031776 1.243 0.224
## Sales 0.003071 0.015726 0.195 0.847
## Years 0.603324 0.372287 1.621 0.116
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.43 on 29 degrees of freedom
## Multiple R-squared: 0.746, Adjusted R-squared: 0.6846
## F-statistic: 12.17 on 7 and 29 DF, p-value: 3.813e-07
All models have a similar adjusted R-squared value of, 0.6811, 0.684, and 0.6846. “IsHigh” and “IsLow” have positive correlation to credit score, whereas “IsMedium” is negatively correlated The most significant variables are: * “Years” it consistently has the lowest P value (closest to 0.05) * “Profit” * “IsHigh” * “IsLow” * “Networth”
Regmodel.5 takes into consideration the observances above, here is the result:
RegModel.5 <- lm(Score~IsHigh+IsLow+Networth+Profit+Years, data=credit)
summary(RegModel.5)
##
## Call:
## lm(formula = Score ~ IsHigh + IsLow + Networth + Profit + Years,
## data = credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.0864 -8.9585 -0.9796 7.9085 29.9183
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85.080391 4.699825 18.103 < 2e-16 ***
## IsHigh -0.441658 5.684478 -0.078 0.93857
## IsLow 5.818117 5.523965 1.053 0.30037
## Networth 0.004786 0.003331 1.437 0.16078
## Profit 0.065742 0.022388 2.936 0.00621 **
## Years 0.675448 0.366097 1.845 0.07461 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.46 on 31 degrees of freedom
## Multiple R-squared: 0.7274, Adjusted R-squared: 0.6834
## F-statistic: 16.54 on 5 and 31 DF, p-value: 5.896e-08
This model is what should be used to predict credit score