download.file("https://www.biz.uiowa.edu/faculty/jledolter/datamining/LoanData.csv", "LoanData.csv",method="curl")
ld<-read.csv("https://www.biz.uiowa.edu/faculty/jledolter/datamining/LoanData.csv")
attach(ld)
## Status Credit.Grade Amount Age
## Current:5186 HR :1217 Min. : 1000 Min. : 0.000
## Default: 75 E :1129 1st Qu.: 2025 1st Qu.: 2.000
## Late : 350 D : 927 Median : 3001 Median : 4.000
## C : 843 Mean : 4817 Mean : 4.504
## B : 553 3rd Qu.: 6000 3rd Qu.: 7.000
## AA : 451 Max. :25000 Max. :14.000
## (Other): 491
## Borrower.Rate Debt.To.Income.Ratio
## Min. :0.0000 Min. : 0.00
## 1st Qu.:0.1425 1st Qu.: 0.09
## Median :0.1950 Median : 0.16
## Mean :0.1937 Mean : 45.38
## 3rd Qu.:0.2500 3rd Qu.: 0.25
## Max. :0.4975 Max. :51280.07
##
## Amount Age Borrower.Rate Debt.To.Income.Ratio Current Credit.AA Credit.B
## 1 5000 4 0.1500 0.040 1 0 0
## 2 1900 6 0.2650 0.020 1 0 0
## 3 1000 3 0.1500 0.020 1 0 0
## 4 1000 5 0.2900 0.020 1 0 0
## 5 2550 8 0.0795 0.033 1 1 0
## 6 1500 2 0.2600 0.030 1 0 0
## Credit.C Credit.D Credit.E Credit.HR
## 1 1 0 0 0
## 2 0 0 0 1
## 3 0 0 0 1
## 4 0 0 0 1
## 5 0 0 0 0
## 6 0 0 0 0
table(ld$Current,ld$Credit.AA)
##
## 0 1
## 0 420 5
## 1 4740 446
table(ld$Current,ld$Credit.B)
##
## 0 1
## 0 402 23
## 1 4656 530
table(ld$Current,ld$Credit.C)
##
## 0 1
## 0 394 31
## 1 4374 812
table(ld$Current,ld$Credit.D)
##
## 0 1
## 0 379 46
## 1 4305 881
table(ld$Current,ld$Credit.E)
##
## 0 1
## 0 314 111
## 1 4168 1018
table(ld$Current,ld$Credit.HR)
##
## 0 1
## 0 241 184
## 1 4153 1033
plot(ld$Current~ld$Amount,data=ld)
plot(ld$Current~ld$Age,data=ld)
plot(ld$Current~ld$Borrower.Rate,data=ld)
plot(ld$Current~ld$Debt.To.Income.Ratio,data=ld)
These graphs plot the various variables against the dependent variable, if a loan is current. It is somewhat difficult to truly understand what patterns these graphs show. The tables plot various credit ratings against the dependent variables as well. The tables do not really show any patterns either.
xxx=cbind(Current=ld$Current,Amount=ld$Amount,Age=ld$Age,Borrower.Rate=ld$Borrower.Rate,DI.Ratio=ld$Debt.To.Income.Ratio,
AA=ld$Credit.AA,B=ld$Credit.B,C=ld$Credit.C,D=ld$Credit.D,E=ld$Credit.E,HR=ld$Credit.HR)
n=dim(ld)[1]
n1=floor(n*(0.6))
n2=n-n1
train=sample(1:n,n1)
m1=glm(Current~.,family=binomial,data=data.frame(xxx))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(m1)
##
## Call:
## glm(formula = Current ~ ., family = binomial, data = data.frame(xxx))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3480 0.1393 0.2405 0.3951 2.2457
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.113e+00 3.624e-01 19.626 <2e-16 ***
## Amount -3.123e-05 1.537e-05 -2.032 0.0421 *
## Age -3.621e-01 2.218e-02 -16.326 <2e-16 ***
## Borrower.Rate -1.254e+01 1.239e+00 -10.118 <2e-16 ***
## DI.Ratio 2.742e-01 2.712e-01 1.011 0.3121
## AA 1.104e+00 5.076e-01 2.176 0.0296 *
## B 2.713e-01 3.092e-01 0.877 0.3803
## C 6.813e-01 2.917e-01 2.336 0.0195 *
## D 6.754e-01 2.840e-01 2.378 0.0174 *
## E 1.619e-01 2.807e-01 0.577 0.5641
## HR -3.289e-01 2.857e-01 -1.151 0.2497
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3010.3 on 5610 degrees of freedom
## Residual deviance: 2381.8 on 5600 degrees of freedom
## AIC: 2403.8
##
## Number of Fisher Scoring iterations: 16
This is a model of all of the fitted data against the dependent variable. According to this model, the significant variables are amount, age, borrower rate, and three of the different credit classifications. This is a pretty good model because many of the variables are significant.
xxx=xxx[,-1]
xtrain <- xxx[train,]
xnew <- xxx[-train,]
ytrain <- ld$Current[train]
ynew <- ld$Current[-train]
m2=glm(Current~.,family=binomial,data=data.frame(Current=ytrain,xtrain))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(m2)
##
## Call:
## glm(formula = Current ~ ., family = binomial, data = data.frame(Current = ytrain,
## xtrain))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4503 0.1317 0.2339 0.3894 2.3157
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.711e+00 5.161e-01 14.941 < 2e-16 ***
## Amount -4.471e-05 1.999e-05 -2.236 0.0253 *
## Age -3.572e-01 2.895e-02 -12.341 < 2e-16 ***
## Borrower.Rate -1.385e+01 1.714e+00 -8.080 6.5e-16 ***
## DI.Ratio 3.370e-01 4.511e-01 0.747 0.4550
## AA 1.011e+00 7.872e-01 1.285 0.1988
## B 6.885e-02 4.338e-01 0.159 0.8739
## C 2.348e-01 4.006e-01 0.586 0.5579
## D 3.766e-01 4.061e-01 0.927 0.3537
## E -8.452e-02 4.054e-01 -0.208 0.8349
## HR -5.578e-01 4.148e-01 -1.345 0.1787
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1801.1 on 3365 degrees of freedom
## Residual deviance: 1406.1 on 3355 degrees of freedom
## AIC: 1428.1
##
## Number of Fisher Scoring iterations: 16
When put into the test set, the model does not do as well as in the full data set. Only the Age and the Borrower Rate are significant. This shows that the model is not good enough to reliably predict the current status of loans.
ptest=predict(m2,newdata=data.frame(xnew),type="response")
plot(ynew~ptest)
gg1=floor(ptest+0.5)
ttt=table(ynew,gg1)
ttt
## gg1
## ynew 0 1
## 0 15 156
## 1 13 2061
error=(ttt[1,2]+ttt[2,1])/n2
error
## [1] 0.0752784
This classification matrix shows that the model is good at predicting successful loans, as evidenced by the relatively low error rate. It is also evidenced by the amount of current loans that the model correctly predicted.
bb=cbind(ptest,ynew)
bb1=bb[order(ptest,decreasing=TRUE),]
xbar=mean(ynew)
xbar
## [1] 0.9238307
axis=dim(n2)
ax=dim(n2)
ay=dim(n2)
axis[1]=1
ax[1]=xbar
ay[1]=bb1[1,2]
for (i in 2:n2) {
axis[i]=i
ax[i]=xbar*i
ay[i]=ay[i-1]+bb1[i,2]
}
aaa=cbind(bb1[,1],bb1[,2],ay,ax)
aaa[1:10,]
## ay ax
## 2235 1 1 1 0.9238307
## 2236 1 1 2 1.8476615
## 2237 1 1 3 2.7714922
## 2238 1 1 4 3.6953229
## 2239 1 1 5 4.6191537
## 2240 1 1 6 5.5429844
## 2241 1 1 7 6.4668151
## 2242 1 1 8 7.3906459
## 2243 1 1 9 8.3144766
## 2244 1 1 10 9.2383073
plot(axis,ay,xlab="number of cases",ylab="number of good loans",main="Lift: Cum good loans sorted by pred val/success prob")
points(axis,ax,type="l")
This creates a lift curve to see the predictive value of the model. The more area beneath the curve, the better the model is. However, there is very little space between the curve and the dividing line.
tnc <- read.csv("Titanic.csv")
tnc=tnc[,c(-9,-10,-11)]
head(tnc)
## Class Survived Gender Age NumbSibOrSpsAbd NumbParOrChildAbd Fare
## 1 1 1 female 29.0000 0 0 211.3375
## 2 1 1 male 0.9167 1 2 151.5500
## 3 1 0 female 2.0000 1 2 151.5500
## 4 1 0 male 30.0000 1 2 151.5500
## 5 1 0 female 25.0000 1 2 151.5500
## 6 1 1 male 48.0000 0 0 26.5500
## EmbarkedAt
## 1 S
## 2 S
## 3 S
## 4 S
## 5 S
## 6 S
Rows 9, 10, and 11 talk about the passengers’ destinations and such, which are irrelevant to their survival. Because of this, they can be eliminated.
plot(tnc$Survived~tnc$Class,data=tnc)
plot(tnc$Survived~tnc$Gender,data=tnc)
plot(tnc$Survived~tnc$Age,data=tnc)
plot(tnc$Survived~tnc$NumbSibOrSpsAbd,data=tnc)
plot(tnc$Survived~tnc$NumbParOrChildAbd,data=tnc)
plot(tnc$Survived~tnc$Fare,data=tnc)
plot(tnc$Survived~tnc$EmbarkedAt,data=tnc)
These graphs plot the different variables against the dependent variable, survival. From these graphs, it is difficult to identify any patterns.
v5=rep(1,length(tnc$EmbarkedAt))
v6=rep(0,length(tnc$EmbarkedAt))
tnc$EmbarkedC=ifelse(tnc$EmbarkedAt=="C",v5,v6)
tnc$EmbarkedQ=ifelse(tnc$EmbarkedAt=="Q",v5,v6)
v7=rep(1,length(tnc$Gender))
v8=rep(0,length(tnc$Gender))
tnc$Female=ifelse(tnc$Gender=="Female",v7,v8)
tnc=tnc[,c(-3,-8)]
head(tnc)
## Class Survived Age NumbSibOrSpsAbd NumbParOrChildAbd Fare
## 1 1 1 29.0000 0 0 211.3375
## 2 1 1 0.9167 1 2 151.5500
## 3 1 0 2.0000 1 2 151.5500
## 4 1 0 30.0000 1 2 151.5500
## 5 1 0 25.0000 1 2 151.5500
## 6 1 1 48.0000 0 0 26.5500
## EmbarkedC EmbarkedQ Female
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
Here, we need to make the categorical variables into binomials for the logistical regression. The original categorical variable columns are also taken out.
xx=cbind(Survive=tnc$Survived,Class=tnc$Class,Age=tnc$Age,SSAbd=tnc$NumbSibOrSpsAbd,PCAbd=tnc$NumbParOrChildAbd,
Fare=tnc$Fare,EmbC=tnc$EmbarkedC,EmbQ=tnc$EmbarkedQ,Female=tnc$Female)
n=dim(tnc)[1]
n1=floor(n*(0.6))
n2=n-n1
train=sample(1:n,n1)
m1=glm(Survive~.,family=binomial,data=data.frame(xx))
summary(m1)
##
## Call:
## glm(formula = Survive ~ ., family = binomial, data = data.frame(xx))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2028 -0.8732 -0.6469 1.0363 2.3298
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.614409 0.398189 6.566 5.18e-11 ***
## Class -0.953013 0.117082 -8.140 3.96e-16 ***
## Age -0.038422 0.005832 -6.589 4.44e-11 ***
## SSAbd -0.248844 0.087697 -2.838 0.004546 **
## PCAbd 0.317959 0.089043 3.571 0.000356 ***
## Fare 0.002073 0.001806 1.148 0.250946
## EmbC 0.635211 0.179920 3.531 0.000415 ***
## EmbQ 0.043374 0.344657 0.126 0.899854
## Female NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1410.0 on 1042 degrees of freedom
## Residual deviance: 1218.9 on 1035 degrees of freedom
## AIC: 1234.9
##
## Number of Fisher Scoring iterations: 4
This is a good model because almost all of the variables are significant. These include class, age, and gender. This model will be put into a training set to see how it predicts variables.
xx=xx[,-1]
xtrain <- xx[train,]
xnew <- xx[-train,]
ytrain <- tnc$Survived[train]
ynew <- tnc$Survived[-train]
m2=glm(Survive~.,family=binomial,data=data.frame(Survive=ytrain,xtrain))
summary(m2)
##
## Call:
## glm(formula = Survive ~ ., family = binomial, data = data.frame(Survive = ytrain,
## xtrain))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2179 -0.8826 -0.6282 1.0552 2.2317
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.056273 0.536291 5.699 1.21e-08 ***
## Class -1.023459 0.158391 -6.462 1.04e-10 ***
## Age -0.048946 0.007950 -6.156 7.44e-10 ***
## SSAbd -0.201351 0.111770 -1.801 0.07163 .
## PCAbd 0.240935 0.125954 1.913 0.05576 .
## Fare 0.001021 0.002712 0.376 0.70656
## EmbC 0.691990 0.234554 2.950 0.00318 **
## EmbQ 0.423941 0.419161 1.011 0.31182
## Female NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 842.07 on 624 degrees of freedom
## Residual deviance: 728.19 on 617 degrees of freedom
## AIC: 744.19
##
## Number of Fisher Scoring iterations: 4
According to this regression, the model is still good, even in a new data set. Many of the same variables are still significant in this second model.
ptest=predict(m2,newdata=data.frame(xnew),type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
gg1=floor(ptest+0.5)
ttt=table(ynew,gg1)
ttt
## gg1
## ynew 0 1
## 0 212 32
## 1 86 88
error=(ttt[1,2]+ttt[2,1])/n2
error
## [1] 0.2822967
According to this classification matrix, the model is relatively good at predicting not surviving. However, it is not good at predicting survival. That contributes to the relatively high error rate. Especially with a smaller sample.
bb=cbind(ptest,ynew)
bb1=bb[order(ptest,decreasing=TRUE),]
plot(ynew~ptest)
bb=cbind(ptest,ynew)
bb1=bb[order(ptest,decreasing=TRUE),]
xbar=mean(ynew)
axis=dim(n2)
ax=dim(n2)
ay=dim(n2)
axis[1]=1
ax[1]=xbar
ay[1]=bb1[1,2]
for (i in 2:n2) {
axis[i]=i
ax[i]=xbar*i
ay[i]=ay[i-1]+bb1[i,2]
}
aaa=cbind(bb1[,1],bb1[,2],ay,ax)
aaa[1:10,]
## ay ax
## 94 0.9547963 1 1 0.4162679
## 101 0.9232818 1 2 0.8325359
## 2 0.9185790 1 3 1.2488038
## 60 0.9039402 1 4 1.6650718
## 179 0.8977113 1 5 2.0813397
## 110 0.8910061 0 5 2.4976077
## 8 0.8853136 0 5 2.9138756
## 171 0.8781867 1 6 3.3301435
## 53 0.8753992 1 7 3.7464115
## 25 0.8696016 1 8 4.1626794
plot(axis,ay,xlab="number of cases",ylab="number of survivals",main="Lift: Cum Survivals sorted by pred val/success prob")
points(axis,ax,type="l")
This graph is a lift curve that shows the predictive value of the model. This lift curve is relatively good because there is more area underneath the curve. This indicates that the model is okay at predicting the outcomes related to survival, but it is not the best either.