Loan Data

download.file("https://www.biz.uiowa.edu/faculty/jledolter/datamining/LoanData.csv", "LoanData.csv",method="curl")
ld<-read.csv("https://www.biz.uiowa.edu/faculty/jledolter/datamining/LoanData.csv")
attach(ld)
##      Status      Credit.Grade      Amount           Age        
##  Current:5186   HR     :1217   Min.   : 1000   Min.   : 0.000  
##  Default:  75   E      :1129   1st Qu.: 2025   1st Qu.: 2.000  
##  Late   : 350   D      : 927   Median : 3001   Median : 4.000  
##                 C      : 843   Mean   : 4817   Mean   : 4.504  
##                 B      : 553   3rd Qu.: 6000   3rd Qu.: 7.000  
##                 AA     : 451   Max.   :25000   Max.   :14.000  
##                 (Other): 491                                   
##  Borrower.Rate    Debt.To.Income.Ratio
##  Min.   :0.0000   Min.   :    0.00    
##  1st Qu.:0.1425   1st Qu.:    0.09    
##  Median :0.1950   Median :    0.16    
##  Mean   :0.1937   Mean   :   45.38    
##  3rd Qu.:0.2500   3rd Qu.:    0.25    
##  Max.   :0.4975   Max.   :51280.07    
## 
##   Amount Age Borrower.Rate Debt.To.Income.Ratio Current Credit.AA Credit.B
## 1   5000   4        0.1500                0.040       1         0        0
## 2   1900   6        0.2650                0.020       1         0        0
## 3   1000   3        0.1500                0.020       1         0        0
## 4   1000   5        0.2900                0.020       1         0        0
## 5   2550   8        0.0795                0.033       1         1        0
## 6   1500   2        0.2600                0.030       1         0        0
##   Credit.C Credit.D Credit.E Credit.HR
## 1        1        0        0         0
## 2        0        0        0         1
## 3        0        0        0         1
## 4        0        0        0         1
## 5        0        0        0         0
## 6        0        0        0         0
table(ld$Current,ld$Credit.AA)
##    
##        0    1
##   0  420    5
##   1 4740  446
table(ld$Current,ld$Credit.B)
##    
##        0    1
##   0  402   23
##   1 4656  530
table(ld$Current,ld$Credit.C)
##    
##        0    1
##   0  394   31
##   1 4374  812
table(ld$Current,ld$Credit.D)
##    
##        0    1
##   0  379   46
##   1 4305  881
table(ld$Current,ld$Credit.E)
##    
##        0    1
##   0  314  111
##   1 4168 1018
table(ld$Current,ld$Credit.HR)
##    
##        0    1
##   0  241  184
##   1 4153 1033
plot(ld$Current~ld$Amount,data=ld)

plot(ld$Current~ld$Age,data=ld)

plot(ld$Current~ld$Borrower.Rate,data=ld)

plot(ld$Current~ld$Debt.To.Income.Ratio,data=ld)

These graphs plot the various variables against the dependent variable, if a loan is current. It is somewhat difficult to truly understand what patterns these graphs show. The tables plot various credit ratings against the dependent variables as well. The tables do not really show any patterns either.

Cbind and Assigning the Floor

xxx=cbind(Current=ld$Current,Amount=ld$Amount,Age=ld$Age,Borrower.Rate=ld$Borrower.Rate,DI.Ratio=ld$Debt.To.Income.Ratio,
         AA=ld$Credit.AA,B=ld$Credit.B,C=ld$Credit.C,D=ld$Credit.D,E=ld$Credit.E,HR=ld$Credit.HR)

n=dim(ld)[1]
n1=floor(n*(0.6))
n2=n-n1
train=sample(1:n,n1)

Model for all Fitted Data

m1=glm(Current~.,family=binomial,data=data.frame(xxx))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(m1)
## 
## Call:
## glm(formula = Current ~ ., family = binomial, data = data.frame(xxx))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3480   0.1393   0.2405   0.3951   2.2457  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    7.113e+00  3.624e-01  19.626   <2e-16 ***
## Amount        -3.123e-05  1.537e-05  -2.032   0.0421 *  
## Age           -3.621e-01  2.218e-02 -16.326   <2e-16 ***
## Borrower.Rate -1.254e+01  1.239e+00 -10.118   <2e-16 ***
## DI.Ratio       2.742e-01  2.712e-01   1.011   0.3121    
## AA             1.104e+00  5.076e-01   2.176   0.0296 *  
## B              2.713e-01  3.092e-01   0.877   0.3803    
## C              6.813e-01  2.917e-01   2.336   0.0195 *  
## D              6.754e-01  2.840e-01   2.378   0.0174 *  
## E              1.619e-01  2.807e-01   0.577   0.5641    
## HR            -3.289e-01  2.857e-01  -1.151   0.2497    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3010.3  on 5610  degrees of freedom
## Residual deviance: 2381.8  on 5600  degrees of freedom
## AIC: 2403.8
## 
## Number of Fisher Scoring iterations: 16

This is a model of all of the fitted data against the dependent variable. According to this model, the significant variables are amount, age, borrower rate, and three of the different credit classifications. This is a pretty good model because many of the variables are significant.

Creating a Training Set

xxx=xxx[,-1]
xtrain <- xxx[train,]
xnew <- xxx[-train,]
ytrain <- ld$Current[train]
ynew <- ld$Current[-train]

Model fitting on the Training Data Set

m2=glm(Current~.,family=binomial,data=data.frame(Current=ytrain,xtrain))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(m2)
## 
## Call:
## glm(formula = Current ~ ., family = binomial, data = data.frame(Current = ytrain, 
##     xtrain))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4503   0.1317   0.2339   0.3894   2.3157  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    7.711e+00  5.161e-01  14.941  < 2e-16 ***
## Amount        -4.471e-05  1.999e-05  -2.236   0.0253 *  
## Age           -3.572e-01  2.895e-02 -12.341  < 2e-16 ***
## Borrower.Rate -1.385e+01  1.714e+00  -8.080  6.5e-16 ***
## DI.Ratio       3.370e-01  4.511e-01   0.747   0.4550    
## AA             1.011e+00  7.872e-01   1.285   0.1988    
## B              6.885e-02  4.338e-01   0.159   0.8739    
## C              2.348e-01  4.006e-01   0.586   0.5579    
## D              3.766e-01  4.061e-01   0.927   0.3537    
## E             -8.452e-02  4.054e-01  -0.208   0.8349    
## HR            -5.578e-01  4.148e-01  -1.345   0.1787    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1801.1  on 3365  degrees of freedom
## Residual deviance: 1406.1  on 3355  degrees of freedom
## AIC: 1428.1
## 
## Number of Fisher Scoring iterations: 16

When put into the test set, the model does not do as well as in the full data set. Only the Age and the Borrower Rate are significant. This shows that the model is not good enough to reliably predict the current status of loans.

Predicting

ptest=predict(m2,newdata=data.frame(xnew),type="response")
plot(ynew~ptest)

Classification Matrix

gg1=floor(ptest+0.5)
ttt=table(ynew,gg1)
ttt
##     gg1
## ynew    0    1
##    0   15  156
##    1   13 2061
error=(ttt[1,2]+ttt[2,1])/n2
error
## [1] 0.0752784

This classification matrix shows that the model is good at predicting successful loans, as evidenced by the relatively low error rate. It is also evidenced by the amount of current loans that the model correctly predicted.

overall success probability in evaluation (test) data set

bb=cbind(ptest,ynew)
bb1=bb[order(ptest,decreasing=TRUE),]
xbar=mean(ynew)
xbar
## [1] 0.9238307
axis=dim(n2)
ax=dim(n2)
ay=dim(n2)
axis[1]=1
ax[1]=xbar
ay[1]=bb1[1,2]
for (i in 2:n2) {
  axis[i]=i
  ax[i]=xbar*i
  ay[i]=ay[i-1]+bb1[i,2]
}

aaa=cbind(bb1[,1],bb1[,2],ay,ax)
aaa[1:10,]
##          ay        ax
## 2235 1 1  1 0.9238307
## 2236 1 1  2 1.8476615
## 2237 1 1  3 2.7714922
## 2238 1 1  4 3.6953229
## 2239 1 1  5 4.6191537
## 2240 1 1  6 5.5429844
## 2241 1 1  7 6.4668151
## 2242 1 1  8 7.3906459
## 2243 1 1  9 8.3144766
## 2244 1 1 10 9.2383073
plot(axis,ay,xlab="number of cases",ylab="number of good loans",main="Lift: Cum good loans sorted by pred val/success prob")
points(axis,ax,type="l") 

This creates a lift curve to see the predictive value of the model. The more area beneath the curve, the better the model is. However, there is very little space between the curve and the dividing line.

Titanic Data

tnc <- read.csv("Titanic.csv")
tnc=tnc[,c(-9,-10,-11)]
head(tnc)
##   Class Survived Gender     Age NumbSibOrSpsAbd NumbParOrChildAbd     Fare
## 1     1        1 female 29.0000               0                 0 211.3375
## 2     1        1   male  0.9167               1                 2 151.5500
## 3     1        0 female  2.0000               1                 2 151.5500
## 4     1        0   male 30.0000               1                 2 151.5500
## 5     1        0 female 25.0000               1                 2 151.5500
## 6     1        1   male 48.0000               0                 0  26.5500
##   EmbarkedAt
## 1          S
## 2          S
## 3          S
## 4          S
## 5          S
## 6          S

Rows 9, 10, and 11 talk about the passengers’ destinations and such, which are irrelevant to their survival. Because of this, they can be eliminated.

Plotting Variables

plot(tnc$Survived~tnc$Class,data=tnc)

plot(tnc$Survived~tnc$Gender,data=tnc)

plot(tnc$Survived~tnc$Age,data=tnc)

plot(tnc$Survived~tnc$NumbSibOrSpsAbd,data=tnc)

plot(tnc$Survived~tnc$NumbParOrChildAbd,data=tnc)

plot(tnc$Survived~tnc$Fare,data=tnc)

plot(tnc$Survived~tnc$EmbarkedAt,data=tnc)

These graphs plot the different variables against the dependent variable, survival. From these graphs, it is difficult to identify any patterns.

Recoding Variables

v5=rep(1,length(tnc$EmbarkedAt))
v6=rep(0,length(tnc$EmbarkedAt))
tnc$EmbarkedC=ifelse(tnc$EmbarkedAt=="C",v5,v6)
tnc$EmbarkedQ=ifelse(tnc$EmbarkedAt=="Q",v5,v6)
v7=rep(1,length(tnc$Gender))
v8=rep(0,length(tnc$Gender))
tnc$Female=ifelse(tnc$Gender=="Female",v7,v8)
tnc=tnc[,c(-3,-8)]
head(tnc)
##   Class Survived     Age NumbSibOrSpsAbd NumbParOrChildAbd     Fare
## 1     1        1 29.0000               0                 0 211.3375
## 2     1        1  0.9167               1                 2 151.5500
## 3     1        0  2.0000               1                 2 151.5500
## 4     1        0 30.0000               1                 2 151.5500
## 5     1        0 25.0000               1                 2 151.5500
## 6     1        1 48.0000               0                 0  26.5500
##   EmbarkedC EmbarkedQ Female
## 1         0         0      0
## 2         0         0      0
## 3         0         0      0
## 4         0         0      0
## 5         0         0      0
## 6         0         0      0

Here, we need to make the categorical variables into binomials for the logistical regression. The original categorical variable columns are also taken out.

split the data set into training and test (evaluation) set

xx=cbind(Survive=tnc$Survived,Class=tnc$Class,Age=tnc$Age,SSAbd=tnc$NumbSibOrSpsAbd,PCAbd=tnc$NumbParOrChildAbd,
         Fare=tnc$Fare,EmbC=tnc$EmbarkedC,EmbQ=tnc$EmbarkedQ,Female=tnc$Female)

n=dim(tnc)[1]
n1=floor(n*(0.6))
n2=n-n1
train=sample(1:n,n1)

model fitted on all data

m1=glm(Survive~.,family=binomial,data=data.frame(xx))
summary(m1)
## 
## Call:
## glm(formula = Survive ~ ., family = binomial, data = data.frame(xx))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2028  -0.8732  -0.6469   1.0363   2.3298  
## 
## Coefficients: (1 not defined because of singularities)
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.614409   0.398189   6.566 5.18e-11 ***
## Class       -0.953013   0.117082  -8.140 3.96e-16 ***
## Age         -0.038422   0.005832  -6.589 4.44e-11 ***
## SSAbd       -0.248844   0.087697  -2.838 0.004546 ** 
## PCAbd        0.317959   0.089043   3.571 0.000356 ***
## Fare         0.002073   0.001806   1.148 0.250946    
## EmbC         0.635211   0.179920   3.531 0.000415 ***
## EmbQ         0.043374   0.344657   0.126 0.899854    
## Female             NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1410.0  on 1042  degrees of freedom
## Residual deviance: 1218.9  on 1035  degrees of freedom
## AIC: 1234.9
## 
## Number of Fisher Scoring iterations: 4

This is a good model because almost all of the variables are significant. These include class, age, and gender. This model will be put into a training set to see how it predicts variables.

Training Data Set

xx=xx[,-1]
xtrain <- xx[train,]
xnew <- xx[-train,]
ytrain <- tnc$Survived[train]
ynew <- tnc$Survived[-train]
m2=glm(Survive~.,family=binomial,data=data.frame(Survive=ytrain,xtrain))
summary(m2)
## 
## Call:
## glm(formula = Survive ~ ., family = binomial, data = data.frame(Survive = ytrain, 
##     xtrain))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2179  -0.8826  -0.6282   1.0552   2.2317  
## 
## Coefficients: (1 not defined because of singularities)
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.056273   0.536291   5.699 1.21e-08 ***
## Class       -1.023459   0.158391  -6.462 1.04e-10 ***
## Age         -0.048946   0.007950  -6.156 7.44e-10 ***
## SSAbd       -0.201351   0.111770  -1.801  0.07163 .  
## PCAbd        0.240935   0.125954   1.913  0.05576 .  
## Fare         0.001021   0.002712   0.376  0.70656    
## EmbC         0.691990   0.234554   2.950  0.00318 ** 
## EmbQ         0.423941   0.419161   1.011  0.31182    
## Female             NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 842.07  on 624  degrees of freedom
## Residual deviance: 728.19  on 617  degrees of freedom
## AIC: 744.19
## 
## Number of Fisher Scoring iterations: 4

According to this regression, the model is still good, even in a new data set. Many of the same variables are still significant in this second model.

Classification Matrix

ptest=predict(m2,newdata=data.frame(xnew),type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
gg1=floor(ptest+0.5)
ttt=table(ynew,gg1)
ttt
##     gg1
## ynew   0   1
##    0 212  32
##    1  86  88
error=(ttt[1,2]+ttt[2,1])/n2
error
## [1] 0.2822967

According to this classification matrix, the model is relatively good at predicting not surviving. However, it is not good at predicting survival. That contributes to the relatively high error rate. Especially with a smaller sample.

overall success probability in evaluation (test) data set

bb=cbind(ptest,ynew)
bb1=bb[order(ptest,decreasing=TRUE),]
plot(ynew~ptest)

bb=cbind(ptest,ynew)
bb1=bb[order(ptest,decreasing=TRUE),]
xbar=mean(ynew)

axis=dim(n2)
ax=dim(n2)
ay=dim(n2)
axis[1]=1
ax[1]=xbar
ay[1]=bb1[1,2]
for (i in 2:n2) {
  axis[i]=i
  ax[i]=xbar*i
  ay[i]=ay[i-1]+bb1[i,2]
}

aaa=cbind(bb1[,1],bb1[,2],ay,ax)
aaa[1:10,]
##                 ay        ax
## 94  0.9547963 1  1 0.4162679
## 101 0.9232818 1  2 0.8325359
## 2   0.9185790 1  3 1.2488038
## 60  0.9039402 1  4 1.6650718
## 179 0.8977113 1  5 2.0813397
## 110 0.8910061 0  5 2.4976077
## 8   0.8853136 0  5 2.9138756
## 171 0.8781867 1  6 3.3301435
## 53  0.8753992 1  7 3.7464115
## 25  0.8696016 1  8 4.1626794

Lift Curve

plot(axis,ay,xlab="number of cases",ylab="number of survivals",main="Lift: Cum Survivals sorted by pred val/success prob")
points(axis,ax,type="l") 

This graph is a lift curve that shows the predictive value of the model. This lift curve is relatively good because there is more area underneath the curve. This indicates that the model is okay at predicting the outcomes related to survival, but it is not the best either.