For assignment 5, the data we are going to analyze credit data. Here are library packages I use on this analysis.

library(dplyr)
library(rpart)
library(rpart.plot)
library(ROCR)
library(ggplot2)

Part A

(I) Use the first half of this dataset to be the training set and the second half of this dataset to be the test set. Your task is to predict the credit rating of an individual.

mydata <- read.csv("c:/Users/tresz/Desktop/DNSC6211/ass05/creditdata.csv")
dim.data.frame(mydata)
## [1] 4446   14
str(mydata)
## 'data.frame':    4446 obs. of  14 variables:
##  $ rating    : Factor w/ 2 levels "bad","good": 2 2 1 2 2 2 2 2 2 1 ...
##  $ experience: int  9 17 10 0 0 1 29 9 0 0 ...
##  $ homeown   : Factor w/ 6 levels "ignore","other",..: 6 6 3 6 6 3 3 4 3 4 ...
##  $ loandurn  : int  60 60 36 60 36 60 60 12 60 48 ...
##  $ age       : int  30 58 46 24 26 36 44 27 32 41 ...
##  $ mstat     : Factor w/ 5 levels "divorced","married",..: 2 5 2 4 4 2 2 4 2 2 ...
##  $ rcds      : Factor w/ 2 levels "no_rec","yes_rec": 1 1 2 1 1 1 1 1 1 1 ...
##  $ jtype     : Factor w/ 4 levels "fixed","freelance",..: 2 1 2 1 1 1 1 1 2 4 ...
##  $ explvl    : int  73 48 90 63 46 75 75 35 90 90 ...
##  $ inc       : int  129 131 200 182 107 214 125 80 107 80 ...
##  $ assts     : int  0 0 3000 2500 0 3500 10000 0 15000 0 ...
##  $ debt      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ loanamount: int  800 1000 2000 900 310 650 1600 200 1200 1200 ...
##  $ purchprice: int  846 1658 2985 1325 910 1645 1800 1093 1957 1468 ...
mydata.train <- mydata[1:2223,]
mydata.test <- mydata[2224:4446,]

We can see that this dataset has 4446 rows and 14 column. So I assigned first half of dataset as training data, and the second half of dataset as testing data.

(II)Please ensure that the data are clean.

sapply(mydata, function(x) sum(is.na(x)) )
##     rating experience    homeown   loandurn        age      mstat 
##          0          0          0          0          0          0 
##       rcds      jtype     explvl        inc      assts       debt 
##          0          0          0          0          0          0 
## loanamount purchprice 
##          0          0

Each column of this dataset does not have any NA data. So the data are clean and ready to analyse.

(III)Please provide the relevant exploratory and descriptive analysis

table(mydata.train$rating)
## 
##  bad good 
##  628 1595
table(mydata.train$homeown) 
## 
##  ignore   other   owner parents    priv    rent 
##       9     160    1072     362     104     516
table(mydata.train$mstat)
## 
##  divorced   married separated    single     widow 
##        20      1647        59       463        34
table(mydata.train$rcds)
## 
##  no_rec yes_rec 
##    1896     327
table(mydata.train$jtype)
## 
##     fixed freelance    others   partime 
##      1413       535        72       203
table(mydata.train$loandurn)
## 
##   6  12  18  24  30  36  42  48  54  60 
##  14  81  49 170  25 480  18 403  11 972
summary(mydata.train$experience)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   5.000   7.853  12.000  43.000
summary(mydata.train$loandurn)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   36.00   48.00   46.36   60.00   60.00
summary(mydata.train$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   28.00   35.00   37.07   45.00   68.00
summary(mydata.train$explvl)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   35.00   45.00   60.00   60.91   75.00  173.00
summary(mydata.train$inc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    93.0   130.0   148.6   180.0   959.0
summary(mydata.train$assts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0    3000    5467    6000  300000
summary(mydata.train$debt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0   353.8     0.0 30000.0
summary(mydata.train$loanamount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     100     700    1000    1028    1300    5000
summary(mydata.train$purchprice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     105    1111    1386    1440    1683    8800
g1 <- ggplot(data = mydata.train, aes(rating, fill = rating)) +
  geom_bar() +
  scale_fill_discrete(name="Rating", labels=c("Good", "Bad")) +
  scale_x_discrete(breaks=c("good","bad"),
                   labels=c("Good","Bad"))
g1

g2 <- ggplot(data = mydata.train, aes(homeown, fill = homeown)) +
  geom_bar() +
  scale_fill_discrete(name="Home Ownership", labels=c("Ignore", "Other", "Owner", "Parents", "Priv", "Rent")) 
g2

g4 <- ggplot(data = mydata.train, aes(mstat, fill = mstat)) +
  geom_bar() +
  scale_fill_discrete(name="Marital Status") 

g4

g5 <- ggplot(data = mydata.train, aes(rcds, fill = rcds)) +
  geom_bar() +
  scale_fill_discrete(name="Existence of Records", labels=c("Yes", "No")) 

g5

g5 <- ggplot(data = mydata.train, aes(jtype, fill = jtype)) +
  geom_bar() +
  scale_fill_discrete(name="Job Type", labels=c("Fixed", "Freelance","Others", "Part time")) 

g5

ggplot(mydata.train, aes(homeown)) + geom_bar(aes(fill=factor(rating)))

ggplot(mydata.train, aes(mstat)) + geom_bar(aes(fill=factor(rating)))

ggplot(mydata.train, aes(rcds)) + geom_bar(aes(fill=factor(rating)))

ggplot(mydata.train, aes(jtype)) + geom_bar(aes(fill=factor(rating)))

(IV) Please explain how this descriptive analysis informed your next steps.

Descriptive statistics helps us in analyzing data by showing or summarizing data patterns, also tell us which variable is useful to predict. This will make us to understand more about each variables and the relationship between variables.

Part B

(I) Please make a predictive model using logistic regression

train.logistic <- glm(formula = rating ~ ., family =binomial(link='logit'), data = mydata.train)
summary(train.logistic)
## 
## Call:
## glm(formula = rating ~ ., family = binomial(link = "logit"), 
##     data = mydata.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1915  -0.6031   0.3956   0.6973   2.5219  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -2.738e-01  9.990e-01  -0.274 0.784032    
## experience      7.646e-02  1.048e-02   7.293 3.03e-13 ***
## homeownother    3.322e-01  7.807e-01   0.425 0.670501    
## homeownowner    1.733e+00  7.607e-01   2.278 0.022701 *  
## homeownparents  1.404e+00  7.730e-01   1.816 0.069310 .  
## homeownpriv     1.019e+00  7.911e-01   1.288 0.197857    
## homeownrent     9.726e-01  7.646e-01   1.272 0.203349    
## loandurn       -5.372e-03  4.882e-03  -1.100 0.271222    
## age            -1.172e-02  6.999e-03  -1.675 0.094032 .  
## mstatmarried    1.474e+00  5.415e-01   2.722 0.006481 ** 
## mstatseparated  1.728e-01  6.154e-01   0.281 0.778852    
## mstatsingle     9.441e-01  5.443e-01   1.735 0.082814 .  
## mstatwidow      1.976e+00  8.173e-01   2.417 0.015639 *  
## rcdsyes_rec    -1.874e+00  1.539e-01 -12.176  < 2e-16 ***
## jtypefreelance -8.816e-01  1.429e-01  -6.168 6.93e-10 ***
## jtypeothers    -1.113e+00  3.032e-01  -3.671 0.000241 ***
## jtypepartime   -1.275e+00  1.813e-01  -7.034 2.00e-12 ***
## explvl         -1.271e-02  3.813e-03  -3.334 0.000855 ***
## inc             6.200e-03  8.841e-04   7.012 2.34e-12 ***
## assts           2.638e-05  9.123e-06   2.891 0.003834 ** 
## debt           -1.920e-04  4.640e-05  -4.138 3.50e-05 ***
## loanamount     -1.975e-03  2.468e-04  -8.002 1.23e-15 ***
## purchprice      1.022e-03  1.869e-04   5.467 4.57e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2646.7  on 2222  degrees of freedom
## Residual deviance: 1954.8  on 2200  degrees of freedom
## AIC: 2000.8
## 
## Number of Fisher Scoring iterations: 5

(II) Which variables in the logistic regression model are significant or important?

To consider which variable in model is significant or not, we can consider by the p-value of each variable. If the p-value is less than 0.05, the variable would be significant. Experience, homeowner factor, mstamarried factor, mstatwidow factor, rcdsyes_rec(Existence of records) factor, all jtype(Job type) factor, explvl, inc, assts, debt, loanamount,purchprice are significant to the logistic regression model.

(III) Please make a predictive model using a decision tree

tree.model <- rpart(rating ~ ., mydata.train, method = "class")
plot <- rpart.plot(tree.model, type=1, extra = 102, tweak = 1.15)

(IV) Describe and explain the tree. How will you explain the tree to your client (less than 250 words)?

We can see on the top of the decision tree, out of 2,223 sample 1,595 borrowers has good credit. For two down branches from the top, left branch indicate that the borrower has rcds=yes_rec, and the right branch indicate the borrowers who rcds is not yes_rec or mean doesn’t have records. The left branch, 194 borrower out of 327 borrowers which has records has bad rating. Furthermore, among 1,896 borrowers who doesn’t have records, 1,462 borrowers has good records. And the lower brunches explain rating of borrowers with each variable from the nods before. Finally, after all the related variables, we use the bottom nodes from decision tree to determine and predict borrowers rating. For example, if the borrower has a record and has experience less than 2.5 years, the possibility of that borrow would have bad rating is 5%.

Part C

(I) Please help your client understand which model is better for predicting purposes.

#Accuracy for logistic
test.logistic.fitted <- predict(train.logistic,mydata.test,type='response')

test.logistic.fitted <- ifelse(test.logistic.fitted > 0.5,"good","bad") 
misClassificationError1 <- mean(test.logistic.fitted != mydata.test$rating) #run thie how many true there are
print(paste('Accuracy',1-misClassificationError1))
## [1] "Accuracy 0.791722896986055"
p1 <- predict(train.logistic, mydata.test, type="response")
pr1 <- prediction(p1, mydata.test$rating)
prf1 <- performance(pr1, measure = "tpr", x.measure = "fpr")
plot(prf1)
abline(a=0, b= 1)

auc1 <- performance(pr1, measure = "auc")
auc1 <- auc1@y.values[[1]]
auc1
## [1] 0.8305952

For logistic regression model, after calculating the accuracy of the model. We found that the logistic regression model has only 79% accuracy. Then, calculating ROC curve and area under the curve, AUC of logistic regress model is 0.83.

#decision tree model
probs1 <- predict(tree.model, mydata.test, type = "prob")[,2]

pred1 <- prediction(probs1, mydata.test$rating)
perf1 <- performance(pred1, "tpr" , "fpr")
plot(perf1)
abline(a=0, b= 1)

auc2 <- performance(pred1, measure = "auc")
auc2 <- auc2@y.values[[1]]
auc2
## [1] 0.7518058

For decision tree model, after plot ROC and calculate AUC, we found that AUC of decision tree model is 0.75.

Comparing AUC form both models, logistic regression model’s AUC is more than decision tree’s AUC. So we can conclude that logistic regression model is better for predicting purposes.

plot(prf1, col="red")
par(new=TRUE)
plot(perf1, col="blue")
legend("bottomright", legend=c("Logistic model","Decision tree"), lty=c(1,1),col=c("red","blue"))
abline(a=0, b= 1)

This graph show the ROC curve for logistic regression and decision tree model. We can see that logisrtic regression ROC line has more area under the curve than area under the curve of decision tree model.

(II)Find a way to explain sensitivity, specificity and auc in the ROC to your client (in less than 250 words).

ROC or Receiver Operating Curve use to analysis binary variables. ROC is create under two variables which are True positive rate(TPR) and False positive rate(FPR). TPR is the proportion for the data match with the TRUE value correctly in this data means the percentage of borrowers who are correctly indentified as having good rating. TPR also called as “Sensitivity”. FPR or “Specificity” is the proportion for the negative data which are identified correctly. In this data, FPR is the percentage of borrowers who are correctly indentified as having bad rating. AUC(Area under the curve) in the ROC is an ability of the model to indicate between borrower who has good rating or bad rating.

(III)Provide a brief write-up of your findings. How will you present these findings to your clients in less than 250 words?

From the credit data, we use two models to predict credit rating for the potential borrowers which are logistic regression model and decision tree model. Logistic regression model analysis shows that not every variables in the dataset is significantly related to the rating of borrowers. The accuracy of the logistic regression model is only 79.17% which is a medium accuracy. Then we analyze by area under the curve of ROC, auc in ROC of logistic regression model is 0.83. The decision tree model provides illustrative pgrap which is easy to understand. However, when we calculate the auc in ROc of decision tree model which is only 0.75. The logistic regression will be a better indicator to predict credit rating of potential borrowers. The important variables for the bank to predict credit rating are experience, existence of records, job type, quantum of expenses, income, loan amount, and purchase price of item.