For assignment 5, the data we are going to analyze credit data. Here are library packages I use on this analysis.
library(dplyr)
library(rpart)
library(rpart.plot)
library(ROCR)
library(ggplot2)
Part A
mydata <- read.csv("c:/Users/tresz/Desktop/DNSC6211/ass05/creditdata.csv")
dim.data.frame(mydata)
## [1] 4446 14
str(mydata)
## 'data.frame': 4446 obs. of 14 variables:
## $ rating : Factor w/ 2 levels "bad","good": 2 2 1 2 2 2 2 2 2 1 ...
## $ experience: int 9 17 10 0 0 1 29 9 0 0 ...
## $ homeown : Factor w/ 6 levels "ignore","other",..: 6 6 3 6 6 3 3 4 3 4 ...
## $ loandurn : int 60 60 36 60 36 60 60 12 60 48 ...
## $ age : int 30 58 46 24 26 36 44 27 32 41 ...
## $ mstat : Factor w/ 5 levels "divorced","married",..: 2 5 2 4 4 2 2 4 2 2 ...
## $ rcds : Factor w/ 2 levels "no_rec","yes_rec": 1 1 2 1 1 1 1 1 1 1 ...
## $ jtype : Factor w/ 4 levels "fixed","freelance",..: 2 1 2 1 1 1 1 1 2 4 ...
## $ explvl : int 73 48 90 63 46 75 75 35 90 90 ...
## $ inc : int 129 131 200 182 107 214 125 80 107 80 ...
## $ assts : int 0 0 3000 2500 0 3500 10000 0 15000 0 ...
## $ debt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ loanamount: int 800 1000 2000 900 310 650 1600 200 1200 1200 ...
## $ purchprice: int 846 1658 2985 1325 910 1645 1800 1093 1957 1468 ...
mydata.train <- mydata[1:2223,]
mydata.test <- mydata[2224:4446,]
We can see that this dataset has 4446 rows and 14 column. So I assigned first half of dataset as training data, and the second half of dataset as testing data.
sapply(mydata, function(x) sum(is.na(x)) )
## rating experience homeown loandurn age mstat
## 0 0 0 0 0 0
## rcds jtype explvl inc assts debt
## 0 0 0 0 0 0
## loanamount purchprice
## 0 0
Each column of this dataset does not have any NA data. So the data are clean and ready to analyse.
table(mydata.train$rating)
##
## bad good
## 628 1595
table(mydata.train$homeown)
##
## ignore other owner parents priv rent
## 9 160 1072 362 104 516
table(mydata.train$mstat)
##
## divorced married separated single widow
## 20 1647 59 463 34
table(mydata.train$rcds)
##
## no_rec yes_rec
## 1896 327
table(mydata.train$jtype)
##
## fixed freelance others partime
## 1413 535 72 203
table(mydata.train$loandurn)
##
## 6 12 18 24 30 36 42 48 54 60
## 14 81 49 170 25 480 18 403 11 972
summary(mydata.train$experience)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 5.000 7.853 12.000 43.000
summary(mydata.train$loandurn)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 36.00 48.00 46.36 60.00 60.00
summary(mydata.train$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 28.00 35.00 37.07 45.00 68.00
summary(mydata.train$explvl)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.00 45.00 60.00 60.91 75.00 173.00
summary(mydata.train$inc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 93.0 130.0 148.6 180.0 959.0
summary(mydata.train$assts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 3000 5467 6000 300000
summary(mydata.train$debt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 353.8 0.0 30000.0
summary(mydata.train$loanamount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 100 700 1000 1028 1300 5000
summary(mydata.train$purchprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 105 1111 1386 1440 1683 8800
g1 <- ggplot(data = mydata.train, aes(rating, fill = rating)) +
geom_bar() +
scale_fill_discrete(name="Rating", labels=c("Good", "Bad")) +
scale_x_discrete(breaks=c("good","bad"),
labels=c("Good","Bad"))
g1
g2 <- ggplot(data = mydata.train, aes(homeown, fill = homeown)) +
geom_bar() +
scale_fill_discrete(name="Home Ownership", labels=c("Ignore", "Other", "Owner", "Parents", "Priv", "Rent"))
g2
g4 <- ggplot(data = mydata.train, aes(mstat, fill = mstat)) +
geom_bar() +
scale_fill_discrete(name="Marital Status")
g4
g5 <- ggplot(data = mydata.train, aes(rcds, fill = rcds)) +
geom_bar() +
scale_fill_discrete(name="Existence of Records", labels=c("Yes", "No"))
g5
g5 <- ggplot(data = mydata.train, aes(jtype, fill = jtype)) +
geom_bar() +
scale_fill_discrete(name="Job Type", labels=c("Fixed", "Freelance","Others", "Part time"))
g5
ggplot(mydata.train, aes(homeown)) + geom_bar(aes(fill=factor(rating)))
ggplot(mydata.train, aes(mstat)) + geom_bar(aes(fill=factor(rating)))
ggplot(mydata.train, aes(rcds)) + geom_bar(aes(fill=factor(rating)))
ggplot(mydata.train, aes(jtype)) + geom_bar(aes(fill=factor(rating)))
Descriptive statistics helps us in analyzing data by showing or summarizing data patterns, also tell us which variable is useful to predict. This will make us to understand more about each variables and the relationship between variables.
Part B
train.logistic <- glm(formula = rating ~ ., family =binomial(link='logit'), data = mydata.train)
summary(train.logistic)
##
## Call:
## glm(formula = rating ~ ., family = binomial(link = "logit"),
## data = mydata.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1915 -0.6031 0.3956 0.6973 2.5219
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.738e-01 9.990e-01 -0.274 0.784032
## experience 7.646e-02 1.048e-02 7.293 3.03e-13 ***
## homeownother 3.322e-01 7.807e-01 0.425 0.670501
## homeownowner 1.733e+00 7.607e-01 2.278 0.022701 *
## homeownparents 1.404e+00 7.730e-01 1.816 0.069310 .
## homeownpriv 1.019e+00 7.911e-01 1.288 0.197857
## homeownrent 9.726e-01 7.646e-01 1.272 0.203349
## loandurn -5.372e-03 4.882e-03 -1.100 0.271222
## age -1.172e-02 6.999e-03 -1.675 0.094032 .
## mstatmarried 1.474e+00 5.415e-01 2.722 0.006481 **
## mstatseparated 1.728e-01 6.154e-01 0.281 0.778852
## mstatsingle 9.441e-01 5.443e-01 1.735 0.082814 .
## mstatwidow 1.976e+00 8.173e-01 2.417 0.015639 *
## rcdsyes_rec -1.874e+00 1.539e-01 -12.176 < 2e-16 ***
## jtypefreelance -8.816e-01 1.429e-01 -6.168 6.93e-10 ***
## jtypeothers -1.113e+00 3.032e-01 -3.671 0.000241 ***
## jtypepartime -1.275e+00 1.813e-01 -7.034 2.00e-12 ***
## explvl -1.271e-02 3.813e-03 -3.334 0.000855 ***
## inc 6.200e-03 8.841e-04 7.012 2.34e-12 ***
## assts 2.638e-05 9.123e-06 2.891 0.003834 **
## debt -1.920e-04 4.640e-05 -4.138 3.50e-05 ***
## loanamount -1.975e-03 2.468e-04 -8.002 1.23e-15 ***
## purchprice 1.022e-03 1.869e-04 5.467 4.57e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2646.7 on 2222 degrees of freedom
## Residual deviance: 1954.8 on 2200 degrees of freedom
## AIC: 2000.8
##
## Number of Fisher Scoring iterations: 5
To consider which variable in model is significant or not, we can consider by the p-value of each variable. If the p-value is less than 0.05, the variable would be significant. Experience, homeowner factor, mstamarried factor, mstatwidow factor, rcdsyes_rec(Existence of records) factor, all jtype(Job type) factor, explvl, inc, assts, debt, loanamount,purchprice are significant to the logistic regression model.
tree.model <- rpart(rating ~ ., mydata.train, method = "class")
plot <- rpart.plot(tree.model, type=1, extra = 102, tweak = 1.15)
We can see on the top of the decision tree, out of 2,223 sample 1,595 borrowers has good credit. For two down branches from the top, left branch indicate that the borrower has rcds=yes_rec, and the right branch indicate the borrowers who rcds is not yes_rec or mean doesn’t have records. The left branch, 194 borrower out of 327 borrowers which has records has bad rating. Furthermore, among 1,896 borrowers who doesn’t have records, 1,462 borrowers has good records. And the lower brunches explain rating of borrowers with each variable from the nods before. Finally, after all the related variables, we use the bottom nodes from decision tree to determine and predict borrowers rating. For example, if the borrower has a record and has experience less than 2.5 years, the possibility of that borrow would have bad rating is 5%.
Part C
#Accuracy for logistic
test.logistic.fitted <- predict(train.logistic,mydata.test,type='response')
test.logistic.fitted <- ifelse(test.logistic.fitted > 0.5,"good","bad")
misClassificationError1 <- mean(test.logistic.fitted != mydata.test$rating) #run thie how many true there are
print(paste('Accuracy',1-misClassificationError1))
## [1] "Accuracy 0.791722896986055"
p1 <- predict(train.logistic, mydata.test, type="response")
pr1 <- prediction(p1, mydata.test$rating)
prf1 <- performance(pr1, measure = "tpr", x.measure = "fpr")
plot(prf1)
abline(a=0, b= 1)
auc1 <- performance(pr1, measure = "auc")
auc1 <- auc1@y.values[[1]]
auc1
## [1] 0.8305952
For logistic regression model, after calculating the accuracy of the model. We found that the logistic regression model has only 79% accuracy. Then, calculating ROC curve and area under the curve, AUC of logistic regress model is 0.83.
#decision tree model
probs1 <- predict(tree.model, mydata.test, type = "prob")[,2]
pred1 <- prediction(probs1, mydata.test$rating)
perf1 <- performance(pred1, "tpr" , "fpr")
plot(perf1)
abline(a=0, b= 1)
auc2 <- performance(pred1, measure = "auc")
auc2 <- auc2@y.values[[1]]
auc2
## [1] 0.7518058
For decision tree model, after plot ROC and calculate AUC, we found that AUC of decision tree model is 0.75.
Comparing AUC form both models, logistic regression model’s AUC is more than decision tree’s AUC. So we can conclude that logistic regression model is better for predicting purposes.
plot(prf1, col="red")
par(new=TRUE)
plot(perf1, col="blue")
legend("bottomright", legend=c("Logistic model","Decision tree"), lty=c(1,1),col=c("red","blue"))
abline(a=0, b= 1)
This graph show the ROC curve for logistic regression and decision tree model. We can see that logisrtic regression ROC line has more area under the curve than area under the curve of decision tree model.
ROC or Receiver Operating Curve use to analysis binary variables. ROC is create under two variables which are True positive rate(TPR) and False positive rate(FPR). TPR is the proportion for the data match with the TRUE value correctly in this data means the percentage of borrowers who are correctly indentified as having good rating. TPR also called as “Sensitivity”. FPR or “Specificity” is the proportion for the negative data which are identified correctly. In this data, FPR is the percentage of borrowers who are correctly indentified as having bad rating. AUC(Area under the curve) in the ROC is an ability of the model to indicate between borrower who has good rating or bad rating.
From the credit data, we use two models to predict credit rating for the potential borrowers which are logistic regression model and decision tree model. Logistic regression model analysis shows that not every variables in the dataset is significantly related to the rating of borrowers. The accuracy of the logistic regression model is only 79.17% which is a medium accuracy. Then we analyze by area under the curve of ROC, auc in ROC of logistic regression model is 0.83. The decision tree model provides illustrative pgrap which is easy to understand. However, when we calculate the auc in ROc of decision tree model which is only 0.75. The logistic regression will be a better indicator to predict credit rating of potential borrowers. The important variables for the bank to predict credit rating are experience, existence of records, job type, quantum of expenses, income, loan amount, and purchase price of item.