headers = read.csv("E:\\GWU\\DNSC 6211\\Week 5\\creditdata.csv", header = FALSE, nrows = 1)
df.training <- read.csv("E:\\GWU\\DNSC 6211\\Week 5\\creditdata.csv", header = TRUE, nrows = 2223)
df.testing = read.csv("E:\\GWU\\DNSC 6211\\Week 5\\creditdata.csv", header = FALSE, skip = 2224 ,nrows = 2223)
colnames(df.testing) = unlist(headers) # set the column names for testing dataset
str(df.testing)
## 'data.frame': 2223 obs. of 14 variables:
## $ rating : Factor w/ 2 levels "bad","good": 2 2 2 2 1 2 2 2 2 2 ...
## $ experience: int 3 1 22 30 2 1 5 17 14 3 ...
## $ homeown : Factor w/ 6 levels "ignore","other",..: 3 6 4 3 2 4 6 3 2 6 ...
## $ loandurn : int 60 60 36 48 24 48 48 60 48 36 ...
## $ age : int 30 35 38 59 37 40 34 57 46 23 ...
## $ mstat : Factor w/ 5 levels "divorced","married",..: 2 5 2 2 2 2 1 2 2 2 ...
## $ rcds : Factor w/ 2 levels "no_rec","yes_rec": 1 1 1 1 2 1 1 1 1 1 ...
## $ jtype : Factor w/ 4 levels "fixed","freelance",..: 1 1 1 1 2 1 2 1 1 4 ...
## $ explvl : int 45 45 75 45 60 75 57 60 60 53 ...
## $ inc : int 200 100 110 150 134 121 120 106 100 79 ...
## $ assts : int 10000 0 2000 4000 0 0 9000 4000 148 0 ...
## $ debt : int 2969 0 0 0 0 0 0 0 1 0 ...
## $ loanamount: int 1650 1250 780 660 1000 1320 950 870 800 600 ...
## $ purchprice: int 1750 1400 1785 933 1795 1600 1290 979 1651 650 ...
str(df.training)
## 'data.frame': 2223 obs. of 14 variables:
## $ rating : Factor w/ 2 levels "bad","good": 2 2 1 2 2 2 2 2 2 1 ...
## $ experience: int 9 17 10 0 0 1 29 9 0 0 ...
## $ homeown : Factor w/ 6 levels "ignore","other",..: 6 6 3 6 6 3 3 4 3 4 ...
## $ loandurn : int 60 60 36 60 36 60 60 12 60 48 ...
## $ age : int 30 58 46 24 26 36 44 27 32 41 ...
## $ mstat : Factor w/ 5 levels "divorced","married",..: 2 5 2 4 4 2 2 4 2 2 ...
## $ rcds : Factor w/ 2 levels "no_rec","yes_rec": 1 1 2 1 1 1 1 1 1 1 ...
## $ jtype : Factor w/ 4 levels "fixed","freelance",..: 2 1 2 1 1 1 1 1 2 4 ...
## $ explvl : int 73 48 90 63 46 75 75 35 90 90 ...
## $ inc : int 129 131 200 182 107 214 125 80 107 80 ...
## $ assts : int 0 0 3000 2500 0 3500 10000 0 15000 0 ...
## $ debt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ loanamount: int 800 1000 2000 900 310 650 1600 200 1200 1200 ...
## $ purchprice: int 846 1658 2985 1325 910 1645 1800 1093 1957 1468 ...
sapply(df.training, function(x) sum(is.na(x)) )
## rating experience homeown loandurn age mstat
## 0 0 0 0 0 0
## rcds jtype explvl inc assts debt
## 0 0 0 0 0 0
## loanamount purchprice
## 0 0
sapply(df.testing, function(x) sum(is.na(x)) )
## rating experience homeown loandurn age mstat
## 0 0 0 0 0 0
## rcds jtype explvl inc assts debt
## 0 0 0 0 0 0
## loanamount purchprice
## 0 0
# no Null in both datasets; they are clean already
# summaries of variables
table(df.training$rating)
##
## bad good
## 628 1595
table(df.training$homeown)
##
## ignore other owner parents priv rent
## 9 160 1072 362 104 516
table(df.training$mstat)
##
## divorced married separated single widow
## 20 1647 59 463 34
table(df.training$rcds)
##
## no_rec yes_rec
## 1896 327
table(df.training$jtype)
##
## fixed freelance others partime
## 1413 535 72 203
summary(df.training$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 28.00 35.00 37.07 45.00 68.00
summary(df.training$experience)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 5.000 7.853 12.000 43.000
summary(df.training$loandurn)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 36.00 48.00 46.36 60.00 60.00
summary(df.training$explvl)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.00 45.00 60.00 60.91 75.00 173.00
summary(df.training$assts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 3000 5467 6000 300000
summary(df.training$loanamount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 100 700 1000 1028 1300 5000
summary(df.training$purchprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 105 1111 1386 1440 1683 8800
library(ggplot2)
# rating
g <- ggplot(data = df.training, aes(rating, fill = rating)) +
geom_bar() +
scale_fill_discrete(name="Rating", labels=c("Bad", "Good")) +
scale_x_discrete(breaks=c("1","2"),
labels=c("Bad","Good"))
g
# three way tables
table(df.training$rating,df.training$rcds, df.training$jtype, dnn = c("rating", "rcds", "jtype"))
## , , jtype = fixed
##
## rcds
## rating no_rec yes_rec
## bad 200 109
## good 1016 88
##
## , , jtype = freelance
##
## rcds
## rating no_rec yes_rec
## bad 112 61
## good 323 39
##
## , , jtype = others
##
## rcds
## rating no_rec yes_rec
## bad 25 8
## good 38 1
##
## , , jtype = partime
##
## rcds
## rating no_rec yes_rec
## bad 97 16
## good 85 5
table(df.training$rating,df.training$mstat, df.training$homeown, dnn = c("rating", "mstat", "homeown"))
## , , homeown = ignore
##
## mstat
## rating divorced married separated single widow
## bad 0 4 0 0 0
## good 0 4 0 1 0
##
## , , homeown = other
##
## mstat
## rating divorced married separated single widow
## bad 0 55 2 28 1
## good 1 48 1 23 1
##
## , , homeown = owner
##
## mstat
## rating divorced married separated single widow
## bad 2 168 7 17 2
## good 4 771 10 66 25
##
## , , homeown = parents
##
## mstat
## rating divorced married separated single widow
## bad 2 30 5 71 0
## good 2 84 10 158 0
##
## , , homeown = priv
##
## mstat
## rating divorced married separated single widow
## bad 0 32 0 3 0
## good 0 60 1 7 1
##
## , , homeown = rent
##
## mstat
## rating divorced married separated single widow
## bad 7 137 16 37 2
## good 2 254 7 52 2
ggplot(df.training, aes(homeown)) + geom_bar(aes(fill=factor(rating)))
ggplot(df.training, aes(rcds)) + geom_bar(aes(fill=factor(rating)))
ggplot(df.training, aes(jtype)) + geom_bar(aes(fill=factor(rating)))
ggplot(df.training, aes(mstat)) + geom_bar(aes(fill=factor(rating)))
Based on the descriptive analysis, the plot for mstat does not patterns and not every categories in homeown seems to be related with rating. Compared with other integer variables, age may not be a significant variable because bad credit customers do not vary based on their ages.
Thus, I will drop mstat, homeown and age in the following model.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#model
df.training <- select(df.training, -mstat, -age, -homeown)
model.logis <- glm(formula = rating ~ ., family =binomial(link='logit'), data = df.training)
summary(model.logis)
##
## Call:
## glm(formula = rating ~ ., family = binomial(link = "logit"),
## data = df.training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1231 -0.7086 0.4282 0.7250 2.3749
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.348e+00 2.929e-01 4.602 4.18e-06 ***
## experience 7.984e-02 9.699e-03 8.232 < 2e-16 ***
## loandurn -1.682e-03 4.749e-03 -0.354 0.723240
## rcdsyes_rec -1.901e+00 1.488e-01 -12.776 < 2e-16 ***
## jtypefreelance -8.198e-01 1.379e-01 -5.944 2.78e-09 ***
## jtypeothers -1.024e+00 2.774e-01 -3.690 0.000224 ***
## jtypepartime -1.203e+00 1.755e-01 -6.854 7.17e-12 ***
## explvl -7.576e-03 2.793e-03 -2.712 0.006684 **
## inc 5.782e-03 8.589e-04 6.732 1.68e-11 ***
## assts 4.261e-05 9.875e-06 4.316 1.59e-05 ***
## debt -1.609e-04 4.146e-05 -3.881 0.000104 ***
## loanamount -2.074e-03 2.441e-04 -8.494 < 2e-16 ***
## purchprice 1.107e-03 1.863e-04 5.941 2.83e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2646.7 on 2222 degrees of freedom
## Residual deviance: 2045.9 on 2210 degrees of freedom
## AIC: 2071.9
##
## Number of Fisher Scoring iterations: 5
model.logistic.fit <- predict(model.logis,df.testing,type='response')
# fit the testing dataset into the model
model.logistic.fit <- ifelse(model.logistic.fit > 0.5,"good","bad")
# categorize the results into good and bad
misClassificationError <- mean(model.logistic.fit != df.testing$rating)
print(paste('Accuracy',1-misClassificationError))
## [1] "Accuracy 0.785874943769681"
# find the accuracy rate
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
p1 <- predict(model.logis, df.testing, type="response")
# generate results of the testing dataset
pr1 <- prediction(p1, df.testing$rating)
# generate the function
prf1 <- performance(pr1, measure = "tpr", x.measure = "fpr")
plot(prf1)
# generate the plot for the function
abline(a=0, b= 1)
auc <- performance(pr1, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.8260005
In the model, significant variables are: experience, rcdsyes_rec, jtype, explvl, inc, assts, debt, loanamount and purchprice.
library(rpart)
library(rpart.plot)
model.tree <- rpart(rating ~ ., df.training, method = "class")
rpart.plot(model.tree, type=1)
probs <- predict(model.tree, df.testing, type = "prob")[,2]
pred <- prediction(probs, df.testing$rating)
# Make a performance object: perf
perf <- performance(pred, "tpr" , "fpr")
# Plot this curve
plot(perf)
auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.7518058
The decision tree shows splits that maximize the “separation” of the classes. A split is any test that divides the data in two, yes or no. The left side of each separation indicates yes. In each category, it shows the probability of good or bad and the corresponding proportion in the dataset. Overall, there are eight splits in four levels and 72% of customers have good credit. My analysis reveals that, the most significant split is experience. Then asset, job type and income are splits under experience. On the third level, assets under job type and explvl under income are significant. When capturing good credit customers in the highest accuracy, 46% of total customers who have experience larger than 1.5 and inc larger than 102 will have good credit 89% chance. One level up, 84% probability to be a good credit customer with experience larger than 1.5. Also, those with experience over 1.5, inc < 102 and explvl less than 82 are good credit ones 74% chance. To find bad credit customers in the highest probability (46%), we should focus on those with experience below 1.5 and jobs of freelance, partime and others. When we have a new customer, we can try to fit into a subcategory and get the rating result in probabilities.
My analysis shows that the logistic regression performs better because by dropping three variables, its accuracy of 0.7859 outnumbers that of the decision tree, which is 0.752.
It means that the logistic regression model predicts correctly on the testing dataset by 0.7859.
Sensitivity is the proportion of true positives (predicted correctly as good in this case). The Y-axis indicates the sensitivity. Since the model can effectively find good credit customers by making errors around 20% chance, the model is reliable.
Specificity is the proportion of true negatives (predicted correctly as bad in this case). Bad credit customers are fewer than good credit ones and the specificity is calculated as 1 - X. When sensitivity gets higher, specificity gets lower. Thus, the model performs better when identifying good credit customers.
Accuracy is measured by the area under the ROC curve. An area of 1 represents a perfect test; an area of .5 represents a worthless test. An auc of 0.79 means the model can correctly categorize a customer as good or bad in 79% chance. Since auc for the model is larger than .5, it is useful and performs better than guessing.
To effectively identifying if a customer has good or bad credit rating, I used the dataset provided and applied logistic regression model and decision tree for the prediction. Fortunately, by using half of the data as training set and the other half as testing set, the regression model can correctly predict the ratings 79% of the time, which are based on variables as experience, rcdsyes_rec, jtype, explvl, inc, assts, debt, loanamount and purchprice. It means once we have the information of a customer, the model would be useful. On the other hand, my decision tree performed closely well, with accuracy rate of 0.75. This approach allows us to target on specific categories of customers based on their info and predict the ratings. Performances of everal subcategories even surpassed the regression model. For example, those with experience larger than 1.5 and inc larger than 102 will have good credit 89% chance. Last but not least, the two approaches are quite effective in predicting ratings, each with pros and cons. The regression model has a better performance on the overall population. However, the decision tree can be very accurate when customers fall in specific groups.