Part A

i

headers = read.csv("E:\\GWU\\DNSC 6211\\Week 5\\creditdata.csv", header = FALSE, nrows = 1)
df.training <- read.csv("E:\\GWU\\DNSC 6211\\Week 5\\creditdata.csv", header = TRUE, nrows = 2223)
df.testing = read.csv("E:\\GWU\\DNSC 6211\\Week 5\\creditdata.csv", header = FALSE, skip = 2224 ,nrows = 2223)

colnames(df.testing) = unlist(headers) # set the column names for testing dataset
str(df.testing)
## 'data.frame':    2223 obs. of  14 variables:
##  $ rating    : Factor w/ 2 levels "bad","good": 2 2 2 2 1 2 2 2 2 2 ...
##  $ experience: int  3 1 22 30 2 1 5 17 14 3 ...
##  $ homeown   : Factor w/ 6 levels "ignore","other",..: 3 6 4 3 2 4 6 3 2 6 ...
##  $ loandurn  : int  60 60 36 48 24 48 48 60 48 36 ...
##  $ age       : int  30 35 38 59 37 40 34 57 46 23 ...
##  $ mstat     : Factor w/ 5 levels "divorced","married",..: 2 5 2 2 2 2 1 2 2 2 ...
##  $ rcds      : Factor w/ 2 levels "no_rec","yes_rec": 1 1 1 1 2 1 1 1 1 1 ...
##  $ jtype     : Factor w/ 4 levels "fixed","freelance",..: 1 1 1 1 2 1 2 1 1 4 ...
##  $ explvl    : int  45 45 75 45 60 75 57 60 60 53 ...
##  $ inc       : int  200 100 110 150 134 121 120 106 100 79 ...
##  $ assts     : int  10000 0 2000 4000 0 0 9000 4000 148 0 ...
##  $ debt      : int  2969 0 0 0 0 0 0 0 1 0 ...
##  $ loanamount: int  1650 1250 780 660 1000 1320 950 870 800 600 ...
##  $ purchprice: int  1750 1400 1785 933 1795 1600 1290 979 1651 650 ...
str(df.training)
## 'data.frame':    2223 obs. of  14 variables:
##  $ rating    : Factor w/ 2 levels "bad","good": 2 2 1 2 2 2 2 2 2 1 ...
##  $ experience: int  9 17 10 0 0 1 29 9 0 0 ...
##  $ homeown   : Factor w/ 6 levels "ignore","other",..: 6 6 3 6 6 3 3 4 3 4 ...
##  $ loandurn  : int  60 60 36 60 36 60 60 12 60 48 ...
##  $ age       : int  30 58 46 24 26 36 44 27 32 41 ...
##  $ mstat     : Factor w/ 5 levels "divorced","married",..: 2 5 2 4 4 2 2 4 2 2 ...
##  $ rcds      : Factor w/ 2 levels "no_rec","yes_rec": 1 1 2 1 1 1 1 1 1 1 ...
##  $ jtype     : Factor w/ 4 levels "fixed","freelance",..: 2 1 2 1 1 1 1 1 2 4 ...
##  $ explvl    : int  73 48 90 63 46 75 75 35 90 90 ...
##  $ inc       : int  129 131 200 182 107 214 125 80 107 80 ...
##  $ assts     : int  0 0 3000 2500 0 3500 10000 0 15000 0 ...
##  $ debt      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ loanamount: int  800 1000 2000 900 310 650 1600 200 1200 1200 ...
##  $ purchprice: int  846 1658 2985 1325 910 1645 1800 1093 1957 1468 ...

ii

sapply(df.training, function(x) sum(is.na(x)) )
##     rating experience    homeown   loandurn        age      mstat 
##          0          0          0          0          0          0 
##       rcds      jtype     explvl        inc      assts       debt 
##          0          0          0          0          0          0 
## loanamount purchprice 
##          0          0
sapply(df.testing, function(x) sum(is.na(x)) )
##     rating experience    homeown   loandurn        age      mstat 
##          0          0          0          0          0          0 
##       rcds      jtype     explvl        inc      assts       debt 
##          0          0          0          0          0          0 
## loanamount purchprice 
##          0          0
  # no Null in both datasets; they are clean already

iii

# summaries of variables
table(df.training$rating) 
## 
##  bad good 
##  628 1595
table(df.training$homeown)
## 
##  ignore   other   owner parents    priv    rent 
##       9     160    1072     362     104     516
table(df.training$mstat)
## 
##  divorced   married separated    single     widow 
##        20      1647        59       463        34
table(df.training$rcds)
## 
##  no_rec yes_rec 
##    1896     327
table(df.training$jtype)
## 
##     fixed freelance    others   partime 
##      1413       535        72       203
summary(df.training$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   28.00   35.00   37.07   45.00   68.00
summary(df.training$experience)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   5.000   7.853  12.000  43.000
summary(df.training$loandurn)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   36.00   48.00   46.36   60.00   60.00
summary(df.training$explvl)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   35.00   45.00   60.00   60.91   75.00  173.00
summary(df.training$assts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0    3000    5467    6000  300000
summary(df.training$loanamount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     100     700    1000    1028    1300    5000
summary(df.training$purchprice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     105    1111    1386    1440    1683    8800
library(ggplot2)
# rating
g <- ggplot(data = df.training, aes(rating, fill = rating)) +
  geom_bar() +
  scale_fill_discrete(name="Rating", labels=c("Bad", "Good")) +
  scale_x_discrete(breaks=c("1","2"),
                   labels=c("Bad","Good"))
g

# three way tables
table(df.training$rating,df.training$rcds, df.training$jtype, dnn = c("rating", "rcds", "jtype"))
## , , jtype = fixed
## 
##       rcds
## rating no_rec yes_rec
##   bad     200     109
##   good   1016      88
## 
## , , jtype = freelance
## 
##       rcds
## rating no_rec yes_rec
##   bad     112      61
##   good    323      39
## 
## , , jtype = others
## 
##       rcds
## rating no_rec yes_rec
##   bad      25       8
##   good     38       1
## 
## , , jtype = partime
## 
##       rcds
## rating no_rec yes_rec
##   bad      97      16
##   good     85       5
table(df.training$rating,df.training$mstat, df.training$homeown, dnn = c("rating", "mstat", "homeown"))
## , , homeown = ignore
## 
##       mstat
## rating divorced married separated single widow
##   bad         0       4         0      0     0
##   good        0       4         0      1     0
## 
## , , homeown = other
## 
##       mstat
## rating divorced married separated single widow
##   bad         0      55         2     28     1
##   good        1      48         1     23     1
## 
## , , homeown = owner
## 
##       mstat
## rating divorced married separated single widow
##   bad         2     168         7     17     2
##   good        4     771        10     66    25
## 
## , , homeown = parents
## 
##       mstat
## rating divorced married separated single widow
##   bad         2      30         5     71     0
##   good        2      84        10    158     0
## 
## , , homeown = priv
## 
##       mstat
## rating divorced married separated single widow
##   bad         0      32         0      3     0
##   good        0      60         1      7     1
## 
## , , homeown = rent
## 
##       mstat
## rating divorced married separated single widow
##   bad         7     137        16     37     2
##   good        2     254         7     52     2
ggplot(df.training, aes(homeown)) + geom_bar(aes(fill=factor(rating)))

ggplot(df.training, aes(rcds)) + geom_bar(aes(fill=factor(rating)))

ggplot(df.training, aes(jtype)) + geom_bar(aes(fill=factor(rating)))

ggplot(df.training, aes(mstat)) + geom_bar(aes(fill=factor(rating)))

iv

Based on the descriptive analysis, the plot for mstat does not patterns and not every categories in homeown seems to be related with rating. Compared with other integer variables, age may not be a significant variable because bad credit customers do not vary based on their ages.

Thus, I will drop mstat, homeown and age in the following model.

Part B

i

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
  #model
df.training <- select(df.training, -mstat, -age, -homeown)

model.logis <- glm(formula = rating ~ ., family =binomial(link='logit'), data = df.training)
summary(model.logis)
## 
## Call:
## glm(formula = rating ~ ., family = binomial(link = "logit"), 
##     data = df.training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1231  -0.7086   0.4282   0.7250   2.3749  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     1.348e+00  2.929e-01   4.602 4.18e-06 ***
## experience      7.984e-02  9.699e-03   8.232  < 2e-16 ***
## loandurn       -1.682e-03  4.749e-03  -0.354 0.723240    
## rcdsyes_rec    -1.901e+00  1.488e-01 -12.776  < 2e-16 ***
## jtypefreelance -8.198e-01  1.379e-01  -5.944 2.78e-09 ***
## jtypeothers    -1.024e+00  2.774e-01  -3.690 0.000224 ***
## jtypepartime   -1.203e+00  1.755e-01  -6.854 7.17e-12 ***
## explvl         -7.576e-03  2.793e-03  -2.712 0.006684 ** 
## inc             5.782e-03  8.589e-04   6.732 1.68e-11 ***
## assts           4.261e-05  9.875e-06   4.316 1.59e-05 ***
## debt           -1.609e-04  4.146e-05  -3.881 0.000104 ***
## loanamount     -2.074e-03  2.441e-04  -8.494  < 2e-16 ***
## purchprice      1.107e-03  1.863e-04   5.941 2.83e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2646.7  on 2222  degrees of freedom
## Residual deviance: 2045.9  on 2210  degrees of freedom
## AIC: 2071.9
## 
## Number of Fisher Scoring iterations: 5
model.logistic.fit <- predict(model.logis,df.testing,type='response')
  # fit the testing dataset into the model
model.logistic.fit <- ifelse(model.logistic.fit > 0.5,"good","bad")
  # categorize the results into good and bad
misClassificationError <- mean(model.logistic.fit != df.testing$rating)
print(paste('Accuracy',1-misClassificationError))
## [1] "Accuracy 0.785874943769681"
  # find the accuracy rate

library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
p1 <- predict(model.logis, df.testing, type="response")
  # generate results of the testing dataset
pr1 <- prediction(p1, df.testing$rating)
  # generate the function
prf1 <- performance(pr1, measure = "tpr", x.measure = "fpr")
plot(prf1)
  # generate the plot for the function
abline(a=0, b= 1)

auc <- performance(pr1, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.8260005

ii

In the model, significant variables are: experience, rcdsyes_rec, jtype, explvl, inc, assts, debt, loanamount and purchprice.

iii

library(rpart)
library(rpart.plot)

model.tree <- rpart(rating ~ ., df.training, method = "class")
rpart.plot(model.tree, type=1)

probs <- predict(model.tree, df.testing, type = "prob")[,2]

pred <- prediction(probs, df.testing$rating)
  # Make a performance object: perf
perf <- performance(pred, "tpr" , "fpr")
  # Plot this curve
plot(perf)

auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.7518058

iv

The decision tree shows splits that maximize the “separation” of the classes. A split is any test that divides the data in two, yes or no. The left side of each separation indicates yes. In each category, it shows the probability of good or bad and the corresponding proportion in the dataset. Overall, there are eight splits in four levels and 72% of customers have good credit. My analysis reveals that, the most significant split is experience. Then asset, job type and income are splits under experience. On the third level, assets under job type and explvl under income are significant. When capturing good credit customers in the highest accuracy, 46% of total customers who have experience larger than 1.5 and inc larger than 102 will have good credit 89% chance. One level up, 84% probability to be a good credit customer with experience larger than 1.5. Also, those with experience over 1.5, inc < 102 and explvl less than 82 are good credit ones 74% chance. To find bad credit customers in the highest probability (46%), we should focus on those with experience below 1.5 and jobs of freelance, partime and others. When we have a new customer, we can try to fit into a subcategory and get the rating result in probabilities.

Part C

i

My analysis shows that the logistic regression performs better because by dropping three variables, its accuracy of 0.7859 outnumbers that of the decision tree, which is 0.752.

It means that the logistic regression model predicts correctly on the testing dataset by 0.7859.

ii

Sensitivity is the proportion of true positives (predicted correctly as good in this case). The Y-axis indicates the sensitivity. Since the model can effectively find good credit customers by making errors around 20% chance, the model is reliable.

Specificity is the proportion of true negatives (predicted correctly as bad in this case). Bad credit customers are fewer than good credit ones and the specificity is calculated as 1 - X. When sensitivity gets higher, specificity gets lower. Thus, the model performs better when identifying good credit customers.

Accuracy is measured by the area under the ROC curve. An area of 1 represents a perfect test; an area of .5 represents a worthless test. An auc of 0.79 means the model can correctly categorize a customer as good or bad in 79% chance. Since auc for the model is larger than .5, it is useful and performs better than guessing.

iii

To effectively identifying if a customer has good or bad credit rating, I used the dataset provided and applied logistic regression model and decision tree for the prediction. Fortunately, by using half of the data as training set and the other half as testing set, the regression model can correctly predict the ratings 79% of the time, which are based on variables as experience, rcdsyes_rec, jtype, explvl, inc, assts, debt, loanamount and purchprice. It means once we have the information of a customer, the model would be useful. On the other hand, my decision tree performed closely well, with accuracy rate of 0.75. This approach allows us to target on specific categories of customers based on their info and predict the ratings. Performances of everal subcategories even surpassed the regression model. For example, those with experience larger than 1.5 and inc larger than 102 will have good credit 89% chance. Last but not least, the two approaches are quite effective in predicting ratings, each with pros and cons. The regression model has a better performance on the overall population. However, the decision tree can be very accurate when customers fall in specific groups.