Assignment 05

Importing the dataset

#Importing the dataset
library (readr)
mydf <- read_csv(file.choose())

## Parsed with column specification:
## cols(
##   rating = col_character(),
##   experience = col_integer(),
##   homeown = col_character(),
##   loandurn = col_integer(),
##   age = col_integer(),
##   mstat = col_character(),
##   rcds = col_character(),
##   jtype = col_character(),
##   explvl = col_integer(),
##   inc = col_integer(),
##   assts = col_integer(),
##   debt = col_integer(),
##   loanamount = col_integer(),
##   purchprice = col_integer()
## )

PART A

See some data

################################PART A##################################
#See some data
head(mydf)

## # A tibble: 6 x 14
##   rating experience homeown loandurn   age   mstat    rcds     jtype
##    <chr>      <int>   <chr>    <int> <int>   <chr>   <chr>     <chr>
## 1   good          9    rent       60    30 married  no_rec freelance
## 2   good         17    rent       60    58   widow  no_rec     fixed
## 3    bad         10   owner       36    46 married yes_rec freelance
## 4   good          0    rent       60    24  single  no_rec     fixed
## 5   good          0    rent       36    26  single  no_rec     fixed
## 6   good          1   owner       60    36 married  no_rec     fixed
## # ... with 6 more variables: explvl <int>, inc <int>, assts <int>,
## #   debt <int>, loanamount <int>, purchprice <int>

Explore structure of data

#Explore structure of data
str(mydf)

## Classes 'tbl_df', 'tbl' and 'data.frame':    4446 obs. of  14 variables:
##  $ rating    : chr  "good" "good" "bad" "good" ...
##  $ experience: int  9 17 10 0 0 1 29 9 0 0 ...
##  $ homeown   : chr  "rent" "rent" "owner" "rent" ...
##  $ loandurn  : int  60 60 36 60 36 60 60 12 60 48 ...
##  $ age       : int  30 58 46 24 26 36 44 27 32 41 ...
##  $ mstat     : chr  "married" "widow" "married" "single" ...
##  $ rcds      : chr  "no_rec" "no_rec" "yes_rec" "no_rec" ...
##  $ jtype     : chr  "freelance" "fixed" "freelance" "fixed" ...
##  $ explvl    : int  73 48 90 63 46 75 75 35 90 90 ...
##  $ inc       : int  129 131 200 182 107 214 125 80 107 80 ...
##  $ assts     : int  0 0 3000 2500 0 3500 10000 0 15000 0 ...
##  $ debt      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ loanamount: int  800 1000 2000 900 310 650 1600 200 1200 1200 ...
##  $ purchprice: int  846 1658 2985 1325 910 1645 1800 1093 1957 1468 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 14
##   .. ..$ rating    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ experience: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ homeown   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ loandurn  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ age       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ mstat     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ rcds      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ jtype     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ explvl    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ inc       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ assts     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ debt      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ loanamount: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ purchprice: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

Now, I have to change the class of categorical and ordinal variables into factors

#Now, I have to change the class of categorical and ordinal variables into factors
mydf$rating <- as.factor(mydf$rating)
mydf$homeown <- as.factor(mydf$homeown)
mydf$mstat <- as.factor(mydf$mstat)
mydf$rcds <- as.factor(mydf$rcds)
mydf$jtype <- as.factor(mydf$jtype)
str(mydf)

## Classes 'tbl_df', 'tbl' and 'data.frame':    4446 obs. of  14 variables:
##  $ rating    : Factor w/ 2 levels "bad","good": 2 2 1 2 2 2 2 2 2 1 ...
##  $ experience: int  9 17 10 0 0 1 29 9 0 0 ...
##  $ homeown   : Factor w/ 6 levels "ignore","other",..: 6 6 3 6 6 3 3 4 3 4 ...
##  $ loandurn  : int  60 60 36 60 36 60 60 12 60 48 ...
##  $ age       : int  30 58 46 24 26 36 44 27 32 41 ...
##  $ mstat     : Factor w/ 5 levels "divorced","married",..: 2 5 2 4 4 2 2 4 2 2 ...
##  $ rcds      : Factor w/ 2 levels "no_rec","yes_rec": 1 1 2 1 1 1 1 1 1 1 ...
##  $ jtype     : Factor w/ 4 levels "fixed","freelance",..: 2 1 2 1 1 1 1 1 2 4 ...
##  $ explvl    : int  73 48 90 63 46 75 75 35 90 90 ...
##  $ inc       : int  129 131 200 182 107 214 125 80 107 80 ...
##  $ assts     : int  0 0 3000 2500 0 3500 10000 0 15000 0 ...
##  $ debt      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ loanamount: int  800 1000 2000 900 310 650 1600 200 1200 1200 ...
##  $ purchprice: int  846 1658 2985 1325 910 1645 1800 1093 1957 1468 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 14
##   .. ..$ rating    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ experience: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ homeown   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ loandurn  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ age       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ mstat     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ rcds      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ jtype     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ explvl    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ inc       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ assts     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ debt      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ loanamount: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ purchprice: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

Now, I am looking for missing values

#Now, I am looking for missing values
sum(is.na(mydf))

## [1] 0

Good for me!! we have no missing values

Now it is time to split the data set into training and testing subsets

#Now it is time to split the data set into training and testing subsets
train <- mydf[0:(0.5*nrow(mydf)),]
test <- mydf[((0.5*nrow(mydf))+1):nrow(mydf),]
head(train)

## # A tibble: 6 x 14
##   rating experience homeown loandurn   age   mstat    rcds     jtype
##   <fctr>      <int>  <fctr>    <int> <int>  <fctr>  <fctr>    <fctr>
## 1   good          9    rent       60    30 married  no_rec freelance
## 2   good         17    rent       60    58   widow  no_rec     fixed
## 3    bad         10   owner       36    46 married yes_rec freelance
## 4   good          0    rent       60    24  single  no_rec     fixed
## 5   good          0    rent       36    26  single  no_rec     fixed
## 6   good          1   owner       60    36 married  no_rec     fixed
## # ... with 6 more variables: explvl <int>, inc <int>, assts <int>,
## #   debt <int>, loanamount <int>, purchprice <int>

head(test)

## # A tibble: 6 x 14
##   rating experience homeown loandurn   age   mstat    rcds     jtype
##   <fctr>      <int>  <fctr>    <int> <int>  <fctr>  <fctr>    <fctr>
## 1   good          3   owner       60    30 married  no_rec     fixed
## 2   good          1    rent       60    35   widow  no_rec     fixed
## 3   good         22 parents       36    38 married  no_rec     fixed
## 4   good         30   owner       48    59 married  no_rec     fixed
## 5    bad          2   other       24    37 married yes_rec freelance
## 6   good          1 parents       48    40 married  no_rec     fixed
## # ... with 6 more variables: explvl <int>, inc <int>, assts <int>,
## #   debt <int>, loanamount <int>, purchprice <int>

From now on, we work on training set. It is time for exoloratory and discriptive analysis. we use table method for categorical and ordinal variables and summary method for continuous variables

#From now on, we work on training set. It is time for exoloratory and discriptive analysis. we use table method for categorical and ordinal variables and summary method for continuous variables
table(train$rating)

## 
##  bad good 
##  628 1595

table(train$homeown)

## 
##  ignore   other   owner parents    priv    rent 
##       9     160    1072     362     104     516

table(train$mstat)

## 
##  divorced   married separated    single     widow 
##        20      1647        59       463        34

table(train$rcds)

## 
##  no_rec yes_rec 
##    1896     327

table(train$jtype)

## 
##     fixed freelance    others   partime 
##      1413       535        72       203

summary(train$experience)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   5.000   7.853  12.000  43.000

summary(train$loandurn)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   36.00   48.00   46.36   60.00   60.00

summary(train$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   28.00   35.00   37.07   45.00   68.00

summary(train$explvl)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   35.00   45.00   60.00   60.91   75.00  173.00

summary(train$inc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    93.0   130.0   148.6   180.0   959.0

summary(train$assts)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0    3000    5467    6000  300000

summary(train$debt)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0   353.8     0.0 30000.0

summary(train$loanamount)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     100     700    1000    1028    1300    5000

summary(train$purchprice)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     105    1111    1386    1440    1683    8800

For a better analysis, we use ggplot plot the bar plots for those ordinal variables which are as factors, now:

#For a better analysis, we use ggplot plot the bar plots for those ordinal variables which are as factors, now:
library(ggplot2)

g <- ggplot(data = train, aes(rating, fill = rating)) +
  geom_bar() +
  scale_fill_discrete(name="Rating", labels=c("Bad", "good")) +
  scale_x_discrete(breaks=c("0","1"),
                   labels=c("Bad","Good"))
g

g <- ggplot(data = train, aes(homeown, fill = homeown)) +
  geom_bar() +
  scale_fill_discrete(name="Homeown", labels=c("Ignore", "Other", "Owner", "Parents", "Private", "Rent")) +
  scale_x_discrete(breaks=c("0","1","2","3","4","5"),
                   labels=c("Ignore", "Other", "Owner", "Parents", "Private", "Rent"))
g

g <- ggplot(data = train, aes(mstat, fill = mstat)) +
  geom_bar() +
  scale_fill_discrete(name="mstat", labels=c("divorced", "married", "separated", "single", "widow")) +
  scale_x_discrete(breaks=c("0","1","2","3","4"),
                   labels=c("divorced", "married", "separated", "single", "widow"))
g

g <- ggplot(data = train, aes(rcds, fill = rcds)) +
  geom_bar() +
  scale_fill_discrete(name="rcds", labels=c(" no_rec", "yes_rec")) +
  scale_x_discrete(breaks=c("0","1"),
                   labels=c("no_rec","yes_rec"))
g

g <- ggplot(data = train, aes(jtype, fill = jtype)) +
  geom_bar() +
  scale_fill_discrete(name="jtype", labels=c("fixed", "freelance", "others", "partime")) +
  scale_x_discrete(breaks=c("0","1","2","3"),
                   labels=c("fixed", "freelance", "others", "partime"))
g

The rating shows the exact values of what we are going to predict. It is observed that: 1. most of the people in our training dataset have fixed jobs. 2. majority of the them have no records. 3. majority of them are married. 4. and they are mostly home owners. This analysis helps us to improve our model in the next steps and gives us an overview of how the varibales are distributed among records. we have the correlation coefficient for analyzing how the continuous variables affect our regression model, but we cannot use it for categorical variables. this analysis gives us that understanding before building our regression model. also logestic regression model requires quite large sample sizes. Because maximum likelihood estimates are less powerful than ordinary least squares (e.g., simple linear regression, multiple linear regression);so this analysis can ensure we have enough observations for prediction.

Two way analysis:

ggplot(train, aes(x=experience, y=rating)) + geom_point(size=2, shape=1)

ggplot(train, aes(x=loandurn, y=rating)) + geom_point(size=2, shape=1)

ggplot(train, aes(x=age, y=rating)) + geom_point(size=2, shape=1)

ggplot(train, aes(x=explvl, y=rating)) + geom_point(size=2, shape=1)

ggplot(train, aes(x=inc, y=rating)) + geom_point(size=2, shape=1)

ggplot(train, aes(x=assts, y=rating)) + geom_point(size=2, shape=1)

ggplot(train, aes(x=debt, y=rating)) + geom_point(size=2, shape=1)

ggplot(train, aes(x=loanamount, y=rating)) + geom_point(size=2, shape=1)

ggplot(train, aes(x=purchprice, y=rating)) + geom_point(size=2, shape=1)

But it does not bring us meaningful conclusions

PART B

Here we build our logestic model for predicting “Rating” based on other 13 predictors:

#Here we build our logestic model for predicting "Rating" based on other 13 predictors:
model.logistic <- glm(formula = rating ~ ., family =binomial(link='logit'), data = train)
summary(model.logistic)

## 
## Call:
## glm(formula = rating ~ ., family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1915  -0.6031   0.3956   0.6973   2.5219  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -2.738e-01  9.990e-01  -0.274 0.784032    
## experience      7.646e-02  1.048e-02   7.293 3.03e-13 ***
## homeownother    3.322e-01  7.807e-01   0.425 0.670501    
## homeownowner    1.733e+00  7.607e-01   2.278 0.022701 *  
## homeownparents  1.404e+00  7.730e-01   1.816 0.069310 .  
## homeownpriv     1.019e+00  7.911e-01   1.288 0.197857    
## homeownrent     9.726e-01  7.646e-01   1.272 0.203349    
## loandurn       -5.372e-03  4.882e-03  -1.100 0.271222    
## age            -1.172e-02  6.999e-03  -1.675 0.094032 .  
## mstatmarried    1.474e+00  5.415e-01   2.722 0.006481 ** 
## mstatseparated  1.728e-01  6.154e-01   0.281 0.778852    
## mstatsingle     9.441e-01  5.443e-01   1.735 0.082814 .  
## mstatwidow      1.976e+00  8.173e-01   2.417 0.015639 *  
## rcdsyes_rec    -1.874e+00  1.539e-01 -12.176  < 2e-16 ***
## jtypefreelance -8.816e-01  1.429e-01  -6.168 6.93e-10 ***
## jtypeothers    -1.113e+00  3.032e-01  -3.671 0.000241 ***
## jtypepartime   -1.275e+00  1.813e-01  -7.034 2.00e-12 ***
## explvl         -1.271e-02  3.813e-03  -3.334 0.000855 ***
## inc             6.200e-03  8.841e-04   7.012 2.34e-12 ***
## assts           2.638e-05  9.123e-06   2.891 0.003834 ** 
## debt           -1.920e-04  4.640e-05  -4.138 3.50e-05 ***
## loanamount     -1.975e-03  2.468e-04  -8.002 1.23e-15 ***
## purchprice      1.022e-03  1.869e-04   5.467 4.57e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2646.7  on 2222  degrees of freedom
## Residual deviance: 1954.8  on 2200  degrees of freedom
## AIC: 2000.8
## 
## Number of Fisher Scoring iterations: 5

Now we conclude that the variables which matter for predicting the “Rating” variable are : “Purchprice”, “loanmount”, “debt”, “assts”, “inc”, “explvl”, “jty”, “experience”, “mstat”

Now we apply the test dataset to our model and plot the decision tree:

library(rpart)
library(rpart.plot)


model.tree <- rpart(rating ~ ., train, method = "class")
rpart.plot(model.tree, type=1, extra = 102)

probs <- predict(model.tree, test, type = "prob")[,2]

Decision tree interpretation: the leaves of the decision tree show the distribution of the chance of having a good or bad credit rating among different values of significant variables we considered. If records exist for the applicants and their experience level is less than 2.5 years, the credit rating will be bad by chance of 94/120. It holds 5% of the credit ratings chance and is bad. But if their experience level is greater than 2.5 years while their quantum of assets is less than 574, the credit rating will be bad by chance of 52/82. It holds 4% of the credit ratings chance and is bad. But if the quantum of assets for those is greater than 574, the credit rating will be good by chance of 77/125. For the right stem, if records do not exist for the applicants, their experience level is greater than 1.5 years and their level of income is greater than 102, the credit rating will be good by chance of 911/1021. While if all is the same but the level of income is less than 102, it depends on the quantum of expenses. If it is greater than 82, the credit rating will be bad by chance of 29/44 while if the quantum of expenses is less than 82, the credit rating will be good by chance of 275/371. Finally, if records do not exist for the applicants, their experience level is less than 1.5 years and they have a fixed job, the credit rating will be good by chance of 144/206. But if they don’t have a fixed job but their assets are more than 2750, they still have the chance of a good credit rating, equal to 60/93. And if they don’t have assets greater than 3750, they will have a bad credit rating by the chance of 104/161.

PARTC

The most efficient model for predicting the credit rating can be derived by different cut offs but a simple decision can be made by considering those coeffienct with the greatest coefficient values and greater level of significance. the greater level of significance means better the effect of that variable is considered in the regression model and the greater coefficient value means the more probability of good credit rating.

model.logistic.efficient <- glm(formula = rating ~ experience+rcds+mstat+jtype+inc, family =binomial(link='logit'), data = train)
summary(model.logistic.efficient)

## 
## Call:
## glm(formula = rating ~ experience + rcds + mstat + jtype + inc, 
##     family = binomial(link = "logit"), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8020  -0.8220   0.5026   0.7435   2.1265  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -0.6652624  0.5134002  -1.296  0.19505    
## experience      0.0797039  0.0092390   8.627  < 2e-16 ***
## rcdsyes_rec    -1.8706648  0.1409575 -13.271  < 2e-16 ***
## mstatmarried    1.1769886  0.5019141   2.345  0.01903 *  
## mstatseparated  0.2515748  0.5772756   0.436  0.66298    
## mstatsingle     1.0687477  0.5095907   2.097  0.03597 *  
## mstatwidow      2.0756001  0.7442350   2.789  0.00529 ** 
## jtypefreelance -0.7177196  0.1265894  -5.670 1.43e-08 ***
## jtypeothers    -0.8633203  0.2697631  -3.200  0.00137 ** 
## jtypepartime   -1.1545577  0.1689692  -6.833 8.32e-12 ***
## inc             0.0044754  0.0007503   5.965 2.45e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2646.7  on 2222  degrees of freedom
## Residual deviance: 2183.4  on 2212  degrees of freedom
## AIC: 2205.4
## 
## Number of Fisher Scoring iterations: 5

ROC Plot:

library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

library(gplots)
library(ggplot2)
p <- predict(model.logistic, test, type="response")
pr <- prediction(p, test$rating)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
p2 <- predict(model.logistic.efficient, test, type="response")
pr2 <- prediction(p2, test$rating)
prf2 <- performance(pr2, measure = "tpr", x.measure = "fpr")
plot(prf2, main="ROC Curve", xlab="1-Specificity", ylab="Sensitivity")
par(new=TRUE)
plot(prf, xlab="", ylab="")
par(new=TRUE)
text(.4, .6,
     "model with all the variables")
par(new=TRUE)
text(.1, .8,
     "efficient model")
abline(a=0, b= 1)

What if you predict a credit rating as GOOD but it is BAD? What if you predict a credit rating as BAD but it is GOOD? There is an ideal model which predicts with %100 accuracy and classifies the credit ratings but how much similar to that is our model? The best model is the one which has the %100 accuracy for all the predictions. so if the client selects a model with AUC=1, the model has sensitivity and specificity of %100. In other words, the model prediction is correct for all the records on the test dataset. our efficient model has more AUC( Area Under Curve) and consequently better sensitivity on the same specificity which means it is more accurate. it means we can specify more records for the same sensitivity and is more accurate and gives better predictions.

There are 13 things which affect your prediction of the credit rating of a person who comes in. some of them have higher effects while others are not that effective. we cannot consider all those factors as long as your data history does not consist enough records for each of them. We found out that loan duration, age, quantum of expenses, assets, debt, loan amount requested and purchased amount of items does not clearly affect the credit rating. When we build the most efficient model based on those predictors, we found out that as long as we know the job experience (in years), records existence or non-existence, marital status, job type and level of income, we can predict accurately if the credit rating for that person is good or bad. Our model shows the order of good credit chance for the same people with different marital statuses is: Widow > married > single > separated >divorced Also, the order of good credit chance for the same people with different job types is: Fixed job > freelance > others > part time Moreover, the more level of income and job experience, the higher chance of good credit rating. Also no records greatly improve the chance of good credit rating.

Assignment 05

Keivan Najafian Zahmatkeshan

October 3, 2017