Importing the dataset
#Importing the dataset
library (readr)
mydf <- read_csv(file.choose())
## Parsed with column specification:
## cols(
## rating = col_character(),
## experience = col_integer(),
## homeown = col_character(),
## loandurn = col_integer(),
## age = col_integer(),
## mstat = col_character(),
## rcds = col_character(),
## jtype = col_character(),
## explvl = col_integer(),
## inc = col_integer(),
## assts = col_integer(),
## debt = col_integer(),
## loanamount = col_integer(),
## purchprice = col_integer()
## )
PART A
See some data
################################PART A##################################
#See some data
head(mydf)
## # A tibble: 6 x 14
## rating experience homeown loandurn age mstat rcds jtype
## <chr> <int> <chr> <int> <int> <chr> <chr> <chr>
## 1 good 9 rent 60 30 married no_rec freelance
## 2 good 17 rent 60 58 widow no_rec fixed
## 3 bad 10 owner 36 46 married yes_rec freelance
## 4 good 0 rent 60 24 single no_rec fixed
## 5 good 0 rent 36 26 single no_rec fixed
## 6 good 1 owner 60 36 married no_rec fixed
## # ... with 6 more variables: explvl <int>, inc <int>, assts <int>,
## # debt <int>, loanamount <int>, purchprice <int>
Explore structure of data
#Explore structure of data
str(mydf)
## Classes 'tbl_df', 'tbl' and 'data.frame': 4446 obs. of 14 variables:
## $ rating : chr "good" "good" "bad" "good" ...
## $ experience: int 9 17 10 0 0 1 29 9 0 0 ...
## $ homeown : chr "rent" "rent" "owner" "rent" ...
## $ loandurn : int 60 60 36 60 36 60 60 12 60 48 ...
## $ age : int 30 58 46 24 26 36 44 27 32 41 ...
## $ mstat : chr "married" "widow" "married" "single" ...
## $ rcds : chr "no_rec" "no_rec" "yes_rec" "no_rec" ...
## $ jtype : chr "freelance" "fixed" "freelance" "fixed" ...
## $ explvl : int 73 48 90 63 46 75 75 35 90 90 ...
## $ inc : int 129 131 200 182 107 214 125 80 107 80 ...
## $ assts : int 0 0 3000 2500 0 3500 10000 0 15000 0 ...
## $ debt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ loanamount: int 800 1000 2000 900 310 650 1600 200 1200 1200 ...
## $ purchprice: int 846 1658 2985 1325 910 1645 1800 1093 1957 1468 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 14
## .. ..$ rating : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ experience: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ homeown : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ loandurn : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ age : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ mstat : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ rcds : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ jtype : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ explvl : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ inc : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ assts : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ debt : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ loanamount: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ purchprice: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
Now, I have to change the class of categorical and ordinal variables into factors
#Now, I have to change the class of categorical and ordinal variables into factors
mydf$rating <- as.factor(mydf$rating)
mydf$homeown <- as.factor(mydf$homeown)
mydf$mstat <- as.factor(mydf$mstat)
mydf$rcds <- as.factor(mydf$rcds)
mydf$jtype <- as.factor(mydf$jtype)
str(mydf)
## Classes 'tbl_df', 'tbl' and 'data.frame': 4446 obs. of 14 variables:
## $ rating : Factor w/ 2 levels "bad","good": 2 2 1 2 2 2 2 2 2 1 ...
## $ experience: int 9 17 10 0 0 1 29 9 0 0 ...
## $ homeown : Factor w/ 6 levels "ignore","other",..: 6 6 3 6 6 3 3 4 3 4 ...
## $ loandurn : int 60 60 36 60 36 60 60 12 60 48 ...
## $ age : int 30 58 46 24 26 36 44 27 32 41 ...
## $ mstat : Factor w/ 5 levels "divorced","married",..: 2 5 2 4 4 2 2 4 2 2 ...
## $ rcds : Factor w/ 2 levels "no_rec","yes_rec": 1 1 2 1 1 1 1 1 1 1 ...
## $ jtype : Factor w/ 4 levels "fixed","freelance",..: 2 1 2 1 1 1 1 1 2 4 ...
## $ explvl : int 73 48 90 63 46 75 75 35 90 90 ...
## $ inc : int 129 131 200 182 107 214 125 80 107 80 ...
## $ assts : int 0 0 3000 2500 0 3500 10000 0 15000 0 ...
## $ debt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ loanamount: int 800 1000 2000 900 310 650 1600 200 1200 1200 ...
## $ purchprice: int 846 1658 2985 1325 910 1645 1800 1093 1957 1468 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 14
## .. ..$ rating : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ experience: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ homeown : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ loandurn : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ age : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ mstat : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ rcds : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ jtype : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ explvl : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ inc : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ assts : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ debt : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ loanamount: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ purchprice: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
Now, I am looking for missing values
#Now, I am looking for missing values
sum(is.na(mydf))
## [1] 0
Good for me!! we have no missing values
Now it is time to split the data set into training and testing subsets
#Now it is time to split the data set into training and testing subsets
train <- mydf[0:(0.5*nrow(mydf)),]
test <- mydf[((0.5*nrow(mydf))+1):nrow(mydf),]
head(train)
## # A tibble: 6 x 14
## rating experience homeown loandurn age mstat rcds jtype
## <fctr> <int> <fctr> <int> <int> <fctr> <fctr> <fctr>
## 1 good 9 rent 60 30 married no_rec freelance
## 2 good 17 rent 60 58 widow no_rec fixed
## 3 bad 10 owner 36 46 married yes_rec freelance
## 4 good 0 rent 60 24 single no_rec fixed
## 5 good 0 rent 36 26 single no_rec fixed
## 6 good 1 owner 60 36 married no_rec fixed
## # ... with 6 more variables: explvl <int>, inc <int>, assts <int>,
## # debt <int>, loanamount <int>, purchprice <int>
head(test)
## # A tibble: 6 x 14
## rating experience homeown loandurn age mstat rcds jtype
## <fctr> <int> <fctr> <int> <int> <fctr> <fctr> <fctr>
## 1 good 3 owner 60 30 married no_rec fixed
## 2 good 1 rent 60 35 widow no_rec fixed
## 3 good 22 parents 36 38 married no_rec fixed
## 4 good 30 owner 48 59 married no_rec fixed
## 5 bad 2 other 24 37 married yes_rec freelance
## 6 good 1 parents 48 40 married no_rec fixed
## # ... with 6 more variables: explvl <int>, inc <int>, assts <int>,
## # debt <int>, loanamount <int>, purchprice <int>
From now on, we work on training set. It is time for exoloratory and discriptive analysis. we use table method for categorical and ordinal variables and summary method for continuous variables
#From now on, we work on training set. It is time for exoloratory and discriptive analysis. we use table method for categorical and ordinal variables and summary method for continuous variables
table(train$rating)
##
## bad good
## 628 1595
table(train$homeown)
##
## ignore other owner parents priv rent
## 9 160 1072 362 104 516
table(train$mstat)
##
## divorced married separated single widow
## 20 1647 59 463 34
table(train$rcds)
##
## no_rec yes_rec
## 1896 327
table(train$jtype)
##
## fixed freelance others partime
## 1413 535 72 203
summary(train$experience)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 5.000 7.853 12.000 43.000
summary(train$loandurn)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 36.00 48.00 46.36 60.00 60.00
summary(train$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 28.00 35.00 37.07 45.00 68.00
summary(train$explvl)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.00 45.00 60.00 60.91 75.00 173.00
summary(train$inc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 93.0 130.0 148.6 180.0 959.0
summary(train$assts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 3000 5467 6000 300000
summary(train$debt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 353.8 0.0 30000.0
summary(train$loanamount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 100 700 1000 1028 1300 5000
summary(train$purchprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 105 1111 1386 1440 1683 8800
For a better analysis, we use ggplot plot the bar plots for those ordinal variables which are as factors, now:
#For a better analysis, we use ggplot plot the bar plots for those ordinal variables which are as factors, now:
library(ggplot2)
g <- ggplot(data = train, aes(rating, fill = rating)) +
geom_bar() +
scale_fill_discrete(name="Rating", labels=c("Bad", "good")) +
scale_x_discrete(breaks=c("0","1"),
labels=c("Bad","Good"))
g
g <- ggplot(data = train, aes(homeown, fill = homeown)) +
geom_bar() +
scale_fill_discrete(name="Homeown", labels=c("Ignore", "Other", "Owner", "Parents", "Private", "Rent")) +
scale_x_discrete(breaks=c("0","1","2","3","4","5"),
labels=c("Ignore", "Other", "Owner", "Parents", "Private", "Rent"))
g
g <- ggplot(data = train, aes(mstat, fill = mstat)) +
geom_bar() +
scale_fill_discrete(name="mstat", labels=c("divorced", "married", "separated", "single", "widow")) +
scale_x_discrete(breaks=c("0","1","2","3","4"),
labels=c("divorced", "married", "separated", "single", "widow"))
g
g <- ggplot(data = train, aes(rcds, fill = rcds)) +
geom_bar() +
scale_fill_discrete(name="rcds", labels=c(" no_rec", "yes_rec")) +
scale_x_discrete(breaks=c("0","1"),
labels=c("no_rec","yes_rec"))
g
g <- ggplot(data = train, aes(jtype, fill = jtype)) +
geom_bar() +
scale_fill_discrete(name="jtype", labels=c("fixed", "freelance", "others", "partime")) +
scale_x_discrete(breaks=c("0","1","2","3"),
labels=c("fixed", "freelance", "others", "partime"))
g
The rating shows the exact values of what we are going to predict. It is observed that: 1. most of the people in our training dataset have fixed jobs. 2. majority of the them have no records. 3. majority of them are married. 4. and they are mostly home owners. This analysis helps us to improve our model in the next steps and gives us an overview of how the varibales are distributed among records. we have the correlation coefficient for analyzing how the continuous variables affect our regression model, but we cannot use it for categorical variables. this analysis gives us that understanding before building our regression model. also logestic regression model requires quite large sample sizes. Because maximum likelihood estimates are less powerful than ordinary least squares (e.g., simple linear regression, multiple linear regression);so this analysis can ensure we have enough observations for prediction.
Two way analysis:
ggplot(train, aes(x=experience, y=rating)) + geom_point(size=2, shape=1)
ggplot(train, aes(x=loandurn, y=rating)) + geom_point(size=2, shape=1)
ggplot(train, aes(x=age, y=rating)) + geom_point(size=2, shape=1)
ggplot(train, aes(x=explvl, y=rating)) + geom_point(size=2, shape=1)
ggplot(train, aes(x=inc, y=rating)) + geom_point(size=2, shape=1)
ggplot(train, aes(x=assts, y=rating)) + geom_point(size=2, shape=1)
ggplot(train, aes(x=debt, y=rating)) + geom_point(size=2, shape=1)
ggplot(train, aes(x=loanamount, y=rating)) + geom_point(size=2, shape=1)
ggplot(train, aes(x=purchprice, y=rating)) + geom_point(size=2, shape=1)
But it does not bring us meaningful conclusions
PART B
Here we build our logestic model for predicting “Rating” based on other 13 predictors:
#Here we build our logestic model for predicting "Rating" based on other 13 predictors:
model.logistic <- glm(formula = rating ~ ., family =binomial(link='logit'), data = train)
summary(model.logistic)
##
## Call:
## glm(formula = rating ~ ., family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1915 -0.6031 0.3956 0.6973 2.5219
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.738e-01 9.990e-01 -0.274 0.784032
## experience 7.646e-02 1.048e-02 7.293 3.03e-13 ***
## homeownother 3.322e-01 7.807e-01 0.425 0.670501
## homeownowner 1.733e+00 7.607e-01 2.278 0.022701 *
## homeownparents 1.404e+00 7.730e-01 1.816 0.069310 .
## homeownpriv 1.019e+00 7.911e-01 1.288 0.197857
## homeownrent 9.726e-01 7.646e-01 1.272 0.203349
## loandurn -5.372e-03 4.882e-03 -1.100 0.271222
## age -1.172e-02 6.999e-03 -1.675 0.094032 .
## mstatmarried 1.474e+00 5.415e-01 2.722 0.006481 **
## mstatseparated 1.728e-01 6.154e-01 0.281 0.778852
## mstatsingle 9.441e-01 5.443e-01 1.735 0.082814 .
## mstatwidow 1.976e+00 8.173e-01 2.417 0.015639 *
## rcdsyes_rec -1.874e+00 1.539e-01 -12.176 < 2e-16 ***
## jtypefreelance -8.816e-01 1.429e-01 -6.168 6.93e-10 ***
## jtypeothers -1.113e+00 3.032e-01 -3.671 0.000241 ***
## jtypepartime -1.275e+00 1.813e-01 -7.034 2.00e-12 ***
## explvl -1.271e-02 3.813e-03 -3.334 0.000855 ***
## inc 6.200e-03 8.841e-04 7.012 2.34e-12 ***
## assts 2.638e-05 9.123e-06 2.891 0.003834 **
## debt -1.920e-04 4.640e-05 -4.138 3.50e-05 ***
## loanamount -1.975e-03 2.468e-04 -8.002 1.23e-15 ***
## purchprice 1.022e-03 1.869e-04 5.467 4.57e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2646.7 on 2222 degrees of freedom
## Residual deviance: 1954.8 on 2200 degrees of freedom
## AIC: 2000.8
##
## Number of Fisher Scoring iterations: 5
Now we conclude that the variables which matter for predicting the “Rating” variable are : “Purchprice”, “loanmount”, “debt”, “assts”, “inc”, “explvl”, “jty”, “experience”, “mstat”
Now we apply the test dataset to our model and plot the decision tree:
library(rpart)
library(rpart.plot)
model.tree <- rpart(rating ~ ., train, method = "class")
rpart.plot(model.tree, type=1, extra = 102)
probs <- predict(model.tree, test, type = "prob")[,2]
Decision tree interpretation: the leaves of the decision tree show the distribution of the chance of having a good or bad credit rating among different values of significant variables we considered. If records exist for the applicants and their experience level is less than 2.5 years, the credit rating will be bad by chance of 94/120. It holds 5% of the credit ratings chance and is bad. But if their experience level is greater than 2.5 years while their quantum of assets is less than 574, the credit rating will be bad by chance of 52/82. It holds 4% of the credit ratings chance and is bad. But if the quantum of assets for those is greater than 574, the credit rating will be good by chance of 77/125. For the right stem, if records do not exist for the applicants, their experience level is greater than 1.5 years and their level of income is greater than 102, the credit rating will be good by chance of 911/1021. While if all is the same but the level of income is less than 102, it depends on the quantum of expenses. If it is greater than 82, the credit rating will be bad by chance of 29/44 while if the quantum of expenses is less than 82, the credit rating will be good by chance of 275/371. Finally, if records do not exist for the applicants, their experience level is less than 1.5 years and they have a fixed job, the credit rating will be good by chance of 144/206. But if they don’t have a fixed job but their assets are more than 2750, they still have the chance of a good credit rating, equal to 60/93. And if they don’t have assets greater than 3750, they will have a bad credit rating by the chance of 104/161.
PARTC
The most efficient model for predicting the credit rating can be derived by different cut offs but a simple decision can be made by considering those coeffienct with the greatest coefficient values and greater level of significance. the greater level of significance means better the effect of that variable is considered in the regression model and the greater coefficient value means the more probability of good credit rating.
model.logistic.efficient <- glm(formula = rating ~ experience+rcds+mstat+jtype+inc, family =binomial(link='logit'), data = train)
summary(model.logistic.efficient)
##
## Call:
## glm(formula = rating ~ experience + rcds + mstat + jtype + inc,
## family = binomial(link = "logit"), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8020 -0.8220 0.5026 0.7435 2.1265
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6652624 0.5134002 -1.296 0.19505
## experience 0.0797039 0.0092390 8.627 < 2e-16 ***
## rcdsyes_rec -1.8706648 0.1409575 -13.271 < 2e-16 ***
## mstatmarried 1.1769886 0.5019141 2.345 0.01903 *
## mstatseparated 0.2515748 0.5772756 0.436 0.66298
## mstatsingle 1.0687477 0.5095907 2.097 0.03597 *
## mstatwidow 2.0756001 0.7442350 2.789 0.00529 **
## jtypefreelance -0.7177196 0.1265894 -5.670 1.43e-08 ***
## jtypeothers -0.8633203 0.2697631 -3.200 0.00137 **
## jtypepartime -1.1545577 0.1689692 -6.833 8.32e-12 ***
## inc 0.0044754 0.0007503 5.965 2.45e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2646.7 on 2222 degrees of freedom
## Residual deviance: 2183.4 on 2212 degrees of freedom
## AIC: 2205.4
##
## Number of Fisher Scoring iterations: 5
ROC Plot:
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(gplots)
library(ggplot2)
p <- predict(model.logistic, test, type="response")
pr <- prediction(p, test$rating)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
p2 <- predict(model.logistic.efficient, test, type="response")
pr2 <- prediction(p2, test$rating)
prf2 <- performance(pr2, measure = "tpr", x.measure = "fpr")
plot(prf2, main="ROC Curve", xlab="1-Specificity", ylab="Sensitivity")
par(new=TRUE)
plot(prf, xlab="", ylab="")
par(new=TRUE)
text(.4, .6,
"model with all the variables")
par(new=TRUE)
text(.1, .8,
"efficient model")
abline(a=0, b= 1)
What if you predict a credit rating as GOOD but it is BAD? What if you predict a credit rating as BAD but it is GOOD? There is an ideal model which predicts with %100 accuracy and classifies the credit ratings but how much similar to that is our model? The best model is the one which has the %100 accuracy for all the predictions. so if the client selects a model with AUC=1, the model has sensitivity and specificity of %100. In other words, the model prediction is correct for all the records on the test dataset. our efficient model has more AUC( Area Under Curve) and consequently better sensitivity on the same specificity which means it is more accurate. it means we can specify more records for the same sensitivity and is more accurate and gives better predictions.
There are 13 things which affect your prediction of the credit rating of a person who comes in. some of them have higher effects while others are not that effective. we cannot consider all those factors as long as your data history does not consist enough records for each of them. We found out that loan duration, age, quantum of expenses, assets, debt, loan amount requested and purchased amount of items does not clearly affect the credit rating. When we build the most efficient model based on those predictors, we found out that as long as we know the job experience (in years), records existence or non-existence, marital status, job type and level of income, we can predict accurately if the credit rating for that person is good or bad. Our model shows the order of good credit chance for the same people with different marital statuses is: Widow > married > single > separated >divorced Also, the order of good credit chance for the same people with different job types is: Fixed job > freelance > others > part time Moreover, the more level of income and job experience, the higher chance of good credit rating. Also no records greatly improve the chance of good credit rating.