Load in the revelant packages

library(dplyr)
library(readr)
library(stringr)
library(tidyr)

Part 1: Data Loading and Cleanup

# setting the work directory and loading in dataset
setwd("~/Desktop/OIDD245/Lab 4")
train <- read.csv("~/Desktop/OIDD245/Lab 4/LoanStats3c.csv", skip = 1)
test <- read.csv("~/Desktop/OIDD245/Lab 4/LoanStats3d.csv", skip = 1)

# setting backup files
train.untouched <- train
test.untouched <- test

# find the no. of the last two rows and remove them
dim(train)
## [1] 235631    111
train <- train[-c(235631-1, 235631),]
dim(train)
## [1] 235629    111
dim(test)
## [1] 421097    111
test <- test[-c(421097-1, 421097),]
dim(test)
## [1] 421095    111
# add set.seed() to generate same results each time
set.seed(0)

Part 2: Descriptive Statistics

a. Using the training data, create a binary variable called highgrade that takes the value 1 when a loan has received an “A” or “B” grade and 0 otherwise. This is the response variable that you will predict.

table(train$grade)
## 
##           A     B     C     D     E     F     G 
##     0 36108 61935 66565 42992 20121  6223  1685
train$highgrade <- 0
train$highgrade[train$grade == "A" | train$grade == "B"] <- 1
table(train$highgrade)
## 
##      0      1 
## 137586  98043

b. What is the proportion of loans in the training data that receive either an “A” or “B” grade?

table(train$highgrade)
## 
##      0      1 
## 137586  98043
prop.table(table(train$highgrade, train$highgrade == 1))
##    
##         FALSE      TRUE
##   0 0.5839095 0.0000000
##   1 0.0000000 0.4160905

Ans: 41.61% of loans in the training data would receive either an “A” or “B” grade.

c. Using the t.test command, report results from t-tests indicating whether the differences in the proportion of loans that are highgrade is significantly different depending upon:


# create new columns for the results we are interested in
train = train %>% 
  mutate(abovemedincome = ifelse(annual_inc > median(annual_inc), 'above median income', 'below median income')) %>% 
  mutate(aboveloanrequest = ifelse(loan_amnt > median(loan_amnt), 'above median loan amount', 'below median loan amount')) %>% 
  mutate(renthome = ifelse(train$home_ownership == "RENT", 'rents home', 'does not rent home'))
c.1 Whether the debtor is above or below the median income level:
t.test(train$highgrade ~ train$abovemedincome) 
## 
##  Welch Two Sample t-test
## 
## data:  train$highgrade by train$abovemedincome
## t = 45.042, df = 231180, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.08738990 0.09534142
## sample estimates:
## mean in group above median income mean in group below median income 
##                         0.4642354                         0.3728698
# make a 1,0 for the above median income as well
# then run normal test // highgrade ~ above
# use pipe and mutate 

Ans: As the p-value<2.2e-16 which is very close to zero, there is a significant difference between median income level and highgrade. This means that the proportion of loans that are highgrade is significantly different depending on whether the debtor is above or below the median annual income. In particular, we can tell that the mean in the group above median income is 46.4% while the mean in the group below the median income is 37.2%.


##### Whether the loan request is above or below the median loan amount:

t.test(train$highgrade ~ train$aboveloanrequest) 
## 
##  Welch Two Sample t-test
## 
## data:  train$highgrade by train$aboveloanrequest
## t = -34.091, df = 235540, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.07303066 -0.06508972
## sample estimates:
## mean in group above median loan amount 
##                              0.3814771 
## mean in group below median loan amount 
##                              0.4505373

Ans: similar to the earlier part, the proportion of loans that are highgrade is significantly different depending on whether the debtor is above or below the median loan amount. In particular, we can tell that the mean in the group above median loan amount is 38.1% while the mean in the group below the median loan amount is 45.1%.


Whether the debtor rents their home or not

t.test(train$highgrade ~ train$renthome) 
## 
##  Welch Two Sample t-test
## 
## data:  train$highgrade by train$renthome
## t = 14.688, df = 199444, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.02638399 0.03450975
## sample estimates:
## mean in group does not rent home         mean in group rents home 
##                        0.4280667                        0.3976199

Ans: similar to the earlier part, the proportion of home owners is significantly different from the non-home owners. In particular, we can tell that the proportion of home owners is 39.8% while the proportion for non-homeowners is 42.8%.

Part 3: Building a Logistic Regression Model

a. Using the glm command, perform a logistic regression that predicts highgrade using annual income, home ownership, the loan amount, verification status, and purpose as the predictors. Use the summary command to view the output.

table(train$verification_status)
## 
##                    Not Verified Source Verified        Verified 
##               0           70659           97741           67229
table(train$purpose)
## 
##                                   car        credit_card 
##                  0               1832              55522 
## debt_consolidation   home_improvement              house 
##             143006              13045                750 
##     major_purchase            medical             moving 
##               3858               2331               1328 
##              other   renewable_energy     small_business 
##              10371                123               2277 
##           vacation            wedding 
##               1178                  8
model.lm <- glm(highgrade ~ annual_inc + home_ownership + loan_amnt + verification_status + purpose, data = train, family = binomial())
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model.lm)
## 
## Call:
## glm(formula = highgrade ~ annual_inc + home_ownership + loan_amnt + 
##     verification_status + purpose, family = binomial(), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.4904  -0.9499  -0.7030   1.1244   2.6029  
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)
## (Intercept)                         8.188e+00  2.666e+01   0.307   0.7588
## annual_inc                          8.547e-06  1.216e-07  70.261  < 2e-16
## home_ownershipMORTGAGE             -8.055e+00  2.666e+01  -0.302   0.7626
## home_ownershipOWN                  -8.071e+00  2.666e+01  -0.303   0.7621
## home_ownershipRENT                 -8.180e+00  2.666e+01  -0.307   0.7590
## loan_amnt                          -3.895e-05  6.762e-07 -57.601  < 2e-16
## verification_statusSource Verified -6.533e-01  1.090e-02 -59.928  < 2e-16
## verification_statusVerified        -9.497e-01  1.245e-02 -76.262  < 2e-16
## purposecredit_card                  8.271e-01  4.978e-02  16.617  < 2e-16
## purposedebt_consolidation          -8.011e-02  4.925e-02  -1.627   0.1038
## purposehome_improvement            -3.269e-01  5.256e-02  -6.219 5.02e-10
## purposehouse                       -2.032e+00  1.385e-01 -14.673  < 2e-16
## purposemajor_purchase              -1.265e-01  5.963e-02  -2.121   0.0339
## purposemedical                     -1.177e+00  7.063e-02 -16.659  < 2e-16
## purposemoving                      -2.159e+00  1.037e-01 -20.814  < 2e-16
## purposeother                       -1.173e+00  5.481e-02 -21.394  < 2e-16
## purposerenewable_energy            -2.306e+00  3.299e-01  -6.990 2.74e-12
## purposesmall_business              -1.844e+00  8.677e-02 -21.251  < 2e-16
## purposevacation                    -1.294e+00  8.797e-02 -14.712  < 2e-16
## purposewedding                     -4.688e-01  7.629e-01  -0.614   0.5389
##                                       
## (Intercept)                           
## annual_inc                         ***
## home_ownershipMORTGAGE                
## home_ownershipOWN                     
## home_ownershipRENT                    
## loan_amnt                          ***
## verification_statusSource Verified ***
## verification_statusVerified        ***
## purposecredit_card                 ***
## purposedebt_consolidation             
## purposehome_improvement            ***
## purposehouse                       ***
## purposemajor_purchase              *  
## purposemedical                     ***
## purposemoving                      ***
## purposeother                       ***
## purposerenewable_energy            ***
## purposesmall_business              ***
## purposevacation                    ***
## purposewedding                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 319984  on 235628  degrees of freedom
## Residual deviance: 290586  on 235609  degrees of freedom
## AIC: 290626
## 
## Number of Fisher Scoring iterations: 6

b. Use the predict command to generate a vector of the probabilities predicted by your logistic regression.

predict(model.lm, newdata = train)
train$outcomes_glm <- predict(model.lm, newdata = train, type = "response")

c. Create a new variable that classifies loans as being highgrade or not (1 or 0), based on your predicted probabilities. To do this, choose a probability threshold above which to classify loans as being high grade. Although – as we discussed in class – there are a variety of criteria to use when choosing this value, for now, try different values to select one that roughly optimizes the accuracy of the classifier (you do not need to go beyond the two-digit level), where accuracy is defined in the next bullet.

# let x be the threshold value we use (classifier if loan is highgrade or not)

# We can start with using quartiles for our first guess. 
summary(train$outcomes_glm) # 3rd quartile is 0.51281. 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02375 0.28624 0.39207 0.41609 0.53422 1.00000
# first try
x <- 0.51281
train$threshold <- 0
train$threshold[train$outcomes_glm >= x] <- 1
mean(train$threshold == train$highgrade) 
## [1] 0.6621511
# second try
x <- 0.6
train$threshold <- 0
train$threshold[train$outcomes_glm >= x] <- 1
mean(train$threshold == train$highgrade) 
## [1] 0.6421366
# we should not increase the threshold from the first try. 

# third try
x <- 0.5
train$threshold <- 0
train$threshold[train$outcomes_glm >= x] <- 1
mean(train$threshold == train$highgrade) 
## [1] 0.66241
# let's test if we should decrease it

# fourth try
x <- 0.45
train$threshold <- 0
train$threshold[train$outcomes_glm >= x] <- 1
mean(train$threshold == train$highgrade) 
## [1] 0.6561501
# we should keep it as 0.5 

Ans: the most optimal probability threshold to classify is the loan is highgrade is 0.5. This is because it got the highest proportion of rows where the classifier prediction is equal to its actual highgrade value. The accuracy of this classifier on the training data is 0.66241 or 66.24%.

To compare:

# assigning 0 
x <- 0
train$threshold <- 0
mean(train$threshold == train$highgrade) # 0.5839095
## [1] 0.5839095
# assigning 0 and 1 randomly 
train$threshold <- sample(0:1, nrow(train), TRUE)
table(train$threshold)
## 
##      0      1 
## 117777 117852
mean(train$threshold == train$highgrade) # 0.4986483, but this value will change everytime we run the code. 
## [1] 0.5002271

To compare, the accuracy of a classifier that assigns a value of 0 to all rows for the predicted class is only 58.39%. To compare, the accuracy of a classifier that randomly assigns 0 and 1 values as the predicted class is 49.86%. We can see that both the accuracies are lower than the accuracy I had obtained, which is 64.82%

Part 4: Build a Classification Tree model

library(rpart)
# install.packages("rpart.plot")
library(rpart.plot)
# where rpart stands for "Recursive partitioning and regression trees"
#install.packages("rattle")
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
model.classtree <- rpart(highgrade ~ annual_inc + home_ownership + loan_amnt + verification_status + purpose, data = train, method = "class")
summary(model.classtree)
## Call:
## rpart(formula = highgrade ~ annual_inc + home_ownership + loan_amnt + 
##     verification_status + purpose, data = train, method = "class")
##   n= 235629 
## 
##           CP nsplit rel error    xerror        xstd
## 1 0.10758545      0 1.0000000 1.0000000 0.002440419
## 2 0.02272472      1 0.8924146 0.8924146 0.002392148
## 3 0.01000000      3 0.8469651 0.8524831 0.002368711
## 
## Variable importance
##             purpose verification_status          annual_inc 
##                  54                  34                   8 
##           loan_amnt      home_ownership 
##                   3                   1 
## 
## Node number 1: 235629 observations,    complexity param=0.1075854
##   predicted class=0  expected loss=0.4160905  P(node) =1
##     class counts: 137586 98043
##    probabilities: 0.584 0.416 
##   left son=2 (180107 obs) right son=3 (55522 obs)
##   Primary splits:
##       purpose             splits as  -LRLLLLLLLLLLL, improve=4649.521, (0 missing)
##       verification_status splits as  -RLL, improve=4298.594, (0 missing)
##       loan_amnt           < 28012.5 to the right, improve=1446.233, (0 missing)
##       annual_inc          < 57988.1 to the left,  improve=1046.181, (0 missing)
##       home_ownership      splits as  -RRLL, improve= 114.601, (0 missing)
## 
## Node number 2: 180107 observations,    complexity param=0.02272472
##   predicted class=0  expected loss=0.360941  P(node) =0.7643669
##     class counts: 115099 65008
##    probabilities: 0.639 0.361 
##   left son=4 (127355 obs) right son=5 (52752 obs)
##   Primary splits:
##       verification_status splits as  -RLL, improve=2961.2500, (0 missing)
##       purpose             splits as  -R-RRLRLLLLLLR, improve=1247.7670, (0 missing)
##       loan_amnt           < 28012.5 to the right, improve= 967.6973, (0 missing)
##       annual_inc          < 59991   to the left,  improve= 882.0200, (0 missing)
##       home_ownership      splits as  -RRLL, improve= 140.4077, (0 missing)
##   Surrogate splits:
##       loan_amnt < 4987.5  to the right, agree=0.724, adj=0.057, (0 split)
## 
## Node number 3: 55522 observations
##   predicted class=1  expected loss=0.4050106  P(node) =0.2356331
##     class counts: 22487 33035
##    probabilities: 0.405 0.595 
## 
## Node number 4: 127355 observations
##   predicted class=0  expected loss=0.3025873  P(node) =0.5404895
##     class counts: 88819 38536
##    probabilities: 0.697 0.303 
## 
## Node number 5: 52752 observations,    complexity param=0.02272472
##   predicted class=1  expected loss=0.4981802  P(node) =0.2238774
##     class counts: 26280 26472
##    probabilities: 0.498 0.502 
##   left son=10 (22636 obs) right son=11 (30116 obs)
##   Primary splits:
##       annual_inc     < 54977   to the left,  improve=730.9184, (0 missing)
##       loan_amnt      < 4987.5  to the left,  improve=727.7945, (0 missing)
##       purpose        splits as  -R-RRLRLLLLLLL, improve=646.1673, (0 missing)
##       home_ownership splits as  --RRL, improve= 19.6824, (0 missing)
##   Surrogate splits:
##       loan_amnt      < 7987.5  to the left,  agree=0.617, adj=0.107, (0 split)
##       home_ownership splits as  --RLL, agree=0.614, adj=0.102, (0 split)
##       purpose        splits as  -R-RRRRRLRLRRL, agree=0.571, adj=0.001, (0 split)
## 
## Node number 10: 22636 observations
##   predicted class=0  expected loss=0.4058137  P(node) =0.09606627
##     class counts: 13450  9186
##    probabilities: 0.594 0.406 
## 
## Node number 11: 30116 observations
##   predicted class=1  expected loss=0.4260194  P(node) =0.1278111
##     class counts: 12830 17286
##    probabilities: 0.426 0.574
plot(model.classtree)
text(model.classtree)

fancyRpartPlot(model.classtree,
               type = 4)

train$z = predict(model.classtree, type="class")
table(train$z)
## 
##      0      1 
## 149991  85638
mean(train$z == train$highgrade) # 0.6475858
## [1] 0.6475858

The accuracy of this machine learning based classifier is 64.75%, which is almost the same (a little bit lower by 0.07%) as the logistic regression model.

Part 5: Model performance on the test data

a. Incorporate the testing dataset using the logistic regression and classification tree models.

# remove the 'educational' category in 'purpose' variable 
test$purpose[test$purpose == "educational"] <- NA
table(test$purpose)
## 
##                                   car        credit_card 
##                  0               3466             102025 
## debt_consolidation        educational   home_improvement 
##             250020                  0              25293 
##              house     major_purchase            medical 
##               1438               7449               3938 
##             moving              other   renewable_energy 
##               2420              19204                224 
##     small_business           vacation            wedding 
##               3364               2249                  4
# create a dummy variable for highgrade in testing dataset
test$highgrade <- 0
test$highgrade[test$grade == "A" | test$grade == "B"] <- 1
table(test$highgrade)
## 
##      0      1 
## 230153 190942
# predict using the testing dataset
test$outcomes_lm = predict(model.lm, newdata = test, type = 'response')
test$outcomes_lm %>% head(5)
## [1] 0.2065605 0.1816427 0.5533284 0.5768101 0.7501927
test$outcomes_ct = predict(model.classtree, newdata = test, type="class")
test$outcomes_ct %>% head(5)
## [1] 0 0 1 1 1
## Levels: 0 1
## use the threshold. 
x <- 0.5
test$threshold <- 0
test$threshold[test$outcomes_lm >= x] <- 1
mean(test$threshold == test$highgrade) # 0.6485995: the logistic regression model with 0.5 threshold has the accuracy of 0.6485995
## [1] 0.6485995
ct.accuracy<-dim(test[which(test$outcomes_ct==test$highgrade),])/dim(test)
print(ct.accuracy) # 0.6290053: the classfication tree model with 0.5 threshold has the accuracy of 0.6290053
## [1] 0.6290053 1.0000000

b. As a benchmark, what is the accuracy of a classifier that flips a coin for each row in the test data, and predicts a 1 if it lands head and a 0 if it lands tails? As another benchmark, what is the accuracy of a classifier that simply assigns a value of 0 to all rows of the test data?

# for the values where it is all the random numbers based on a coin flip (sample)

test$rolldice <- sample(0:1,nrow(test),replace=T)
mean(test$rolldice == test$highgrade) # 0.502034: the accuracy of a classifier that flips a coin for each row in the test data, and predicts a 1 if it lands head and a 0 if it lands tails is 0.502034. 
## [1] 0.4993956
# for the values where it is all 0

x <- 0
test$threshold <- 0
mean(test$threshold == test$highgrade) # 0.5465584: the accuracy of a classifier that simply assigns a value of 0 to all rows of the test data is 0.5465584. 
## [1] 0.5465584

Part 6: Accuracy, Precision and Recall

a. Compute the precision (fraction of rows you identify as highgrade that are

actually highgrade) and compute the recall (fraction of all loans that are highgrade that you identify as highgrade).

foo <- test %>% 
  mutate(outcomes_lm_highgrade = outcomes_lm > 0.5) %>% 
  mutate(highgrade_foo = ifelse(highgrade == 1, 'TRUE', 'FALSE')) %>% 
  select(outcomes_lm_highgrade,highgrade_foo) %>% 
  table() 
foo
##                      highgrade_foo
## outcomes_lm_highgrade  FALSE   TRUE
##                 FALSE 185835 103654
##                 TRUE   44318  87287
#                                                       highgrade (actual)
#outcomes_lm_highgrade(predicted)         FALSE(negative)                    TRUE(positive)
#                 FALSE(negative)     185835(true negative)            103654(false negative)
#                 TRUE(positive)      44318(false positive)           87287(true positive)
          
# Precision
# precision = true.positive/(false.positive+true.positive)
87287/(44318+87287) #0.6632499
## [1] 0.6632499
# Recall
# recall = true.positive/(false.negative+true.positive)
87287/(103654+87287) #0.4571412
## [1] 0.4571412

b. Among accuracy, precision, and recall, which do you think would be most important to focus on for this context (i.e. online lending)? What are the costs and benefits of these types of market transactions that would be important to consider when making this decision?

Answer:
First, let’s understand the meaning of accuracy, precision and recall in the context of online lending.

  • For accuracy, it is the ratio of correctly predicted observation to the total observations. Which means of all loans predicted, how many were actually high grade? The value we got from the test data was 0.6500148 or 65.00%.

  • For precision, it is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all loans that labeled as high grade, how many were actually high grade? The value we got from the test data was 66.32%.

  • For recall, it is the ratio of correctly predicted positive observations to the all observations in actual class. The question recall answers is: Of all the loans that were actually high grade, how many did we predict? The value was 45.71%.

In my opinon, I believe that precision is the most important factor to consider.

  • Online p2p lending is more precarious than bank lending; there is a higher possibility of a lendee defaulting on the loan since he/she was not scrutinized by a traditional banker.

  • Hence, it is very important that the model we built gets a high precision rate or else we will be giving out loans that are low-grade when it was predicted to be highgrade. This will ensures lower default rate and safer lending environment for the lender.

  • The benefit of these types of market transactions would be lower interest rate since we will not need a financial institution thus less operational costs.

  • but the cost would be higher default rate since there are less stringent checks (i.e. the lendee might fabricate fake details on the online lending platform).

Completed by : Pitchaya Liewchanpatana, for OIDD245 Lab 4 P2P Lending. Thank you!