Lab 10: Naive Bayes II: Lending Club Data

Learning objectives

bin quantitative data using cut()
clean or consolidate categorical/factor variables
make predictions using naiveBayes()
turn probability into classification
evaluate predictions using confusionMatrix()
understand sensitivity and specificity

1. Introduction

Let’s use the Lending Club data again. Clean it. Calculate fico etc.

library(dplyr)
#loan <- read.csv("https://www.dropbox.com/s/vljs4z2r4wixful/LoanStats3a_securev1.csv?raw=1", skip=1)
loan <- read.csv("C:/Users/dvorakt/Documents/teaching/data analytics/case studies/lending club/LoanStats3a_securev1.csv", skip=1)
loan <- filter(loan, loan_status!="")
loan$status <- ifelse(loan$loan_status=="Current" | 
                      loan$loan_status=="Fully Paid" |
                          loan$loan_status=="Does not meet the credit policy.  Status:Fully Paid","good","bad")
loan$status <- as.factor(loan$status)
loan$fico <- (loan$fico_range_high+loan$fico_range_low)/2

2. Bin/dicretize quantitative variables

Naive Bayes calculates conditional probabilities given certain things happening. While in principle, it could calculate probability of default given that fico is 600 and 601 and 602 etc, these calculations would be highly imprecise given that there may be only one observation with a particular value of fico. Therefore, when using naive Bayes we turn quantitative or continuous variables into categorical variables. In this case, let’s create a new variable for both fico and dti that indicate whether the value is in the bottom 25%, middle 50% or top 25%.

summary(loan$fico)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   612.0   687.0   712.0   715.1   742.0   827.0

loan$ficocat <- cut(loan$fico, breaks=c(0,687,742,1000),
                    labels=c("bottom 25%","middle 50%", "top 25%"))
table(loan$ficocat)

## 
## bottom 25% middle 50%    top 25% 
##      12280      20950       9305

summary(loan$dti)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    8.20   13.47   13.37   18.68   29.99

loan$dticat <- cut(loan$dti, breaks=c(0,8.2,18.68,100),
                    labels=c("bottom 25%","middle 50%", "top 25%"))
table(loan$dticat)

## 
## bottom 25% middle 50%    top 25% 
##      10436      21266      10627

3. ‘Clean’ categorical/factor variables

Naive Bayes works with categorical variables. Let’s consider the purpose of the loan:

table(loan$purpose)

## 
##                                   car        credit_card 
##                  0               1615               5477 
## debt_consolidation        educational   home_improvement 
##              19776                422               3199 
##              house     major_purchase            medical 
##                426               2311                753 
##             moving              other   renewable_energy 
##                629               4425                106 
##     small_business           vacation            wedding 
##               1992                400               1004

We see that Lending Club borrowers use money for a variety of purposes. Debt consolidation is the most common purpose of a Lending Club loan. Also, it appears that purpose has a factor level that is blank “” and has zero observations. Recall that we filtered out some observations with missing values above. The blank factor level however remained. Let’s take the blank level out, by re-encoding purpose as a factor using the factor() function. When applied it will create levels only if the level has observations.

loan$purpose <- factor(loan$purpose)
levels(loan$purpose)

##  [1] "car"                "credit_card"        "debt_consolidation"
##  [4] "educational"        "home_improvement"   "house"             
##  [7] "major_purchase"     "medical"            "moving"            
## [10] "other"              "renewable_energy"   "small_business"    
## [13] "vacation"           "wedding"

4. Consolidate/combine factor categories

We also see that some categories have a pretty low number of cases (e.g. ‘renewable_energy’). We may want to consolidate large number of categories with few observations into fewer categories with larger number of observations (especially if the categories are similar). In this case, let’s consolidate ‘renewable_energy’ into ‘other.’ We will use function fct_collapse() from package forcats to accomplish that.

library(forcats)
loan$purpose <- fct_collapse(loan$purpose, other=c("renewable_energy", "other"))
levels(loan$purpose)

##  [1] "car"                "credit_card"        "debt_consolidation"
##  [4] "educational"        "home_improvement"   "house"             
##  [7] "major_purchase"     "medical"            "moving"            
## [10] "other"              "small_business"     "vacation"          
## [13] "wedding"

table(loan$purpose)

## 
##                car        credit_card debt_consolidation 
##               1615               5477              19776 
##        educational   home_improvement              house 
##                422               3199                426 
##     major_purchase            medical             moving 
##               2311                753                629 
##              other     small_business           vacation 
##               4531               1992                400 
##            wedding 
##               1004

5. Split loan data into test and train. Use 80-20 split.

loan <- select(loan, status, ficocat, dticat, purpose)
set.seed(364)
loan$rand <- runif(nrow(loan))
train <- filter(loan, rand<=0.8)
test <- filter(loan, rand>0.8)

6. Estimate naive Bayes and make predictions

library(e1071)
classifier <- naiveBayes(status ~ ficocat+dticat+purpose,train)
classifier

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##       bad      good 
## 0.1501923 0.8498077 
## 
## Conditional probabilities:
##       ficocat
## Y      bottom 25% middle 50%   top 25%
##   bad   0.4089132  0.4749805 0.1161063
##   good  0.2688361  0.4937990 0.2373648
## 
##       dticat
## Y      bottom 25% middle 50%   top 25%
##   bad   0.2153392  0.5063913 0.2782694
##   good  0.2541214  0.5011280 0.2447506
## 
##       purpose
## Y              car credit_card debt_consolidation educational
##   bad  0.027560594 0.091868647        0.474589523 0.013682565
##   good 0.039555049 0.134694442        0.462500432 0.009672850
##       purpose
## Y      home_improvement       house major_purchase     medical      moving
##   bad       0.068217357 0.011336982    0.041634089 0.020914777 0.017200938
##   good      0.076450064 0.009776488    0.057726189 0.017583860 0.014163817
##       purpose
## Y            other small_business    vacation     wedding
##   bad  0.119038311    0.087763878 0.008991400 0.017200938
##   good 0.103879504    0.039969600 0.009776488 0.024251218

prediction <- predict(classifier, select(test, ficocat, dticat, purpose), type="raw")
summary(prediction)

##       bad               good       
##  Min.   :0.04759   Min.   :0.5984  
##  1st Qu.:0.10488   1st Qu.:0.8169  
##  Median :0.14986   Median :0.8501  
##  Mean   :0.15043   Mean   :0.8496  
##  3rd Qu.:0.18310   3rd Qu.:0.8951  
##  Max.   :0.40160   Max.   :0.9524

Prediction is a matrix with two columns (one for each class) and 8,507 rows (one for each loan in the test data frame). Doing a summary on this matrix tells us that the probability of a loan being good ranges from 0.61 to 0.95.

7. Turn probability into classification

Now that we have the probabilities of a loan being good, we need to decide the threshold for classifying a loan as good. This depends on the contest of the problem. In our case let’s declare a loan as good if there is more than 75% probability that is will be good.

test$status_pred <- ifelse(prediction[,"good"] > 0.75, "good", "bad")
table(test$status_pred)

## 
##  bad good 
##  341 8131

library(descr)
crosstab(test$status, test$status_pred, prop.t = TRUE, plot = FALSE)

##    Cell Contents 
## |-------------------------|
## |                   Count | 
## |           Total Percent | 
## |-------------------------|
## 
## ===================================
##                test$status_pred
## test$status     bad    good   Total
## -----------------------------------
## bad             98    1157    1255 
##                1.2%   13.7%        
## -----------------------------------
## good           243    6974    7217 
##                2.9%   82.3%        
## -----------------------------------
## Total          341    8131    8472 
## ===================================

Our accuracy is about the same as the kNN algorithm in lab 8.

8. Evaluating predictions using `confusionMatrix()` (kappa, sensitivity, specificity)

Accuracy is not the only way of assessing our predictions. Function confusionMatrix() from the caret package calculates a number of additional metrics. It takes the vector of class predictions as the first argument and vector of actual classes as the second argument.

library(caret)
confusionMatrix(test$status_pred, test$status)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  bad good
##       bad    98  243
##       good 1157 6974
##                                           
##                Accuracy : 0.8347          
##                  95% CI : (0.8267, 0.8426)
##     No Information Rate : 0.8519          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0635          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.07809         
##             Specificity : 0.96633         
##          Pos Pred Value : 0.28739         
##          Neg Pred Value : 0.85771         
##              Prevalence : 0.14814         
##          Detection Rate : 0.01157         
##    Detection Prevalence : 0.04025         
##       Balanced Accuracy : 0.52221         
##                                           
##        'Positive' Class : bad             
##

We see that the confusionMatrix function calculates accuracy (83%). Kappa is a statistic that summarizes how much of our accuracy is due to chance (i. e. just picking the most common class). Kappa ranges from 0 to 1. A kappa value of 0.06 is pretty low. There are two other statistics we will pay attention to. Sensitivity tells us the percentage of positives that were correctly classified. In our case positive class is ‘bad’, i.e. we are trying to detect bad loans. Of the 1255 bad loans we detected 98 making our sensitivity 7.8%. The other statistic is specificity which tells us the percentage correctly identified negatives. In our case the negative is a good loan. Of the 7,217 good loans we correctly identified 6,974 as good making our specificity 97%.

9. Trade-off between sensitivity and specificity (IN-CLASS EXERCISE)

What happens when we raise the cut off probability for classifying loans as good?

Exercises:

Read this New York Times article entitled “How Companies Learn Your Secrets” Why is Target interested in identifying pregnant customers?
What is the strategy that Andrew Pole used to accomplish the Target marketers’ task?
The data at this link has information on 2,000 fictitious Target shoppers. The variables include the following:

variable	description
gender	gender of shopper, M=male, F=female, U=unknown
address	address of shopper, H=home, A=apartment, P=POBox
Pregnancy.Test	= 1 if recently purchased pregnancy test, = 0 otherwise
Birth.Control	= 1 if recently purchased birth control, = 0 otherwise
…	…
pregnant	= 1 if someone in the shopper’s household is pregnant, = 0 otherwise

Load this data into your R session. Summarize the data. Do the values seem reasonable? Are there any missing values? If so, eliminate observations with missing values.

Calculate the share of shoppers with pregnancy in the household.
Split the data into test and train. Use 75/25 split. Set seed to 364.
Estimate naive Bayes classifier using the recent purchase of a pregnancy test as the only predictor. Choose your own probability cut off for classifying pregnancy. How well does your algorithm do? (Hint: When using confusionMatix() use option , positive="1" to tell the function that positive result is “1”, i.e. pregnancy.)
Suppose you really wanted to avoid marketing baby products to someone who is not pregnant. How should you set the cut off for probability? Would this mean that your sensitivity will be high? What about your specificity?
Have fun with this data. Use whichever predictors you like, and however many predictors you like. Can you improve upon the first algorithm?