Lab 6: Naive Bayes II: Lending Club Data

Learning objectives

bin quantitative data using cut()
clean or consolidate categorical/factor variables
make predictions using naiveBayes()
turn probability into classification
evaluate predictions using confusionMatrix()
understand sensitivity and specificity

1. Introduction

Let’s use the Lending Club data again. Clean it. Calculate fico etc.

library(dplyr)
library(ggplot2)
library(stargazer)
setwd("C:/Users/dvorakt/Documents/teaching/data analytics/case studies/lending club")
#loan <- read.csv("https://www.dropbox.com/s/vljs4z2r4wixful/LoanStats3a_securev1.csv?raw=1", skip=1)
loan <- read.csv("LoanStats3a_securev1.csv", skip=1)
loan <- filter(loan, loan_status!="")
loan$good <- ifelse(loan$loan_status=="Current" | 
                      loan$loan_status=="Fully Paid" |
                          loan$loan_status=="Does not meet the credit policy.  Status:Fully Paid","good","bad")
loan$good <- as.factor(loan$good)
loan$fico <- (loan$fico_range_high+loan$fico_range_low)/2

2. Bin/dicretize quantitative variables

Naive Bayes calculates conditional probabilities given certain things happening. While in principle, it could calculate probability of default given that fico is 600 and 601 and 602 etc, these calculations would be highly imprecise given that there may be only one observation with a particular value of fico. Therefore, when using naive Bayes we turn quantitative or continuous variables into categorical variables. In this case, let’s create a new variable for both fico and dti that indicate whether the value is in the bottom 25%, middle 50% or top 25%.

summary(loan$fico)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   612.0   687.0   712.0   715.1   742.0   827.0

loan$ficocat <- cut(loan$fico, breaks=c(0,687,742,1000),
                    labels=c("bottom 25%","middle 50%", "top 25%"))
table(loan$ficocat)

## 
## bottom 25% middle 50%    top 25% 
##      12280      20950       9305

summary(loan$dti)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    8.20   13.47   13.37   18.68   29.99

loan$dticat <- cut(loan$dti, breaks=c(0,8.2,18.68,100),
                    labels=c("bottom 25%","middle 50%", "top 25%"))
table(loan$dticat)

## 
## bottom 25% middle 50%    top 25% 
##      10436      21266      10627

3. ‘Clean’ categorical/factor variables

Naive Bayes works with categorical variables. Let’s consider the purpose of the loan:

table(loan$purpose)

## 
##                                   car        credit_card 
##                  0               1615               5477 
## debt_consolidation        educational   home_improvement 
##              19776                422               3199 
##              house     major_purchase            medical 
##                426               2311                753 
##             moving              other   renewable_energy 
##                629               4425                106 
##     small_business           vacation            wedding 
##               1992                400               1004

levels(loan$purpose)

##  [1] ""                   "car"                "credit_card"       
##  [4] "debt_consolidation" "educational"        "home_improvement"  
##  [7] "house"              "major_purchase"     "medical"           
## [10] "moving"             "other"              "renewable_energy"  
## [13] "small_business"     "vacation"           "wedding"

We see that Lending Club borrowers use money for a variety of purposes. Debt consolidation is the most common purpose of a Lending Club loan. Also, it appears that purpose has a factor level that is blank “” and has zero observations. Recall that we filtered out some observations with missing values above. The blank factor level however remained. Let’s take the blank level out, by re-encoding purpose as a factor using the factor() function. When applied it will create levels only if the level has observations.

loan$purpose <- factor(loan$purpose)
levels(loan$purpose)

##  [1] "car"                "credit_card"        "debt_consolidation"
##  [4] "educational"        "home_improvement"   "house"             
##  [7] "major_purchase"     "medical"            "moving"            
## [10] "other"              "renewable_energy"   "small_business"    
## [13] "vacation"           "wedding"

4. Consolidate/combine factor categories

We also see that some categories have a pretty low number of cases (e.g. ‘renewable_energy’). We may want to consolidate large number of categories with few observations into fewer categories with larger number of observations (especially if the categories are similar). In this case, let’s consolidate ‘renewable_energy’ into ‘other.’ This is a little cumbersome to do, but the code below accomplishes that.

#create a character vector with level 'labels' where 'renewable energy' is replaced with 'other'
loan$test <- ifelse(loan$purpose=="renewable_energy", "other", as.character(loan$purpose))
#turn the character vector back into a factor
loan$purpose <- as.factor(loan$test)
table(loan$purpose)

## 
##                car        credit_card debt_consolidation 
##               1615               5477              19776 
##        educational   home_improvement              house 
##                422               3199                426 
##     major_purchase            medical             moving 
##               2311                753                629 
##              other     small_business           vacation 
##               4531               1992                400 
##            wedding 
##               1004

levels(loan$purpose)

##  [1] "car"                "credit_card"        "debt_consolidation"
##  [4] "educational"        "home_improvement"   "house"             
##  [7] "major_purchase"     "medical"            "moving"            
## [10] "other"              "small_business"     "vacation"          
## [13] "wedding"

5. Split loan data into test and train. Use 80-20 split.

loan <- select(loan, good, ficocat, dticat, purpose)
set.seed(364)
sample <- sample(nrow(loan),floor(nrow(loan)*0.8))
train <- loan[sample,]
test <- loan[-sample,]

6. Estimate naive Bayes and make predictions

library(e1071)
classifier <- naiveBayes(good ~ ficocat+dticat+purpose,train)
classifier

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##       bad      good 
## 0.1506406 0.8493594 
## 
## Conditional probabilities:
##       ficocat
## Y      bottom 25% middle 50%   top 25%
##   bad   0.4044089  0.4802965 0.1152946
##   good  0.2684589  0.4947062 0.2368348
## 
##       dticat
## Y      bottom 25% middle 50%   top 25%
##   bad   0.2124485  0.5105046 0.2770469
##   good  0.2527729  0.5030423 0.2441848
## 
##       purpose
## Y              car credit_card debt_consolidation educational
##   bad  0.026531409 0.097932111        0.473858759 0.014241124
##   good 0.039305238 0.134800360        0.461144557 0.009445713
##       purpose
## Y      home_improvement       house major_purchase     medical      moving
##   bad       0.068084276 0.010729614    0.039797113 0.020093640 0.017362466
##   good      0.077572486 0.009653311    0.057331673 0.017542039 0.014324268
##       purpose
## Y            other small_business    vacation     wedding
##   bad  0.119781506    0.085056574 0.009364027 0.017167382
##   good 0.103937444    0.040758425 0.009307314 0.024877171

prediction <- predict(classifier, select(test, ficocat, dticat, purpose), type="raw")
summary(prediction)

##       bad              good       
##  Min.   :0.0467   Min.   :0.6125  
##  1st Qu.:0.1082   1st Qu.:0.8162  
##  Median :0.1522   Median :0.8478  
##  Mean   :0.1511   Mean   :0.8489  
##  3rd Qu.:0.1838   3rd Qu.:0.8918  
##  Max.   :0.3875   Max.   :0.9533

Prediction is a matrix with two columns (one for each class) and 8,507 rows (one for each loan in the test data frame). Doing a summary on this matrix tells us that the probability of a loan being good ranges from 0.61 to 0.95.

7. Turn probability into classification

Now that we have the probabilities of a loan being good, we need to decide the threshold for classifying a loan as good. This depends on the contest of the problem. In our case let’s declare a loan as good if there is more than 75% probability that is will be good.

test$good_pred <- ifelse(prediction[,"good"] > 0.75, "good", "bad")
table(test$good_pred)

## 
##  bad good 
##  346 8161

library(gmodels)
CrossTable(x=test$good, y=test$good_pred, prop.chisq=FALSE, prop.r=FALSE, prop.c=FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  8507 
## 
##  
##              | test$good_pred 
##    test$good |       bad |      good | Row Total | 
## -------------|-----------|-----------|-----------|
##          bad |       117 |      1128 |      1245 | 
##              |     0.014 |     0.133 |           | 
## -------------|-----------|-----------|-----------|
##         good |       229 |      7033 |      7262 | 
##              |     0.027 |     0.827 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       346 |      8161 |      8507 | 
## -------------|-----------|-----------|-----------|
## 
##

We get accuracy of (117+7033)/8507= 84 %. (This is a tiny bit better than with the kNN algorithm in lab 4.)

8. Evaluating predictions using `confusionMatrix()` (kappa, sensitivity, specificity)

Accuracy is not the only way of assessing our predictions. Function confusionMatrix() from the caret package calculates a number of additional metrics. It takes the vector of class predictions as the first argument and vector of actual classes as the second argument.

library(caret)
confusionMatrix(test$good_pred, test$good)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  bad good
##       bad   117  229
##       good 1128 7033
##                                           
##                Accuracy : 0.8405          
##                  95% CI : (0.8325, 0.8482)
##     No Information Rate : 0.8536          
##     P-Value [Acc > NIR] : 0.9997          
##                                           
##                   Kappa : 0.0891          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.09398         
##             Specificity : 0.96847         
##          Pos Pred Value : 0.33815         
##          Neg Pred Value : 0.86178         
##              Prevalence : 0.14635         
##          Detection Rate : 0.01375         
##    Detection Prevalence : 0.04067         
##       Balanced Accuracy : 0.53122         
##                                           
##        'Positive' Class : bad             
##

We see that the confusionMatrix function calculates accuracy (84%). Kappa is a statistic that summarizes how much of our accuracy is due to chance (i. e. just picking the most common class). Kappa ranges from 0 to 1. A kappa value of 0.089 is pretty low. There are two other statistics we will pay attention to. Sensitivity tells us the percentage of positives that were correctly classified. In our case positive class is ‘bad’, i.e. we are trying to detect bad loans. Of the 1245 bad loans we detected 117 making our sensitivity 9.4%. The other statistic is specificity which tells us the percentage correctly identified negatives. In our case the negative is a good loan. Of the 7,262 good loans we correctly identified 7,033 as good making our specificity 97%.

9. Trade-off between sensitivity and specificity (IN CLASS EXERCISE)

What happens when we raise the cut off probability for classifying loans as good?

test$good_pred <- ifelse(prediction[,"good"] > 0.85, "good", "bad")
confusionMatrix(test$good_pred, test$good)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  bad good
##       bad   904 3867
##       good  341 3395
##                                          
##                Accuracy : 0.5053         
##                  95% CI : (0.4947, 0.516)
##     No Information Rate : 0.8536         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0891         
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.7261         
##             Specificity : 0.4675         
##          Pos Pred Value : 0.1895         
##          Neg Pred Value : 0.9087         
##              Prevalence : 0.1464         
##          Detection Rate : 0.1063         
##    Detection Prevalence : 0.5608         
##       Balanced Accuracy : 0.5968         
##                                          
##        'Positive' Class : bad            
##

Exercises

Read this New York Times article entitled “How Companies Learn Your Secrets” Why is Target interested in identifying pregnant customers?
What is the strategy that Andrew Pole used to accomplish the Target marketer’s task?
The data at this link has information on 2,000 fictitious Target shoppers. The variables include the following:

variable	description
gender	gender of shopper, M=male, F=female, U=unknown
address	address of shopper, H=home, A=apartment, P=POBox
Pregnancy.Test	= 1 if recently purchased pregnancy test, = 0 otherwise
Birth.Control	= 1 if recently purchased birth control, = 0 otherwise
…	…
pregnant	= 1 if someone in the shopper’s household is pregnant, = 0 otherwise

Load this data into your R session. Summarize the data. Do the values seem reasonable? Are there any missing values? If so, eliminate observations with missing values.

Calculate the share of shoppers with pregnancy in the household.
Split the data into test and train. Use 75/25 split. Set seed to 364.
Estimate naive Bayes classifier using the recent purchase of a pregnancy test as the only predictor. Choose your own probability cut off for classifying pregnancy. How well does your algorithm do? (Hint: When using confusionMatix() use option , positive="1" to tell the function that positive result is “1”“, i.e. pregnancy.)
Suppose you really wanted to avoid marketing baby products to someone who is not pregnant. How should you set the cut off for probability? Would this mean that your sensitivity will be high? What about your specificity?
Have fun with this data. Use whichever predictors you like, and however many predictors you like. Can you improve upon the first algorithm?