cut()naiveBayes()confusionMatrix()Let’s use the Lending Club data again. Clean it. Calculate fico etc.
library(dplyr)
#loan <- read.csv("https://www.dropbox.com/s/vljs4z2r4wixful/LoanStats3a_securev1.csv?raw=1", skip=1)
loan <- read.csv("C:/Users/dvorakt/Documents/teaching/data analytics/case studies/lending club/LoanStats3a_securev1.csv", skip=1)
loan <- filter(loan, loan_status!="")
loan$status <- ifelse(loan$loan_status=="Current" |
loan$loan_status=="Fully Paid" |
loan$loan_status=="Does not meet the credit policy. Status:Fully Paid","good","bad")
loan$status <- as.factor(loan$status)
loan$fico <- (loan$fico_range_high+loan$fico_range_low)/2
Naive Bayes calculates conditional probabilities given certain things happening. While in principle, it could calculate probability of default given that fico is 600 and 601 and 602 etc, these calculations would be highly imprecise given that there may be only one observation with a particular value of fico. Therefore, when using naive Bayes we turn quantitative or continuous variables into categorical variables. In this case, let’s create a new variable for both fico and dti that indicate whether the value is in the bottom 25%, middle 50% or top 25%.
summary(loan$fico)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 612.0 687.0 712.0 715.1 742.0 827.0
loan$ficocat <- cut(loan$fico, breaks=c(0,687,742,1000),
labels=c("bottom 25%","middle 50%", "top 25%"))
table(loan$ficocat)
##
## bottom 25% middle 50% top 25%
## 12280 20950 9305
summary(loan$dti)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 8.20 13.47 13.37 18.68 29.99
loan$dticat <- cut(loan$dti, breaks=c(0,8.2,18.68,100),
labels=c("bottom 25%","middle 50%", "top 25%"))
table(loan$dticat)
##
## bottom 25% middle 50% top 25%
## 10436 21266 10627
Naive Bayes works with categorical variables. Let’s consider the purpose of the loan:
table(loan$purpose)
##
## car credit_card
## 0 1615 5477
## debt_consolidation educational home_improvement
## 19776 422 3199
## house major_purchase medical
## 426 2311 753
## moving other renewable_energy
## 629 4425 106
## small_business vacation wedding
## 1992 400 1004
We see that Lending Club borrowers use money for a variety of purposes. Debt consolidation is the most common purpose of a Lending Club loan. Also, it appears that purpose has a factor level that is blank “” and has zero observations. Recall that we filtered out some observations with missing values above. The blank factor level however remained. Let’s take the blank level out, by re-encoding purpose as a factor using the factor() function. When applied it will create levels only if the level has observations.
loan$purpose <- factor(loan$purpose)
levels(loan$purpose)
## [1] "car" "credit_card" "debt_consolidation"
## [4] "educational" "home_improvement" "house"
## [7] "major_purchase" "medical" "moving"
## [10] "other" "renewable_energy" "small_business"
## [13] "vacation" "wedding"
We also see that some categories have a pretty low number of cases (e.g. ‘renewable_energy’). We may want to consolidate large number of categories with few observations into fewer categories with larger number of observations (especially if the categories are similar). In this case, let’s consolidate ‘renewable_energy’ into ‘other.’ We will use function fct_collapse() from package forcats to accomplish that.
library(forcats)
loan$purpose <- fct_collapse(loan$purpose, other=c("renewable_energy", "other"))
levels(loan$purpose)
## [1] "car" "credit_card" "debt_consolidation"
## [4] "educational" "home_improvement" "house"
## [7] "major_purchase" "medical" "moving"
## [10] "other" "small_business" "vacation"
## [13] "wedding"
table(loan$purpose)
##
## car credit_card debt_consolidation
## 1615 5477 19776
## educational home_improvement house
## 422 3199 426
## major_purchase medical moving
## 2311 753 629
## other small_business vacation
## 4531 1992 400
## wedding
## 1004
loan <- select(loan, status, ficocat, dticat, purpose)
set.seed(364)
loan$rand <- runif(nrow(loan))
train <- filter(loan, rand<=0.8)
test <- filter(loan, rand>0.8)
library(e1071)
classifier <- naiveBayes(status ~ ficocat+dticat+purpose,train)
classifier
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## bad good
## 0.1501923 0.8498077
##
## Conditional probabilities:
## ficocat
## Y bottom 25% middle 50% top 25%
## bad 0.4089132 0.4749805 0.1161063
## good 0.2688361 0.4937990 0.2373648
##
## dticat
## Y bottom 25% middle 50% top 25%
## bad 0.2153392 0.5063913 0.2782694
## good 0.2541214 0.5011280 0.2447506
##
## purpose
## Y car credit_card debt_consolidation educational
## bad 0.027560594 0.091868647 0.474589523 0.013682565
## good 0.039555049 0.134694442 0.462500432 0.009672850
## purpose
## Y home_improvement house major_purchase medical moving
## bad 0.068217357 0.011336982 0.041634089 0.020914777 0.017200938
## good 0.076450064 0.009776488 0.057726189 0.017583860 0.014163817
## purpose
## Y other small_business vacation wedding
## bad 0.119038311 0.087763878 0.008991400 0.017200938
## good 0.103879504 0.039969600 0.009776488 0.024251218
prediction <- predict(classifier, select(test, ficocat, dticat, purpose), type="raw")
summary(prediction)
## bad good
## Min. :0.04759 Min. :0.5984
## 1st Qu.:0.10488 1st Qu.:0.8169
## Median :0.14986 Median :0.8501
## Mean :0.15043 Mean :0.8496
## 3rd Qu.:0.18310 3rd Qu.:0.8951
## Max. :0.40160 Max. :0.9524
Prediction is a matrix with two columns (one for each class) and 8,507 rows (one for each loan in the test data frame). Doing a summary on this matrix tells us that the probability of a loan being good ranges from 0.61 to 0.95.
Now that we have the probabilities of a loan being good, we need to decide the threshold for classifying a loan as good. This depends on the contest of the problem. In our case let’s declare a loan as good if there is more than 75% probability that is will be good.
test$status_pred <- ifelse(prediction[,"good"] > 0.75, "good", "bad")
table(test$status_pred)
##
## bad good
## 341 8131
library(descr)
crosstab(test$status, test$status_pred, prop.t = TRUE, plot = FALSE)
## Cell Contents
## |-------------------------|
## | Count |
## | Total Percent |
## |-------------------------|
##
## ===================================
## test$status_pred
## test$status bad good Total
## -----------------------------------
## bad 98 1157 1255
## 1.2% 13.7%
## -----------------------------------
## good 243 6974 7217
## 2.9% 82.3%
## -----------------------------------
## Total 341 8131 8472
## ===================================
Our accuracy is about the same as the kNN algorithm in lab 8.
confusionMatrix() (kappa, sensitivity, specificity)Accuracy is not the only way of assessing our predictions. Function confusionMatrix() from the caret package calculates a number of additional metrics. It takes the vector of class predictions as the first argument and vector of actual classes as the second argument.
library(caret)
confusionMatrix(test$status_pred, test$status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 98 243
## good 1157 6974
##
## Accuracy : 0.8347
## 95% CI : (0.8267, 0.8426)
## No Information Rate : 0.8519
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0635
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.07809
## Specificity : 0.96633
## Pos Pred Value : 0.28739
## Neg Pred Value : 0.85771
## Prevalence : 0.14814
## Detection Rate : 0.01157
## Detection Prevalence : 0.04025
## Balanced Accuracy : 0.52221
##
## 'Positive' Class : bad
##
We see that the confusionMatrix function calculates accuracy (83%). Kappa is a statistic that summarizes how much of our accuracy is due to chance (i. e. just picking the most common class). Kappa ranges from 0 to 1. A kappa value of 0.06 is pretty low. There are two other statistics we will pay attention to. Sensitivity tells us the percentage of positives that were correctly classified. In our case positive class is ‘bad’, i.e. we are trying to detect bad loans. Of the 1255 bad loans we detected 98 making our sensitivity 7.8%. The other statistic is specificity which tells us the percentage correctly identified negatives. In our case the negative is a good loan. Of the 7,217 good loans we correctly identified 6,974 as good making our specificity 97%.
What happens when we raise the cut off probability for classifying loans as good?
Read this New York Times article entitled “How Companies Learn Your Secrets” Why is Target interested in identifying pregnant customers?
What is the strategy that Andrew Pole used to accomplish the Target marketers’ task?
The data at this link has information on 2,000 fictitious Target shoppers. The variables include the following:
| variable | description |
|---|---|
| gender | gender of shopper, M=male, F=female, U=unknown |
| address | address of shopper, H=home, A=apartment, P=POBox |
| Pregnancy.Test | = 1 if recently purchased pregnancy test, = 0 otherwise |
| Birth.Control | = 1 if recently purchased birth control, = 0 otherwise |
| … | … |
| pregnant | = 1 if someone in the shopper’s household is pregnant, = 0 otherwise |
Load this data into your R session. Summarize the data. Do the values seem reasonable? Are there any missing values? If so, eliminate observations with missing values.
Calculate the share of shoppers with pregnancy in the household.
Split the data into test and train. Use 75/25 split. Set seed to 364.
Estimate naive Bayes classifier using the recent purchase of a pregnancy test as the only predictor. Choose your own probability cut off for classifying pregnancy. How well does your algorithm do? (Hint: When using confusionMatix() use option , positive="1" to tell the function that positive result is “1”, i.e. pregnancy.)
Suppose you really wanted to avoid marketing baby products to someone who is not pregnant. How should you set the cut off for probability? Would this mean that your sensitivity will be high? What about your specificity?
Have fun with this data. Use whichever predictors you like, and however many predictors you like. Can you improve upon the first algorithm?