This toy example is taken from Machine Learning with R by Brett Lantz. Suppose we have four things with the following characteristics:
things <- data.frame(ingredient = c("grape", "green bean" , "nuts" , "organge"),
sweetness = c(8,3,3,7),
crunchiness = c(5,7,6,3),
class = c("fruit", "vegetable", "protein", "fruit"))
things
## ingredient sweetness crunchiness class
## 1 grape 8 5 fruit
## 2 green bean 3 7 vegetable
## 3 nuts 3 6 protein
## 4 organge 7 3 fruit
Suppose we wanted to classify tomato into one of the classes. Is tomato a fruit, vegetable or protein? Tomato has the following characteristics:
tomato <- data.frame(ingredient = "tomato",
sweetness = 6,
crunchiness = 4)
We now have two data frames: the things data frame where items are already classified (training data set), and the tomato data frame where we don’t know the class of the item (test data set). kNN algorithm will classify tomato by calculating the distance between tomato and each item in the training data set. The item that is closest to a tomato will determine how tomato is classified. If the closest item is a fruit, tomato will be classified as a fruit, if the closest item is a vegetable, tomato will be classified as a vegetable, etc. The distance between items is measured as the Euclidean distance in terms of sweetness and crunchiness.
We will use the knn() function from the class package. The function has three arguments: the training data frame, the test data frame, and the vector that show the class of each item in the training data. Note that the training and test data frames need to have only the characteristics which we use for classification - no other variables. The knn() function returns a vector of prediction for each item in the test data set. In our case there is only one item in the test data set.
library(class) #contains knn function
library(dplyr)
pred <- knn(select(things, sweetness, crunchiness),
select(tomato,sweetness, crunchiness), things$class, k=1)
pred
## [1] fruit
## Levels: fruit protein vegetable
Ok, so knn says that tomato is a fruit. In terms of sweetness and crunchiness, tomato is closest to an orange and since orange is a fruit, we say tomato is fruit as well. Let’s classify another item, say a carrot. Let’s add a carrot to the tomato in our test data set.
unknown <- data.frame(ingredient = c("tomato", "carrot"),
sweetness = c(6,4),
crunchiness = c(4,9))
unknown
## ingredient sweetness crunchiness
## 1 tomato 6 4
## 2 carrot 4 9
pred <- knn(select(things, sweetness, crunchiness),
select(unknown,sweetness, crunchiness), things$class, k=1)
pred
## [1] fruit vegetable
## Levels: fruit protein vegetable
Since we have two items in the test data set kNN returned two predictions, one for tomato and one for carrot. Carrot is classified as a vegetable.
Sometimes we don’t want to classify an item based on just one nearest neighbor but based on a couple of neighbors and determine the class based on the majority among the k nearest neighbors. Let’s set k equal to 4 and see what happens.
pred <- knn(select(things, sweetness, crunchiness),
select(unknown,sweetness, crunchiness), things$class, k=4)
pred
## [1] fruit fruit
## Levels: fruit protein vegetable
Both tomato and carrot were classified as fruit. This is because the algorithm took four nearest neighbors and took the majority class among those four neighbors. Since, the four neighbors is the entire training data set and two out of four items are fruit, knn classified both tomato and carrot as fruit.
In this lab we will work with data on loans from a peer-to-peer lending website Lending Club. Our goal is to use the Nearest Neighbor algorithm to predict which loans will default. The data is publicly available on Lending Club’s website. (Though you may have to be a member to get the full data set.) We will work with data on loans issued between 2007 and 2011. Since the longest loan term is 5 years, most of the loans should be either paid off or delinquent. The first row in the .csv file contains a statement from Lending Club so we skip the first row by adding skip=1 as an option in the read.csv() function.
library(dplyr)
library(ggplot2)
library(stargazer)
setwd("C:/Users/dvorakt/Documents/teaching/data analytics/case studies/lending club")
#loan <- read.csv("https://www.dropbox.com/s/vljs4z2r4wixful/LoanStats3a_securev1.csv?raw=1", skip=1)
loan <- read.csv("LoanStats3a_securev1.csv", skip=1)
#str(loan)
We see that this is a rather rich data with over 40 thousand loans and 56 variables. The description of these variables is here.
The variable we would like to predict is loan_status. Let’s see what values it takes.
table(loan$loan_status)
##
##
## 3
## Charged Off
## 5310
## Current
## 4012
## Default
## 10
## Does not meet the credit policy. Status:Charged Off
## 755
## Does not meet the credit policy. Status:Current
## 74
## Does not meet the credit policy. Status:Fully Paid
## 1913
## Does not meet the credit policy. Status:In Grace Period
## 1
## Does not meet the credit policy. Status:Late (16-30 days)
## 1
## Does not meet the credit policy. Status:Late (31-120 days)
## 5
## Fully Paid
## 30239
## In Grace Period
## 64
## Late (16-30 days)
## 12
## Late (31-120 days)
## 139
Clearly, it is not as simple as good loan versus bad loan. Also, there are three loans with a blank “” loan status. Looking at these observations we see that many variables are missing for these three loans. Therefore, we will filter these loans out.
loan <- filter(loan, loan_status!="")
For our purposes we will consider “good” loans those with “Fully Paid”, “Current” and “Does not meet.. Fully Paid” status. Loans with any other status will be considered “bad”. Notice the syntax for “or” in R is “|”.
loan$good <- ifelse(loan$loan_status == "Current" |
loan$loan_status == "Fully Paid" |
loan$loan_status == "Does not meet the credit policy. Status:Fully Paid",
"good","bad")
table(loan$good)
##
## bad good
## 6371 36164
This gives us about 36 thousand good and 6 thousand bad loans.
Let’s take a look two candidate variables that may predict good loans: debt to income ratio and FICO scores. For FICO scores let’s use the average of the high and low.
loan$fico <- (loan$fico_range_high+loan$fico_range_low)/2
The descriptive statistics of the two quantitative variables are: (Notice that below filter() function is nested inside select() which is nested inside stargazer().)
stargazer(select(filter(loan, good == "good"),dti, fico), median = TRUE, type = "text")
##
## =====================================================
## Statistic N Mean St. Dev. Min Median Max
## -----------------------------------------------------
## dti 36,164 13.259 6.733 0.000 13.310 29.990
## fico 36,164 717.247 36.425 612 712 827
## -----------------------------------------------------
stargazer(select(filter(loan, good == "good"),dti, fico), median = TRUE, type = "text")
##
## =====================================================
## Statistic N Mean St. Dev. Min Median Max
## -----------------------------------------------------
## dti 36,164 13.259 6.733 0.000 13.310 29.990
## fico 36,164 717.247 36.425 612 712 827
## -----------------------------------------------------
We see straight away that debt to income ratios are lower for good loans and the opposite is true for FICO scores. This is to be expected.
Let’s also plot the densities of these two variables for good and bad loans. When plotting densities the key aesthetic is the x variable (the variable whose density we want to plot). By adding aesthetic color= ggplot will plot observations belonging to different values of the variable specified in color in different colors.
ggplot(aes(x = dti, color = factor(good)) ,data = loan) + geom_density()
ggplot(aes(x = fico, color = factor(good)) ,data = loan) + geom_density()
The graphs confirm that debt-to-income ratios tent to be higher for bad loans, and FICO scores are lower for bad loans. With FICO scores we also see a sharp drop off around FICO score of 650 suggesting that Lending Club does not approve loans from borrowers with FICO below 650.
Let’s simplify our loan data set by keeping only the variables we plan to use.
loan <- loan %>% select(good, fico, dti)
Since the k-NN algorithm uses Euclidean distance, it is sensitive to the scale of different variables. A variable that is measured in millions has much higher influence on the overall Euclidean distance than a variable measured in tens. Therefore, it is typical to normalize or re-scale all variables so that their magnitudes are comparable. In order for us not to have to retype a long formula several times, we will write our own function and then apply it to our predictors.
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
This bit of code defines a new function which we called normalize(). This function takes a vector and returns a vector whose elements have been transformed according to the formula inside the function. The formula ensures that the new vector has elements between zero and one. Let’s apply this to our quantitative variables:
loan$fico_n <- normalize(loan$fico)
loan$dti_n <- normalize(loan$dti)
summary(loan[,c("fico", "fico_n")])
## fico fico_n
## Min. :612.0 Min. :0.0000
## 1st Qu.:687.0 1st Qu.:0.3488
## Median :712.0 Median :0.4651
## Mean :715.1 Mean :0.4793
## 3rd Qu.:742.0 3rd Qu.:0.6047
## Max. :827.0 Max. :1.0000
We will use 80% of the loan data to train our model and the rest we will use to test our predictions. Since selecting the test and train observations is random, we will ‘set seed’ so that the computer generates the same set of random numbers each time we run this program. This makes our results reproducible.
set.seed(364)
We will use function sample() to create a vector of random numbers. we will set it so that the numbers range from 1 to number of observations in the loan data set (42,535, or nrow(loan)). We will also ask that the number of numbers (or the length of this vector) is 80% of 48,535. By default sample() samples without replacement.
sample <- sample(nrow(loan),floor(nrow(loan)*0.8))
Now we can subset the loan data set to create train data set that will have rows included in the vector sample and a test data set that will have rows from loan not included in train.
train <- loan[sample,]
test <- loan[-sample,]
Let’s check that we have roughly the same proportion of good loans in both test and train data frames.
prop.table(table(train$good))
##
## bad good
## 0.1506406 0.8493594
prop.table(table(test$good))
##
## bad good
## 0.1463501 0.8536499
This looks good – the proportion of bad loans is roughly the same in train and test data.
The syntax of the knn function requires the first two arguments are the train and test data frames. These data frames should contain only the variables we want to use in the prediction. Therefore, we create two ‘clean’ versions of the train and test data sets by selecting specifically the variables we want to use as predictors.
train_knn <- select(train, fico_n, dti_n)
test_knn <- select(test, fico_n, dti_n)
The third argument is a vector of the values we are predicting (i.e. good vs. bad) in the training data set. Finally, we are ready to run the algorithm. The output is a set of predictions for the test data set. We will ask for predictions based on five nearest neighbors, i.e. k=5.
library(class)
pred <- knn(train_knn, test_knn, train$good, k = 5)
head(pred)
## [1] good good good good good good
## Levels: bad good
The vector pred had 8,507 elements. This is exactly the number of observations in the test data. The vector elements take on values of either “good” or “bad”. The k-NN algorithm basically, took the characteristics of loans in the test data set, calculated Euclidean distance to other loans, found 5 closest ones, and if the majority of those 5 were good, k-NN said that the loan is good. How did k-NN actually do?
We can evaluate the model by cross-tabulating the predictions against the actual class of the loans in the test data set.
library(gmodels) #contains CrossTable function
CrossTable(x = test$good, y = pred, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 8507
##
##
## | pred
## test$good | bad | good | Row Total |
## -------------|-----------|-----------|-----------|
## bad | 51 | 1194 | 1245 |
## | 0.041 | 0.959 | 0.146 |
## | 0.192 | 0.145 | |
## | 0.006 | 0.140 | |
## -------------|-----------|-----------|-----------|
## good | 214 | 7048 | 7262 |
## | 0.029 | 0.971 | 0.854 |
## | 0.808 | 0.855 | |
## | 0.025 | 0.828 | |
## -------------|-----------|-----------|-----------|
## Column Total | 265 | 8242 | 8507 |
## | 0.031 | 0.969 | |
## -------------|-----------|-----------|-----------|
##
##
We define accuracy as the percentage of cases correctly classified. In our case we classified 51 bad loans correctly as bad, and 7,048 good loans correctly as good. Thus, our accuracy is (51+7048)/8507=83.45%.
Use the kNN algorithm and the things toy example to classify apple (sweetness=8, crunchiness=8) as either fruit, vegetable or protein (use k=1).
Load in the Lending Club data from this url “https://www.dropbox.com/s/vljs4z2r4wixful/LoanStats3a_securev1.csv?raw=1” (make sure to include skip=1 option in your read.csv file). Drop loans with empty loan_status. Creade the good variable as we did in class. Calculate the average fico score as we did in class.
Do you think loan size (variable loan_amnt) would be a good feature for our model predicting defaults? Present evidence to support your answer.
Normalize the three variables/features (dti, fico and loan_amnt). Check that their range is from zero to one.
Split loan data into test and train. Use 80-20 split. (Use set.seed(364) so that we all get the same results.) Estimate predictions using the k-NN algorithm and the three predictors. Use k=5. (Keep in mind that the knn() function wants a ‘clean’ training and test data frames, i.e. data frames with just the predictor variables.)
Evaluate your predictions. Is the model with loan amount more accurate than the model without?
How well are we predicting bad loans? What percentage of bad loans were we able to correctly predict?
Suppose you change k from 5 to 200. What happens to your predictions? Can you explain why?
Change k to 2. What happens your accuracy? What happens to your ability to detect bad loans? Do you think this algorithm is better than when k=5?