Lending Club is an online, peer-to-peer marketplace that connects borrowers and investors. This assignment asks you to do some predictive modeling on a dataset of past Lending Club loans, including loan details and information about the borrowers. A full data dictionary can be found in LCDataDictionary.xlsx.
The goal of this assignment is to get hands-on practice with data cleaning, feature engineering, and predictive modeling algorithms beyond the basics, including classification trees, kNNs, and regularized logistic regression. You will also practice creating and interpreting ROC and lift curves.
You will be predicting whether loans were paid in full or not. Your intended use case is to help an organization decide which loans to “flag” as potentially risky investments.
RUBRIC: There are three possible grades on this assignment: Fail (F), Pass (P), and High Pass (H). If you receive an F then you will have one more chance to turn it in to receive a P. If you receive H on 3 out of the 4 assignments this semester you’ll get a bonus point on your final average.
Note that this assignment is somewhat open-ended and there are many ways to answer these questions. I don’t require that we have exactly the same answers in order for you to receive full credit.
The following code block does some intial setup, including:
lc <- read_csv("LendingClub_LoanStats_2011_v2.csv") #read the Lending Club dataset in R
#create target variable: fully paid
#remove any rows where y is NA
lc <- lc %>%
mutate(y = as.factor(ifelse(loan_status == "Fully Paid", "Paid", "Not Paid"))) %>%
filter(!is.na(y))
#set seed and randomly downsample 15k instances
#(otherwise training kNN will take hours)
set.seed(1)
lc_small <- sample(nrow(lc), 15000)
lc <- lc[lc_small,]
#then calculate the training/validation row numbers, but don't yet split
va_inst <- sample(nrow(lc), .3*nrow(lc))
What is the mean loan amount in this dataset??
ANSWER: The loan amount in this dataset is $10,993.60.
loan_mean <- lc %>%
summarise(mean_amt = mean(loan_amnt))
ANSWER TO QUESTION 1a HERE:
data_clean <- lc %>%#Selecting the variables that are needed to be processed and cleaned
select('grade', 'sub_grade', 'home_ownership', 'addr_state', 'loan_amnt', 'emp_length', 'annual_inc','purpose',
'dti', 'mths_since_last_delinq', 'int_rate', 'y') %>%
mutate(#Checking for NAs and adding NULL values if present
grade = as.factor(ifelse(is.na(grade), NULL, grade)),
sub_grade = as.factor(ifelse(is.na(sub_grade), NULL, sub_grade)),
home_ownership = as.factor(ifelse(is.na(home_ownership), NULL, home_ownership)),
addr_state = as.factor(ifelse(is.na(addr_state), NULL, addr_state)),
#Replacing NAs with mean value
loan_amnt = ifelse(is.na(loan_amnt), mean(loan_amnt, na.rm = TRUE), loan_amnt),
#Grouping into bins
emp_length = as.factor(case_when(
emp_length %in% c('< 1 year') ~ '< 1 year',
emp_length %in% c('1 year', '2 years', '3 years') ~ '1-3 years',
emp_length %in% c('4 years', '5 years', '6 years') ~ '4-6 years',
emp_length %in% c('7 years', '8 years', '9 years') ~ '7-9 years',
emp_length %in% c('10+ years') ~ '10+ years',
TRUE ~ 'unknown')),
#Replacing NAs with mean value and then grouping into bins based on quartiles
annual_inc = ifelse(is.na(annual_inc), mean(annual_inc, na.rm = TRUE), annual_inc),
annual_inc = as.factor(ntile(annual_inc, 4)),
#Grouping into five equally-sized bins
dti = as.factor(cut(dti, breaks = 5)),
#Grouping into bins
mths_since_last_delinq = as.factor(case_when(
mths_since_last_delinq < 12 ~ "<1 year",
mths_since_last_delinq >=12 & mths_since_last_delinq <= 24 ~ "1-2 years",
mths_since_last_delinq > 24 & mths_since_last_delinq <= 36 ~ "2-3 years",
mths_since_last_delinq > 36 ~ "3+ years",
TRUE ~ "never")),
#Splitting the values and processing the numeric part of the string. Checking for NAs and
#replacing it with the mean value
int_rate = as.numeric(strsplit(int_rate, "%")),
int_rate = ifelse(is.na(int_rate), mean(int_rate, na.rm = TRUE), int_rate),
#Converting target variable into a factor
y = as.factor(y)) %>%
#Grouping the data w.r.t 'purpose' and binning based on count
group_by(purpose) %>%
mutate(purpose = ifelse(n() < 200, 'other', purpose),
purpose = as.factor(ifelse((purpose == 'credit_card' | purpose == 'debt_consolidation'),
'debt',purpose))) %>%
ungroup()
#Summarizing the cleaned and processed features
summary(data_clean)
## grade sub_grade home_ownership addr_state loan_amnt
## A:3593 A4 :1041 MORTGAGE:6637 CA :2650 Min. : 500
## B:4295 B3 :1037 NONE : 3 NY :1423 1st Qu.: 5000
## C:3108 A5 : 994 OTHER : 50 FL :1086 Median : 9600
## D:2134 B5 : 983 OWN :1152 TX :1007 Mean :10994
## E:1212 B4 : 890 RENT :7158 NJ : 689 3rd Qu.:15000
## F: 467 C1 : 821 IL : 602 Max. :35000
## G: 191 (Other):9234 (Other):7543
## emp_length annual_inc purpose dti
## < 1 year :1806 1:3750 debt :8845 (-0.03,6]:2486
## 1-3 years:4480 2:3750 other :2090 (6,12] :3942
## 10+ years:3332 3:3750 home_improvement:1141 (12,18] :4351
## 4-6 years:3355 4:3750 major_purchase : 815 (18,24] :3510
## 7-9 years:1653 small_business : 706 (24,30] : 711
## unknown : 374 car : 542
## (Other) : 861
## mths_since_last_delinq int_rate y
## <1 year : 985 Min. : 5.42 Not Paid: 3333
## 1-2 years:1077 1st Qu.: 9.62 Paid :11667
## 2-3 years: 960 Median :11.99
## 3+ years :2509 Mean :12.18
## never :9469 3rd Qu.:14.74
## Max. :24.11
##
ANSWER TO QUESTION 1b HERE:
Initially, there were 154 dummy variables. However, after dropping one level from each factor variable including y = “Paid”, we get 143 dummy variables.
#One Hot Encoding
dummy <- dummyVars(~.+annual_inc:emp_length, data = data_clean)
ohe_data_clean <- data.frame(predict(dummy, newdata = data_clean))
#Number of dummy variables
ncol(ohe_data_clean)
## [1] 154
#Removing one level from each factor variable to treat it as the baseline factor
drops <- c('y.Paid', 'grade.G', 'sub_grade.G5', 'home_ownership.NONE', 'addr_state.MS', 'emp_length.unknown', 'annual_inc.1', 'purpose.other', 'dti..24.30.', 'mths_since_last_delinq.never', 'emp_lengthunknown.annual_inc4')
ohe_data_clean <- ohe_data_clean[ , !(names(ohe_data_clean) %in% drops)]
#Number of dummy variables
ncol(ohe_data_clean)
## [1] 143
#Converting into a factor
ohe_data_clean$y.Not.Paid <- as.factor(ohe_data_clean$y.Not.Paid)
#Splitting cleaned data into training set and validation set
train <- ohe_data_clean[-va_inst,]
test <- ohe_data_clean[va_inst,]
ANSWER TO QUESTION 2a HERE:
There are 125 terminal nodes in the full tree.
int_rate has the highest information gain because the first variable used to make the split is always the one with the lowest Gini impurity.
#Initializing a full tree (unpruned)
mycontrol = tree.control(nrow(train), mincut = 5, minsize = 10, mindev = 0.0005)
lc.full.tree = tree(y.Not.Paid ~ .,control = mycontrol, train)
#Summary of full tree
summary(lc.full.tree)
##
## Classification tree:
## tree(formula = y.Not.Paid ~ ., data = train, control = mycontrol)
## Variables actually used in tree construction:
## [1] "int_rate" "loan_amnt"
## [3] "dti..18.24." "emp_length7.9.years.annual_inc1"
## [5] "emp_lengthunknown.annual_inc1" "addr_state.TX"
## [7] "grade.C" "purpose.debt"
## [9] "mths_since_last_delinq..1.year" "sub_grade.B2"
## [11] "emp_length7.9.years.annual_inc3" "purpose.wedding"
## [13] "emp_length...1.year" "addr_state.NJ"
## [15] "home_ownership.OWN" "home_ownership.RENT"
## [17] "addr_state.MO" "addr_state.FL"
## [19] "emp_length1.3.years.annual_inc4" "mths_since_last_delinq.1.2.years"
## [21] "dti..6.12." "emp_length..1.year.annual_inc1"
## [23] "emp_length4.6.years.annual_inc1" "emp_length4.6.years.annual_inc3"
## [25] "emp_length.4.6.years" "addr_state.NY"
## [27] "grade.E" "grade.D"
## [29] "purpose.small_business" "grade.F"
## [31] "home_ownership.MORTGAGE" "purpose.major_purchase"
## [33] "mths_since_last_delinq.3..years" "emp_lengthunknown.annual_inc2"
## [35] "mths_since_last_delinq.2.3.years" "annual_inc.4"
## [37] "addr_state.KY" "emp_length10..years.annual_inc3"
## [39] "dti..12.18." "grade.B"
## [41] "emp_length10..years.annual_inc2" "emp_length1.3.years.annual_inc1"
## [43] "sub_grade.C2" "sub_grade.D1"
## [45] "sub_grade.D2" "sub_grade.D5"
## [47] "addr_state.PA" "emp_length10..years.annual_inc4"
## [49] "sub_grade.E1" "emp_length1.3.years.annual_inc2"
## [51] "addr_state.CA" "addr_state.MA"
## [53] "purpose.medical" "annual_inc.3"
## Number of terminal nodes: 125
## Residual mean deviance: 0.8745 = 9073 / 10380
## Misclassification error rate: 0.1954 = 2052 / 10500
ANSWER TO QUESTION 2b HERE:
sizes = c(2, 4, 6, 8, 10, 15, 20, 25, 30, 35, 40) #Tree Sizes
cutoff = 0.5
tree_tr_acc = rep(0, length(sizes)) #Empty Vector to store training accuracies
tree_tst_acc = rep(0, length(sizes)) ##Empty Vector to store validation accuracies
#Function to calculate accuracy
accuracy <- function(classifications, actuals){
correct_classifications <- ifelse(classifications == actuals, 1, 0)
acc <- sum(correct_classifications)/length(classifications)
return(acc)
}
#Function for prediction and evaluation
predictor <- function(treename, pred_data, cutoff){
predictions <- predict(treename, newdata = pred_data) #make predictions
probs <- predictions[,2] #get the probabilities (second column)
classifications <- ifelse(probs > cutoff, 1, 0) #do the classifications
acc <- accuracy(classifications, pred_data$y.Not.Paid) #compute the accuracy
return(acc)
}
#Loop to calculate accuracy for different tree sizes
for (i in 1:length(sizes)){
size = sizes[i]
pruned_tree = prune.tree(lc.full.tree, best = size)
tr_acc <- predictor(pruned_tree, train, cutoff)
tst_acc <- predictor(pruned_tree, test, cutoff)
tree_tr_acc[i] <- tr_acc
tree_tst_acc[i] <- tst_acc
}
#Plot the Tree Size vs. Accuracy
plot(sizes, tree_tr_acc, type = 'l', col = 'red', main = "Fitting Curve (Decision Tree)",
xlab = 'Tree Size', ylab = 'Accuracy', ylim = c(0.7,0.8))
lines(sizes, tree_tst_acc, col = 'blue')
legend("bottomright", inset=0.05, legend=c("Training Data", "Validation Data"),col=c("red", "blue"), lty=1)
ANSWER TO QUESTION 2c HERE:
Observing the above fitting curve, it can be concluded that the accuracy is highest for tree sizes 8 and 10.
In my opinion, a tree size of 8 is the best because of the high accuracy and less complex tree.
cat("Accuracies for Validation Data: \n\n", tree_tst_acc)
## Accuracies for Validation Data:
##
## 0.7726667 0.7788889 0.7788889 0.784 0.784 0.7828889 0.7835556 0.7791111 0.7808889 0.7808889 0.78
#Storing predictions of the tree with highest accuracy in a separate data frame
best.tree <- prune.tree(lc.full.tree, best = 8)
best.tree.prediction <- predict(best.tree, newdata = test)
best.tree.preds <- best.tree.prediction[,2]
+Note: you will need to separate your training and validation sets into X and y. +Note: Be patient - it will take several minutes for kNN to make its predictions!
ANSWER TO QUESTION 3a HERE:
kval <- c(2, 4, 6, 8, 10, 15, 20) #K-Values
cutoff = 0.5
knn_tr_acc = rep(0, length(kval)) #Empty Vector to store training accuracies
knn_tst_acc = rep(0, length(kval)) #Empty Vector to store validation accuracies
#Storing all variables except for labelled value into x data frame
train.X = train[,!names(train) %in% c("y.Not.Paid")]
test.X = test[,!names(test) %in% c("y.Not.Paid")]
#Storing the labelled data into the y data frame
train.y= train$y.Not.Paid
test.y= test$y.Not.Paid
for(i in 1:length(kval)){
k <- kval[i] #get the ith k in kval
#compute predictions using train.X as the potential neighbors
#new points to be classified are first validation, then training points
tr_preds <- knn(train.X, train.X, train.y, k = k)
tst_preds <- knn(train.X, test.X, train.y, k = k)
#compute the accuracy for each set of predictions
tst_accuracy <- accuracy(tst_preds, test.y)
tr_accuracy <- accuracy(tr_preds, train.y)
#store in the appropriate place
knn_tr_acc[i] <- tr_accuracy
knn_tst_acc[i] <- tst_accuracy
}
#Plot the K-Value vs. Accuracy
plot(kval, knn_tr_acc, type = 'l', col = 'red', main = "Fitting Curve (KNN)", xlab = 'K-Value', ylab = 'Accuracy',
ylim = c(0.5,1))
lines(kval, knn_tst_acc, col = 'blue')
legend("bottomright", inset=0.05, legend=c("Training Data", "Validation Data"),col=c("red", "blue"), lty=1)
+Note: you’ll need to convert these probabilities from the probability of the majority-class vote to the probability that y = the positive class.
ANSWER TO QUESTION 3b HERE:
Observing the above fitting curve, it can be concluded that the accuracy is highest for k-value = 20
Therefore, k = 20 is the best value to go with.
cat("Accuracies for Validation Data: \n\n", knn_tst_acc)
## Accuracies for Validation Data:
##
## 0.6897778 0.7317778 0.746 0.7537778 0.7544444 0.7655556 0.7691111
#Storing predictions of the tree with highest accuracy in a separate data frame
best.knn <- knn(train.X, test.X, train.y, k=20, prob = TRUE)
best.knn.probs <- attr(best.knn, "prob")
best.knn.preds <- ifelse(best.knn == 1, best.knn.probs, 1-best.knn.probs)
ANSWER TO QUESTION 4a HERE:
prediction_best_tree <- prediction(best.tree.preds, test$y.Not.Paid)
prediction_best_knn <- prediction(best.knn.preds, test$y.Not.Paid)
#Creating a ROCR performance object
roc_tree <- performance(prediction_best_tree, "tpr", "fpr")
roc_knn <- performance(prediction_best_knn, "tpr", "fpr")
#Plot the TPR vs FPR performance
plot(roc_tree, col = "red", lwd = 2, xlab = 'False Positive Rate', ylab = 'True Positive Rate', main = 'ROC Curve')
plot(roc_knn, add=T, col = "darkgreen", lwd = 2)
legend("bottomright", inset=0.05, legend=c("Best Decision Tree Model", "Best KNN Model"),col=c("red", "darkgreen"), lty=1)
ANSWER TO QUESTION 4b HERE:
The Decision Tree Model has the highest AUC of 0.7164188. Yes, it is the same model as the one with the higher accuracy. From ROC Curve, you can observe that the highest AUC model has the highest TPR for every cutoff.
#Computing the AUC for tree model
performance(prediction_best_tree, measure = "auc")@y.values[[1]]
## [1] 0.7164188
#Computing the AUC for knn model
performance(prediction_best_knn, measure = "auc")@y.values[[1]]
## [1] 0.6605081
ANSWER TO QUESTION 4c HERE:
lift_curve <- performance(prediction_best_tree, "lift", "rpp")
#Plotting the lift curve
plot(lift_curve, col = "purple", lwd = 2 ,main="Lift Curve", xlab="Percent of Instances Targetted", ylab="Lift Value")
+Note: you can answer approximately by reading your lift chart, no need to calculate the exact amounts.
ANSWER TO QUESTION 4d HERE:
If we decide to flag the top 10% of loans most likely to be “Not Paid”, the lift will be 2.8 approximately.
If we decide to flag the top 50% of loans most likely to be “Not Paid”, the lift will be 1.6 approximately.
If we want to achieve a lift of at least 2.0, We must flag the top 22% or less.
Can you improve on the best-performing model with either more/different features, a different model specification, or a different tuning parameter (or all of the above)? Report your best validation performance and give details on your best model.
ANSWER TO QUESTION 5 HERE:
The accuracy of the tree model was improved from 78.4% to 89.3%. The accuracy improved due to the following changes: + Added extra variables: funded_amnt, total_pymnt, open_acc, and total_rec_int + Tuned the tree control parameters: Changed mindev to 0.0001, Chose a tree size of 60
#Selecting additional columns for increasing accuracy
data_clean <- lc %>%#Selecting the variables that are needed to be processed and cleaned
select('grade', 'sub_grade', 'home_ownership', 'addr_state', 'loan_amnt', 'emp_length', 'annual_inc','purpose',
'dti', 'mths_since_last_delinq', 'int_rate', 'funded_amnt', 'total_pymnt', 'open_acc', 'total_rec_int', 'y') %>%
mutate(#Checking for NAs and adding NULL values if present
grade = as.factor(ifelse(is.na(grade), NULL, grade)),
sub_grade = as.factor(ifelse(is.na(sub_grade), NULL, sub_grade)),
home_ownership = as.factor(ifelse(is.na(home_ownership), NULL, home_ownership)),
addr_state = as.factor(ifelse(is.na(addr_state), NULL, addr_state)),
#Replacing NAs with mean value
loan_amnt = ifelse(is.na(loan_amnt), mean(loan_amnt, na.rm = TRUE), loan_amnt),
#Grouping into bins
emp_length = as.factor(case_when(
emp_length %in% c('< 1 year') ~ '< 1 year',
emp_length %in% c('1 year', '2 years', '3 years') ~ '1-3 years',
emp_length %in% c('4 years', '5 years', '6 years') ~ '4-6 years',
emp_length %in% c('7 years', '8 years', '9 years') ~ '7-9 years',
emp_length %in% c('10+ years') ~ '10+ years',
TRUE ~ 'unknown')),
#Replacing NAs with mean value and then grouping into bins based on quartiles
annual_inc = ifelse(is.na(annual_inc), mean(annual_inc, na.rm = TRUE), annual_inc),
annual_inc = as.factor(ntile(annual_inc, 4)),
#Grouping into five equally-sized bins
dti = as.factor(cut(dti, breaks = 5)),
#Grouping into bins
mths_since_last_delinq = as.factor(case_when(
mths_since_last_delinq < 12 ~ "<1 year",
mths_since_last_delinq >=12 & mths_since_last_delinq <= 24 ~ "1-2 years",
mths_since_last_delinq > 24 & mths_since_last_delinq <= 36 ~ "2-3 years",
mths_since_last_delinq > 36 ~ "3+ years",
TRUE ~ "never")),
#Splitting the values and processing the numeric part of the string. Checking for NAs and
#replacing it with the mean value
int_rate = as.numeric(strsplit(int_rate, "%")),
int_rate = ifelse(is.na(int_rate), mean(int_rate, na.rm = TRUE), int_rate),
#Converting target variable into a factor
y = as.factor(y)) %>%
#Grouping the data w.r.t 'purpose' and binning based on count
group_by(purpose) %>%
mutate(purpose = ifelse(n() < 200, 'other', purpose),
purpose = as.factor(ifelse((purpose == 'credit_card' | purpose == 'debt_consolidation'),
'debt',purpose))) %>%
ungroup()
#One Hot Encoding
dummy <- dummyVars(~.+annual_inc:emp_length, data = data_clean)
ohe_data_clean <- data.frame(predict(dummy, newdata = data_clean))
#Removing one level from each factor variable to treat it as the baseline factor
drops <- c('y.Paid', 'grade.G', 'sub_grade.G5', 'home_ownership.NONE', 'addr_state.MS', 'emp_length.unknown', 'annual_inc.1', 'purpose.other', 'dti..24.30.', 'mths_since_last_delinq.never', 'emp_lengthunknown.annual_inc4')
ohe_data_clean <- ohe_data_clean[ , !(names(ohe_data_clean) %in% drops)]
#Converting into a factor
ohe_data_clean$y.Not.Paid <- as.factor(ohe_data_clean$y.Not.Paid)
#Splitting cleaned data into training set and validation set
train <- ohe_data_clean[-va_inst,]
test <- ohe_data_clean[va_inst,]
#Initializing a full tree (unpruned)
mycontrol = tree.control(nrow(train), mincut = 5, minsize = 10, mindev = 0.0001)
lc.full.tree = tree(y.Not.Paid ~ .,control = mycontrol, train)
#Summary of full tree
summary(lc.full.tree)
##
## Classification tree:
## tree(formula = y.Not.Paid ~ ., data = train, control = mycontrol)
## Variables actually used in tree construction:
## [1] "total_pymnt" "funded_amnt"
## [3] "int_rate" "total_rec_int"
## [5] "annual_inc.4" "grade.B"
## [7] "dti..12.18." "grade.D"
## [9] "open_acc" "emp_length10..years.annual_inc4"
## [11] "loan_amnt" "purpose.debt"
## [13] "emp_length.1.3.years" "home_ownership.MORTGAGE"
## [15] "emp_length.10..years" "grade.C"
## [17] "home_ownership.RENT" "grade.A"
## [19] "grade.E" "emp_length.7.9.years"
## [21] "annual_inc.3" "annual_inc.2"
## [23] "addr_state.IL" "mths_since_last_delinq.2.3.years"
## [25] "mths_since_last_delinq.3..years" "addr_state.NJ"
## [27] "addr_state.FL" "addr_state.VA"
## [29] "mths_since_last_delinq..1.year" "emp_length10..years.annual_inc3"
## [31] "dti...0.03.6." "addr_state.CA"
## [33] "purpose.small_business" "emp_length1.3.years.annual_inc3"
## [35] "sub_grade.B2" "emp_length.4.6.years"
## [37] "dti..6.12." "home_ownership.OWN"
## [39] "sub_grade.B3" "emp_length4.6.years.annual_inc1"
## [41] "mths_since_last_delinq.1.2.years" "sub_grade.C2"
## [43] "emp_length..1.year.annual_inc1" "purpose.home_improvement"
## [45] "sub_grade.C1" "emp_length1.3.years.annual_inc2"
## [47] "addr_state.TX" "sub_grade.D5"
## [49] "sub_grade.D3" "dti..18.24."
## [51] "sub_grade.B5" "sub_grade.C3"
## [53] "emp_length...1.year" "grade.F"
## [55] "sub_grade.D4" "addr_state.AZ"
## [57] "sub_grade.C4" "sub_grade.C5"
## [59] "sub_grade.E2" "emp_length1.3.years.annual_inc4"
## [61] "emp_length7.9.years.annual_inc2" "sub_grade.D2"
## [63] "sub_grade.B4"
## Number of terminal nodes: 552
## Residual mean deviance: 0.2012 = 1999 / 9936
## Misclassification error rate: 0.04872 = 511 / 10488
#Determining the accuracy
size = 60
pruned_tree = prune.tree(lc.full.tree, best = size)
tr_acc <- predictor(pruned_tree, train, cutoff)
tst_acc <- predictor(pruned_tree, test, cutoff)
tst_acc
## [1] 0.8928889