Assignment 2 File

Due March 29, 2022

Worth 40 points total

Problem Overview

Lending Club is an online, peer-to-peer marketplace that connects borrowers and investors. This assignment asks you to do some predictive modeling on a dataset of past Lending Club loans, including loan details and information about the borrowers. A full data dictionary can be found in LCDataDictionary.xlsx.

The goal of this assignment is to get hands-on practice with data cleaning, feature engineering, and predictive modeling algorithms beyond the basics, including classification trees, kNNs, and regularized logistic regression. You will also practice creating and interpreting ROC and lift curves.

You will be predicting whether loans were paid in full or not. Your intended use case is to help an organization decide which loans to “flag” as potentially risky investments.

RUBRIC: There are three possible grades on this assignment: Fail (F), Pass (P), and High Pass (H). If you receive an F then you will have one more chance to turn it in to receive a P. If you receive H on 3 out of the 4 assignments this semester you’ll get a bonus point on your final average.

Turn in a well-formatted compiled HTML document using R markdown. If you turn in a different file type or your code doesn’t compile, you will be asked to redo the assignment.
Provide clear answers to the questions and the correct R commands as necessary, in the appropriate places. You may answer up to three sub-questions incorrectly and still receive a P on this assignment (for example, 1(a) counts as one sub-question). If you answer all sub-questions correctly on your first submission you will receive an H.
The entire document must be clear, concise, readable, and well-formatted. If your assignment is unreadable or if you include more output than necessary to answer the questions you will be asked to redo the assignment.

Note that this assignment is somewhat open-ended and there are many ways to answer these questions. I don’t require that we have exactly the same answers in order for you to receive full credit.

The following code block does some intial setup, including:

Reading the dataset (make sure to set your working diretory)
Creating the target variable
Setting the random seed
Selecting a subsample of 15k observations in order to speed up training time
Selecting the training and validation row numbers

lc <- read_csv("LendingClub_LoanStats_2011_v2.csv")  #read the Lending Club dataset in R

#create target variable: fully paid
#remove any rows where y is NA
lc <- lc %>%
  mutate(y = as.factor(ifelse(loan_status == "Fully Paid", "Paid", "Not Paid"))) %>%
  filter(!is.na(y))

#set seed and randomly downsample 15k instances 
#(otherwise training kNN will take hours)
set.seed(1)
lc_small <- sample(nrow(lc), 15000)
lc <- lc[lc_small,]

#then calculate the training/validation row numbers, but don't yet split
va_inst <- sample(nrow(lc), .3*nrow(lc))

0: Example answer

What is the mean loan amount in this dataset??

ANSWER: The loan amount in this dataset is $10,993.60.

loan_mean <- lc %>%
  summarise(mean_amt = mean(loan_amnt))

1: Data Cleaning and Feature Engineering

Clean and process the following variables in the Lending Club dataset:

grade, sub_grade, home_ownership, addr_state: check if there are NAs and add a NULL value if so
loan_amnt: check if there are NAs, and if there are, replace with the mean loan amount (by grade)
emp_length: group into bins: <1 year, 1-3 years, 4-6 years, 7-9 years, 10+ years, and “unknown”
annual_inc: replace NAs with the average value, then group into four bins based on quartile values
purpose: any factor levels with fewer than 200 instances should get grouped into “other”. Also, combine credit_card and debt_consolidation into “debt”.
dti: group into five equally-sized bins
mths_since_last_delinq: group into five bins (< 1 year, 1-2 years, 2-3 years, 3+ years, never)
int_rate: make sure this has the correct data type, process it if not. Check for NAs and replace by the mean.
y: convert into a factor

ANSWER TO QUESTION 1a HERE:

data_clean <- lc %>%#Selecting the variables that are needed to be processed and cleaned
  select('grade', 'sub_grade', 'home_ownership', 'addr_state', 'loan_amnt', 'emp_length', 'annual_inc','purpose',
         'dti', 'mths_since_last_delinq', 'int_rate', 'y') %>%
  mutate(#Checking for NAs and adding NULL values if present
         grade = as.factor(ifelse(is.na(grade), NULL, grade)),
         sub_grade = as.factor(ifelse(is.na(sub_grade), NULL, sub_grade)),
         home_ownership = as.factor(ifelse(is.na(home_ownership), NULL, home_ownership)),
         addr_state = as.factor(ifelse(is.na(addr_state), NULL, addr_state)),
         #Replacing NAs with mean value
         loan_amnt = ifelse(is.na(loan_amnt), mean(loan_amnt, na.rm = TRUE), loan_amnt),
         #Grouping into bins
         emp_length = as.factor(case_when(
           emp_length %in% c('< 1 year') ~ '< 1 year',
           emp_length %in% c('1 year', '2 years', '3 years') ~ '1-3 years',
           emp_length %in% c('4 years', '5 years', '6 years') ~ '4-6 years',
           emp_length %in% c('7 years', '8 years', '9 years') ~ '7-9 years',
           emp_length %in% c('10+ years') ~ '10+ years',
           TRUE ~ 'unknown')),
         #Replacing NAs with mean value and then grouping into bins based on quartiles
         annual_inc = ifelse(is.na(annual_inc), mean(annual_inc, na.rm = TRUE), annual_inc),
         annual_inc = as.factor(ntile(annual_inc, 4)),
         #Grouping into five equally-sized bins
         dti = as.factor(cut(dti, breaks = 5)),
         #Grouping into bins
         mths_since_last_delinq = as.factor(case_when(
           mths_since_last_delinq < 12 ~ "<1 year",
           mths_since_last_delinq >=12 & mths_since_last_delinq <= 24 ~ "1-2 years",
           mths_since_last_delinq > 24 & mths_since_last_delinq <= 36 ~ "2-3 years",
           mths_since_last_delinq > 36 ~ "3+ years",
           TRUE ~ "never")),
         #Splitting the values and processing the numeric part of the string. Checking for NAs and 
         #replacing it with the mean value
         int_rate = as.numeric(strsplit(int_rate, "%")),
         int_rate = ifelse(is.na(int_rate), mean(int_rate, na.rm = TRUE), int_rate),
         #Converting target variable into a factor
         y = as.factor(y)) %>%
  #Grouping the data w.r.t 'purpose' and binning based on count
  group_by(purpose) %>%
  mutate(purpose = ifelse(n() < 200, 'other', purpose),
         purpose = as.factor(ifelse((purpose == 'credit_card' | purpose == 'debt_consolidation'),
                                    'debt',purpose))) %>% 
  ungroup()
#Summarizing the cleaned and processed features
summary(data_clean)

##  grade      sub_grade     home_ownership   addr_state     loan_amnt    
##  A:3593   A4     :1041   MORTGAGE:6637   CA     :2650   Min.   :  500  
##  B:4295   B3     :1037   NONE    :   3   NY     :1423   1st Qu.: 5000  
##  C:3108   A5     : 994   OTHER   :  50   FL     :1086   Median : 9600  
##  D:2134   B5     : 983   OWN     :1152   TX     :1007   Mean   :10994  
##  E:1212   B4     : 890   RENT    :7158   NJ     : 689   3rd Qu.:15000  
##  F: 467   C1     : 821                   IL     : 602   Max.   :35000  
##  G: 191   (Other):9234                   (Other):7543                  
##      emp_length   annual_inc             purpose            dti      
##  < 1 year :1806   1:3750     debt            :8845   (-0.03,6]:2486  
##  1-3 years:4480   2:3750     other           :2090   (6,12]   :3942  
##  10+ years:3332   3:3750     home_improvement:1141   (12,18]  :4351  
##  4-6 years:3355   4:3750     major_purchase  : 815   (18,24]  :3510  
##  7-9 years:1653              small_business  : 706   (24,30]  : 711  
##  unknown  : 374              car             : 542                   
##                              (Other)         : 861                   
##  mths_since_last_delinq    int_rate            y        
##  <1 year  : 985         Min.   : 5.42   Not Paid: 3333  
##  1-2 years:1077         1st Qu.: 9.62   Paid    :11667  
##  2-3 years: 960         Median :11.99                   
##  3+ years :2509         Mean   :12.18                   
##  never    :9469         3rd Qu.:14.74                   
##                         Max.   :24.11                   
##

At this point you should have 12 cleaned variables (including y). +Convert the set of cleaned and processed variables into dummy variables. Also create dummy variables that interact the (binned) annual income and (binned) employment length variables. +You will end up with dummy variables for both y = “Paid” and y = “Not Paid”. Drop the y = “Paid” dummy variable and convert the “Not Paid” dummy variable into a factor. +How many variables do you have after converting to dummies and dropping the y = Paid dummy? +Finally, partition your dataset into “train” and “test” using the va_inst row numbers sampled above.

ANSWER TO QUESTION 1b HERE:

Initially, there were 154 dummy variables. However, after dropping one level from each factor variable including y = “Paid”, we get 143 dummy variables.

#One Hot Encoding
dummy <- dummyVars(~.+annual_inc:emp_length, data = data_clean)
ohe_data_clean <- data.frame(predict(dummy, newdata = data_clean))

#Number of dummy variables
ncol(ohe_data_clean)

## [1] 154

#Removing one level from each factor variable to treat it as the baseline factor
drops <- c('y.Paid', 'grade.G', 'sub_grade.G5', 'home_ownership.NONE', 'addr_state.MS', 'emp_length.unknown', 'annual_inc.1', 'purpose.other', 'dti..24.30.', 'mths_since_last_delinq.never', 'emp_lengthunknown.annual_inc4')
ohe_data_clean <- ohe_data_clean[ , !(names(ohe_data_clean) %in% drops)]

#Number of dummy variables
ncol(ohe_data_clean)

## [1] 143

#Converting into a factor
ohe_data_clean$y.Not.Paid <- as.factor(ohe_data_clean$y.Not.Paid)

#Splitting cleaned data into training set and validation set
train <- ohe_data_clean[-va_inst,]
test <- ohe_data_clean[va_inst,]

2: Trees

Use the following code to create an unpruned tree (replace YOUR_Y_VAR and YOUR_TRAINING_DATA with the appropriate variable names, then uncomment the line starting with “lc.full.tree=tree…”). How many terminal nodes are in the full tree? Which variable has the highest-information gain (leads to the biggest decrease in impurity)? How do you know?

ANSWER TO QUESTION 2a HERE:

There are 125 terminal nodes in the full tree.
int_rate has the highest information gain because the first variable used to make the split is always the one with the lowest Gini impurity.

#Initializing a full tree (unpruned)
mycontrol = tree.control(nrow(train), mincut = 5, minsize = 10, mindev = 0.0005)
lc.full.tree = tree(y.Not.Paid ~ .,control = mycontrol, train)

#Summary of full tree
summary(lc.full.tree)

## 
## Classification tree:
## tree(formula = y.Not.Paid ~ ., data = train, control = mycontrol)
## Variables actually used in tree construction:
##  [1] "int_rate"                         "loan_amnt"                       
##  [3] "dti..18.24."                      "emp_length7.9.years.annual_inc1" 
##  [5] "emp_lengthunknown.annual_inc1"    "addr_state.TX"                   
##  [7] "grade.C"                          "purpose.debt"                    
##  [9] "mths_since_last_delinq..1.year"   "sub_grade.B2"                    
## [11] "emp_length7.9.years.annual_inc3"  "purpose.wedding"                 
## [13] "emp_length...1.year"              "addr_state.NJ"                   
## [15] "home_ownership.OWN"               "home_ownership.RENT"             
## [17] "addr_state.MO"                    "addr_state.FL"                   
## [19] "emp_length1.3.years.annual_inc4"  "mths_since_last_delinq.1.2.years"
## [21] "dti..6.12."                       "emp_length..1.year.annual_inc1"  
## [23] "emp_length4.6.years.annual_inc1"  "emp_length4.6.years.annual_inc3" 
## [25] "emp_length.4.6.years"             "addr_state.NY"                   
## [27] "grade.E"                          "grade.D"                         
## [29] "purpose.small_business"           "grade.F"                         
## [31] "home_ownership.MORTGAGE"          "purpose.major_purchase"          
## [33] "mths_since_last_delinq.3..years"  "emp_lengthunknown.annual_inc2"   
## [35] "mths_since_last_delinq.2.3.years" "annual_inc.4"                    
## [37] "addr_state.KY"                    "emp_length10..years.annual_inc3" 
## [39] "dti..12.18."                      "grade.B"                         
## [41] "emp_length10..years.annual_inc2"  "emp_length1.3.years.annual_inc1" 
## [43] "sub_grade.C2"                     "sub_grade.D1"                    
## [45] "sub_grade.D2"                     "sub_grade.D5"                    
## [47] "addr_state.PA"                    "emp_length10..years.annual_inc4" 
## [49] "sub_grade.E1"                     "emp_length1.3.years.annual_inc2" 
## [51] "addr_state.CA"                    "addr_state.MA"                   
## [53] "purpose.medical"                  "annual_inc.3"                    
## Number of terminal nodes:  125 
## Residual mean deviance:  0.8745 = 9073 / 10380 
## Misclassification error rate: 0.1954 = 2052 / 10500

Create pruned trees of size 2, 4, 6, 8, 10, 15, 20, 25, 30, 35, and 40. Plot fitting curves consisting of the accuracy in the validation and training sets for each pruned tree (assuming a cutoff of 0.5). Make sure the two sets of points are different colors.

ANSWER TO QUESTION 2b HERE:

sizes = c(2, 4, 6, 8, 10, 15, 20, 25, 30, 35, 40) #Tree Sizes
cutoff = 0.5
tree_tr_acc = rep(0, length(sizes)) #Empty Vector to store training accuracies
tree_tst_acc = rep(0, length(sizes)) ##Empty Vector to store validation accuracies

#Function to calculate accuracy
accuracy <- function(classifications, actuals){
  correct_classifications <- ifelse(classifications == actuals, 1, 0)
  acc <- sum(correct_classifications)/length(classifications)
  return(acc)
}

#Function for prediction and evaluation
predictor <- function(treename, pred_data, cutoff){
  predictions <- predict(treename, newdata = pred_data) #make predictions
  probs <- predictions[,2] #get the probabilities (second column)
  classifications <- ifelse(probs > cutoff, 1, 0) #do the classifications
  acc <- accuracy(classifications, pred_data$y.Not.Paid) #compute the accuracy
  return(acc)
}

#Loop to calculate accuracy for different tree sizes
for (i in 1:length(sizes)){
  size = sizes[i]
  pruned_tree = prune.tree(lc.full.tree, best = size)
  tr_acc <- predictor(pruned_tree, train, cutoff)
  tst_acc <- predictor(pruned_tree, test, cutoff)
  
  tree_tr_acc[i] <- tr_acc
  tree_tst_acc[i] <- tst_acc 
}

#Plot the Tree Size vs. Accuracy
plot(sizes, tree_tr_acc, type = 'l', col = 'red', main = "Fitting Curve (Decision Tree)",
     xlab = 'Tree Size', ylab = 'Accuracy', ylim = c(0.7,0.8))
lines(sizes, tree_tst_acc, col = 'blue')
legend("bottomright", inset=0.05, legend=c("Training Data", "Validation Data"),col=c("red", "blue"), lty=1)

Which tree size is the best, and how did you select the best one? Store the vector of probabilities estimated by your best tree in the validation set as a variable called best.tree.preds. We’ll use these later.

ANSWER TO QUESTION 2c HERE:

Observing the above fitting curve, it can be concluded that the accuracy is highest for tree sizes 8 and 10.
In my opinion, a tree size of 8 is the best because of the high accuracy and less complex tree.

cat("Accuracies for Validation Data: \n\n", tree_tst_acc)

## Accuracies for Validation Data: 
## 
##  0.7726667 0.7788889 0.7788889 0.784 0.784 0.7828889 0.7835556 0.7791111 0.7808889 0.7808889 0.78

#Storing predictions of the tree with highest accuracy in a separate data frame
best.tree <- prune.tree(lc.full.tree, best = 8)
best.tree.prediction <- predict(best.tree, newdata = test)
best.tree.preds <- best.tree.prediction[,2]

3: kNN

Compute kNN estimates in the training and validation data using k values of 2, 4, 6, 8, 10, 15, and 20. Assume a cutoff of 0.5. Plot the accuracy in the validation and training sets for each k value. Make sure the two sets of points are different colors!

+Note: you will need to separate your training and validation sets into X and y. +Note: Be patient - it will take several minutes for kNN to make its predictions!

ANSWER TO QUESTION 3a HERE:

kval <- c(2, 4, 6, 8, 10, 15, 20) #K-Values
cutoff = 0.5

knn_tr_acc = rep(0, length(kval)) #Empty Vector to store training accuracies
knn_tst_acc = rep(0, length(kval)) #Empty Vector to store validation accuracies

#Storing all variables except for labelled value into x data frame
train.X = train[,!names(train) %in% c("y.Not.Paid")]
test.X = test[,!names(test) %in% c("y.Not.Paid")]

#Storing the labelled data into the y data frame
train.y= train$y.Not.Paid
test.y= test$y.Not.Paid

for(i in 1:length(kval)){
  k <- kval[i] #get the ith k in kval
  
  #compute predictions using train.X as the potential neighbors
  #new points to be classified are first validation, then training points
  tr_preds <- knn(train.X, train.X, train.y, k = k)
  tst_preds <- knn(train.X, test.X, train.y, k = k)
  
  #compute the accuracy for each set of predictions
  tst_accuracy <- accuracy(tst_preds, test.y)
  tr_accuracy <- accuracy(tr_preds, train.y)
  
  #store in the appropriate place
  knn_tr_acc[i] <- tr_accuracy
  knn_tst_acc[i] <- tst_accuracy
}

#Plot the K-Value vs. Accuracy
plot(kval, knn_tr_acc, type = 'l', col = 'red', main = "Fitting Curve (KNN)", xlab = 'K-Value', ylab = 'Accuracy', 
     ylim = c(0.5,1))
lines(kval, knn_tst_acc, col = 'blue')
legend("bottomright", inset=0.05, legend=c("Training Data", "Validation Data"),col=c("red", "blue"), lty=1)

Which k is the best, and how did you select the best one? Store the vector of probabilities estimated by your best k value in the validation set as a variable called best.knn.preds. +

+Note: you’ll need to convert these probabilities from the probability of the majority-class vote to the probability that y = the positive class.

ANSWER TO QUESTION 3b HERE:

Observing the above fitting curve, it can be concluded that the accuracy is highest for k-value = 20
Therefore, k = 20 is the best value to go with.

cat("Accuracies for Validation Data: \n\n", knn_tst_acc)

## Accuracies for Validation Data: 
## 
##  0.6897778 0.7317778 0.746 0.7537778 0.7544444 0.7655556 0.7691111

#Storing predictions of the tree with highest accuracy in a separate data frame
best.knn <- knn(train.X, test.X, train.y, k=20, prob = TRUE)
best.knn.probs <- attr(best.knn, "prob")
best.knn.preds <- ifelse(best.knn == 1, best.knn.probs, 1-best.knn.probs)

4: ROC and Lift

Plot the ROC curves for the probability estimates generated by your best tree and kNN models on the same chart (make sure that your reader can tell which line is which).

ANSWER TO QUESTION 4a HERE:

prediction_best_tree <- prediction(best.tree.preds, test$y.Not.Paid)
prediction_best_knn <- prediction(best.knn.preds, test$y.Not.Paid)

#Creating a ROCR performance object
roc_tree <- performance(prediction_best_tree, "tpr", "fpr")
roc_knn <- performance(prediction_best_knn, "tpr", "fpr")

#Plot the TPR vs FPR performance
plot(roc_tree, col = "red", lwd = 2, xlab = 'False Positive Rate', ylab = 'True Positive Rate', main = 'ROC Curve')
plot(roc_knn, add=T, col = "darkgreen", lwd = 2)
legend("bottomright", inset=0.05, legend=c("Best Decision Tree Model", "Best KNN Model"),col=c("red", "darkgreen"), lty=1)

Compute the AUC for each model (in the validation data, of course. Which model has the highest AUC? Is this the same model as the one with the higher accuracy? Does the highest-AUC model have the highest TPR for every cutoff?

ANSWER TO QUESTION 4b HERE:

The Decision Tree Model has the highest AUC of 0.7164188. Yes, it is the same model as the one with the higher accuracy. From ROC Curve, you can observe that the highest AUC model has the highest TPR for every cutoff.

#Computing the AUC for tree model
performance(prediction_best_tree, measure = "auc")@y.values[[1]]

## [1] 0.7164188

#Computing the AUC for knn model
performance(prediction_best_knn, measure = "auc")@y.values[[1]]

## [1] 0.6605081

Plot the lift curve for your highest-AUC model.

ANSWER TO QUESTION 4c HERE:

lift_curve <- performance(prediction_best_tree, "lift", "rpp")

#Plotting the lift curve
plot(lift_curve, col = "purple", lwd = 2 ,main="Lift Curve", xlab="Percent of Instances Targetted", ylab="Lift Value")

If we decide to flag the top 10% of loans most likely to be “Not Paid”, what will our lift be? What if we flag the top 50% of loans? If we want to achieve a lift of at least 2.0, how many loans should we flag?

+Note: you can answer approximately by reading your lift chart, no need to calculate the exact amounts.

ANSWER TO QUESTION 4d HERE:

If we decide to flag the top 10% of loans most likely to be “Not Paid”, the lift will be 2.8 approximately.
If we decide to flag the top 50% of loans most likely to be “Not Paid”, the lift will be 1.6 approximately.
If we want to achieve a lift of at least 2.0, We must flag the top 22% or less.

5: OPTIONAL (Extra Challenge)

Can you improve on the best-performing model with either more/different features, a different model specification, or a different tuning parameter (or all of the above)? Report your best validation performance and give details on your best model.

ANSWER TO QUESTION 5 HERE:

The accuracy of the tree model was improved from 78.4% to 89.3%. The accuracy improved due to the following changes: + Added extra variables: funded_amnt, total_pymnt, open_acc, and total_rec_int + Tuned the tree control parameters: Changed mindev to 0.0001, Chose a tree size of 60

#Selecting additional columns for increasing accuracy

data_clean <- lc %>%#Selecting the variables that are needed to be processed and cleaned
  select('grade', 'sub_grade', 'home_ownership', 'addr_state', 'loan_amnt', 'emp_length', 'annual_inc','purpose',
         'dti', 'mths_since_last_delinq', 'int_rate', 'funded_amnt', 'total_pymnt', 'open_acc', 'total_rec_int', 'y') %>%
  mutate(#Checking for NAs and adding NULL values if present
         grade = as.factor(ifelse(is.na(grade), NULL, grade)),
         sub_grade = as.factor(ifelse(is.na(sub_grade), NULL, sub_grade)),
         home_ownership = as.factor(ifelse(is.na(home_ownership), NULL, home_ownership)),
         addr_state = as.factor(ifelse(is.na(addr_state), NULL, addr_state)),
         #Replacing NAs with mean value
         loan_amnt = ifelse(is.na(loan_amnt), mean(loan_amnt, na.rm = TRUE), loan_amnt),
         #Grouping into bins
         emp_length = as.factor(case_when(
           emp_length %in% c('< 1 year') ~ '< 1 year',
           emp_length %in% c('1 year', '2 years', '3 years') ~ '1-3 years',
           emp_length %in% c('4 years', '5 years', '6 years') ~ '4-6 years',
           emp_length %in% c('7 years', '8 years', '9 years') ~ '7-9 years',
           emp_length %in% c('10+ years') ~ '10+ years',
           TRUE ~ 'unknown')),
         #Replacing NAs with mean value and then grouping into bins based on quartiles
         annual_inc = ifelse(is.na(annual_inc), mean(annual_inc, na.rm = TRUE), annual_inc),
         annual_inc = as.factor(ntile(annual_inc, 4)),
         #Grouping into five equally-sized bins
         dti = as.factor(cut(dti, breaks = 5)),
         #Grouping into bins
         mths_since_last_delinq = as.factor(case_when(
           mths_since_last_delinq < 12 ~ "<1 year",
           mths_since_last_delinq >=12 & mths_since_last_delinq <= 24 ~ "1-2 years",
           mths_since_last_delinq > 24 & mths_since_last_delinq <= 36 ~ "2-3 years",
           mths_since_last_delinq > 36 ~ "3+ years",
           TRUE ~ "never")),
         #Splitting the values and processing the numeric part of the string. Checking for NAs and 
         #replacing it with the mean value
         int_rate = as.numeric(strsplit(int_rate, "%")),
         int_rate = ifelse(is.na(int_rate), mean(int_rate, na.rm = TRUE), int_rate),
         #Converting target variable into a factor
         y = as.factor(y)) %>%
  #Grouping the data w.r.t 'purpose' and binning based on count
  group_by(purpose) %>%
  mutate(purpose = ifelse(n() < 200, 'other', purpose),
         purpose = as.factor(ifelse((purpose == 'credit_card' | purpose == 'debt_consolidation'),
                                    'debt',purpose))) %>% 
  ungroup()

#One Hot Encoding
dummy <- dummyVars(~.+annual_inc:emp_length, data = data_clean)
ohe_data_clean <- data.frame(predict(dummy, newdata = data_clean))

#Removing one level from each factor variable to treat it as the baseline factor
drops <- c('y.Paid', 'grade.G', 'sub_grade.G5', 'home_ownership.NONE', 'addr_state.MS', 'emp_length.unknown', 'annual_inc.1', 'purpose.other', 'dti..24.30.', 'mths_since_last_delinq.never', 'emp_lengthunknown.annual_inc4')
ohe_data_clean <- ohe_data_clean[ , !(names(ohe_data_clean) %in% drops)]

#Converting into a factor
ohe_data_clean$y.Not.Paid <- as.factor(ohe_data_clean$y.Not.Paid)

#Splitting cleaned data into training set and validation set
train <- ohe_data_clean[-va_inst,]
test <- ohe_data_clean[va_inst,]

#Initializing a full tree (unpruned)
mycontrol = tree.control(nrow(train), mincut = 5, minsize = 10, mindev = 0.0001)
lc.full.tree = tree(y.Not.Paid ~ .,control = mycontrol, train)

#Summary of full tree
summary(lc.full.tree)

## 
## Classification tree:
## tree(formula = y.Not.Paid ~ ., data = train, control = mycontrol)
## Variables actually used in tree construction:
##  [1] "total_pymnt"                      "funded_amnt"                     
##  [3] "int_rate"                         "total_rec_int"                   
##  [5] "annual_inc.4"                     "grade.B"                         
##  [7] "dti..12.18."                      "grade.D"                         
##  [9] "open_acc"                         "emp_length10..years.annual_inc4" 
## [11] "loan_amnt"                        "purpose.debt"                    
## [13] "emp_length.1.3.years"             "home_ownership.MORTGAGE"         
## [15] "emp_length.10..years"             "grade.C"                         
## [17] "home_ownership.RENT"              "grade.A"                         
## [19] "grade.E"                          "emp_length.7.9.years"            
## [21] "annual_inc.3"                     "annual_inc.2"                    
## [23] "addr_state.IL"                    "mths_since_last_delinq.2.3.years"
## [25] "mths_since_last_delinq.3..years"  "addr_state.NJ"                   
## [27] "addr_state.FL"                    "addr_state.VA"                   
## [29] "mths_since_last_delinq..1.year"   "emp_length10..years.annual_inc3" 
## [31] "dti...0.03.6."                    "addr_state.CA"                   
## [33] "purpose.small_business"           "emp_length1.3.years.annual_inc3" 
## [35] "sub_grade.B2"                     "emp_length.4.6.years"            
## [37] "dti..6.12."                       "home_ownership.OWN"              
## [39] "sub_grade.B3"                     "emp_length4.6.years.annual_inc1" 
## [41] "mths_since_last_delinq.1.2.years" "sub_grade.C2"                    
## [43] "emp_length..1.year.annual_inc1"   "purpose.home_improvement"        
## [45] "sub_grade.C1"                     "emp_length1.3.years.annual_inc2" 
## [47] "addr_state.TX"                    "sub_grade.D5"                    
## [49] "sub_grade.D3"                     "dti..18.24."                     
## [51] "sub_grade.B5"                     "sub_grade.C3"                    
## [53] "emp_length...1.year"              "grade.F"                         
## [55] "sub_grade.D4"                     "addr_state.AZ"                   
## [57] "sub_grade.C4"                     "sub_grade.C5"                    
## [59] "sub_grade.E2"                     "emp_length1.3.years.annual_inc4" 
## [61] "emp_length7.9.years.annual_inc2"  "sub_grade.D2"                    
## [63] "sub_grade.B4"                    
## Number of terminal nodes:  552 
## Residual mean deviance:  0.2012 = 1999 / 9936 
## Misclassification error rate: 0.04872 = 511 / 10488

#Determining the accuracy
size = 60
pruned_tree = prune.tree(lc.full.tree, best = size)
tr_acc <- predictor(pruned_tree, train, cutoff)
tst_acc <- predictor(pruned_tree, test, cutoff)

tst_acc

## [1] 0.8928889