This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
This exercise is under construction. Please report any errors at https://forms.gle/2W4tffs4YJA1jeBv9
Goal: Understand and experience decision trees regression to predict the probability of loan default (due to fraud or other reasons). Build skills and confidence to search for online help.
Background: The data for this question contains information about borrowers, loans, and the outcome (defaulted or paid). We are concerned about minimize loss, and not concerned whether the default was intentional or unintentional. I developed this assignment to walk you through the process because I couldn’t find any assignment at this level that can balance fundamentals and practical aspects. The data has been adapted from https://campus.datacamp.com/courses/credit-risk-modeling-in-r/ (but my approach is quite different).
Before starting: 1. You are not allowed to: 1a. Search for solutions to this assignment 2b. Subcontract your assignment to someone else 2. You are allowed to: 2a. Search information about packages and functions you may use 2b. Consult with your team mates.
Individual assignment only: 50 total points (Rmd and html solution) Team assignment: None
Start by entering your name and today’s date in Lines 3 and 4, respectively, to indicate your compliance with the Fuqua Honor Code. Then, run the chunk of code below by clicking on the green arrow (that points to the right) on the top right of the chunk. Tip: I numbered code chunks corresponding to their numbers. Chunk 1 specifies the knitting parameters.
Read and store the data from the file LoanData.rds into a variable called loanData. Then, inspect the data using 2 or more R commands. Tip: Use Google to learn about rds file format and how to read it into R. You’ll likely need some libraries and packages. Rubric: 1 point for storing; 1 points each for using 2 R commands for inspecting.*
#install.packages("tidyverse")
library(dplyr)
##
## 载入程辑包:'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
loanData = read_rds("LoanData.rds")
glimpse(loanData) #Glimpse is a dplyr function that is a smarter variation of str.
## Rows: 29,092
## Columns: 8
## $ isLoanDefault <int> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, …
## $ loanAmount <int> 5000, 2400, 10000, 5000, 3000, 12000, 9000, 3000, 1000…
## $ interestRate <dbl> 10.65, NA, 13.49, NA, NA, 12.69, 13.49, 9.91, 10.65, 1…
## $ creditGrade <fct> B, C, C, A, E, B, C, B, B, D, C, A, B, A, B, B, B, B, …
## $ employmentYears <int> 10, 25, 13, 3, 9, 11, 0, 3, 3, 0, 4, 13, 1, 6, 17, 13,…
## $ homeLiving <fct> RENT, RENT, RENT, RENT, RENT, OWN, RENT, RENT, RENT, R…
## $ incomeAnnual <dbl> 24000.00, 12252.00, 49200.00, 36000.00, 48000.00, 7500…
## $ ageYears <int> 33, 31, 24, 39, 24, 28, 22, 22, 28, 22, 23, 27, 30, 24…
summary(loanData)
## isLoanDefault loanAmount interestRate creditGrade employmentYears
## Min. :0.0000 Min. : 500 Min. : 5.42 A:9649 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.: 5000 1st Qu.: 7.90 B:9329 1st Qu.: 2.000
## Median :0.0000 Median : 8000 Median :10.99 C:5748 Median : 4.000
## Mean :0.1109 Mean : 9594 Mean :11.00 D:3231 Mean : 6.145
## 3rd Qu.:0.0000 3rd Qu.:12250 3rd Qu.:13.47 E: 868 3rd Qu.: 8.000
## Max. :1.0000 Max. :35000 Max. :23.22 F: 211 Max. :62.000
## NA's :2776 G: 56 NA's :809
## homeLiving incomeAnnual ageYears
## MORTGAGE:12002 Min. : 4000 Min. : 20.0
## OTHER : 97 1st Qu.: 40000 1st Qu.: 23.0
## OWN : 2301 Median : 56424 Median : 26.0
## RENT :14692 Mean : 67169 Mean : 27.7
## 3rd Qu.: 80000 3rd Qu.: 30.0
## Max. :6000000 Max. :144.0
##
Before running regression, run the following code to clean the data, and then generate the training (loanTrain) and testing (loanTest) data.
### Do not modify this code - just run it!
loanDataNoOutliers = loanData[!is.na(loanData$incomeAnnual) & !(2500000 < loanData$incomeAnnual), ]
# All the columns have less than 10% missing values. Only two have any missing values at all. These are interestRate and employmentYears.
isMissingInterestRate = is.na(loanDataNoOutliers$interestRate)
isMissingEmploymentYears = is.na(loanDataNoOutliers$employmentYears)
#make a copy of loanDataNoOutliers
loanDataNoOutliersNA = loanDataNoOutliers
loanDataNoOutliersNA$interestRate[isMissingInterestRate] = median(loanDataNoOutliersNA$interestRate, na.rm = TRUE)
naInterestRate = as.integer(isMissingInterestRate)
loanDataNoOutliersNA = cbind(loanDataNoOutliersNA, naInterestRate)
loanDataNoOutliersNA$employmentYears[isMissingEmploymentYears] = median(loanDataNoOutliersNA$employmentYears, na.rm = TRUE)
naEmploymentYears = as.integer(isMissingEmploymentYears)
loanDataNoOutliersNA = cbind(loanDataNoOutliersNA, naEmploymentYears)
#summary(loanDataNoOutliersNA)
set.seed(2020)
loanTestIndices = sample(1:nrow(loanDataNoOutliersNA), nrow(loanDataNoOutliersNA)*1/3, replace = FALSE) # replace = FALSE is not needed by R but lowers my anxiety level
loanTest = loanDataNoOutliersNA[loanTestIndices, ]
loanTrain = loanDataNoOutliersNA[-loanTestIndices, ]
#str(loanTest)
#str(loanTrain)
Suppose an independent data aggregator is selling a new logical feature called xyz (that is 0 or 1) along with the default data for each of your cases. Use Gini to determine whether it is useful or not to purchase the feature X based on the following data: a. There are 1000 defaults and 8015 non-defaults cases when xyz is 1 (“left” cases). b. There are 2227 defaults and 17850 non-default cases when xyz is 0 (“right” cases). Tip: I have already calculated the Gini of the root for you. 1. Compute the gini_left and gini_right using the getGini function already written for you. 2. Compute the gini_gain based on the following formula: gini_root - (count(left)/count(total) * gini_left) - (count(right)/count(total) * gini_right)) Tip: See BVV pp. 138-189 for an example. Rubric: 2 each point for the gini_left and gini_right, 2 points for gini_gain
###Do not modify the code in the next 8 lines
getGini = function(good, bad) {
total = good+bad
return(2*(bad/total) * (good/ total))
}
defaults = sum(loanData$isLoanDefault)
non_defaults = sum(0 == loanData$isLoanDefault)
gini_root = getGini(non_defaults,defaults)
###Do not modify the code above this line
###Add your code below this line
# Given information
defaults_left = 1000
non_defaults_left = 8015
defaults_right = 2227
non_defaults_right = 17850
# Gini index for the left and right cases
gini_left = getGini(non_defaults_left, defaults_left)
gini_right = getGini(non_defaults_right, defaults_right)
# Compute the Gini gain
gini_gain = gini_root - ((non_defaults_left + defaults_left) / (non_defaults + defaults) * gini_left) -
((non_defaults_right + defaults_right) / (non_defaults + defaults) * gini_right)
# Output the results
cat("Gini index for left cases:", gini_left, "\n")
## Gini index for left cases: 0.1972432
cat("Gini index for right cases:", gini_right, "\n")
## Gini index for right cases: 0.1972381
cat("Gini gain:", gini_gain, "\n")
## Gini gain: 4.622219e-12
Based on the gini_gain in the above section, how much value do you ascribe to the data for feature X
### This section doesn't require code but feel free to reprint any critical values.
#Gini gain is very close to zero (4.622219e-12), which suggests that the new feature X doesn't contribute significantly to the overall predictive power of the model.
Load the package rpart to use decision trees without writing all the code.
library(rpart)
Run the following code to use rpart to build a decision tree.
###Do not modify the code in this chunk
tree_defaults = rpart(isLoanDefault ~ ., method="class", data = loanTrain)
Run the following code to use to plot the decision tree you generated. Then, explain the results. Tip: You will likely get an error and will have to comment out this code line to knit. To explain consider what’s peculiar about the distribution of fraud data.
###Do not modify the code in this chunk
#plot(tree_defaults)
#Error in plot.rpart(tree_defaults, uniform = TRUE) :
#fit is not a tree, just a root
### Explain your results below...
# I think there is a significant imbalance, because no-default has 25865cases but default only has 3227cases, it can lead to this issue.
Now, we will trick R by giving it a balanced dataset where 1/3 of the cases are default and 2/3 are are not default. Run the following code and then enter a comment before each line to explain what that line is doing in plain English. Rubric: 5 points for the explanation (1 for each line).
###Enter comments without modifying the code in this chunk
#This line filters rows from the loanData dataframe where the isLoanDefault = 1 (indicating a default).
loanTrain_default = loanData[1 ==loanData$isLoanDefault, ]
#This line filters rows from the loanData dataframe where the isLoanDefault = 0 (indicating no default).
loanTrain_nodefault = loanData[0 ==loanData$isLoanDefault, ]
#This line indices will be used to create a balanced training dataset by sampling(default:nodefault=1:2).
loanTestIndices = sample(1:nrow(loanTrain_nodefault), 2*nrow(loanTrain_default), replace = FALSE) # replace = FALSE is not needed by R but lowers my anxiety level
#This line ensures that the number of non-defaulted loan cases is equal to twice the number of defaulted cases.
loanTrain_nodefault = loanTrain_nodefault[loanTestIndices,]
#This line combines default and nodefault to create a balanced training dataset where 1/3 are defaulted and 2/3 are non-defaulted.
loanTrain_balanced = rbind(loanTrain_default,loanTrain_nodefault)
Now, call rpart again using loanTrain_balanced and also add this parameter setting to rpart control = rpart.control(cp = 0.001). Then, plot the tree again using plot command. Tip: The parameter cp is the complexity parameter that sets the threshold value for a decrease in overall lack of fit for any split. If cp is not met, further splits will no longer be pursued. cp’s default value is 0.01, but for complex problems, it is advised to relax cp (to a lower value). Tip: In your call to plot set uniform = TRUE. Rubric: 4 points for rpart, 2 points for plot.
tree_defaults_balanced = rpart(isLoanDefault ~ ., method="class", data = loanTrain_balanced, control = rpart.control(cp = 0.001)) #Add your "control" parameter code here#
plot(tree_defaults_balanced, uniform = TRUE)
Now, stop and admire the beauty of what you just did for as long as you want! This is our tree and we will help it grow. Then, explain what was the effect of uniform = TRUE.
### This section doesn't require code but feel free to reprint any critical values.
# The command ensures that the decision tree is visualized in a more standardized and readable format, which can be particularly useful for understanding the structure of the tree and conveying information to others.Setting uniform to TRUE balances the tree for better visualization with equal-sized branches.
Let’s nurture our tree so it looks even better starting by adding the parameter parms = list(prior = c(0.67, 0.33). This changes the proportion of non-defaults to 0.67, and of defaults to 0.33 (they should always sum up to 1). Also, add labels to the decision tree by entering text(tree_defaults_balanced_parms) in the line after the plot. Tip: Don’t delete control = rpart.control(cp = 0.001). Tip: Don’t worry about text readability. Rubric: 4 points for rpart, 2 points for plot with labels.
tree_defaults_balanced_parms = rpart(isLoanDefault ~ ., method="class", data = loanTrain_balanced,
control = rpart.control(cp = 0.001),
parms = list(prior = c(0.67, 0.33)))
plot(tree_defaults_balanced_parms, uniform = TRUE)
text(tree_defaults_balanced_parms)
Let’s continue to nurture our tree by including a loss matrix. You can do this by adding the following to the parms: parms = list(loss = matrix(c(0, M, 1, 0), ncol = 2)), where M is greater than 1. (This means an actual 1, predicted as 0 (i.e., a false negative) costs M times more that a false positive!) Choose an M that helps you improve the tree. Store the result in tree_defaults_balanced_parms_lossmatrix. Tip: Your parms parameter will now look like this: parms = list(loss = matrix(c(0, M, 1, 0), ncol = 2), prior = c(0.67, 0.33)). Tip: The loss matrix changes the relative importance of misclassification of a default as a non-default versus a non-default as a default. We are stressing that misclassifying a default as a non-default should be penalized more heavily. Including a loss matrix can again be done in the argument parms in the loss matrix: parms = list(loss = matrix(c(0, cost_def_as_nondef, cost_nondef_as_def, 0), ncol=2)). This constructs a 2x2-matrix with zeroes on the diagonal and changed loss penalties off-diagonal (by making the second parameter is higher). The default loss matrix is all ones off-diagonal, i.e, c(0, 1, 1, 0)). I suggest choosing something more than 5 for sure. Also, add cex parameter to the plot to scale the text size. plot(tree_defaults_balanced_parms_lossmatrix) text(tree_defaults_balanced_parms, cex = 0.25) Rubric: 4 points for rpart, 2 points for the plot.
# Choose an appropriate value for M (greater than 1)
M = 10
tree_defaults_balanced_parms_lossmatrix = rpart(isLoanDefault ~ ., method = "class", data = loanTrain_balanced,
control = rpart.control(cp = 0.001),
parms = list(loss = matrix(c(0, M, 1, 0), ncol = 2), prior = c(0.67, 0.33)))
plot(tree_defaults_balanced_parms_lossmatrix, cex = 0.75)
text(tree_defaults_balanced_parms, cex = 0.25)
Now, prune the tree_defaults_balanced_parms_lossmatrix tree using prune with a parameter between 0.001 and 0.004 and store the result in tree_lovely. Then, plot tree_lovely with text. Tip: Lots of work went into giving you that range. Don’t worry about that. Just choose a value that looks good to you. Rubric: 4 points for prune, 2 points for the plot.
tree_lovely = prune(tree_defaults_balanced_parms_lossmatrix, cp = 0.0033)
plot(tree_lovely)
text(tree_defaults_balanced_parms, cex = 0.5)
Knit to html after eliminating all the errors. Submit both the Rmd and html files. Tip: Do not worry about minor formatting issues.* Tip: This will take some time as you are processing medium size data sets.
### This section doesn't require code. Just knit and submit the Rmd and html files.###