If any issues, questions or suggestions feel free to reach me out via e-mail wieczynskipawel@gmail.com or Linkedin. You can also visit my Github.
This is R replication of the code and exercises from the Udemy course “Credit Risk Modeling in Python 2022”.
if(!require('pacman')) install.packages('pacman')
pacman::p_load(dplyr, stringr, lubridate, fastDummies)
Dataset comes from kaggle.com and consists of data on 466285 consumer loans and has 75 attributes.
data <- read.csv('loan_data_2007_2014.csv')
head(data)
First, we convert attribute emp_length to numeric, but instead of using multiple times str_extract function (as is done in the course) we utilize regular expression here.
data$emp_length %>% unique()
[1] "10+ years" "< 1 year" "1 year" "3 years" "8 years" "9 years" "4 years" "5 years" "6 years"
[10] "2 years" "7 years" ""
data <- data %>%
mutate(emp_length = if_else(
emp_length %in% c('', '< 1 year')
,0L
,str_extract(emp_length, '[0-9]{1,2}') %>% as.integer())
)
data$emp_length %>% unique()
[1] 10 0 1 3 8 9 4 5 6 2 7
Similarly as above with the attribute term, but this part was left as an exercise for the learner.
data$term %>% unique()
[1] " 36 months" " 60 months"
data <- data %>%
mutate(term = str_extract(term, '[0-9]{2}') %>% as.integer())
data$term %>% unique()
[1] 36 60
Here we convert attribute earliest_cr_line to the date class and then calculate number of months between this attribute and 01-Dec-2017. Some results are negative bcos e.g. Jan-68 is converted to January 2068. In such cases we set months_since_earliest_cr_line to max of this attribute.
data<- data %>%
mutate(earliest_cr_line = as.Date(paste0('01-', earliest_cr_line), format = '%d-%b-%y')) %>%
mutate(months_since_earliest_cr_line = interval(earliest_cr_line, ymd(171201)) %/% months(1)) %>%
mutate(months_since_earliest_cr_line = if_else(months_since_earliest_cr_line < 0
,max(months_since_earliest_cr_line)
,months_since_earliest_cr_line))
summary(data$months_since_earliest_cr_line)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
73.0 183.0 225.0 241.5 285.0 587.0 1198
Similarly as above with the attribute issue_d, but this part was left as an exercise for the learner.
data <- data %>%
mutate(issue_d = as.Date(paste0('01-', issue_d), format = '%d-%b-%y')) %>%
mutate(months_since_issue_d = interval(issue_d, ymd(171201)) %/% months(1))
summary(data$months_since_issue_d)
Min. 1st Qu. Median Mean 3rd Qu. Max.
36.00 41.00 47.00 51.26 57.00 126.00
Few categorical columns are converted to dummy variables using function dummy_cols from the package fastDummies.
data <- dummy_cols(data, select_columns = c('grade', 'sub_grade', 'home_ownership', 'verification_status', 'loan_status', 'purpose', 'addr_state', 'initial_list_status'))
colnames(data)[78:ncol(data)] %>% cat()
grade_A grade_B grade_C grade_D grade_E grade_F grade_G sub_grade_A1 sub_grade_A2 sub_grade_A3 sub_grade_A4 sub_grade_A5 sub_grade_B1 sub_grade_B2 sub_grade_B3 sub_grade_B4 sub_grade_B5 sub_grade_C1 sub_grade_C2 sub_grade_C3 sub_grade_C4 sub_grade_C5 sub_grade_D1 sub_grade_D2 sub_grade_D3 sub_grade_D4 sub_grade_D5 sub_grade_E1 sub_grade_E2 sub_grade_E3 sub_grade_E4 sub_grade_E5 sub_grade_F1 sub_grade_F2 sub_grade_F3 sub_grade_F4 sub_grade_F5 sub_grade_G1 sub_grade_G2 sub_grade_G3 sub_grade_G4 sub_grade_G5 home_ownership_ANY home_ownership_MORTGAGE home_ownership_NONE home_ownership_OTHER home_ownership_OWN home_ownership_RENT verification_status_Not Verified verification_status_Source Verified verification_status_Verified loan_status_Charged Off loan_status_Current loan_status_Default loan_status_Does not meet the credit policy. Status:Charged Off loan_status_Does not meet the credit policy. Status:Fully Paid loan_status_Fully Paid loan_status_In Grace Period loan_status_Late (16-30 days) loan_status_Late (31-120 days) purpose_car purpose_credit_card purpose_debt_consolidation purpose_educational purpose_home_improvement purpose_house purpose_major_purchase purpose_medical purpose_moving purpose_other purpose_renewable_energy purpose_small_business purpose_vacation purpose_wedding addr_state_AK addr_state_AL addr_state_AR addr_state_AZ addr_state_CA addr_state_CO addr_state_CT addr_state_DC addr_state_DE addr_state_FL addr_state_GA addr_state_HI addr_state_IA addr_state_ID addr_state_IL addr_state_IN addr_state_KS addr_state_KY addr_state_LA addr_state_MA addr_state_MD addr_state_ME addr_state_MI addr_state_MN addr_state_MO addr_state_MS addr_state_MT addr_state_NC addr_state_NE addr_state_NH addr_state_NJ addr_state_NM addr_state_NV addr_state_NY addr_state_OH addr_state_OK addr_state_OR addr_state_PA addr_state_RI addr_state_SC addr_state_SD addr_state_TN addr_state_TX addr_state_UT addr_state_VA addr_state_VT addr_state_WA addr_state_WI addr_state_WV addr_state_WY initial_list_status_f initial_list_status_w
Imputing missing values left as an exercise for the learner:
- annual_income missing values replaced with average
income
- missing values for few other attributes replaced with 0.
data <- data %>%
mutate(annual_inc = if_else(is.na(annual_inc), mean(annual_inc, na.rm = TRUE), annual_inc))
cols <- c('months_since_earliest_cr_line', 'acc_now_delinq', 'total_acc', 'pub_rec'
,'open_acc', 'inq_last_6mths', 'delinq_2yrs')
data %>% select(cols) %>% sapply(function(x) is.na(x) %>% sum())
months_since_earliest_cr_line acc_now_delinq total_acc
1198 29 29
pub_rec open_acc inq_last_6mths
29 29 29
delinq_2yrs
29
for (col in cols) {
data[[col]] <- if_else(is.na(data[[col]]), 0, data[[col]])
}
data %>% select(cols) %>% sapply(function(x) is.na(x) %>% sum())
months_since_earliest_cr_line acc_now_delinq total_acc
0 0 0
pub_rec open_acc inq_last_6mths
0 0 0
delinq_2yrs
0
write.csv(data, 'preprocessed_loan_data_2007_2014.csv', row.names = FALSE)