Loan Data Analysis

Introduction

Objective of this analysis, is to explore the loan data set provided on Kaggle for Q1 2017 from the Lending Club. We will go through data cleaning and then explore potantial relationship between features and bad loans in order to find a way of predicting bad loans. Not only it is useful for avoiding bad loans, but also for services personalization. The data can be found under https://data.world/lpetrocelli/lendingclub-loan-data-2017-q-1

Loading the data set

The csv file is separated by comma and values are between quotes, on top we need to clean some non standard characters, this is a bit more complicated than usual

lines <- readLines("LoanStats_2017Q1 2.csv")
lines <-str_replace_all(lines,"[^[:graph:]]", " ") 
lines <- gsub('(^"|"$)', "", lines)
lines <- gsub("\"","" ,lines)
loans<-read.csv(textConnection(lines), sep=',',stringsAsFactors = FALSE)

Transforming the data

Unnecessary columns

Let us remove some unwanted columns and let us remove some columns we believe are not relevant. For now we do not need IDs, zip codes and states.

removedColumns<-select(loans, c(id,member_id,emp_title, zip_code,addr_state))
loans<-select(loans, -c(id,member_id,emp_title))
loans<-select(loans, -c(url,desc,purpose,title,
                        sec_app_earliest_cr_line,
                        sec_app_inq_last_6mths,
                        sec_app_mths_since_last_major_derog.))

Convert percentages columns

Convert percentage columns into numeric columns and remove the % sign to get numeric values.

loans$int_rate<-as.numeric(gsub("[\\%]", "", loans$int_rate))
loans$revol_util<-as.numeric(gsub("[\\%]", "", loans$revol_util))
loans$total_acc<-as.numeric(gsub("[\\%]", "", loans$total_acc))

Clean employment length

Get employment length and remove years to get cleaner data.

loans$emp_length<-gsub("[^0-9+<]","",loans$emp_length)
loans$emp_length<-as.factor(loans$emp_length)

Identify numeric columns

Identify numeric columns and convert them to numeric

numCols=c("annual_inc","loan_amnt","funded_amnt", "out_prncp","out_prncp_inv","dti","delinq_2yrs",
        "mths_since_last_delinq","mths_since_last_record","earliest_cr_line",
        "inq_last_6mths","total_pymnt","last_pymnt_amnt", "annual_inc_joint","dti_joint","mths_since_last_major_derog")
          
for (col in numCols) {
  loans[[col]]<-as.numeric(loans[[col]])
}

Remove null values

Remove NA values from known columns so far and replace by 0. Then identify numeric columns for further analysis.

# remove NA values from known columns so far and replace by 0
loans$num_accts_ever_120_pd[is.na(loans$num_accts_ever_120_pd)]<-0  
loans$num_tl_120dpd_2m[is.na(loans$num_tl_120dpd_2m)]<-0  
loans$num_tl_30dpd[is.na(loans$num_tl_30dpd)]<-0  
loans$num_tl_90g_dpd_24m[is.na(loans$num_tl_90g_dpd_24m)]<-0 
loans$num_tl_op_past_12m[is.na(loans$num_tl_op_past_12m)]<-0 
loans$delinq_2yrs[is.na(loans$delinq_2yrs)]<-0 
loans$mths_since_last_delinq[is.na(loans$mths_since_last_delinq)]<-0
loans$mths_since_last_major_derog[is.na(loans$mths_since_last_major_derog)]<-0
# figure out which columns are numeric so that we can look at the distribution
numeric_cols <- sapply(loans, is.numeric)

Identify bad loans

Identify bad loans and add a flag for bad loans based on the indicators found in loan_status column.

# 'bad' statuses
badIndicators <- c("Charged Off",
                    "Default",
                    "Does not meet the credit policy. Status:Charged Off",
                    "In Grace Period", 
                    "Default Receiver",
                    "Late (16-30 days)",
                    "Late (31-120 days)")

# assign certain statuses to a 'bad' ('1') group
loans$is_bad <- ifelse(loans$loan_status %in% badIndicators, 1,0)
loans$is_bad<-as.factor(loans$is_bad)
loans<-select(loans,-loan_status)
ggplot(loans, aes(x=is_bad)) + geom_bar(fill = "slateblue")+labs(x="Bad loan flag", y="Count")

We have a very small percentage of bad loans, so the data set is highly imbalanced.

Missing values

First we determine percentage of NAs in each column, then remove columns where NA values represent more than 80% For other columns we replace NA values with 0 for now

naPercentage <-sapply(loans, function(y) sum(length(which(is.na(y))))/length(y))
hist(naPercentage[naPercentage>0], col = "slateblue", xlab = "Percentage of null values",
     main = "Frequency of null values")

# let us remove all columns with more than 80% NAs
AllNA <- which(naPercentage>0.80)
loans<-loans[,-AllNA]
naPercentage <-sapply(loans, function(y) sum(length(which(is.na(y))))/length(y))
hist(naPercentage[naPercentage>0], col = "slateblue", xlab = "Percentage of null values",
     main = "Frequency of null values after removal")

colnames(loans)[colSums(is.na(loans)) > 0]
##  [1] "loan_amnt"                      "funded_amnt"                   
##  [3] "funded_amnt_inv"                "int_rate"                      
##  [5] "installment"                    "annual_inc"                    
##  [7] "dti"                            "inq_last_6mths"                
##  [9] "open_acc"                       "pub_rec"                       
## [11] "revol_bal"                      "revol_util"                    
## [13] "total_acc"                      "out_prncp"                     
## [15] "out_prncp_inv"                  "total_pymnt"                   
## [17] "total_pymnt_inv"                "total_rec_prncp"               
## [19] "total_rec_int"                  "total_rec_late_fee"            
## [21] "recoveries"                     "collection_recovery_fee"       
## [23] "last_pymnt_amnt"                "tot_coll_amt"                  
## [25] "tot_cur_bal"                    "open_acc_6m"                   
## [27] "open_il_6m"                     "open_il_12m"                   
## [29] "open_il_24m"                    "mths_since_rcnt_il"            
## [31] "total_bal_il"                   "il_util"                       
## [33] "open_rv_12m"                    "open_rv_24m"                   
## [35] "max_bal_bc"                     "all_util"                      
## [37] "total_rev_hi_lim"               "inq_fi"                        
## [39] "total_cu_tl"                    "inq_last_12m"                  
## [41] "acc_open_past_24mths"           "avg_cur_bal"                   
## [43] "bc_open_to_buy"                 "bc_util"                       
## [45] "chargeoff_within_12_mths"       "delinq_amnt"                   
## [47] "mo_sin_old_il_acct"             "mo_sin_old_rev_tl_op"          
## [49] "mo_sin_rcnt_rev_tl_op"          "mo_sin_rcnt_tl"                
## [51] "mort_acc"                       "mths_since_recent_bc"          
## [53] "mths_since_recent_bc_dlq"       "mths_since_recent_inq"         
## [55] "mths_since_recent_revol_delinq" "num_actv_bc_tl"                
## [57] "num_actv_rev_tl"                "num_bc_sats"                   
## [59] "num_bc_tl"                      "num_il_tl"                     
## [61] "num_op_rev_tl"                  "num_rev_accts"                 
## [63] "num_rev_tl_bal_gt_0"            "num_sats"                      
## [65] "pct_tl_nvr_dlq"                 "percent_bc_gt_75"              
## [67] "pub_rec_bankruptcies"           "tax_liens"                     
## [69] "tot_hi_cred_lim"                "total_bal_ex_mort"             
## [71] "total_bc_limit"                 "total_il_high_credit_limit"
loans[is.na(loans)] <- 0

Convert characters to factors

Convert character columns to factors for later use

loans <- mutate_if(loans, is.character, as.factor)

Exploring the data

Let us look at few distributions, first by loan amount and second by installment amount

hist(loans$loan_amnt,col = 'slateblue', xlab='Loan Amount', main = 'Distribution by loan amount')

hist(loans$installment,col = 'slateblue', xlab='Installment Amount', main = 'Distribution by installment amount')

Let us look at interest rate by grade

plot(loans$int_rate~loans$grade, loans,col='slateblue',xlab = "Loan grade", ylab="Interest rate")

Without surprise, interest rates are higher for lower grades.

Let us look at the number of loans by credit sub-grade

fdc<- loans%>% group_by(sub_grade) %>% summarize(count=n())
ggplot(fdc,aes(x=sub_grade,y=count))+geom_col(fill = "slateblue")+labs(x="Loan sub-grade", y="Count", title="Total number of loans by sub-grade")+theme(
      panel.border = element_blank(),  
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      panel.background = element_blank(),
      axis.line = element_line(colour = "grey")
    )

Let us look at the past due as function of loan grade averaged by number of loans

# let us look at the past due as function of loan grade averaged by number of loans
fdc<- loans%>% group_by(sub_grade) %>% summarize(PDMean=sum(num_accts_ever_120_pd)/n()) %>% arrange(PDMean)
ggplot(fdc, aes(x=sub_grade, y=PDMean))+geom_col()+geom_col(fill = "slateblue")+labs(x="Loan sub-grade", y="Past due more than 2 years",title="Percentage of past due by loan sub-grade")+theme(
  panel.border = element_blank(),  
  panel.grid.major = element_blank(),
  panel.grid.minor = element_blank(),
  panel.background = element_blank(),
  axis.line = element_line(colour = "grey")
)

Let us look at the deliquency in last 2 years as function of loan grade averaged by number of loans

fdc<- loans%>% group_by(sub_grade) %>% summarize(delinq=sum(delinq_2yrs)/n()) %>% arrange(delinq)
ggplot(fdc, aes(x=sub_grade, y=delinq))+geom_col()+geom_col(fill = "slateblue")+labs(x="Loan sub-grade", y="Deliquency in past 2 years",title="Percentage of deliquency by loan sub-grade")+theme(
  panel.border = element_blank(),  
  panel.grid.major = element_blank(),
  panel.grid.minor = element_blank(),
  panel.background = element_blank(),
  axis.line = element_line(colour = "grey")
)

Surprisingly there is no relationship between the loan sub-grade and number of deliquencies.

Let us look at bad loans in function of employment length

fdc<- loans%>% group_by(emp_length) %>% summarize(badloan=sum(as.numeric(is_bad==1))/n())                                 
                             
ggplot(fdc, aes(x=emp_length, y=badloan))+geom_col(fill = "slateblue")+labs(x="Length of employment", y="Bad loans",title="Number of bad loans by employment length")+
  theme(
  panel.border = element_blank(),  
  panel.grid.major = element_blank(),
  panel.grid.minor = element_blank(),
  panel.background = element_blank(),
  axis.line = element_line(colour = "grey"))

As we can see there is some unclean data that we will need to remove or correct, let us do that to eliminate noise

Remove rows with employee length = 911 or 1954

loans<-loans %>% filter(!emp_length=="911") 
loans<-loans %>% filter(!emp_length=="1954") 

Interesting could be also to look at the number of bad loans as a relation to the annual income

fdc<- loans%>% group_by(annual_inc) %>% summarize(badloan=sum(as.numeric(is_bad==1)))                                 
ggplot(subset(fdc,badloan>0), aes(x=annual_inc, y=badloan))+geom_point(col = "slateblue")+labs(x="Annual income", y="Number of bad loans",title="Number of bad loans by annual income")+
  theme(
    panel.border = element_blank(),  
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_blank(),
    axis.line = element_line(colour = "grey"))+
    scale_x_continuous(labels = scales::comma)

It seems we have more issues with income lower than 200k, but this is alos in relation to the number of loans in that segment, so no conclusion can be derived from this.

Further data exploration

Numeric columns

Figure out which columns are numeric so that we can do some modeling with numeric features and get numeric only columns. At same time, we keep th enon numeric columns that we will chekc later on.

numericCols <- sapply(loans, is.numeric)
nonumeric_cols<- sapply(loans, is.factor)

loansnum<-loans[, numericCols]
loansnonum<-loans[,nonumeric_cols]

Removing correlated variables

Correlated features do not add any significance in the data and do not convey extra information. Hence this can be recommended to remove them. Let us take a look

# correlations to drop

indexesToDrop <- findCorrelation(cor(loansnum), cutoff = 0.8)
corrplot(cor(loansnum[,indexesToDrop]),type = "upper",order="hclust",
         col = cm.colors(100), tl.col = "black")

loansnum<-loansnum[,-indexesToDrop]

We can take a cut off at 0.8 correlation and remove them. For example it is quite clear that loan installment is highly correlated with loan amount.

PCA ( Principal Component Analysis)

Since we have 80 columns in the data set, let us see if we can reduce the dimension of the matrix through PCA

loansnum[is.na(loansnum)] <- 0
fullTrainMatrix<-as.matrix(as.data.frame(lapply(loansnum, as.numeric)))
fullTrainMatrix<-scale(fullTrainMatrix)
res.PCA<-PCA(fullTrainMatrix, scale.unit = TRUE, ncp = 5, graph = FALSE)
fviz_eig(res.PCA, addlabels = TRUE, ylim = c(0, 50))

This looks difficult to identify few dimensions that will account for a large portion of the variance, hence PCA does not bring much in our case.

KMeans

Identifying ideal number of clusters for numeric columns

Using the elbow method we can try to identify the ideal number of clusters if any.

loansnum[is.na(loansnum)] <- 0
fullTrainMatrix<-as.matrix(as.data.frame(lapply(loansnum, as.numeric)))
fullTrainMatrix<-scale(fullTrainMatrix)
# Initialize total within sum of squares error: wss
wss <- 0
gc()
##            used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
## Ncells  2960359 158.2    4783452 255.5         NA  4783452 255.5
## Vcells 33638606 256.7   65858486 502.5      16384 65837302 502.3
# For 1 to 15 cluster centers
for (i in 1:10) {
  km.out <- kmeans(fullTrainMatrix, centers = i, nstart=10)
  # Save total within sum of squares to wss variable
  wss[i] <- km.out$tot.withinss
}

# Plot total within sum of squares vs. number of clusters
plot(1:10, wss, type = "b", 
     xlab = "Number of Clusters", 
     ylab = "Within groups sum of squares")

Let us try 3 clusters based on our previous calculation although it does not seem we have real clusters.

cluster<-kmeans(fullTrainMatrix, centers=3, iter.max = 20)
fviz_cluster(cluster, data = fullTrainMatrix,
             #palette = c("#00AFBB","#2E9FDF", "#E7B800", "#FC4E07"),
             ggtheme = theme_minimal(),
             main = "Partitioning Clustering Plot")

As we can see there are no real clusters, the data is just split in 3 arbitrarily.

Random Forest

Random forests allow us to find good predictors within a set of features. We will start with the numeric columns only.

Numeric columns

With random forest let us identify the strongest predictors in the data set using the importance results. Let us start with training a random forest.

# random forest
loansnum$isBad<-loans$is_bad
loansnum[ , "isBad"][is.na(loansnum[ , "isBad"])]<-0

loansnonum$isBad<-loansnum$isBad
set.seed(101)
modelRF <- randomForest(isBad ~ ., data=loansnum, ntree=20, importance = TRUE)
plot(modelRF$err.rate[, 1], type = "l", xlab = "Number of trees", ylab = "Prediction Error")

The prediction error is decreasing with the number of trees. This is a good sign that random forests can predict bad loans based on the data set.

We can now check with varaiables are most important in the prediction. We are displaying only first 10 below.

important <- importance(modelRF, type=1 )
Important_Features <- data.frame(Feature = row.names(important), Importance = important[, 1])
Important_Features <-Important_Features[order(-Important_Features$Importance),]
ggplot(Important_Features[1:10,], 
                aes(x= reorder(Feature,
                               Importance) , y = Importance) ) +
  geom_bar(stat = "identity", 
           fill = "slateblue") +
  coord_flip() +
  theme_light() +
  xlab("") + 
  ylab("Importance")+
  ggtitle("Important Features in Random Forest\n")

In order of importance of predictive value, we get for the first 5: 1. Ratio of total current balance to high credit/credit limit for all bankcard accounts. 2. Annual income 3. Late fees received to date 4. Remaining outstanding principal for portion of total amount funded by investors 5. The number of open credit lines in the borrower’s credit file.

Non numeric features

Let us find also important features within non numeric variables using random forest again.

loansnonum<-select(loansnonum,-c("verification_status","policy_code"))
loansnonum<-select(loansnonum,-c("next_pymnt_d"))
loansnonum<-select(loansnonum,-c("initial_list_status"))
loansnonum<-select(loansnonum,-c("last_pymnt_d"))
loansnonum<-select(loansnonum,-c("issue_d"))
loansnonum$zip_code<-as.numeric(loansnonum$zip_code)
loansnonum$addr_state<-as.numeric(loansnonum$addr_state)

loansnonum<-select(loansnonum,-is_bad)
loansnonum<-loansnonum[rowSums(is.na(loansnonum)) == 0,]
set.seed(101)
modelRF <- randomForest(isBad ~ ., data=loansnonum, ntree=20, importance = TRUE)
important <- importance(modelRF, type=1 )
Important_Features <- data.frame(Feature = row.names(important), Importance = important[, 1])
Important_Features <-Important_Features[order(-Important_Features$Importance),]
ggplot(Important_Features, 
                aes(x= reorder(Feature,
                               Importance) , y = Importance) ) +
  geom_bar(stat = "identity", 
           fill = "slateblue") +
  coord_flip() +
  theme_light() +
  xlab("") + 
  ylab("Importance")+
  ggtitle("Important Features in Random Forest\n")

Conclusion

As we can see most of the features incldue already the fact that the loan is past due or deliquent, so this is not helping us to detect in advance if a loan is going to default or not. Hence we need to identify features that are fully independent from the result. For example we can think of annual income, zip code. home ownership and so on. Let us try to build a predictive model based on these observations.