1. Introduction

The Lending Club data contains complete loan data for all loans issued through the 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. Financial decision making regarding to the credit risks is one of the crucial operations for the lending businesses. In this analysis, I present here exploratory data analysis, visualizations, word cloud, loan outcome prediction and lots of other interesting insights.

# read data file 
data <- read.csv("loan.csv")
# See dimension of dataset
dim(data)
[1] 887379     74

1.1 Data Overview

The data is sourced from the Kaggle website https://www.kaggle.com/wendykan/lending-club-loan-data

The LendingClub is a matrix of about 890 thousand observations and 74 variables. Some important features:

  • loan_amnt - Amount of money requested by the borrower.

  • int_rate - Interest rate of the loan.

  • grade - Loan grade with categories A, B, C, D, E, F, G.

  • loan_status - Current loan status, i.e. approved, default, late, etc.

  • annual_inc - Borrowers annual income.

  • purpose - The primary purpose of borrowing.

2. Risk Modelling

In this section, I will fit logistics regression model for loan outcome and predict future loan status. I have divided it into four subsections through different steps.

2.1 Select Variables

The original dataset contains 74 columns, which are not necessarily useful and relevant in regression model. In addition, there are some missing values. Thus I did feature selection and data cleaning first. I select the following features:

  • loan_status - Current loan status, binary variable.

  • loan_amnt - Amount of money requested by the borrower.

  • int_rate - Interest rate of the loan.

  • emp_length - Borrowers’ employment length.

  • grade - Loan grade with categories A, B, C, D, E, F, G.

  • annual_inc - Borrowers’ annual income.

  • home_ownership - A form of housing tenure: rent/own/mortgage/none, factors.

  • term - Regular payments over a set period of time, factors.

loan <- data %>%
        select(loan_status, loan_amnt, int_rate, grade, emp_length, home_ownership, annual_inc, term)
loan <- loan %>%
        filter(!is.na(annual_inc) , 
               !(home_ownership %in% c('NONE' , 'ANY')) , 
               emp_length != 'n/a')
loan$grade <- as.factor(loan$grade)
loan$home_ownership <- as.factor(loan$home_ownership)
loan$term <- as.factor(loan$term)

2.2 Binary Outcomes

I neet to establish a binary outcome, i.e. 0 as default, 1 as non-default. we have 10 outcomes at indecated above, I select “Fully paid” as 1; then “Charged off”,“Late(31-120 days)”,“Late(16-30days)”,“Default”, “In Grace Period” as 0. The next step is to select loan data only with 0 and 1.

loan <- loan %>%
        mutate(loan_outcome = ifelse(loan_status %in% c('Charged Off' , 'Default', 'Late (16-30 days)','Late (31-120 days)','In Grace Period') , 0, 
                                     ifelse(loan_status == 'Fully Paid' , 1 , 'No info')))
loan2 <- loan %>%select(-loan_status) %>%filter(loan_outcome %in% c(0 , 1))

2.3 Fit Model

Split the dataset by 6:4, as training set and test set. The new dataset has 263281 rows, thus training set has 151446 rows, and test set has 100962.

dim(loan2)
[1] 263281      8
loan2$loan_outcome = as.numeric(loan2$loan_outcome)
training <- loan2[1:151446, ]
test <- loan2[151447:263281, ]
# Fit logistic regression
model1 <- glm(loan_outcome ~ loan_amnt + int_rate + grade + emp_length + home_ownership + annual_inc + term, data = training, family = binomial(link = 'logit'))
summary(model1)

Call:
glm(formula = loan_outcome ~ loan_amnt + int_rate + grade + emp_length + 
    home_ownership + annual_inc + term, family = binomial(link = "logit"), 
    data = training)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.5671   0.3449   0.5380   0.6911   1.3784  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)          3.404e+00  5.179e-02  65.735  < 2e-16 ***
loan_amnt           -1.364e-05  1.045e-06 -13.050  < 2e-16 ***
int_rate            -1.361e-01  4.930e-03 -27.605  < 2e-16 ***
gradeB              -1.764e-01  3.436e-02  -5.132 2.87e-07 ***
gradeC              -1.769e-01  4.601e-02  -3.846  0.00012 ***
gradeD              -8.686e-02  5.863e-02  -1.482  0.13843    
gradeE               8.298e-02  7.214e-02   1.150  0.25005    
gradeF               2.365e-01  8.576e-02   2.757  0.00583 ** 
gradeG               3.116e-01  1.081e-01   2.881  0.00396 ** 
emp_length1 year     3.033e-02  3.463e-02   0.876  0.38113    
emp_length10+ years -4.355e-02  2.665e-02  -1.634  0.10222    
emp_length2 years    2.540e-02  3.179e-02   0.799  0.42424    
emp_length3 years   -2.321e-03  3.279e-02  -0.071  0.94357    
emp_length4 years    4.186e-02  3.490e-02   1.199  0.23043    
emp_length5 years    2.602e-02  3.320e-02   0.784  0.43319    
emp_length6 years   -4.306e-02  3.494e-02  -1.232  0.21779    
emp_length7 years   -3.291e-02  3.598e-02  -0.915  0.36042    
emp_length8 years   -6.148e-02  3.871e-02  -1.588  0.11226    
emp_length9 years   -6.663e-02  4.138e-02  -1.610  0.10733    
home_ownershipOTHER -2.798e-01  2.230e-01  -1.254  0.20973    
home_ownershipOWN   -7.105e-02  2.626e-02  -2.705  0.00682 ** 
home_ownershipRENT  -1.030e-01  1.507e-02  -6.835 8.20e-12 ***
annual_inc           6.633e-06  2.274e-07  29.172  < 2e-16 ***
term 60 months      -4.863e-01  1.835e-02 -26.497  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 150151  on 151445  degrees of freedom
Residual deviance: 138273  on 151422  degrees of freedom
AIC: 138321

Number of Fisher Scoring iterations: 5

loan_amnt, int_rate, grade B, grade C, grade F, grade G, home_ownership type own, home_ownership type rent, annual_inc and term 60 months are clearly significant (the significance p-value is indicated by *** in the summary of the fit).

  • The coefficients loan_amnt, int_rate, grade B, grade C, home_ownership type own, home_ownership type rent, and term 60 months are negative, these coefficients will negative effect the probability of non-default.

  • The coefficients annual_inc, grade F and grade G are positive, these coefficients will positive effect the probability of non-default.

    • An interesting observation is that the coefficients of grade F and grade G are positive, and I need more detailed analysis on these.

The interpretation can be provided in terms of log-odds. Applying model1 to our training dataset, each estimated coefficient is the expected change in the log odds of default for a unit increase in the corresponding predictor variable holding the other predictor variables constant at certain value. Each exponentiated coefficient is the ratio of two odds, or the change in odds in the multiplicative scale for a unit increase in the corresponding predictor variable holding other variables at certain value. Since I set 0 as default, 1 as non-default, the interpretation is a little bit counter-intuitive.

exp(coef(model1))
        (Intercept)           loan_amnt            int_rate              gradeB              gradeC 
         30.0934106           0.9999864           0.8727535           0.8383189           0.8378266 
             gradeD              gradeE              gradeF              gradeG    emp_length1 year 
          0.9168021           1.0865189           1.2667901           1.3655773           1.0307939 
emp_length10+ years   emp_length2 years   emp_length3 years   emp_length4 years   emp_length5 years 
          0.9573850           1.0257297           0.9976816           1.0427457           1.0263591 
  emp_length6 years   emp_length7 years   emp_length8 years   emp_length9 years home_ownershipOTHER 
          0.9578548           0.9676293           0.9403701           0.9355377           0.7559713 
  home_ownershipOWN  home_ownershipRENT          annual_inc      term 60 months 
          0.9314146           0.9021214           1.0000066           0.6149161 
  1. For example, the odds of a borrower will not default in the future with term 60 months over the odds of default with term 36 months is exp(-4.863e-01) = 0.6149161, holding other variables unchanged. In terms of percent change, we can say that the odds for non-default with long-term loan is 61.49% of short-term loan, i.e. long-term loan is more likely default in the future.

2.4 Predict Model On Test Set

# Prediction on test set
predict_model1 <- predict.glm(model1, newdata = test, type = "response")
summary(predict_model1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.3048  0.7182  0.8126  0.7876  0.8771  1.0000 

Here I plot the ROC curve to see the sensitivity and specificity. As a result, it gives Area under the curve: 67.12%.

library(pROC)
# Area Under Curve
roc(test$loan_outcome, predict_model1, percent=T, boot.n=1000, ci.alpha=0.9, stratified=T, plot=TRUE, grid=TRUE, show.thres=TRUE, legacy.axes = TRUE, reuse.auc = TRUE, print.auc = TRUE, print.thres.col = "blue", ci=TRUE, ci.type="bars", print.thres.cex = 0.7, col = 'red', main = paste("ROC curve using","(N = ",nrow(test),")") )

Call:
roc.default(response = test$loan_outcome, predictor = predict_model1,     percent = T, ci = TRUE, plot = TRUE, boot.n = 1000, ci.alpha = 0.9,     stratified = T, grid = TRUE, show.thres = TRUE, legacy.axes = TRUE,     reuse.auc = TRUE, print.auc = TRUE, print.thres.col = "blue",     ci.type = "bars", print.thres.cex = 0.7, col = "red", main = paste("ROC curve using",         "(N = ", nrow(test), ")"))

Data: predict_model1 in 33185 controls (test$loan_outcome 0) < 78650 cases (test$loan_outcome 1).
Area under the curve: 67.13%
95% CI: 66.79%-67.47% (DeLong)

3. Exploratory Data Analysis for Borrowers

In this section, I will detail exploratory data analysis and visualization for borrowers. I have divided it into six subsections through a variety of different visualization.

3.1 Borrowers Home Ownership Type

In total, there are six type of home ownership types. Most borrowers have mortgages(443557) and 356117 borrowers rent house/apartment.

data %>%
  group_by(home_ownership) %>%
  summarize(freq = n()) %>%
  ggplot(aes(reorder(home_ownership, freq), y = freq, fill = freq)) +   
  geom_bar(stat = "identity", position = "dodge") +
  xlab("home_ownership") +
  ylab("Frequency") +
  theme_fivethirtyeight() + 
  theme(legend.position ='none', axis.text.x = element_text(size = 15)) + 
  geom_text(aes(label = freq), vjust = -0.1, size = 4.5) +
  scale_fill_gradientn(name = '',colours = rev(brewer.pal(10,'Spectral'))) +
  ggtitle("Home Ownership")

3.2 Borrowers Employment Length

I look at the borrower’s employment length. Most borrowers have 10-year+ employment length.

data %>%
  group_by(emp_length) %>%
  summarize(freq = n()) %>%
  top_n(50) %>%
  ggplot(aes(reorder(emp_length, freq), y = freq, fill = freq)) +   
  geom_bar(stat = "identity", position = "dodge") +
  xlab("Employment Length") +
  ylab("Frequency") +
  coord_flip() +
  theme_fivethirtyeight() + 
  theme(legend.position ='none', axis.text.y = element_text(size = 12)) + 
  scale_fill_gradientn(name = '',colours = rev(brewer.pal(10,'Spectral'))) +
  ggtitle("Employment Length")
Selecting by freq

3.3 Borrowers’ Grade

There are seven grades. Most borrowers have grade B.

data %>%
  group_by(grade) %>%
  summarize(freq = n()) %>%
  ggplot(aes(reorder(grade, freq), y = freq, fill = freq)) +   
  geom_bar(stat = "identity", position = "dodge") +
  xlab("grade") +
  ylab("Frequency") +
  theme_fivethirtyeight() + 
  theme(legend.position ='none', axis.text.x = element_text(size = 15)) + 
  geom_text(aes(label = freq), vjust = -0.1, size = 4.5) +
  scale_fill_gradientn(name = '',colours = rev(brewer.pal(10,'Spectral'))) +
  ggtitle("Grade")

3.4 Distribution of the loan amount by borrower’s grade

I seperate loan amount by different grades, we can see that the lower the borrower’s grade, the higher amount of borrowing.

ggplot(data, aes(x=loan_amnt,fill=grade))+
  geom_density(alpha = 0.3) + 
  theme(legend.position="none") +
  theme_fivethirtyeight() + 
  xlab("Loan Amount") + 
  ggtitle("Loan Amount by Borrower's Grade") + 
  facet_grid(grade ~ ., scales = "free")

3.5 Borrower’s Annual Income

Here I create bins to examine the annual income distribution.

summary(data$annual_inc)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0   45000   65000   75028   90000 9500000       4 
# set up boundaries for intervals/bins
breaks <- c(0,20000,40000,60000,80000,90000,100000,150000,300000, 9500000)
# specify interval/bin labels
labels <- c("<20000", "20000-40000)", "40000-60000)", "60000-80000)", "80000-90000)", "90000-100000)", "100000-150000)", "150000-300000)", "<=9500000")
# bucketing data points into bins
bins <- cut(data$annual_inc, breaks, include.lowest = T, right=FALSE, labels=labels)
# inspect bins
bin <- summary(bins)
freq_inc <- c(10204, 129409, 241593, 203729, 7258i, 54965, 123426, 45865, 5603)
annual_income <- data.frame(labels, freq_inc)
plot(bins, main="Annual Income", ylab="Annual Income")

3.6 Borrower’s Purpose

We observe that debt consolidation is the most frequently used reason to borrow.

data %>%
  group_by(purpose) %>%
  summarize(freq = n()) %>%
  top_n(50) %>%
  ggplot(aes(reorder(purpose, freq), y = freq, fill = freq)) +   
  geom_bar(stat = "identity", position = "dodge") +
  xlab("Purpose") +
  ylab("Frequency") +
  coord_flip() +
  theme_fivethirtyeight() + 
  theme(legend.position ='none', axis.text.y = element_text(size = 12)) + 
  scale_fill_gradientn(name = '',colours = rev(brewer.pal(10,'Spectral'))) +
  ggtitle("purpose")

4. Exploratory Data Analysis for Loans

In this section, I will detail exploratory data analysis and visualization for loans. I have divided it into five subsections through a variety of different visualization.

4.1 Loan Amount Summary

I create a line chart to view the loan amount. The median loan amount is 13000 and mean is 14755.

summary(data$loan_amnt)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    500    8000   13000   14755   20000   35000 
ggplot(data, aes(x=loan_amnt)) + 
  geom_density() + 
  theme(legend.position="none") +
  xlab("Loan Amount") + 
  ggtitle("Loan Amount") + 
  geom_vline(aes(xintercept=mean(loan_amnt)), color="black", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median(loan_amnt)), color="red", linetype="dashed", size=1) + 
  geom_vline(aes(xintercept=quantile(loan_amnt, 0.25)), color="blue", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=quantile(loan_amnt, 0.75)), color="yellow", linetype="dashed", size=1)

4.2 Interest Rate Summary

Here I examine the distribution of interest rate. The median interest rate is 12.99 and mean is 13.25.

summary(data$int_rate)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   5.32    9.99   12.99   13.25   16.20   28.99 
ggplot(data, aes(x=int_rate)) + 
  geom_density() + 
  theme(legend.position="none") +
  xlab("Interest Rate") + 
  ggtitle("Interest Rate") + 
  geom_vline(aes(xintercept=mean(int_rate)), color="black", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median(int_rate)), color="red", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=quantile(int_rate, 0.25)), color="blue", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=quantile(int_rate, 0.75)), color="yellow", linetype="dashed", size=1)

4.3 Loan Term Length

There are two term length, 3-year and 5-year. Most borrowers have loan in 3 years.

data %>%
  group_by(term) %>%
  summarize(freq = n()) %>%
  ggplot(aes(reorder(term, freq), y = freq, fill = freq)) +   
  geom_bar(stat = "identity", position = "dodge", width = 0.4) +
  xlab("Term") +
  ylab("Frequency") +
  theme_fivethirtyeight() + 
  theme(legend.position ='none', axis.text.x = element_text(size = 15)) + 
  geom_text(aes(label = freq), vjust = -0.1, size = 4.5) +
  scale_fill_gradientn(name = '',colours = rev(brewer.pal(10,'Spectral'))) +
  ggtitle("Term Length")

4.4 Loan Distribution by Geographic Region

Here I plot the loan volume by different US states.

data$region <- data$addr_state
data$region <- as.factor(data$region)
levels(data$region)<- c("alaska", "alabama","arkansas", "arizona", "california","colorado","connecticut","district of columbia","delaware","florida","georgia","hawaii","iowa","idaho","illinois","indiana","kansas","kentucky","louisiana","massachusetts","maryland","maine","michigan","minnesota","missouri","mississippi","montana","north carolina","north dakota","nebraska","new hampshire","new jersey","new mexico","nevada","new york","ohio","oklahoma","oregon","pennsylvania","rhode island","south carolina","south dakota","tennessee","texas","utah","virginia","vermont","washington","wisconsin","west virginia","wyoming")
us_states <- map_data("state")
loanvolume <-data %>% group_by(region) %>%
                summarise(value = sum(loan_amnt, na.rm=TRUE))
loanvolume$region <- as.character(loanvolume$region)
loanvolume
loandata <- merge(us_states, loanvolume, by="region")
ggplot() + 
  geom_polygon(data=loandata, aes(x=long, y=lat, group = group, fill=loandata$value)) +
  theme_void() +
  labs(fill = "Gradient" ,title = "Loan Volume by States", x="", y="")

4.5 Loan Status

We observe that most loan status are current, which means this loan still has one or more scheduled future payments. It also reflect that some late and default loans.

data %>%
  group_by(loan_status) %>%
  summarize(freq = n()) %>%
  ggplot(aes(reorder(loan_status, freq), y = freq, fill = freq)) +   
  geom_bar(stat = "identity", position = "dodge", width=0.4) +
  xlab("Loan Status") +
  ylab("Frequency") +
  theme_fivethirtyeight() + 
  coord_flip() +
  theme(legend.position ='none', axis.text.x = element_text(size = 7)) + 
  scale_fill_gradientn(name = '',colours = rev(brewer.pal(11,'Spectral'))) +
  ggtitle("Loan Status")

summary(data$loan_status)
                                        Charged Off                                             Current 
                                              45248                                              601779 
                                            Default Does not meet the credit policy. Status:Charged Off 
                                               1219                                                 761 
 Does not meet the credit policy. Status:Fully Paid                                          Fully Paid 
                                               1988                                              207723 
                                    In Grace Period                                              Issued 
                                               6253                                                8460 
                                  Late (16-30 days)                                  Late (31-120 days) 
                                               2357                                               11591 

5. Word Cloud

The WordCloud generator presents interesting insights.

5.1 Word Cloud By Title

loanCorpus <- Corpus(VectorSource(data$title))
loanCorpus <- tm_map(loanCorpus, content_transformer(tolower))
loanCorpus <- tm_map(loanCorpus, removePunctuation)
wordcloud(loanCorpus, max.words = 100, random.order = FALSE,rot.per=0, 
          colors=brewer.pal(8, "Dark2"))

5.2 Word Cloud By Employment Title

loantitleCorpus <- Corpus(VectorSource(data$emp_title))
loantitleCorpus <- tm_map(loantitleCorpus, content_transformer(tolower))
loantitleCorpus <- tm_map(loantitleCorpus, removePunctuation)
wordcloud(loantitleCorpus, max.words = 100, random.order = FALSE,rot.per=0, 
          colors=brewer.pal(8, "Dark2"))

---
title: "LendingClub Data Analysis"
author: "Mengyao Xue"
date: '2019-03-08'
output:
  html_notebook: default
  word_document: default
  pdf_document: default
  html_document:
    toc: true
    df_print: paged
---

#1. Introduction

The Lending Club data contains complete loan data for all loans issued through the 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. Financial decision making regarding to the credit risks is one of the crucial operations for the lending businesses. In this analysis, I present here exploratory data analysis, visualizations, word cloud, loan outcome prediction and lots of other interesting insights.


```{r load, echo=FALSE, warning=FALSE, message=FALSE}
library(choroplethr)
library(choroplethrMaps)
library(DescTools)
library(dplyr)
library(readr)
library("ggplot2")
library("RColorBrewer")
library("ggthemes")
library("ggrepel")
library("gridExtra")
library("lubridate")
library(usmap)
library("tm")
library("SnowballC")
library("wordcloud")
library(tidyverse)
library(ggthemes)
library(corrplot)
library(GGally)
library(DT)
```


```{r load data}
# read data file 
data <- read.csv("loan.csv")

# See dimension of dataset
dim(data)
```


##1.1 Data Overview
The data is sourced from the Kaggle website `https://www.kaggle.com/wendykan/lending-club-loan-data`

The LendingClub is a matrix of about 890 thousand observations and 74 variables.
Some important features:

* `loan_amnt` - Amount of money requested by the borrower.

* `int_rate` - Interest rate of the loan.

* `grade` - Loan grade with categories A, B, C, D, E, F, G.

* `loan_status` - Current loan status, i.e. approved, default, late, etc.

* `annual_inc` - Borrowers annual income.

* `purpose` - The primary purpose of borrowing.



#2. Risk Modelling 
In this section, I will fit logistics regression model for loan outcome and predict future loan status. I have divided it into four subsections through different steps. 

###2.1 Select Variables

The original dataset contains 74 columns, which are not necessarily useful and relevant in regression model. In addition, there are some missing values. Thus I did feature selection and data cleaning first. I select the following features:

* `loan_status` - Current loan status, binary variable.

* `loan_amnt` - Amount of money requested by the borrower.

* `int_rate` - Interest rate of the loan.

* `emp_length` - Borrowers' employment length.

* `grade` - Loan grade with categories A, B, C, D, E, F, G.

* `annual_inc` - Borrowers' annual income.

* `home_ownership` - A form of housing tenure: rent/own/mortgage/none, factors.

* `term` - Regular payments over a set period of time, factors.

```{r feature selection}
loan <- data %>%
        select(loan_status, loan_amnt, int_rate, grade, emp_length, home_ownership, annual_inc, term)

loan <- loan %>%
        filter(!is.na(annual_inc) , 
               !(home_ownership %in% c('NONE' , 'ANY')) , 
               emp_length != 'n/a')

loan$grade <- as.factor(loan$grade)
loan$home_ownership <- as.factor(loan$home_ownership)
loan$term <- as.factor(loan$term)
```

###2.2 Binary Outcomes

I neet to establish a binary outcome, i.e. 0 as default, 1 as non-default. we have 10 outcomes at indecated above, I select "Fully paid" as 1; then "Charged off","Late(31-120 days)","Late(16-30days)","Default", "In Grace Period" as 0. The next step is to select loan data only with 0 and 1.

```{r binary variables}
loan <- loan %>%
        mutate(loan_outcome = ifelse(loan_status %in% c('Charged Off' , 'Default', 'Late (16-30 days)','Late (31-120 days)','In Grace Period') , 0, 
                                     ifelse(loan_status == 'Fully Paid' , 1 , 'No info')))

loan2 <- loan %>%select(-loan_status) %>%filter(loan_outcome %in% c(0 , 1))

```

###2.3 Fit Model

Split the dataset by 6:4, as training set and test set. The new dataset has 263281 rows, thus training set has 151446 rows, and test set has 100962.
```{r fit glm, warning=FALSE}
dim(loan2)
loan2$loan_outcome = as.numeric(loan2$loan_outcome)
training <- loan2[1:151446, ]
test <- loan2[151447:263281, ]

# Fit logistic regression
model1 <- glm(loan_outcome ~ loan_amnt + int_rate + grade + emp_length + home_ownership + annual_inc + term, data = training, family = binomial(link = 'logit'))
summary(model1)
```
`loan_amnt`, `int_rate`, `grade B`, `grade C`, `grade F`, `grade G`, `home_ownership type own`, `home_ownership type rent`, `annual_inc` and `term 60 months` are clearly significant (the significance p-value is indicated by *** in the summary of the fit).

*  The coefficients `loan_amnt`, `int_rate`, `grade B`, `grade C`, `home_ownership type own`, `home_ownership type rent`, and `term 60 months` are negative, these coefficients will negative effect the probability of non-default.

*  The coefficients `annual_inc`, `grade F` and `grade G` are positive, these coefficients will positive effect the probability of non-default. 

    + An interesting observation is that the coefficients of `grade F` and `grade G` are positive, and I need more detailed analysis on these.


The interpretation can be provided in terms of log-odds. Applying model1 to our training dataset, each estimated coefficient is the expected change in the log odds of default for a unit increase in the corresponding predictor variable holding the other predictor variables constant at certain value. Each exponentiated coefficient is the ratio of two odds, or the change in odds in the multiplicative scale for a unit increase in the corresponding predictor variable holding other variables at certain value. Since I set 0 as default, 1 as non-default, the interpretation is a little bit counter-intuitive. 

```{r interpretation and coef}
exp(coef(model1))
```
1. For example, the odds of a borrower will not default in the future with term 60 months over the odds of default with term 36 months is exp(-4.863e-01) = 0.6149161, holding other variables unchanged. In terms of percent change, we can say that the odds for non-default with long-term loan is 61.49% of short-term loan, i.e. long-term loan is more likely default in the future.


###2.4 Predict Model On Test Set
```{r predict glm}
# Prediction on test set
predict_model1 <- predict.glm(model1, newdata = test, type = "response")
summary(predict_model1)
```




Here I plot the ROC curve to see the sensitivity and specificity. As a result, it gives Area under the curve: 67.12%. 
```{r warning=FALSE, message=FALSE}
library(pROC)

# Area Under Curve
roc(test$loan_outcome, predict_model1, percent=T, boot.n=1000, ci.alpha=0.9, stratified=T, plot=TRUE, grid=TRUE, show.thres=TRUE, legacy.axes = TRUE, reuse.auc = TRUE, print.auc = TRUE, print.thres.col = "blue", ci=TRUE, ci.type="bars", print.thres.cex = 0.7, col = 'red', main = paste("ROC curve using","(N = ",nrow(test),")") )
```



#3. Exploratory Data Analysis for Borrowers

In this section, I will detail exploratory data analysis and visualization for borrowers. I have divided it into six subsections through a variety of different visualization.

###3.1 Borrowers Home Ownership Type 

In total, there are six type of home ownership types. Most borrowers have mortgages(443557) and 356117 borrowers rent house/apartment.  
```{r home_ownership, message=FALSE, warning=FALSE}
data %>%
  group_by(home_ownership) %>%
  summarize(freq = n()) %>%
  ggplot(aes(reorder(home_ownership, freq), y = freq, fill = freq)) +   
  geom_bar(stat = "identity", position = "dodge") +
  xlab("home_ownership") +
  ylab("Frequency") +
  theme_fivethirtyeight() + 
  theme(legend.position ='none', axis.text.x = element_text(size = 15)) + 
  geom_text(aes(label = freq), vjust = -0.1, size = 4.5) +
  scale_fill_gradientn(name = '',colours = rev(brewer.pal(10,'Spectral'))) +
  ggtitle("Home Ownership")
```


###3.2 Borrowers Employment Length

I look at the borrower's employment length. Most borrowers have 10-year+ employment length.
```{r emp_length, message=FALSE, warning=FALSE}
data %>%
  group_by(emp_length) %>%
  summarize(freq = n()) %>%
  top_n(50) %>%
  ggplot(aes(reorder(emp_length, freq), y = freq, fill = freq)) +   
  geom_bar(stat = "identity", position = "dodge") +
  xlab("Employment Length") +
  ylab("Frequency") +
  coord_flip() +
  theme_fivethirtyeight() + 
  theme(legend.position ='none', axis.text.y = element_text(size = 12)) + 
  scale_fill_gradientn(name = '',colours = rev(brewer.pal(10,'Spectral'))) +
  ggtitle("Employment Length")
```

###3.3 Borrowers' Grade

There are seven grades. Most borrowers have grade B.
```{r loan grade}
data %>%
  group_by(grade) %>%
  summarize(freq = n()) %>%
  ggplot(aes(reorder(grade, freq), y = freq, fill = freq)) +   
  geom_bar(stat = "identity", position = "dodge") +
  xlab("grade") +
  ylab("Frequency") +
  theme_fivethirtyeight() + 
  theme(legend.position ='none', axis.text.x = element_text(size = 15)) + 
  geom_text(aes(label = freq), vjust = -0.1, size = 4.5) +
  scale_fill_gradientn(name = '',colours = rev(brewer.pal(10,'Spectral'))) +
  ggtitle("Grade")
```


###3.4 Distribution of the loan amount by borrower's grade

I seperate loan amount by different grades, we can see that the lower the borrower's grade, the higher amount of borrowing. 
```{r loan amount distribution by borrowers grade}
ggplot(data, aes(x=loan_amnt,fill=grade))+
  geom_density(alpha = 0.3) + 
  theme(legend.position="none") +
  theme_fivethirtyeight() + 
  xlab("Loan Amount") + 
  ggtitle("Loan Amount by Borrower's Grade") + 
  facet_grid(grade ~ ., scales = "free")

```


###3.5 Borrower's Annual Income 
Here I create bins to examine the annual income distribution.
```{r Annual Income}
summary(data$annual_inc)

# set up boundaries for intervals/bins
breaks <- c(0,20000,40000,60000,80000,90000,100000,150000,300000, 9500000)
# specify interval/bin labels
labels <- c("<20000", "20000-40000)", "40000-60000)", "60000-80000)", "80000-90000)", "90000-100000)", "100000-150000)", "150000-300000)", "<=9500000")
# bucketing data points into bins
bins <- cut(data$annual_inc, breaks, include.lowest = T, right=FALSE, labels=labels)
# inspect bins
bin <- summary(bins)
freq_inc <- c(10204, 129409, 241593, 203729, 7258i, 54965, 123426, 45865, 5603)
annual_income <- data.frame(labels, freq_inc)
plot(bins, main="Annual Income", ylab="Annual Income")

```

###3.6 Borrower's Purpose

We observe that *debt consolidation* is the most frequently used reason to borrow. 
```{r purpose, message=FALSE, warning=FALSE}
data %>%
  group_by(purpose) %>%
  summarize(freq = n()) %>%
  top_n(50) %>%
  ggplot(aes(reorder(purpose, freq), y = freq, fill = freq)) +   
  geom_bar(stat = "identity", position = "dodge") +
  xlab("Purpose") +
  ylab("Frequency") +
  coord_flip() +
  theme_fivethirtyeight() + 
  theme(legend.position ='none', axis.text.y = element_text(size = 12)) + 
  scale_fill_gradientn(name = '',colours = rev(brewer.pal(10,'Spectral'))) +
  ggtitle("purpose")
```



#4. Exploratory Data Analysis for Loans

In this section, I will detail exploratory data analysis and visualization for loans. I have divided it into five subsections through a variety of different visualization.

###4.1 Loan Amount Summary

I create a line chart to view the loan amount. The median loan amount is 13000 and mean is 14755.
```{r Loan Amount}
summary(data$loan_amnt)

ggplot(data, aes(x=loan_amnt)) + 
  geom_density() + 
  theme(legend.position="none") +
  xlab("Loan Amount") + 
  ggtitle("Loan Amount") + 
  geom_vline(aes(xintercept=mean(loan_amnt)), color="black", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median(loan_amnt)), color="red", linetype="dashed", size=1) + 
  geom_vline(aes(xintercept=quantile(loan_amnt, 0.25)), color="blue", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=quantile(loan_amnt, 0.75)), color="yellow", linetype="dashed", size=1)

```
###4.2 Interest Rate Summary

Here I examine the distribution of interest rate. The median interest rate is 12.99 and mean is 13.25. 
```{r Interest Rate, warning=FALSE}
summary(data$int_rate)
ggplot(data, aes(x=int_rate)) + 
  geom_density() + 
  theme(legend.position="none") +
  xlab("Interest Rate") + 
  ggtitle("Interest Rate") + 
  geom_vline(aes(xintercept=mean(int_rate)), color="black", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median(int_rate)), color="red", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=quantile(int_rate, 0.25)), color="blue", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=quantile(int_rate, 0.75)), color="yellow", linetype="dashed", size=1)
```

###4.3 Loan Term Length

There are two term length, 3-year and 5-year. Most borrowers have loan in 3 years.
```{r Loan Term Length}
data %>%
  group_by(term) %>%
  summarize(freq = n()) %>%
  ggplot(aes(reorder(term, freq), y = freq, fill = freq)) +   
  geom_bar(stat = "identity", position = "dodge", width = 0.4) +
  xlab("Term") +
  ylab("Frequency") +
  theme_fivethirtyeight() + 
  theme(legend.position ='none', axis.text.x = element_text(size = 15)) + 
  geom_text(aes(label = freq), vjust = -0.1, size = 4.5) +
  scale_fill_gradientn(name = '',colours = rev(brewer.pal(10,'Spectral'))) +
  ggtitle("Term Length")
```

###4.4 Loan Distribution by Geographic Region

Here I plot the loan volume by different US states. 
```{r Geographic Region}
data$region <- data$addr_state
data$region <- as.factor(data$region)
levels(data$region)<- c("alaska", "alabama","arkansas", "arizona", "california","colorado","connecticut","district of columbia","delaware","florida","georgia","hawaii","iowa","idaho","illinois","indiana","kansas","kentucky","louisiana","massachusetts","maryland","maine","michigan","minnesota","missouri","mississippi","montana","north carolina","north dakota","nebraska","new hampshire","new jersey","new mexico","nevada","new york","ohio","oklahoma","oregon","pennsylvania","rhode island","south carolina","south dakota","tennessee","texas","utah","virginia","vermont","washington","wisconsin","west virginia","wyoming")

us_states <- map_data("state")
loanvolume <-data %>% group_by(region) %>%
                summarise(value = sum(loan_amnt, na.rm=TRUE))
loanvolume$region <- as.character(loanvolume$region)
loandata <- merge(us_states, loanvolume, by="region")
ggplot() + 
  geom_polygon(data=loandata, aes(x=long, y=lat, group = group, fill=loandata$value)) +
  theme_void() +
  labs(fill = "Gradient" ,title = "Loan Volume by States", x="", y="")
```

###4.5 Loan Status

We observe that most loan status are current, which means this loan still has one or more scheduled future payments. It also reflect that some late and default loans.
```{r Loan Status}
data %>%
  group_by(loan_status) %>%
  summarize(freq = n()) %>%
  ggplot(aes(reorder(loan_status, freq), y = freq, fill = freq)) +   
  geom_bar(stat = "identity", position = "dodge", width=0.4) +
  xlab("Loan Status") +
  ylab("Frequency") +
  theme_fivethirtyeight() + 
  coord_flip() +
  theme(legend.position ='none', axis.text.x = element_text(size = 7)) + 
  scale_fill_gradientn(name = '',colours = rev(brewer.pal(11,'Spectral'))) +
  ggtitle("Loan Status")

summary(data$loan_status)
```


#5. Word Cloud

The WordCloud generator presents interesting insights.

###5.1 Word Cloud By Title
```{r word cloud by title, warning=FALSE}
loanCorpus <- Corpus(VectorSource(data$title))
loanCorpus <- tm_map(loanCorpus, content_transformer(tolower))
loanCorpus <- tm_map(loanCorpus, removePunctuation)
wordcloud(loanCorpus, max.words = 100, random.order = FALSE,rot.per=0, 
          colors=brewer.pal(8, "Dark2"))
```

###5.2 Word Cloud By Employment Title
```{r word cloud by emp title, warning=FALSE}
loantitleCorpus <- Corpus(VectorSource(data$emp_title))
loantitleCorpus <- tm_map(loantitleCorpus, content_transformer(tolower))
loantitleCorpus <- tm_map(loantitleCorpus, removePunctuation)
wordcloud(loantitleCorpus, max.words = 100, random.order = FALSE,rot.per=0, 
          colors=brewer.pal(8, "Dark2"))
```


