DATA 606. Project Proposal

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

#load data from Amazon
data_2007_2011 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats3a.csv",header=TRUE, sep=",")
data_2014 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats3c%2520%25281%2529.csv", header=TRUE, sep=",")

#load more data from Amazon later
#data_2015 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats3d.csv", header=FALSE, sep=",")
#data_2016_q1 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2016Q1.csv", header=FALSE, sep=",")
#data_2016_q2 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2016Q2.csv", header=FALSE, sep=",")
#data_2016_q3 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2016Q3.csv", header=FALSE, sep=",")
#data_2016_q3 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2016Q4.csv", header=FALSE, sep=",")
#data_2017_q1 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2017Q1.csv", header=FALSE, sep=",")
#data_2017_g2 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2017Q2.csv", header=FALSE, sep=",")

full_data = rbind(data_2007_2011,data_2014)

#select meaningful variables
all_data <- full_data %>% select(loan_amnt,funded_amnt,term,int_rate,installment,grade,sub_grade,emp_length,annual_inc,loan_status)

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for. Are creditworthiness parameters such as grade, sub grade, annual income and length of employment predictive of whether borrower pay off the debt or not?

Cases

What are the cases, and how many are there? Each case represents a loan that was issued from 2007 to 2014.

dim(full_data)

## [1] 278167    137

Currently there are 278167 observations in the given data set. I might add data sets for 2015-2017.

Data collection

Describe the method of data collection. Multiple csv files were downloaded from Lending Club Statistics page (https://www.lendingclub.com/info/download-data.action) and saved on local machine. Then csv files were uploaded to Amozon server. After that the data were converted to R data frame and cleaned. More data cleaning will be performed on the next project stage.

Type of study

What type of study is this (observational/experiment)? This is an observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link. Multiple csv files were downloaded from Lending Club Statistics page. the page URL is https://www.lendingclub.com/info/download-data.action)

Response

What is the response variable, and what type is it (numerical/categorical)? Response variable “loan status” belongs to categorical type (first category - “Paid off”, second category is “Charged off”,.etc)

levels(all_data$loan_status)

##  [1] ""                                                   
##  [2] "Charged Off"                                        
##  [3] "Does not meet the credit policy. Status:Charged Off"
##  [4] "Does not meet the credit policy. Status:Fully Paid" 
##  [5] "Fully Paid"                                         
##  [6] "Current"                                            
##  [7] "Default"                                            
##  [8] "In Grace Period"                                    
##  [9] "Late (16-30 days)"                                  
## [10] "Late (31-120 days)"

Explanatory

What is the explanatory variable, and what type is it (numerical/categorival)? Explanatory variable “annual income” belongs to numerical type. While exploratory variables “grade”, “sub grade”, “annual income”" and “length of employment”(10+ years,< 1 year,.etc ) are categorical.

levels(all_data$grade)

## [1] ""  "A" "B" "C" "D" "E" "F" "G"

levels(all_data$sub_grade )

##  [1] ""   "A1" "A2" "A3" "A4" "A5" "B1" "B2" "B3" "B4" "B5" "C1" "C2" "C3"
## [15] "C4" "C5" "D1" "D2" "D3" "D4" "D5" "E1" "E2" "E3" "E4" "E5" "F1" "F2"
## [29] "F3" "F4" "F5" "G1" "G2" "G3" "G4" "G5"

levels(all_data$emp_length)

##  [1] ""          "< 1 year"  "1 year"    "10+ years" "2 years"  
##  [6] "3 years"   "4 years"   "5 years"   "6 years"   "7 years"  
## [11] "8 years"   "9 years"   "n/a"

levels(all_data$annual_inc)

## NULL

Relevant summary statistics

At this point it’s hard to decide which graphs and plots are more relevant to research question. So, I decided to display basic summary statistics for Lending Club data set.

summary(all_data)

##    loan_amnt      funded_amnt            term           int_rate     
##  Min.   :  500   Min.   :  500             :     3   12.99% : 13086  
##  1st Qu.: 8000   1st Qu.: 8000    36 months:194104   10.99% : 11654  
##  Median :12000   Median :12000    60 months: 84060   15.61% : 10309  
##  Mean   :14292   Mean   :14251                       12.49% :  9726  
##  3rd Qu.:20000   3rd Qu.:20000                       13.98% :  9168  
##  Max.   :35000   Max.   :35000                       14.99% :  8102  
##  NA's   :3       NA's   :3                           (Other):216122  
##   installment          grade         sub_grade          emp_length   
##  Min.   :  15.67   C      :75305   C2     : 16122   10+ years:88874  
##  1st Qu.: 248.48   B      :74324   B5     : 16116   2 years  :25230  
##  Median : 368.17   D      :49008   B3     : 16072   < 1 year :23044  
##  Mean   : 424.14   A      :46291   B4     : 16065   3 years  :22631  
##  3rd Qu.: 558.19   E      :23515   C1     : 15762   1 year   :18188  
##  Max.   :1409.99   F      : 7524   C3     : 15452   4 years  :17177  
##  NA's   :3         (Other): 2200   (Other):182578   (Other)  :83023  
##    annual_inc     
##  Min.   :   1896  
##  1st Qu.:  45000  
##  Median :  63000  
##  Mean   :  73980  
##  3rd Qu.:  90000  
##  Max.   :7500000  
##  NA's   :7        
##                                              loan_status    
##  Fully Paid                                        :186440  
##  Current                                           : 43129  
##  Charged Off                                       : 42324  
##  Does not meet the credit policy. Status:Fully Paid:  1988  
##  Late (31-120 days)                                :  1728  
##  In Grace Period                                   :  1362  
##  (Other)                                           :  1196

boxplot(all_data$annual_inc~all_data$loan_status)

boxplot(all_data$loan_amnt~all_data$loan_status)

boxplot(all_data$loan_amnt/all_data$annual_inc~all_data$loan_status)