This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
#load data from Amazon
data_2007_2011 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats3a.csv",header=TRUE, sep=",")
data_2014 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats3c%2520%25281%2529.csv", header=TRUE, sep=",")
#load more data from Amazon later
#data_2015 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats3d.csv", header=FALSE, sep=",")
#data_2016_q1 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2016Q1.csv", header=FALSE, sep=",")
#data_2016_q2 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2016Q2.csv", header=FALSE, sep=",")
#data_2016_q3 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2016Q3.csv", header=FALSE, sep=",")
#data_2016_q3 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2016Q4.csv", header=FALSE, sep=",")
#data_2017_q1 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2017Q1.csv", header=FALSE, sep=",")
#data_2017_g2 <- read.csv("https://cdn-stage.fedweb.org/fed-2/13/LoanStats_2017Q2.csv", header=FALSE, sep=",")
full_data = rbind(data_2007_2011,data_2014)
#select meaningful variables
all_data <- full_data %>% select(loan_amnt,funded_amnt,term,int_rate,installment,grade,sub_grade,emp_length,annual_inc,loan_status)
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for. Are creditworthiness parameters such as grade, sub grade, annual income and length of employment predictive of whether borrower pay off the debt or not?
What are the cases, and how many are there? Each case represents a loan that was issued from 2007 to 2014.
dim(full_data)
## [1] 278167 137
Currently there are 278167 observations in the given data set. I might add data sets for 2015-2017.
Describe the method of data collection. Multiple csv files were downloaded from Lending Club Statistics page (https://www.lendingclub.com/info/download-data.action) and saved on local machine. Then csv files were uploaded to Amozon server. After that the data were converted to R data frame and cleaned. More data cleaning will be performed on the next project stage.
What type of study is this (observational/experiment)? This is an observational study.
If you collected the data, state self-collected. If not, provide a citation/link. Multiple csv files were downloaded from Lending Club Statistics page. the page URL is https://www.lendingclub.com/info/download-data.action)
What is the response variable, and what type is it (numerical/categorical)? Response variable “loan status” belongs to categorical type (first category - “Paid off”, second category is “Charged off”,.etc)
levels(all_data$loan_status)
## [1] ""
## [2] "Charged Off"
## [3] "Does not meet the credit policy. Status:Charged Off"
## [4] "Does not meet the credit policy. Status:Fully Paid"
## [5] "Fully Paid"
## [6] "Current"
## [7] "Default"
## [8] "In Grace Period"
## [9] "Late (16-30 days)"
## [10] "Late (31-120 days)"
What is the explanatory variable, and what type is it (numerical/categorival)? Explanatory variable “annual income” belongs to numerical type. While exploratory variables “grade”, “sub grade”, “annual income”" and “length of employment”(10+ years,< 1 year,.etc ) are categorical.
levels(all_data$grade)
## [1] "" "A" "B" "C" "D" "E" "F" "G"
levels(all_data$sub_grade )
## [1] "" "A1" "A2" "A3" "A4" "A5" "B1" "B2" "B3" "B4" "B5" "C1" "C2" "C3"
## [15] "C4" "C5" "D1" "D2" "D3" "D4" "D5" "E1" "E2" "E3" "E4" "E5" "F1" "F2"
## [29] "F3" "F4" "F5" "G1" "G2" "G3" "G4" "G5"
levels(all_data$emp_length)
## [1] "" "< 1 year" "1 year" "10+ years" "2 years"
## [6] "3 years" "4 years" "5 years" "6 years" "7 years"
## [11] "8 years" "9 years" "n/a"
levels(all_data$annual_inc)
## NULL
At this point it’s hard to decide which graphs and plots are more relevant to research question. So, I decided to display basic summary statistics for Lending Club data set.
summary(all_data)
## loan_amnt funded_amnt term int_rate
## Min. : 500 Min. : 500 : 3 12.99% : 13086
## 1st Qu.: 8000 1st Qu.: 8000 36 months:194104 10.99% : 11654
## Median :12000 Median :12000 60 months: 84060 15.61% : 10309
## Mean :14292 Mean :14251 12.49% : 9726
## 3rd Qu.:20000 3rd Qu.:20000 13.98% : 9168
## Max. :35000 Max. :35000 14.99% : 8102
## NA's :3 NA's :3 (Other):216122
## installment grade sub_grade emp_length
## Min. : 15.67 C :75305 C2 : 16122 10+ years:88874
## 1st Qu.: 248.48 B :74324 B5 : 16116 2 years :25230
## Median : 368.17 D :49008 B3 : 16072 < 1 year :23044
## Mean : 424.14 A :46291 B4 : 16065 3 years :22631
## 3rd Qu.: 558.19 E :23515 C1 : 15762 1 year :18188
## Max. :1409.99 F : 7524 C3 : 15452 4 years :17177
## NA's :3 (Other): 2200 (Other):182578 (Other) :83023
## annual_inc
## Min. : 1896
## 1st Qu.: 45000
## Median : 63000
## Mean : 73980
## 3rd Qu.: 90000
## Max. :7500000
## NA's :7
## loan_status
## Fully Paid :186440
## Current : 43129
## Charged Off : 42324
## Does not meet the credit policy. Status:Fully Paid: 1988
## Late (31-120 days) : 1728
## In Grace Period : 1362
## (Other) : 1196
boxplot(all_data$annual_inc~all_data$loan_status)
boxplot(all_data$loan_amnt~all_data$loan_status)
boxplot(all_data$loan_amnt/all_data$annual_inc~all_data$loan_status)