Lending Club Case Study - Data Import, Data Tidy and EDA


Data Import and Data Tidy

data_2012 <- read_csv("F:\\2022_NEC\\R\\Week11\\Final Case Analysis\\data2012.csv")
## New names:
## Rows: 53368 Columns: 32
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (18): term, int_rate, grade, sub_grade, emp_title, emp_length, home_owne... dbl
## (14): ...1, loan_amnt, annual_inc, dti, delinq_2yrs, mths_since_last_del...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
data_2013 <- read_csv("F:\\2022_NEC\\R\\Week11\\Final Case Analysis\\data2013.csv")
## New names:
## Rows: 134814 Columns: 32
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (16): term, int_rate, grade, sub_grade, emp_title, emp_length, home_owne... dbl
## (16): ...1, loan_amnt, annual_inc, dti, delinq_2yrs, inq_last_6mths, mth...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
data_2014 <- read_csv("F:\\2022_NEC\\R\\Week11\\Final Case Analysis\\data2014.csv")
## New names:
## Rows: 235629 Columns: 32
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (16): term, int_rate, grade, sub_grade, emp_title, emp_length, home_owne... dbl
## (16): ...1, loan_amnt, annual_inc, dti, delinq_2yrs, inq_last_6mths, mth...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
data_2015 <- read_csv("F:\\2022_NEC\\R\\Week11\\Final Case Analysis\\data2015.csv")
## New names:
## • `` -> `...1`
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 421095 Columns: 32
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): term, int_rate, grade, sub_grade, emp_title, emp_length, home_owne...
## dbl (16): ...1, loan_amnt, annual_inc, dti, delinq_2yrs, inq_last_6mths, mth...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data_2016 <- read_csv("F:\\2022_NEC\\R\\Week11\\Final Case Analysis\\data2016.csv")
## New names:
## Rows: 434407 Columns: 32
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (16): term, int_rate, grade, sub_grade, emp_title, emp_length, home_owne... dbl
## (16): ...1, loan_amnt, annual_inc, dti, delinq_2yrs, inq_last_6mths, mth...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
data_2017 <- read_csv("F:\\2022_NEC\\R\\Week11\\Final Case Analysis\\data2017.csv")
## New names:
## • `` -> `...1`
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 443579 Columns: 32
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): term, int_rate, grade, sub_grade, emp_title, emp_length, home_owne...
## dbl (16): ...1, loan_amnt, annual_inc, dti, delinq_2yrs, inq_last_6mths, mth...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data_states <- read_csv("F:\\2022_NEC\\R\\Week11\\Final Case Analysis\\States.csv")
## Rows: 52 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Geography
## dbl (7): Num_Households, Median_income_Households, Unemployment rate; Estima...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data_regions <- read_csv("F:\\2022_NEC\\R\\Week11\\Final Case Analysis\\states_regions.csv")
## New names:
## Rows: 51 Columns: 5
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (4): State, State Code, Region, Division lgl (1): ...5
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...5`
#Concatenate all 6 tables into 1
LC_data <- rbind(data_2012, data_2013, data_2014, data_2015, data_2016, data_2017)

#dropped the previous index column because it doesn't apply to the new table, dropped the issue_d column because we already have that divided into issue_Month and issue_Year column. 
LC_data_drop_columns <- subset(LC_data, select= -c(...1, issue_d))


Next I’ll plot each variable to identify unusual values, missing values and its distribution.

#Loan Amount
ggplot(data = LC_data_drop_columns) +
  geom_histogram(mapping = aes(x = loan_amnt), binwidth = 1000)
## Warning: Removed 12 rows containing non-finite values (stat_bin).

summary(LC_data_drop_columns$loan_amnt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1000    8000   12800   14863   20000   40000      12
  • From the above graph and summary, we know that the loan amounts range from $1,000 to $40,000, with a mean of $14,863. Most of the loans that Lending Club does are between $5,000-$20,000, and the peaks normally appear at loan amounts that are mutiples of $5,000, such as $10,000, $15,000, $20,000, $25,000 and so on.


#Loan Amount
ggplot(data = LC_data_drop_columns) +
  geom_bar(mapping = aes(x = emp_length))

  • From the above graph, we can see that most of the employment years of the clients are above 10 years, with a even distribution of other employment length. There seems to be wrong values and missing values in this column, some of the home ownership data was mistakenly input into the emp_length column.


#Annual Income
ggplot(data = LC_data_drop_columns) +
  geom_histogram(mapping = aes(x = annual_inc), binwidth = 10000)
## Warning: Removed 16 rows containing non-finite values (stat_bin).

summary(LC_data_drop_columns$annual_inc)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##         0     46500     65000     77607     92000 110000000        16
#I don't know if the extremely large number is wrong value or not, but for the plotting purpose, I'll zoom to a smaller values of the x-axis
ggplot(data = LC_data_drop_columns) +
  geom_histogram(mapping = aes(x = annual_inc), binwidth = 10000)+
  coord_cartesian(xlim = c(0, 1000000))
## Warning: Removed 16 rows containing non-finite values (stat_bin).

ggplot(data = LC_data_drop_columns) +
  geom_histogram(mapping = aes(x = annual_inc), binwidth = 5000)+
  coord_cartesian(xlim = c(0, 300000))
## Warning: Removed 16 rows containing non-finite values (stat_bin).

  • From the above graph and summary, we know that most of the clients who get loans with Lending Club have an annual income between $40,000 - $100,000, with a mean of $77607. However there are also small number of clients whose annual income are very high.


ggplot(data = LC_data_drop_columns) +
  geom_bar(mapping = aes(x = inq_last_6mths))

  • From the above graph we can see that most of the clients didn’t have any inquires in the past 6 months. And there exist some usual values, such as 0.0.1 which must be a mistaken input. 17 is suspicious, might also be a mistaken input. And there are missing values as well.


Bivariate Analysis

ggplot(data = LC_data_drop_columns, aes(loan_amnt, color = grade)) + geom_histogram(binwidth = 1000) + 
    facet_grid(grade ~ .)
## Warning: Removed 12 rows containing non-finite values (stat_bin).

  • From the above graph we can see that most of the loans have a grade of A,B,C. And for loans that have a grade lower than F, loan amount doesn’t seem to be a decisive factor for distribution anymore.


ggplot(data = LC_data_drop_columns) +
  geom_point(mapping = aes(x = annual_inc, y = loan_amnt))+ 
  coord_cartesian(xlim = c(0, 500000))
## Warning: Removed 16 rows containing missing values (geom_point).

  • Scatter plot isn’t very helpful to learn about the correlation between loan amount and annual income.
ggplot(data = LC_data_drop_columns) +
  geom_bin2d(mapping = aes(x = annual_inc,y=loan_amnt))+ 
  coord_cartesian(xlim = c(0, 12000000))
## Warning: Removed 16 rows containing non-finite values (stat_bin2d).

  • The above graph shows a bit more information than the scatter plot. We tell that generally people with an annual income of over $3,500,000 have less loans.