Week11_Assignment

Lending Club Case Study - Data Import, Data Tidy and EDA

Data Import and Data Tidy

data_2012 <- read_csv("F:\\2022_NEC\\R\\Week11\\Final Case Analysis\\data2012.csv")

## New names:
## Rows: 53368 Columns: 32
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (18): term, int_rate, grade, sub_grade, emp_title, emp_length, home_owne... dbl
## (14): ...1, loan_amnt, annual_inc, dti, delinq_2yrs, mths_since_last_del...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

data_2013 <- read_csv("F:\\2022_NEC\\R\\Week11\\Final Case Analysis\\data2013.csv")

## New names:
## Rows: 134814 Columns: 32
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (16): term, int_rate, grade, sub_grade, emp_title, emp_length, home_owne... dbl
## (16): ...1, loan_amnt, annual_inc, dti, delinq_2yrs, inq_last_6mths, mth...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

data_2014 <- read_csv("F:\\2022_NEC\\R\\Week11\\Final Case Analysis\\data2014.csv")

## New names:
## Rows: 235629 Columns: 32
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (16): term, int_rate, grade, sub_grade, emp_title, emp_length, home_owne... dbl
## (16): ...1, loan_amnt, annual_inc, dti, delinq_2yrs, inq_last_6mths, mth...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

data_2015 <- read_csv("F:\\2022_NEC\\R\\Week11\\Final Case Analysis\\data2015.csv")

## New names:
## • `` -> `...1`

## Warning: One or more parsing issues, see `problems()` for details

## Rows: 421095 Columns: 32
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): term, int_rate, grade, sub_grade, emp_title, emp_length, home_owne...
## dbl (16): ...1, loan_amnt, annual_inc, dti, delinq_2yrs, inq_last_6mths, mth...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

data_2016 <- read_csv("F:\\2022_NEC\\R\\Week11\\Final Case Analysis\\data2016.csv")

## New names:
## Rows: 434407 Columns: 32
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (16): term, int_rate, grade, sub_grade, emp_title, emp_length, home_owne... dbl
## (16): ...1, loan_amnt, annual_inc, dti, delinq_2yrs, inq_last_6mths, mth...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

data_2017 <- read_csv("F:\\2022_NEC\\R\\Week11\\Final Case Analysis\\data2017.csv")

## New names:
## • `` -> `...1`

## Warning: One or more parsing issues, see `problems()` for details

## Rows: 443579 Columns: 32
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): term, int_rate, grade, sub_grade, emp_title, emp_length, home_owne...
## dbl (16): ...1, loan_amnt, annual_inc, dti, delinq_2yrs, inq_last_6mths, mth...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

data_states <- read_csv("F:\\2022_NEC\\R\\Week11\\Final Case Analysis\\States.csv")

## Rows: 52 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Geography
## dbl (7): Num_Households, Median_income_Households, Unemployment rate; Estima...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

data_regions <- read_csv("F:\\2022_NEC\\R\\Week11\\Final Case Analysis\\states_regions.csv")

## New names:
## Rows: 51 Columns: 5
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (4): State, State Code, Region, Division lgl (1): ...5
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...5`

#Concatenate all 6 tables into 1
LC_data <- rbind(data_2012, data_2013, data_2014, data_2015, data_2016, data_2017)

#dropped the previous index column because it doesn't apply to the new table, dropped the issue_d column because we already have that divided into issue_Month and issue_Year column. 
LC_data_drop_columns <- subset(LC_data, select= -c(...1, issue_d))

Next I’ll plot each variable to identify unusual values, missing values and its distribution.

#Loan Amount
ggplot(data = LC_data_drop_columns) +
  geom_histogram(mapping = aes(x = loan_amnt), binwidth = 1000)

## Warning: Removed 12 rows containing non-finite values (stat_bin).

summary(LC_data_drop_columns$loan_amnt)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1000    8000   12800   14863   20000   40000      12

From the above graph and summary, we know that the loan amounts range from $1,000 to $40,000, with a mean of $14,863. Most of the loans that Lending Club does are between $5,000-$20,000, and the peaks normally appear at loan amounts that are mutiples of $5,000, such as $10,000, $15,000, $20,000, $25,000 and so on.

#Loan Amount
ggplot(data = LC_data_drop_columns) +
  geom_bar(mapping = aes(x = emp_length))

From the above graph, we can see that most of the employment years of the clients are above 10 years, with a even distribution of other employment length. There seems to be wrong values and missing values in this column, some of the home ownership data was mistakenly input into the emp_length column.

#Annual Income
ggplot(data = LC_data_drop_columns) +
  geom_histogram(mapping = aes(x = annual_inc), binwidth = 10000)

## Warning: Removed 16 rows containing non-finite values (stat_bin).

summary(LC_data_drop_columns$annual_inc)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##         0     46500     65000     77607     92000 110000000        16

#I don't know if the extremely large number is wrong value or not, but for the plotting purpose, I'll zoom to a smaller values of the x-axis
ggplot(data = LC_data_drop_columns) +
  geom_histogram(mapping = aes(x = annual_inc), binwidth = 10000)+
  coord_cartesian(xlim = c(0, 1000000))

## Warning: Removed 16 rows containing non-finite values (stat_bin).

ggplot(data = LC_data_drop_columns) +
  geom_histogram(mapping = aes(x = annual_inc), binwidth = 5000)+
  coord_cartesian(xlim = c(0, 300000))

## Warning: Removed 16 rows containing non-finite values (stat_bin).

From the above graph and summary, we know that most of the clients who get loans with Lending Club have an annual income between $40,000 - $100,000, with a mean of $77607. However there are also small number of clients whose annual income are very high.

ggplot(data = LC_data_drop_columns) +
  geom_bar(mapping = aes(x = inq_last_6mths))

From the above graph we can see that most of the clients didn’t have any inquires in the past 6 months. And there exist some usual values, such as 0.0.1 which must be a mistaken input. 17 is suspicious, might also be a mistaken input. And there are missing values as well.

Bivariate Analysis

ggplot(data = LC_data_drop_columns, aes(loan_amnt, color = grade)) + geom_histogram(binwidth = 1000) + 
    facet_grid(grade ~ .)

## Warning: Removed 12 rows containing non-finite values (stat_bin).

From the above graph we can see that most of the loans have a grade of A,B,C. And for loans that have a grade lower than F, loan amount doesn’t seem to be a decisive factor for distribution anymore.

ggplot(data = LC_data_drop_columns) +
  geom_point(mapping = aes(x = annual_inc, y = loan_amnt))+ 
  coord_cartesian(xlim = c(0, 500000))

## Warning: Removed 16 rows containing missing values (geom_point).

Scatter plot isn’t very helpful to learn about the correlation between loan amount and annual income.

ggplot(data = LC_data_drop_columns) +
  geom_bin2d(mapping = aes(x = annual_inc,y=loan_amnt))+ 
  coord_cartesian(xlim = c(0, 12000000))

## Warning: Removed 16 rows containing non-finite values (stat_bin2d).

The above graph shows a bit more information than the scatter plot. We tell that generally people with an annual income of over $3,500,000 have less loans.

Week11_Assignment

2022-07-31

Lending Club Case Study - Data Import, Data Tidy and EDA

Data Import and Data Tidy

Next I’ll plot each variable to identify unusual values, missing values and its distribution.

From the above graph and summary, we know that most of the clients who get loans with Lending Club have an annual income between $40,000 - $100,000, with a mean of $77607. However there are also small number of clients whose annual income are very high.

From the above graph we can see that most of the clients didn’t have any inquires in the past 6 months. And there exist some usual values, such as 0.0.1 which must be a mistaken input. 17 is suspicious, might also be a mistaken input. And there are missing values as well.

Bivariate Analysis

From the above graph we can see that most of the loans have a grade of A,B,C. And for loans that have a grade lower than F, loan amount doesn’t seem to be a decisive factor for distribution anymore.

Scatter plot isn’t very helpful to learn about the correlation between loan amount and annual income.

The above graph shows a bit more information than the scatter plot. We tell that generally people with an annual income of over $3,500,000 have less loans.