Lending club is a leading online marketplace that connects lenders with borrowers for short-term personal loans up to $40,000. The company has existed since 2007 and has provided millions of personal loans to individuals; in December 2014 they announced IPO (Initial Public Offering), prompting them to change the business model slightly by making some risky landings. Thie data analysis report evaluates the Lending Club data from 2012 to 2017, using R for data manipulation, cleaning, analysis, visualization and interpretation.
This analysis report dives deep into the borrower and loan characteristics across geography, credit grade, borrower annual income and employment length and time. It explores the analysis of loans granted by Lending Club at a national, state, regional and division level to highlight trends and insights like the difference in loan amounts by state/region, number of loans granted by year or the influence of borrower income and employment length on the loan amounts and loan grade assigned by Lending Club.
In order to do the analysis, this report uses the following data files provided in csv format from the Lending Club:
Annual Lending Club Loans Files: All loans borrowed data files from 2012-2017 which contains the information on all the loans granted by Lending Club between 2012 and 2017 with 32 variables (columns) like loan amount, annual income, loan grade, address, interest rates and more.
DataDictionary.xls: A supplementary file that contains the description of each column in the 2012-2017 data files.
States.csv: A states characteristics csv that contains demographic information like population size, median income, unemployment rate etc.
States_regions.csv: Lastly, we have a states_regions,csv which contains the information on which state gets categorized as which region and divisions.
5) LC_Sample.csv: Contains sample data to compare with all years data to better understand the different fields and data types available for analysis.
Before the data analysis process, the data needs to be available in
an efficient manner in a single table. To achieve this, the first
step was to create an R Studio project “CRN196_Final_Case_Analysis”
which has a built-in path to the folder that contains all the
above-mentioned CSV files. This can be verified with the
getwd() function:
## [1] "/Users/smit/Downloads/Final Case Analysis/CRN196_Final_Case_Analysis"
All six year’s annual Lending Club datasets from 2012-2017 were read into R Studio and to standardize the data columns in all datasets except 2013, LC_sample.csv was used as a reference. Each file had a different column header structure with some being unnamed or the header of columns didn’t exist. Once all the column names in each dataset were consistent, they were all stacked vertically using rbind(). To make sure that the stacking process worked, the new table “LC_Data_All” was inspected using functions like str(), head(), names(), and nrow().
# Read all years files
LCData2012 <- read.csv("data2012.csv", header = FALSE)
LCData2013 <- read.csv("data2013.csv", header = TRUE)
LCData2014 <- read.csv("data2014.csv", header = FALSE)
LCData2015 <- read.csv("data2015.csv", header = FALSE)## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,
## : embedded nul(s) found in input
LCData2016 <- read.csv("data2016.csv", header = FALSE)
LCData2017 <- read.csv("data2017.csv", header = FALSE)
# Read sample data file
LCSample <- read.csv("LCSample(in).csv")
# Check number of columns in each to make sure they all match sample data
ncol(LCSample)## [1] 32
## [1] 32
## [1] 32
## [1] 32
## [1] 32
## [1] 32
## [1] 32
# Change column names using names()
names(LCData2012) <- names(LCSample)
names(LCData2014) <- names(LCSample)
names(LCData2015) <- names(LCSample)
names(LCData2016) <- names(LCSample)
names(LCData2017) <- names(LCSample)
# Stack them all together
LC_Data_All <- rbind(LCData2012, LCData2013, LCData2014,
LCData2015, LCData2016, LCData2017)
# Preview first 10 rows of LC_Data_All
library(DT)
datatable(head(LC_Data_All, 5),
options = list(scrollX = TRUE, pageLength = 5),
caption = 'Preview of LC_Data_All')The states.csv file contains valuable geographic information like the state population size, median income, males/female ration and unemployment rates. This data, mapped with Lending Club’s loan data from 2012-2017 can provide valuable insights like loans per capita, average amount of loans by state or correlation between unemployment rate and the demand for borrowing in the state.
To achieve this, the states.csv was read into R as LC_States, then a lookup table “states_table” was created to convert the abbreviated state names to full names in order to match with the “Geography” column of the states.csv. Finally, a new combined file LC_Data_All_States was created. The results can be seen in the next screenshot using the head() function. It can be observed that a new column called “state” created on left side of the table.
# Join states file with the stacked file using the state name as the primary key
LC_States <- read.csv("states.csv", header = TRUE)
head(LC_States)# Create table of state codes and full names to match both the tables
state_table <- data.frame(
addr_state = state.abb,
state = state.name,
stringsAsFactors = FALSE
)
# Merge the LC Data All years with states_table for full state names
LC_Data_States <- merge(
LC_Data_All,
state_table,
by = "addr_state",
all.x = TRUE
)
# Now merge states.csv with LC_Data_States
LC_Data_All_States <- merge(
LC_Data_States,
LC_States,
by.x = "state", # Using the new state column from new merged file
by.y = "Geography",
all.x = TRUE # To have rows without addr_state as well
)
# Inspect
datatable(head(LC_Data_All_States, 5),
options = list(scrollX = TRUE, pageLength = 10),
caption = 'Preview of LC_Data_All_States')Similarly to the state file merge, the states_region.csv was read into R and then joined with the LC_Data_All_States dataset to assign region to every loan row in the combined dataset based on the state of the borrower. This will help us get regional insights and trends that inform business investment decisions on a regional level.
# Join states_regions file with the combined file
LC_State_Regions <- read.csv("states_regions.csv", header = TRUE)
datatable(head(LC_State_Regions, 10),
options = list(scrollX = TRUE, pageLength = 5),
caption = 'Preview of LC_State_Regions')# Merge the states_regions file with the combined states file
LC_Final_Combined <- merge(
LC_Data_All_States,
LC_State_Regions,
by.x = "state",
by.y = "State",
all.x = TRUE
)
# Inspect
nrow(LC_Final_Combined)## [1] 1444064
A clean dataset can help reduce data inaccuracies, makes the insights reliable for better decision making, enhances data quality and improves data analysis process efficiency.
To clean the final Lending Club dataset, records with missing loan amount, interest rate, annual income, grade, purpose, issue year, or region were removed. The cleanup process removed about 3500 records out of a little over 1.4 million records.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Remove missing values rows for relevant columns
LC_Final_Combined_Clean <- LC_Final_Combined |>
filter(
!is.na(loan_amnt),
!is.na(int_rate),
!is.na(annual_inc),
!is.na(grade),
!is.na(purpose),
!is.na(issue_Year),
!is.na(Region)
) |>
select(-X.x, -X.y) # removing columns created from merge
datatable(head(LC_Final_Combined_Clean, 10),
options = list(scrollX = TRUE, pageLength = 5),
caption = 'Preview of LC_Final_Combined_Clean')To analyze the Lending Club data to extract valuable insights, 6 important questions were answered ranging from the number of loans by geographic differences to the average loan amount by state, region and division.
Find the distribution of number of loans by state, regions and divisions. Describe in your own words the geographic differences in the number of loans. Also, analyze your results by comparing the number of loans per capita. Did you notice any missing states in the Lending Club data? If yes, then find out why.
The analysis began by computing the following variables grouped by state
A new table to count the number of loans granted by the state was created while joining the population column from our cleaned final dataset which then was used to calculate the variables.
# Distribution of number of loans by state #
number_loans_by_state <- LC_Final_Combined_Clean |>
count(state, sort = TRUE) |> # number of loans per state
left_join(LC_Final_Combined_Clean |> # to add the population to the table
select(state, population) |>
distinct(),
by = "state") |>
mutate(
percent_share = round(n / sum(n) * 100, 2), # to get percent of total
rank = dense_rank(desc(n)), # to get rank of the state
loans_per_capita = (n / population), # to get loans per capita by state
loans_per_100K = round(n / population * 100000, 0) # to get loans per 100K
)
datatable(head(number_loans_by_state, 50),
options = list(scrollX = TRUE, pageLength = 5),
caption = 'Preview of number_loans_by_state')The analysis revealed the top 5 states by number of loans are California, New York, Texas, Florida, and Illinois - in that order. These 5 states combined account for just over 40% of total number loans between 2012 and 2017, which shows a heavy concentration of Lending Club customers in some of the most populous and urban states in the country. This pattern is expected as these states are not only the 5 most populous states in the country, but also have a higher internet presence, more financial assets and aligns well with the Lending Clubs target demographic.
Some states, such as New York, do not have the highest number of loans but still rank higher than states with more loans when it comes to loans per capita with. This can mean that the Lending Club has a stronger brand presence in those states.
Nevada, which ranks number 1 for loans per 100K, issued 710 loans per 100,000 people between 2012 and 2017 indicating that Lending Club has the strongest brand presence in that state. Yet, it accounts for less than 2% of total loans issued, likely due to lower population of the state compared to some of the most populous states.
While the 5 states with the least number of loans - Iowa, North Dakota, Idaho, South Dakota, and Wyoming – all together account for less than 1% of all loans over 6 years. These states are mostly in low-density rural areas in the Midwest and the west region.However, Wyoming stands out as the number of loans issued per 100,000 people is among the top 10 states.
Similarly, to state, we computed the same variables but grouped by region instead.
# Distribution of number of loans by regions
number_loans_by_regions <- LC_Final_Combined_Clean |>
select(state, Region, population) |> # get distinct state-region population
distinct() |>
group_by(Region) |> # get region population
summarize(population = sum(population, na.rm = TRUE)) |>
left_join( # get number of loans by region
LC_Final_Combined_Clean |>
count(Region, name = "n"),
by = "Region"
) |>
mutate(
percent_share = round(n / sum(n) * 100, 2), # percent share of total
loans_per_capita = (n / population), # loans per capita by region
loans_per_100K = round(n / population * 100000, 0 ) #loans per 100K by region
) |>
arrange(desc(n))
datatable(head(number_loans_by_regions, 10),
options = list(scrollX = TRUE, pageLength = 5),
caption = 'Preview of number_loans_by_regions')When looking at the number of loans by region, the South ranks number 1, with 35% of total loans granted between 2012-2017, followed by the West (26%), Northeast (21%) and Midwest (18%). This analysis reveals that Lending Club activity is more prominent in the South and West accounting for more than 65% of total loans from 2012 to 2017.
The distribution of loans among regions also aligns with the top 5 and bottom 5 states by number of loans. Two of the top 5 states are from the South region – Texas and Florida, while the other 2 – California and New York, in the West and Northeast respectively.
Like the state level, regions with the highest number of loans don’t always have a high level of loans per 100,000 people. For example, Northeast ranks 1st when it comes to loans issued per 100,000 people but ranks 3rd in terms of number of loans. The high loans per 100,000 is highly evident in most of northeast states as half of the top 10 states by loans per 100,000 are in the northeast.
This suggests that the Northeast region has a higher adoption rate of Lending Club loans among the residents of the region.
A similar data analysis method was applied here to gather insights at a division level to further breakdown the regional insights into smaller divisions.
# Distribution of number of loans by divisions
number_loans_by_divisions <- LC_Final_Combined_Clean |>
select(state, Division, population) |> # get distinct state-division population
distinct() |>
group_by(Division) |> # get division population
summarize(population = sum(population, na.rm = TRUE)) |>
left_join( # get number of loans by division
LC_Final_Combined_Clean |>
count(Division, sort = TRUE),
by = "Division"
) |>
mutate(
percent_share = round(n / sum(n) * 100, 2), # percent share of total by division
loans_per_capita = (n / population), # loans per capita by division
loans_per_100K = round(n / population * 100000, 0) # loans per 100K by division
) |>
arrange(desc(n))
datatable(head(number_loans_by_divisions, 10),
options = list(scrollX = TRUE, pageLength = 10),
caption = 'Preview of number_loans_by_divisions')
At the division level, the ranking follows a similar trend as states and regions with South Atlantic ranking number 1 with for 20% of total loan share, followed by Pacific (18%), Middle Atlantic (15%) and East North Central (13%). While the bottom 3 divisions – Each South Central (4%), West North Central (5%), and New England (5%) - also follow a similar trend to the bottom regions.
When looking at loans granted per 100K, we can see a similar trend at the division level as well, where divisions that don’t have the highest number of loans granted but have a higher loan granted per 100,000 people. For example, New England is one of the bottom 3 divisions in terms of number of loans granted but ranks second in terms of loans per 100K. This aligns well with the previous insight of the Northeast region having a stronger brand presence compared to other regions.
All 50 U.S states were present in the 2012-2017 Lending Club dataset. While some states such as Iowa, North Dakota, and Wyoming had very low loan volumes contributing to less than 0.2% of total loan volume. Although they are not missing, their representation in the data is insignificant.
Compare the average amount of loans granted by all states and divisions. Which states and divisions have the highest and lowest average loan amounts?
Before deep diving into the average amount of loans by states and divisions it’s important to have the national average to benchmark against the averag loan amounts of states and divisions. This was computed using a simple mean() function on the loan_amnt column:
# National average loan amount
national_avg_loan <- LC_Final_Combined_Clean |>
mutate(loan_amnt = as.numeric(loan_amnt)) |>
summarize(national_avg = round(mean(loan_amnt, na.rm = TRUE), 0))
print(national_avg_loan)## national_avg
## 1 14782
The national average amount of loan granted by the Lending Club between 2012 and 2017 was $14,782.
To further analyze the Lending Clubs loans from 2012 to 2017, the average amount of loans was calculated using the mean() and group_by() function:
# Average amount of loans granted by state
avg_loanamnt_by_state <- LC_Final_Combined_Clean |>
mutate(loan_amnt = as.numeric(loan_amnt)) |>
group_by(state) |>
summarize(avg_loan_amnt = round(mean(loan_amnt, na.rm = TRUE),0)) |>
arrange(desc(avg_loan_amnt))
datatable(head(avg_loanamnt_by_state, 50),
options = list(scrollX = TRUE, pageLength = 5),
caption = 'Preview of avg_loanamnt_by_state')top_5_states <- avg_loanamnt_by_state |> # Top 5 States
slice_max(avg_loan_amnt, n = 5)
top_5_statesbottom_5_states <- avg_loanamnt_by_state |> # Bottom 5 States
slice_min(avg_loan_amnt, n = 5)
bottom_5_states
The average amount of loans in each state ranges from $13,000 to $17,000. Alaska ranks number 1 with the highest loan amount of $17,044 which suggests the high level of individual loans in that state. Iowa is a clear outlier, with the lowest average loan amount of a little less than $7,000; significantly lower than the national average of $14,782. Some factors that can have an impact on the average loan amounts include state economy or cost of living.
Using a similar methodology, the average amount of loans granted by divisions was computed.
# Average amount of loans granted by divisions
avg_loanamnt_by_divisions <- LC_Final_Combined_Clean |>
mutate(loan_amnt = as.numeric(loan_amnt)) |>
group_by(Division) |>
summarize(avg_loan_amnt = round(mean(loan_amnt, na.rm = TRUE), 0)) |>
arrange(desc(avg_loan_amnt))
avg_loanamnt_by_divisionsThe average loan amount by division ranges from $14,330 to $15,199. The national average loan amount of $14,782, sits right in the middle of this range.
The West South-Central division has the highest average loan amount of $15,199 which includes states like Texas, Oklahoma and Louisiana. This average loan amount is slightly higher than the national average which means the demand for loans is higher in those states. This trend also aligns with the high number of loans in the area.
New England and Pacific, which includes some higher-income urban states, rank close with the average loan amounts of $15,083 and $14,997 respectively.
One the other hand of the spectrum, the East South-Central and the West North Central division have the lowest average loan amounts of around $14,330, about $450 lower than the national average. This may be due to the differences in income levels and lower cost of living in those areas. While the difference is not extreme, these slight variations of loan amounts can tell a lot about the borrowing power of the residents in those states.
Compare the average interest rate charged and average loan amount by the loan Grade. Do you notice any patterns?
After extracting valuable insights on the loan distribution geographically, it’s time to take a look at Loan Grade. This is a grade assigned to each record by the Lending Club. The grade in the name indicates there may be some correlation between Loan Grade and Interest rates charged. Does a lower loan grade mean a higher interest rate?
To answer that question, the loan amount and interest rate fields are first converted to numeric, the final dataset is grouped by grade while computing average loan amount and average interest rate using mean().
# Compute national average interest rate
national_avg_int_rate <- LC_Final_Combined_Clean |>
mutate(int_rate = as.numeric(gsub("%", "", int_rate))) |>
summarize(avg_int_rate = round(mean(int_rate, na.rm = TRUE), 2))National Average Interest
Rate
13.31%
The national average interest rate at the Lending Club is 13.31%, the average interest rate by grade ranges from 7.07% for grade A to 28.5% for grade G. The analysis reveals a strong positive correlation between loan grade and interest rate, as predicted. As the grade moves from A to G, the average interest rate increases significantly, indicating a higher credit risk. This pattern suggests that the credit risk is an important factor in the interest rate pricing model of the Lending Club.
# Average interest rate and average loan amount by loan Grade
avgs_by_grade <- LC_Final_Combined_Clean |>
mutate(
loan_amnt = as.numeric(loan_amnt), # Make sure all loan amount are numeric
int_rate = as.numeric(gsub("%", "", int_rate)) # Remove the % signs for data manipulations
) |>
group_by(grade) |> # Group the table by grade
summarize(
avg_loan_amnt = round(mean(loan_amnt, na.rm = TRUE), 0), # Get the average loan amount
avg_int_rate = round(mean(int_rate, na.rm = TRUE), 2) # Get average interest rate
) |>
arrange(grade)
avgs_by_grade# Compute growth rate
avgs_by_grade <- avgs_by_grade |>
mutate(
growth_rate = round((avg_int_rate - lag(avg_int_rate)) / lag(avg_int_rate) * 100, 2)
)
avgs_by_gradeThe average interest rate increases at a steady rate from Grade A to Grade G with the most significant jump from Grade A to Grade B with over 50% growth in average interest rate. Although this growth rate declines as we go up to Grade G indicating narrowing of risk-based interest rates pricing as the loan grades worsen.
The average loan amount also increases with a declining grade letter, ranging from $14,186 in Grade A to $20,517 in Grade G. This can mean that people with lower grades tend to borrow larger amount of loans, potentially due to more financial need. It can also reflect on Lending Club’s willingness to give high interest loans with higher credit risk to offset the balances.
These findings demonstrate the direct relationship between interest rate, loan size, and credit risk and reinforce the importance of grade-based loan system for brands like Lending Club.
Run a frequency distribution of number of loans, average loan amount and average interest rate for each state by year (2012 through 2017). Describe the changing patterns in those numbers.
Before looking at yearly trends of each state, it’s important to understand the yearly trends at a national level to benchmark the state trends against it.
# National yearly trends
loans_by_year <- LC_Final_Combined_Clean |>
mutate(
loan_amnt = as.numeric(loan_amnt), # Make sure all loan amount are numeric
int_rate = as.numeric(gsub("%", "", int_rate)) # Remove the % signs for data manipulations
) |>
group_by(issue_Year) |> # Get the loans data by year
summarize(
n_loans = n(),
avg_loan_amnt = round(mean(loan_amnt, na.rm = TRUE), 0), # Get the average loan amount
avg_int_rate = round(mean(int_rate, na.rm = TRUE), 2) # Get average interest rate
) |>
arrange(issue_Year)
loans_by_yearFrom 2012 to 2017, there was a general upward trend in average loan amounts at Lending Club, rising from $13,463 in 2012 to a peak of $15,169 in 2015, before plateauing just under $15,000 in the following years. The increase of average loan amount in 2015 reflects the growth followed by Lending Club’s IPO in 2014 which most likely brough in new base of consumers.
On the other hand, the average interest rate shows a little more variability. It initially rose from 13.6% in 2012 to a high of 14.5% in 2013, but then declined sharply to 12.3% in 2015, possibly indicating improved borrower quality or more competitive pricing after the Lending Club’s IPO but they began to rise again slightly after 2015, stabilizing around 13% by 2017.
These trends suggest a growth in the amount of loans issued by Lending Club while there’s more shifts in interest rates, which may reflect changes in borrower credit mix or adjustments to the platform’s risk assessment models over the years.
Now by computing average loan amounts and average interest rates over the years of each state, the national yearly trends can be used as a benchmark for comparsion
# Frequency distribution of loans, average amount and average interest for each state by year
loans_by_state_year <- LC_Final_Combined_Clean |>
mutate(
loan_amnt = as.numeric(loan_amnt), # Make sure loan amount is numeric
int_rate = as.numeric(gsub("%", "", int_rate)) # Interest rate cleanup
) |>
group_by(state, issue_Year) |> # Group by state and then for each state by year
summarize(
n_loans = n(),
avg_loan_amnt = round(mean(loan_amnt, na.rm = TRUE), 0), # Get the average loan amount
avg_int_rate = round(mean(int_rate, na.rm = TRUE), 2) # Get average interest rate
) |>
arrange(state, issue_Year)## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.
# Top 3 states by year to view trends
top_3_state <- loans_by_state_year |>
filter(state %in% c("California", "New York", "Texas"))
top_3_stateThe three states with the highest total loan volumes - California, New York, and Texas show consistent trend to the national trends. The Loan volume grew significantly year over year with California issuing over 60,000 loans in 2017. For all three states, the growth was significant in 2014 and 2015 indicating the impact of the Lending Club IPO.
There’s also a steady increase in the average loan amounts with Texas’s increasing by 15% going from $13,940 in 2012 to $15,443 in 2017. Interest rates remain relatively stable across all three states, ranging from 12.2% to 14.8%, with minor fluctuations. This indicates the steady quality of borrowers in these urban states with a steady economy and strong fintech adoption.
bottom_3_state <- loans_by_state_year |>
filter(state %in% c("Wyoming", "North Dakota", "South Dakota"))
bottom_3_stateOn the lower end of the spectrum, Wyoming, North Dakota, and South Dakota showed a much lower loan volumes, each issuing under 1,000 loans per year, even in peak years for Lending Club like 2014 and 2015.
Surprisingly, these states showed higher average loan amounts, often comparable to or exceeding national levels. Although the interest rates have different trends that don’t align with the high loan amounts. For instance, Wyoming’s average loan amount exceeded $15,000 consistently, while the interest rates remained between 12.4% and 14.3%. This shows little difference from higher-volume states like Texas and California. This suggests that while less people borrowed from Lending Club, those who did often borrowed larger sums in these states.
These findings and fluctuations in averages across different states can tell us that the national expansion expected from the IPO played out a bit unevenly across states with urban states with high population scaling faster than the rural states with low populations. The rural states still had a steady growth with good quality borrowers.
Is there a relationship with the population size of a state and the average loan amount given? Is there a relationship between Grade of loans and median income level in a state?
# Population size vs average loan amount by state
pop_vs_loanamnt_state <- LC_Final_Combined_Clean |>
mutate(loan_amnt = as.numeric(loan_amnt)) |> # Make loan amount numeric
group_by(state, population) |> # Group the numbers by state and then population
summarize(
avg_loan_amt = round(mean(loan_amnt, na.rm = TRUE), 0),
.groups = "drop"
) |> # Add the average loan amount by state
arrange(desc(population))
print(pop_vs_loanamnt_state, n = 50)## # A tibble: 50 × 3
## state population avg_loan_amt
## <chr> <int> <dbl>
## 1 California 38982847 15029
## 2 Texas 27419612 15466
## 3 Florida 20278447 14065
## 4 New York 19798228 14617
## 5 Illinois 12854526 15229
## 6 Pennsylvania 12790505 14466
## 7 Ohio 11609756 14070
## 8 Georgia 10201635 15049
## 9 North Carolina 10052564 14483
## 10 Michigan 9925568 14057
## 11 New Jersey 8960161 15611
## 12 Virginia 8365952 15898
## 13 Washington 7169967 15055
## 14 Arizona 6809946 14244
## 15 Massachusetts 6789319 15616
## 16 Indiana 6614418 14390
## 17 Tennessee 6597381 14391
## 18 Missouri 6075300 14303
## 19 Maryland 5996079 15634
## 20 Wisconsin 5763217 14187
## 21 Minnesota 5490726 14334
## 22 Colorado 5436519 14773
## 23 South Carolina 4893444 14687
## 24 Alabama 4850771 14391
## 25 Louisiana 4663461 14611
## 26 Kentucky 4424376 14080
## 27 Oregon 4025127 13886
## 28 Oklahoma 3896251 14709
## 29 Connecticut 3594478 15029
## 30 Iowa 3118102 6962
## 31 Utah 2993941 14889
## 32 Mississippi 2986220 14465
## 33 Arkansas 2977944 13762
## 34 Kansas 2903820 14552
## 35 Nevada 2887725 14098
## 36 New Mexico 2084828 14739
## 37 Nebraska 1893921 13795
## 38 West Virginia 1836843 14632
## 39 Idaho 1657375 14064
## 40 Hawaii 1421658 15530
## 41 New Hampshire 1331848 14750
## 42 Maine 1330158 14464
## 43 Rhode Island 1056138 13995
## 44 Montana 1029862 13605
## 45 Delaware 943732 14680
## 46 South Dakota 855444 13972
## 47 North Dakota 745475 14983
## 48 Alaska 738565 17044
## 49 Vermont 624636 13482
## 50 Wyoming 583200 15420
To identify if the population of the state has an impact on the average loan amount, the loan amount was grouped by state and then sorted by population.
The most populous states like California, Texas and New York show high average loan amounts that are comparable to or a little higher than the national average of $14,782. However, several smaller states also show similar average loan amounts. For example, Alaska with a population of under 800,000 ranks number 1 for average loan amount, Hawaii and Wyoming also rank high with average loan amounts of over $15,000 each.
On the other hand, Iowa, a mid-sized state in terms of population, has the lowest average loan amount at only $6,962, standing out as a significant outlier once again with only 2 loans granted from 2012 to 2017.
These results indicate that there is no clear relationship between the population size of the state and average loan amount of the state. Loan sizes are most likely influenced by state factors such as regional financial behavior, state median income levels, or state population of high credit risk borrowers.
From previous analysis, it is clear that the interest rates have a clear positive relationship with the Loan Grade. Does the same relationship apply to the state median income?
# Grade vs median income level in a state
grade_percent_by_state <- LC_Final_Combined_Clean |>
count(state, grade) |> # Get the count of loans by state and by grade
group_by(state) |> # Group by state
mutate(
loans_n = sum(n),
grade_percent = round(n / loans_n * 100, 2)
) |>
ungroup() |> # Stop the grouping here
left_join(
LC_Final_Combined_Clean |> # Get median income household by state
select(state, Median_income_Households) |>
distinct(),
by = "state"
) |>
arrange(desc(Median_income_Households))
grade_percent_by_stateThe comparison of Loan grade with Median income of the state uncovers a clear pattern. States with high median income tend to have more Grade A and Grade B loans. For instance, Massachusetts with one of the highest median incomes (~$74K) shows about 80% of the total loans of the state in A, B, or C grade. While the lower income states like Mississippi with median income of less than $42K show only about 40% of total loans in Grade A and B and a high concentration of loans in Grade C and higher.
This strong correlation aligns with Lending Club’s credit model that assigns higher interest rates to higher grade loans. It can be inferred that the borrowers from wealthier states are more likely to receive higher credit grades due to better financial profiles. This supports the hypothesis that income level is a driver of loan grade distribution across states.
This is an open-ended question where you are asked to share an interesting fact that you found through data analysis.
While we know that Grades are influenced by median income, and are strongly correlated with interest rates, it might be interesting to understand the average annual income for each loan grade to see if there’s a correlation.
# Grade vs Annual Income of Borrowers
avg_income_by_grade <- LC_Final_Combined_Clean |>
mutate(annual_inc = as.numeric(annual_inc)) |>
group_by(grade) |>
summarize(
avg_annual_income = round(mean(annual_inc, na.rm = TRUE), 0)
) |>
arrange(grade)
avg_income_by_gradeWith no surprise, there’s a clear downward relationship between income and loan grade as the Grade A borrowers have the highest average income of $91,086, Grades B to D show a steady decline to $71,190 for grade D loans. Interestingly, the average income shows a slight increase again with Grade G’s average income being $75,269. This indicates that the higher-grade loans are given to borrowers who have a high income and high credit risk, highlighting the importance of credit history for computing the loan grade.
Create a plot of interest rates and Grade of a loan and describe the pattern.
Using the ggplot() functions, a boxplot representing the interest rates frequency distribution for each loan grade.
library(ggplot2)
# Interest rate vs grade for the loan
LC_Final_Combined_Clean |>
mutate(int_rate = as.numeric(gsub("%", "", int_rate))) |>
ggplot(aes(x = grade, y = int_rate)) +
geom_boxplot(fill = "blue") +
labs(
title = "Interest Rates by Loan Grade",
x = "LC Loan Grade",
y = "Interest Rates (%)") +
theme_classic()The boxplot indicates a clear pattern of a positive relationship between loan grades and interest rate. As the loan grades go from A to G, the interest rates increase steadily, indicating Grade A borrowers are given low-interest rate between 5-10% and Grade G borrowers are given a high interest rate between 25-30%.
Create a map of US states and color code the map with the average amount of loans given.
# Map of US states by average amount of loans
library(maps)
# Get US map data
us_states_data <- map_data("state")
# Get the average loan amount by state data table ready
map_data_avg_loan_amnt <- us_states_data |>
left_join(
avg_loanamnt_by_state |>
mutate(region = tolower(state)),
by = "region"
)
# Plot the US map
ggplot(map_data_avg_loan_amnt,
aes(long, lat, group = group,
fill = avg_loan_amnt)) +
geom_polygon(color = "black") +
coord_fixed(1.0) +
scale_fill_viridis_c(option = "C", name = "Avg Loan Amount") +
labs(title = "Average Loan Amount by US State") +
theme_void()To create a map of the US, first the US states longitude and latitude data from the built-in map_data table is extracted. Next, the US data table is joined with the average loan amount by state table.
The above US map highlights state patterns in average Lending Club loan amounts. States like Florida, Texas, and New Jersey exhibit the highest average loans which was clear from the previous analysis. These states seem to have a strong borrowing activity or larger financial needs. Once again, Iowa emerges an outlier with the lowest average loan amount of less than $8,000 potentially due to demographic or economic factors, or a lower loan volume (2) skewing the average.
Show visually the relationship between the annual income of the recipient and the loan amount obtained from Lending Club.
While there’s some relation between the median income and the loan grade, it’s not yet clear if the annual income of borrowers has anything to do with the loan amount.
But first, the borrowers are grouped into income brackets like <25K, 25K-50K and so on using the cut() function. The ggplot() geom_col() is used to create a bar chart with a geom_line() showcasing a linear trend.
# Average loan amount by income brackets
avg_loan_income_brackets <- LC_Final_Combined_Clean |>
mutate(
annual_inc = as.numeric(annual_inc),
loan_amnt = as.numeric(loan_amnt),
income_bracket = cut(
annual_inc,
breaks = c(0, 25000, 50000, 75000, 100000, 150000, 250000, 300000, Inf),
labels = c("<25K", "25–50K", "50–75K", "75–100K", "100–150K", "150–250K", "250-300K", "300K+" )
)
) |>
filter(!is.na(income_bracket)) |>
group_by(income_bracket) |>
summarize(
avg_loan_amnt = round(mean(loan_amnt, na.rm = TRUE), 0),
count = n()
) |>
arrange(income_bracket)
avg_loan_income_brackets# Plot the average loan amount by income brackets
ggplot(avg_loan_income_brackets,
aes(x = income_bracket, y = avg_loan_amnt)) +
geom_col(fill = "blue", width = 0.5) +
geom_line(aes(group = 1), color = "black", linewidth = 1.0) +
scale_y_continuous(labels = scales::dollar) +
labs(
x = "Income Bracket",
y = "Average Loan Amoount",
title = "Relation between income and loan amounts"
) +
theme_classic()The bar chart reveals a strong positive relationship between the borrower’s annual income and the loan amounts granted. As the income increases, the average loan amount also increases steadily. The average loan amount ranges from under $10,000 for people with less than $25,000 annual income to $25,000 for people earning more than $300,000. However, the upward trends start to flatten out at the highest income levels of $300K and above suggesting less need for higher loan size for high income individuals.
The trend line is smooth and consistent which means that income is an important factor for Lending Club’s loan approval process. This aligns with our findings for interest rates and income distributions by loan grade. Higher income individuals qualify for larger loans as they are more likely to afford the payments.
Create a plot that shows the relationship between the length of employment and amount of loan obtained.
To create a plot of length of employment and amount of loan obtained, a new table with employment length organized from <1 year to 10+ years using factor() and average loan amount grouped by employment length is computed using mean().
# Average loan amounts by employment length
avg_loan_emp_length <- LC_Final_Combined_Clean |>
mutate(
emp_length = as.character(emp_length),
loan_amnt = as.numeric(loan_amnt)
) |>
filter(
emp_length !="",
!is.na(emp_length),
!is.na(loan_amnt)
) |>
group_by(emp_length) |>
summarize(
avg_loan_amnt = round(mean(loan_amnt), 0),
loan_count = n()
) |>
mutate(
emp_length = factor(emp_length, levels = c(
"< 1 year", "1 year", "2 years", "3 years", "4 years", "5 years",
"6 years", "7 years", "8 years", "9 years", "10+ years"
)))
avg_loan_emp_length # Plot the average loan amounts by employment length
ggplot(avg_loan_emp_length,
aes(x = emp_length, y = avg_loan_amnt)) +
geom_col(fill = "orange") +
geom_line(aes(group = 1), color = "black", linewidth = 1.0) +
scale_y_continuous(labels = scales::dollar) +
labs(
title = "Relation between employment length and loan amounts",
x = "Employment Length",
y = "Average Loan Amount"
) +
theme_minimal()This chart shows somewhat of a steady and consistent relationship between employment length and average loan amount. People with 10+ years of work experiences tend to receive higher loans. The trend suggests Lending Club’s loan approval process also value employment stability, though its influence on loan size is more gradual compared to income as the growth is not that significant.
Create a “regional” map and show an interesting relationship of your liking.
To create a regional map, first the longitude and latitude data from the map_data is joined with the US states and region date along with the calculation of loans per 100K people and mean() of longitude and latitude. Then a labels table for the regional map is created to use that for the the labels on the map. Using the ggplot() geom_polygon() function, the map is created.
# Regional map of loans per 100K by region
# Table with regional long and lat data
state_region_lookup <- LC_State_Regions |> # Create a look up table for regions to match with state
mutate(region = tolower(State))
us_regional_data <- map_data_avg_loan_amnt |> # Now join with the US states map long and lat data with avg loan amounts
left_join(state_region_lookup) |>
select(long, lat, group, order,region, Region, by = "region")## Joining with `by = join_by(region)`
us_regional_data <- us_regional_data |> # Now join the above two tables
left_join(
number_loans_by_regions |>
select(Region, loans_per_100K),
by = "Region"
)
us_regional_data # Create labels table for each regions with loans per 100K by region
region_loans_per_100K <- us_regional_data |>
group_by(Region) |>
summarize(
loans_per_100K = first(loans_per_100K),
long = mean(range(long)),
lat = mean(range(lat))
)
region_loans_per_100K# Plot the regional map with loans per 100K
ggplot(us_regional_data, aes(x = long, y = lat, group = group, fill = loans_per_100K)) +
geom_polygon(color = "lightgray", linewidth = 0.2) +
geom_text(
data = region_loans_per_100K,
mapping = aes(x = long, y = lat, label = paste0(Region, "\n", loans_per_100K, " per 100K")),
color = "black",
size = 3,
inherit.aes = FALSE
) +
scale_fill_viridis_b(option = "C", name = "Loans per 100K") +
labs(
title = "Lending Club Loans per 100K by U.S. Region",
subtitle = "Each state is colored by its region’s average loans per capita"
) +
coord_fixed(1.3) +
theme_void()The regional map of the Lending Club loans per 100,000 between 2012 and 2017 reveals clear geographic disparities by each region. The Northeast region leads with the highest loans per capita of 530 loans per 100K, suggesting a higher engagement with online lending platforms in that region. The region is also amounting to the wealthiest in the nation, which can have an impact on the borrowing capacity of the residents in the region. The Western region follows with 488 loans per 100K, likely driven by high population and higher adoption of fintech in the region. In contrast, the Midwest records the lowest usage at 382 loans per 100K, which may reflect either lower demand or greater access to traditional credit sources. Although the southern region ranks 1 in terms of loan volumes between 2012 and 2017, it falls in the middle at 426 loans per 100K, highlighting a broad mix of lending behaviors in the region. These findings suggest that regional differences in economy, financial literacy, digital adoption, population and income levels have a big impact of Lending Club’s performance in the region.
The Lending Club data from 2012 to 2017 revealed several trends and correlations. These trends and insights can help to inform future targeting and risk assessment strategies. Some of the key trends extracted from the analysis: