1 Introduction

Lending club is a leading online marketplace that connects lenders with borrowers for short-term personal loans up to $40,000. The company has existed since 2007 and has provided millions of personal loans to individuals; in December 2014 they announced IPO (Initial Public Offering), prompting them to change the business model slightly by making some risky landings.  Thie data analysis report evaluates the Lending Club data from 2012 to 2017, using R for data manipulation, cleaning, analysis, visualization and interpretation. 

This analysis report dives deep into the borrower and loan characteristics across geography, credit grade, borrower annual income and employment length and time. It explores the analysis of loans granted by Lending Club at a national, state, regional and division level to highlight trends and insights like the difference in loan amounts by state/region, number of loans granted by year or the influence of borrower income and employment length on the loan amounts and loan grade assigned by Lending Club.  

2 Data Details

In order to do the analysis, this report uses the following data files provided in csv format from the Lending Club:

  1. Annual Lending Club Loans Files: All loans borrowed data files from 2012-2017 which contains the information on all the loans granted by Lending Club between 2012 and 2017 with 32 variables (columns) like loan amount, annual income, loan grade, address, interest rates and more.

  2. DataDictionary.xls: A supplementary file that contains the description of each column in the 2012-2017 data files.

  3. States.csv: A states characteristics csv that contains demographic information like population size, median income, unemployment rate etc.

  4. States_regions.csv: Lastly, we have a states_regions,csv which contains the information on which state gets categorized as which region and divisions.

5) LC_Sample.csv: Contains sample data to compare with all years data to better understand the different fields and data types available for analysis.

3 Merging and Cleaning

Before the data analysis process, the data needs to be available in an efficient manner in a single table. To achieve this, the first step was to create an R Studio project “CRN196_Final_Case_Analysis” which has a built-in path to the folder that contains all the above-mentioned CSV files. This can be verified with the getwd() function:

getwd()
## [1] "/Users/smit/Downloads/Final Case Analysis/CRN196_Final_Case_Analysis"

3.1 Stack all Lending Club Datasets

All six year’s annual Lending Club datasets from 2012-2017 were read into R Studio and to standardize the data columns in all datasets except 2013, LC_sample.csv was used as a reference. Each file had a different column header structure with some being unnamed or the header of columns didn’t exist. Once all the column names in each dataset were consistent, they were all stacked vertically using rbind(). To make sure that the stacking process worked, the new table “LC_Data_All” was inspected using functions like str(), head(), names(), and nrow().

# Read all years files 
LCData2012 <- read.csv("data2012.csv", header = FALSE)
LCData2013 <- read.csv("data2013.csv", header = TRUE)
LCData2014 <- read.csv("data2014.csv", header = FALSE)
LCData2015 <- read.csv("data2015.csv", header = FALSE)
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,
## : embedded nul(s) found in input
LCData2016 <- read.csv("data2016.csv", header = FALSE)
LCData2017 <- read.csv("data2017.csv", header = FALSE)

# Read sample data file 
LCSample <- read.csv("LCSample(in).csv")

# Check number of columns in each to make sure they all match sample data
ncol(LCSample)
## [1] 32
ncol(LCData2012)
## [1] 32
ncol(LCData2013)
## [1] 32
ncol(LCData2014)
## [1] 32
ncol(LCData2015)
## [1] 32
ncol(LCData2016)
## [1] 32
ncol(LCData2017)
## [1] 32
# Change column names using names() 
names(LCData2012) <- names(LCSample)
names(LCData2014) <- names(LCSample)
names(LCData2015) <- names(LCSample)
names(LCData2016) <- names(LCSample)
names(LCData2017) <- names(LCSample)

# Stack them all together 
LC_Data_All <- rbind(LCData2012, LCData2013, LCData2014, 
                     LCData2015, LCData2016, LCData2017)

# Preview first 10 rows of LC_Data_All
library(DT)
datatable(head(LC_Data_All, 5),
          options = list(scrollX = TRUE, pageLength = 5),
          caption = 'Preview of LC_Data_All')

3.2 Merging with States Data 

The states.csv file contains valuable geographic information like the state population size, median income, males/female ration and unemployment rates. This data, mapped with Lending Club’s loan data from 2012-2017 can provide valuable insights like loans per capita, average amount of loans by state or correlation between unemployment rate and the demand for borrowing in the state.

To achieve this, the states.csv was read into R as LC_States, then a lookup table “states_table” was created to convert the abbreviated state names to full names in order to match with the “Geography” column of the states.csv. Finally, a new combined file LC_Data_All_States was created. The results can be seen in the next screenshot using the head() function. It can be observed that a new column called “state” created on left side of the table.

# Join states file with the stacked file using the state name as the primary key
LC_States <- read.csv("states.csv", header = TRUE)
head(LC_States)
# Create table of state codes and full names to match both the tables
state_table <- data.frame(
  addr_state = state.abb,
  state = state.name,
  stringsAsFactors = FALSE
)

# Merge the LC Data All years with states_table for full state names
LC_Data_States <- merge(
  LC_Data_All,
  state_table,
  by = "addr_state",
  all.x = TRUE
)

# Now merge states.csv with LC_Data_States
LC_Data_All_States <- merge(
  LC_Data_States,
  LC_States,
  by.x = "state", # Using the new state column from new merged file
  by.y = "Geography",
  all.x = TRUE # To have rows without addr_state as well
)

# Inspect
datatable(head(LC_Data_All_States, 5),
          options = list(scrollX = TRUE, pageLength = 10),
          caption = 'Preview of LC_Data_All_States')

3.3 Merging with Regions Data 

Similarly to the state file merge, the states_region.csv was read into R and then joined with the LC_Data_All_States dataset to assign region to every loan row in the combined dataset based on the state of the borrower. This will help us get regional insights and trends that inform business investment decisions on a regional level.  

# Join states_regions file with the combined file 
LC_State_Regions <- read.csv("states_regions.csv", header = TRUE)

datatable(head(LC_State_Regions, 10),
          options = list(scrollX = TRUE, pageLength = 5),
          caption = 'Preview of LC_State_Regions')
# Merge the states_regions file with the combined states file 
LC_Final_Combined <- merge(
  LC_Data_All_States,
  LC_State_Regions,
  by.x = "state",
  by.y = "State",
  all.x = TRUE
)

# Inspect 
nrow(LC_Final_Combined)
## [1] 1444064
datatable(head(LC_Final_Combined, 50),
          options = list(scrollX = TRUE, pageLength = 5),
          caption = 'Preview of LC_Final_Combined')

3.4 Cleaning the Final Dataset 

A clean dataset can help reduce data inaccuracies, makes the insights reliable for better decision making, enhances data quality and improves data analysis process efficiency.  

To clean the final Lending Club dataset, records with missing loan amount, interest rate, annual income, grade, purpose, issue year, or region were removed. The cleanup process removed about 3500 records out of a little over 1.4 million records. 

# Check for missing values in relevant columns 
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
LC_Final_Combined |> 
  summarize(across(everything(), ~ sum(is.na(.x))))
# Remove missing values rows for relevant columns 
LC_Final_Combined_Clean <- LC_Final_Combined |>
  filter(
    !is.na(loan_amnt),
    !is.na(int_rate),
    !is.na(annual_inc),
    !is.na(grade),
    !is.na(purpose),
    !is.na(issue_Year),
    !is.na(Region)  
  ) |>
  select(-X.x, -X.y) # removing columns created from merge

datatable(head(LC_Final_Combined_Clean, 10),
          options = list(scrollX = TRUE, pageLength = 5),
          caption = 'Preview of LC_Final_Combined_Clean')

4 Data Analysis

To analyze the Lending Club data to extract valuable insights, 6 important questions were answered ranging from the number of loans by geographic differences to the average loan amount by state, region and division.

4.1 Question 1: Geographic Distribution Analysis 

Find the distribution of number of loans by state, regions and divisions. Describe in your own words the geographic differences in the number of loans. Also, analyze your results by comparing the number of loans per capita. Did you notice any missing states in the Lending Club data? If yes, then find out why. 

4.1.1 Distribution of loans by state, regions, and division 

4.1.1.1 Distribution of loans by state 

The analysis began by computing the following variables grouped by state  

  1. Number of loans by state(n): the total number of loans granted in each state between 2012 and 2017. 
  1. Percent share of total loans(percent_share): the percent of total loans granted by each state 
  1. Ranking of the states by number of loans granted 
  1. Loans per capita: Using the formula loans per state/population per state can inform us of the market share of each state.  
  1. Loans per 100K people: To normalize the loans per capita that were granted in the period by using the formula loans per state/population per state * 100,000 

A new table to count the number of loans granted by the state was created while joining the population column from our cleaned final dataset which then was used to calculate the variables.

# Distribution of number of loans by state # 
number_loans_by_state <- LC_Final_Combined_Clean |>
  count(state, sort = TRUE) |>                         # number of loans per state
  left_join(LC_Final_Combined_Clean |>                 # to add the population to the table
              select(state, population) |>
              distinct(),
            by = "state") |>
  mutate(
    percent_share = round(n / sum(n) * 100, 2),        # to get percent of total
    rank = dense_rank(desc(n)),                        # to get rank of the state 
    loans_per_capita = (n / population),               # to get loans per capita by state
    loans_per_100K = round(n / population * 100000, 0) # to get loans per 100K
    )

datatable(head(number_loans_by_state, 50),
          options = list(scrollX = TRUE, pageLength = 5),
          caption = 'Preview of number_loans_by_state')

The analysis revealed the top 5 states by number of loans are California, New York, Texas, Florida, and Illinois - in that order. These 5 states combined account for just over 40% of total number loans between 2012 and 2017, which shows a heavy concentration of Lending Club customers in some of the most populous and urban states in the country. This pattern is expected as these states are not only the 5 most populous states in the country, but also have a higher internet presence, more financial assets and aligns well with the Lending Clubs target demographic.

Some states, such as New York, do not have the highest number of loans but still rank higher than states with more loans when it comes to loans per capita with. This can mean that the Lending Club has a stronger brand presence in those states.

Nevada, which ranks number 1 for loans per 100K, issued 710 loans per 100,000 people between 2012 and 2017 indicating that Lending Club has the strongest brand presence in that state. Yet, it accounts for less than 2% of total loans issued, likely due to lower population of the state compared to some of the most populous states.  

While the 5 states with the least number of loans - Iowa, North Dakota, Idaho, South Dakota, and Wyoming – all together account for less than 1% of all loans over 6 years. These states are mostly in low-density rural areas in the Midwest and the west region.However, Wyoming stands out as the number of loans issued per 100,000 people is among the top 10 states. 

4.1.1.2 Distribution of loans by regions 

Similarly, to state, we computed the same variables but grouped by region instead.

# Distribution of number of loans by regions 
number_loans_by_regions <- LC_Final_Combined_Clean |>
  select(state, Region, population) |>                  # get distinct state-region population
  distinct() |>
  group_by(Region) |>                                   # get region population
  summarize(population = sum(population, na.rm = TRUE)) |>
  left_join(                                            # get number of loans by region
    LC_Final_Combined_Clean |>
      count(Region, name = "n"),
    by = "Region"
  ) |>
  mutate(
    percent_share = round(n / sum(n) * 100, 2),         # percent share of total 
    loans_per_capita = (n / population),                # loans per capita by region
    loans_per_100K = round(n / population * 100000, 0 ) #loans per 100K by region
  ) |>
  arrange(desc(n))

datatable(head(number_loans_by_regions, 10),
          options = list(scrollX = TRUE, pageLength = 5),
          caption = 'Preview of number_loans_by_regions')

When looking at the number of loans by region, the South ranks number 1, with 35% of total loans granted between 2012-2017, followed by the West (26%), Northeast (21%) and Midwest (18%). This analysis reveals that Lending Club activity is more prominent in the South and West accounting for more than 65% of total loans from 2012 to 2017.

The distribution of loans among regions also aligns with the top 5 and bottom 5 states by number of loans. Two of the top 5 states are from the South region – Texas and Florida, while the other 2 – California and New York, in the West and Northeast respectively. 

Like the state level, regions with the highest number of loans don’t always have a high level of loans per 100,000 people. For example, Northeast ranks 1st when it comes to loans issued per 100,000 people but ranks 3rd in terms of number of loans. The high loans per 100,000 is highly evident in most of northeast states as half of the top 10 states by loans per 100,000 are in the northeast.   

This suggests that the Northeast region has a higher adoption rate of Lending Club loans among the residents of the region.  

4.1.1.3 Distribution of loans by divisions 

A similar data analysis method was applied here to gather insights at a division level to further breakdown the regional insights into smaller divisions.  

# Distribution of number of loans by divisions 
number_loans_by_divisions <- LC_Final_Combined_Clean |>
  select(state, Division, population) |>                    # get distinct state-division population
  distinct() |>
  group_by(Division) |>                                     # get division population
  summarize(population = sum(population, na.rm = TRUE)) |>
  left_join(                                                # get number of loans by division
    LC_Final_Combined_Clean |>
      count(Division, sort = TRUE),
    by = "Division"
  ) |>
  mutate(
    percent_share = round(n / sum(n) * 100, 2),             # percent share of total by division
    loans_per_capita = (n / population),                    # loans per capita by division
    loans_per_100K = round(n / population * 100000, 0)      # loans per 100K by division
  ) |>
  arrange(desc(n))

datatable(head(number_loans_by_divisions, 10),
          options = list(scrollX = TRUE, pageLength = 10),
          caption = 'Preview of number_loans_by_divisions')

 

At the division level, the ranking follows a similar trend as states and regions with South Atlantic ranking number 1 with for 20% of total loan share, followed by Pacific (18%), Middle Atlantic (15%) and East North Central (13%). While the bottom 3 divisions – Each South Central (4%), West North Central (5%), and New England (5%) - also follow a similar trend to the bottom regions. 

When looking at loans granted per 100K, we can see a similar trend at the division level as well, where divisions that don’t have the highest number of loans granted but have a higher loan granted per 100,000 people. For example, New England is one of the bottom 3 divisions in terms of number of loans granted but ranks second in terms of loans per 100K. This aligns well with the previous insight of the Northeast region having a stronger brand presence compared to other regions.  

4.1.2 Key Insights 

  1. States with highest loan volumes: CA, TX, FL, NY, IL 
  1. States with Highest loans per capita: NV, NY, UT 
  1. States with lowest per capita: IA, WY, ND 
  1. Top region by volume: South 
  1. Top region by per capita: Northeast  

All 50 U.S states were present in the 2012-2017 Lending Club dataset. While some states such as Iowa, North Dakota, and Wyoming had very low loan volumes contributing to less than 0.2% of total loan volume. Although they are not missing, their representation in the data is insignificant. 

4.2 Question 2: Average Loan Amounts States and Divisions 

Compare the average amount of loans granted by all states and divisions. Which states and divisions have the highest and lowest average loan amounts? 

4.2.1 National average amount of loans granted by Lending Club 

Before deep diving into the average amount of loans by states and divisions it’s important to have the national average to benchmark against the averag loan amounts of states and divisions. This was computed using a simple mean() function on the loan_amnt column:  

# National average loan amount 
national_avg_loan <- LC_Final_Combined_Clean |>
  mutate(loan_amnt = as.numeric(loan_amnt)) |>
  summarize(national_avg = round(mean(loan_amnt, na.rm = TRUE), 0))

print(national_avg_loan)
##   national_avg
## 1        14782

The national average amount of loan granted by the Lending Club between 2012 and 2017 was $14,782. 

4.2.2 Average amount of loans granted by all states 

To further analyze the Lending Clubs loans from 2012 to 2017, the average amount of loans was calculated using the mean() and group_by() function: 

# Average amount of loans granted by state
avg_loanamnt_by_state <- LC_Final_Combined_Clean |>
  mutate(loan_amnt = as.numeric(loan_amnt)) |>
  group_by(state) |>
  summarize(avg_loan_amnt = round(mean(loan_amnt, na.rm = TRUE),0)) |>
  arrange(desc(avg_loan_amnt))

datatable(head(avg_loanamnt_by_state, 50),
          options = list(scrollX = TRUE, pageLength = 5),
          caption = 'Preview of avg_loanamnt_by_state')
top_5_states <- avg_loanamnt_by_state |> # Top 5 States
  slice_max(avg_loan_amnt, n = 5)

top_5_states
bottom_5_states <- avg_loanamnt_by_state |> # Bottom 5 States
  slice_min(avg_loan_amnt, n = 5)

bottom_5_states

 

The average amount of loans in each state ranges from $13,000 to $17,000. Alaska ranks number 1 with the highest loan amount of $17,044 which suggests the high level of individual loans in that state. Iowa is a clear outlier, with the lowest average loan amount of a little less than $7,000; significantly lower than the national average of $14,782. Some factors that can have an impact on the average loan amounts include state economy or cost of living. 

4.2.3 Average amount of loans granted by Divisions 

Using a similar methodology, the average amount of loans granted by divisions was computed.  

# Average amount of loans granted by divisions
avg_loanamnt_by_divisions <- LC_Final_Combined_Clean |>
  mutate(loan_amnt = as.numeric(loan_amnt)) |>
  group_by(Division) |>
  summarize(avg_loan_amnt = round(mean(loan_amnt, na.rm = TRUE), 0)) |>
  arrange(desc(avg_loan_amnt))

avg_loanamnt_by_divisions

The average loan amount by division ranges from $14,330 to $15,199. The national average loan amount of $14,782, sits right in the middle of this range.  

The West South-Central division has the highest average loan amount of $15,199 which includes states like Texas, Oklahoma and Louisiana. This average loan amount is slightly higher than the national average which means the demand for loans is higher in those states. This trend also aligns with the high number of loans in the area.  

New England and Pacific, which includes some higher-income urban states, rank close with the average loan amounts of $15,083 and $14,997 respectively.  

One the other hand of the spectrum, the East South-Central and the West North Central division have the lowest average loan amounts of around $14,330, about $450 lower than the national average. This may be due to the differences in income levels and lower cost of living in those areas. While the difference is not extreme, these slight variations of loan amounts can tell a lot about the borrowing power of the residents in those states.  

4.2.4 Key Insights 

  1. 5 States with the highest average loan amounts: Alaska, Virginia, Maryland, Massachusetts and New Jersey. These states are also among the top 10 states with highest median incomes suggesting a strong correlation between median income and average loan amounts.  
  2. States with the lowest average loan amount: Iowa, Vermont, Montana, Arkansas, and Nebraska. These states also have a lower median income compared to the top 5 states.
  3. Divisions with highest average loan amount: West South-Central and New England 
  4. Divisions with lowest average loan amount: East South-Central and West North Central 

4.3 Question 3: Analysis by Loan Grade

Compare the average interest rate charged and average loan amount by the loan Grade. Do you notice any patterns?

4.3.1 Computing the averages by Loan Grade

After extracting valuable insights on the loan distribution geographically, it’s time to take a look at Loan Grade. This is a grade assigned to each record by the Lending Club. The grade in the name indicates there may be some correlation between Loan Grade and Interest rates charged. Does a lower loan grade mean a higher interest rate?

To answer that question, the loan amount and interest rate fields are first converted to numeric, the final dataset is grouped by grade while computing average loan amount and average interest rate using mean(). 

# Compute national average interest rate
national_avg_int_rate <- LC_Final_Combined_Clean |>
  mutate(int_rate = as.numeric(gsub("%", "", int_rate))) |>
  summarize(avg_int_rate = round(mean(int_rate, na.rm = TRUE), 2))

National Average Interest Rate
13.31%

The national average interest rate at the Lending Club is 13.31%, the average interest rate by grade ranges from 7.07% for grade A to 28.5% for grade G. The analysis reveals a strong positive correlation between loan grade and interest rate, as predicted. As the grade moves from A to G, the average interest rate increases significantly, indicating a higher credit risk. This pattern suggests that the credit risk is an important factor in the interest rate pricing model of the Lending Club.

# Average interest rate and average loan amount by loan Grade 
avgs_by_grade <- LC_Final_Combined_Clean |>
  mutate(
    loan_amnt = as.numeric(loan_amnt),             # Make sure all loan amount are numeric
    int_rate = as.numeric(gsub("%", "", int_rate)) # Remove the % signs for data manipulations
  ) |>
  group_by(grade) |>   # Group the table by grade
  summarize(
    avg_loan_amnt = round(mean(loan_amnt, na.rm = TRUE), 0), # Get the average loan amount
    avg_int_rate = round(mean(int_rate, na.rm = TRUE), 2)    # Get average interest rate
  ) |>
  arrange(grade)

avgs_by_grade
# Compute growth rate
avgs_by_grade <- avgs_by_grade |>
  mutate(
    growth_rate = round((avg_int_rate - lag(avg_int_rate)) / lag(avg_int_rate) * 100, 2)
  )
  
avgs_by_grade

The average interest rate increases at a steady rate from Grade A to Grade G with the most significant jump from Grade A to Grade B with over 50% growth in average interest rate. Although this growth rate declines as we go up to Grade G indicating narrowing of risk-based interest rates pricing as the loan grades worsen.  

The average loan amount also increases with a declining grade letter, ranging from $14,186 in Grade A to $20,517 in Grade G. This can mean that people with lower grades tend to borrow larger amount of loans, potentially due to more financial need. It can also reflect on Lending Club’s willingness to give high interest loans with higher credit risk to offset the balances.  

These findings demonstrate the direct relationship between interest rate, loan size, and credit risk and reinforce the importance of grade-based loan system for brands like Lending Club.  

4.3.2 Key Insights 

  1. Grade A loans have the lowest average interest rate and the lowest average loan amount.  
  1. Grade G loans have the highest average interest rate and the highest average loan amount.  
  1. A positive linear relationship between Loan Grade and Interest Rate. 

4.5 Question 5: Relationship between Population and Average Loan Amount 

Is there a relationship with the population size of a state and the average loan amount given? Is there a relationship between Grade of loans and median income level in a state?

4.5.1 Population size vs average loan amount by state

# Population size vs average loan amount by state 
pop_vs_loanamnt_state <- LC_Final_Combined_Clean |>
  mutate(loan_amnt = as.numeric(loan_amnt)) |> # Make loan amount numeric
  group_by(state, population) |> # Group the numbers by state and then population
  summarize(   
    avg_loan_amt = round(mean(loan_amnt, na.rm = TRUE), 0),
    .groups = "drop"
  ) |>                 # Add the average loan amount by state 
  arrange(desc(population))


print(pop_vs_loanamnt_state, n = 50)
## # A tibble: 50 × 3
##    state          population avg_loan_amt
##    <chr>               <int>        <dbl>
##  1 California       38982847        15029
##  2 Texas            27419612        15466
##  3 Florida          20278447        14065
##  4 New York         19798228        14617
##  5 Illinois         12854526        15229
##  6 Pennsylvania     12790505        14466
##  7 Ohio             11609756        14070
##  8 Georgia          10201635        15049
##  9 North Carolina   10052564        14483
## 10 Michigan          9925568        14057
## 11 New Jersey        8960161        15611
## 12 Virginia          8365952        15898
## 13 Washington        7169967        15055
## 14 Arizona           6809946        14244
## 15 Massachusetts     6789319        15616
## 16 Indiana           6614418        14390
## 17 Tennessee         6597381        14391
## 18 Missouri          6075300        14303
## 19 Maryland          5996079        15634
## 20 Wisconsin         5763217        14187
## 21 Minnesota         5490726        14334
## 22 Colorado          5436519        14773
## 23 South Carolina    4893444        14687
## 24 Alabama           4850771        14391
## 25 Louisiana         4663461        14611
## 26 Kentucky          4424376        14080
## 27 Oregon            4025127        13886
## 28 Oklahoma          3896251        14709
## 29 Connecticut       3594478        15029
## 30 Iowa              3118102         6962
## 31 Utah              2993941        14889
## 32 Mississippi       2986220        14465
## 33 Arkansas          2977944        13762
## 34 Kansas            2903820        14552
## 35 Nevada            2887725        14098
## 36 New Mexico        2084828        14739
## 37 Nebraska          1893921        13795
## 38 West Virginia     1836843        14632
## 39 Idaho             1657375        14064
## 40 Hawaii            1421658        15530
## 41 New Hampshire     1331848        14750
## 42 Maine             1330158        14464
## 43 Rhode Island      1056138        13995
## 44 Montana           1029862        13605
## 45 Delaware           943732        14680
## 46 South Dakota       855444        13972
## 47 North Dakota       745475        14983
## 48 Alaska             738565        17044
## 49 Vermont            624636        13482
## 50 Wyoming            583200        15420

To identify if the population of the state has an impact on the average loan amount, the loan amount was grouped by state and then sorted by population.  

The most populous states like California, Texas and New York show high average loan amounts that are comparable to or a little higher than the national average of $14,782. However, several smaller states also show similar average loan amounts. For example, Alaska with a population of under 800,000 ranks number 1 for average loan amount, Hawaii and Wyoming also rank high with average loan amounts of over $15,000 each.  

On the other hand, Iowa, a mid-sized state in terms of population, has the lowest average loan amount at only $6,962, standing out as a significant outlier once again with only 2 loans granted from 2012 to 2017. 

These results indicate that there is no clear relationship between the population size of the state and average loan amount of the state. Loan sizes are most likely influenced by state factors such as regional financial behavior, state median income levels, or state population of high credit risk borrowers.  

4.5.2 Grade vs Median Income

From previous analysis, it is clear that the interest rates have a clear positive relationship with the Loan Grade. Does the same relationship apply to the state median income?

# Grade vs median income level in a state
grade_percent_by_state <- LC_Final_Combined_Clean |> 
  count(state, grade) |> # Get the count of loans by state and by grade
  group_by(state) |> # Group by state
  mutate(
    loans_n = sum(n),
    grade_percent = round(n / loans_n * 100, 2)
  ) |>
  ungroup() |> # Stop the grouping here
  left_join(
    LC_Final_Combined_Clean |> # Get median income household by state 
      select(state, Median_income_Households) |>
      distinct(),
    by = "state"
  ) |>
  arrange(desc(Median_income_Households))

grade_percent_by_state

The comparison of Loan grade with Median income of the state uncovers a clear pattern. States with high median income tend to have more Grade A and Grade B loans. For instance, Massachusetts with one of the highest median incomes (~$74K) shows about 80% of the total loans of the state in A, B, or C grade. While the lower income states like Mississippi with median income of less than $42K show only about 40% of total loans in Grade A and B and a high concentration of loans in Grade C and higher.  

This strong correlation aligns with Lending Club’s credit model that assigns higher interest rates to higher grade loans. It can be inferred that the borrowers from wealthier states are more likely to receive higher credit grades due to better financial profiles. This supports the hypothesis that income level is a driver of loan grade distribution across states. 

4.5.3 Key Insights

  1. Populous states have a higher average loan amount but so some lower populated states so there’s not direct correlation.
  2. States with high median income tend to have most of the loan grade A.B, or C for instance 80% of total loans in Massachusetts are in those grades.
  3. Strong correlation with grade and median income of the state.

4.6 Question 6

This is an open-ended question where you are asked to share an interesting fact that you found through data analysis. 

4.6.1 Average annual income by Grade 

While we know that Grades are influenced by median income, and are strongly correlated with interest rates, it might be interesting to understand the average annual income for each loan grade to see if there’s a correlation.

# Grade vs Annual Income of Borrowers 
avg_income_by_grade <- LC_Final_Combined_Clean |>
  mutate(annual_inc = as.numeric(annual_inc)) |>
  group_by(grade) |>
  summarize(
    avg_annual_income = round(mean(annual_inc, na.rm = TRUE), 0)
  ) |>
  arrange(grade)

avg_income_by_grade

With no surprise, there’s a clear downward relationship between income and loan grade as the Grade A borrowers have the highest average income of $91,086, Grades B to D show a steady decline to $71,190 for grade D loans. Interestingly, the average income shows a slight increase again with Grade G’s average income being $75,269. This indicates that the higher-grade loans are given to borrowers who have a high income and high credit risk, highlighting the importance of credit history for computing the loan grade.  

5 Visualization

5.1 Question 1: Interest Rates vs Loan Grade

Create a plot of interest rates and Grade of a loan and describe the pattern.

5.1.1 Boxplot of Interest Rates and Loan Grade

Using the ggplot() functions, a boxplot representing the interest rates frequency distribution for each loan grade.

library(ggplot2)

# Interest rate vs grade for the loan
LC_Final_Combined_Clean |>
  mutate(int_rate = as.numeric(gsub("%", "", int_rate))) |>
  ggplot(aes(x = grade, y = int_rate)) +
  geom_boxplot(fill = "blue") +
  labs(
    title = "Interest Rates by Loan Grade",
    x = "LC Loan Grade",
    y = "Interest Rates (%)") +
  theme_classic()

The boxplot indicates a clear pattern of a positive relationship between loan grades and interest rate. As the loan grades go from A to G, the interest rates increase steadily, indicating Grade A borrowers are given low-interest rate between 5-10% and Grade G borrowers are given a high interest rate between 25-30%.  

5.1.2 Key Insights

  1.  Lowest risk, Lowest rate: Grade A loans  
  1. Highest risk, Highest rate: Grade G loans  
  1. Some outliers but lower grades show more variance 

5.2 Question 2: US Map by Average Loan Amount 

Create a map of US states and color code the map with the average amount of loans given.  

5.2.1 Heatmap of the US States with Average Loan Amounts

# Map of US states by average amount of loans
library(maps)

# Get US map data 
us_states_data <- map_data("state")

# Get the average loan amount by state data table ready 
map_data_avg_loan_amnt <- us_states_data |>
  left_join(
    avg_loanamnt_by_state |>
      mutate(region = tolower(state)), 
    by = "region"
  )

# Plot the US map 
ggplot(map_data_avg_loan_amnt, 
       aes(long, lat, group = group,
           fill = avg_loan_amnt)) +
  geom_polygon(color = "black") + 
  coord_fixed(1.0) +
  scale_fill_viridis_c(option = "C", name = "Avg Loan Amount") +
  labs(title = "Average Loan Amount by US State") +
  theme_void()

To create a map of the US, first the US states longitude and latitude data from the built-in map_data table is extracted. Next, the US data table is joined with the average loan amount by state table.

The above US map highlights state patterns in average Lending Club loan amounts. States like Florida, Texas, and New Jersey exhibit the highest average loans which was clear from the previous analysis. These states seem to have a strong borrowing activity or larger financial needs. Once again, Iowa emerges an outlier with the lowest average loan amount of less than $8,000 potentially due to demographic or economic factors, or a lower loan volume (2) skewing the average. 

5.3 Question 3: Annual Income and Loan Amount

Show visually the relationship between the annual income of the recipient and the loan amount obtained from Lending Club.

5.3.1 Average loan amount by income brackets

While there’s some relation between the median income and the loan grade, it’s not yet clear if the annual income of borrowers has anything to do with the loan amount.

But first, the borrowers are grouped into income brackets like <25K, 25K-50K and so on using the cut() function. The ggplot() geom_col() is used to create a bar chart with a geom_line() showcasing a linear trend.

# Average loan amount by income brackets 
avg_loan_income_brackets <- LC_Final_Combined_Clean |>
  mutate(
    annual_inc = as.numeric(annual_inc),
    loan_amnt = as.numeric(loan_amnt),
    income_bracket = cut(
      annual_inc,
      breaks = c(0, 25000, 50000, 75000, 100000, 150000, 250000, 300000, Inf),
      labels = c("<25K", "25–50K", "50–75K", "75–100K", "100–150K", "150–250K", "250-300K", "300K+" )
    )
  ) |>
  filter(!is.na(income_bracket)) |>
  group_by(income_bracket) |>
  summarize(
    avg_loan_amnt = round(mean(loan_amnt, na.rm = TRUE), 0),
    count = n()
  ) |>
  arrange(income_bracket)

avg_loan_income_brackets
# Plot the average loan amount by income brackets 
ggplot(avg_loan_income_brackets, 
       aes(x = income_bracket, y = avg_loan_amnt)) +
  geom_col(fill = "blue", width = 0.5) +
  geom_line(aes(group = 1), color = "black", linewidth = 1.0) +
  scale_y_continuous(labels = scales::dollar) +
  labs(
    x = "Income Bracket",
    y = "Average Loan Amoount",
    title = "Relation between income and loan amounts"
  ) +
  theme_classic()

The bar chart reveals a strong positive relationship between the borrower’s annual income and the loan amounts granted. As the income increases, the average loan amount also increases steadily. The average loan amount ranges from under $10,000 for people with less than $25,000 annual income to $25,000 for people earning more than $300,000. However, the upward trends start to flatten out at the highest income levels of $300K and above suggesting less need for higher loan size for high income individuals.

The trend line is smooth and consistent which means that income is an important factor for Lending Club’s loan approval process. This aligns with our findings for interest rates and income distributions by loan grade. Higher income individuals qualify for larger loans as they are more likely to afford the payments.

5.4 Question 4

Create a plot that shows the relationship between the length of employment and amount of loan obtained.

5.4.1 Average loan amounts by employment length

To create a plot of length of employment and amount of loan obtained, a new table with employment length organized from <1 year to 10+ years using factor() and average loan amount grouped by employment length is computed using mean().

# Average loan amounts by employment length 
avg_loan_emp_length <- LC_Final_Combined_Clean |>
  mutate(
    emp_length = as.character(emp_length),
    loan_amnt = as.numeric(loan_amnt)
  ) |>
  filter(
    emp_length !="",
    !is.na(emp_length),
    !is.na(loan_amnt)
  ) |>
  group_by(emp_length) |>
  summarize(
    avg_loan_amnt = round(mean(loan_amnt), 0),
    loan_count = n()
  ) |>
  mutate(
    emp_length = factor(emp_length, levels = c(
    "< 1 year", "1 year", "2 years", "3 years", "4 years", "5 years",
    "6 years", "7 years", "8 years", "9 years", "10+ years"
  )))

avg_loan_emp_length 
# Plot the average loan amounts by employment length 
ggplot(avg_loan_emp_length,
       aes(x = emp_length, y = avg_loan_amnt)) +
  geom_col(fill = "orange") +
  geom_line(aes(group = 1), color = "black", linewidth = 1.0) +
  scale_y_continuous(labels = scales::dollar) +
  labs(
    title = "Relation between employment length and loan amounts",
    x = "Employment Length",
    y = "Average Loan Amount"
  ) + 
  theme_minimal()

This chart shows somewhat of a steady and consistent relationship between employment length and average loan amount. People with 10+ years of work experiences tend to receive higher loans. The trend suggests Lending Club’s loan approval process also value employment stability, though its influence on loan size is more gradual compared to income as the growth is not that significant.  

5.5 Question 5: Regional Map with Loans per 100K

Create a “regional” map and show an interesting relationship of your liking. 

5.5.1 Creating a Regional Map with Loan per 100K by Region 

To create a regional map, first the longitude and latitude data from the map_data is joined with the US states and region date along with the calculation of loans per 100K people and mean() of longitude and latitude. Then a labels table for the regional map is created to use that for the the labels on the map. Using the ggplot() geom_polygon() function, the map is created.  

# Regional map of loans per 100K by region

# Table with regional long and lat data 
state_region_lookup <- LC_State_Regions |> # Create a look up table for regions to match with state
  mutate(region = tolower(State))

us_regional_data <- map_data_avg_loan_amnt |> # Now join with the US states map long and lat data with avg loan amounts
  left_join(state_region_lookup) |>
              select(long, lat, group, order,region, Region, by = "region")
## Joining with `by = join_by(region)`
us_regional_data <- us_regional_data |> # Now join the above two tables
  left_join(
    number_loans_by_regions |>
      select(Region, loans_per_100K),
    by = "Region"
  )

us_regional_data 
# Create labels table for each regions with loans per 100K by region
region_loans_per_100K <- us_regional_data |>
  group_by(Region) |>
  summarize(
    loans_per_100K = first(loans_per_100K),
    long = mean(range(long)),
    lat = mean(range(lat))
  )

region_loans_per_100K
# Plot the regional map with loans per 100K 
ggplot(us_regional_data, aes(x = long, y = lat, group = group, fill = loans_per_100K)) +
  geom_polygon(color = "lightgray", linewidth = 0.2) +
  geom_text(
    data = region_loans_per_100K,
    mapping = aes(x = long, y = lat, label = paste0(Region, "\n", loans_per_100K, " per 100K")),
    color = "black",
    size = 3,
    inherit.aes = FALSE
  ) +
  scale_fill_viridis_b(option = "C", name = "Loans per 100K") +
  labs(
    title = "Lending Club Loans per 100K by U.S. Region",
    subtitle = "Each state is colored by its region’s average loans per capita"
  ) +
  coord_fixed(1.3) +
  theme_void()

The regional map of the Lending Club loans per 100,000 between 2012 and 2017 reveals clear geographic disparities by each region. The Northeast region leads with the highest loans per capita of 530 loans per 100K, suggesting a higher engagement with online lending platforms in that region. The region is also amounting to the wealthiest in the nation, which can have an impact on the borrowing capacity of the residents in the region. The Western region follows with 488 loans per 100K, likely driven by high population and higher adoption of fintech in the region. In contrast, the Midwest records the lowest usage at 382 loans per 100K, which may reflect either lower demand or greater access to traditional credit sources. Although the southern region ranks 1 in terms of loan volumes between 2012 and 2017, it falls in the middle at 426 loans per 100K, highlighting a broad mix of lending behaviors in the region. These findings suggest that regional differences in economy, financial literacy, digital adoption, population and income levels have a big impact of Lending Club’s performance in the region.

6 Conclusion

The Lending Club data from 2012 to 2017 revealed several trends and correlations. These trends and insights can help to inform future targeting and risk assessment strategies. Some of the key trends extracted from the analysis:

  1. The states with large populations like Texas, California, and New York dominate the loan volume, likely due to the high population. However, there’s no correlation between the average loan amounts and the population of the state.
  2. The lending Club uses several factors for credit risk assessment, some of the most important ones being income and employment length. Higher income and longer work experience means larger loan amounts granted.
  3. The loan grade assigned by Lending Club has a strong correlation to the interest rate. The lower the grade, the lower the interest rate, with lower being A.  
  4. There’s a significant regional variation in lending activity from 2012 to 2017, with Northeast and West having a higher loan volume per capita.
  5. The Lending Club’s IPO had a significant impact on the company’s overall performance, growing the loan volume by 75% year over year in 2014.