R Markdown

Lending Club Analysis

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.5.3
## Warning: package 'dplyr' was built under R version 4.5.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.3     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(scales)
## Warning: package 'scales' was built under R version 4.5.3
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
library(maps)
## 
## Attaching package: 'maps'
## 
## The following object is masked from 'package:purrr':
## 
##     map
lc_sample <- read_csv("LCsample.csv")
## New names:
## Rows: 191367 Columns: 32
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (16): term, int_rate, grade, sub_grade, emp_title, emp_length, home_owne... dbl
## (16): ...1, loan_amnt, annual_inc, dti, delinq_2yrs, inq_last_6mths, mth...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
col_names <- names(lc_sample)

The codes below stacks all six annual files and merges them with the states demographic and region files.

This step imports each annual Lending Club CSV file from 2012 through 2017. col_names = col_names applies the correct headers from the sample file. col_types = cols(.default = “c”) reads every column as character text first. This helps avoid errors when files have inconsistent formatting across years.

data2012 <- read_csv("data2012.csv", col_names = col_names, col_types = cols(.default = "c"))
data2013 <- read_csv("data2013.csv", col_names = col_names, col_types = cols(.default = "c"))
data2014 <- read_csv("data2014.csv", col_names = col_names, col_types = cols(.default = "c"))
data2015 <- read_csv("data2015.csv", col_names = col_names, col_types = cols(.default = "c"))
data2016 <- read_csv("data2016.csv", col_names = col_names, col_types = cols(.default = "c"))
data2017 <- read_csv("data2017.csv", col_names = col_names, col_types = cols(.default = "c"))

This step stacks all yearly datasets into one combined dataset called loans. bind_rows() adds the rows from each year together. The output is one large dataset containing Lending Club loan records from 2012 through 2017.

loans <- bind_rows(
  data2012,
  data2013,
  data2014,
  data2015,
  data2016,
  data2017
)

to clean, this step prepares the combined loan data for analysis. clean_names() standardizes column names into a cleaner format. mutate() creates or updates variables. loan_amnt, annual_inc, and issue_year are converted into numeric values so they can be summarized or graphed. int_rate removes the percent sign and converts the interest rate to a number. The state line converts state abbreviations like “CA” or “NY” into full state names so they can be joined with the state files.

loans_clean <- loans |>
  clean_names() |>
  mutate(
    loan_amnt = as.numeric(loan_amnt),
    int_rate = as.numeric(str_remove(int_rate, "%")),
    annual_inc = as.numeric(annual_inc),
    issue_year = as.numeric(issue_year),
    state = state.name[match(addr_state, state.abb)]
  )
## Warning: There were 4 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `loan_amnt = as.numeric(loan_amnt)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 3 remaining warnings.

This step we imports the two supporting files. States.csv includes state-level information such as population, income, and unemployment. states_regions.csv includes region and division classifications. clean_names() makes the column names easier to reference in R.

states <- read_csv("States.csv") |>
  clean_names()
## Rows: 52 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Geography
## dbl (7): Num_Households, Median_income_Households, Unemployment rate; Estima...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
regions <- read_csv("states_regions.csv") |>
  clean_names()
## New names:
## Rows: 51 Columns: 5
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (4): State, State Code, Region, Division lgl (1): ...5
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...5`

this code combines the loan data with state demographic and regional information. The first left_join() matches the loan dataset’s state column to the geography column in the state file. The second left_join() adds region and division data using the state column. The output is final_data, which contains loan details plus state, population, income, region, and division information.

final_data <- loans_clean |>
  left_join(states, by = c("state" = "geography")) |>
  left_join(regions, by = "state")

The results returns a final dataset after all the joins, merge,cleanining has happened to begin our analysis

##Exploratory Analysis

####1.Find the distribution of number of loans by state, regions and divisions. Describe in your own words the geographic differences in the number of loans. Also, analyze your results by comparing number of loans per capita. Did you notice any missing states in the Lending Club data? If yes, then find out why.

This step counts how many loans appear in each state. group_by(state) groups the data by state. summarize(number_of_loans = n()) counts the number of records in each state. arrange(desc(number_of_loans)) sorts the results from highest to lowest. The output shows which states have the most and fewest loans.

loans_by_state <- final_data |>
  group_by(state) |>
  summarize(number_of_loans = n()) |>
  arrange(desc(number_of_loans))

loans_by_state
## # A tibble: 51 × 2
##    state        number_of_loans
##    <chr>                  <int>
##  1 California            239836
##  2 New York              142729
##  3 Texas                 142517
##  4 Florida               121499
##  5 Illinois               69826
##  6 New Jersey             63336
##  7 Pennsylvania           59289
##  8 Ohio                   57766
##  9 Georgia                56297
## 10 Virginia               48505
## # ℹ 41 more rows

here, it counts the number of loans by U.S. region. group_by(region) groups loans into regions such as South, West, Northeast, and Midwest. summarize() counts the loans in each region. The output helps show whether loan activity is concentrated in certain parts of the country

loans_by_region <- final_data |>
  group_by(region) |>
  summarize(number_of_loans = n()) |>
  arrange(desc(number_of_loans))

loans_by_region
## # A tibble: 5 × 2
##   region    number_of_loans
##   <chr>               <int>
## 1 South              612804
## 2 West               441840
## 3 Northeast          354637
## 4 Midwest            309450
## 5 <NA>                 4166

the code adjusts loan counts for population size. group_by(state, population) groups the data by each state and its population. summarize() counts the loans per state. mutate() creates a new variable called loans_per_capita by dividing loans by population. This gives a fairer comparison because large states naturally tend to have more loans

loans_per_capita <- final_data |>
  group_by(state, population) |>
  summarize(number_of_loans = n(), .groups = "drop") |>
  mutate(loans_per_capita = number_of_loans / population) |>
  arrange(desc(loans_per_capita))

loans_per_capita
## # A tibble: 51 × 4
##    state         population number_of_loans loans_per_capita
##    <chr>              <dbl>           <int>            <dbl>
##  1 Nevada           2887725           24515          0.00849
##  2 Connecticut      3594478           27015          0.00752
##  3 New York        19798228          142729          0.00721
##  4 Rhode Island     1056138            7532          0.00713
##  5 New Jersey       8960161           63336          0.00707
##  6 Maryland         5996079           40828          0.00681
##  7 Colorado         5436519           36214          0.00666
##  8 New Hampshire    1331848            8484          0.00637
##  9 Wyoming           583200            3699          0.00634
## 10 California      38982847          239836          0.00615
## # ℹ 41 more rows

####2. Compare the average amount of loans granted by all states and divisions. Which states and divisions have the highest and lowest average loan amounts?

This code calculates the average loan amount for each state. group_by(state) separates the data by state. mean(loan_amnt, na.rm = TRUE) calculates the average loan amount while ignoring missing values. arrange(desc()) ranks the states from highest average loan amount to lowest.

avg_loan_state <- final_data |>
  group_by(state) |>
  summarize(avg_loan_amount = mean(loan_amnt, na.rm = TRUE)) |>
  arrange(desc(avg_loan_amount))

avg_loan_state
## # A tibble: 51 × 2
##    state         avg_loan_amount
##    <chr>                   <dbl>
##  1 Alaska                 17058.
##  2 Virginia               15952.
##  3 <NA>                   15730.
##  4 Maryland               15689.
##  5 New Jersey             15684.
##  6 Massachusetts          15673.
##  7 Hawaii                 15586.
##  8 Texas                  15569.
##  9 Wyoming                15466.
## 10 Illinois               15315.
## # ℹ 41 more rows

This step calculates the average loan amount by U.S. division. Divisions are smaller geographic groupings within regions. The output shows which divisions have the highest and lowest average loan amounts.

avg_loan_division <- final_data |>
  group_by(division) |>
  summarize(avg_loan_amount = mean(loan_amnt, na.rm = TRUE)) |>
  arrange(desc(avg_loan_amount))

avg_loan_division
## # A tibble: 10 × 2
##    division           avg_loan_amount
##    <chr>                        <dbl>
##  1 <NA>                        15730.
##  2 West South Central          15309.
##  3 New England                 15169.
##  4 Pacific                     15040.
##  5 Middle Atlantic             14903.
##  6 South Atlantic              14856.
##  7 East North Central          14567.
##  8 Mountain                    14534.
##  9 East South Central          14459.
## 10 West North Central          14418.

####3.Compare the average interest rate charged and average loan amount by the loan Grade. Do you notice any patterns?

it summarizes loans by grade. group_by(grade) groups the loans into credit grades such as A, B, C, and so on. The code calculates average interest rate, average loan amount, and total number of loans for each grade. The output helps compare risk levels across loan grades.

grade_summary <- final_data |>
  group_by(grade) |>
  summarize(
    avg_interest_rate = mean(int_rate, na.rm = TRUE),
    avg_loan_amount = mean(loan_amnt, na.rm = TRUE),
    number_of_loans = n()
  ) |>
  arrange(grade)

grade_summary
## # A tibble: 10 × 4
##    grade      avg_interest_rate avg_loan_amount number_of_loans
##    <chr>                  <dbl>           <dbl>           <int>
##  1 A                       7.08          14265.          287667
##  2 B                      10.6           13688.          509800
##  3 C                      14.0           14828.          514460
##  4 D                      17.8           15752.          249360
##  5 E                      21.4           17895.          113286
##  6 F                      25.3           19202.           37323
##  7 G                      28.3           20577.           10984
##  8 Unnamed: 8            NaN               NaN                4
##  9 grade                 NaN               NaN                1
## 10 <NA>                  NaN               NaN               12

####4) Run a frequency distribution of number of loans, average loan amount and average interest rate for each state by year (2012 through 2017). Describe the changing patterns in those numbers. The code calculates the number of loans, average loan amount, and average interest rate for every state-year combination. This output can be used to see how lending changed from 2012 through 2017.Result summarizes loan activity by state and year. group_by(state, issue_year) creates groups for each state in each year.

state_year_summary <- final_data |>
  group_by(state, issue_year) |>
  summarize(
    number_of_loans = n(),
    average_loan_amount = mean(loan_amnt, na.rm = TRUE),
    average_interest_rate = mean(int_rate, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(state, issue_year)

state_year_summary
## # A tibble: 294 × 5
##    state   issue_year number_of_loans average_loan_amount average_interest_rate
##    <chr>        <dbl>           <int>               <dbl>                 <dbl>
##  1 Alabama       2012             695              14002.                  13.9
##  2 Alabama       2013            1644              14469.                  15.1
##  3 Alabama       2014            3034              14887.                  14.3
##  4 Alabama       2015            5347              15072.                  13.2
##  5 Alabama       2016            5329              14053.                  13.5
##  6 Alabama       2017            5147              14364.                  13.6
##  7 Alaska        2012             178              15610.                  14.8
##  8 Alaska        2013             389              16562.                  14.5
##  9 Alaska        2014             598              17199.                  14.2
## 10 Alaska        2015             954              17150.                  12.8
## # ℹ 284 more rows

####5) Is there a relationship with the population size of a state and the average loan amount given? Is there a relationship between Grade of loans and median income level in a state?

the code calculates the average loan amount for each state. Then cor() calculates the correlation between population and average loan amount. A correlation close to 0 means there is little relationship. A positive value means both variables tend to increase together. A negative value means one tends to decrease as the other increases.Here we explor whether states with larger populations also have higher average loan amounts.

population_loan_relationship <- final_data |>
  group_by(state, population) |>
  summarize(avg_loan_amount = mean(loan_amnt, na.rm = TRUE), .groups = "drop")

cor(
  population_loan_relationship$population,
  population_loan_relationship$avg_loan_amount,
  use = "complete.obs"
)
## [1] 0.1332012
population_loan <- final_data |>
  group_by(state, population) |>
  summarize(
    average_loan_amount = mean(loan_amnt, na.rm = TRUE),
    .groups = "drop"
  )

population_loan
## # A tibble: 51 × 3
##    state       population average_loan_amount
##    <chr>            <dbl>               <dbl>
##  1 Alabama        4850771              14535.
##  2 Alaska          738565              17058.
##  3 Arizona        6809946              14326.
##  4 Arkansas       2977944              13901.
##  5 California    38982847              15068.
##  6 Colorado       5436519              14906.
##  7 Connecticut    3594478              15143.
##  8 Delaware        943732              14736.
##  9 Florida       20278447              14139.
## 10 Georgia       10201635              15111.
## # ℹ 41 more rows
cor(
  population_loan$population,
  population_loan$average_loan_amount,
  use = "complete.obs"
)
## [1] 0.1332012
ggplot(population_loan, aes(x = population, y = average_loan_amount)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Population Size vs Average Loan Amount",
    x = "State Population",
    y = "Average Loan Amount"
  )
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

grade_income <- final_data |>
  mutate(
    grade_num = case_when(
      grade == "A" ~ 1,
      grade == "B" ~ 2,
      grade == "C" ~ 3,
      grade == "D" ~ 4,
      grade == "E" ~ 5,
      grade == "F" ~ 6,
      grade == "G" ~ 7
    )
  ) |>
  group_by(state, median_income_households) |>
  summarize(
    average_grade = mean(grade_num, na.rm = TRUE),
    .groups = "drop"
  )

grade_income
## # A tibble: 51 × 3
##    state       median_income_households average_grade
##    <chr>                          <dbl>         <dbl>
##  1 Alabama                        46472          2.87
##  2 Alaska                         76114          2.80
##  3 Arizona                        53510          2.70
##  4 Arkansas                       43813          2.80
##  5 California                     67169          2.70
##  6 Colorado                       65458          2.69
##  7 Connecticut                    73781          2.71
##  8 Delaware                       63036          2.77
##  9 Florida                        50883          2.76
## 10 Georgia                        52977          2.76
## # ℹ 41 more rows
cor(
  grade_income$median_income_households,
  grade_income$average_grade,
  use = "complete.obs"
)
## [1] -0.2678967
ggplot(grade_income, aes(x = median_income_households, y = average_grade)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Median Income vs Average Loan Grade",
    x = "Median Household Income",
    y = "Average Loan Grade (1=A, 7=G)"
  )
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

###Visualization

####1) Create a plot of interest rates and Grade or a loan and describe the pattern.

grade_interest <- final_data |>
  group_by(grade) |>
  summarize(
    average_interest_rate = mean(int_rate, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(grade)

grade_interest
## # A tibble: 10 × 2
##    grade      average_interest_rate
##    <chr>                      <dbl>
##  1 A                           7.08
##  2 B                          10.6 
##  3 C                          14.0 
##  4 D                          17.8 
##  5 E                          21.4 
##  6 F                          25.3 
##  7 G                          28.3 
##  8 Unnamed: 8                NaN   
##  9 grade                     NaN   
## 10 <NA>                      NaN
ggplot(grade_interest, aes(x = grade, y = average_interest_rate, fill = grade)) +
  geom_col() +
  labs(
    title = "Average Interest Rate by Loan Grade",
    x = "Loan Grade",
    y = "Average Interest Rate (%)"
  ) +
  theme_minimal()
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_col()`).

####2) Create a map of US states and color code the map with the average amount of loans given. This code first calculates the average loan amount for each state. Then it converts state names to lowercase so they can match the map data. The map_data(“state”) function provides the shape data for U.S. states. The left_join() connects the loan summary data to the map. The final ggplot() creates a U.S. map where each state is shaded based on its average loan amount.

# Create average loan amount by state
avg_loan_map <- final_data |>
  group_by(state) |>
  summarize(
    average_loan_amount = mean(loan_amnt, na.rm = TRUE),
    .groups = "drop"
  ) |>
  mutate(region = tolower(state))
# Load U.S. state map data
us_states <- map_data("state")
# Join loan data to map data
map_data_loans <- us_states |>
  left_join(avg_loan_map, by = "region")
# Create map
ggplot(map_data_loans, aes(x = long, y = lat, group = group, fill = average_loan_amount)) +
  geom_polygon(color = "white") +
  coord_fixed(1.3) +
  labs(
    title = "Average Loan Amount by State",
    x = "",
    y = "",
    fill = "Average Loan Amount"
  ) +
  scale_fill_continuous(labels = dollar) +
  theme_minimal()

# Create scatterplot of annual income vs loan amount
ggplot(final_data, aes(x = annual_inc, y = loan_amnt)) +
  geom_point(alpha = 0.3, color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  scale_x_continuous(labels = dollar) +
  scale_y_continuous(labels = dollar) +
  labs(
    title = "Annual Income vs Loan Amount",
    x = "Annual Income of Borrower",
    y = "Loan Amount Received"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 17 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 17 rows containing missing values or values outside the scale range
## (`geom_point()`).