library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.5.3
## Warning: package 'dplyr' was built under R version 4.5.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.3 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(scales)
## Warning: package 'scales' was built under R version 4.5.3
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(maps)
##
## Attaching package: 'maps'
##
## The following object is masked from 'package:purrr':
##
## map
lc_sample <- read_csv("LCsample.csv")
## New names:
## Rows: 191367 Columns: 32
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (16): term, int_rate, grade, sub_grade, emp_title, emp_length, home_owne... dbl
## (16): ...1, loan_amnt, annual_inc, dti, delinq_2yrs, inq_last_6mths, mth...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
col_names <- names(lc_sample)
The codes below stacks all six annual files and merges them with the states demographic and region files.
This step imports each annual Lending Club CSV file from 2012 through 2017. col_names = col_names applies the correct headers from the sample file. col_types = cols(.default = “c”) reads every column as character text first. This helps avoid errors when files have inconsistent formatting across years.
data2012 <- read_csv("data2012.csv", col_names = col_names, col_types = cols(.default = "c"))
data2013 <- read_csv("data2013.csv", col_names = col_names, col_types = cols(.default = "c"))
data2014 <- read_csv("data2014.csv", col_names = col_names, col_types = cols(.default = "c"))
data2015 <- read_csv("data2015.csv", col_names = col_names, col_types = cols(.default = "c"))
data2016 <- read_csv("data2016.csv", col_names = col_names, col_types = cols(.default = "c"))
data2017 <- read_csv("data2017.csv", col_names = col_names, col_types = cols(.default = "c"))
This step stacks all yearly datasets into one combined dataset called loans. bind_rows() adds the rows from each year together. The output is one large dataset containing Lending Club loan records from 2012 through 2017.
loans <- bind_rows(
data2012,
data2013,
data2014,
data2015,
data2016,
data2017
)
to clean, this step prepares the combined loan data for analysis. clean_names() standardizes column names into a cleaner format. mutate() creates or updates variables. loan_amnt, annual_inc, and issue_year are converted into numeric values so they can be summarized or graphed. int_rate removes the percent sign and converts the interest rate to a number. The state line converts state abbreviations like “CA” or “NY” into full state names so they can be joined with the state files.
loans_clean <- loans |>
clean_names() |>
mutate(
loan_amnt = as.numeric(loan_amnt),
int_rate = as.numeric(str_remove(int_rate, "%")),
annual_inc = as.numeric(annual_inc),
issue_year = as.numeric(issue_year),
state = state.name[match(addr_state, state.abb)]
)
## Warning: There were 4 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `loan_amnt = as.numeric(loan_amnt)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 3 remaining warnings.
This step we imports the two supporting files. States.csv includes state-level information such as population, income, and unemployment. states_regions.csv includes region and division classifications. clean_names() makes the column names easier to reference in R.
states <- read_csv("States.csv") |>
clean_names()
## Rows: 52 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Geography
## dbl (7): Num_Households, Median_income_Households, Unemployment rate; Estima...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
regions <- read_csv("states_regions.csv") |>
clean_names()
## New names:
## Rows: 51 Columns: 5
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (4): State, State Code, Region, Division lgl (1): ...5
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...5`
this code combines the loan data with state demographic and regional information. The first left_join() matches the loan dataset’s state column to the geography column in the state file. The second left_join() adds region and division data using the state column. The output is final_data, which contains loan details plus state, population, income, region, and division information.
final_data <- loans_clean |>
left_join(states, by = c("state" = "geography")) |>
left_join(regions, by = "state")
##Exploratory Analysis
####1.Find the distribution of number of loans by state, regions and divisions. Describe in your own words the geographic differences in the number of loans. Also, analyze your results by comparing number of loans per capita. Did you notice any missing states in the Lending Club data? If yes, then find out why.
This step counts how many loans appear in each state. group_by(state) groups the data by state. summarize(number_of_loans = n()) counts the number of records in each state. arrange(desc(number_of_loans)) sorts the results from highest to lowest. The output shows which states have the most and fewest loans.
loans_by_state <- final_data |>
group_by(state) |>
summarize(number_of_loans = n()) |>
arrange(desc(number_of_loans))
loans_by_state
## # A tibble: 51 × 2
## state number_of_loans
## <chr> <int>
## 1 California 239836
## 2 New York 142729
## 3 Texas 142517
## 4 Florida 121499
## 5 Illinois 69826
## 6 New Jersey 63336
## 7 Pennsylvania 59289
## 8 Ohio 57766
## 9 Georgia 56297
## 10 Virginia 48505
## # ℹ 41 more rows
here, it counts the number of loans by U.S. region. group_by(region) groups loans into regions such as South, West, Northeast, and Midwest. summarize() counts the loans in each region. The output helps show whether loan activity is concentrated in certain parts of the country
loans_by_region <- final_data |>
group_by(region) |>
summarize(number_of_loans = n()) |>
arrange(desc(number_of_loans))
loans_by_region
## # A tibble: 5 × 2
## region number_of_loans
## <chr> <int>
## 1 South 612804
## 2 West 441840
## 3 Northeast 354637
## 4 Midwest 309450
## 5 <NA> 4166
the code adjusts loan counts for population size. group_by(state, population) groups the data by each state and its population. summarize() counts the loans per state. mutate() creates a new variable called loans_per_capita by dividing loans by population. This gives a fairer comparison because large states naturally tend to have more loans
loans_per_capita <- final_data |>
group_by(state, population) |>
summarize(number_of_loans = n(), .groups = "drop") |>
mutate(loans_per_capita = number_of_loans / population) |>
arrange(desc(loans_per_capita))
loans_per_capita
## # A tibble: 51 × 4
## state population number_of_loans loans_per_capita
## <chr> <dbl> <int> <dbl>
## 1 Nevada 2887725 24515 0.00849
## 2 Connecticut 3594478 27015 0.00752
## 3 New York 19798228 142729 0.00721
## 4 Rhode Island 1056138 7532 0.00713
## 5 New Jersey 8960161 63336 0.00707
## 6 Maryland 5996079 40828 0.00681
## 7 Colorado 5436519 36214 0.00666
## 8 New Hampshire 1331848 8484 0.00637
## 9 Wyoming 583200 3699 0.00634
## 10 California 38982847 239836 0.00615
## # ℹ 41 more rows
####2. Compare the average amount of loans granted by all states and divisions. Which states and divisions have the highest and lowest average loan amounts?
This code calculates the average loan amount for each state. group_by(state) separates the data by state. mean(loan_amnt, na.rm = TRUE) calculates the average loan amount while ignoring missing values. arrange(desc()) ranks the states from highest average loan amount to lowest.
avg_loan_state <- final_data |>
group_by(state) |>
summarize(avg_loan_amount = mean(loan_amnt, na.rm = TRUE)) |>
arrange(desc(avg_loan_amount))
avg_loan_state
## # A tibble: 51 × 2
## state avg_loan_amount
## <chr> <dbl>
## 1 Alaska 17058.
## 2 Virginia 15952.
## 3 <NA> 15730.
## 4 Maryland 15689.
## 5 New Jersey 15684.
## 6 Massachusetts 15673.
## 7 Hawaii 15586.
## 8 Texas 15569.
## 9 Wyoming 15466.
## 10 Illinois 15315.
## # ℹ 41 more rows
This step calculates the average loan amount by U.S. division. Divisions are smaller geographic groupings within regions. The output shows which divisions have the highest and lowest average loan amounts.
avg_loan_division <- final_data |>
group_by(division) |>
summarize(avg_loan_amount = mean(loan_amnt, na.rm = TRUE)) |>
arrange(desc(avg_loan_amount))
avg_loan_division
## # A tibble: 10 × 2
## division avg_loan_amount
## <chr> <dbl>
## 1 <NA> 15730.
## 2 West South Central 15309.
## 3 New England 15169.
## 4 Pacific 15040.
## 5 Middle Atlantic 14903.
## 6 South Atlantic 14856.
## 7 East North Central 14567.
## 8 Mountain 14534.
## 9 East South Central 14459.
## 10 West North Central 14418.
####3.Compare the average interest rate charged and average loan amount by the loan Grade. Do you notice any patterns?
it summarizes loans by grade. group_by(grade) groups the loans into credit grades such as A, B, C, and so on. The code calculates average interest rate, average loan amount, and total number of loans for each grade. The output helps compare risk levels across loan grades.
grade_summary <- final_data |>
group_by(grade) |>
summarize(
avg_interest_rate = mean(int_rate, na.rm = TRUE),
avg_loan_amount = mean(loan_amnt, na.rm = TRUE),
number_of_loans = n()
) |>
arrange(grade)
grade_summary
## # A tibble: 10 × 4
## grade avg_interest_rate avg_loan_amount number_of_loans
## <chr> <dbl> <dbl> <int>
## 1 A 7.08 14265. 287667
## 2 B 10.6 13688. 509800
## 3 C 14.0 14828. 514460
## 4 D 17.8 15752. 249360
## 5 E 21.4 17895. 113286
## 6 F 25.3 19202. 37323
## 7 G 28.3 20577. 10984
## 8 Unnamed: 8 NaN NaN 4
## 9 grade NaN NaN 1
## 10 <NA> NaN NaN 12
####4) Run a frequency distribution of number of loans, average loan amount and average interest rate for each state by year (2012 through 2017). Describe the changing patterns in those numbers. The code calculates the number of loans, average loan amount, and average interest rate for every state-year combination. This output can be used to see how lending changed from 2012 through 2017.Result summarizes loan activity by state and year. group_by(state, issue_year) creates groups for each state in each year.
state_year_summary <- final_data |>
group_by(state, issue_year) |>
summarize(
number_of_loans = n(),
average_loan_amount = mean(loan_amnt, na.rm = TRUE),
average_interest_rate = mean(int_rate, na.rm = TRUE),
.groups = "drop"
) |>
arrange(state, issue_year)
state_year_summary
## # A tibble: 294 × 5
## state issue_year number_of_loans average_loan_amount average_interest_rate
## <chr> <dbl> <int> <dbl> <dbl>
## 1 Alabama 2012 695 14002. 13.9
## 2 Alabama 2013 1644 14469. 15.1
## 3 Alabama 2014 3034 14887. 14.3
## 4 Alabama 2015 5347 15072. 13.2
## 5 Alabama 2016 5329 14053. 13.5
## 6 Alabama 2017 5147 14364. 13.6
## 7 Alaska 2012 178 15610. 14.8
## 8 Alaska 2013 389 16562. 14.5
## 9 Alaska 2014 598 17199. 14.2
## 10 Alaska 2015 954 17150. 12.8
## # ℹ 284 more rows
####5) Is there a relationship with the population size of a state and the average loan amount given? Is there a relationship between Grade of loans and median income level in a state?
the code calculates the average loan amount for each state. Then cor() calculates the correlation between population and average loan amount. A correlation close to 0 means there is little relationship. A positive value means both variables tend to increase together. A negative value means one tends to decrease as the other increases.Here we explor whether states with larger populations also have higher average loan amounts.
population_loan_relationship <- final_data |>
group_by(state, population) |>
summarize(avg_loan_amount = mean(loan_amnt, na.rm = TRUE), .groups = "drop")
cor(
population_loan_relationship$population,
population_loan_relationship$avg_loan_amount,
use = "complete.obs"
)
## [1] 0.1332012
population_loan <- final_data |>
group_by(state, population) |>
summarize(
average_loan_amount = mean(loan_amnt, na.rm = TRUE),
.groups = "drop"
)
population_loan
## # A tibble: 51 × 3
## state population average_loan_amount
## <chr> <dbl> <dbl>
## 1 Alabama 4850771 14535.
## 2 Alaska 738565 17058.
## 3 Arizona 6809946 14326.
## 4 Arkansas 2977944 13901.
## 5 California 38982847 15068.
## 6 Colorado 5436519 14906.
## 7 Connecticut 3594478 15143.
## 8 Delaware 943732 14736.
## 9 Florida 20278447 14139.
## 10 Georgia 10201635 15111.
## # ℹ 41 more rows
cor(
population_loan$population,
population_loan$average_loan_amount,
use = "complete.obs"
)
## [1] 0.1332012
ggplot(population_loan, aes(x = population, y = average_loan_amount)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Population Size vs Average Loan Amount",
x = "State Population",
y = "Average Loan Amount"
)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
grade_income <- final_data |>
mutate(
grade_num = case_when(
grade == "A" ~ 1,
grade == "B" ~ 2,
grade == "C" ~ 3,
grade == "D" ~ 4,
grade == "E" ~ 5,
grade == "F" ~ 6,
grade == "G" ~ 7
)
) |>
group_by(state, median_income_households) |>
summarize(
average_grade = mean(grade_num, na.rm = TRUE),
.groups = "drop"
)
grade_income
## # A tibble: 51 × 3
## state median_income_households average_grade
## <chr> <dbl> <dbl>
## 1 Alabama 46472 2.87
## 2 Alaska 76114 2.80
## 3 Arizona 53510 2.70
## 4 Arkansas 43813 2.80
## 5 California 67169 2.70
## 6 Colorado 65458 2.69
## 7 Connecticut 73781 2.71
## 8 Delaware 63036 2.77
## 9 Florida 50883 2.76
## 10 Georgia 52977 2.76
## # ℹ 41 more rows
cor(
grade_income$median_income_households,
grade_income$average_grade,
use = "complete.obs"
)
## [1] -0.2678967
ggplot(grade_income, aes(x = median_income_households, y = average_grade)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Median Income vs Average Loan Grade",
x = "Median Household Income",
y = "Average Loan Grade (1=A, 7=G)"
)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
###Visualization
####1) Create a plot of interest rates and Grade or a loan and describe the pattern.
grade_interest <- final_data |>
group_by(grade) |>
summarize(
average_interest_rate = mean(int_rate, na.rm = TRUE),
.groups = "drop"
) |>
arrange(grade)
grade_interest
## # A tibble: 10 × 2
## grade average_interest_rate
## <chr> <dbl>
## 1 A 7.08
## 2 B 10.6
## 3 C 14.0
## 4 D 17.8
## 5 E 21.4
## 6 F 25.3
## 7 G 28.3
## 8 Unnamed: 8 NaN
## 9 grade NaN
## 10 <NA> NaN
ggplot(grade_interest, aes(x = grade, y = average_interest_rate, fill = grade)) +
geom_col() +
labs(
title = "Average Interest Rate by Loan Grade",
x = "Loan Grade",
y = "Average Interest Rate (%)"
) +
theme_minimal()
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_col()`).
####2) Create a map of US states and color code the map with the average amount of loans given. This code first calculates the average loan amount for each state. Then it converts state names to lowercase so they can match the map data. The map_data(“state”) function provides the shape data for U.S. states. The left_join() connects the loan summary data to the map. The final ggplot() creates a U.S. map where each state is shaded based on its average loan amount.
# Create average loan amount by state
avg_loan_map <- final_data |>
group_by(state) |>
summarize(
average_loan_amount = mean(loan_amnt, na.rm = TRUE),
.groups = "drop"
) |>
mutate(region = tolower(state))
# Load U.S. state map data
us_states <- map_data("state")
# Join loan data to map data
map_data_loans <- us_states |>
left_join(avg_loan_map, by = "region")
# Create map
ggplot(map_data_loans, aes(x = long, y = lat, group = group, fill = average_loan_amount)) +
geom_polygon(color = "white") +
coord_fixed(1.3) +
labs(
title = "Average Loan Amount by State",
x = "",
y = "",
fill = "Average Loan Amount"
) +
scale_fill_continuous(labels = dollar) +
theme_minimal()
# Create scatterplot of annual income vs loan amount
ggplot(final_data, aes(x = annual_inc, y = loan_amnt)) +
geom_point(alpha = 0.3, color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
scale_x_continuous(labels = dollar) +
scale_y_continuous(labels = dollar) +
labs(
title = "Annual Income vs Loan Amount",
x = "Annual Income of Borrower",
y = "Loan Amount Received"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 17 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 17 rows containing missing values or values outside the scale range
## (`geom_point()`).