Explore the loans_full_schema data set in openintro by performing
EDA
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
1.Ask a question
Do people with higher annual income tend to get lower interest rates?
2.Visualize data to answer
ggplot(loans_full_schema, aes(x = annual_income, y = interest_rate)) +
geom_point(alpha = 0.2) +
scale_x_log10() +
labs(
title = "Interest Rate vs Annual Income",
x = "Annual Income (log scale)",
y = "Interest Rate"
) +
theme(plot.title = element_text(hjust = 0.5))
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.

Higher income loans often have slightly lower interest rates, but the points are very spread out, so income alone does not explain the rate.
3.Try to raise new questions
For similar income levels, does verified income status relate to interest rate?
4.Visualize data to answer
ggplot(loans_full_schema, aes(x = verified_income, y = interest_rate)) +
geom_boxplot() +
labs(
title = "Interest Rate by Verification Status",
x = "Verified Income",
y = "Interest Rate (%)"
) +
theme(plot.title = element_text(hjust = 0.5))

The verified groups show different typical interest rates, but there is still a lot of overlap, so verification is related to rate but it is not the only factor.
1(2).Ask a question
Do different loan purposes have different interest rates?
2(2).Visualize data to answer
ggplot(loans_full_schema, aes(x = loan_purpose, y = interest_rate)) +
geom_boxplot() +
coord_flip() +
labs(
title = "Interest Rate by Loan Purpose",
x = "Loan Purpose",
y = "Interest Rate"
) +
theme(plot.title = element_text(hjust = 0.5))

Some loan purposes have higher typical interest rates than others, but the distributions overlap a lot.
3(2).Try to raise new questions
Is the difference by loan purpose still there if we compare loans with the same grade?
4(2).Visualize data to answer
ggplot(loans_full_schema, aes(x = loan_purpose, y = interest_rate)) +
geom_boxplot() +
coord_flip() +
facet_wrap(~ grade) +
labs(
title = "Interest Rate by Loan Purpose (within each Grade)",
x = "Loan Purpose",
y = "Interest Rate"
) +
theme(plot.title = element_text(hjust = 0.5))

Within the same grade, the interest rates across purposes look more similar, so grade explains a lot of the differences we saw before.