library(tidyverse)
library(openintro)
library(ggplot2)
library(ggridges)
loans_full_schema
glimpse(loans_full_schema)

Question: How many samples are there? How many variables are there?

There are 10,000 samples and 55 variables

Glimpse of the data

loans <- select(loans_full_schema, loan_amount, interest_rate, term, 
         grade, state, annual_income, homeownership, debt_to_income)
glimpse(loans)
Question: What is the data set about?
Answer: The data is about the amount customers has borrowed

Lab Exercise

unique(loans$homeownership)
## [1] MORTGAGE RENT     OWN     
## Levels:  ANY MORTGAGE OWN RENT
table(loans$homeownership)
## 
##               ANY MORTGAGE      OWN     RENT 
##        0        0     4789     1353     3858
Question 1 - How many distinct values are there for homeownership variable? Which value is the most common one?

Answer: There are 3 distinct values for homeownership variable, mortgage is the most common one

Question 2
unique(loans$interest_rate)
##  [1] 14.07 12.61 17.09  6.72 13.59 11.99  6.71 15.04  9.92  9.43 19.03 28.72
## [13] 26.77 15.05  6.08 11.98  7.96  7.34  5.32  6.07 12.62  9.44 20.39  9.93
## [25] 21.45 10.42 18.06 22.91 30.79 17.47  5.31  7.97 14.08 19.42 10.91 16.02
## [37] 13.58 16.01 20.00 21.85 10.90 23.87  7.35 23.88 25.82 10.41 18.45 30.17
## [49] 24.85 25.81 24.84 30.75 29.69 26.30 22.90  6.00 30.65 30.94
table1 <-(table(loans$interest_rate))
sort(table1)
## 
## 30.94     6 30.75 30.65 29.69 30.17 30.79  22.9 23.87 25.81 22.91 24.84 28.72 
##     1     3     4     5     9     9    11    13    20    26    28    31    31 
## 23.88 26.77 24.85 25.82  26.3 21.85 20.39 19.42 17.47    20 18.45 21.45 18.06 
##    37    38    42    47    53    90    93   114   124   137   146   172   176 
## 14.07  5.31  6.71 10.41 17.09 16.01 19.03 15.04  6.07  7.96 13.58  5.32  7.34 
##   183   188   192   194   195   196   197   199   202   211   225   234   243 
##  9.92 11.98 12.61  7.97  10.9  6.08  9.43 16.02 15.05 10.91  6.72 14.08  7.35 
##   248   255   264   274   275   277   280   284   304   306   312   318   325 
## 12.62 10.42 13.59  9.44 11.99  9.93 
##   333   346   347   367   376   390
How many distinct interest rates are there?
Which value is the most common one?

Answer: There are 58 distinct interest rates, 9.93 is the most common one

Question 3

table(loans$annual_income)
Apply table function to the annual_income variable. Do you think the result is helpful or not?

Answer: I do not think the result is helpful becuase there are many different amounts of annual income and there is not a lot of repetition between these amounts of annual income

Questions on data

Why is the lowest interest rate 5%-ish and there was no lower interest rate? Can you explain?

Answer: The Federal Open Market Committee sets the target rate at which commercial banks borrow and lend their excess reserves to each other overnight, this rate affects the rate banks charge to their customers and the rate banks given to customers for saving money. The lowest rate here is 5 percent because there is a risk to lending money to other people so the rate needs to be higher than the rates the bank give.

Why are there some peak interest rates around 7%, 10%, 14%? Can you explain?

Answer: There are peaks around 7, 10 and 14 because the data is not continuous, also since the binwidth and center can be adjusted, the binwidth of 1 and center of 10 results in the plot having these specific peaks

Lab Exercise

Create a histogram of loan_amount. Customize your plot to give a graph that looks most reasonable to you.
ggplot(data = loans) + geom_histogram (mapping = aes(x = loan_amount), binwidth = 5000, boundary = 0) 

Create a histogram of annual_income. What is the issue with your graph?
ggplot(data = loans) + geom_histogram(mapping = aes(x = annual_income), binwidth = 100000, boundary = 0)

Answer: The issue with the graph is that there are customers who earn more than 2 million per year but there are only a few customers who earn these amount, most of the customers earn less than 500,000 so the histogram has most of the data below 500,000 with few to almost no data points above 500,000

Lab Exercise

Create a histogram of variable debt_to_income in loans with the following requirements:
1) The plotting range of x is between 0 and 100
2) The binwidth is 2
3) Create a density plot on top of the histogram
ggplot(loans, aes(x = debt_to_income)) + 
  geom_histogram(aes(y = after_stat(density)), binwidth = 2, boundary = 0, colour = "black", fill = "white") + xlim(0,100) +
  geom_density(linewidth = 1.2)

Question: Can you explain the distribution of debt_to_income?

Answer: From the distribution, it seems the amount of customers start to increase from a 0 percent to 16 percent and the most common debt to income is between 16 to 18 percent before the number of customers start to decrease as the debt to income for the customer increases.

Example: How would we explain the red dots which seem to deviate the overall trend (larger displ leading to lower hwy)?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

Answer: From the plot we see cars that have a large engine displacement but also has a good fuel economy are cars that have the class of 2 seater- an example are sports cars where they are lighter

Example- group by color and size (for ordinal variable)
Question: what can we learn from the graph
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = cyl, size = cyl))

Answer: From the graph, we learn that when the displacement of the engine increases, the number of cylinders in the car increases

Lab Exercise

Question: For loans data, create a scatter plot of interest_rate vs debt_to_income with mapping color to grade. What can you learn from the graph?
ggplot (data = loans) + geom_point (mapping = aes (x = interest_rate, y = debt_to_income, color = grade))

Answer: From the graph we learn that the higher the grade a customer has, the lower the interest rate the customer will reveive, the data show that customers with a grade of A has a interest rate under 10 percent while a customer with a grade of G has a interest rate greater than 30 percent

Ridge plots for density curves

library(ggridges)
ggplot(loans, aes (x = loan_amount, y = grade, fill = grade, color = grade)) + geom_density_ridges(alpha = 0.5)  
## Picking joint bandwidth of 2360

Question: What can we learn from the graph

Answer: From the graph we learn customers that have a lower grade has a higher loan amount than customers with a higher grade

Final Question 1

Create a scatter plot of loan_amount vs interest_rate with a color grouping using term variable (please use factor(term) to convert it into a categorical variable). Save your plot to your local folder.

ggplot (data = loans) + geom_point (mapping = aes (x = loan_amount, y = interest_rate, color = factor(term)))

ggsave("loan_amount-interest_rate-term_my-plot.pdf")
## Saving 7 x 5 in image