library(tidyverse)
library(openintro)
library(ggplot2)
library(ggridges)
loans_full_schema
glimpse(loans_full_schema)
There are 10,000 samples and 55 variables
loans <- select(loans_full_schema, loan_amount, interest_rate, term,
grade, state, annual_income, homeownership, debt_to_income)
glimpse(loans)
unique(loans$homeownership)
## [1] MORTGAGE RENT OWN
## Levels: ANY MORTGAGE OWN RENT
table(loans$homeownership)
##
## ANY MORTGAGE OWN RENT
## 0 0 4789 1353 3858
Answer: There are 3 distinct values for homeownership variable, mortgage is the most common one
unique(loans$interest_rate)
## [1] 14.07 12.61 17.09 6.72 13.59 11.99 6.71 15.04 9.92 9.43 19.03 28.72
## [13] 26.77 15.05 6.08 11.98 7.96 7.34 5.32 6.07 12.62 9.44 20.39 9.93
## [25] 21.45 10.42 18.06 22.91 30.79 17.47 5.31 7.97 14.08 19.42 10.91 16.02
## [37] 13.58 16.01 20.00 21.85 10.90 23.87 7.35 23.88 25.82 10.41 18.45 30.17
## [49] 24.85 25.81 24.84 30.75 29.69 26.30 22.90 6.00 30.65 30.94
table1 <-(table(loans$interest_rate))
sort(table1)
##
## 30.94 6 30.75 30.65 29.69 30.17 30.79 22.9 23.87 25.81 22.91 24.84 28.72
## 1 3 4 5 9 9 11 13 20 26 28 31 31
## 23.88 26.77 24.85 25.82 26.3 21.85 20.39 19.42 17.47 20 18.45 21.45 18.06
## 37 38 42 47 53 90 93 114 124 137 146 172 176
## 14.07 5.31 6.71 10.41 17.09 16.01 19.03 15.04 6.07 7.96 13.58 5.32 7.34
## 183 188 192 194 195 196 197 199 202 211 225 234 243
## 9.92 11.98 12.61 7.97 10.9 6.08 9.43 16.02 15.05 10.91 6.72 14.08 7.35
## 248 255 264 274 275 277 280 284 304 306 312 318 325
## 12.62 10.42 13.59 9.44 11.99 9.93
## 333 346 347 367 376 390
Answer: There are 58 distinct interest rates, 9.93 is the most common one
table(loans$annual_income)
Answer: I do not think the result is helpful becuase there are many different amounts of annual income and there is not a lot of repetition between these amounts of annual income
Answer: The Federal Open Market Committee sets the target rate at which commercial banks borrow and lend their excess reserves to each other overnight, this rate affects the rate banks charge to their customers and the rate banks given to customers for saving money. The lowest rate here is 5 percent because there is a risk to lending money to other people so the rate needs to be higher than the rates the bank give.
Answer: There are peaks around 7, 10 and 14 because the data is not continuous, also since the binwidth and center can be adjusted, the binwidth of 1 and center of 10 results in the plot having these specific peaks
ggplot(data = loans) + geom_histogram (mapping = aes(x = loan_amount), binwidth = 5000, boundary = 0)
ggplot(data = loans) + geom_histogram(mapping = aes(x = annual_income), binwidth = 100000, boundary = 0)
Answer: The issue with the graph is that there are customers who earn more than 2 million per year but there are only a few customers who earn these amount, most of the customers earn less than 500,000 so the histogram has most of the data below 500,000 with few to almost no data points above 500,000
ggplot(loans, aes(x = debt_to_income)) +
geom_histogram(aes(y = after_stat(density)), binwidth = 2, boundary = 0, colour = "black", fill = "white") + xlim(0,100) +
geom_density(linewidth = 1.2)
Answer: From the distribution, it seems the amount of customers start to increase from a 0 percent to 16 percent and the most common debt to income is between 16 to 18 percent before the number of customers start to decrease as the debt to income for the customer increases.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
Answer: From the plot we see cars that have a large engine displacement but also has a good fuel economy are cars that have the class of 2 seater- an example are sports cars where they are lighter
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = cyl, size = cyl))
Answer: From the graph, we learn that when the displacement of the engine increases, the number of cylinders in the car increases
ggplot (data = loans) + geom_point (mapping = aes (x = interest_rate, y = debt_to_income, color = grade))
Answer: From the graph we learn that the higher the grade a customer has, the lower the interest rate the customer will reveive, the data show that customers with a grade of A has a interest rate under 10 percent while a customer with a grade of G has a interest rate greater than 30 percent
library(ggridges)
ggplot(loans, aes (x = loan_amount, y = grade, fill = grade, color = grade)) + geom_density_ridges(alpha = 0.5)
## Picking joint bandwidth of 2360
Answer: From the graph we learn customers that have a lower grade has a higher loan amount than customers with a higher grade
ggplot (data = loans) + geom_point (mapping = aes (x = loan_amount, y = interest_rate, color = factor(term)))
ggsave("loan_amount-interest_rate-term_my-plot.pdf")
## Saving 7 x 5 in image