Q: How many distinct values are there for homeownership variable? Which value is the most common one?
unique(loans$homeownership)
## [1] MORTGAGE RENT OWN
## Levels: ANY MORTGAGE OWN RENT
table(loans$homeownership)
##
## ANY MORTGAGE OWN RENT
## 0 0 4789 1353 3858
A: There are 4 distince values for homeownership variable: ANY, MORTGAGE, OWN, RENT The most common one is MORTGAGE with 4789 samples out of total 10000 samples
Q: How many distinct interest rates are there? Which value is the most common one?
length(unique(loans$interest_rate))
## [1] 58
sort(table(loans$interest_rate))
##
## 30.94 6 30.75 30.65 29.69 30.17 30.79 22.9 23.87 25.81 22.91 24.84 28.72
## 1 3 4 5 9 9 11 13 20 26 28 31 31
## 23.88 26.77 24.85 25.82 26.3 21.85 20.39 19.42 17.47 20 18.45 21.45 18.06
## 37 38 42 47 53 90 93 114 124 137 146 172 176
## 14.07 5.31 6.71 10.41 17.09 16.01 19.03 15.04 6.07 7.96 13.58 5.32 7.34
## 183 188 192 194 195 196 197 199 202 211 225 234 243
## 9.92 11.98 12.61 7.97 10.9 6.08 9.43 16.02 15.05 10.91 6.72 14.08 7.35
## 248 255 264 274 275 277 280 284 304 306 312 318 325
## 12.62 10.42 13.59 9.44 11.99 9.93
## 333 346 347 367 376 390
A: There are 58 unique interest rates and the most common one is 9.93
Q: Apply table function to the annual_income variable. Do you think the result is helpful or not?
table(loans$annual_income)
A: The annual_income variable is continuos because the frequency for each entry is almost 1 as shown in the table result above. Therefore, it is not really helpful to apply table function to this continuous variable as the result is scattering a lot without telling any specific patterns or attributes of the variable
Q: Create a histogram of loan_amount. Customize your plot to give a graph that looks most reasonable to you.
A:
ggplot(data = loans) +
geom_histogram(mapping = aes(x = loan_amount),
binwidth = 5000, boundary = 1000)
Q: Create a histogram of annual_income. What is the issue with your graph?
A:
ggplot(data = loans) +
geom_histogram (mapping = aes(x = annual_income),
binwidth = 5000, boundary = 1000)
The histogram displays a large range in annual income and it is skewed to the right due to a few outliers of very significant annual income values.
Q: Create a histogram of variable debt_to_income in loans with the following requirements: The plotting range of x is between 0 and 100 The binwidth is 2 Create a density plot on top of the histogram Question: Can you explain the distribution of debt_to_income?
ggplot(loans,aes(x=debt_to_income)) +
geom_histogram(aes (y = after_stat(density)), binwidth = 2,
boundary = 0, colour = "black", fill = "pink") +
xlim(0,100) +
geom_density(linewidth = 1.2)
A: The distribution of debt to income is also right-skewed due to a few people having very large debt loads or low income. Most individuals have relatively low debt to income ratio, concentrating to the left, this ensures they will have room to pay off debts.
Q: For loans data, create a scatter plot of interest_rate vs debt_to_income with mapping color to grade. What can you learn from the graph?
ggplot(loans) +
geom_point(mapping = aes(x = interest_rate, y = debt_to_income, color = grade))
A: The graph clearly indicates that the better grade that a person has, the lower the interest rate that person can borrow and vice versa. Grade A has the smallest interest rate and grade G has the highest. Also, a few cases having abnormally high debt to income ratio tends to happen with better grades than worse grades. Since worst grades have much higher interest rate, they have to ensure the debt is payable with manageable debt to income ratio.
ggplot(loans) +
geom_point(mapping = aes(x = interest_rate, y = loan_amount, color = factor(term)))
ggsave("scatter_plot.pdf")