Question #0: Finish all lab exercises

Lab Exercise 1: Answer the following questions by using unique or table function:

Q: How many distinct values are there for homeownership variable? Which value is the most common one?

unique(loans$homeownership)
## [1] MORTGAGE RENT     OWN     
## Levels:  ANY MORTGAGE OWN RENT
table(loans$homeownership)
## 
##               ANY MORTGAGE      OWN     RENT 
##        0        0     4789     1353     3858

A: There are 4 distince values for homeownership variable: ANY, MORTGAGE, OWN, RENT The most common one is MORTGAGE with 4789 samples out of total 10000 samples

Q: How many distinct interest rates are there? Which value is the most common one?

length(unique(loans$interest_rate))
## [1] 58
sort(table(loans$interest_rate))
## 
## 30.94     6 30.75 30.65 29.69 30.17 30.79  22.9 23.87 25.81 22.91 24.84 28.72 
##     1     3     4     5     9     9    11    13    20    26    28    31    31 
## 23.88 26.77 24.85 25.82  26.3 21.85 20.39 19.42 17.47    20 18.45 21.45 18.06 
##    37    38    42    47    53    90    93   114   124   137   146   172   176 
## 14.07  5.31  6.71 10.41 17.09 16.01 19.03 15.04  6.07  7.96 13.58  5.32  7.34 
##   183   188   192   194   195   196   197   199   202   211   225   234   243 
##  9.92 11.98 12.61  7.97  10.9  6.08  9.43 16.02 15.05 10.91  6.72 14.08  7.35 
##   248   255   264   274   275   277   280   284   304   306   312   318   325 
## 12.62 10.42 13.59  9.44 11.99  9.93 
##   333   346   347   367   376   390

A: There are 58 unique interest rates and the most common one is 9.93

Q: Apply table function to the annual_income variable. Do you think the result is helpful or not?

table(loans$annual_income)

A: The annual_income variable is continuos because the frequency for each entry is almost 1 as shown in the table result above. Therefore, it is not really helpful to apply table function to this continuous variable as the result is scattering a lot without telling any specific patterns or attributes of the variable

Lab Exercise 2:

Q: Create a histogram of loan_amount. Customize your plot to give a graph that looks most reasonable to you.

A:

ggplot(data = loans) + 
  geom_histogram(mapping = aes(x = loan_amount), 
                 binwidth = 5000, boundary = 1000)

Q: Create a histogram of annual_income. What is the issue with your graph?

A:

ggplot(data = loans) + 
  geom_histogram (mapping = aes(x = annual_income), 
                  binwidth = 5000, boundary = 1000)

The histogram displays a large range in annual income and it is skewed to the right due to a few outliers of very significant annual income values.

Lab Exercise 3:

Q: Create a histogram of variable debt_to_income in loans with the following requirements: The plotting range of x is between 0 and 100 The binwidth is 2 Create a density plot on top of the histogram Question: Can you explain the distribution of debt_to_income?

ggplot(loans,aes(x=debt_to_income)) +
  geom_histogram(aes (y = after_stat(density)), binwidth = 2,
                 boundary = 0, colour = "black", fill = "pink") +
  xlim(0,100) +
  geom_density(linewidth = 1.2)

A: The distribution of debt to income is also right-skewed due to a few people having very large debt loads or low income. Most individuals have relatively low debt to income ratio, concentrating to the left, this ensures they will have room to pay off debts.

Lab Exercise 4:

Q: For loans data, create a scatter plot of interest_rate vs debt_to_income with mapping color to grade. What can you learn from the graph?

ggplot(loans) +
  geom_point(mapping = aes(x = interest_rate, y = debt_to_income, color = grade))

A: The graph clearly indicates that the better grade that a person has, the lower the interest rate that person can borrow and vice versa. Grade A has the smallest interest rate and grade G has the highest. Also, a few cases having abnormally high debt to income ratio tends to happen with better grades than worse grades. Since worst grades have much higher interest rate, they have to ensure the debt is payable with manageable debt to income ratio.

Question #1: Create a scatter plot of loan_amount vs interest_rate with a color grouping using term variable (please use factor(term) to convert it into a categorical variable). Save your plot to your local folder.

ggplot(loans) +
  geom_point(mapping = aes(x = interest_rate, y = loan_amount, color = factor(term)))

ggsave("scatter_plot.pdf")