0. Finish all lab exercises

Lab exercise 1

  1. How many distinct values are there for homeownership variable? Which value is the most common one?
table(loans$homeownership)
## 
##               ANY MORTGAGE      OWN     RENT 
##        0        0     4789     1353     3858

There are four distinct values. Mortgage is the most common one.

  1. How many distinct interest rates are there? Which value is the most common one?
sort(table(loans$interest_rate))
## 
## 30.94     6 30.75 30.65 29.69 30.17 30.79  22.9 23.87 25.81 22.91 24.84 28.72 
##     1     3     4     5     9     9    11    13    20    26    28    31    31 
## 23.88 26.77 24.85 25.82  26.3 21.85 20.39 19.42 17.47    20 18.45 21.45 18.06 
##    37    38    42    47    53    90    93   114   124   137   146   172   176 
## 14.07  5.31  6.71 10.41 17.09 16.01 19.03 15.04  6.07  7.96 13.58  5.32  7.34 
##   183   188   192   194   195   196   197   199   202   211   225   234   243 
##  9.92 11.98 12.61  7.97  10.9  6.08  9.43 16.02 15.05 10.91  6.72 14.08  7.35 
##   248   255   264   274   275   277   280   284   304   306   312   318   325 
## 12.62 10.42 13.59  9.44 11.99  9.93 
##   333   346   347   367   376   390

There are 58 distinct interest rates. 9.93 is the most common one.

  1. Apply table function to the annual_income variable. Do you think the result is helpful or not?
table(table(loans$annual_income))
## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1111  102   38   29   17   12    9    5    7    2    7    2    1    5    2    1 
##   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32   33 
##    3    5    1    2    4    5    4    1    4    3    1    5    2    4    2    1 
##   34   35   36   37   39   40   41   43   44   45   46   47   48   50   53   60 
##    2    1    2    4    1    1    2    1    1    2    6    1    2    1    1    2 
##   62   66   69   70   73   75   76   77   80   89   90   92   96   99  110  116 
##    1    2    2    1    1    1    2    1    1    1    1    1    1    1    1    1 
##  121  124  141  168  178  182  204  221  236  247  248  260  273  314  350  383 
##    1    1    1    1    1    1    2    1    1    1    1    1    1    1    1    1

I think the result is useless. Because most of data are unique. It is a continuous variable.

Lab exercise 2

  1. Create a histogram of loan_amount. Customize your plot to give a graph that looks most reasonable to you.
ggplot(loans) +
  geom_histogram(aes(x = loan_amount), binwidth = 2500, boundary = 0)

  1. Create a histogram of annual_income. What is the issue with your graph?
ggplot(loans) +
  geom_histogram(aes(x = annual_income))

Extreme values cause the plot to be squeezed to the left.

Lab exercise 3

Create a histogram of variable debt_to_income in loans with the following requirements:

ggplot(loans, aes(x = debt_to_income)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 2) +
  geom_density(color = "red") +
  xlim(0, 100)

It looks like a gamma distribution.

Lab exercise 4

For loans data, create a scatter plot of interest_rate vs debt_to_income with mapping color to grade. What can you learn from the graph?

ggplot(loans) +
  geom_point(aes(x = interest_rate, y = debt_to_income, color = grade))

The higher loan grade always has higher interest rate.

1. Create a scatter plot of loan_amount vs interest_rate with a color grouping using term variable (please use factor(term) to convert it into a categorical variable). Save your plot to your local folder.

ggplot(loans) +
  geom_point(aes(x = loan_amount, y = interest_rate, color = factor(term)))

ggsave("test_plot.pdf")