Lab Exercise 1:

  1. How many distinct values are there for homeownership variable? Which value is the most common one?

    sort(table(loans$homeownership))
    ## 
    ##               ANY      OWN     RENT MORTGAGE 
    ##        0        0     1353     3858     4789

    4 distinct values for homeownership variables. Mortgage is the most common one.

  2. How many distinct interest rates are there? Which value is the most common one?

    length(unique(loans$interest_rate))
    ## [1] 58
    sort(table(loans$interest_rate))
    ## 
    ## 30.94     6 30.75 30.65 29.69 30.17 30.79  22.9 23.87 25.81 22.91 24.84 28.72 
    ##     1     3     4     5     9     9    11    13    20    26    28    31    31 
    ## 23.88 26.77 24.85 25.82  26.3 21.85 20.39 19.42 17.47    20 18.45 21.45 18.06 
    ##    37    38    42    47    53    90    93   114   124   137   146   172   176 
    ## 14.07  5.31  6.71 10.41 17.09 16.01 19.03 15.04  6.07  7.96 13.58  5.32  7.34 
    ##   183   188   192   194   195   196   197   199   202   211   225   234   243 
    ##  9.92 11.98 12.61  7.97  10.9  6.08  9.43 16.02 15.05 10.91  6.72 14.08  7.35 
    ##   248   255   264   274   275   277   280   284   304   306   312   318   325 
    ## 12.62 10.42 13.59  9.44 11.99  9.93 
    ##   333   346   347   367   376   390

    58 distinct interest rates, 9.93% is the most common one.

  3. Apply table function to the annual_income variable. Do you think the result is helpful or not? Not really helpful

Lab Exercise 2:

  1. Create a histogram of loan_amount. Customize your plot to give a graph that looks most reasonable to you.

    ggplot(data = loans) +
      geom_histogram(mapping = aes(x = loan_amount), binwidth = 2500, boundary = 1000)

  2. Create a histogram of annual_income. What is the issue with your graph?

    ggplot(data = loans) +
      geom_histogram(mapping = aes(x = annual_income))

    The graph is squeezed to the left because of outliers to the right that skewed the graph. This is better:

    ggplot(data = loans) +
      geom_histogram(mapping = aes(x = annual_income)) + 
      xlim(0, 500000)

Lab Exercise 3:

Create a histogram of variable debt_to_income in loans with the following requirements:
  1. The plotting range of x is between 0 and 100
ggplot(loans, aes(x = debt_to_income)) +
  geom_histogram(aes(y = after_stat(density)),  
                 boundary = 0, colour = "black", fill = "white") + xlim(0, 100)

  1. The binwidth is 2
ggplot(loans, aes(x = debt_to_income)) +
  geom_histogram(aes(y = after_stat(density), binwidth = 2),  
                 boundary = 0, colour = "black", fill = "white") + xlim(0, 100)

  1. Create a density plot on top of the histogram
ggplot(loans, aes(x = debt_to_income)) +
  geom_histogram(aes(y = after_stat(density), binwidth = 2),  
                 boundary = 0, colour = "black", fill = "white") + xlim(0, 100) +
  geom_density(adjust = 30/8, linewidth = 1.2) 

Question: Can you explain the distribution of debt_to_income?

debt_to_income is relatively Normal with the center closer to 0. This means that most people have a relatively manageable to low debt to income ratio, but there are some people that has a pretty high debt to income ratio which explains the skewness of the distribution.

Lab Exercise 4:

For loans data, create a scatter plot of interest_rate vs debt_to_income with mapping color to grade. What can you learn from the graph?

ggplot(data = loans) + geom_point(mapping = aes(x = interest_rate, y = debt_to_income, color = grade))

From this scatter plot, I can tell that the higher the grade, the lower the interest_rate. Also most of the debt_to_income data points are below 100 and the extremely high debt_to_income are normally the mid-tier grade with the mid-tier interest_rate.

Lab Homework #1:

  1. Finish all Lab Exercises

  2. Create a scatter plot of loan_amount vs interest_rate with a color grouping using term variable (please use factor(term) to convert it into a categorical variable). Save your plot to your local folder.

ggplot(data = loans) + geom_point(mapping = aes(x = loan_amount, y = interest_rate, color = factor(term)))

ggsave("my-plot.pdf")
  1. Submit your homework using R Markdown in pdf format.