Introduction

Lending Club is a peer-to-peer loan company. It runs an online marketplace to match borrowers with lenders. Borrowers apply for loans ranging from $1,000 to $35,000. If the application is approved, Lending Club will offer a loan whose interest rate is based on submitted information and credit history. The process is largely automated, reducing overhead costs that banks typically incur. Peer-to-peer loan companies claim that they can therefore offer better rates for borrowers.

Individuals interested in lending to these borrowers can fund these loans in $25 increments. When browsing loans on the online marketplace, investors will see the following page:


There are approximately 35 different variables for each borrower that you can examine using this API. For a more detailed examination, you can download a csv file which contains over 100 features related to each loan.


Prospective Loans

What makes Lending Club interesting for a data analyst is that they provide data on the performance of all loans issued. This is a boon for anyone interested using past data to inform present investment decisions.

We begin our analysis with a general overview of the loans. There are two options for payment plans: 36 months and 60 months.

library(plyr)
library(dplyr)
library(readr)
library(plotly)
library(lubridate)
library(DT)
library(parallel)
library(doParallel)
setwd("~/Dropbox/DataScience/Projects/P2P/LC_Data")

# 2007 - 2011
lc1.df <- read_csv("LoanStats3a_20160901.csv.zip", skip=1, n_max=42538)

# 2012 - 2013
lc2.df <- read_csv("LoanStats3b_20160901.csv.zip", skip=1, n_max=188123)

# 2014
lc3.df <- read_csv("LoanStats3c_20160901.csv.zip", skip=1)

# 2015
lc4.df <- read_csv("LoanStats3d_20160901.csv.zip", skip=1)

# Combine data sets
df <- rbind(lc1.df, lc2.df, lc3.df, lc4.df)

# Remove data
rm(lc1.df, lc2.df, lc3.df, lc4.df)

original.names <- names(df)  # keep record of column names

# Clean and format data
setwd("~/Dropbox/DataScience/Projects/P2P")
source("LClub_CleanFn.R") # script to clean data

# Classifying loans by the year that they were issued
df <- df %>% mutate(issue_yr = lubridate::year(issue_d))

df %>% count(issue_yr, term) %>% 
  plot_ly(x = issue_yr, y = n, group = term, type = "bar") %>% 
  layout(title = "Number of Loans Issued by Lending Club", 
         xaxis = list(title = "Year"),
         yaxis = list(title = "Number of Loans"))


Loans are classified by grade (A through G) and subgrade (1 through 5). These groups correspond to the given interest rates, which are determined by the perceived risk of the borrower. Included in the data are the expected default rates for each subgrade and loan term:

setwd("~/Dropbox/DataScience/Projects/P2P/LC_Data")

# Prospective investments
browse.df <- read_csv("BrowseNotesAllCurrent.csv")

# Plotting int_rate vs. exp_default_rate.
# First, collapse rows.  There are only about 50 unique values over the 
#   hundreds of rows.  No need to plot redundant points.
# Selecting relevant columns:
exp.default.df <- browse.df %>% 
  select(term, sub_grade, int_rate, exp_default_rate) %>% 
  as.data.frame

# Identify unique rows
unique.id.idx <- which(!duplicated(exp.default.df))

# Extract unique rows
exp.default.df <- slice(exp.default.df, unique.id.idx) %>% 
  as.data.frame %>% 
  # Converting subgrades from factor to character puts them 
  #   in proper order for plot legend
  mutate(sub_grade = as.character(sub_grade)) %>% 
  arrange(sub_grade)

color.code <- c(A1 = "royalblue", A2 = "royalblue", A3 = "royalblue",
                A4 = "royalblue", A5 = "royalblue",
                B1 = "turquoise", B2 = "turquoise", B3 = "turquoise",
                B4 = "turquoise", B5 = "turquoise", C1 = "olivedrab",
                C2 = "olivedrab", C3 = "olivedrab", C4 = "olivedrab",
                C5 = "olivedrab", D1 = "chartreuse", D2 = "chartreuse",
                D3 = "chartreuse", D4 = "chartreuse", D5 = "chartreuse",
                E1 = "yellow", E2 = "yellow", E3 = "yellow", E4 = "yellow",
                E5 = "yellow", F1 = "goldenrod", F2 = "goldenrod",
                F3 = "goldenrod", F4 = "goldenrod", F5 = "goldenrod",
                G1 = "orange", G2 = "orange", G3 = "orange", G4 = "orange",
                G5 = "orange")

exp.default.df <- exp.default.df %>% 
  mutate(hover_complex = paste0("Term: ", term, "<br>", 
                                "Subgrade: ", sub_grade, "<br>",  
                                "Interest Rate: ", int_rate, "%", "<br>",
                                "Exp. Default Rate: ", exp_default_rate, "%"))

# plot by subgroup
plot_ly(data = exp.default.df, 
        x = int_rate, 
        y = exp_default_rate, 
        text = hover_complex,
        hoverinfo = "text",
        group = sub_grade,
        marker = list(color = color.code[sub_grade]),
        mode = "markers") %>% 
  layout(title = "Interest Rate vs. Expected Default Rate",
         xaxis = list(title = "Interest Rate", ticksuffix = "%"),
         yaxis = list(title = "Expected Default Rate", ticksuffix = "%"))


Past Performance

Our analysis of loan performace is limited to loans that have come to term, so we will only consider 36-months loans that were issued at least 3 years ago. The following table then summarizes the outcome of 36-month loans issued between January 2007 and September 2013.

# We want to examine data whose loans have come to term.
# This means 36-month loans that were issued at least 36 months ago.
# Some loans are still being paid off, but will be treated as paid off.
cutoff.date <- lubridate::ymd(Sys.Date()) - months(37)  # make cutoff 37 months
matured36.df <- df %>% filter(term == "36 months" & issue_d < cutoff.date) %>% as.data.frame

# All possible entries for loan status
DT::datatable(data = count(matured36.df, loan_status) %>% 
                mutate("Percent of Total" = paste0(100*(n/sum(n)) %>% round(4), "%"),
                       n = format(n, big.mark=",", scientific = FALSE, trim = TRUE)) %>% 
                rename("Loan Status" = loan_status, Total = n),
              options=list(dom="t"))


Some helpful descriptions:

Going by the more conventional definition of default, we will treat Charged Off, Default, and Late (31-120 days) as defaulted loans. Current and Fully Paid will be treated as non-defaulted loans; those listed as Current are loans nearly paid off.

In Grace Period and Late (16-30 days) will be removed from the data set as it is unclear if these loans will be paid off. Fortunately, they are few in number and so they will not have much influence on the data.

# For the time being, consider following categories.
# Ignore "In Grace Period," "Late (16-30 days)," and "NA"
# Grey area, not sure where to categorize.
# 2007-11 data has multiple labels for "Charged Off" and "Fully Paid"
matured36.df <- matured36.df %>% 
  filter(grepl("Charged Off", loan_status) | 
           grepl("Fully Paid", loan_status) |
           loan_status == "Current" |
           loan_status == "Default" |
           loan_status == "Late (31-120 days)")

matured36.df <- matured36.df %>% 
  mutate(defaulted = ifelse(loan_status == "Current" | 
                              loan_status == "Fully Paid",
                            yes = 0,
                            no = 1)
         )


Dividing loans by the year that they were issued, we can see how Lending Club’s ability to screen out risky loans has changed over time:

# 1st data frame in `rbind()`: stats for each year 
# 2nd data frame: stats for overall average
default.rates.df <- rbind(
  matured36.df %>% 
  count(issue_yr, defaulted) %>% 
  mutate(pct = 100*(n/sum(n)) %>% round(3),
         defaulted = replace(defaulted, defaulted == 0, "Paid Off"),
         defaulted = replace(defaulted, defaulted==1, "Defaulted"),
         hover_info = paste0(format(n, big.mark=",", scientific = FALSE, trim = TRUE), 
                             " loans", "<br>", pct, "%")
         ) %>% 
  as.data.frame %>% 
  mutate(issue_yr = as.character(issue_yr)),
  
  matured36.df %>% 
    count(defaulted) %>% 
    mutate(issue_yr = "2007 - 2013",
           pct = 100*(n/sum(n)) %>% round(3),
           defaulted = replace(defaulted, defaulted == 0, "Paid Off"),
           defaulted = replace(defaulted, defaulted==1, "Defaulted"),
           hover_info = paste0(format(n, big.mark=",", 
                                      scientific = FALSE, 
                                      trim = TRUE), 
                               " loans", "<br>", pct, "%")) %>% 
    select(issue_yr, defaulted, n, pct, hover_info) %>% as.data.frame
  ) %>% 
  mutate(issue_yr = factor(issue_yr,
                           levels = c("2007", "2008", "2009", "2010", 
                                      "2011", "2012", "2013",
                                      "2007 - 2013"))
         )

# Barplot for default rates by year
plot_ly(data = filter(default.rates.df , issue_yr == "2007 - 2013"),
        x = issue_yr, 
        y = pct, 
        type = "bar", 
        group = defaulted,
        text = hover_info,
        hoverinfo = "text",
        marker = list(color = c("Defaulted" = "red", 
                                "Paid Off" = "green")[defaulted]), 
        mode = "markers") %>% 
  add_trace(data = filter(default.rates.df , issue_yr != "2007 - 2013"),
            x = issue_yr, 
            y = pct, 
            type = "bar", 
            group = defaulted,
            text = hover_info,
            hoverinfo = "text",
            marker = list(color = c("Defaulted" = "red", 
                                    "Paid Off" = "green")[defaulted]), 
            mode = "markers",
            showlegend = FALSE)  %>% 
  layout(title = "Default Rates", xaxis = list(title = "Year Issued"),
         yaxis = list(title = ""),
         barmode = "stack")


Their screening process improved, but it is worth remembering that relatively few loans were issued in the first few years.

We can get a sense of the improvements in risk-modelling by plotting interest rates and default rates by year. Click on the labels in the legend to examine individual years. (Default rates associated with F- and G-grade loans are omitted due to small sample sizes.)

# Create data frame that compares default rates in the past against
#   projected default rates for loans in August 2016.
# Start with Aug 2016 data
exp.v.actual.df <- exp.default.df %>% # "Browse loan" data
      filter(term == 36) %>% # filter out 60-month loans
      select(sub_grade, int_rate, exp_default_rate) %>% 
      rename(default_rate = exp_default_rate) %>% # To match columns correctly
      mutate(grouping = "Aug 2016 (Expected)")

# Next iterate by year and calculate default rates 
#   for each subgrade.
for (i in 2009:2013) {
  temp.df <- merge(
    x = matured36.df %>% 
      filter(issue_yr == i & grade %in% c("A", "B", "C", "D", "E")) %>% 
      select(sub_grade, int_rate) %>% 
      unique %>% 
      arrange(int_rate) %>% 
      filter(int_rate != 6), # Appears to be data entry error.
                             # 27 of over 100,000 loans have int_rate of 6
                             #   but with varying subgrades. Remove for now.
  
    # Compiling calculated default rates
    y = matured36.df %>% 
      filter(issue_yr == i & grade %in% c("A", "B", "C", "D", "E")) %>% 
      count(sub_grade, int_rate, defaulted) %>% 
      mutate(default_rate = (100*n/sum(n)) %>% round(2),
             grouping = as.character(i) ) %>% 
      filter(defaulted == 1) %>% as.data.frame,
    by = c("int_rate", "sub_grade")
  ) %>% 
    select(sub_grade, int_rate, default_rate, grouping) %>% 
    arrange(int_rate)
  
  exp.v.actual.df <- dplyr::bind_rows(
    exp.v.actual.df,
    temp.df
    ) %>% 
    mutate(hover_info = paste0("Subgrade: ", sub_grade, "<br>",
                               "Interest Rate: ", int_rate, "% <br>",
                               "Default Rate: ", default_rate, "%"))
}

plot_ly(data = filter(exp.v.actual.df, grouping != "Aug 2016 (Expected)"),
        x = int_rate, 
        y = default_rate, 
        group = grouping,
        mode = "markers",
        text = hover_info,
        hoverinfo = "text",
        legendgroup = "") %>% 
  layout(title = "Interest Rate vs. Default Rate",
         xaxis = list(title = "Interest Rate", ticksuffix = "%"),
         yaxis = list(title = "Default Rate", ticksuffix = "%"),
         annotations = list(
            list(x = 1.13, y = 0.75, xref = "paper", yref = "paper", 
                 showarrow = F, 
                 text = "Click on <br> legend <br> to remove <br> groups"))
  )


There is a clear positive correlation between the two variables for each year, but it is only in 2013 where we see a strong linear relationship.


Challenges in Developing Strategies for Investment

In using past data to devise an investment strategy, we should be wary about using all available data. The previous plot showed significant fluctuations in default rates. Consistency in the outcome is only evident in 2013. Therefore, for the purpose of this analysis, we will only consider 36-month loans issued in that year.

Unfortunately, strategies for maximizing returns are limited. This is in spite of all the data available (over 100 variables per loan). This is a testament to Lending Club’s models in assessing risk.

Indeed, initial attempts at using standard machine-learning algorithms to predict defaults have been fruitless thus far. Decision trees, bagging, boosting, and random forests all failed to pick up useful patterns. Similarly, a logistic regression model was not helpful.

An attempt to group loans by simple categories has not shown much promise either, despite claims by active investors. Consider, for example, a seemingly popular strategy: lending to borrowers who have no inquiries on their accounts in the previous six months. Typically, an inquiry is made when one applies for a loan, and it is thought that someone who has recently applied for multiple loans is not a reliable candidate. Lending Memo notes that filtering for this variable “is used by almost every investor who filters his or her loans.” Other sources that recommend this strategy can be found here, here, here, and here.

Sweet Jesus, here’s another one.

The flaw in this approach is that Lending Club already accounts for this variable in determining an appropriate interest rate for a loan. Not surprisingly, borrowers with at least one inquiry are more likely to be given loans of a lower grade:

# Advice claims that people with `inq_last_6mths` have higher default rates.
# Create boolean to differentiate 0 inquiries and 1 or more inquiries
matured36.df <- matured36.df %>% 
  mutate(inq_6mths_boolean = ifelse(inq_last_6mths == 0, 0, 1))

# Relation between inquiries and subgrade
matured36.df %>% 
  filter(!grade %in% c("F", "G") & issue_yr == 2013) %>% 
  count(sub_grade, inq_6mths_boolean) %>% na.omit %>%  
  mutate(pct = 100*(n/sum(n)) %>% round(3),
         inq_6mths_boolean = ifelse(inq_6mths_boolean == 0,
                                    "0 Inquiries",
                                    "1+ Inquiries"),
         hover_info = paste0(inq_6mths_boolean, ": ", pct, "%", "<br>")) %>% 
  mutate(inq_6mths_boolean = ifelse(inq_6mths_boolean == "0 Inquiries",
                                    "<br> 0 Inquiries <br>",
                                    "At Least <br> 1 Inquiry")) %>% 
  plot_ly(x = sub_grade, 
          y = pct, 
          type = "bar", 
          group = inq_6mths_boolean,
          hoverinfo = "text",
          text = hover_info) %>% 
  layout(title = "Borrowers Classified by Inquiries <br> in the Previous Six Months, 2013", 
         xaxis = list(title = "Subgrade"),
    yaxis = list(title = "Percent"),
    legend = list(traceorder = "reversed"),
    barmode = "stack")


To measure the performance of a loan, we will calculate the return on investment for every $100 invested in a loan. This calculation will take into account fees, defaults, etc.

# new column specifies length of loan (in months)
matured36.df <- mutate(matured36.df, 
                       loan_length = lubridate::interval(issue_d, last_pymnt_d) %/% months(1)) 

# For rare cases when loan is paid back in same month as the issue date.
# Treat loan as one that lasts for 1 month.
matured36.df <- matured36.df %>% 
  mutate(loan_length = ifelse(loan_length == 0, 
                              1, 
                              loan_length)
         )

# new column reports total fees charged by Lending Club
# LC charges 1% of whatever a borrower pays back - unless it is paid back
#   within 12 months.  Then LC charges 1% of installment plan in that period
#   that the borrower pays back that loan.

                     # test argument: if loan_status contains phrase 
                     #                "Fully Paid" AND if length of loan
                     #                was fewer than 12 months . . .
fees100.vec <- ifelse(test = grepl(pattern = "Fully Paid", x = matured36.df$loan_status) &
                       matured36.df$loan_length <= 12,
                     # if yes:
                     # 1% (fee) multiplied by amount subjected to fee 
                     #   (loan length multiplied by 
                     #   each installment) multiplied by
                     #   pctg that a $100 investment represents of 
                     #   the entire loan amount.
                     yes = 0.01 * matured36.df$loan_length * matured36.df$installment * 
                       100/matured36.df$funded_amnt,
                     # if no:
                     # The quantity (1% multiplied by total payments by 
                     #   borrower) multiplied 
                     #   by pctg that a $100 investment represents of
                     #   the entire loan amount.
                     no = (0.01 * matured36.df$total_pymnt) * 100/matured36.df$funded_amnt
                     ) 
                     
matured36.df <- matured36.df %>% 
  mutate(fees100 = fees100.vec) # add vector to data frame

# new column total return (principle + interest + collection fees etc.)
matured36.df <- matured36.df %>% 
  mutate(tot_return100 = total_pymnt * 100/funded_amnt - fees100) 


In measuring the average return on investment, we break up the inquiry groupings into three categories: 0 inquiries, 1 inquiry, and greater than 1 inquiry:

matured36.df <- matured36.df %>% 
  mutate(inq_6mths_3bins = case_when(matured36.df$inq_last_6mths == 0 ~ "0 Inquiries", 
                                     matured36.df$inq_last_6mths == 1 ~ "1 Inquiry", 
                                     matured36.df$inq_last_6mths > 1 ~ "2+ Inquiries")
         )

matured36.df %>% 
  filter(!grade %in% c("F", "G") & issue_yr == 2013) %>% 
  group_by(sub_grade, inq_6mths_3bins) %>% 
  summarise(n = n(),
            avg_ret100 = mean(tot_return100, na.rm = TRUE) %>% round(1) - 100) %>% 
  as.data.frame %>% 
  na.omit %>% 
  plot_ly(x = sub_grade, y = avg_ret100, group = inq_6mths_3bins, type = "bar") %>% 
  layout(title = "Grouped by 0, 1, or 2+ Inquiries, 2013 <br> (previous six months)", 
         xaxis = list(title = "Subgrade"),
         yaxis = list(title = "Percent Return", ticksuffix = "%"))

(Note: these calculated returns are not annual returns. They cover the length of the loan.)


The advantages of filtering by inquiries seems to be only marginally beneficial and is likely to be even less of a useful predictor for current loans, given the improvements in risk modeling.

As one analyst astutely observes, peer-to-peer lending companies are tuning their models to the point that simple filtering is now an ineffective strategy for increasing returns.


Diversifying Investments

The most frequently-given advice for any type of investing is to diversify one’s investments. This obvious strategy shields oneself from the possibility of large losses - a very real possibility with peer-to-peer lending. (There are a handful of cases where borrowers never made a single payment after receiving a loan.)

In this context, the recommendation of diversification typically refers to investing in many loans without regard to risk levels. The focus is on the number of loans rather than a diverse grade of loans. But there is not a consensus on what is considered a diverse portfolio. Online advice varies, stating numbers such as 150, or 200, or 500, while this page collected the advice of various experts whose recommendations ranges from 100 to 500.

Suppose we invested by grade. How many loans should we fund to ensure a positive yield? We can group returns by subgrade and year:

pct.ret.df <- matured36.df %>% 
  filter(!grade %in% c("F", "G") & issue_yr > 2010) %>% 
  group_by(grade, issue_yr) %>% 
  summarise(count = n(),
            pct_ret = mean(tot_return100) %>% round(1) - 100,
            loan_length = mean(loan_length, na.rm = TRUE) %>% round(1)) %>% 
  as.data.frame %>% 
  filter(count >= 50) %>% 
  mutate(hover_info = paste0("Year: ", issue_yr, "<br>",
                             "Return: ", round(pct_ret, 2), "%", "<br>", 
                             "Loans Issued: ", format(count, 
                                                      big.mark=",", 
                                                      scientific = FALSE, 
                                                      trim = TRUE))) %>% 
  arrange(issue_yr) # necessary for plotly to organize groups in chronological order


plot_ly(data = pct.ret.df, 
        x = grade, 
        y = pct_ret, 
        group = issue_yr,
        type = "bar", 
        name = issue_yr,
        text = hover_info,
        hoverinfo = "text") %>% 
  layout(title = "Percent Return on Investment", 
         xaxis = list(title = "Grade", range = unique(pct.ret.df$Grade)),
         yaxis = list(title = "Percent Return", ticksuffix = "%"))


As an aside, what is notable is that returns have increased significantly for the safer loan grades (A, B, and C) while the riskier loans (D and E) have had decreasing returns. The disparity may be further diminished for loans issued today.

Diversification protects investments from volatility. But volatility is not constant across grades:

matured36.df %>% 
  filter(!grade %in% c("F", "G") & issue_yr > 2010) %>% 
  group_by(grade, issue_yr) %>% 
  summarise(return_variance = var(tot_return100, na.rm = TRUE) %>% round(0)) %>% 
  plot_ly(x = grade, y = return_variance, group = issue_yr, type = "bar") %>% 
  layout(title = "Variance of Return on Investment",
         xaxis = list(title = "Grade"),
         yaxis = list(title = "Variance"))


Riskier loans are accompanied, not surprisingly, with greater volatility.

To determine the minimal requirements to avoid a net loss in investment, we will begin with the riskier E-grade loans:

matured36.df %>% 
  filter(grade == "E" & issue_yr == 2013) %>% 
  plot_ly(x = (tot_return100 - 100), type = "histogram", nbinsx = 90, showlegend = FALSE) %>% 
  layout(title = "Distribution of Returns for E-Grade Loans, 2013",
         xaxis = list(title = "Percent Return", ticksuffix = "%"),
         yaxis = list(title = "Count"))


A large majority of investments in E-grade loans yield a positive return. The challenge is to avoid the minority of defaulted loans, where a single one can wipe out all positive returns.

We could, through trial and error, randomly sample different number of loans to arrive at the minimum number necessary to ensure a positive net yield. But a more elegant solution would be to rely on the central limit theorem and examine the distribution of sample means.

The central limit theorem states that even if a distribution is not normal, a sample mean (of roughly 30 or more samples) will be approximately normally distributed. The above distribution clearly does not resemble a bell curve (i.e. normal distribution). But we can estimate a normal distribution of repeated samples from the pool of E-grade loans:

# 2-column data frame: grade and pct_return.
# Data from 2013
pct.returns.df <- matured36.df %>% 
  filter(issue_yr == 2013) %>% 
  select(grade, tot_return100) %>% 
  mutate(tot_return100 = tot_return100 - 100)

mean_sd_fn <- function(loan.grade) {
  # Derives mean and standard deviation for each loan grade.
  # Output is a vector.
  pct.return <- pct.returns.df %>% 
    filter(grade == loan.grade) %>% 
    .$tot_return100
  
  grade.mean <- mean(pct.return)
  
  grade.sd <- sd(pct.return)
  
  mean.sd <- c(mean = grade.mean, sd = grade.sd)
  
  return(mean.sd)
}

plot_coord_fn <- function(loan.grade, mean.sd) {
  # Generates points for plotting normal curves
  x.coord <- seq(from = - 2*mean.sd[["mean"]], to = 3*mean.sd[["mean"]], by = 0.05)
  y.coord <- dnorm(x.coord, mean = mean.sd[["mean"]], sd = mean.sd[["sd"]]/sqrt(30))
  return(data.frame(grade = loan.grade, 
                    x_coord = x.coord, 
                    y_coord = y.coord,
                    stringsAsFactors = FALSE))
}

# Data frame with all the points necessary for plotting normal curves 
#   for each grade.
norm.curve.df <- rbind(
  plot_coord_fn("A", mean_sd_fn("A")),
  plot_coord_fn("B", mean_sd_fn("B")),
  plot_coord_fn("C", mean_sd_fn("C")),
  plot_coord_fn("D", mean_sd_fn("D")),
  plot_coord_fn("E", mean_sd_fn("E"))
  )

# For color coding the curves
grade.color.code <- c(A = "royalblue", B = "turquoise", C = "olivedrab",
                      G = "orange")

plot_ly(data = filter(norm.curve.df, grade == "E"), 
        x = x_coord, 
        y = y_coord, 
        marker = list(color = "gold"), 
        type = "line",
        hoverinfo = "none",
        showlegend = FALSE) %>% 
  layout(title = "Distribution of Sample Returns for E-Grade Loans, 2013 (n = 30)",
         xaxis = list(title = "Percent Return", ticksuffix = "%"),
         yaxis = list(title = "", showticklabels = FALSE))


This is an estimated distribution of sample means from the pool of grade-E loans, each sample of size 30. That is to say, for each sample of 30 loans, the most frequent sample average will be where the curve is at its maximum. Here, the maximum is 12.03%, which is the overall average.

We can see that part of the curve extends to the left of 0%, indicating that investing in 30 loans will occasionally result in a net loss. We can calculate that area under the curve:

z.score <- mean_sd_fn("E")[["mean"]]*sqrt(30)/mean_sd_fn("E")[["sd"]]
area.below.curve <- pnorm(- z.score) %>% round(3)

# For identifying the shaded area under curve
marker1 <- list(x=-3, 
                y=.0045,
                text=paste0("Shaded Area: <br>", area.below.curve),
                arrowsize=0.8,
                ay= -30,
                ax= -10,
                font=list(size=10),
                showarrow=TRUE)  

plot_ly(data = filter(norm.curve.df, grade == "E"), 
        x = x_coord, 
        y = y_coord, 
        marker = list(color = "gold"), 
        type = "line",
        hoverinfo = "none",
        showlegend = FALSE) %>% 
  add_trace(data = filter(norm.curve.df, grade == "E" & x_coord <= 0), 
            x = x_coord, 
            y = y_coord, 
            marker = list(color = "gold"), 
            type = "line", 
            fill = "tozeroy",
            hoverinfo = "none",
            showlegend = FALSE) %>% 
  layout(title = "Distribution of Sample Returns for E-Grade Loans, 2013 (n = 30)",
         xaxis = list(title = "Percent Return", ticksuffix = "%"),
         yaxis = list(title = "", showticklabels = FALSE),
         annotations = list(marker1))


So we expect that if we randomly select 30 E-grade loans, the probability that they will result in a net negative yield is 2.4%

We can test this by repeated sampling. The following code chunk will repeatedly take the sum of the returns of 30 randomly-sampled loans and note when the sum is negative. The output will report the results:

# Sampling will be done with using parallel processing.

# Record how many cores computer contains.
core.count <- parallel::detectCores()

# Needed for `foreach()`
registerDoParallel(cores = 4)

sampling_sim <- function(loan.grade, trials = 10000, sample.size = 30) {
  # Repeatedly sampling returns to simulate investing in loans.
  
  # For calculating expected number trials to yield net losses
  # z.score = (xbar - mu)/(sigma/sqrt(n))
  z.score <- mean_sd_fn(loan.grade)[["mean"]]*sqrt(30)/mean_sd_fn(loan.grade)[["sd"]]
  area.below.curve <- pnorm(- z.score) %>% round(4)
  
  # Vector containing all returns of particular grade
  grade.returns <- filter(pct.returns.df, grade %in% loan.grade) %>% 
    .$tot_return100
  
  # Process is parallelized using `foreach` library
  trials.per.core <- round(trials/core.count) # `round()` in case of non-whole num
  net.yield <- foreach(i = 1:4, .combine = 'c') %dopar% 
    # `replicate()` repeats each trial and records them in a matrix.
    # `apply()` adds each column which represents each trial, and
    #   result is a vector of sums
    apply(X = replicate(n = trials.per.core, 
                        expr = sample(grade.returns, size = sample.size)), 
          MARGIN = 2, 
          FUN = sum)
  
  # total.trials will equal `trials` parameter unless 
  #   trials/core.count is not a whole number.
  total.trials <- length(net.yield) 
  
  # How many trials resulted in a negative yield.
  neg.count <- sum(net.yield < 0)
  
  # Output.
  cat("Grade: ", loan.grade, "\n")
  cat("Sample size per trial: ", sample.size, "\n")
  cat("Number of trials: ", 
      format(total.trials, big.mark=",", scientific = FALSE, trim = TRUE), 
      "\n")
  cat("Expected number of trials with negative yields: ", 
      format(trials*area.below.curve, big.mark=",", scientific = FALSE, trim = TRUE),
      "\n")  
  cat("Simulated number of trials with negative yields: ", 
      format(neg.count, big.mark=",", scientific = FALSE, trim = TRUE),
      "\n")  
}

sampling_sim("E")
## Grade:  E 
## Sample size per trial:  30 
## Number of trials:  10,000 
## Expected number of trials with negative yields:  239 
## Simulated number of trials with negative yields:  318


Out of 10,000 trials, the simulated outcome is close to the expected value.

Next, we compare the distributions of sample means for other grades:

plot_ly(data = norm.curve.df, 
        x = x_coord, 
        y = y_coord, 
        group = grade, 
        marker = list(color = grade.color.code[grade]), # not working
        hoverinfo = "none",
        type = "line") %>% 
  layout(title = "Distribution of Sample Means by Grade, 2013 (n = 30)",
         xaxis = list(title = "Percent Return", ticksuffix = "%"),
         yaxis = list(title = ""))


We can see that center of each curve of shifts to the right as you move from grades A to E. Recall from an earlier plot that riskier loans had greater variances, and we see that reflected in the width of the distributions. Random samples of E-grade loans will vary more than those of A-grade loans.

We can repeat the sampling procedure that we did for E-grade loans for the other grades:

sampling_sim("A")
## Grade:  A 
## Sample size per trial:  30 
## Number of trials:  10,000 
## Expected number of trials with negative yields:  1 
## Simulated number of trials with negative yields:  26
sampling_sim("B")
## Grade:  B 
## Sample size per trial:  30 
## Number of trials:  10,000 
## Expected number of trials with negative yields:  14 
## Simulated number of trials with negative yields:  51
sampling_sim("C")
## Grade:  C 
## Sample size per trial:  30 
## Number of trials:  10,000 
## Expected number of trials with negative yields:  52 
## Simulated number of trials with negative yields:  112
sampling_sim("D")
## Grade:  D 
## Sample size per trial:  30 
## Number of trials:  10,000 
## Expected number of trials with negative yields:  175 
## Simulated number of trials with negative yields:  257
sampling_sim("E")
## Grade:  E 
## Sample size per trial:  30 
## Number of trials:  10,000 
## Expected number of trials with negative yields:  239 
## Simulated number of trials with negative yields:  295


It appears that the projections consistently underestimate the number of losses. However, the discrepancy is small given the number of total trials.

If we were to increase the sample size, the distributions would become narrower, making it even less likely that a larger sample would result in a net loss.

In any case, the evidence suggests that we can invest in just a few dozen (not hundreds) of loans and be relatively confident that we will at least break even with our investments. Of course, nobody invests with hopes of merely breaking even; more investments improves the probability of a solid positive yield. In addition, the conclusion assumes that present-day loans are the same as loans issued in 2013. That is most certainly not the case.