Class Project

Kevin Mitchell, Spencer Morgan, & Ian Lonsdale

##Executive Summary

#For our Big Data Project we found our targeted area of interest to be accurately predicting and analyzing the possibility of loan default and factors that impact interest rates in Lending Club’s datasets. Lending Club is a peer to peer lending site where a single person can apply for a personal loan that might be selected to be funded by a single person or group of funders. Lending Club requires relevant financial data from the individual requesting the loan. Information such as loan amount, income, employment status, etc. Lending Club then grades the loan request and provides the amount of interest that will be charged by assessing risk from the provided information. Lending Club then puts the loan “up for auction” to see if any funders would like to fulfill the loan at the specified amount and percentage rate. The first steps in the group process was to identify and find an appropriate dataset that we could use to accurately run a regression model to understand the variables that lead to loan default.

#We focused our attention for finding the dataset by searching various internet forums, including Kaggle, for datasets that were relevant to our goal of analyzing the possibility of loan default. We ultimately found a dataset on Kaggle that was originally from Lending Club that provided the relevant variables to test in a predictive model. Most importantly, this dataset had a field that indicated the status of the loan. This variable is what we used for our binomial analysis classification problem. For our regression problem (where the output is a numerical scalar value) ww examined what factors impact the interest rate of the loan. We ultimately decided on a set of 8 variables that would most likely have the largest impact on both of the studied outcomes. These variables of the data were loan amount, interest rate, loan grade, employment length, home ownership, annual income, loan term, and finally loan status.

#Our findings matched with logical assumptions for both models.

#For loan outcome, the following columns showed positive coefficients: #1. Loan Amount #2. Interest Rate #3. Employment being N/A #4. Term Length being 60 months instead of 36 months #5. Grades A-D impact positively, but less with each drop in grade

The probability of a loan defaulting changes directly with the listed factors.

#For loan outcome, the following columns showed negative coefficients: #1. Annual Income #2. Being employed is significantly better than being N/A #3. Grades E-G impact negatively # The probability of a loan defaulting changes inversely with the listed factors. #We found these coefficients to account for about 80% of the change in loan outcome.

For Interest rate, we ran a model using the same variables with interest rate as the Y. Below is a list of the most stand out findings.

#1. The higher the loan amount the higher the interest rate will be. #2. The lower the grade of the loan the higher the interest rate will be. #3. The longer one has been employed the lower the interest rate will be. #4. Home ownership type was not found to have any statistical significance on determining the interest rate. #5. The higher one’s annual income the lower the interest rate will be. #6. The longer term amount (60 months) results in a higher interest rate. #7. Loans that default typically had higher interest rates

Although we could not get the confusion matrix to successfully run, these results match with what one would assume.

#In general, investing money in a loan on lending club is similar to investing in the stock market. Higher interest comes with higher risk. But, you can utilize positive features, like employment and annual income to select high interest loans that have a lower risk of defaulting. ’

## First, we will load the required libraries for our analysis:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggthemes)
library(corrplot)
## corrplot 0.84 loaded
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(DT)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Warning in doTryCatch(return(expr), name, parentenv, handler): unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so':
##   dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib
##   Referenced from: /Library/Frameworks/R.framework/Versions/3.6/Resources/modules/R_X11.so
##   Reason: image not found
## Could not load tcltk.  Will use slower R code instead.
## Loading required package: RSQLite
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(DescTools)
## 
## Attaching package: 'DescTools'
## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE
## Ian - Pathname for storage "/Users/ian/Downloads/loan.csv"
## Kevin - Pathname for storage "/Users/Spartatoe/Documents/MBA/Fall 2019/GSBAS 576/Group Project/loan.csv"
## Spencer - Pathname for storage: "/Users/spencermorgan/Library/Mobile Documents/com~apple~CloudDocs/Documents/Documents - Spencer’s MacBook Air/Big Data/loan.csv"
## Load in dataset, set the blank spaces to NA, and examine the demensions and names of all 74 columns
loan = read_csv("/Users/ian/Downloads/loan.csv" , na = "")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   term = col_character(),
##   grade = col_character(),
##   sub_grade = col_character(),
##   emp_title = col_character(),
##   emp_length = col_character(),
##   home_ownership = col_character(),
##   verification_status = col_character(),
##   issue_d = col_character(),
##   loan_status = col_character(),
##   pymnt_plan = col_character(),
##   url = col_character(),
##   desc = col_character(),
##   purpose = col_character(),
##   title = col_character(),
##   zip_code = col_character(),
##   addr_state = col_character(),
##   earliest_cr_line = col_character(),
##   initial_list_status = col_logical(),
##   last_pymnt_d = col_character(),
##   next_pymnt_d = col_character()
##   # ... with 23 more columns
## )
## See spec(...) for full column specifications.
## Warning: 3400965 parsing failures.
##   row                 col           expected   actual                            file
## 42536 initial_list_status 1/0/T/F/TRUE/FALSE w        '/Users/ian/Downloads/loan.csv'
## 42536 tot_coll_amt        1/0/T/F/TRUE/FALSE 0.0      '/Users/ian/Downloads/loan.csv'
## 42536 tot_cur_bal         1/0/T/F/TRUE/FALSE 114834.0 '/Users/ian/Downloads/loan.csv'
## 42536 total_rev_hi_lim    1/0/T/F/TRUE/FALSE 59900.0  '/Users/ian/Downloads/loan.csv'
## 42537 tot_coll_amt        1/0/T/F/TRUE/FALSE 0.0      '/Users/ian/Downloads/loan.csv'
## ..... ................... .................. ........ ...............................
## See problems(...) for more details.
dim(loan)
## [1] 887379     74
colnames(loan)
##  [1] "id"                          "member_id"                  
##  [3] "loan_amnt"                   "funded_amnt"                
##  [5] "funded_amnt_inv"             "term"                       
##  [7] "int_rate"                    "installment"                
##  [9] "grade"                       "sub_grade"                  
## [11] "emp_title"                   "emp_length"                 
## [13] "home_ownership"              "annual_inc"                 
## [15] "verification_status"         "issue_d"                    
## [17] "loan_status"                 "pymnt_plan"                 
## [19] "url"                         "desc"                       
## [21] "purpose"                     "title"                      
## [23] "zip_code"                    "addr_state"                 
## [25] "dti"                         "delinq_2yrs"                
## [27] "earliest_cr_line"            "inq_last_6mths"             
## [29] "mths_since_last_delinq"      "mths_since_last_record"     
## [31] "open_acc"                    "pub_rec"                    
## [33] "revol_bal"                   "revol_util"                 
## [35] "total_acc"                   "initial_list_status"        
## [37] "out_prncp"                   "out_prncp_inv"              
## [39] "total_pymnt"                 "total_pymnt_inv"            
## [41] "total_rec_prncp"             "total_rec_int"              
## [43] "total_rec_late_fee"          "recoveries"                 
## [45] "collection_recovery_fee"     "last_pymnt_d"               
## [47] "last_pymnt_amnt"             "next_pymnt_d"               
## [49] "last_credit_pull_d"          "collections_12_mths_ex_med" 
## [51] "mths_since_last_major_derog" "policy_code"                
## [53] "application_type"            "annual_inc_joint"           
## [55] "dti_joint"                   "verification_status_joint"  
## [57] "acc_now_delinq"              "tot_coll_amt"               
## [59] "tot_cur_bal"                 "open_acc_6m"                
## [61] "open_il_6m"                  "open_il_12m"                
## [63] "open_il_24m"                 "mths_since_rcnt_il"         
## [65] "total_bal_il"                "il_util"                    
## [67] "open_rv_12m"                 "open_rv_24m"                
## [69] "max_bal_bc"                  "all_util"                   
## [71] "total_rev_hi_lim"            "inq_fi"                     
## [73] "total_cu_tl"                 "inq_last_12m"

EDA of the loan data

### This graph shows the distribution of loan amounts. 
### Looking at the below graph it appears that a majority of the loeans are between $5,000 and $20,000

ggplot(data=loan, aes(loan_amnt))+geom_histogram(bins = 40,color="blue",fill="purple")

#See this graph for more detail of the breakdown. Interesting to note the spike at the maximum loan amount

Desc(loan$loan_amnt, main = "Loan amount distribution", plotit = TRUE)
## ------------------------------------------------------------------------------ 
## Loan amount distribution
## 
##      length         n       NAs     unique         0s       mean     meanCI
##     887'379   887'379         0      1'372          0  14'755.26  14'737.71
##                100.0%      0.0%                  0.0%             14'772.82
##                                                                            
##         .05       .10       .25     median        .75        .90        .95
##    3'600.00  5'000.00  8'000.00  13'000.00  20'000.00  28'000.00  32'000.00
##                                                                            
##       range        sd     vcoef        mad        IQR       skew       kurt
##   34'500.00  8'435.46      0.57   8'599.08  12'000.00       0.68      -0.26
##                                                                            
## lowest : 500.0 (11), 550.0, 600.0 (6), 700.0 (3), 725.0
## highest: 34'900.0 (14), 34'925.0 (9), 34'950.0 (18), 34'975.0 (31), 35'000.0 (36'368)
## 
## heap(?): remarkable frequency (7.0%) for the mode(s) (= 10000)

# Take a look at reasons why people take out loans on Lending Club. Debt consolidation is by far the most funded and applied for reason 
Desc(loan$purpose, main = "Loan purposes", plotit = TRUE)
## ------------------------------------------------------------------------------ 
## Loan purposes
## 
##    length       n     NAs  unique  levels   dupes
##   887'379 887'379       0      14      14       y
##            100.0%    0.0%                        
## 
##                  level     freq   perc  cumfreq  cumperc
## 1   debt_consolidation  524'215  59.1%  524'215    59.1%
## 2          credit_card  206'182  23.2%  730'397    82.3%
## 3     home_improvement   51'829   5.8%  782'226    88.2%
## 4                other   42'894   4.8%  825'120    93.0%
## 5       major_purchase   17'277   1.9%  842'397    94.9%
## 6       small_business   10'377   1.2%  852'774    96.1%
## 7                  car    8'863   1.0%  861'637    97.1%
## 8              medical    8'540   1.0%  870'177    98.1%
## 9               moving    5'414   0.6%  875'591    98.7%
## 10            vacation    4'736   0.5%  880'327    99.2%
## 11               house    3'707   0.4%  884'034    99.6%
## 12             wedding    2'347   0.3%  886'381    99.9%
## ... etc.
##  [list output truncated]

### This graph indicates that typically grades of B and C tend to have a higher loan amount.
### This could be indicitive of riskier loans

ggplot(loan, aes(x=grade, y=loan_amnt, fill=grade)) +
  stat_summary(fun.y="sum", geom="bar") +
  labs(y ="Total Loan Amount",title="Total loan amount based on loan grade")

### This graph indicates that as the grade becomes less the interest rate increases which would make sense as these loans would most likely be more risky. 
ggplot(data=loan, aes(grade,int_rate,fill=grade))+geom_boxplot(outlier.color = "blue")+labs(title="Box plot of Interest rate")

### This graph shows a distribution of loan amount by grade.
### Seems that across all the grades for the most part the loan amounts are consistent across grades
### However it does appear that some of the lower grades have more loans of $30,000+


ggplot(data=loan,aes(loan_amnt, fill=grade))+
  geom_density(alpha=0.25) + 
  facet_grid(grade ~ .)

### Graph showing distribution of interest rates depending on grade of the loan
### Hoping to find correlation between the two

hist1=sqldf("select count(*) as freq,grade,int_rate from loan group by grade,int_rate")

p <- ggplot(data = hist1, aes(x = int_rate, y = freq)) +
  geom_smooth(aes(colour = grade, fill = grade)) + facet_wrap(~ grade) + labs(x = "Interest Rate", y="Frequency")
 

ggplotly(p) 
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
#Confirming state id's made it in import
unique(loan$addr_state)
##  [1] "AZ" "GA" "IL" "CA" "OR" "NC" "TX" "VA" "MO" "CT" "UT" "FL" "NY" "PA" "MN"
## [16] "NJ" "KY" "OH" "SC" "RI" "LA" "MA" "WA" "WI" "AL" "CO" "KS" "NV" "AK" "MD"
## [31] "WV" "VT" "MI" "DC" "SD" "NH" "AR" "NM" "MT" "HI" "WY" "OK" "DE" "MS" "TN"
## [46] "IA" "NE" "ID" "IN" "ME" "ND"
### This is a hot map showing the states within the US that have the highest and lowest distributions of loan amounts

suppressPackageStartupMessages(library(maps))
loan$region <- loan$addr_state
loan$region <- as.factor(loan$region)
levels(loan$region)<- c("alaska", "alabama","arkansas", "arizona", "california","colorado","connecticut","district of columbia","delaware","florida","georgia","hawaii","iowa","idaho","illinois","indiana","kansas","kentucky","louisiana","massachusetts","maryland","maine","michigan","minnesota","missouri","mississippi","montana","north carolina","north dakota","nebraska","new hampshire","new jersey","new mexico","nevada","new york","ohio","oklahoma","oregon","pennsylvania","rhode island","south carolina","south dakota","tennessee","texas","utah","virginia","vermont","washington","wisconsin","west virginia","wyoming")

all_states <- map_data("state")
state_by_loan <-loan %>% group_by(region) %>%
                summarise(value = sum(loan_amnt, na.rm=TRUE))
state_by_loan$region <- as.character(state_by_loan$region)

Total <- merge(all_states, state_by_loan, by="region")

p <- ggplot()
p <- p + geom_polygon(data=Total, aes(x=long, y=lat, group = group, fill=Total$value),colour="white"
      ) + scale_fill_continuous(low = "skyblue", high = "darkblue", guide="colorbar")
P1 <- p + theme_bw()  + labs(fill = "Gradient of loan amount" 
                            ,title = "Heat Map of loan amount in all states", x="", y="")
P1 + scale_y_continuous(breaks=c()) + scale_x_continuous(breaks=c()) + theme(panel.border =  element_blank())

### This graph shows a distribution of the different loan status types by state

ggplot(data = loan, aes(x=addr_state, fill=loan_status)) +
  labs(x="State", y="Total Loan issued") +
  geom_bar() +
  coord_flip()

##Bar Graph Demonstrating a count of the loan status variable giving an idea of the distribution of the different loan status amounts
## IN this case a large majority of the data  indicates that the loans are current. 
## A much smaller percentage of the amount is defaulted or late

loan %>%
        count(loan_status) %>%
        ggplot(aes(x = reorder(loan_status , desc(n)) , y = n , fill = n)) + 
        geom_col() + 
        coord_flip() + 
        labs(x = 'Loan Status' , y = 'Count')

# Start to clean data up a little more to make additional graphs
loan = loan %>%
        select(loan_status , loan_amnt , int_rate , grade , emp_length , home_ownership , 
               annual_inc , term)
loan
## # A tibble: 887,379 x 8
##    loan_status loan_amnt int_rate grade emp_length home_ownership annual_inc
##    <chr>           <dbl>    <dbl> <chr> <chr>      <chr>               <dbl>
##  1 Fully Paid       5000     10.6 B     10+ years  RENT                24000
##  2 Charged Off      2500     15.3 C     < 1 year   RENT                30000
##  3 Fully Paid       2400     16.0 C     10+ years  RENT                12252
##  4 Fully Paid      10000     13.5 C     10+ years  RENT                49200
##  5 Current          3000     12.7 B     1 year     RENT                80000
##  6 Fully Paid       5000      7.9 A     3 years    RENT                36000
##  7 Current          7000     16.0 C     8 years    RENT                47004
##  8 Fully Paid       3000     18.6 E     9 years    RENT                48000
##  9 Charged Off      5600     21.3 F     4 years    OWN                 40000
## 10 Charged Off      5375     12.7 B     < 1 year   RENT                15000
## # … with 887,369 more rows, and 1 more variable: term <chr>
#Find loan status options to chose what is an acceptable loan outcome

unique(loan$loan_status)
##  [1] "Fully Paid"                                         
##  [2] "Charged Off"                                        
##  [3] "Current"                                            
##  [4] "Default"                                            
##  [5] "Late (31-120 days)"                                 
##  [6] "In Grace Period"                                    
##  [7] "Late (16-30 days)"                                  
##  [8] "Does not meet the credit policy. Status:Fully Paid" 
##  [9] "Does not meet the credit policy. Status:Charged Off"
## [10] "Issued"
## We want to convert this variable to binary (1 for default and 0 for non-default) but we have 10 different levels. Loans with status Current, Late payments, In grace period need to be removed. Therefore, we create a new variable called loan_outcome where

## loan_outcome -> 1 if loan_status = ‘Charged Off’ or ‘Default’ loan_outcome -> 0 if loan_status = ‘Fully Paid’

loan = loan %>%
        mutate(loan_outcome = ifelse(loan_status %in% c('Charged Off' , 'Default') , 
                                     1, 
                                     ifelse(loan_status == 'Fully Paid' , 0 , 'No info')
                                     ))

barplot(table(loan$loan_outcome) , col = 'lightblue')

head(loan)
## # A tibble: 6 x 9
##   loan_status loan_amnt int_rate grade emp_length home_ownership annual_inc
##   <chr>           <dbl>    <dbl> <chr> <chr>      <chr>               <dbl>
## 1 Fully Paid       5000     10.6 B     10+ years  RENT                24000
## 2 Charged Off      2500     15.3 C     < 1 year   RENT                30000
## 3 Fully Paid       2400     16.0 C     10+ years  RENT                12252
## 4 Fully Paid      10000     13.5 C     10+ years  RENT                49200
## 5 Current          3000     12.7 B     1 year     RENT                80000
## 6 Fully Paid       5000      7.9 A     3 years    RENT                36000
## # … with 2 more variables: term <chr>, loan_outcome <chr>
# We will create a new dataset which contains only rows with 0 or 1 in loan_outcome feature for better modelling.

# Create the new dataset by filtering 0's and 1's in the loan_outcome column and remove loan_status column for the modelling
loan2 = loan %>%
        select(-loan_status) %>%
        filter(loan_outcome %in% c(0 , 1))
dim(loan2) # Our new dataset contains 265,781 rows & 8 columns
## [1] 254190      8
#Testing Y column for binomial (loan_outcome) to confirm that NA is gone
unique(loan2$loan_outcome)
## [1] "0" "1"
# We assume that grade is a great predictor for the volume of non-performing loans. But how many of them did not performed grouped by grade?

table(loan2$grade , factor(loan2$loan_outcome , c(0 , 1) , c('Fully Paid' , 'Default')))
##    
##     Fully Paid Default
##   A      39679    2664
##   B      66546    9717
##   C      52678   13002
##   D      30020   10798
##   E      12928    6459
##   F       4726    3013
##   G       1146     814
#let's graph this so we can see volume by grade and default
ggplot(loan2 , aes(x = grade , y = ..count.. , fill = factor(loan_outcome , c(1 , 0) , c('Default' , 'Fully Paid')))) + 
        geom_bar() + 
        theme(legend.title = element_blank())

Time for Data Modelling

#Modelling Process:

## Let's make factors in a new dataset of all non-numaric variables

loan3<-loan2

loan3$home_ownership<-factor(loan3$home_ownership)
loan3$grade<-factor(loan3$grade)
loan3$term<-factor(loan3$term)
loan3$emp_length<-factor(loan3$emp_length)
loan3$loan_outcome<-factor(loan3$loan_outcome)
set.seed(123) #locks seed for random partitioning

#creates a vector of rows to randomly sample from the raw data
inTrain <- createDataPartition(y=loan3$loan_outcome, p=.70, list = FALSE) 
#stores these rows in the training set
Training<-loan3[inTrain,]  
#stores all rows not in the training set in the test/validation set
Testing<-loan3[-inTrain,]  
##LOGISTIC REGRESSION## (need rank to be factor)
M_LOG<-glm(loan_outcome ~ ., data = Training, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(M_LOG)
## 
## Call:
## glm(formula = loan_outcome ~ ., family = "binomial", data = Training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4548  -0.6764  -0.5186  -0.3373   4.6614  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -1.009e+01  4.395e+01  -0.230 0.818477    
## loan_amnt               1.499e-05  9.881e-07  15.167  < 2e-16 ***
## int_rate                1.373e-01  4.765e-03  28.817  < 2e-16 ***
## gradeB                  1.315e-01  3.357e-02   3.918 8.91e-05 ***
## gradeC                  1.652e-01  4.332e-02   3.814 0.000136 ***
## gradeD                  8.502e-02  5.540e-02   1.535 0.124853    
## gradeE                 -4.992e-02  6.870e-02  -0.727 0.467468    
## gradeF                 -2.886e-01  8.504e-02  -3.394 0.000688 ***
## gradeG                 -4.290e-01  1.039e-01  -4.128 3.67e-05 ***
## emp_length1 year       -1.456e-02  3.303e-02  -0.441 0.659277    
## emp_length10+ years    -6.388e-02  2.532e-02  -2.523 0.011649 *  
## emp_length2 years      -6.764e-02  3.052e-02  -2.216 0.026665 *  
## emp_length3 years      -4.748e-02  3.157e-02  -1.504 0.132607    
## emp_length4 years      -4.054e-02  3.358e-02  -1.207 0.227327    
## emp_length5 years      -3.975e-02  3.263e-02  -1.218 0.223059    
## emp_length6 years       1.545e-02  3.427e-02   0.451 0.652034    
## emp_length7 years       5.799e-03  3.459e-02   0.168 0.866869    
## emp_length8 years       1.327e-02  3.666e-02   0.362 0.717327    
## emp_length9 years       3.873e-04  3.933e-02   0.010 0.992143    
## emp_lengthn/a           4.010e-01  3.630e-02  11.048  < 2e-16 ***
## home_ownershipMORTGAGE  6.597e+00  4.395e+01   0.150 0.880688    
## home_ownershipNONE      6.573e+00  4.396e+01   0.150 0.881128    
## home_ownershipOTHER     7.148e+00  4.395e+01   0.163 0.870815    
## home_ownershipOWN       6.666e+00  4.395e+01   0.152 0.879457    
## home_ownershipRENT      6.802e+00  4.395e+01   0.155 0.877014    
## annual_inc             -6.931e-06  2.212e-07 -31.342  < 2e-16 ***
## term60 months           3.316e-01  1.710e-02  19.391  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 169257  on 177933  degrees of freedom
## Residual deviance: 156390  on 177907  degrees of freedom
## AIC: 156444
## 
## Number of Fisher Scoring iterations: 7
##The table below summarizes the effect the variables have on determining if someone will or will not default on their loan.
##The larger the loan amount and interest rate, the more likely one will default on the loan.
##The lower the grade on the loan, the more likely the loan will default.
##Only employment length of 10+ years and N/A are statistically significant. If someone is employeed for 10+ years they are more likely to non-default, and if they are N/A employeed they are more likely to default.
##Home ownership does not have any staistical signficance on determing loan default.
##The higher someone's annual income is, the more likely they will not default.
##A loan term of 60 months makes a loan default more likely.
exp(cbind(M_LOG$coefficients, confint.default(M_LOG)))
##                                            2.5 %       97.5 %
## (Intercept)            4.159048e-05 1.604327e-42 1.078189e+33
## loan_amnt              1.000015e+00 1.000013e+00 1.000017e+00
## int_rate               1.147191e+00 1.136527e+00 1.157955e+00
## gradeB                 1.140564e+00 1.067945e+00 1.218121e+00
## gradeC                 1.179675e+00 1.083651e+00 1.284208e+00
## gradeD                 1.088741e+00 9.767174e-01 1.213612e+00
## gradeE                 9.513072e-01 8.314626e-01 1.088426e+00
## gradeF                 7.492747e-01 6.342481e-01 8.851626e-01
## gradeG                 6.511515e-01 5.311411e-01 7.982781e-01
## emp_length1 year       9.855442e-01 9.237722e-01 1.051447e+00
## emp_length10+ years    9.381208e-01 8.926990e-01 9.858538e-01
## emp_length2 years      9.345927e-01 8.803263e-01 9.922043e-01
## emp_length3 years      9.536263e-01 8.964020e-01 1.014504e+00
## emp_length4 years      9.602671e-01 8.990963e-01 1.025600e+00
## emp_length5 years      9.610273e-01 9.014976e-01 1.024488e+00
## emp_length6 years      1.015572e+00 9.496063e-01 1.086119e+00
## emp_length7 years      1.005816e+00 9.398822e-01 1.076375e+00
## emp_length8 years      1.013360e+00 9.431068e-01 1.088846e+00
## emp_length9 years      1.000387e+00 9.261759e-01 1.080545e+00
## emp_lengthn/a          1.493370e+00 1.390812e+00 1.603491e+00
## home_ownershipMORTGAGE 7.331453e+02 2.828207e-35 1.900505e+40
## home_ownershipNONE     7.156649e+02 2.747405e-35 1.864218e+40
## home_ownershipOTHER    1.271638e+03 4.898626e-35 3.301053e+40
## home_ownershipOWN      7.851949e+02 3.028973e-35 2.035446e+40
## home_ownershipRENT     8.997597e+02 3.470948e-35 2.332410e+40
## annual_inc             9.999931e-01 9.999926e-01 9.999935e-01
## term60 months          1.393258e+00 1.347328e+00 1.440754e+00
##The chart below shows the mean and the two standard deviation range of all the variables results.
confusionMatrix(table(predict(M_LOG, Testing, type="response") >= 0.5,
                      Testing$loan_outcome == 1))
## Confusion Matrix and Statistics
## 
##        
##         FALSE  TRUE
##   FALSE 62129 13765
##   TRUE    187   175
##                                           
##                Accuracy : 0.817           
##                  95% CI : (0.8143, 0.8198)
##     No Information Rate : 0.8172          
##     P-Value [Acc > NIR] : 0.547           
##                                           
##                   Kappa : 0.0154          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.99700         
##             Specificity : 0.01255         
##          Pos Pred Value : 0.81863         
##          Neg Pred Value : 0.48343         
##              Prevalence : 0.81719         
##          Detection Rate : 0.81474         
##    Detection Prevalence : 0.99525         
##       Balanced Accuracy : 0.50478         
##                                           
##        'Positive' Class : FALSE           
## 
##The table below shows how well our model was in predicting if someone will or will not default on their loan. We were able to predict with 81.7% accuracy.

##CART## #rpart package implementation

train_control <- trainControl(method="cv", number=10, savePredictions = TRUE)  
#sets the cross validation parameters to be fed to train()
M_CART <- train(loan_outcome ~., data = Training, trControl=train_control, tuneLength=10, method = "rpart") 
#increasing tunelength increases regularization penalty
##the "cv", number = 10 refers to 10-fold cross validation on the training data
print(M_CART)
## CART 
## 
## 177934 samples
##      7 predictor
##      2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 160140, 160141, 160140, 160140, 160141, 160142, ... 
## Resampling results across tuning parameters:
## 
##   cp            Accuracy   Kappa      
##   0.0001076029  0.8170333  0.049708345
##   0.0001229748  0.8171850  0.047762354
##   0.0001414210  0.8173536  0.045588162
##   0.0001537184  0.8173986  0.042588010
##   0.0001767762  0.8176121  0.038570760
##   0.0002152058  0.8177302  0.031770406
##   0.0002459495  0.8178313  0.031324347
##   0.0002561974  0.8177864  0.031568566
##   0.0005841301  0.8171457  0.009225183
##   0.0006264027  0.8172468  0.006993798
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.0002459495.
##Find the most optimal complexity parameter. The optimal CP is 0.0002459495
plot(M_CART) #produces plot of cross-validation results

M_CART$bestTune #returns optimal complexity parameter
##             cp
## 7 0.0002459495
confusionMatrix(predict(M_CART, Testing), Testing$loan_outcome)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 61994 13643
##          1   322   297
##                                           
##                Accuracy : 0.8169          
##                  95% CI : (0.8141, 0.8196)
##     No Information Rate : 0.8172          
##     P-Value [Acc > NIR] : 0.5948          
##                                           
##                   Kappa : 0.0257          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.99483         
##             Specificity : 0.02131         
##          Pos Pred Value : 0.81963         
##          Neg Pred Value : 0.47981         
##              Prevalence : 0.81719         
##          Detection Rate : 0.81297         
##    Detection Prevalence : 0.99188         
##       Balanced Accuracy : 0.50807         
##                                           
##        'Positive' Class : 0               
## 
#Not as accurate as glm, so keeping glm function

##RANDOM FOREST## # Code below is not ran because it took over 24 hours and did not improve accuracy

#caret package implementation with 10-fold cross validation
#train_control <- trainControl(method="cv", number=10, savePredictions = TRUE)
#RF1 <- train(loan_outcome ~ ., method="rf", trControl=train_control, preProcess=c("center", "scale"), tuneLength=2, data=Training)
#print(RF1)
#confusionMatrix(predict(RF1, Testing), Testing$loan_outcome) #run model predictions on test data

Test for improved accuracy by remoivng items that are not statistically significant

#Removing Home ownership column and employment length rows with subsets that were not significant

# Select only the columns mentioned above.
loan4 = loan3 %>%
        select(loan_amnt , int_rate , grade , emp_length ,annual_inc , term , loan_outcome)
loan4
## # A tibble: 254,190 x 7
##    loan_amnt int_rate grade emp_length annual_inc term      loan_outcome
##        <dbl>    <dbl> <fct> <fct>           <dbl> <fct>     <fct>       
##  1      5000     10.6 B     10+ years       24000 36 months 0           
##  2      2500     15.3 C     < 1 year        30000 60 months 1           
##  3      2400     16.0 C     10+ years       12252 36 months 0           
##  4     10000     13.5 C     10+ years       49200 36 months 0           
##  5      5000      7.9 A     3 years         36000 36 months 0           
##  6      3000     18.6 E     9 years         48000 36 months 0           
##  7      5600     21.3 F     4 years         40000 60 months 1           
##  8      5375     12.7 B     < 1 year        15000 60 months 1           
##  9      6500     14.6 C     5 years         72000 60 months 0           
## 10     12000     12.7 B     10+ years       75000 36 months 0           
## # … with 254,180 more rows
####Removed the employment length data (1-9 years) since that data was not statistically significant in determing whether or not one would default on the loan.
loan4<-subset(loan4, emp_length!='1 year' & emp_length!='2 years' & emp_length!='3 years' & emp_length!='4 years' & emp_length!='5 years' &  emp_length!='6 years' & emp_length!='7 years' &emp_length!='8 years' & emp_length!='9 years'  )
set.seed(123) #locks seed for random partitioning

#creates a vector of rows to randomly sample from the raw data
inTrain <- createDataPartition(y=loan4$loan_outcome, p=.70, list = FALSE) 
#stores these rows in the training set
Training<-loan4[inTrain,]  
#stores all rows not in the training set in the test/validation set
Testing<-loan4[-inTrain,]  
##LOGISTIC REGRESSION## (need rank to be factor)
M_LOG2<-glm(loan_outcome ~ ., data = Training, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##The table below summarizes the effect the variables have on determining if someone will or will not default on their loan using the cleaned up data. (Removing Emp length & home ownership)
##The larger the loan amount and interest rate, the more likely one will default on the loan.
##The lower the grade on the loan, the more likely the loan will default. Only Grade A,B, & F are statistically signficant.
##If someone is employed for 10+ years they are more likely to non-default, and if they are N/A employeed they are more likely to default.
##The higher someone's annual income is, the more likely they will not default.
##A loan term of 60 months makes a loan default more likely.

summary(M_LOG2)
## 
## Call:
## glm(formula = loan_outcome ~ ., family = "binomial", data = Training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4624  -0.6767  -0.5226  -0.3443   8.4904  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -3.218e+00  7.121e-02 -45.184  < 2e-16 ***
## loan_amnt            1.451e-05  1.459e-06   9.947  < 2e-16 ***
## int_rate             1.304e-01  7.338e-03  17.768  < 2e-16 ***
## gradeB               1.605e-01  5.095e-02   3.150  0.00163 ** 
## gradeC               1.469e-01  6.625e-02   2.217  0.02660 *  
## gradeD               1.234e-01  8.506e-02   1.451  0.14673    
## gradeE               7.898e-03  1.053e-01   0.075  0.94019    
## gradeF              -2.693e-01  1.308e-01  -2.059  0.03948 *  
## gradeG              -2.747e-01  1.588e-01  -1.730  0.08370 .  
## emp_length10+ years -1.551e-01  2.511e-02  -6.176 6.56e-10 ***
## emp_lengthn/a        3.230e-01  3.587e-02   9.007  < 2e-16 ***
## annual_inc          -7.146e-06  3.359e-07 -21.276  < 2e-16 ***
## term60 months        3.234e-01  2.560e-02  12.630  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 72605  on 75740  degrees of freedom
## Residual deviance: 67153  on 75728  degrees of freedom
## AIC: 67179
## 
## Number of Fisher Scoring iterations: 6
##The table below shows the variables mean and two standard deviation range of results.
exp(cbind(M_LOG2$coefficients, confint.default(M_LOG)))
## Warning in cbind(M_LOG2$coefficients, confint.default(M_LOG)): number of rows of
## result is not a multiple of vector length (arg 1)
##                                          2.5 %       97.5 %
## (Intercept)            0.04005005 1.604327e-42 1.078189e+33
## loan_amnt              1.00001451 1.000013e+00 1.000017e+00
## int_rate               1.13926165 1.136527e+00 1.157955e+00
## gradeB                 1.17410922 1.067945e+00 1.218121e+00
## gradeC                 1.15824334 1.083651e+00 1.284208e+00
## gradeD                 1.13137832 9.767174e-01 1.213612e+00
## gradeE                 1.00792880 8.314626e-01 1.088426e+00
## gradeF                 0.76388537 6.342481e-01 8.851626e-01
## gradeG                 0.75979347 5.311411e-01 7.982781e-01
## emp_length1 year       0.85635999 9.237722e-01 1.051447e+00
## emp_length10+ years    1.38130179 8.926990e-01 9.858538e-01
## emp_length2 years      0.99999285 8.803263e-01 9.922043e-01
## emp_length3 years      1.38177930 8.964020e-01 1.014504e+00
## emp_length4 years      0.04005005 8.990963e-01 1.025600e+00
## emp_length5 years      1.00001451 9.014976e-01 1.024488e+00
## emp_length6 years      1.13926165 9.496063e-01 1.086119e+00
## emp_length7 years      1.17410922 9.398822e-01 1.076375e+00
## emp_length8 years      1.15824334 9.431068e-01 1.088846e+00
## emp_length9 years      1.13137832 9.261759e-01 1.080545e+00
## emp_lengthn/a          1.00792880 1.390812e+00 1.603491e+00
## home_ownershipMORTGAGE 0.76388537 2.828207e-35 1.900505e+40
## home_ownershipNONE     0.75979347 2.747405e-35 1.864218e+40
## home_ownershipOTHER    0.85635999 4.898626e-35 3.301053e+40
## home_ownershipOWN      1.38130179 3.028973e-35 2.035446e+40
## home_ownershipRENT     0.99999285 3.470948e-35 2.332410e+40
## annual_inc             1.38177930 9.999926e-01 9.999935e-01
## term60 months          0.04005005 1.347328e+00 1.440754e+00
confusionMatrix(table(predict(M_LOG2, Testing, type="response") >= 0.5,
                      Testing$loan_outcome == 1))
## Confusion Matrix and Statistics
## 
##        
##         FALSE  TRUE
##   FALSE 26371  5954
##   TRUE     74    59
##                                         
##                Accuracy : 0.8143        
##                  95% CI : (0.81, 0.8185)
##     No Information Rate : 0.8147        
##     P-Value [Acc > NIR] : 0.5882        
##                                         
##                   Kappa : 0.0113        
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.997202      
##             Specificity : 0.009812      
##          Pos Pred Value : 0.815808      
##          Neg Pred Value : 0.443609      
##              Prevalence : 0.814745      
##          Detection Rate : 0.812465      
##    Detection Prevalence : 0.995902      
##       Balanced Accuracy : 0.503507      
##                                         
##        'Positive' Class : FALSE         
## 
#Found to be not as accurate - Use previous regression and keep data for futher testing

##Testing against Interest Rate

#rpart package implementation

set.seed(123) #locks seed for random partitioning

#creates a vector of rows to randomly sample from the raw data
inTrain <- createDataPartition(y=loan3$int_rate, p=.70, list = FALSE) 
#stores these rows in the training set
Training<-loan3[inTrain,]  
#stores all rows not in the training set in the test/validation set
Testing<-loan3[-inTrain,]  
##LOGISTIC REGRESSION## (need rank to be factor)

M_LOG3<-lm(int_rate ~ ., data = Training)

##Created a formula that made interest rate the Y variable. This formula looks at what variables help predict what the interest rate on the loan will be
##The table below shows a summary of the formula ran above.
##The higher the loan amount the higher the interest rate will be.
##The lower the grade of the loan the higher the interest rate will be.
##The longer one has been employed the lower the interest rate will be.
##Home ownership type does not have any statistical signficance on determining the interest rate.
##The higher ones annual income the lower the interest rate will be.
##The longer term amount (60 months) results in a higher interest rate.
##Loans that default typically had higher interest rates

summary(M_LOG3)
## 
## Call:
## lm(formula = int_rate ~ ., data = Training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4829  -0.8972   0.0793   0.9284   7.4380 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             7.490e+00  1.315e+00   5.696 1.23e-08 ***
## loan_amnt               1.282e-05  4.514e-07  28.403  < 2e-16 ***
## gradeB                  4.008e+00  9.580e-03 418.313  < 2e-16 ***
## gradeC                  7.042e+00  1.007e-02 699.046  < 2e-16 ***
## gradeD                  9.960e+00  1.139e-02 874.231  < 2e-16 ***
## gradeE                  1.265e+01  1.478e-02 856.133  < 2e-16 ***
## gradeF                  1.570e+01  2.076e-02 756.237  < 2e-16 ***
## gradeG                  1.731e+01  3.781e-02 457.920  < 2e-16 ***
## emp_length1 year        3.928e-02  1.628e-02   2.414 0.015801 *  
## emp_length10+ years     1.309e-01  1.244e-02  10.524  < 2e-16 ***
## emp_length2 years       5.731e-02  1.491e-02   3.842 0.000122 ***
## emp_length3 years       4.028e-02  1.545e-02   2.608 0.009118 ** 
## emp_length4 years       2.893e-02  1.643e-02   1.761 0.078200 .  
## emp_length5 years       1.375e-01  1.597e-02   8.606  < 2e-16 ***
## emp_length6 years       1.699e-01  1.695e-02  10.023  < 2e-16 ***
## emp_length7 years       1.874e-01  1.718e-02  10.905  < 2e-16 ***
## emp_length8 years       1.303e-01  1.812e-02   7.192 6.41e-13 ***
## emp_length9 years       1.107e-01  1.941e-02   5.703 1.18e-08 ***
## emp_lengthn/a           1.274e-01  1.929e-02   6.600 4.12e-11 ***
## home_ownershipMORTGAGE -2.124e-01  1.315e+00  -0.162 0.871675    
## home_ownershipNONE      4.318e-01  1.334e+00   0.324 0.746215    
## home_ownershipOTHER    -3.326e-01  1.322e+00  -0.252 0.801385    
## home_ownershipOWN      -1.883e-01  1.315e+00  -0.143 0.886140    
## home_ownershipRENT     -1.376e-01  1.315e+00  -0.105 0.916673    
## annual_inc             -8.310e-07  5.956e-08 -13.952  < 2e-16 ***
## term60 months           3.427e-02  9.143e-03   3.748 0.000178 ***
## loan_outcome1           2.570e-01  8.325e-03  30.872  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.315 on 177907 degrees of freedom
## Multiple R-squared:  0.9107, Adjusted R-squared:  0.9107 
## F-statistic: 6.978e+04 on 26 and 177907 DF,  p-value: < 2.2e-16
##The table below shows the mean and the two standard deviaton range of the variables below.
exp(cbind(M_LOG3$coefficients, confint.default(M_LOG3)))
##                                            2.5 %       97.5 %
## (Intercept)            1.790862e+03 1.360638e+02 2.357120e+04
## loan_amnt              1.000013e+00 1.000012e+00 1.000014e+00
## gradeB                 5.501614e+01 5.399272e+01 5.605895e+01
## gradeC                 1.143491e+03 1.121136e+03 1.166292e+03
## gradeD                 2.116657e+04 2.069916e+04 2.164454e+04
## gradeE                 3.121587e+05 3.032474e+05 3.213319e+05
## gradeF                 6.577595e+06 6.315337e+06 6.850744e+06
## gradeG                 3.306317e+07 3.070157e+07 3.560643e+07
## emp_length1 year       1.040062e+00 1.007409e+00 1.073773e+00
## emp_length10+ years    1.139854e+00 1.112402e+00 1.167982e+00
## emp_length2 years      1.058980e+00 1.028473e+00 1.090391e+00
## emp_length3 years      1.041107e+00 1.010056e+00 1.073112e+00
## emp_length4 years      1.029357e+00 9.967406e-01 1.063041e+00
## emp_length5 years      1.147358e+00 1.111996e+00 1.183845e+00
## emp_length6 years      1.185177e+00 1.146449e+00 1.225214e+00
## emp_length7 years      1.206089e+00 1.166146e+00 1.247399e+00
## emp_length8 years      1.139213e+00 1.099459e+00 1.180405e+00
## emp_length9 years      1.117031e+00 1.075342e+00 1.160337e+00
## emp_lengthn/a          1.135814e+00 1.093663e+00 1.179590e+00
## home_ownershipMORTGAGE 8.086393e-01 6.144543e-02 1.064192e+01
## home_ownershipNONE     1.539974e+00 1.126978e-01 2.104318e+01
## home_ownershipOTHER    7.170663e-01 5.372401e-02 9.570844e+00
## home_ownershipOWN      8.283755e-01 6.294102e-02 1.090237e+01
## home_ownershipRENT     8.714695e-01 6.621998e-02 1.146873e+01
## annual_inc             9.999992e-01 9.999991e-01 9.999993e-01
## term60 months          1.034864e+00 1.016484e+00 1.053577e+00
## loan_outcome1          1.293052e+00 1.272125e+00 1.314322e+00
#Calculate in and out-of sample error

RMSE_IN<-sqrt(sum(M_LOG3$residuals^2)/length(M_LOG3$residuals))
RMSE_OUT<-sqrt(sum((predict(M_LOG3, Testing)-Testing$int_rate)^2)/length(Testing))
##In-sample error
RMSE_IN
## [1] 1.314754
##Out of sample error. As expected it performs better in sample
RMSE_OUT
## [1] 128.4008

##CART## #rpart package implementation

set.seed(123) #locks seed for random partitioning

#creates a vector of rows to randomly sample from the raw data
inTrain <- createDataPartition(y=loan3$int_rate, p=.70, list = FALSE) 
#stores these rows in the training set
Training<-loan3[inTrain,]  
#stores all rows not in the training set in the test/validation set
Testing<-loan3[-inTrain,]  
train_control <- trainControl(method="cv", number=10, savePredictions = TRUE)  
#sets the cross validation parameters to be fed to train()
M_CART2 <- train(int_rate ~., data = Training, trControl=train_control, tuneLength=10, method = "rpart") 
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
#increasing tunelength increases regularization penalty
##the "cv", number = 10 refers to 10-fold cross validation on the training data

#Error Note: "There were missing values in resampled performance measures."
##This finds the most opitimal complex parameter for our model.

print(M_CART2)
## CART 
## 
## 177934 samples
##      7 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 160140, 160141, 160141, 160141, 160141, 160140, ... 
## Resampling results across tuning parameters:
## 
##   cp            RMSE      Rsquared   MAE     
##   0.0003506705  1.320604  0.9098941  1.058498
##   0.0101330433  1.373009  0.9025210  1.065246
##   0.0142811690  1.562458  0.8728190  1.129173
##   0.0219238664  1.730192  0.8452135  1.219438
##   0.0532072735  1.926081  0.8075706  1.354130
##   0.0573583562  2.237852  0.7405999  1.533855
##   0.0702092465  2.633747  0.6380623  1.878925
##   0.1173965943  3.271816  0.4404943  2.483735
##   0.1513530351  3.748131  0.2720469  2.967636
##   0.1977939980  4.177829  0.1919575  3.360700
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.0003506705.
plot(M_CART2) #produces plot of cross-validation results

M_CART2$bestTune #returns optimal complexity parameter
##             cp
## 1 0.0003506705
##Again calculated the in and out of sample error for the cleaned up model

RMSE_IN<-sqrt(sum(M_CART2$residuals^2)/length(M_CART2$residuals))
RMSE_OUT<-sqrt(sum((predict(M_CART2, Testing)-Testing$int_rate)^2)/length(Testing))
RMSE_IN
## [1] NaN
RMSE_OUT
## [1] 129.03
##As we can see the the in and out of sample error are both worse in the second model

Thought Process Cleaning

Original dataset from Lending Club had 74 different columns variables We determined that this was too many variables in which to base the regression model We evaluated the various variables and narrowed it down to 8 variables These variables were loan status, loan amount, interest rate, loan grade, employment length, home ownership, annual income, and loan term Once we had cleaned the data we then went about the process of setting our binomial y variable(loan outcome) that we based on the loan status(See r markdown for specific functions and actions taken to clean the data) Originally when developing our y variable(loan outcome) we had the loan reject part of the binomial include loan status variable types of Charged Off as well as the various late features We then went through the process of running our original regression and confusion matrix to determine significance and as well as accuracy In the original run through using this setup our model ended with an accuracy of 78% In the process we also determined that home ownership was not statistically significant We did additional run throughs removing home ownership, but it had little impact on the model We then went back and changed the structure of our y variable binomial loan outcome In this second model we set the binomial to be an Accept if Fully Paid or a Reject if charged off We removed the late payments from the setup of the binomial function After this we went back through the process of running our regression After restructuring our y variable(loan outcome) we saw in an increase of accuracy to 81% and also saw a decrease in complexity of our model Each team member then ran various iterations of the model by removing different subsets of the x variables to see if we could increase the accuracy of the model Ultimately the various different regressions we ran had minimal impact on the accuracy and complexity of the model

Conclusion

#Examining this dataset for loan outcome and interest rate variability was a good exercise to learn the basics of programing and running analysis in R. We were able to find resources available on the internet to help with solving code issues and to help interpret our data. We had lasting issues that we could not resolve, like how to run a confusion matrix with our numerical scalar regression. Because of that, we do not know the accuracy that our results were able to attain. #The project was a learning experience and showed how powerful this tool is to clean, analyze, and find conclusions from very large sets of data that the tools in excel would not be able to handle.