DATA622 Assignment 1 (Large and Small Dataset Machine Learning methods and analysis)

Author

Banu & Lucas

Published

October 10, 2024

Code
#Load Packages
library(arrow)
library(dplyr)
library(ggplot2)
library(reshape2)
library(tidymodels)
library(tidyr)
library(broom)
library(patchwork)
library(lubridate)
library(rpart.plot)
library(kableExtra)
library(DataExplorer)
library(skimr)
#install.packages("ranger")
#install.packages("tidymodels")
library(DescTools)
library(corrplot)
library(ggcorrplot)
library(caret)
library(rpart)
library(car)
library(rattle)
library(ROSE)
library(mice)
library(MASS)
library(fpp3)
library(pROC)

Project

Deliverable

  1. Explore how to analyze and predict an outcome based on the data available. This will be an exploratory exercise, so feel free to show errors and warnings that raise during the analysis. Test the code with both datasets selected and compare the results.

    1. Are the columns of your data correlated?

    2. Are there labels in your data? Did that impact your choice of algorithm?

    3. What are the pros and cons of each algorithm you selected?

    4. How your choice of algorithm relates to the datasets (was your choice of algorithm impacted by the datasets you chose)?

    5. Which result will you trust if you need to make a business decision?

    6. Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible?

    7. How does the analysis between data sets compare?

  2. Essay (minimum 500 word document)
    Write a short essay explaining your selection of algorithms and how they relate to the data and what you are trying to do

Essay

Our goal was to understand and work with financial datasets. In line with the goal of this project we selected two large datasets (Credit Card Fraud and Lendingclub Loan). Both classification and regression methods (supervised learning) are good approaches for our datasets since we have availability of labeled samples.

LendingClub Loan (Fully Paid or Charged Off) Prediction (Large Volume - 300,000+):

LendingClub loan dataset is provided by Lending Club a peer-to-peer lending firm in San Francisco. Lending loans to risky customers can have a higher impact on credit loss. The dataset above is a good one as it has labeled data on loans that were “Fully Paid” or “Charged Off”. Additionally, this dataset is a rich resource of other categorical potential predictor variables and lends itself well to exploratory data analysis as well as machine learning algorithms such as Logistic regression, RandomForest and Decision Tree models.

https://www.kaggle.com/datasets/jeandedieunyandwi/lending-club-dataset

Credit Card Fraud (Fraud/Not-Fraud) Prediction (Medium Volume - 200,000+ ): Fraud activity and customer behavior changes rapidly, causing non-stationary in the transaction data. Fraud represents a small fraction of the daily transactions. We then have a skewed dataset towards the genuine transactions. This yields to a highly imbalanced dataset. Statistical methods such as logistic regression can be applied for fraud detection for classification tasks, however, these are impacted by imbalance of the dataset and can be biased to predicting the majority class. We have also explored decision trees, wherein we can identify rules for predicting the correct class of transactions. We can identify the percentage of instances the condition of rule applies and the accuracy or confidence of a rule, which is predicting correct class of instances in which the condition of the rule applies. We would like to highlight the findings below through exploratory analysis, selection of models to train and then testing on subset of data.

https://data.world/vlad/credit-card-fraud-detection

Reference thesis with the above dataset: https://di.ulb.ac.be/map/adalpozz/pdf/Dalpozzolo2015PhD.pdf

Comparison between the two datasets:

While both datasets have a large number of records, we have selected the LendingClub loan dataset for our Large dataset for EDA and ML analysis and the Credit Card Fraud as a medium dataset for EDA and ML analysis. Our two datasets above are different in that the credit card fraud dataset is time series data, however the lendingclub loan dataset is not time series related. Therefore, we found the fraud dataset to be higher in complexity since it is time series data. We were able to train on large sample size here (227000+) and then apply the model on the test dataset.

The Lendingclub dataset on the other hand has far more categorical variables than the credit card fraud dataset. We have dived into it further on the exploratory data analysis. Additionally, due to the size of the dataset (300000+ records), we subset the training set from a modeling perspective with data records from issue date of the loan after 2015-01-01. There were challenges in training the data on the higher number of samples, therefore we made a decision to subset the dataset to 100,000 records range and then split it up to train/test set. On this dataset, we were dealing with missing values, decision to drop certain values, review correlated values, additionally during model training, we found train/test split issues and had to ensure all factor levels are present in both train and test datasets.

Large Data Set (End to End ML Analysis)

Data set Introduction

The Lendingclub loan data set consists of about 396,030 rows and 27 columns. Even though this dataset is large, we have carved out a subset of records with Time since first credit line > 0 (Time since first credit line is a new engineered variable which is calculated by subtracting the time since first credit line from Issue date of the loan) and Issue date > 2015-01-01 to look at a smaller population to model our dataset.

Data Exploration & Plots

Several variables are character types and we will be converting those to factors prior to train and test split of the data. The dataset contains missing values recorded for 3 of the variables. We will also consider dropping certain variables such as address. Imputation of the missing values will be handled during data pre-processing step below. Additionally we created a parquet file since it was large volume and stored it on github.

Code
path1 = "https://github.com/BanuB/Card_Transaction_Fraud/raw/refs/heads/master/loandata.parquet" 

inputf1 = read_parquet(path1)

#Glimpse variables
introduce(inputf1)
# A tibble: 1 × 9
    rows columns discrete_columns continuous_columns all_missing_columns
   <int>   <int>            <int>              <int>               <int>
1 396030      27               15                 12                   0
# ℹ 4 more variables: total_missing_values <int>, complete_rows <int>,
#   total_observations <int>, memory_usage <dbl>
Code
plot_intro(inputf1)

Code
plot_missing(inputf1)

Code
glimpse(inputf1)
Rows: 396,030
Columns: 27
$ loan_amnt            <dbl> 10000, 8000, 15600, 7200, 24375, 20000, 18000, 13…
$ term                 <chr> " 36 months", " 36 months", " 36 months", " 36 mo…
$ int_rate             <dbl> 11.44, 11.99, 10.49, 6.49, 17.27, 13.33, 5.32, 11…
$ installment          <dbl> 329.48, 265.68, 506.97, 220.65, 609.33, 677.07, 5…
$ grade                <chr> "B", "B", "B", "A", "C", "C", "A", "B", "B", "C",…
$ sub_grade            <chr> "B4", "B5", "B3", "A2", "C5", "C3", "A1", "B2", "…
$ emp_title            <chr> "Marketing", "Credit analyst ", "Statistician", "…
$ emp_length           <chr> "10+ years", "4 years", "< 1 year", "6 years", "9…
$ home_ownership       <chr> "RENT", "MORTGAGE", "RENT", "RENT", "MORTGAGE", "…
$ annual_inc           <dbl> 117000, 65000, 43057, 54000, 55000, 86788, 125000…
$ verification_status  <chr> "Not Verified", "Not Verified", "Source Verified"…
$ issue_d              <chr> "Jan-2015", "Jan-2015", "Jan-2015", "Nov-2014", "…
$ loan_status          <chr> "Fully Paid", "Fully Paid", "Fully Paid", "Fully …
$ purpose              <chr> "vacation", "debt_consolidation", "credit_card", …
$ title                <chr> "Vacation", "Debt consolidation", "Credit card re…
$ dti                  <dbl> 26.24, 22.05, 12.79, 2.60, 33.95, 16.31, 1.36, 26…
$ earliest_cr_line     <chr> "Jun-1990", "Jul-2004", "Aug-2007", "Sep-2006", "…
$ open_acc             <dbl> 16, 17, 13, 6, 13, 8, 8, 11, 13, 13, 5, 30, 13, 1…
$ pub_rec              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
$ revol_bal            <dbl> 36369, 20131, 11987, 5472, 24584, 25757, 4178, 13…
$ revol_util           <dbl> 41.8, 53.3, 92.2, 21.5, 69.8, 100.6, 4.9, 64.5, 3…
$ total_acc            <dbl> 25, 27, 26, 13, 43, 23, 25, 15, 40, 37, 26, 61, 3…
$ initial_list_status  <chr> "w", "f", "f", "f", "f", "f", "f", "f", "w", "f",…
$ application_type     <chr> "INDIVIDUAL", "INDIVIDUAL", "INDIVIDUAL", "INDIVI…
$ mort_acc             <dbl> 0, 3, 0, 0, 1, 4, 3, 0, 3, 1, 4, 4, 4, 2, 6, 4, 1…
$ pub_rec_bankruptcies <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
$ address              <chr> "0174 Michelle Gateway\nMendozaberg, OK 22690", "…
Code
unique(inputf1$loan_status)
[1] "Fully Paid"  "Charged Off"
Code
skim(inputf1) %>% kable()
skim_type skim_variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
character term 0 1.0000000 10 10 0 2 0 NA NA NA NA NA NA NA NA
character grade 0 1.0000000 1 1 0 7 0 NA NA NA NA NA NA NA NA
character sub_grade 0 1.0000000 2 2 0 35 0 NA NA NA NA NA NA NA NA
character emp_title 0 1.0000000 0 78 22927 173106 0 NA NA NA NA NA NA NA NA
character emp_length 0 1.0000000 0 9 18301 12 0 NA NA NA NA NA NA NA NA
character home_ownership 0 1.0000000 3 8 0 6 0 NA NA NA NA NA NA NA NA
character verification_status 0 1.0000000 8 15 0 3 0 NA NA NA NA NA NA NA NA
character issue_d 0 1.0000000 8 8 0 115 0 NA NA NA NA NA NA NA NA
character loan_status 0 1.0000000 10 11 0 2 0 NA NA NA NA NA NA NA NA
character purpose 0 1.0000000 3 18 0 14 0 NA NA NA NA NA NA NA NA
character title 0 1.0000000 0 80 1755 48818 0 NA NA NA NA NA NA NA NA
character earliest_cr_line 0 1.0000000 8 8 0 684 0 NA NA NA NA NA NA NA NA
character initial_list_status 0 1.0000000 1 1 0 2 0 NA NA NA NA NA NA NA NA
character application_type 0 1.0000000 5 10 0 3 0 NA NA NA NA NA NA NA NA
character address 0 1.0000000 19 68 0 393700 0 NA NA NA NA NA NA NA NA
numeric loan_amnt 0 1.0000000 NA NA NA NA NA 1.411389e+04 8.357441e+03 500.00 8000.00 12000.00 20000.00 40000.00 ▆▇▅▂▁
numeric int_rate 0 1.0000000 NA NA NA NA NA 1.363940e+01 4.472157e+00 5.32 10.49 13.33 16.49 30.99 ▅▇▅▁▁
numeric installment 0 1.0000000 NA NA NA NA NA 4.318497e+02 2.507278e+02 16.08 250.33 375.43 567.30 1533.81 ▇▇▃▁▁
numeric annual_inc 0 1.0000000 NA NA NA NA NA 7.420318e+04 6.163762e+04 0.00 45000.00 64000.00 90000.00 8706582.00 ▇▁▁▁▁
numeric dti 0 1.0000000 NA NA NA NA NA 1.737951e+01 1.801909e+01 0.00 11.28 16.91 22.98 9999.00 ▇▁▁▁▁
numeric open_acc 0 1.0000000 NA NA NA NA NA 1.131115e+01 5.137649e+00 0.00 8.00 10.00 14.00 90.00 ▇▁▁▁▁
numeric pub_rec 0 1.0000000 NA NA NA NA NA 1.781910e-01 5.306706e-01 0.00 0.00 0.00 0.00 86.00 ▇▁▁▁▁
numeric revol_bal 0 1.0000000 NA NA NA NA NA 1.584454e+04 2.059184e+04 0.00 6025.00 11181.00 19620.00 1743266.00 ▇▁▁▁▁
numeric revol_util 276 0.9993031 NA NA NA NA NA 5.379175e+01 2.445219e+01 0.00 35.80 54.80 72.90 892.30 ▇▁▁▁▁
numeric total_acc 0 1.0000000 NA NA NA NA NA 2.541474e+01 1.188699e+01 2.00 17.00 24.00 32.00 151.00 ▇▃▁▁▁
numeric mort_acc 37795 0.9045653 NA NA NA NA NA 1.813991e+00 2.147931e+00 0.00 0.00 1.00 3.00 34.00 ▇▁▁▁▁
numeric pub_rec_bankruptcies 535 0.9986491 NA NA NA NA NA 1.216476e-01 3.561743e-01 0.00 0.00 0.00 0.00 8.00 ▇▁▁▁▁
Code
sapply(inputf1, function(x) sum(is.na(x))) %>% kable()
x
loan_amnt 0
term 0
int_rate 0
installment 0
grade 0
sub_grade 0
emp_title 0
emp_length 0
home_ownership 0
annual_inc 0
verification_status 0
issue_d 0
loan_status 0
purpose 0
title 0
dti 0
earliest_cr_line 0
open_acc 0
pub_rec 0
revol_bal 0
revol_util 276
total_acc 0
initial_list_status 0
application_type 0
mort_acc 37795
pub_rec_bankruptcies 535
address 0

Correlation Analysis

Correlation plot reveals that some variables are inter correlated which may not be ideal and can cause unreliable regression estimates.Some steps we can take is combining variables or dropping certain variables. We will consider these later on. For instance total_acc and mort_acc are positively correlated below. Similarly loan_amt and installment have a positive correlation.

Code
# Select numeric columns only
 numeric_data<- inputf1[sapply(inputf1, is.numeric)]
M<- cor(numeric_data,use="complete.obs")
 # M %>% kable() %>%
 #  kable_styling()

ggcorrplot(M, type = "upper", outline.color = "white",
           ggtheme = theme_classic,
           #colors = c("#6D9EC1", "white", "#E46726"),
           lab = TRUE, show.legend = FALSE, tl.cex = 8, lab_size = 3)

Code
# Calculate the correlation matrix
correlation_matrix <- cor(numeric_data, use="complete.obs")
kable(correlation_matrix)
loan_amnt int_rate installment annual_inc dti open_acc pub_rec revol_bal revol_util total_acc mort_acc pub_rec_bankruptcies
loan_amnt 1.0000000 0.1467352 0.9552040 0.3424394 0.0080215 0.1898158 -0.0887822 0.3273355 0.0984388 0.2137557 0.2223815 -0.1193732
int_rate 0.1467352 1.0000000 0.1401521 -0.0714182 0.0716342 -0.0037783 0.0516314 -0.0229408 0.2733073 -0.0485737 -0.0826559 0.0485709
installment 0.9552040 0.1401521 1.0000000 0.3355009 0.0059081 0.1770997 -0.0801857 0.3148570 0.1206656 0.1909353 0.1937519 -0.1126934
annual_inc 0.3424394 -0.0714182 0.3355009 1.0000000 -0.0839820 0.1327355 -0.0161510 0.3016988 0.0278276 0.1877229 0.2362765 -0.0550641
dti 0.0080215 0.0716342 0.0059081 -0.0839820 1.0000000 0.1253677 -0.0234581 0.0571616 0.0795856 0.0934558 -0.0254013 -0.0201805
open_acc 0.1898158 -0.0037783 0.1770997 0.1327355 0.1253677 1.0000000 -0.0297351 0.2144474 -0.1446031 0.6777667 0.1094396 -0.0393651
pub_rec -0.0887822 0.0516314 -0.0801857 -0.0161510 -0.0234581 -0.0297351 1.0000000 -0.1069308 -0.0893496 0.0143639 0.0115758 0.6946356
revol_bal 0.3273355 -0.0229408 0.3148570 0.3016988 0.0571616 0.2144474 -0.1069308 1.0000000 0.2204292 0.1808765 0.1950629 -0.1311621
revol_util 0.0984388 0.2733073 0.1206656 0.0278276 0.0795856 -0.1446031 -0.0893496 0.2204292 1.0000000 -0.1141447 0.0075141 -0.1020037
total_acc 0.2137557 -0.0485737 0.1909353 0.1877229 0.0934558 0.6777667 0.0143639 0.1808765 -0.1141447 1.0000000 0.3812052 0.0384642
mort_acc 0.2223815 -0.0826559 0.1937519 0.2362765 -0.0254013 0.1094396 0.0115758 0.1950629 0.0075141 0.3812052 1.0000000 0.0272727
pub_rec_bankruptcies -0.1193732 0.0485709 -0.1126934 -0.0550641 -0.0201805 -0.0393651 0.6946356 -0.1311621 -0.1020037 0.0384642 0.0272727 1.0000000
Code
corrplot(correlation_matrix, method="circle")

Review Distributions

Distributions of factor variables across loan statuses, starting with loan grades. Fully paid loans are at lower interest rates, and charged off loans have a more even distribution, tending towards mid tier interest rates.

Code
loan_stat_df <- subset(inputf1, !is.na(inputf1$loan_status)) %>% group_by(loan_status) %>% 
    summarise(Number = n())
loan_stat_df
# A tibble: 2 × 2
  loan_status Number
  <chr>        <int>
1 Charged Off  77673
2 Fully Paid  318357
Code
table(inputf1$loan_status, inputf1$grade)
             
                   A      B      C      D      E      F      G
  Charged Off   4036  14587  22449  18338  11765   5037   1461
  Fully Paid   60151 101431  83538  45186  19723   6735   1593
Code
ggplot(inputf1, aes(x = int_rate))+ geom_histogram(aes(fill = grade)) + facet_wrap(~loan_status, ncol = 1)

Some additional EDA plots below to show a broader patterns within the data.Loan grades of type E, F, G have higher interest rates.D, E, F, G grade loans also show the presence of high number of outliers. Some of the density plots additionally show not many distributions have a normal bell curve. Additionally several have right skew in the distributions. Loan amount has a 0.78 positive skew value. Also the median loan amount seems to be $12000.00. There are a good number of outliers on higher loan amounts that don’t seem to be verified. Additionally, dataset has upto 59.2% of records of type debt consolidation.

Code
#Loan amt distribution
ggplot(data=inputf1, aes(loan_amnt, fill=loan_status))+geom_histogram(bins = 40,color="blue")

Code
str(inputf1)
tibble [396,030 × 27] (S3: tbl_df/tbl/data.frame)
 $ loan_amnt           : num [1:396030] 10000 8000 15600 7200 24375 ...
 $ term                : chr [1:396030] " 36 months" " 36 months" " 36 months" " 36 months" ...
 $ int_rate            : num [1:396030] 11.44 11.99 10.49 6.49 17.27 ...
 $ installment         : num [1:396030] 329 266 507 221 609 ...
 $ grade               : chr [1:396030] "B" "B" "B" "A" ...
 $ sub_grade           : chr [1:396030] "B4" "B5" "B3" "A2" ...
 $ emp_title           : chr [1:396030] "Marketing" "Credit analyst " "Statistician" "Client Advocate" ...
 $ emp_length          : chr [1:396030] "10+ years" "4 years" "< 1 year" "6 years" ...
 $ home_ownership      : chr [1:396030] "RENT" "MORTGAGE" "RENT" "RENT" ...
 $ annual_inc          : num [1:396030] 117000 65000 43057 54000 55000 ...
 $ verification_status : chr [1:396030] "Not Verified" "Not Verified" "Source Verified" "Not Verified" ...
 $ issue_d             : chr [1:396030] "Jan-2015" "Jan-2015" "Jan-2015" "Nov-2014" ...
 $ loan_status         : chr [1:396030] "Fully Paid" "Fully Paid" "Fully Paid" "Fully Paid" ...
 $ purpose             : chr [1:396030] "vacation" "debt_consolidation" "credit_card" "credit_card" ...
 $ title               : chr [1:396030] "Vacation" "Debt consolidation" "Credit card refinancing" "Credit card refinancing" ...
 $ dti                 : num [1:396030] 26.2 22.1 12.8 2.6 34 ...
 $ earliest_cr_line    : chr [1:396030] "Jun-1990" "Jul-2004" "Aug-2007" "Sep-2006" ...
 $ open_acc            : num [1:396030] 16 17 13 6 13 8 8 11 13 13 ...
 $ pub_rec             : num [1:396030] 0 0 0 0 0 0 0 0 0 0 ...
 $ revol_bal           : num [1:396030] 36369 20131 11987 5472 24584 ...
 $ revol_util          : num [1:396030] 41.8 53.3 92.2 21.5 69.8 ...
 $ total_acc           : num [1:396030] 25 27 26 13 43 23 25 15 40 37 ...
 $ initial_list_status : chr [1:396030] "w" "f" "f" "f" ...
 $ application_type    : chr [1:396030] "INDIVIDUAL" "INDIVIDUAL" "INDIVIDUAL" "INDIVIDUAL" ...
 $ mort_acc            : num [1:396030] 0 3 0 0 1 4 3 0 3 1 ...
 $ pub_rec_bankruptcies: num [1:396030] 0 0 0 0 0 0 0 0 0 0 ...
 $ address             : chr [1:396030] "0174 Michelle Gateway\nMendozaberg, OK 22690" "1076 Carney Fort Apt. 347\nLoganmouth, SD 05113" "87025 Mark Dale Apt. 269\nNew Sabrina, WV 05113" "823 Reid Ford\nDelacruzside, MA 00813" ...
Code
ggplot(inputf1, aes(x=grade, y=loan_amnt, fill=grade)) +
  stat_summary(fun.y="sum", geom="bar") +
  labs(y ="Total Loan Amount",title="Total loan amount based on loan grade")

Code
ggplot(data=inputf1, aes(grade,int_rate,fill=grade))+geom_boxplot(outlier.color = "blue")+labs(title="Box plot of Interest rate")

Code
plot_density(inputf1)

Code
#Distribution of loan amount and purpose
Desc(inputf1$loan_amnt, main = "Loan amount distribution", plotit = TRUE)
────────────────────────────────────────────────────────────────────────────── 
Loan amount distribution

     length         n       NAs     unique         0s       mean     meanCI'
    396'030   396'030         0      1'397          0  14'113.89  14'087.86
               100.0%      0.0%                  0.0%             14'139.92
                                                                           
        .05       .10       .25     median        .75        .90        .95
   3'250.00  5'000.00  8'000.00  12'000.00  20'000.00  26'000.00  30'975.00
                                                                           
      range        sd     vcoef        mad        IQR       skew       kurt
  39'500.00  8'357.44      0.59   8'154.30  12'000.00       0.78      -0.06
                                                                           
lowest : 500.0 (4), 700.0, 725.0, 750.0, 800.0
highest: 39'475.0, 39'500.0, 39'600.0, 39'700.0, 40'000.0 (180)

heap(?): remarkable frequency (7.0%) for the mode(s) (= 10000)

' 95%-CI (classic)

Code
Desc(inputf1$purpose, main = "Loan purposes", plotit = TRUE)
────────────────────────────────────────────────────────────────────────────── 
Loan purposes

   length       n     NAs  unique  levels   dupes
  396'030 396'030       0      14      14       y
           100.0%    0.0%                        

                 level     freq   perc  cumfreq  cumperc
1   debt_consolidation  234'507  59.2%  234'507    59.2%
2          credit_card   83'019  21.0%  317'526    80.2%
3     home_improvement   24'030   6.1%  341'556    86.2%
4                other   21'185   5.3%  362'741    91.6%
5       major_purchase    8'790   2.2%  371'531    93.8%
6       small_business    5'701   1.4%  377'232    95.3%
7                  car    4'697   1.2%  381'929    96.4%
8              medical    4'196   1.1%  386'125    97.5%
9               moving    2'854   0.7%  388'979    98.2%
10            vacation    2'452   0.6%  391'431    98.8%
11               house    2'201   0.6%  393'632    99.4%
12             wedding    1'812   0.5%  395'444    99.9%
... etc.
 [list output truncated]

Code
#Distribution by verification status
inputf1 %>% group_by(verification_status) %>% summarise(mean(loan_amnt), var(loan_amnt), mean(int_rate),mean(annual_inc))
# A tibble: 3 × 5
  verification_status `mean(loan_amnt)` `var(loan_amnt)` `mean(int_rate)`
  <chr>                           <dbl>            <dbl>            <dbl>
1 Not Verified                   10415.        32524348.             12.2
2 Source Verified                14765.        67211278.             13.7
3 Verified                       16816.        85819043.             14.8
# ℹ 1 more variable: `mean(annual_inc)` <dbl>
Code
ggplot(data = inputf1,aes(x = verification_status, y = loan_amnt))+geom_boxplot()

Code
ggplot(data=inputf1,aes(loan_amnt, fill=grade))+
  geom_density(alpha=0.25) + 
  facet_grid(grade ~ .)

Code
#Distribution of loan status and grade
table(inputf1$loan_status, inputf1$grade)
             
                   A      B      C      D      E      F      G
  Charged Off   4036  14587  22449  18338  11765   5037   1461
  Fully Paid   60151 101431  83538  45186  19723   6735   1593
Code
ggplot(inputf1, aes(x = int_rate))+ geom_histogram(aes(fill = grade)) + facet_wrap(~loan_status, ncol = 1)

Data Preparation (Imputation and Feature Engineering)

Since mort_acc seems to have majority of values missing, since its positively correlated with total_acc, we have considered dropping this variable instead of imputing. Additionally, the address, title and emp_title had many levels, during our train/test process, these variables caused issues with factor levels missing in test dataset. We have also imputed 2 variables with their median and mean (since missing values were in the range of about 500 for these 2 variables)

Therefore, we have chosen to drop these variables. Additionally, our randomForest did not run successfully on the large train dataset when we tried with a 80/20 split, therefore we have chosen to subset a smaller number of records to look at loans issued after 2015 (our subset of records we will use is 110,647).

We have created one additional feature variable called “time since first credit line” was issued to see if there is any impact on the response variable.

Code
# drop address, mort_acc
library(dplyr)
library(gridExtra)
library(grid)
inputf1_new<- dplyr::select(inputf1,-c(mort_acc,address, title, emp_title))
#glimpse(inputf1_new)


inputf1_new %>% 
  gather(variable, value) %>%
  filter(is.na(value)) %>%
  group_by(variable) %>%
  tally() %>%
  # dplyr::mutate(percent = (n / nrow(df)) * 100) %>%
  # dplyr::mutate(percent = paste0(round(percent, ifelse(percent < 10, 1, 0)), "%")) %>%
  # arrange(desc(n)) %>%
  rename(`Variable Missing Data` = variable,
         `Number of Records` = n) %>%
      #   `Share of Total` = percent) %>%
  kable() %>%
  kable_styling()
Variable Missing Data Number of Records
pub_rec_bankruptcies 535
revol_util 276
Code
#impute with median
#unique(inputf1_new$pub_rec_bankruptcies)
mean1 <- round(median(inputf1_new$pub_rec_bankruptcies, na.rm = TRUE),0)
inputf1_new[is.na(inputf1_new[,"pub_rec_bankruptcies"]), "pub_rec_bankruptcies"] <- mean1

#impute with mean
mean2 <- round(mean(inputf1_new$revol_util, na.rm = TRUE),0)
inputf1_new[is.na(inputf1_new[,"revol_util"]), "revol_util"] <- mean2
#unique(inputf1_new$revol_util)

#create new variables
inputf1_new$issue_d <- as.character(inputf1_new$issue_d)
inputf1_new$issue_d <- paste(inputf1_new$issue_d, "-01", sep = "")
inputf1_new$issue_d <- parse_date_time(inputf1_new$issue_d, "myd")

inputf1_new$earliest_cr_line <- as.character(inputf1_new$earliest_cr_line)
inputf1_new$earliest_cr_line <- paste(inputf1_new$earliest_cr_line, "-01", sep = "")
inputf1_new$earliest_cr_line <- parse_date_time(inputf1_new$earliest_cr_line, "myd")

inputf1_new$time_since_fcline <- inputf1_new$issue_d - inputf1_new$earliest_cr_line
inputf1_new$time_since_fcline <- as.numeric(inputf1_new$time_since_fcline)

inputf1_new2 <- inputf1_new %>% filter(time_since_fcline > 0 & issue_d > c("2015-01-01 UTC") )
head(inputf1_new2$time_since_fcline)
[1] 3895 3683 7549 6664 6969 5844
Code
loan_stat_df1 <- subset(inputf1_new2, !is.na(inputf1_new2$loan_status)) %>% group_by(loan_status) %>% 
    summarise(Number = n())
loan_stat_df1
# A tibble: 2 × 2
  loan_status Number
  <chr>        <int>
1 Charged Off  24211
2 Fully Paid   86436
Code
ggplot(data = inputf1_new2 , aes(loan_status)) + geom_bar(position = "dodge") + 
    labs(x = "Loan Status", title = "Distribution of Loan Status on our sample population of issue date after 2015-01-01") + theme(axis.text.x = element_text(angle = 90, 
    hjust = 1))

Code
plot_missing(inputf1_new2)

Code
p1 <- ggplot(data = inputf1_new2, aes(loan_amnt, color = grade)) + geom_histogram(binwidth = 1000) +     facet_grid(grade ~ .)
p2 <- ggplot(data = inputf1_new2, aes(loan_amnt, color = grade, fill = grade)) + geom_density(binwidth = 1000) + 
    facet_grid(grade ~ .)

grid.arrange(p1, p2, ncol = 2)

Code
ggplot(data = subset(inputf1_new2, !home_ownership %in% c("ANY", "NONE", "OTHER")), aes(y = home_ownership, 
    purpose)) + geom_count(color = "Navy") + theme(axis.text.x = element_text(angle = 90, 
    hjust = 1))

Code
ggplot(data = subset(inputf1_new2, (!is.na(home_ownership) & (!home_ownership %in% c("ANY", 
    "NONE", "OTHER")))), aes(home_ownership, fill = grade, color = grade)) + 
    geom_bar() + labs(title = "Distribution of Home Ownership by Loan Grade")

Code
inputf1_new2 %>% filter(!is.na(purpose)) %>% group_by(purpose) %>% summarise(mean_annual_inc = mean(annual_inc), 
    mean_amnt_loan = mean(loan_amnt), n = n()) %>% ungroup() %>% arrange(desc(n))
# A tibble: 14 × 4
   purpose            mean_annual_inc mean_amnt_loan     n
   <chr>                        <dbl>          <dbl> <int>
 1 debt_consolidation          76042.         15477. 66782
 2 credit_card                 78940.         15553. 21673
 3 home_improvement            95932.         15217.  7839
 4 other                       74409.         10332.  6211
 5 major_purchase              80800.         13112.  2400
 6 medical                     71286.          9173.  1243
 7 car                         68790.          9460.  1104
 8 small_business              96188.         16688.  1036
 9 moving                      71603.          8825.   868
10 vacation                    69324.          6738.   766
11 house                       81944.         15620.   638
12 renewable_energy            69735.          8773.    84
13 wedding                     79000          13350      2
14 educational                 64500           2200      1
Code
ggplot(data = subset(inputf1_new2, !is.na(purpose)), aes(purpose, fill = loan_status, 
    color = loan_status)) + geom_bar() + labs(title = "Distribution of Loan purpose") + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

Code
ggplot(data = subset(inputf1_new2, !is.na(verification_status)), aes(verification_status, 
    fill = loan_status, color = loan_status)) + geom_bar(position = "fill") + 
    labs(title = "Distribution of verification status by Loan Status") + scale_y_continuous(labels = percent_format())

Scatter plots in an attempt to identify trends between seemingly related numeric variables.

This one plot suggests that the 60-month loans tend to have larger interest rates and be for larger loan amounts (the top right corner is dominated by blue points).

Code
set.seed(2024)

#dim(inputf1_new2)
#p<-table(inputf1_new2$loan_status, inputf1_new2$int_rate)
#ggplot(inputf1_new2, aes(x = int_rate))+ geom_histogram() + facet_wrap(~loan_status, ncol = 1)

#Distribution of loan status and grade
#table(inputf1_new2$loan_status, inputf1_new2$grade)
#ggplot(inputf1_new2, aes(x = int_rate))+ geom_histogram(aes(fill = grade)) + facet_wrap(~loan_status, ncol = 1)


#Distribution of loan status and term
table(inputf1_new2$loan_status, inputf1_new2$term)
             
               36 months  60 months
  Charged Off      14414       9797
  Fully Paid       64470      21966
Code
index = createDataPartition(y = inputf1_new2$loan_status, p = 0.90)[[1]]
loans.sample <- inputf1_new2[-index,]
ggplot(loans.sample, aes(x = loan_amnt, y = int_rate)) + geom_point(aes(color = term))

Algorithm Selection/Build Models

Decision Tree Model and metrics review

For model selection,we have created a partition of the 110,647 records and created 70/30 split of the dataset. The accuracy of our Basic decision tree model is 0.7835.

We have shown an example of pruned tree below. Overly complex trees have high variance. We set complexity Parameter of 0 as a measure of the required split improvement. The parameter modulates the amount by which splitting a given node improved the minimum error of 0.001435764 so that a spit can be justified.Additionally, we have printed the decision tree rules that were generated by the model on the pruned instance of the tree.

The Sensitivity (true positive rate) and Specificity (true negative rate) are below for the baseline model.Sensitivity is the metric that evaluates a model’s ability to predict true positives of each available category. Specificity is the metric that evaluates a model’s ability to predict true negatives of each available category.The sensitivity of the model is very low, this implies that the model did not successfully classify the “charged off” loans accurately.Sensitivity : 0.10973 Specificity : 0.97219.

Since our train dataset is highly imbalanced, we can use the ROC curve since we can’t simply use the accuracy measure.Area under the curve (AUC): 0.690. We can try to over, under or combine both sampling method to balance the class prior to running the model to avoid class imbalance issues.

For reference of classification categories of the confusion matrix is given below.

True Positive (TP) – An instance that is positive and is classified correctly as positive True Negative (TN) – An instance that is negative and is classified correctly as negative False Positive (FP) – An instance that is negative but is classified wrongly as positive False Negative (FN) – An instance that is positive but is classified incorrectly as negative

Code
#install.packages("RGtk2")

library("rattle")
library(rpart.plot)
library(rpart)
#install.packages("vip")
library(vip)


set.seed(2024)
index = createDataPartition(y = inputf1_new2$loan_status, p = 0.7)[[1]]
loans.test <- inputf1_new2[-index,]
loans.train <- inputf1_new2[index,]


loans.rpart.1 <- rpart(loan_status ~ . , data = loans.train, 
                      control=rpart.control(minsplit=10, minbucket = 3, cp=0.0006))

fancyRpartPlot(loans.rpart.1)

Code
predictions.1 <- (predict(loans.rpart.1, loans.test , type = "class")) 
p1<-confusionMatrix(predictions.1,as.factor(loans.test$loan_status))
roc.curve(loans.test$loan_status, predict(loans.rpart.1, loans.test, type = "prob")[,1], plot = TRUE)

Area under the curve (AUC): 0.690
Code
p1
Confusion Matrix and Statistics

             Reference
Prediction    Charged Off Fully Paid
  Charged Off         797        721
  Fully Paid         6466      25209
                                         
               Accuracy : 0.7835         
                 95% CI : (0.779, 0.7879)
    No Information Rate : 0.7812         
    P-Value [Acc > NIR] : 0.1581         
                                         
                  Kappa : 0.1145         
                                         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.10973        
            Specificity : 0.97219        
         Pos Pred Value : 0.52503        
         Neg Pred Value : 0.79586        
             Prevalence : 0.21881        
         Detection Rate : 0.02401        
   Detection Prevalence : 0.04573        
      Balanced Accuracy : 0.54096        
                                         
       'Positive' Class : Charged Off    
                                         
Code
#rpart.plot(loans.rpart.1, type = 4, extra = 101, under = TRUE, cex = 0.8, box.palette = "auto")
rules<- rpart.rules(loans.rpart.1)
head(rules, 4) %>% kable() 
loan_status
17466 0.00 when sub_grade is E2 or E3 or E4 or F2 or F3 or F4 or F5 or G1 or G2 or G3 or G5 & issue_d < 1.4e+09 & dti is 20 to 30 & home_ownership is MORTGAGE or OWN & emp_length is or 10+ years or 2 years or 4 years or 5 years or 6 years or 7 years or 9 years & annual_inc < 142500 & revol_bal is 21644 to 22462 & purpose is debt_consolidation or home_improvement or house or major_purchase or medical or small_business & total_acc < 35
1090 0.14 when sub_grade is E2 or E3 or E4 or F2 or F3 or F4 or F5 or G1 or G2 or G3 or G5 & issue_d < 1.4e+09 & dti >= 20 & home_ownership is MORTGAGE or OWN & emp_length is or 10+ years or 2 years or 4 years or 5 years or 6 years or 7 years or 9 years & annual_inc >= 142500 & revol_bal < 22462
1060 0.20 when sub_grade is E3 or E5 or F2 or F3 or F4 or F5 or G1 or G2 or G3 or G4 or G5 & issue_d < 1.5e+09 & dti < 26 & home_ownership is RENT & emp_length is or < 1 year or 1 year or 2 years or 3 years or 4 years & annual_inc < 96472 & purpose is car or credit_card or home_improvement or moving & loan_amnt < 14413
2154 0.24 when sub_grade is D3 or D4 or D5 or E1 or E2 or E4 or F1 & issue_d < 1.4e+09 & dti < 26 & home_ownership is RENT & emp_length is < 1 year or 1 year or 10+ years or 5 years or 7 years or 9 years & installment >= 145 & revol_util < 91 & time_since_fcline < 1705
Code
# Create a variable importance plot
var_importance <- vip::vip(loans.rpart.1, num_features = 30)
print(var_importance)

Code
plotcp(loans.rpart.1)

Code
costdt <- data.frame(printcp(loans.rpart.1))

Classification tree:
rpart(formula = loan_status ~ ., data = loans.train, control = rpart.control(minsplit = 10, 
    minbucket = 3, cp = 6e-04))

Variables actually used in tree construction:
 [1] annual_inc        dti               emp_length        home_ownership   
 [5] installment       issue_d           loan_amnt         open_acc         
 [9] purpose           revol_bal         revol_util        sub_grade        
[13] time_since_fcline total_acc        

Root node error: 16948/77454 = 0.21881

n= 77454 

           CP nsplit rel error  xerror      xstd
1  0.00308001      0   1.00000 1.00000 0.0067892
2  0.00271418      5   0.98460 0.98708 0.0067574
3  0.00171112      7   0.97917 0.98749 0.0067584
4  0.00143576      8   0.97746 0.98637 0.0067556
5  0.00135709     11   0.97315 0.98737 0.0067581
6  0.00118008     13   0.97044 0.98678 0.0067566
7  0.00106207     14   0.96926 0.98779 0.0067591
8  0.00082606     16   0.96713 0.98743 0.0067583
9  0.00073755     18   0.96548 0.98838 0.0067606
10 0.00070805     23   0.96177 0.98655 0.0067561
11 0.00067855     31   0.95545 0.98672 0.0067565
12 0.00064904     36   0.95191 0.98761 0.0067587
13 0.00060000     40   0.94932 0.98649 0.0067559
Code
min_err <- (costdt  %>% filter(nsplit > 1) %>% slice(which.min(xerror)))$CP
cat("Minimum Error: ", min_err)
Minimum Error:  0.001435764
Code
loan_prune <- rpart::prune(loans.rpart.1,min_err)
rpart.plot(loan_prune, box.col = c("pink", "palegreen3")[loans.rpart.1$frame$yval])

Code
loan_prune_pred <- predict(loan_prune, loans.test, type = "class")
#prunedcm <- confusionMatrix(loan_prune_pred, as.factor(loans.test$loan_Status))
# dt_pruned_cm$table

roc.curve(loans.test$loan_status, predict(loan_prune, loans.test, type = "prob")[,1], plot = TRUE)

Area under the curve (AUC): 0.689
Code
p1
Confusion Matrix and Statistics

             Reference
Prediction    Charged Off Fully Paid
  Charged Off         797        721
  Fully Paid         6466      25209
                                         
               Accuracy : 0.7835         
                 95% CI : (0.779, 0.7879)
    No Information Rate : 0.7812         
    P-Value [Acc > NIR] : 0.1581         
                                         
                  Kappa : 0.1145         
                                         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.10973        
            Specificity : 0.97219        
         Pos Pred Value : 0.52503        
         Neg Pred Value : 0.79586        
             Prevalence : 0.21881        
         Detection Rate : 0.02401        
   Detection Prevalence : 0.04573        
      Balanced Accuracy : 0.54096        
                                         
       'Positive' Class : Charged Off    
                                         
Code
rules<- rpart.rules(loan_prune)
head(rules) %>% kable() 
loan_status
264 0.41 when sub_grade is E3 or E5 or F2 or F3 or F4 or F5 or G1 or G2 or G3 or G4 or G5 & issue_d < 1.5e+09 & home_ownership is RENT & dti < 26 & annual_inc < 96472 & loan_amnt >= 14413
32 0.44 when sub_grade is D3 or D4 or D5 or E1 or E2 or E3 or E4 or E5 or F1 or F2 or F3 or F4 or F5 or G1 or G2 or G3 or G4 or G5 & issue_d < 1.5e+09 & home_ownership is RENT & dti >= 26
265 0.53 when sub_grade is E3 or E5 or F2 or F3 or F4 or F5 or G1 or G2 or G3 or G4 or G5 & issue_d < 1.5e+09 & home_ownership is RENT & dti < 26 & annual_inc < 96472 & loan_amnt < 14413
67 0.60 when sub_grade is D3 or D4 or D5 or E1 or E2 or E4 or F1 & issue_d < 1.5e+09 & home_ownership is RENT & dti < 26
133 0.61 when sub_grade is E3 or E5 or F2 or F3 or F4 or F5 or G1 or G2 or G3 or G4 or G5 & issue_d < 1.5e+09 & home_ownership is RENT & dti < 26 & annual_inc >= 96472
17 0.64 when sub_grade is D3 or D4 or D5 or E1 or E2 or E3 or E4 or E5 or F1 or F2 or F3 or F4 or F5 or G1 or G2 or G3 or G4 or G5 & issue_d < 1.5e+09 & home_ownership is MORTGAGE or OWN

Oversampling and fixing class imbalance and rerunning decision tree model.

We have tried over sample/under sample and combined with “Both” option on the train dataset to see if we can fix the class imbalance. The resulting model has jumped in sensitivity, however, it has misclassified large # of records as charged off when they were full paid. We will need to try to understand why this maybe potentially as a follow-up to this project.

Code
set.seed(2024)

glimpse(loans.train)
Rows: 77,454
Columns: 24
$ loan_amnt            <dbl> 20000, 18000, 35000, 20000, 5000, 4600, 21000, 67…
$ term                 <chr> " 36 months", " 36 months", " 60 months", " 36 mo…
$ int_rate             <dbl> 13.33, 5.32, 12.29, 6.97, 15.61, 12.29, 17.86, 21…
$ installment          <dbl> 677.07, 542.07, 783.70, 617.27, 174.83, 153.43, 5…
$ grade                <chr> "C", "A", "C", "A", "D", "C", "D", "D", "A", "D",…
$ sub_grade            <chr> "C3", "A1", "C1", "A3", "D1", "C1", "D5", "D5", "…
$ emp_length           <chr> "10+ years", "2 years", "10+ years", "7 years", "…
$ home_ownership       <chr> "MORTGAGE", "MORTGAGE", "MORTGAGE", "MORTGAGE", "…
$ annual_inc           <dbl> 86788, 125000, 157000, 85000, 75000, 29400, 80000…
$ verification_status  <chr> "Verified", "Source Verified", "Verified", "Not V…
$ issue_d              <dttm> 2015-09-01, 2015-09-01, 2015-04-01, 2016-03-01, …
$ loan_status          <chr> "Fully Paid", "Fully Paid", "Fully Paid", "Fully …
$ purpose              <chr> "debt_consolidation", "home_improvement", "debt_c…
$ dti                  <dbl> 16.31, 1.36, 29.39, 18.80, 13.58, 29.63, 15.67, 1…
$ earliest_cr_line     <dttm> 2005-01-01, 2005-08-01, 1997-01-01, 2000-03-01, …
$ open_acc             <dbl> 8, 8, 17, 15, 9, 12, 15, 15, 7, 18, 11, 10, 14, 2…
$ pub_rec              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ revol_bal            <dbl> 25757, 4178, 113091, 24195, 3120, 10112, 4189, 58…
$ revol_util           <dbl> 100.6, 4.9, 94.9, 55.7, 19.1, 65.7, 13.0, 14.7, 9…
$ total_acc            <dbl> 23, 25, 27, 38, 26, 17, 35, 21, 18, 23, 29, 23, 2…
$ initial_list_status  <chr> "f", "f", "w", "w", "f", "w", "w", "w", "w", "w",…
$ application_type     <chr> "INDIVIDUAL", "INDIVIDUAL", "INDIVIDUAL", "INDIVI…
$ pub_rec_bankruptcies <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ time_since_fcline    <dbl> 3895, 3683, 6664, 5844, 2679, 2772, 5051, 2009, 7…
Code
glimpse(loans.test)
Rows: 33,193
Columns: 24
$ loan_amnt            <dbl> 7500, 11200, 20000, 12000, 15000, 8050, 35000, 84…
$ term                 <chr> " 36 months", " 60 months", " 36 months", " 60 mo…
$ int_rate             <dbl> 9.17, 12.29, 12.29, 8.18, 18.55, 16.55, 14.65, 6.…
$ installment          <dbl> 239.10, 250.79, 667.06, 244.36, 385.41, 285.21, 1…
$ grade                <chr> "B", "C", "C", "B", "E", "D", "C", "A", "C", "B",…
$ sub_grade            <chr> "B2", "C1", "C1", "B1", "E2", "D2", "C5", "A4", "…
$ emp_length           <chr> "7 years", "10+ years", "3 years", "7 years", "3 …
$ home_ownership       <chr> "OWN", "MORTGAGE", "RENT", "MORTGAGE", "RENT", "R…
$ annual_inc           <dbl> 55000, 81000, 95000, 60000, 50000, 29500, 95000, …
$ verification_status  <chr> "Not Verified", "Not Verified", "Source Verified"…
$ issue_d              <dttm> 2015-12-01, 2015-10-01, 2015-06-01, 2015-04-01, …
$ loan_status          <chr> "Fully Paid", "Fully Paid", "Fully Paid", "Charge…
$ purpose              <chr> "debt_consolidation", "debt_consolidation", "debt…
$ dti                  <dbl> 28.21, 12.87, 11.82, 20.34, 4.22, 28.23, 13.20, 6…
$ earliest_cr_line     <dttm> 1995-04-01, 1996-09-01, 2007-10-01, 2002-02-01, …
$ open_acc             <dbl> 13, 6, 11, 7, 6, 9, 10, 5, 14, 5, 13, 13, 4, 8, 1…
$ pub_rec              <dbl> 0, 1, 0, 0, 0, 0, 0, 2, 1, 0, 1, 1, 0, 0, 0, 1, 1…
$ revol_bal            <dbl> 17838, 5874, 16498, 21750, 8912, 3327, 15437, 602…
$ revol_util           <dbl> 54.9, 54.9, 54.1, 59.9, 47.9, 39.6, 42.1, 76.3, 7…
$ total_acc            <dbl> 35, 20, 17, 20, 23, 12, 25, 12, 19, 16, 18, 46, 2…
$ initial_list_status  <chr> "w", "w", "f", "w", "f", "w", "w", "f", "w", "f",…
$ application_type     <chr> "JOINT", "INDIVIDUAL", "INDIVIDUAL", "INDIVIDUAL"…
$ pub_rec_bankruptcies <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1…
$ time_since_fcline    <dbl> 7549, 6969, 2800, 4807, 4230, 4807, 6604, 4171, 8…
Code
loans.oversample <- ovun.sample(loan_status ~ ., data = loans.train, method = "both",N = 77454, seed = 13)$data
table(loans.oversample$loan_status)

Charged Off  Fully Paid 
      38861       38593 
Code
table(loans.train$loan_status)

Charged Off  Fully Paid 
      16948       60506 
Code
barplot(table(loans.train$loan_status) , col = 'lightblue')

Code
barplot(table(loans.oversample$loan_status) , col = 'lightblue')

Code
tune <- data.frame(0.001)
colnames(tune) <- "cp"
tr_control <- trainControl(method = "cv",number = 10, verboseIter = TRUE)
loans.over <- train(loan_status ~., data = loans.oversample, method = "rpart", trControl = tr_control, tuneGrid = tune, control=rpart.control(minsplit=10, minbucket = 3))
+ Fold01: cp=0.001 
- Fold01: cp=0.001 
+ Fold02: cp=0.001 
- Fold02: cp=0.001 
+ Fold03: cp=0.001 
- Fold03: cp=0.001 
+ Fold04: cp=0.001 
- Fold04: cp=0.001 
+ Fold05: cp=0.001 
- Fold05: cp=0.001 
+ Fold06: cp=0.001 
- Fold06: cp=0.001 
+ Fold07: cp=0.001 
- Fold07: cp=0.001 
+ Fold08: cp=0.001 
- Fold08: cp=0.001 
+ Fold09: cp=0.001 
- Fold09: cp=0.001 
+ Fold10: cp=0.001 
- Fold10: cp=0.001 
Aggregating results
Fitting final model on full training set
Code
fancyRpartPlot(loans.over$finalModel)

Code
confusionMatrix(predict(loans.over, loans.test), as.factor(loans.test$loan_status))
Confusion Matrix and Statistics

             Reference
Prediction    Charged Off Fully Paid
  Charged Off        5458      10640
  Fully Paid         1805      15290
                                          
               Accuracy : 0.6251          
                 95% CI : (0.6198, 0.6303)
    No Information Rate : 0.7812          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.2373          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.7515          
            Specificity : 0.5897          
         Pos Pred Value : 0.3390          
         Neg Pred Value : 0.8944          
             Prevalence : 0.2188          
         Detection Rate : 0.1644          
   Detection Prevalence : 0.4850          
      Balanced Accuracy : 0.6706          
                                          
       'Positive' Class : Charged Off     
                                          

Logistic Regression (Train Data set)

Here we have run the 2nd full baseline Logistic regression model and then additionally run the model with stepAIC on the over sampled train data from prior section and compared the metrics on both models.We hit several issues during model train, due to non-conversion of variables to factor type rather than character type. Once that was fixed, we proceeded to run the model and then store the values of accuracy, sensitivity and specificity.The logistic regression model accuracy with the stepAIC improved a lot after using stepAIC method. All 3 parameters were the best so far. (accuracy)0.673135 (sensitivity)0.6847482 (specificity)0.6614412

Code
loans.oversample1 <- loans.oversample %>% mutate(loan_outcome = ifelse(loan_status %in% c('Charged Off' , 'Default') , 1, ifelse(loan_status == 'Fully Paid' , 0 , 'No info')))

loans.oversample1 <- loans.oversample1[, colnames(loans.oversample1)[colnames(loans.oversample1) != 'loan_status']]


loans.oversample1 <- loans.oversample1[, colnames(loans.oversample1)[colnames(loans.oversample1) != 'issue_d']]

loans.oversample1 <- loans.oversample1[, colnames(loans.oversample1)[colnames(loans.oversample1) != 'earliest_cr_line']]

factorize = function(column, df){
  #' Check if column is character and  turn to factor

  if(class(df[1,column]) == "character"){
    out = as.factor(df[,column])
  } else { # if it's numeric
    out = df[,column]
  }
  return(out)
}

# str(loans.oversample)
# str(loans.oversample1)
# class(loans.oversample1[1,"term"])
store.colnames = colnames(loans.oversample1)
loans.oversample3  = lapply(store.colnames, function(column) factorize(column, loans.oversample1))
loans.oversample3= as.data.frame(loans.oversample3 )
colnames(loans.oversample3)=store.colnames


full.reg <- glm(loan_outcome ~ ., data =loans.oversample3, family = "binomial")
loans.reg  <- stepAIC(full.reg, direction = "both")
Start:  AIC=93592.09
loan_outcome ~ loan_amnt + term + int_rate + installment + grade + 
    sub_grade + emp_length + home_ownership + annual_inc + verification_status + 
    purpose + dti + open_acc + pub_rec + revol_bal + revol_util + 
    total_acc + initial_list_status + application_type + pub_rec_bankruptcies + 
    time_since_fcline


Step:  AIC=93592.09
loan_outcome ~ loan_amnt + term + int_rate + installment + sub_grade + 
    emp_length + home_ownership + annual_inc + verification_status + 
    purpose + dti + open_acc + pub_rec + revol_bal + revol_util + 
    total_acc + initial_list_status + application_type + pub_rec_bankruptcies + 
    time_since_fcline

                       Df Deviance   AIC
- annual_inc            1    93435 93591
<none>                       93434 93592
- time_since_fcline     1    93437 93593
- pub_rec_bankruptcies  1    93439 93595
- verification_status   2    93448 93602
- application_type      2    93459 93613
- pub_rec               1    93457 93613
- loan_amnt             1    93461 93617
- revol_bal             1    93464 93620
- installment           1    93476 93632
- initial_list_status   1    93481 93637
- purpose              12    93531 93665
- term                  1    93510 93666
- open_acc              1    93573 93729
- total_acc             1    93619 93775
- revol_util            1    93710 93866
- emp_length           11    93763 93899
- dti                   1    94192 94348
- home_ownership        3    94312 94464
- int_rate              1    94724 94880
- sub_grade            34    95667 95757

Step:  AIC=93591.21
loan_outcome ~ loan_amnt + term + int_rate + installment + sub_grade + 
    emp_length + home_ownership + verification_status + purpose + 
    dti + open_acc + pub_rec + revol_bal + revol_util + total_acc + 
    initial_list_status + application_type + pub_rec_bankruptcies + 
    time_since_fcline

                       Df Deviance   AIC
<none>                       93435 93591
+ annual_inc            1    93434 93592
- time_since_fcline     1    93438 93592
- pub_rec_bankruptcies  1    93440 93594
- verification_status   2    93449 93601
- pub_rec               1    93458 93612
- application_type      2    93460 93612
- loan_amnt             1    93463 93617
- revol_bal             1    93468 93622
- installment           1    93477 93631
- initial_list_status   1    93482 93636
- purpose              12    93532 93664
- term                  1    93512 93666
- open_acc              1    93573 93727
- total_acc             1    93624 93778
- revol_util            1    93710 93864
- emp_length           11    93769 93903
- dti                   1    94265 94419
- home_ownership        3    94316 94466
- int_rate              1    94725 94879
- sub_grade            34    95673 95761
Code
summary(full.reg)

Call:
glm(formula = loan_outcome ~ ., family = "binomial", data = loans.oversample3)

Coefficients: (6 not defined because of singularities)
                                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)                        -8.595e+00  8.448e+01  -0.102 0.918962    
loan_amnt                          -3.637e-05  7.012e-06  -5.187 2.13e-07 ***
term 60 months                      4.150e-01  4.756e-02   8.725  < 2e-16 ***
int_rate                           -5.608e-01  1.668e-02 -33.619  < 2e-16 ***
installment                         1.414e-03  2.192e-04   6.448 1.13e-10 ***
gradeB                              4.985e+00  1.441e-01  34.587  < 2e-16 ***
gradeC                              7.255e+00  1.869e-01  38.822  < 2e-16 ***
gradeD                              9.403e+00  2.390e-01  39.338  < 2e-16 ***
gradeE                              1.164e+01  2.950e-01  39.458  < 2e-16 ***
gradeF                              1.464e+01  3.771e-01  38.817  < 2e-16 ***
gradeG                              1.548e+01  4.642e-01  33.340  < 2e-16 ***
sub_gradeA2                         1.158e+00  1.245e-01   9.304  < 2e-16 ***
sub_gradeA3                         1.377e+00  1.265e-01  10.887  < 2e-16 ***
sub_gradeA4                         1.990e+00  1.154e-01  17.248  < 2e-16 ***
sub_gradeA5                         2.453e+00  1.145e-01  21.429  < 2e-16 ***
sub_gradeB1                        -2.333e+00  7.798e-02 -29.922  < 2e-16 ***
sub_gradeB2                        -1.620e+00  6.529e-02 -24.814  < 2e-16 ***
sub_gradeB3                        -1.014e+00  5.513e-02 -18.393  < 2e-16 ***
sub_gradeB4                        -4.353e-01  4.886e-02  -8.909  < 2e-16 ***
sub_gradeB5                                NA         NA      NA       NA    
sub_gradeC1                        -1.713e+00  6.171e-02 -27.764  < 2e-16 ***
sub_gradeC2                        -1.319e+00  5.555e-02 -23.740  < 2e-16 ***
sub_gradeC3                        -8.432e-01  5.001e-02 -16.859  < 2e-16 ***
sub_gradeC4                        -4.279e-01  4.561e-02  -9.383  < 2e-16 ***
sub_gradeC5                                NA         NA      NA       NA    
sub_gradeD1                        -1.542e+00  6.975e-02 -22.105  < 2e-16 ***
sub_gradeD2                        -9.083e-01  6.366e-02 -14.268  < 2e-16 ***
sub_gradeD3                        -5.574e-01  6.187e-02  -9.010  < 2e-16 ***
sub_gradeD4                        -2.222e-01  5.979e-02  -3.717 0.000202 ***
sub_gradeD5                                NA         NA      NA       NA    
sub_gradeE1                        -1.945e+00  9.192e-02 -21.165  < 2e-16 ***
sub_gradeE2                        -1.570e+00  9.010e-02 -17.426  < 2e-16 ***
sub_gradeE3                        -1.098e+00  8.693e-02 -12.632  < 2e-16 ***
sub_gradeE4                        -6.191e-01  8.604e-02  -7.195 6.25e-13 ***
sub_gradeE5                                NA         NA      NA       NA    
sub_gradeF1                        -2.651e+00  1.517e-01 -17.480  < 2e-16 ***
sub_gradeF2                        -1.661e+00  1.507e-01 -11.023  < 2e-16 ***
sub_gradeF3                        -1.436e+00  1.561e-01  -9.201  < 2e-16 ***
sub_gradeF4                        -8.469e-01  1.574e-01  -5.379 7.48e-08 ***
sub_gradeF5                                NA         NA      NA       NA    
sub_gradeG1                        -9.360e-01  2.662e-01  -3.517 0.000437 ***
sub_gradeG2                        -1.881e-01  2.843e-01  -0.662 0.508182    
sub_gradeG3                         3.359e-04  2.911e-01   0.001 0.999080    
sub_gradeG4                         3.193e-01  3.363e-01   0.949 0.342388    
sub_gradeG5                                NA         NA      NA       NA    
emp_length< 1 year                 -4.532e-01  4.372e-02 -10.367  < 2e-16 ***
emp_length1 year                   -4.712e-01  4.566e-02 -10.319  < 2e-16 ***
emp_length10+ years                -6.238e-01  3.634e-02 -17.168  < 2e-16 ***
emp_length2 years                  -5.229e-01  4.308e-02 -12.136  < 2e-16 ***
emp_length3 years                  -4.913e-01  4.397e-02 -11.174  < 2e-16 ***
emp_length4 years                  -5.476e-01  4.699e-02 -11.655  < 2e-16 ***
emp_length5 years                  -4.487e-01  4.689e-02  -9.568  < 2e-16 ***
emp_length6 years                  -6.089e-01  5.220e-02 -11.666  < 2e-16 ***
emp_length7 years                  -6.175e-01  5.143e-02 -12.007  < 2e-16 ***
emp_length8 years                  -4.776e-01  4.875e-02  -9.797  < 2e-16 ***
emp_length9 years                  -5.397e-01  5.163e-02 -10.452  < 2e-16 ***
home_ownershipMORTGAGE              8.919e+00  8.448e+01   0.106 0.915921    
home_ownershipOWN                   9.166e+00  8.448e+01   0.109 0.913594    
home_ownershipRENT                  9.451e+00  8.448e+01   0.112 0.910923    
annual_inc                         -1.541e-07  1.539e-07  -1.002 0.316482    
verification_statusSource Verified  6.897e-02  2.058e-02   3.351 0.000806 ***
verification_statusVerified         7.652e-02  2.244e-02   3.411 0.000648 ***
purposecredit_card                  1.076e-01  8.798e-02   1.223 0.221428    
purposedebt_consolidation           4.181e-02  8.656e-02   0.483 0.629045    
purposehome_improvement             1.581e-01  9.136e-02   1.731 0.083473 .  
purposehouse                       -2.516e-01  1.343e-01  -1.874 0.060882 .  
purposemajor_purchase               2.137e-01  1.026e-01   2.083 0.037274 *  
purposemedical                      2.768e-01  1.115e-01   2.482 0.013078 *  
purposemoving                      -3.080e-02  1.231e-01  -0.250 0.802412    
purposeother                        1.438e-02  9.212e-02   0.156 0.875977    
purposerenewable_energy             5.902e-01  2.791e-01   2.115 0.034447 *  
purposesmall_business               5.022e-01  1.167e-01   4.304 1.67e-05 ***
purposevacation                     2.338e-01  1.280e-01   1.827 0.067661 .  
purposewedding                     -1.201e+01  5.326e+01  -0.226 0.821525    
dti                                 2.827e-02  1.030e-03  27.435  < 2e-16 ***
open_acc                            2.398e-02  2.042e-03  11.743  < 2e-16 ***
pub_rec                             8.309e-02  1.747e-02   4.758 1.96e-06 ***
revol_bal                          -2.334e-06  4.359e-07  -5.355 8.57e-08 ***
revol_util                          6.084e-03  3.671e-04  16.571  < 2e-16 ***
total_acc                          -1.260e-02  9.297e-04 -13.557  < 2e-16 ***
initial_list_statusw               -1.166e-01  1.698e-02  -6.867 6.56e-12 ***
application_typeINDIVIDUAL          4.198e-02  1.559e-01   0.269 0.787709    
application_typeJOINT              -6.672e-01  2.114e-01  -3.156 0.001598 ** 
pub_rec_bankruptcies               -6.001e-02  2.633e-02  -2.279 0.022674 *  
time_since_fcline                   5.717e-06  3.265e-06   1.751 0.079952 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 107373  on 77453  degrees of freedom
Residual deviance:  93434  on 77375  degrees of freedom
AIC: 93592

Number of Fisher Scoring iterations: 9
Code
summary(loans.reg)

Call:
glm(formula = loan_outcome ~ loan_amnt + term + int_rate + installment + 
    sub_grade + emp_length + home_ownership + verification_status + 
    purpose + dti + open_acc + pub_rec + revol_bal + revol_util + 
    total_acc + initial_list_status + application_type + pub_rec_bankruptcies + 
    time_since_fcline, family = "binomial", data = loans.oversample3)

Coefficients:
                                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)                        -8.611e+00  8.448e+01  -0.102 0.918813    
loan_amnt                          -3.670e-05  7.004e-06  -5.240 1.61e-07 ***
term 60 months                      4.159e-01  4.755e-02   8.746  < 2e-16 ***
int_rate                           -5.609e-01  1.668e-02 -33.626  < 2e-16 ***
installment                         1.415e-03  2.192e-04   6.453 1.09e-10 ***
sub_gradeA2                         1.159e+00  1.245e-01   9.311  < 2e-16 ***
sub_gradeA3                         1.377e+00  1.265e-01  10.890  < 2e-16 ***
sub_gradeA4                         1.991e+00  1.154e-01  17.261  < 2e-16 ***
sub_gradeA5                         2.455e+00  1.145e-01  21.448  < 2e-16 ***
sub_gradeB1                         2.654e+00  1.168e-01  22.716  < 2e-16 ***
sub_gradeB2                         3.368e+00  1.223e-01  27.545  < 2e-16 ***
sub_gradeB3                         3.974e+00  1.280e-01  31.049  < 2e-16 ***
sub_gradeB4                         4.553e+00  1.381e-01  32.972  < 2e-16 ***
sub_gradeB5                         4.989e+00  1.441e-01  34.618  < 2e-16 ***
sub_gradeC1                         5.545e+00  1.524e-01  36.381  < 2e-16 ***
sub_gradeC2                         5.940e+00  1.592e-01  37.305  < 2e-16 ***
sub_gradeC3                         6.415e+00  1.674e-01  38.313  < 2e-16 ***
sub_gradeC4                         6.831e+00  1.762e-01  38.761  < 2e-16 ***
sub_gradeC5                         7.259e+00  1.868e-01  38.849  < 2e-16 ***
sub_gradeD1                         7.865e+00  2.007e-01  39.183  < 2e-16 ***
sub_gradeD2                         8.499e+00  2.152e-01  39.498  < 2e-16 ***
sub_gradeD3                         8.850e+00  2.235e-01  39.589  < 2e-16 ***
sub_gradeD4                         9.185e+00  2.321e-01  39.581  < 2e-16 ***
sub_gradeD5                         9.408e+00  2.390e-01  39.365  < 2e-16 ***
sub_gradeE1                         9.702e+00  2.453e-01  39.558  < 2e-16 ***
sub_gradeE2                         1.008e+01  2.524e-01  39.920  < 2e-16 ***
sub_gradeE3                         1.055e+01  2.627e-01  40.157  < 2e-16 ***
sub_gradeE4                         1.103e+01  2.770e-01  39.814  < 2e-16 ***
sub_gradeE5                         1.165e+01  2.950e-01  39.483  < 2e-16 ***
sub_gradeF1                         1.199e+01  3.109e-01  38.570  < 2e-16 ***
sub_gradeF2                         1.298e+01  3.278e-01  39.609  < 2e-16 ***
sub_gradeF3                         1.321e+01  3.443e-01  38.363  < 2e-16 ***
sub_gradeF4                         1.380e+01  3.595e-01  38.383  < 2e-16 ***
sub_gradeF5                         1.464e+01  3.771e-01  38.840  < 2e-16 ***
sub_gradeG1                         1.455e+01  3.908e-01  37.224  < 2e-16 ***
sub_gradeG2                         1.530e+01  4.128e-01  37.055  < 2e-16 ***
sub_gradeG3                         1.548e+01  4.235e-01  36.566  < 2e-16 ***
sub_gradeG4                         1.580e+01  4.635e-01  34.095  < 2e-16 ***
sub_gradeG5                         1.548e+01  4.642e-01  33.351  < 2e-16 ***
emp_length< 1 year                 -4.555e-01  4.366e-02 -10.433  < 2e-16 ***
emp_length1 year                   -4.735e-01  4.560e-02 -10.384  < 2e-16 ***
emp_length10+ years                -6.266e-01  3.624e-02 -17.290  < 2e-16 ***
emp_length2 years                  -5.255e-01  4.301e-02 -12.219  < 2e-16 ***
emp_length3 years                  -4.944e-01  4.387e-02 -11.269  < 2e-16 ***
emp_length4 years                  -5.503e-01  4.691e-02 -11.730  < 2e-16 ***
emp_length5 years                  -4.515e-01  4.681e-02  -9.644  < 2e-16 ***
emp_length6 years                  -6.118e-01  5.213e-02 -11.736  < 2e-16 ***
emp_length7 years                  -6.200e-01  5.137e-02 -12.069  < 2e-16 ***
emp_length8 years                  -4.805e-01  4.867e-02  -9.874  < 2e-16 ***
emp_length9 years                  -5.424e-01  5.157e-02 -10.518  < 2e-16 ***
home_ownershipMORTGAGE              8.924e+00  8.448e+01   0.106 0.915869    
home_ownershipOWN                   9.173e+00  8.448e+01   0.109 0.913534    
home_ownershipRENT                  9.457e+00  8.448e+01   0.112 0.910866    
verification_statusSource Verified  6.842e-02  2.058e-02   3.325 0.000883 ***
verification_statusVerified         7.626e-02  2.244e-02   3.399 0.000677 ***
purposecredit_card                  1.095e-01  8.795e-02   1.245 0.213173    
purposedebt_consolidation           4.309e-02  8.654e-02   0.498 0.618527    
purposehome_improvement             1.574e-01  9.135e-02   1.723 0.084926 .  
purposehouse                       -2.518e-01  1.343e-01  -1.875 0.060742 .  
purposemajor_purchase               2.140e-01  1.026e-01   2.086 0.036967 *  
purposemedical                      2.758e-01  1.115e-01   2.473 0.013391 *  
purposemoving                      -3.112e-02  1.231e-01  -0.253 0.800395    
purposeother                        1.438e-02  9.211e-02   0.156 0.875939    
purposerenewable_energy             5.924e-01  2.791e-01   2.123 0.033789 *  
purposesmall_business               5.001e-01  1.166e-01   4.287 1.81e-05 ***
purposevacation                     2.335e-01  1.280e-01   1.825 0.068002 .  
purposewedding                     -1.201e+01  5.311e+01  -0.226 0.821060    
dti                                 2.851e-02  1.001e-03  28.473  < 2e-16 ***
open_acc                            2.390e-02  2.041e-03  11.714  < 2e-16 ***
pub_rec                             8.246e-02  1.745e-02   4.724 2.31e-06 ***
revol_bal                          -2.414e-06  4.287e-07  -5.631 1.79e-08 ***
revol_util                          6.061e-03  3.664e-04  16.541  < 2e-16 ***
total_acc                          -1.268e-02  9.264e-04 -13.688  < 2e-16 ***
initial_list_statusw               -1.168e-01  1.698e-02  -6.876 6.14e-12 ***
application_typeINDIVIDUAL          4.589e-02  1.559e-01   0.294 0.768438    
application_typeJOINT              -6.619e-01  2.114e-01  -3.131 0.001739 ** 
pub_rec_bankruptcies               -5.933e-02  2.633e-02  -2.254 0.024208 *  
time_since_fcline                   5.597e-06  3.263e-06   1.715 0.086288 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 107373  on 77453  degrees of freedom
Residual deviance:  93435  on 77376  degrees of freedom
AIC: 93591

Number of Fisher Scoring iterations: 9
Code
coef(loans.reg)
                       (Intercept)                          loan_amnt 
                     -8.610653e+00                      -3.670292e-05 
                    term 60 months                           int_rate 
                      4.158783e-01                      -5.608958e-01 
                       installment                        sub_gradeA2 
                      1.414770e-03                       1.158932e+00 
                       sub_gradeA3                        sub_gradeA4 
                      1.377215e+00                       1.991339e+00 
                       sub_gradeA5                        sub_gradeB1 
                      2.455280e+00                       2.653581e+00 
                       sub_gradeB2                        sub_gradeB3 
                      3.367519e+00                       3.974049e+00 
                       sub_gradeB4                        sub_gradeB5 
                      4.553151e+00                       4.988786e+00 
                       sub_gradeC1                        sub_gradeC2 
                      5.544592e+00                       5.939652e+00 
                       sub_gradeC3                        sub_gradeC4 
                      6.415262e+00                       6.830668e+00 
                       sub_gradeC5                        sub_gradeD1 
                      7.258552e+00                       7.865233e+00 
                       sub_gradeD2                        sub_gradeD3 
                      8.499313e+00                       8.850077e+00 
                       sub_gradeD4                        sub_gradeD5 
                      9.185072e+00                       9.407693e+00 
                       sub_gradeE1                        sub_gradeE2 
                      9.701780e+00                       1.007716e+01 
                       sub_gradeE3                        sub_gradeE4 
                      1.054932e+01                       1.102729e+01 
                       sub_gradeE5                        sub_gradeF1 
                      1.164750e+01                       1.199289e+01 
                       sub_gradeF2                        sub_gradeF3 
                      1.298323e+01                       1.320867e+01 
                       sub_gradeF4                        sub_gradeF5 
                      1.379781e+01                       1.464496e+01 
                       sub_gradeG1                        sub_gradeG2 
                      1.454567e+01                       1.529582e+01 
                       sub_gradeG3                        sub_gradeG4 
                      1.548493e+01                       1.580221e+01 
                       sub_gradeG5                 emp_length< 1 year 
                      1.548274e+01                      -4.555064e-01 
                  emp_length1 year                emp_length10+ years 
                     -4.735422e-01                      -6.265897e-01 
                 emp_length2 years                  emp_length3 years 
                     -5.255042e-01                      -4.943896e-01 
                 emp_length4 years                  emp_length5 years 
                     -5.503227e-01                      -4.514590e-01 
                 emp_length6 years                  emp_length7 years 
                     -6.117681e-01                      -6.200040e-01 
                 emp_length8 years                  emp_length9 years 
                     -4.805120e-01                      -5.423702e-01 
            home_ownershipMORTGAGE                  home_ownershipOWN 
                      8.924060e+00                       9.172686e+00 
                home_ownershipRENT verification_statusSource Verified 
                      9.456910e+00                       6.842355e-02 
       verification_statusVerified                 purposecredit_card 
                      7.625872e-02                       1.094934e-01 
         purposedebt_consolidation            purposehome_improvement 
                      4.309290e-02                       1.573838e-01 
                      purposehouse              purposemajor_purchase 
                     -2.517727e-01                       2.139869e-01 
                    purposemedical                      purposemoving 
                      2.758128e-01                      -3.111637e-02 
                      purposeother            purposerenewable_energy 
                      1.438062e-02                       5.924352e-01 
             purposesmall_business                    purposevacation 
                      5.000602e-01                       2.335156e-01 
                    purposewedding                                dti 
                     -1.201161e+01                       2.851141e-02 
                          open_acc                            pub_rec 
                      2.390477e-02                       8.245500e-02 
                         revol_bal                         revol_util 
                     -2.414280e-06                       6.060981e-03 
                         total_acc               initial_list_statusw 
                     -1.268082e-02                      -1.167521e-01 
        application_typeINDIVIDUAL              application_typeJOINT 
                      4.588832e-02                      -6.618791e-01 
              pub_rec_bankruptcies                  time_since_fcline 
                     -5.933475e-02                       5.597281e-06 
Code
head(predict(loans.reg, type="response"))
        1         2         3         4         5         6 
0.1474916 0.4761187 0.3528356 0.2671589 0.6078489 0.3421974 
Code
#model_glm_pred = predict(loans.reg, type="response")
model_glm_pred = ifelse(predict(loans.reg, type = "response") > 0.5, 1, 0)

calc_class_err = function(actual, predicted) {
  mean(actual != predicted)
}

calc_class_err(actual = loans.oversample3$loan_outcome, predicted = model_glm_pred)
[1] 0.326865
Code
train_tab = table(predicted = model_glm_pred, actual = loans.oversample3$loan_outcome)
library(caret)
train_con_mat = confusionMatrix(train_tab, positive = "1")
c(train_con_mat$overall["Accuracy"], 
  train_con_mat$byClass["Sensitivity"], 
  train_con_mat$byClass["Specificity"])
   Accuracy Sensitivity Specificity 
  0.6731350   0.6847482   0.6614412 
Code
#Create function
get_logistic_error = function(mod, data, res = "y", pos = 1, neg = 0, cut = 0.5) {
  probs = predict(mod, newdata = data, type = "response")
  preds = ifelse(probs > cut, pos, neg)
  calc_class_err(actual = data[, res], predicted = preds)
}

get_logistic_error(loans.reg, data = loans.oversample3, 
                   res = "loan_outcome", pos = 1, neg = 0, cut = 0.5)
[1] 0.326865
Code
performance_df <- data.frame(Model = NULL, Accuracy = NULL, Sensitivity = NULL, Specificity = NULL)
perf_dt <- data.frame(Model = "loans.reg Logistic with stepAIC", Accuracy = train_con_mat$overall[1], Sensitivity = train_con_mat$byClass[1], Specificity = train_con_mat$byClass[2])
performance_df <- rbind(performance_df, perf_dt)
perf_dt
                                   Model Accuracy Sensitivity Specificity
Accuracy loans.reg Logistic with stepAIC 0.673135   0.6847482   0.6614412
Code
perf_dt3 <- data.frame(Model = "Decision Tree Baseline", Accuracy = p1$overall[1], Sensitivity = p1$byClass[1], Specificity = p1$byClass[2])
performance_df <- rbind(performance_df, perf_dt3)
perf_dt3
                          Model  Accuracy Sensitivity Specificity
Accuracy Decision Tree Baseline 0.7834784   0.1097343   0.9721944

Logistic Regression (Test Data set)

After training on the test set, our metrics were similar to the train dataset. Accuracy : 0.6645 Sensitivity : 0.6576
Specificity : 0.6891 AUC(0.7358227)

Code
library(ROCR)
set.seed(2024)

loans.test4 <- as.data.frame(loans.test)
barplot(table(loans.test4$loan_status) , col = 'lightblue')

Code
table(loans.test4$loan_status)

Charged Off  Fully Paid 
       7263       25930 
Code
#Use under and oversampling
# loans.oversample.test1 <- ovun.sample(loan_status ~ ., data = loans.test4, method = "both",N = 33193 , seed = 13)$data
# barplot(table(loans.oversample.test1 $loan_status) , col = 'lightblue')


loans.test1 <- loans.test4  %>% mutate(loan_outcome = ifelse(loan_status %in% c('Charged Off' ) , 1, ifelse(loan_status == 'Fully Paid' , 0,'none' )))

barplot(table(loans.test1$loan_outcome) , col = 'lightblue')

Code
table(loans.test1$loan_outcome)

    0     1 
25930  7263 
Code
loans.test1 <- loans.test1[, colnames(loans.test1)[colnames(loans.test1) != 'loan_status']]

loans.test1 <- loans.test1[, colnames(loans.test1)[colnames(loans.test1) != 'issue_d']]


loans.test1 <- loans.test1[, colnames(loans.test1)[colnames(loans.test1) != 'earliest_cr_line']]


# str(loan.test4)
# term  
# grade              
# sub_grade          
# emp_length          
# home_ownership  
# verification_status 
# purpose
# initial_list_status
# application_type
# loan_outcome
# 
# class(loans.test1[1,"grade"])
# 
# store.colnames1 = c("term",
#                     "grade",
#                     "sub_grade",
#                     "emp_length",
#                     "home_ownership",
#                     "verification_status",
#                     "purpose",
#                     "initial_list_status",
#                     "application_type",
#                     "loan_outcome")
#                     
store.colnames1=colnames(loans.test1)
loans.test3  = lapply(store.colnames1, function(column) factorize(column, loans.test1))
loans.test3 = as.data.frame(loans.test3 )
colnames(loans.test3)=store.colnames1


loans.over.2 <- train(loan_outcome ~ loan_amnt + term + int_rate + installment + sub_grade + 
    emp_length + home_ownership + verification_status + purpose + 
    dti + open_acc + pub_rec + revol_bal + revol_util + total_acc + 
    initial_list_status + application_type + pub_rec_bankruptcies + 
    time_since_fcline , data = loans.oversample3, method = "glm")

confusionMatrix(predict(loans.over.2, loans.test3), as.factor(loans.test3$loan_outcome))
Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 17051  2258
         1  8879  5005
                                          
               Accuracy : 0.6645          
                 95% CI : (0.6594, 0.6696)
    No Information Rate : 0.7812          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.261           
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.6576          
            Specificity : 0.6891          
         Pos Pred Value : 0.8831          
         Neg Pred Value : 0.3605          
             Prevalence : 0.7812          
         Detection Rate : 0.5137          
   Detection Prevalence : 0.5817          
      Balanced Accuracy : 0.6733          
                                          
       'Positive' Class : 0               
                                          
Code
loans.2.prediction <- prediction(predict(loans.over.2, newdata = loans.test3, type = "prob")[,"1"], loans.test3$loan_outcome)
performance(loans.2.prediction , measure = "auc")@y.values
[[1]]
[1] 0.7358227
Code
# Make predictions and pre accuracy for full model
probabilities <- predict(full.reg, loans.test3, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)
# Prediction accuracy
observed.classes <- loans.test3$loan_outcome
mean(predicted.classes == observed.classes)
[1] 0.6645678
Code
# Make predictions and pre accuracy for stepwise model
probabilities <- predict(loans.reg, loans.test3, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)
# Prediction accuracy
observed.classes <- loans.test3$loan_outcome
mean(predicted.classes == observed.classes)
[1] 0.6644775
Code
get_logistic_error(full.reg, data = loans.test3, 
                   res = "loan_outcome", pos = 1, neg = 0, cut = 0.5)
[1] 0.3354322
Code
get_logistic_error(loans.reg, data = loans.test3, 
                   res = "loan_outcome", pos = 1, neg = 0, cut = 0.5)
[1] 0.3355225
Code
#A good model will have a high AUC, that is as often as possible a high sensitivity and specificity.
test_prob = predict(loans.reg, newdata = loans.test3, type = "response")
test_roc = roc( loans.test3$loan_outcome ~ test_prob, plot = TRUE, print.auc = TRUE)

Code
as.numeric(test_roc$auc)
[1] 0.7358227

Random Forest Model

We implemented the random forest model, however, we were not able to use some of the functionality of the randomForestExplainer model. Therefore we were able to display some importance measures.

The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case). The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.

Code
#install.packages("randomForestExplainer")
library(randomForest)
library(ipred)
library(randomForestExplainer)

set.seed(2024)

seat_forest = randomForest(loan_outcome ~ ., data = loans.oversample3, mtry = 3, importance = TRUE, ntrees = 500)
seat_forest

Call:
 randomForest(formula = loan_outcome ~ ., data = loans.oversample3,      mtry = 3, importance = TRUE, ntrees = 500) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 8.36%
Confusion matrix:
      0     1 class.error
0 33982  4611  0.11947763
1  1866 36995  0.04801729
Code
print(seat_forest)

Call:
 randomForest(formula = loan_outcome ~ ., data = loans.oversample3,      mtry = 3, importance = TRUE, ntrees = 500) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 8.36%
Confusion matrix:
      0     1 class.error
0 33982  4611  0.11947763
1  1866 36995  0.04801729
Code
plot(seat_forest)

Code
# colnames(loans.oversample3)
# str(loans.oversample3)
# colnames(loans.test3)
# str(loans.test3)

levels( loans.oversample3$home_ownership)
[1] "ANY"      "MORTGAGE" "OWN"      "RENT"    
Code
allvalues <- unique(levels( loans.oversample3$home_ownership)) 
loans.test3$home_ownership <- x <- factor(loans.test3$home_ownership, levels = allvalues)
levels( loans.test3$home_ownership)
[1] "ANY"      "MORTGAGE" "OWN"      "RENT"    
Code
common <- intersect(names(loans.oversample3), names(loans.test3)) 
for (p in common) { 
  if (class(loans.oversample3[[p]]) == "factor") { 
    levels(loans.test3[[p]]) <- levels(loans.oversample3[[p]]) 
  } 
}

seat_forest_tst_perd = predict(seat_forest, loans.test3)
table(predicted = seat_forest_tst_perd, actual = loans.test3$loan_outcome)
         actual
predicted     0     1
        0 20156  3341
        1  5774  3922
Code
rf_cm <- confusionMatrix(seat_forest_tst_perd, loans.test3$loan_outcome)


randomForest::importance(seat_forest, type=1)
                     MeanDecreaseAccuracy
loan_amnt                        97.71957
term                             34.36486
int_rate                         70.05303
installment                     114.92811
grade                            43.68179
sub_grade                        75.03905
emp_length                      169.14120
home_ownership                  102.98117
annual_inc                       94.32410
verification_status              52.81359
purpose                          89.98132
dti                             115.42503
open_acc                        126.29566
pub_rec                          85.39829
revol_bal                       156.45117
revol_util                       96.26056
total_acc                       117.82003
initial_list_status              31.85327
application_type                 34.42225
pub_rec_bankruptcies             83.40827
time_since_fcline               108.80780
Code
randomForest::importance(seat_forest, type=2)
                     MeanDecreaseGini
loan_amnt                  2156.30125
term                        419.29550
int_rate                   2898.03363
installment                2557.74510
grade                      1305.78409
sub_grade                  3283.18077
emp_length                 2893.53889
home_ownership              823.51926
annual_inc                 2611.73969
verification_status         774.22029
purpose                    1339.35019
dti                        3294.59734
open_acc                   1971.88136
pub_rec                     450.61513
revol_bal                  2741.31150
revol_util                 2905.64451
total_acc                  2291.79049
initial_list_status         380.49145
application_type             37.99733
pub_rec_bankruptcies        304.65887
time_since_fcline          2716.11220
Code
par(mfrow = c(1, 2))
varImpPlot(seat_forest, type=1, main = "Importance: permutation")
varImpPlot(seat_forest, type=2, main = "Importance: node impurity")

Code
#var_imp <- measure_importance(seat_forest)

perf_dt4 <- data.frame(Model = "Random forest", Accuracy = rf_cm$overall[1], Sensitivity = rf_cm$byClass[1], Specificity = rf_cm$byClass[2])
performance_df <- rbind(performance_df, perf_dt4)

performance_df 
                                    Model  Accuracy Sensitivity Specificity
Accuracy  loans.reg Logistic with stepAIC 0.6731350   0.6847482   0.6614412
Accuracy1          Decision Tree Baseline 0.7834784   0.1097343   0.9721944
Accuracy2                   Random forest 0.7253939   0.7773236   0.5399972

Medium Data Set (End to End ML Analysis)

Data set Introduction

The large data set consists of about 284,000 card transactions that are labelled as non-fraud and fraud. It is a real data set from a European financial institution, which is why the features are masked. They are the result of extensive PCA. Additionally, it is a highly imbalanced data set, as there are several orders of magnitude more non-fraud than fraud transactions.

Data Exploration & Plots

Code
set.seed(2024)
path = "https://github.com/BanuB/Card_Transaction_Fraud/raw/refs/heads/master/creditcard.parquet" 

tx_raw = read_parquet(path)
Code
summary(tx_raw)
      Time              V1                  V2                  V3          
 Min.   :     0   Min.   :-56.40751   Min.   :-72.71573   Min.   :-48.3256  
 1st Qu.: 54202   1st Qu.: -0.92037   1st Qu.: -0.59855   1st Qu.: -0.8904  
 Median : 84692   Median :  0.01811   Median :  0.06549   Median :  0.1799  
 Mean   : 94814   Mean   :  0.00000   Mean   :  0.00000   Mean   :  0.0000  
 3rd Qu.:139321   3rd Qu.:  1.31564   3rd Qu.:  0.80372   3rd Qu.:  1.0272  
 Max.   :172792   Max.   :  2.45493   Max.   : 22.05773   Max.   :  9.3826  
       V4                 V5                   V6                 V7          
 Min.   :-5.68317   Min.   :-113.74331   Min.   :-26.1605   Min.   :-43.5572  
 1st Qu.:-0.84864   1st Qu.:  -0.69160   1st Qu.: -0.7683   1st Qu.: -0.5541  
 Median :-0.01985   Median :  -0.05434   Median : -0.2742   Median :  0.0401  
 Mean   : 0.00000   Mean   :   0.00000   Mean   :  0.0000   Mean   :  0.0000  
 3rd Qu.: 0.74334   3rd Qu.:   0.61193   3rd Qu.:  0.3986   3rd Qu.:  0.5704  
 Max.   :16.87534   Max.   :  34.80167   Max.   : 73.3016   Max.   :120.5895  
       V8                  V9                 V10                 V11          
 Min.   :-73.21672   Min.   :-13.43407   Min.   :-24.58826   Min.   :-4.79747  
 1st Qu.: -0.20863   1st Qu.: -0.64310   1st Qu.: -0.53543   1st Qu.:-0.76249  
 Median :  0.02236   Median : -0.05143   Median : -0.09292   Median :-0.03276  
 Mean   :  0.00000   Mean   :  0.00000   Mean   :  0.00000   Mean   : 0.00000  
 3rd Qu.:  0.32735   3rd Qu.:  0.59714   3rd Qu.:  0.45392   3rd Qu.: 0.73959  
 Max.   : 20.00721   Max.   : 15.59500   Max.   : 23.74514   Max.   :12.01891  
      V12                V13                V14                V15          
 Min.   :-18.6837   Min.   :-5.79188   Min.   :-19.2143   Min.   :-4.49894  
 1st Qu.: -0.4056   1st Qu.:-0.64854   1st Qu.: -0.4256   1st Qu.:-0.58288  
 Median :  0.1400   Median :-0.01357   Median :  0.0506   Median : 0.04807  
 Mean   :  0.0000   Mean   : 0.00000   Mean   :  0.0000   Mean   : 0.00000  
 3rd Qu.:  0.6182   3rd Qu.: 0.66251   3rd Qu.:  0.4931   3rd Qu.: 0.64882  
 Max.   :  7.8484   Max.   : 7.12688   Max.   : 10.5268   Max.   : 8.87774  
      V16                 V17                 V18           
 Min.   :-14.12985   Min.   :-25.16280   Min.   :-9.498746  
 1st Qu.: -0.46804   1st Qu.: -0.48375   1st Qu.:-0.498850  
 Median :  0.06641   Median : -0.06568   Median :-0.003636  
 Mean   :  0.00000   Mean   :  0.00000   Mean   : 0.000000  
 3rd Qu.:  0.52330   3rd Qu.:  0.39968   3rd Qu.: 0.500807  
 Max.   : 17.31511   Max.   :  9.25353   Max.   : 5.041069  
      V19                 V20                 V21           
 Min.   :-7.213527   Min.   :-54.49772   Min.   :-34.83038  
 1st Qu.:-0.456299   1st Qu.: -0.21172   1st Qu.: -0.22839  
 Median : 0.003735   Median : -0.06248   Median : -0.02945  
 Mean   : 0.000000   Mean   :  0.00000   Mean   :  0.00000  
 3rd Qu.: 0.458949   3rd Qu.:  0.13304   3rd Qu.:  0.18638  
 Max.   : 5.591971   Max.   : 39.42090   Max.   : 27.20284  
      V22                  V23                 V24          
 Min.   :-10.933144   Min.   :-44.80774   Min.   :-2.83663  
 1st Qu.: -0.542350   1st Qu.: -0.16185   1st Qu.:-0.35459  
 Median :  0.006782   Median : -0.01119   Median : 0.04098  
 Mean   :  0.000000   Mean   :  0.00000   Mean   : 0.00000  
 3rd Qu.:  0.528554   3rd Qu.:  0.14764   3rd Qu.: 0.43953  
 Max.   : 10.503090   Max.   : 22.52841   Max.   : 4.58455  
      V25                 V26                V27            
 Min.   :-10.29540   Min.   :-2.60455   Min.   :-22.565679  
 1st Qu.: -0.31715   1st Qu.:-0.32698   1st Qu.: -0.070840  
 Median :  0.01659   Median :-0.05214   Median :  0.001342  
 Mean   :  0.00000   Mean   : 0.00000   Mean   :  0.000000  
 3rd Qu.:  0.35072   3rd Qu.: 0.24095   3rd Qu.:  0.091045  
 Max.   :  7.51959   Max.   : 3.51735   Max.   : 31.612198  
      V28                Amount             Class         
 Min.   :-15.43008   Min.   :    0.00   Min.   :0.000000  
 1st Qu.: -0.05296   1st Qu.:    5.60   1st Qu.:0.000000  
 Median :  0.01124   Median :   22.00   Median :0.000000  
 Mean   :  0.00000   Mean   :   88.35   Mean   :0.001728  
 3rd Qu.:  0.07828   3rd Qu.:   77.17   3rd Qu.:0.000000  
 Max.   : 33.84781   Max.   :25691.16   Max.   :1.000000  
Code
tx_raw$Class = as.factor(tx_raw$Class) #Convert Class column to factor

tx_raw = tx_raw %>%
  mutate(datetime = as.POSIXct("2024-01-01 00:00:00", tz = "UTC") + seconds(Time)) #Make new column that shows datetime

ggplot(tx_raw, aes(x = Amount, fill = Class)) +
  geom_histogram(position = "dodge", bins = 60) +
  labs(title = "Histogram of Amounts by Class (< 500 USD)", x = "Amount (USD)", y = "Frequency") +
  theme_minimal() +
  scale_fill_manual(values = c('grey', 'green')) +
  xlim(0, 500)

Code
tx_1 = tx_raw %>%
  filter(Class == 1)

ggplot(tx_1, aes(x = Amount)) +
  geom_histogram(position = "dodge", bins = 60) +
  labs(title = "Histogram of Amounts for Class Fraud", x = "Amount (USD)", y = "Frequency") +
  theme_minimal()

Code
#Outlier plot
ggplot(tx_1, aes(x = Amount)) +
  geom_boxplot(position = "dodge", bins = 60) +
  labs(title = "Histogram of Amounts for Class Fraud", x = "Amount (USD)", y = "Frequency") +
  theme_minimal()

Code
# Scatterplot

tx_1 %>% ggplot(aes(x=Time, y=Amount)) +
  geom_point() +
  labs(
  y = "Amount ($)", 
  x = "Time (s)",
  title= "Fraudulent Transactions Across Time"
 )

Code
#Correlation Heatmap
tx_raw_numeric = tx_raw %>%
  dplyr::select(!c(Class, datetime))
cor_matrix = cor(tx_raw_numeric)
cor_matrix = melt(cor_matrix)

ggplot(data = cor_matrix, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1, 1), space = "Lab", 
                       name = "Correlation") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                   size = 10, hjust = 1)) +
  coord_fixed() +
  labs(title = "Correlation Heatmap", x = "Variable", y = "Variable")

Code
#----Time-Series for Transactions----
tx_transactions <- tx_raw %>%
  mutate(datetime_hour = floor_date(datetime, "hour")) %>%
  group_by(datetime_hour, Class) %>%
  summarise(transaction_count = n())

tx_trans_1 <- ggplot(tx_transactions, aes(x = datetime_hour, y = transaction_count, color = as.factor(Class))) +
  geom_line() +
  theme_minimal() +
  labs(title = 'Fraud Txs', y = "Number of Transactions", x = "Time (Hourly)") +
  scale_y_continuous(limits = c(0, 50)) +
  theme(legend.position = "none") +
  annotate("text", x = max(tx_transactions$datetime_hour), y = 45, label = expression(rho[1] == -0.226), hjust = 1)

tx_trans_0 <- ggplot(tx_transactions, aes(x = datetime_hour, y = transaction_count, color = as.factor(Class))) +
  geom_line() +
  theme_minimal() +
  labs(title = 'Non-Fraud Txs', x = NULL, y = NULL, color = "Class") +
  scale_y_continuous(limits = c(1000, max(tx_transactions$transaction_count))) +
  theme(legend.position = "none") +
  annotate("text", x = max(tx_transactions$datetime_hour), y = max(tx_transactions$transaction_count) - 50, label = expression(rho[1] == 0.918), hjust = 1)

# Combine the two plots
tx_transactions_plot <- (tx_trans_0 / tx_trans_1) + plot_layout(heights = c(2, 1))
print(tx_transactions_plot)

Code
#----Auto- and Cross-correlations----
tx_nofraud = tx_transactions %>%
  filter(Class == 0) %>%
  dplyr::select(transaction_count)
tx_nofraud = tx_nofraud$transaction_count

tx_nofraud_autocor = acf(tx_nofraud, lag.max = 3, plot = T)

Code
tx_nofraud_autocor

Autocorrelations of series 'tx_nofraud', by lag

    0     1     2     3 
1.000 0.918 0.747 0.536 

When investigating the plots from the EDA above one thing becomes clear: the data set is HEAVILY imbalanced. As discussed in the introduction above, this is unsurprising given the nature of non-fraud versus fraud transactions; however, this is an important consideration when selecting the models to run. Weak learners will likely not be as strong in performance as ensemble methods would be.

Additionally, there are a few more interesting observations. For example, the correlation matrix between all features show no strong correlation between each other. This is important for several machine learning algorithms, and considering that this data set has undergone feature engineering and PCA, it is unsurprising that this is case. Nevertheless, this plot should be part of any machine learning implementation.

Looking at the time-series graph, plotting the amounts of transactions per hour, over the time span of the data set, the cyclic nature of the non-fraud transactions is very apparent. This is not existent in the fraud transactions, which are mostly randomly happening. This can also be observed in the auto-correlations: ρ for the non-fraud transactions is 0.92, which points to a strong predictability for the next data point (i.e., after an increase in count, another increase if followed). The negative ρ of -0.23 of the fraudulent transactions points to a more random behavior across these two and a half days of time period of the data set. This feature will surely be quite important for the algorithm during training.

Lastly, these auto-correlations can be seen in the ACF plot, that shows different lags. It can be seen that the strongest lag is 1, with decreasing auto-correlations with larger lags.

Next, we split the data to prepare for the machine learning implementation. We chose to split the data 80/20 for training and test set. We deferred from a validation set as we are not going to engage in hyper parameter tuning in this exercise.

Data Preparation

Code
set.seed(2024)
library(caret)
library(e1071)
library(randomForest)
library(rpart)
library(pROC)
library(ranger)
library(ranger)

tx_raw$Class = as.factor(tx_raw$Class)

# Ensure datetime is of the correct type
tx_raw$datetime = as.POSIXct(tx_raw$datetime)

# Split the data into training and testing sets
trainIndex = createDataPartition(tx_raw$Class, p = 0.8, list = FALSE)
dataTrain = tx_raw[trainIndex, ]
dataTest = tx_raw[-trainIndex, ]

#Define CV
train_control = trainControl(method = "cv", number = 10)

Algorithm Selection

Given the fact the this is a highly imbalanced data set, a weak learner, such as a decision tree or logistic regression will likely not be very successful. Therefore, the better choice will likely be an ensemble. In order to test this, we will run a logistic regression, and a single decision tree. We wanted to also include a random forest, however, the large data set was computationally too expensive. In the real world, we would certainly use some type of ensemble method, like random forest and XGBoost.

Train Models

Code
set.seed(2024)
#Logistic Regression
time_logistic_train = system.time({
  logistic_model = train(Class ~ ., data = dataTrain, method = "glm", family = "binomial", trControl = train_control)})

#Decision Tree
time_tree_train = system.time({
  tree_model = train(Class ~ ., data = dataTrain, method = "rpart", trControl = train_control)})

Predictions and Evaluation

Code
#Logistic Regression Prediction
time_logistic_pred = system.time({
  logistic_pred = predict(logistic_model, dataTest)
  logistic_probs = predict(logistic_model, dataTest, type = "prob")[, 2]
})

#Decision Tree Prediction
time_tree_pred = system.time({
  tree_pred = predict(tree_model, dataTest)
  tree_probs = predict(tree_model, dataTest, type = "prob")[, 2]
})

# Logistic Regression Confusion Matrix
confusionMatrix(logistic_pred, dataTest$Class)
Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 56852    39
         1    11    59
                                          
               Accuracy : 0.9991          
                 95% CI : (0.9988, 0.9993)
    No Information Rate : 0.9983          
    P-Value [Acc > NIR] : 6.453e-08       
                                          
                  Kappa : 0.702           
                                          
 Mcnemar's Test P-Value : 0.0001343       
                                          
            Sensitivity : 0.9998          
            Specificity : 0.6020          
         Pos Pred Value : 0.9993          
         Neg Pred Value : 0.8429          
             Prevalence : 0.9983          
         Detection Rate : 0.9981          
   Detection Prevalence : 0.9988          
      Balanced Accuracy : 0.8009          
                                          
       'Positive' Class : 0               
                                          
Code
# Decision Tree Confusion Matrix
confusionMatrix(tree_pred, dataTest$Class)
Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 56847    32
         1    16    66
                                          
               Accuracy : 0.9992          
                 95% CI : (0.9989, 0.9994)
    No Information Rate : 0.9983          
    P-Value [Acc > NIR] : 1.581e-08       
                                          
                  Kappa : 0.7329          
                                          
 Mcnemar's Test P-Value : 0.03038         
                                          
            Sensitivity : 0.9997          
            Specificity : 0.6735          
         Pos Pred Value : 0.9994          
         Neg Pred Value : 0.8049          
             Prevalence : 0.9983          
         Detection Rate : 0.9980          
   Detection Prevalence : 0.9986          
      Balanced Accuracy : 0.8366          
                                          
       'Positive' Class : 0               
                                          
Code
#Benchmarking Training and Prediction Time
benchmark_results = data.frame(
  Model = c("Logistic Regression", "Decision Tree"),
  Training_Time = c(time_logistic_train[3], time_tree_train[3]),
  Prediction_Time = c(time_logistic_pred[3], time_tree_pred[3])
)

print(benchmark_results)
                Model Training_Time Prediction_Time
1 Logistic Regression         59.20            1.89
2       Decision Tree         76.94            1.55

The performance of both, the logistic regression and the decision tree are good, with above 90% accuracy. Looking at the timing benchmarks, both models trained within about one minute, and took only seconds to predict the test set of 50,000 rows. As mentioned above, the random forest trained much longer, on the magnitude of hours, so we chose to not continue with this at this time.

Code
#ROC and AUC Curves
#ROC

roc_logistic = roc(dataTest$Class, logistic_probs)
roc_tree = roc(dataTest$Class, tree_probs)

plot(roc_logistic, col = "red", main = "ROC Curves", lwd = 2)
lines(roc_tree, col = "blue", lwd = 2)
legend("bottomright", legend = c("Logistic Regression", "Decision Tree"),
       col = c("red", "blue"), lwd = 2)

Code
#AUC
auc_logistic = auc(roc_logistic)
auc_tree = auc(roc_tree)

print(paste("AUC for Logistic Regression:", auc_logistic))
[1] "AUC for Logistic Regression: 0.974355477379035"
Code
print(paste("AUC for Decision Tree:", auc_tree))
[1] "AUC for Decision Tree: 0.836573906420983"
Code
coefficients <- tidy(logistic_model$finalModel)
coefficients <- coefficients[coefficients$term != "(Intercept)", ]  # Remove intercept for better visualization
ggplot(coefficients, aes(x = reorder(term, estimate), y = estimate)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Logistic Regression Coefficients", x = "Features", y = "Coefficient") +
  theme_minimal()

Code
rpart.plot(tree_model$finalModel)

Code
rules1<- rpart.rules(tree_model$finalModel)
rules1 %>% kable() 
.outcome
2 0.00 when V17 >= -2.7
6 0.07 when V17 < -2.7 & V12 >= -2.2
7 0.82 when V17 < -2.7 & V12 < -2.2

In order to further understand the predictive quality of both models, we decided to compute ROC curves that plot specificity against sensitivity, and here found, interestingly, that the logistic regression was much better across the board than the decision tree. It is likely that the tree over fit, which leads to diminished predictive quality. While the accuracy is still high, this does not mean that it stays high with other unseen data.

Therefore, in the current case, we’d be deciding to utilize logistic regression over the decision tree. While again, an ensemble of trees would likely outperform the logistic regression, even with a smaller data set.

Conclusion and Summary Essay

Note: Both dataset had responded well to Decision Tree and Logistic Regression modeling as expected.

On the lending loan dataset, we have updated results on 3 models and their performance metrics such as accuracy, sensitivity and specificity.

Sensitivity (True Positive Rate): measures the proportion of applicants that were predicted as charged off, who were actually charged off in the test dataset Specificity (True Negative Rate): measures the proportion of applicants that were predicted to be charged off and were also charged off in the test dataset

The ensemble model performed the best if we review all 3 metrics since we had better values across all 3 metrics. We would like to look into further the decision tree baseline model for this dataset as it had a very low sensitivity since we had updated the dataset with overampled/under sampled data.

On the fraud dataset, we decided to utilize the logistic regression over the decision tree. While we did not run the ensemble methods here which may yield a better result in the future.

Takeaways: Additionally the RandomForest took significant amount of time to run and we were unable to load the importance measures through the randomForestExplainer package. We were unable to save the .RDA file with the importance measures on the test model output. We would like to further review this in the future. We would like to further expand on time series with “fpp3” package in the future on the credit dataset to look at further extrapolations of the time series data and forecasting methods to answer questions such as can we forecast credit card fraud given the dataset, are there any seasonal patterns when frauds occur?

References

https://www.kaggle.com/code/krishnaraj30/xgboost-loan-defaulters-prediction https://www.kaggle.com/code/heidarmirhajisadati/advancedtechniques-for-detecting-credit-card-fraud/notebook https://rpubs.com/DeclanStockdale/799284 https://htmlpreview.github.io/?https://github.com/geneticsMiNIng/BlackBoxOpener/blob/master/randomForestExplainer/inst/doc/randomForestExplainer.html