Explore how to analyze and predict an outcome based on the data available. This will be an exploratory exercise, so feel free to show errors and warnings that raise during the analysis. Test the code with both datasets selected and compare the results.
Are the columns of your data correlated?
Are there labels in your data? Did that impact your choice of algorithm?
What are the pros and cons of each algorithm you selected?
How your choice of algorithm relates to the datasets (was your choice of algorithm impacted by the datasets you chose)?
Which result will you trust if you need to make a business decision?
Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible?
How does the analysis between data sets compare?
Essay (minimum 500 word document)
Write a short essay explaining your selection of algorithms and how they relate to the data and what you are trying to do
Essay
Our goal was to understand and work with financial datasets. In line with the goal of this project we selected two large datasets (Credit Card Fraud and Lendingclub Loan). Both classification and regression methods (supervised learning) are good approaches for our datasets since we have availability of labeled samples.
LendingClub loan dataset is provided by Lending Club a peer-to-peer lending firm in San Francisco. Lending loans to risky customers can have a higher impact on credit loss. The dataset above is a good one as it has labeled data on loans that were “Fully Paid” or “Charged Off”. Additionally, this dataset is a rich resource of other categorical potential predictor variables and lends itself well to exploratory data analysis as well as machine learning algorithms such as Logistic regression, RandomForest and Decision Tree models.
Credit Card Fraud (Fraud/Not-Fraud) Prediction (Medium Volume - 200,000+ ): Fraud activity and customer behavior changes rapidly, causing non-stationary in the transaction data. Fraud represents a small fraction of the daily transactions. We then have a skewed dataset towards the genuine transactions. This yields to a highly imbalanced dataset. Statistical methods such as logistic regression can be applied for fraud detection for classification tasks, however, these are impacted by imbalance of the dataset and can be biased to predicting the majority class. We have also explored decision trees, wherein we can identify rules for predicting the correct class of transactions. We can identify the percentage of instances the condition of rule applies and the accuracy or confidence of a rule, which is predicting correct class of instances in which the condition of the rule applies. We would like to highlight the findings below through exploratory analysis, selection of models to train and then testing on subset of data.
While both datasets have a large number of records, we have selected the LendingClub loan dataset for our Large dataset for EDA and ML analysis and the Credit Card Fraud as a medium dataset for EDA and ML analysis. Our two datasets above are different in that the credit card fraud dataset is time series data, however the lendingclub loan dataset is not time series related. Therefore, we found the fraud dataset to be higher in complexity since it is time series data. We were able to train on large sample size here (227000+) and then apply the model on the test dataset.
The Lendingclub dataset on the other hand has far more categorical variables than the credit card fraud dataset. We have dived into it further on the exploratory data analysis. Additionally, due to the size of the dataset (300000+ records), we subset the training set from a modeling perspective with data records from issue date of the loan after 2015-01-01. There were challenges in training the data on the higher number of samples, therefore we made a decision to subset the dataset to 100,000 records range and then split it up to train/test set. On this dataset, we were dealing with missing values, decision to drop certain values, review correlated values, additionally during model training, we found train/test split issues and had to ensure all factor levels are present in both train and test datasets.
Large Data Set (End to End ML Analysis)
Data set Introduction
The Lendingclub loan data set consists of about 396,030 rows and 27 columns. Even though this dataset is large, we have carved out a subset of records with Time since first credit line > 0 (Time since first credit line is a new engineered variable which is calculated by subtracting the time since first credit line from Issue date of the loan) and Issue date > 2015-01-01 to look at a smaller population to model our dataset.
Data Exploration & Plots
Several variables are character types and we will be converting those to factors prior to train and test split of the data. The dataset contains missing values recorded for 3 of the variables. We will also consider dropping certain variables such as address. Imputation of the missing values will be handled during data pre-processing step below. Additionally we created a parquet file since it was large volume and stored it on github.
Correlation plot reveals that some variables are inter correlated which may not be ideal and can cause unreliable regression estimates.Some steps we can take is combining variables or dropping certain variables. We will consider these later on. For instance total_acc and mort_acc are positively correlated below. Similarly loan_amt and installment have a positive correlation.
Code
# Select numeric columns only numeric_data<- inputf1[sapply(inputf1, is.numeric)]M<-cor(numeric_data,use="complete.obs")# M %>% kable() %>%# kable_styling()ggcorrplot(M, type ="upper", outline.color ="white",ggtheme = theme_classic,#colors = c("#6D9EC1", "white", "#E46726"),lab =TRUE, show.legend =FALSE, tl.cex =8, lab_size =3)
Code
# Calculate the correlation matrixcorrelation_matrix <-cor(numeric_data, use="complete.obs")kable(correlation_matrix)
loan_amnt
int_rate
installment
annual_inc
dti
open_acc
pub_rec
revol_bal
revol_util
total_acc
mort_acc
pub_rec_bankruptcies
loan_amnt
1.0000000
0.1467352
0.9552040
0.3424394
0.0080215
0.1898158
-0.0887822
0.3273355
0.0984388
0.2137557
0.2223815
-0.1193732
int_rate
0.1467352
1.0000000
0.1401521
-0.0714182
0.0716342
-0.0037783
0.0516314
-0.0229408
0.2733073
-0.0485737
-0.0826559
0.0485709
installment
0.9552040
0.1401521
1.0000000
0.3355009
0.0059081
0.1770997
-0.0801857
0.3148570
0.1206656
0.1909353
0.1937519
-0.1126934
annual_inc
0.3424394
-0.0714182
0.3355009
1.0000000
-0.0839820
0.1327355
-0.0161510
0.3016988
0.0278276
0.1877229
0.2362765
-0.0550641
dti
0.0080215
0.0716342
0.0059081
-0.0839820
1.0000000
0.1253677
-0.0234581
0.0571616
0.0795856
0.0934558
-0.0254013
-0.0201805
open_acc
0.1898158
-0.0037783
0.1770997
0.1327355
0.1253677
1.0000000
-0.0297351
0.2144474
-0.1446031
0.6777667
0.1094396
-0.0393651
pub_rec
-0.0887822
0.0516314
-0.0801857
-0.0161510
-0.0234581
-0.0297351
1.0000000
-0.1069308
-0.0893496
0.0143639
0.0115758
0.6946356
revol_bal
0.3273355
-0.0229408
0.3148570
0.3016988
0.0571616
0.2144474
-0.1069308
1.0000000
0.2204292
0.1808765
0.1950629
-0.1311621
revol_util
0.0984388
0.2733073
0.1206656
0.0278276
0.0795856
-0.1446031
-0.0893496
0.2204292
1.0000000
-0.1141447
0.0075141
-0.1020037
total_acc
0.2137557
-0.0485737
0.1909353
0.1877229
0.0934558
0.6777667
0.0143639
0.1808765
-0.1141447
1.0000000
0.3812052
0.0384642
mort_acc
0.2223815
-0.0826559
0.1937519
0.2362765
-0.0254013
0.1094396
0.0115758
0.1950629
0.0075141
0.3812052
1.0000000
0.0272727
pub_rec_bankruptcies
-0.1193732
0.0485709
-0.1126934
-0.0550641
-0.0201805
-0.0393651
0.6946356
-0.1311621
-0.1020037
0.0384642
0.0272727
1.0000000
Code
corrplot(correlation_matrix, method="circle")
Review Distributions
Distributions of factor variables across loan statuses, starting with loan grades. Fully paid loans are at lower interest rates, and charged off loans have a more even distribution, tending towards mid tier interest rates.
Some additional EDA plots below to show a broader patterns within the data.Loan grades of type E, F, G have higher interest rates.D, E, F, G grade loans also show the presence of high number of outliers. Some of the density plots additionally show not many distributions have a normal bell curve. Additionally several have right skew in the distributions. Loan amount has a 0.78 positive skew value. Also the median loan amount seems to be $12000.00. There are a good number of outliers on higher loan amounts that don’t seem to be verified. Additionally, dataset has upto 59.2% of records of type debt consolidation.
Code
#Loan amt distributionggplot(data=inputf1, aes(loan_amnt, fill=loan_status))+geom_histogram(bins =40,color="blue")
Data Preparation (Imputation and Feature Engineering)
Since mort_acc seems to have majority of values missing, since its positively correlated with total_acc, we have considered dropping this variable instead of imputing. Additionally, the address, title and emp_title had many levels, during our train/test process, these variables caused issues with factor levels missing in test dataset. We have also imputed 2 variables with their median and mean (since missing values were in the range of about 500 for these 2 variables)
Therefore, we have chosen to drop these variables. Additionally, our randomForest did not run successfully on the large train dataset when we tried with a 80/20 split, therefore we have chosen to subset a smaller number of records to look at loans issued after 2015 (our subset of records we will use is 110,647).
We have created one additional feature variable called “time since first credit line” was issued to see if there is any impact on the response variable.
# A tibble: 2 × 2
loan_status Number
<chr> <int>
1 Charged Off 24211
2 Fully Paid 86436
Code
ggplot(data = inputf1_new2 , aes(loan_status)) +geom_bar(position ="dodge") +labs(x ="Loan Status", title ="Distribution of Loan Status on our sample population of issue date after 2015-01-01") +theme(axis.text.x =element_text(angle =90, hjust =1))
ggplot(data =subset(inputf1_new2, (!is.na(home_ownership) & (!home_ownership %in%c("ANY", "NONE", "OTHER")))), aes(home_ownership, fill = grade, color = grade)) +geom_bar() +labs(title ="Distribution of Home Ownership by Loan Grade")
ggplot(data =subset(inputf1_new2, !is.na(purpose)), aes(purpose, fill = loan_status, color = loan_status)) +geom_bar() +labs(title ="Distribution of Loan purpose") +theme(axis.text.x =element_text(angle =90, hjust =1))
Code
ggplot(data =subset(inputf1_new2, !is.na(verification_status)), aes(verification_status, fill = loan_status, color = loan_status)) +geom_bar(position ="fill") +labs(title ="Distribution of verification status by Loan Status") +scale_y_continuous(labels =percent_format())
Scatter plots in an attempt to identify trends between seemingly related numeric variables.
This one plot suggests that the 60-month loans tend to have larger interest rates and be for larger loan amounts (the top right corner is dominated by blue points).
Code
set.seed(2024)#dim(inputf1_new2)#p<-table(inputf1_new2$loan_status, inputf1_new2$int_rate)#ggplot(inputf1_new2, aes(x = int_rate))+ geom_histogram() + facet_wrap(~loan_status, ncol = 1)#Distribution of loan status and grade#table(inputf1_new2$loan_status, inputf1_new2$grade)#ggplot(inputf1_new2, aes(x = int_rate))+ geom_histogram(aes(fill = grade)) + facet_wrap(~loan_status, ncol = 1)#Distribution of loan status and termtable(inputf1_new2$loan_status, inputf1_new2$term)
index =createDataPartition(y = inputf1_new2$loan_status, p =0.90)[[1]]loans.sample <- inputf1_new2[-index,]ggplot(loans.sample, aes(x = loan_amnt, y = int_rate)) +geom_point(aes(color = term))
Algorithm Selection/Build Models
Decision Tree Model and metrics review
For model selection,we have created a partition of the 110,647 records and created 70/30 split of the dataset. The accuracy of our Basic decision tree model is 0.7835.
We have shown an example of pruned tree below. Overly complex trees have high variance. We set complexity Parameter of 0 as a measure of the required split improvement. The parameter modulates the amount by which splitting a given node improved the minimum error of 0.001435764 so that a spit can be justified.Additionally, we have printed the decision tree rules that were generated by the model on the pruned instance of the tree.
The Sensitivity (true positive rate) and Specificity (true negative rate) are below for the baseline model.Sensitivity is the metric that evaluates a model’s ability to predict true positives of each available category. Specificity is the metric that evaluates a model’s ability to predict true negatives of each available category.The sensitivity of the model is very low, this implies that the model did not successfully classify the “charged off” loans accurately.Sensitivity : 0.10973 Specificity : 0.97219.
Since our train dataset is highly imbalanced, we can use the ROC curve since we can’t simply use the accuracy measure.Area under the curve (AUC): 0.690. We can try to over, under or combine both sampling method to balance the class prior to running the model to avoid class imbalance issues.
For reference of classification categories of the confusion matrix is given below.
True Positive (TP) – An instance that is positive and is classified correctly as positive True Negative (TN) – An instance that is negative and is classified correctly as negative False Positive (FP) – An instance that is negative but is classified wrongly as positive False Negative (FN) – An instance that is positive but is classified incorrectly as negative
Code
#install.packages("RGtk2")library("rattle")library(rpart.plot)library(rpart)#install.packages("vip")library(vip)set.seed(2024)index =createDataPartition(y = inputf1_new2$loan_status, p =0.7)[[1]]loans.test <- inputf1_new2[-index,]loans.train <- inputf1_new2[index,]loans.rpart.1<-rpart(loan_status ~ . , data = loans.train, control=rpart.control(minsplit=10, minbucket =3, cp=0.0006))fancyRpartPlot(loans.rpart.1)
Code
predictions.1<- (predict(loans.rpart.1, loans.test , type ="class")) p1<-confusionMatrix(predictions.1,as.factor(loans.test$loan_status))roc.curve(loans.test$loan_status, predict(loans.rpart.1, loans.test, type ="prob")[,1], plot =TRUE)
Area under the curve (AUC): 0.690
Code
p1
Confusion Matrix and Statistics
Reference
Prediction Charged Off Fully Paid
Charged Off 797 721
Fully Paid 6466 25209
Accuracy : 0.7835
95% CI : (0.779, 0.7879)
No Information Rate : 0.7812
P-Value [Acc > NIR] : 0.1581
Kappa : 0.1145
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.10973
Specificity : 0.97219
Pos Pred Value : 0.52503
Neg Pred Value : 0.79586
Prevalence : 0.21881
Detection Rate : 0.02401
Detection Prevalence : 0.04573
Balanced Accuracy : 0.54096
'Positive' Class : Charged Off
Code
#rpart.plot(loans.rpart.1, type = 4, extra = 101, under = TRUE, cex = 0.8, box.palette = "auto")rules<-rpart.rules(loans.rpart.1)head(rules, 4) %>%kable()
loan_status
17466
0.00
when
sub_grade
is
E2 or E3 or E4 or F2 or F3 or F4 or F5 or G1 or G2 or G3 or G5
&
issue_d
<
1.4e+09
&
dti
is
20
to
30
&
home_ownership
is
MORTGAGE or OWN
&
emp_length
is
or 10+ years or 2 years or 4 years or 5 years or 6 years or 7 years or 9 years
&
annual_inc
<
142500
&
revol_bal
is
21644
to
22462
&
purpose
is
debt_consolidation or home_improvement or house or major_purchase or medical or small_business
&
total_acc
<
35
1090
0.14
when
sub_grade
is
E2 or E3 or E4 or F2 or F3 or F4 or F5 or G1 or G2 or G3 or G5
&
issue_d
<
1.4e+09
&
dti
>=
20
&
home_ownership
is
MORTGAGE or OWN
&
emp_length
is
or 10+ years or 2 years or 4 years or 5 years or 6 years or 7 years or 9 years
&
annual_inc
>=
142500
&
revol_bal
<
22462
1060
0.20
when
sub_grade
is
E3 or E5 or F2 or F3 or F4 or F5 or G1 or G2 or G3 or G4 or G5
&
issue_d
<
1.5e+09
&
dti
<
26
&
home_ownership
is
RENT
&
emp_length
is
or < 1 year or 1 year or 2 years or 3 years or 4 years
&
annual_inc
<
96472
&
purpose
is
car or credit_card or home_improvement or moving
&
loan_amnt
<
14413
2154
0.24
when
sub_grade
is
D3 or D4 or D5 or E1 or E2 or E4 or F1
&
issue_d
<
1.4e+09
&
dti
<
26
&
home_ownership
is
RENT
&
emp_length
is
< 1 year or 1 year or 10+ years or 5 years or 7 years or 9 years
&
installment
>=
145
&
revol_util
<
91
&
time_since_fcline
<
1705
Code
# Create a variable importance plotvar_importance <- vip::vip(loans.rpart.1, num_features =30)print(var_importance)
E3 or E5 or F2 or F3 or F4 or F5 or G1 or G2 or G3 or G4 or G5
&
issue_d
<
1.5e+09
&
home_ownership
is
RENT
&
dti
<
26
&
annual_inc
<
96472
&
loan_amnt
>=
14413
32
0.44
when
sub_grade
is
D3 or D4 or D5 or E1 or E2 or E3 or E4 or E5 or F1 or F2 or F3 or F4 or F5 or G1 or G2 or G3 or G4 or G5
&
issue_d
<
1.5e+09
&
home_ownership
is
RENT
&
dti
>=
26
265
0.53
when
sub_grade
is
E3 or E5 or F2 or F3 or F4 or F5 or G1 or G2 or G3 or G4 or G5
&
issue_d
<
1.5e+09
&
home_ownership
is
RENT
&
dti
<
26
&
annual_inc
<
96472
&
loan_amnt
<
14413
67
0.60
when
sub_grade
is
D3 or D4 or D5 or E1 or E2 or E4 or F1
&
issue_d
<
1.5e+09
&
home_ownership
is
RENT
&
dti
<
26
133
0.61
when
sub_grade
is
E3 or E5 or F2 or F3 or F4 or F5 or G1 or G2 or G3 or G4 or G5
&
issue_d
<
1.5e+09
&
home_ownership
is
RENT
&
dti
<
26
&
annual_inc
>=
96472
17
0.64
when
sub_grade
is
D3 or D4 or D5 or E1 or E2 or E3 or E4 or E5 or F1 or F2 or F3 or F4 or F5 or G1 or G2 or G3 or G4 or G5
&
issue_d
<
1.5e+09
&
home_ownership
is
MORTGAGE or OWN
Oversampling and fixing class imbalance and rerunning decision tree model.
We have tried over sample/under sample and combined with “Both” option on the train dataset to see if we can fix the class imbalance. The resulting model has jumped in sensitivity, however, it has misclassified large # of records as charged off when they were full paid. We will need to try to understand why this maybe potentially as a follow-up to this project.
Confusion Matrix and Statistics
Reference
Prediction Charged Off Fully Paid
Charged Off 5458 10640
Fully Paid 1805 15290
Accuracy : 0.6251
95% CI : (0.6198, 0.6303)
No Information Rate : 0.7812
P-Value [Acc > NIR] : 1
Kappa : 0.2373
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.7515
Specificity : 0.5897
Pos Pred Value : 0.3390
Neg Pred Value : 0.8944
Prevalence : 0.2188
Detection Rate : 0.1644
Detection Prevalence : 0.4850
Balanced Accuracy : 0.6706
'Positive' Class : Charged Off
Logistic Regression (Train Data set)
Here we have run the 2nd full baseline Logistic regression model and then additionally run the model with stepAIC on the over sampled train data from prior section and compared the metrics on both models.We hit several issues during model train, due to non-conversion of variables to factor type rather than character type. Once that was fixed, we proceeded to run the model and then store the values of accuracy, sensitivity and specificity.The logistic regression model accuracy with the stepAIC improved a lot after using stepAIC method. All 3 parameters were the best so far. (accuracy)0.673135 (sensitivity)0.6847482 (specificity)0.6614412
Code
loans.oversample1 <- loans.oversample %>%mutate(loan_outcome =ifelse(loan_status %in%c('Charged Off' , 'Default') , 1, ifelse(loan_status =='Fully Paid' , 0 , 'No info')))loans.oversample1 <- loans.oversample1[, colnames(loans.oversample1)[colnames(loans.oversample1) !='loan_status']]loans.oversample1 <- loans.oversample1[, colnames(loans.oversample1)[colnames(loans.oversample1) !='issue_d']]loans.oversample1 <- loans.oversample1[, colnames(loans.oversample1)[colnames(loans.oversample1) !='earliest_cr_line']]factorize =function(column, df){#' Check if column is character and turn to factorif(class(df[1,column]) =="character"){ out =as.factor(df[,column]) } else { # if it's numeric out = df[,column] }return(out)}# str(loans.oversample)# str(loans.oversample1)# class(loans.oversample1[1,"term"])store.colnames =colnames(loans.oversample1)loans.oversample3 =lapply(store.colnames, function(column) factorize(column, loans.oversample1))loans.oversample3=as.data.frame(loans.oversample3 )colnames(loans.oversample3)=store.colnamesfull.reg <-glm(loan_outcome ~ ., data =loans.oversample3, family ="binomial")loans.reg <-stepAIC(full.reg, direction ="both")
Model Accuracy Sensitivity Specificity
Accuracy Decision Tree Baseline 0.7834784 0.1097343 0.9721944
Logistic Regression (Test Data set)
After training on the test set, our metrics were similar to the train dataset. Accuracy : 0.6645 Sensitivity : 0.6576
Specificity : 0.6891 AUC(0.7358227)
Code
library(ROCR)set.seed(2024)loans.test4 <-as.data.frame(loans.test)barplot(table(loans.test4$loan_status) , col ='lightblue')
Code
table(loans.test4$loan_status)
Charged Off Fully Paid
7263 25930
Code
#Use under and oversampling# loans.oversample.test1 <- ovun.sample(loan_status ~ ., data = loans.test4, method = "both",N = 33193 , seed = 13)$data# barplot(table(loans.oversample.test1 $loan_status) , col = 'lightblue')loans.test1 <- loans.test4 %>%mutate(loan_outcome =ifelse(loan_status %in%c('Charged Off' ) , 1, ifelse(loan_status =='Fully Paid' , 0,'none' )))barplot(table(loans.test1$loan_outcome) , col ='lightblue')
# Make predictions and pre accuracy for full modelprobabilities <-predict(full.reg, loans.test3, type ="response")predicted.classes <-ifelse(probabilities >0.5, 1, 0)# Prediction accuracyobserved.classes <- loans.test3$loan_outcomemean(predicted.classes == observed.classes)
[1] 0.6645678
Code
# Make predictions and pre accuracy for stepwise modelprobabilities <-predict(loans.reg, loans.test3, type ="response")predicted.classes <-ifelse(probabilities >0.5, 1, 0)# Prediction accuracyobserved.classes <- loans.test3$loan_outcomemean(predicted.classes == observed.classes)
[1] 0.6644775
Code
get_logistic_error(full.reg, data = loans.test3, res ="loan_outcome", pos =1, neg =0, cut =0.5)
[1] 0.3354322
Code
get_logistic_error(loans.reg, data = loans.test3, res ="loan_outcome", pos =1, neg =0, cut =0.5)
[1] 0.3355225
Code
#A good model will have a high AUC, that is as often as possible a high sensitivity and specificity.test_prob =predict(loans.reg, newdata = loans.test3, type ="response")test_roc =roc( loans.test3$loan_outcome ~ test_prob, plot =TRUE, print.auc =TRUE)
Code
as.numeric(test_roc$auc)
[1] 0.7358227
Random Forest Model
We implemented the random forest model, however, we were not able to use some of the functionality of the randomForestExplainer model. Therefore we were able to display some importance measures.
The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case). The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.
common <-intersect(names(loans.oversample3), names(loans.test3)) for (p in common) { if (class(loans.oversample3[[p]]) =="factor") { levels(loans.test3[[p]]) <-levels(loans.oversample3[[p]]) } }seat_forest_tst_perd =predict(seat_forest, loans.test3)table(predicted = seat_forest_tst_perd, actual = loans.test3$loan_outcome)
Model Accuracy Sensitivity Specificity
Accuracy loans.reg Logistic with stepAIC 0.6731350 0.6847482 0.6614412
Accuracy1 Decision Tree Baseline 0.7834784 0.1097343 0.9721944
Accuracy2 Random forest 0.7253939 0.7773236 0.5399972
Medium Data Set (End to End ML Analysis)
Data set Introduction
The large data set consists of about 284,000 card transactions that are labelled as non-fraud and fraud. It is a real data set from a European financial institution, which is why the features are masked. They are the result of extensive PCA. Additionally, it is a highly imbalanced data set, as there are several orders of magnitude more non-fraud than fraud transactions.
Time V1 V2 V3
Min. : 0 Min. :-56.40751 Min. :-72.71573 Min. :-48.3256
1st Qu.: 54202 1st Qu.: -0.92037 1st Qu.: -0.59855 1st Qu.: -0.8904
Median : 84692 Median : 0.01811 Median : 0.06549 Median : 0.1799
Mean : 94814 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
3rd Qu.:139321 3rd Qu.: 1.31564 3rd Qu.: 0.80372 3rd Qu.: 1.0272
Max. :172792 Max. : 2.45493 Max. : 22.05773 Max. : 9.3826
V4 V5 V6 V7
Min. :-5.68317 Min. :-113.74331 Min. :-26.1605 Min. :-43.5572
1st Qu.:-0.84864 1st Qu.: -0.69160 1st Qu.: -0.7683 1st Qu.: -0.5541
Median :-0.01985 Median : -0.05434 Median : -0.2742 Median : 0.0401
Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.74334 3rd Qu.: 0.61193 3rd Qu.: 0.3986 3rd Qu.: 0.5704
Max. :16.87534 Max. : 34.80167 Max. : 73.3016 Max. :120.5895
V8 V9 V10 V11
Min. :-73.21672 Min. :-13.43407 Min. :-24.58826 Min. :-4.79747
1st Qu.: -0.20863 1st Qu.: -0.64310 1st Qu.: -0.53543 1st Qu.:-0.76249
Median : 0.02236 Median : -0.05143 Median : -0.09292 Median :-0.03276
Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
3rd Qu.: 0.32735 3rd Qu.: 0.59714 3rd Qu.: 0.45392 3rd Qu.: 0.73959
Max. : 20.00721 Max. : 15.59500 Max. : 23.74514 Max. :12.01891
V12 V13 V14 V15
Min. :-18.6837 Min. :-5.79188 Min. :-19.2143 Min. :-4.49894
1st Qu.: -0.4056 1st Qu.:-0.64854 1st Qu.: -0.4256 1st Qu.:-0.58288
Median : 0.1400 Median :-0.01357 Median : 0.0506 Median : 0.04807
Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
3rd Qu.: 0.6182 3rd Qu.: 0.66251 3rd Qu.: 0.4931 3rd Qu.: 0.64882
Max. : 7.8484 Max. : 7.12688 Max. : 10.5268 Max. : 8.87774
V16 V17 V18
Min. :-14.12985 Min. :-25.16280 Min. :-9.498746
1st Qu.: -0.46804 1st Qu.: -0.48375 1st Qu.:-0.498850
Median : 0.06641 Median : -0.06568 Median :-0.003636
Mean : 0.00000 Mean : 0.00000 Mean : 0.000000
3rd Qu.: 0.52330 3rd Qu.: 0.39968 3rd Qu.: 0.500807
Max. : 17.31511 Max. : 9.25353 Max. : 5.041069
V19 V20 V21
Min. :-7.213527 Min. :-54.49772 Min. :-34.83038
1st Qu.:-0.456299 1st Qu.: -0.21172 1st Qu.: -0.22839
Median : 0.003735 Median : -0.06248 Median : -0.02945
Mean : 0.000000 Mean : 0.00000 Mean : 0.00000
3rd Qu.: 0.458949 3rd Qu.: 0.13304 3rd Qu.: 0.18638
Max. : 5.591971 Max. : 39.42090 Max. : 27.20284
V22 V23 V24
Min. :-10.933144 Min. :-44.80774 Min. :-2.83663
1st Qu.: -0.542350 1st Qu.: -0.16185 1st Qu.:-0.35459
Median : 0.006782 Median : -0.01119 Median : 0.04098
Mean : 0.000000 Mean : 0.00000 Mean : 0.00000
3rd Qu.: 0.528554 3rd Qu.: 0.14764 3rd Qu.: 0.43953
Max. : 10.503090 Max. : 22.52841 Max. : 4.58455
V25 V26 V27
Min. :-10.29540 Min. :-2.60455 Min. :-22.565679
1st Qu.: -0.31715 1st Qu.:-0.32698 1st Qu.: -0.070840
Median : 0.01659 Median :-0.05214 Median : 0.001342
Mean : 0.00000 Mean : 0.00000 Mean : 0.000000
3rd Qu.: 0.35072 3rd Qu.: 0.24095 3rd Qu.: 0.091045
Max. : 7.51959 Max. : 3.51735 Max. : 31.612198
V28 Amount Class
Min. :-15.43008 Min. : 0.00 Min. :0.000000
1st Qu.: -0.05296 1st Qu.: 5.60 1st Qu.:0.000000
Median : 0.01124 Median : 22.00 Median :0.000000
Mean : 0.00000 Mean : 88.35 Mean :0.001728
3rd Qu.: 0.07828 3rd Qu.: 77.17 3rd Qu.:0.000000
Max. : 33.84781 Max. :25691.16 Max. :1.000000
Code
tx_raw$Class =as.factor(tx_raw$Class) #Convert Class column to factortx_raw = tx_raw %>%mutate(datetime =as.POSIXct("2024-01-01 00:00:00", tz ="UTC") +seconds(Time)) #Make new column that shows datetimeggplot(tx_raw, aes(x = Amount, fill = Class)) +geom_histogram(position ="dodge", bins =60) +labs(title ="Histogram of Amounts by Class (< 500 USD)", x ="Amount (USD)", y ="Frequency") +theme_minimal() +scale_fill_manual(values =c('grey', 'green')) +xlim(0, 500)
Code
tx_1 = tx_raw %>%filter(Class ==1)ggplot(tx_1, aes(x = Amount)) +geom_histogram(position ="dodge", bins =60) +labs(title ="Histogram of Amounts for Class Fraud", x ="Amount (USD)", y ="Frequency") +theme_minimal()
Code
#Outlier plotggplot(tx_1, aes(x = Amount)) +geom_boxplot(position ="dodge", bins =60) +labs(title ="Histogram of Amounts for Class Fraud", x ="Amount (USD)", y ="Frequency") +theme_minimal()
Code
# Scatterplottx_1 %>%ggplot(aes(x=Time, y=Amount)) +geom_point() +labs(y ="Amount ($)", x ="Time (s)",title="Fraudulent Transactions Across Time" )
Code
#Correlation Heatmaptx_raw_numeric = tx_raw %>% dplyr::select(!c(Class, datetime))cor_matrix =cor(tx_raw_numeric)cor_matrix =melt(cor_matrix)ggplot(data = cor_matrix, aes(x = Var1, y = Var2, fill = value)) +geom_tile() +scale_fill_gradient2(low ="blue", high ="red", mid ="white", midpoint =0, limit =c(-1, 1), space ="Lab", name ="Correlation") +theme_minimal() +theme(axis.text.x =element_text(angle =45, vjust =1, size =10, hjust =1)) +coord_fixed() +labs(title ="Correlation Heatmap", x ="Variable", y ="Variable")
Code
#----Time-Series for Transactions----tx_transactions <- tx_raw %>%mutate(datetime_hour =floor_date(datetime, "hour")) %>%group_by(datetime_hour, Class) %>%summarise(transaction_count =n())tx_trans_1 <-ggplot(tx_transactions, aes(x = datetime_hour, y = transaction_count, color =as.factor(Class))) +geom_line() +theme_minimal() +labs(title ='Fraud Txs', y ="Number of Transactions", x ="Time (Hourly)") +scale_y_continuous(limits =c(0, 50)) +theme(legend.position ="none") +annotate("text", x =max(tx_transactions$datetime_hour), y =45, label =expression(rho[1] ==-0.226), hjust =1)tx_trans_0 <-ggplot(tx_transactions, aes(x = datetime_hour, y = transaction_count, color =as.factor(Class))) +geom_line() +theme_minimal() +labs(title ='Non-Fraud Txs', x =NULL, y =NULL, color ="Class") +scale_y_continuous(limits =c(1000, max(tx_transactions$transaction_count))) +theme(legend.position ="none") +annotate("text", x =max(tx_transactions$datetime_hour), y =max(tx_transactions$transaction_count) -50, label =expression(rho[1] ==0.918), hjust =1)# Combine the two plotstx_transactions_plot <- (tx_trans_0 / tx_trans_1) +plot_layout(heights =c(2, 1))print(tx_transactions_plot)
Autocorrelations of series 'tx_nofraud', by lag
0 1 2 3
1.000 0.918 0.747 0.536
When investigating the plots from the EDA above one thing becomes clear: the data set is HEAVILY imbalanced. As discussed in the introduction above, this is unsurprising given the nature of non-fraud versus fraud transactions; however, this is an important consideration when selecting the models to run. Weak learners will likely not be as strong in performance as ensemble methods would be.
Additionally, there are a few more interesting observations. For example, the correlation matrix between all features show no strong correlation between each other. This is important for several machine learning algorithms, and considering that this data set has undergone feature engineering and PCA, it is unsurprising that this is case. Nevertheless, this plot should be part of any machine learning implementation.
Looking at the time-series graph, plotting the amounts of transactions per hour, over the time span of the data set, the cyclic nature of the non-fraud transactions is very apparent. This is not existent in the fraud transactions, which are mostly randomly happening. This can also be observed in the auto-correlations: ρ for the non-fraud transactions is 0.92, which points to a strong predictability for the next data point (i.e., after an increase in count, another increase if followed). The negative ρ of -0.23 of the fraudulent transactions points to a more random behavior across these two and a half days of time period of the data set. This feature will surely be quite important for the algorithm during training.
Lastly, these auto-correlations can be seen in the ACF plot, that shows different lags. It can be seen that the strongest lag is 1, with decreasing auto-correlations with larger lags.
Next, we split the data to prepare for the machine learning implementation. We chose to split the data 80/20 for training and test set. We deferred from a validation set as we are not going to engage in hyper parameter tuning in this exercise.
Data Preparation
Code
set.seed(2024)library(caret)library(e1071)library(randomForest)library(rpart)library(pROC)library(ranger)library(ranger)tx_raw$Class =as.factor(tx_raw$Class)# Ensure datetime is of the correct typetx_raw$datetime =as.POSIXct(tx_raw$datetime)# Split the data into training and testing setstrainIndex =createDataPartition(tx_raw$Class, p =0.8, list =FALSE)dataTrain = tx_raw[trainIndex, ]dataTest = tx_raw[-trainIndex, ]#Define CVtrain_control =trainControl(method ="cv", number =10)
Algorithm Selection
Given the fact the this is a highly imbalanced data set, a weak learner, such as a decision tree or logistic regression will likely not be very successful. Therefore, the better choice will likely be an ensemble. In order to test this, we will run a logistic regression, and a single decision tree. We wanted to also include a random forest, however, the large data set was computationally too expensive. In the real world, we would certainly use some type of ensemble method, like random forest and XGBoost.
#Logistic Regression Predictiontime_logistic_pred =system.time({ logistic_pred =predict(logistic_model, dataTest) logistic_probs =predict(logistic_model, dataTest, type ="prob")[, 2]})#Decision Tree Predictiontime_tree_pred =system.time({ tree_pred =predict(tree_model, dataTest) tree_probs =predict(tree_model, dataTest, type ="prob")[, 2]})# Logistic Regression Confusion MatrixconfusionMatrix(logistic_pred, dataTest$Class)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 56852 39
1 11 59
Accuracy : 0.9991
95% CI : (0.9988, 0.9993)
No Information Rate : 0.9983
P-Value [Acc > NIR] : 6.453e-08
Kappa : 0.702
Mcnemar's Test P-Value : 0.0001343
Sensitivity : 0.9998
Specificity : 0.6020
Pos Pred Value : 0.9993
Neg Pred Value : 0.8429
Prevalence : 0.9983
Detection Rate : 0.9981
Detection Prevalence : 0.9988
Balanced Accuracy : 0.8009
'Positive' Class : 0
Code
# Decision Tree Confusion MatrixconfusionMatrix(tree_pred, dataTest$Class)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 56847 32
1 16 66
Accuracy : 0.9992
95% CI : (0.9989, 0.9994)
No Information Rate : 0.9983
P-Value [Acc > NIR] : 1.581e-08
Kappa : 0.7329
Mcnemar's Test P-Value : 0.03038
Sensitivity : 0.9997
Specificity : 0.6735
Pos Pred Value : 0.9994
Neg Pred Value : 0.8049
Prevalence : 0.9983
Detection Rate : 0.9980
Detection Prevalence : 0.9986
Balanced Accuracy : 0.8366
'Positive' Class : 0
Code
#Benchmarking Training and Prediction Timebenchmark_results =data.frame(Model =c("Logistic Regression", "Decision Tree"),Training_Time =c(time_logistic_train[3], time_tree_train[3]),Prediction_Time =c(time_logistic_pred[3], time_tree_pred[3]))print(benchmark_results)
Model Training_Time Prediction_Time
1 Logistic Regression 59.20 1.89
2 Decision Tree 76.94 1.55
The performance of both, the logistic regression and the decision tree are good, with above 90% accuracy. Looking at the timing benchmarks, both models trained within about one minute, and took only seconds to predict the test set of 50,000 rows. As mentioned above, the random forest trained much longer, on the magnitude of hours, so we chose to not continue with this at this time.
Code
#ROC and AUC Curves#ROCroc_logistic =roc(dataTest$Class, logistic_probs)roc_tree =roc(dataTest$Class, tree_probs)plot(roc_logistic, col ="red", main ="ROC Curves", lwd =2)lines(roc_tree, col ="blue", lwd =2)legend("bottomright", legend =c("Logistic Regression", "Decision Tree"),col =c("red", "blue"), lwd =2)
Code
#AUCauc_logistic =auc(roc_logistic)auc_tree =auc(roc_tree)print(paste("AUC for Logistic Regression:", auc_logistic))
[1] "AUC for Logistic Regression: 0.974355477379035"
Code
print(paste("AUC for Decision Tree:", auc_tree))
[1] "AUC for Decision Tree: 0.836573906420983"
Code
coefficients <-tidy(logistic_model$finalModel)coefficients <- coefficients[coefficients$term !="(Intercept)", ] # Remove intercept for better visualizationggplot(coefficients, aes(x =reorder(term, estimate), y = estimate)) +geom_bar(stat ="identity") +coord_flip() +labs(title ="Logistic Regression Coefficients", x ="Features", y ="Coefficient") +theme_minimal()
In order to further understand the predictive quality of both models, we decided to compute ROC curves that plot specificity against sensitivity, and here found, interestingly, that the logistic regression was much better across the board than the decision tree. It is likely that the tree over fit, which leads to diminished predictive quality. While the accuracy is still high, this does not mean that it stays high with other unseen data.
Therefore, in the current case, we’d be deciding to utilize logistic regression over the decision tree. While again, an ensemble of trees would likely outperform the logistic regression, even with a smaller data set.
Conclusion and Summary Essay
Note: Both dataset had responded well to Decision Tree and Logistic Regression modeling as expected.
On the lending loan dataset, we have updated results on 3 models and their performance metrics such as accuracy, sensitivity and specificity.
Sensitivity (True Positive Rate): measures the proportion of applicants that were predicted as charged off, who were actually charged off in the test dataset Specificity (True Negative Rate): measures the proportion of applicants that were predicted to be charged off and were also charged off in the test dataset
The ensemble model performed the best if we review all 3 metrics since we had better values across all 3 metrics. We would like to look into further the decision tree baseline model for this dataset as it had a very low sensitivity since we had updated the dataset with overampled/under sampled data.
On the fraud dataset, we decided to utilize the logistic regression over the decision tree. While we did not run the ensemble methods here which may yield a better result in the future.
Takeaways: Additionally the RandomForest took significant amount of time to run and we were unable to load the importance measures through the randomForestExplainer package. We were unable to save the .RDA file with the importance measures on the test model output. We would like to further review this in the future. We would like to further expand on time series with “fpp3” package in the future on the credit dataset to look at further extrapolations of the time series data and forecasting methods to answer questions such as can we forecast credit card fraud given the dataset, are there any seasonal patterns when frauds occur?
---title: "DATA622 Assignment 1 (Large and Small Dataset Machine Learning methods and analysis)"author: "Banu & Lucas"date: "10/10/2024"toc: trueformat: html: html-math-method: katex code-fold: true code-tools: true self-contained: true toc_depth: 2execute: warning: false---```{r}#Load Packageslibrary(arrow)library(dplyr)library(ggplot2)library(reshape2)library(tidymodels)library(tidyr)library(broom)library(patchwork)library(lubridate)library(rpart.plot)library(kableExtra)library(DataExplorer)library(skimr)#install.packages("ranger")#install.packages("tidymodels")library(DescTools)library(corrplot)library(ggcorrplot)library(caret)library(rpart)library(car)library(rattle)library(ROSE)library(mice)library(MASS)library(fpp3)library(pROC)```## Project### **Deliverable**1. Explore how to analyze and predict an outcome based on the data available. This will be an exploratory exercise, so feel free to show errors and warnings that raise during the analysis. Test the code with both datasets selected and compare the results. 1. Are the columns of your data correlated? 2. Are there labels in your data? Did that impact your choice of algorithm? 3. What are the pros and cons of each algorithm you selected? 4. How your choice of algorithm relates to the datasets (was your choice of algorithm impacted by the datasets you chose)? 5. Which result will you trust if you need to make a business decision? 6. Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible? 7. How does the analysis between data sets compare?2. Essay (minimum 500 word document)\ Write a short essay explaining your selection of algorithms and how they relate to the data and what you are trying to do## EssayOur goal was to understand and work with financial datasets. In line with the goal of this project we selected two large datasets (Credit Card Fraud and Lendingclub Loan). Both classification and regression methods (supervised learning) are good approaches for our datasets since we have availability of labeled samples.**LendingClub Loan (Fully Paid or Charged Off) Prediction (Large Volume - 300,000+):**LendingClub loan dataset is provided by Lending Club a peer-to-peer lending firm in San Francisco. Lending loans to risky customers can have a higher impact on credit loss. The dataset above is a good one as it has labeled data on loans that were "Fully Paid" or "Charged Off". Additionally, this dataset is a rich resource of other categorical potential predictor variables and lends itself well to exploratory data analysis as well as machine learning algorithms such as Logistic regression, RandomForest and Decision Tree models.<https://www.kaggle.com/datasets/jeandedieunyandwi/lending-club-dataset>**Credit Card Fraud (Fraud/Not-Fraud) Prediction (Medium Volume - 200,000+ ):** Fraud activity and customer behavior changes rapidly, causing non-stationary in the transaction data. Fraud represents a small fraction of the daily transactions. We then have a skewed dataset towards the genuine transactions. This yields to a highly imbalanced dataset. Statistical methods such as logistic regression can be applied for fraud detection for classification tasks, however, these are impacted by imbalance of the dataset and can be biased to predicting the majority class. We have also explored decision trees, wherein we can identify rules for predicting the correct class of transactions. We can identify the percentage of instances the condition of rule applies and the accuracy or confidence of a rule, which is predicting correct class of instances in which the condition of the rule applies. We would like to highlight the findings below through exploratory analysis, selection of models to train and then testing on subset of data.<https://data.world/vlad/credit-card-fraud-detection>**Reference thesis with the above dataset:** <https://di.ulb.ac.be/map/adalpozz/pdf/Dalpozzolo2015PhD.pdf>**Comparison between the two datasets:**While both datasets have a large number of records, we have selected the LendingClub loan dataset for our Large dataset for EDA and ML analysis and the Credit Card Fraud as a medium dataset for EDA and ML analysis. Our two datasets above are different in that the credit card fraud dataset is time series data, however the lendingclub loan dataset is not time series related. Therefore, we found the fraud dataset to be higher in complexity since it is time series data. We were able to train on large sample size here (227000+) and then apply the model on the test dataset.The Lendingclub dataset on the other hand has far more categorical variables than the credit card fraud dataset. We have dived into it further on the exploratory data analysis. Additionally, due to the size of the dataset (300000+ records), we subset the training set from a modeling perspective with data records from issue date of the loan after 2015-01-01. There were challenges in training the data on the higher number of samples, therefore we made a decision to subset the dataset to 100,000 records range and then split it up to train/test set. On this dataset, we were dealing with missing values, decision to drop certain values, review correlated values, additionally during model training, we found train/test split issues and had to ensure all factor levels are present in both train and test datasets.## Large Data Set (End to End ML Analysis)### Data set IntroductionThe Lendingclub loan data set consists of about 396,030 rows and 27 columns. Even though this dataset is large, we have carved out a subset of records with Time since first credit line \> 0 (Time since first credit line is a new engineered variable which is calculated by subtracting the time since first credit line from Issue date of the loan) and Issue date \> 2015-01-01 to look at a smaller population to model our dataset.### Data Exploration & PlotsSeveral variables are character types and we will be converting those to factors prior to train and test split of the data. The dataset contains missing values recorded for 3 of the variables. We will also consider dropping certain variables such as address. Imputation of the missing values will be handled during data pre-processing step below. Additionally we created a parquet file since it was large volume and stored it on github.```{r}path1 ="https://github.com/BanuB/Card_Transaction_Fraud/raw/refs/heads/master/loandata.parquet"inputf1 =read_parquet(path1)#Glimpse variablesintroduce(inputf1)plot_intro(inputf1)plot_missing(inputf1)glimpse(inputf1)unique(inputf1$loan_status)skim(inputf1) %>%kable()sapply(inputf1, function(x) sum(is.na(x))) %>%kable()```### Correlation AnalysisCorrelation plot reveals that some variables are inter correlated which may not be ideal and can cause unreliable regression estimates.Some steps we can take is combining variables or dropping certain variables. We will consider these later on. For instance total_acc and mort_acc are positively correlated below. Similarly loan_amt and installment have a positive correlation.```{r}# Select numeric columns only numeric_data<- inputf1[sapply(inputf1, is.numeric)]M<-cor(numeric_data,use="complete.obs")# M %>% kable() %>%# kable_styling()ggcorrplot(M, type ="upper", outline.color ="white",ggtheme = theme_classic,#colors = c("#6D9EC1", "white", "#E46726"),lab =TRUE, show.legend =FALSE, tl.cex =8, lab_size =3)# Calculate the correlation matrixcorrelation_matrix <-cor(numeric_data, use="complete.obs")kable(correlation_matrix)corrplot(correlation_matrix, method="circle")```#### Review DistributionsDistributions of factor variables across loan statuses, starting with loan grades. Fully paid loans are at lower interest rates, and charged off loans have a more even distribution, tending towards mid tier interest rates.```{r}loan_stat_df <-subset(inputf1, !is.na(inputf1$loan_status)) %>%group_by(loan_status) %>%summarise(Number =n())loan_stat_dftable(inputf1$loan_status, inputf1$grade)ggplot(inputf1, aes(x = int_rate))+geom_histogram(aes(fill = grade)) +facet_wrap(~loan_status, ncol =1)```Some additional EDA plots below to show a broader patterns within the data.Loan grades of type E, F, G have higher interest rates.D, E, F, G grade loans also show the presence of high number of outliers. Some of the density plots additionally show not many distributions have a normal bell curve. Additionally several have right skew in the distributions. Loan amount has a 0.78 positive skew value. Also the median loan amount seems to be \$12000.00. There are a good number of outliers on higher loan amounts that don't seem to be verified. Additionally, dataset has upto 59.2% of records of type debt consolidation.```{r}#Loan amt distributionggplot(data=inputf1, aes(loan_amnt, fill=loan_status))+geom_histogram(bins =40,color="blue")str(inputf1)ggplot(inputf1, aes(x=grade, y=loan_amnt, fill=grade)) +stat_summary(fun.y="sum", geom="bar") +labs(y ="Total Loan Amount",title="Total loan amount based on loan grade")ggplot(data=inputf1, aes(grade,int_rate,fill=grade))+geom_boxplot(outlier.color ="blue")+labs(title="Box plot of Interest rate")plot_density(inputf1)#Distribution of loan amount and purposeDesc(inputf1$loan_amnt, main ="Loan amount distribution", plotit =TRUE)Desc(inputf1$purpose, main ="Loan purposes", plotit =TRUE)#Distribution by verification statusinputf1 %>%group_by(verification_status) %>%summarise(mean(loan_amnt), var(loan_amnt), mean(int_rate),mean(annual_inc))ggplot(data = inputf1,aes(x = verification_status, y = loan_amnt))+geom_boxplot()ggplot(data=inputf1,aes(loan_amnt, fill=grade))+geom_density(alpha=0.25) +facet_grid(grade ~ .)#Distribution of loan status and gradetable(inputf1$loan_status, inputf1$grade)ggplot(inputf1, aes(x = int_rate))+geom_histogram(aes(fill = grade)) +facet_wrap(~loan_status, ncol =1)```### Data Preparation (Imputation and Feature Engineering)Since mort_acc seems to have majority of values missing, since its positively correlated with total_acc, we have considered dropping this variable instead of imputing. Additionally, the address, title and emp_title had many levels, during our train/test process, these variables caused issues with factor levels missing in test dataset. We have also imputed 2 variables with their median and mean (since missing values were in the range of about 500 for these 2 variables)Therefore, we have chosen to drop these variables. Additionally, our randomForest did not run successfully on the large train dataset when we tried with a 80/20 split, therefore we have chosen to subset a smaller number of records to look at loans issued after 2015 (our subset of records we will use is 110,647).We have created one additional feature variable called "time since first credit line" was issued to see if there is any impact on the response variable.```{r}# drop address, mort_acclibrary(dplyr)library(gridExtra)library(grid)inputf1_new<- dplyr::select(inputf1,-c(mort_acc,address, title, emp_title))#glimpse(inputf1_new)inputf1_new %>%gather(variable, value) %>%filter(is.na(value)) %>%group_by(variable) %>%tally() %>%# dplyr::mutate(percent = (n / nrow(df)) * 100) %>%# dplyr::mutate(percent = paste0(round(percent, ifelse(percent < 10, 1, 0)), "%")) %>%# arrange(desc(n)) %>%rename(`Variable Missing Data`= variable,`Number of Records`= n) %>%# `Share of Total` = percent) %>%kable() %>%kable_styling()#impute with median#unique(inputf1_new$pub_rec_bankruptcies)mean1 <-round(median(inputf1_new$pub_rec_bankruptcies, na.rm =TRUE),0)inputf1_new[is.na(inputf1_new[,"pub_rec_bankruptcies"]), "pub_rec_bankruptcies"] <- mean1#impute with meanmean2 <-round(mean(inputf1_new$revol_util, na.rm =TRUE),0)inputf1_new[is.na(inputf1_new[,"revol_util"]), "revol_util"] <- mean2#unique(inputf1_new$revol_util)#create new variablesinputf1_new$issue_d <-as.character(inputf1_new$issue_d)inputf1_new$issue_d <-paste(inputf1_new$issue_d, "-01", sep ="")inputf1_new$issue_d <-parse_date_time(inputf1_new$issue_d, "myd")inputf1_new$earliest_cr_line <-as.character(inputf1_new$earliest_cr_line)inputf1_new$earliest_cr_line <-paste(inputf1_new$earliest_cr_line, "-01", sep ="")inputf1_new$earliest_cr_line <-parse_date_time(inputf1_new$earliest_cr_line, "myd")inputf1_new$time_since_fcline <- inputf1_new$issue_d - inputf1_new$earliest_cr_lineinputf1_new$time_since_fcline <-as.numeric(inputf1_new$time_since_fcline)inputf1_new2 <- inputf1_new %>%filter(time_since_fcline >0& issue_d >c("2015-01-01 UTC") )head(inputf1_new2$time_since_fcline)loan_stat_df1 <-subset(inputf1_new2, !is.na(inputf1_new2$loan_status)) %>%group_by(loan_status) %>%summarise(Number =n())loan_stat_df1ggplot(data = inputf1_new2 , aes(loan_status)) +geom_bar(position ="dodge") +labs(x ="Loan Status", title ="Distribution of Loan Status on our sample population of issue date after 2015-01-01") +theme(axis.text.x =element_text(angle =90, hjust =1))plot_missing(inputf1_new2)p1 <-ggplot(data = inputf1_new2, aes(loan_amnt, color = grade)) +geom_histogram(binwidth =1000) +facet_grid(grade ~ .)p2 <-ggplot(data = inputf1_new2, aes(loan_amnt, color = grade, fill = grade)) +geom_density(binwidth =1000) +facet_grid(grade ~ .)grid.arrange(p1, p2, ncol =2)ggplot(data =subset(inputf1_new2, !home_ownership %in%c("ANY", "NONE", "OTHER")), aes(y = home_ownership, purpose)) +geom_count(color ="Navy") +theme(axis.text.x =element_text(angle =90, hjust =1))ggplot(data =subset(inputf1_new2, (!is.na(home_ownership) & (!home_ownership %in%c("ANY", "NONE", "OTHER")))), aes(home_ownership, fill = grade, color = grade)) +geom_bar() +labs(title ="Distribution of Home Ownership by Loan Grade")inputf1_new2 %>%filter(!is.na(purpose)) %>%group_by(purpose) %>%summarise(mean_annual_inc =mean(annual_inc), mean_amnt_loan =mean(loan_amnt), n =n()) %>%ungroup() %>%arrange(desc(n))ggplot(data =subset(inputf1_new2, !is.na(purpose)), aes(purpose, fill = loan_status, color = loan_status)) +geom_bar() +labs(title ="Distribution of Loan purpose") +theme(axis.text.x =element_text(angle =90, hjust =1))ggplot(data =subset(inputf1_new2, !is.na(verification_status)), aes(verification_status, fill = loan_status, color = loan_status)) +geom_bar(position ="fill") +labs(title ="Distribution of verification status by Loan Status") +scale_y_continuous(labels =percent_format())```Scatter plots in an attempt to identify trends between seemingly related numeric variables.This one plot suggests that the 60-month loans tend to have larger interest rates and be for larger loan amounts (the top right corner is dominated by blue points).```{r}set.seed(2024)#dim(inputf1_new2)#p<-table(inputf1_new2$loan_status, inputf1_new2$int_rate)#ggplot(inputf1_new2, aes(x = int_rate))+ geom_histogram() + facet_wrap(~loan_status, ncol = 1)#Distribution of loan status and grade#table(inputf1_new2$loan_status, inputf1_new2$grade)#ggplot(inputf1_new2, aes(x = int_rate))+ geom_histogram(aes(fill = grade)) + facet_wrap(~loan_status, ncol = 1)#Distribution of loan status and termtable(inputf1_new2$loan_status, inputf1_new2$term)index =createDataPartition(y = inputf1_new2$loan_status, p =0.90)[[1]]loans.sample <- inputf1_new2[-index,]ggplot(loans.sample, aes(x = loan_amnt, y = int_rate)) +geom_point(aes(color = term))```### Algorithm Selection/Build Models### Decision Tree Model and metrics reviewFor model selection,we have created a partition of the 110,647 records and created 70/30 split of the dataset. The accuracy of our Basic decision tree model is 0.7835.We have shown an example of pruned tree below. Overly complex trees have high variance. We set complexity Parameter of 0 as a measure of the required split improvement. The parameter modulates the amount by which splitting a given node improved the minimum error of 0.001435764 so that a spit can be justified.Additionally, we have printed the decision tree rules that were generated by the model on the pruned instance of the tree.The Sensitivity (true positive rate) and Specificity (true negative rate) are below for the baseline model.Sensitivity is the metric that evaluates a model's ability to predict true positives of each available category. Specificity is the metric that evaluates a model's ability to predict true negatives of each available category.The sensitivity of the model is very low, this implies that the model did not successfully classify the "charged off" loans accurately.Sensitivity : 0.10973 Specificity : 0.97219.Since our train dataset is highly imbalanced, we can use the ROC curve since we can't simply use the accuracy measure.Area under the curve (AUC): 0.690. We can try to over, under or combine both sampling method to balance the class prior to running the model to avoid class imbalance issues.For reference of classification categories of the confusion matrix is given below.True Positive (TP) -- An instance that is positive and is classified correctly as positive True Negative (TN) -- An instance that is negative and is classified correctly as negative False Positive (FP) -- An instance that is negative but is classified wrongly as positive False Negative (FN) -- An instance that is positive but is classified incorrectly as negative```{r}#install.packages("RGtk2")library("rattle")library(rpart.plot)library(rpart)#install.packages("vip")library(vip)set.seed(2024)index =createDataPartition(y = inputf1_new2$loan_status, p =0.7)[[1]]loans.test <- inputf1_new2[-index,]loans.train <- inputf1_new2[index,]loans.rpart.1<-rpart(loan_status ~ . , data = loans.train, control=rpart.control(minsplit=10, minbucket =3, cp=0.0006))fancyRpartPlot(loans.rpart.1)predictions.1<- (predict(loans.rpart.1, loans.test , type ="class")) p1<-confusionMatrix(predictions.1,as.factor(loans.test$loan_status))roc.curve(loans.test$loan_status, predict(loans.rpart.1, loans.test, type ="prob")[,1], plot =TRUE)p1#rpart.plot(loans.rpart.1, type = 4, extra = 101, under = TRUE, cex = 0.8, box.palette = "auto")rules<-rpart.rules(loans.rpart.1)head(rules, 4) %>%kable() # Create a variable importance plotvar_importance <- vip::vip(loans.rpart.1, num_features =30)print(var_importance)plotcp(loans.rpart.1)costdt <-data.frame(printcp(loans.rpart.1))min_err <- (costdt %>%filter(nsplit >1) %>%slice(which.min(xerror)))$CPcat("Minimum Error: ", min_err)loan_prune <- rpart::prune(loans.rpart.1,min_err)rpart.plot(loan_prune, box.col =c("pink", "palegreen3")[loans.rpart.1$frame$yval])loan_prune_pred <-predict(loan_prune, loans.test, type ="class")#prunedcm <- confusionMatrix(loan_prune_pred, as.factor(loans.test$loan_Status))# dt_pruned_cm$tableroc.curve(loans.test$loan_status, predict(loan_prune, loans.test, type ="prob")[,1], plot =TRUE)p1rules<-rpart.rules(loan_prune)head(rules) %>%kable() ```### Oversampling and fixing class imbalance and rerunning decision tree model.We have tried over sample/under sample and combined with "Both" option on the train dataset to see if we can fix the class imbalance. The resulting model has jumped in sensitivity, however, it has misclassified large \# of records as charged off when they were full paid. We will need to try to understand why this maybe potentially as a follow-up to this project.```{r}set.seed(2024)glimpse(loans.train)glimpse(loans.test)loans.oversample <-ovun.sample(loan_status ~ ., data = loans.train, method ="both",N =77454, seed =13)$datatable(loans.oversample$loan_status)table(loans.train$loan_status)barplot(table(loans.train$loan_status) , col ='lightblue')barplot(table(loans.oversample$loan_status) , col ='lightblue')tune <-data.frame(0.001)colnames(tune) <-"cp"tr_control <-trainControl(method ="cv",number =10, verboseIter =TRUE)loans.over <-train(loan_status ~., data = loans.oversample, method ="rpart", trControl = tr_control, tuneGrid = tune, control=rpart.control(minsplit=10, minbucket =3))fancyRpartPlot(loans.over$finalModel)confusionMatrix(predict(loans.over, loans.test), as.factor(loans.test$loan_status))```### Logistic Regression (Train Data set)Here we have run the 2nd full baseline Logistic regression model and then additionally run the model with stepAIC on the over sampled train data from prior section and compared the metrics on both models.We hit several issues during model train, due to non-conversion of variables to factor type rather than character type. Once that was fixed, we proceeded to run the model and then store the values of accuracy, sensitivity and specificity.The logistic regression model accuracy with the stepAIC improved a lot after using stepAIC method. All 3 parameters were the best so far. (accuracy)0.673135 (sensitivity)0.6847482 (specificity)0.6614412```{r}loans.oversample1 <- loans.oversample %>%mutate(loan_outcome =ifelse(loan_status %in%c('Charged Off' , 'Default') , 1, ifelse(loan_status =='Fully Paid' , 0 , 'No info')))loans.oversample1 <- loans.oversample1[, colnames(loans.oversample1)[colnames(loans.oversample1) !='loan_status']]loans.oversample1 <- loans.oversample1[, colnames(loans.oversample1)[colnames(loans.oversample1) !='issue_d']]loans.oversample1 <- loans.oversample1[, colnames(loans.oversample1)[colnames(loans.oversample1) !='earliest_cr_line']]factorize =function(column, df){#' Check if column is character and turn to factorif(class(df[1,column]) =="character"){ out =as.factor(df[,column]) } else { # if it's numeric out = df[,column] }return(out)}# str(loans.oversample)# str(loans.oversample1)# class(loans.oversample1[1,"term"])store.colnames =colnames(loans.oversample1)loans.oversample3 =lapply(store.colnames, function(column) factorize(column, loans.oversample1))loans.oversample3=as.data.frame(loans.oversample3 )colnames(loans.oversample3)=store.colnamesfull.reg <-glm(loan_outcome ~ ., data =loans.oversample3, family ="binomial")loans.reg <-stepAIC(full.reg, direction ="both")summary(full.reg)summary(loans.reg)coef(loans.reg)head(predict(loans.reg, type="response"))#model_glm_pred = predict(loans.reg, type="response")model_glm_pred =ifelse(predict(loans.reg, type ="response") >0.5, 1, 0)calc_class_err =function(actual, predicted) {mean(actual != predicted)}calc_class_err(actual = loans.oversample3$loan_outcome, predicted = model_glm_pred)train_tab =table(predicted = model_glm_pred, actual = loans.oversample3$loan_outcome)library(caret)train_con_mat =confusionMatrix(train_tab, positive ="1")c(train_con_mat$overall["Accuracy"], train_con_mat$byClass["Sensitivity"], train_con_mat$byClass["Specificity"])#Create functionget_logistic_error =function(mod, data, res ="y", pos =1, neg =0, cut =0.5) { probs =predict(mod, newdata = data, type ="response") preds =ifelse(probs > cut, pos, neg)calc_class_err(actual = data[, res], predicted = preds)}get_logistic_error(loans.reg, data = loans.oversample3, res ="loan_outcome", pos =1, neg =0, cut =0.5)performance_df <-data.frame(Model =NULL, Accuracy =NULL, Sensitivity =NULL, Specificity =NULL)perf_dt <-data.frame(Model ="loans.reg Logistic with stepAIC", Accuracy = train_con_mat$overall[1], Sensitivity = train_con_mat$byClass[1], Specificity = train_con_mat$byClass[2])performance_df <-rbind(performance_df, perf_dt)perf_dtperf_dt3 <-data.frame(Model ="Decision Tree Baseline", Accuracy = p1$overall[1], Sensitivity = p1$byClass[1], Specificity = p1$byClass[2])performance_df <-rbind(performance_df, perf_dt3)perf_dt3```### Logistic Regression (Test Data set)After training on the test set, our metrics were similar to the train dataset. Accuracy : 0.6645 Sensitivity : 0.6576\Specificity : 0.6891 AUC(0.7358227)```{r}library(ROCR)set.seed(2024)loans.test4 <-as.data.frame(loans.test)barplot(table(loans.test4$loan_status) , col ='lightblue')table(loans.test4$loan_status)#Use under and oversampling# loans.oversample.test1 <- ovun.sample(loan_status ~ ., data = loans.test4, method = "both",N = 33193 , seed = 13)$data# barplot(table(loans.oversample.test1 $loan_status) , col = 'lightblue')loans.test1 <- loans.test4 %>%mutate(loan_outcome =ifelse(loan_status %in%c('Charged Off' ) , 1, ifelse(loan_status =='Fully Paid' , 0,'none' )))barplot(table(loans.test1$loan_outcome) , col ='lightblue')table(loans.test1$loan_outcome)loans.test1 <- loans.test1[, colnames(loans.test1)[colnames(loans.test1) !='loan_status']]loans.test1 <- loans.test1[, colnames(loans.test1)[colnames(loans.test1) !='issue_d']]loans.test1 <- loans.test1[, colnames(loans.test1)[colnames(loans.test1) !='earliest_cr_line']]# str(loan.test4)# term # grade # sub_grade # emp_length # home_ownership # verification_status # purpose# initial_list_status# application_type# loan_outcome# # class(loans.test1[1,"grade"])# # store.colnames1 = c("term",# "grade",# "sub_grade",# "emp_length",# "home_ownership",# "verification_status",# "purpose",# "initial_list_status",# "application_type",# "loan_outcome")# store.colnames1=colnames(loans.test1)loans.test3 =lapply(store.colnames1, function(column) factorize(column, loans.test1))loans.test3 =as.data.frame(loans.test3 )colnames(loans.test3)=store.colnames1loans.over.2<-train(loan_outcome ~ loan_amnt + term + int_rate + installment + sub_grade + emp_length + home_ownership + verification_status + purpose + dti + open_acc + pub_rec + revol_bal + revol_util + total_acc + initial_list_status + application_type + pub_rec_bankruptcies + time_since_fcline , data = loans.oversample3, method ="glm")confusionMatrix(predict(loans.over.2, loans.test3), as.factor(loans.test3$loan_outcome))loans.2.prediction <-prediction(predict(loans.over.2, newdata = loans.test3, type ="prob")[,"1"], loans.test3$loan_outcome)performance(loans.2.prediction , measure ="auc")@y.values# Make predictions and pre accuracy for full modelprobabilities <-predict(full.reg, loans.test3, type ="response")predicted.classes <-ifelse(probabilities >0.5, 1, 0)# Prediction accuracyobserved.classes <- loans.test3$loan_outcomemean(predicted.classes == observed.classes)# Make predictions and pre accuracy for stepwise modelprobabilities <-predict(loans.reg, loans.test3, type ="response")predicted.classes <-ifelse(probabilities >0.5, 1, 0)# Prediction accuracyobserved.classes <- loans.test3$loan_outcomemean(predicted.classes == observed.classes)get_logistic_error(full.reg, data = loans.test3, res ="loan_outcome", pos =1, neg =0, cut =0.5)get_logistic_error(loans.reg, data = loans.test3, res ="loan_outcome", pos =1, neg =0, cut =0.5)#A good model will have a high AUC, that is as often as possible a high sensitivity and specificity.test_prob =predict(loans.reg, newdata = loans.test3, type ="response")test_roc =roc( loans.test3$loan_outcome ~ test_prob, plot =TRUE, print.auc =TRUE)as.numeric(test_roc$auc)```### Random Forest ModelWe implemented the random forest model, however, we were not able to use some of the functionality of the randomForestExplainer model. Therefore we were able to display some importance measures.The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case). The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.```{r}#install.packages("randomForestExplainer")library(randomForest)library(ipred)library(randomForestExplainer)set.seed(2024)seat_forest =randomForest(loan_outcome ~ ., data = loans.oversample3, mtry =3, importance =TRUE, ntrees =500)seat_forestprint(seat_forest)plot(seat_forest)# colnames(loans.oversample3)# str(loans.oversample3)# colnames(loans.test3)# str(loans.test3)levels( loans.oversample3$home_ownership)allvalues <-unique(levels( loans.oversample3$home_ownership)) loans.test3$home_ownership <- x <-factor(loans.test3$home_ownership, levels = allvalues)levels( loans.test3$home_ownership)common <-intersect(names(loans.oversample3), names(loans.test3)) for (p in common) { if (class(loans.oversample3[[p]]) =="factor") { levels(loans.test3[[p]]) <-levels(loans.oversample3[[p]]) } }seat_forest_tst_perd =predict(seat_forest, loans.test3)table(predicted = seat_forest_tst_perd, actual = loans.test3$loan_outcome)rf_cm <-confusionMatrix(seat_forest_tst_perd, loans.test3$loan_outcome)randomForest::importance(seat_forest, type=1)randomForest::importance(seat_forest, type=2)par(mfrow =c(1, 2))varImpPlot(seat_forest, type=1, main ="Importance: permutation")varImpPlot(seat_forest, type=2, main ="Importance: node impurity")#var_imp <- measure_importance(seat_forest)perf_dt4 <-data.frame(Model ="Random forest", Accuracy = rf_cm$overall[1], Sensitivity = rf_cm$byClass[1], Specificity = rf_cm$byClass[2])performance_df <-rbind(performance_df, perf_dt4)performance_df ```## Medium Data Set (End to End ML Analysis)### Data set IntroductionThe large data set consists of about 284,000 card transactions that are labelled as non-fraud and fraud. It is a real data set from a European financial institution, which is why the features are masked. They are the result of extensive PCA. Additionally, it is a highly imbalanced data set, as there are several orders of magnitude more non-fraud than fraud transactions.### Data Exploration & Plots```{r}set.seed(2024)path ="https://github.com/BanuB/Card_Transaction_Fraud/raw/refs/heads/master/creditcard.parquet"tx_raw =read_parquet(path)``````{r}summary(tx_raw)tx_raw$Class =as.factor(tx_raw$Class) #Convert Class column to factortx_raw = tx_raw %>%mutate(datetime =as.POSIXct("2024-01-01 00:00:00", tz ="UTC") +seconds(Time)) #Make new column that shows datetimeggplot(tx_raw, aes(x = Amount, fill = Class)) +geom_histogram(position ="dodge", bins =60) +labs(title ="Histogram of Amounts by Class (< 500 USD)", x ="Amount (USD)", y ="Frequency") +theme_minimal() +scale_fill_manual(values =c('grey', 'green')) +xlim(0, 500)tx_1 = tx_raw %>%filter(Class ==1)ggplot(tx_1, aes(x = Amount)) +geom_histogram(position ="dodge", bins =60) +labs(title ="Histogram of Amounts for Class Fraud", x ="Amount (USD)", y ="Frequency") +theme_minimal()#Outlier plotggplot(tx_1, aes(x = Amount)) +geom_boxplot(position ="dodge", bins =60) +labs(title ="Histogram of Amounts for Class Fraud", x ="Amount (USD)", y ="Frequency") +theme_minimal()# Scatterplottx_1 %>%ggplot(aes(x=Time, y=Amount)) +geom_point() +labs(y ="Amount ($)", x ="Time (s)",title="Fraudulent Transactions Across Time" )#Correlation Heatmaptx_raw_numeric = tx_raw %>% dplyr::select(!c(Class, datetime))cor_matrix =cor(tx_raw_numeric)cor_matrix =melt(cor_matrix)ggplot(data = cor_matrix, aes(x = Var1, y = Var2, fill = value)) +geom_tile() +scale_fill_gradient2(low ="blue", high ="red", mid ="white", midpoint =0, limit =c(-1, 1), space ="Lab", name ="Correlation") +theme_minimal() +theme(axis.text.x =element_text(angle =45, vjust =1, size =10, hjust =1)) +coord_fixed() +labs(title ="Correlation Heatmap", x ="Variable", y ="Variable")#----Time-Series for Transactions----tx_transactions <- tx_raw %>%mutate(datetime_hour =floor_date(datetime, "hour")) %>%group_by(datetime_hour, Class) %>%summarise(transaction_count =n())tx_trans_1 <-ggplot(tx_transactions, aes(x = datetime_hour, y = transaction_count, color =as.factor(Class))) +geom_line() +theme_minimal() +labs(title ='Fraud Txs', y ="Number of Transactions", x ="Time (Hourly)") +scale_y_continuous(limits =c(0, 50)) +theme(legend.position ="none") +annotate("text", x =max(tx_transactions$datetime_hour), y =45, label =expression(rho[1] ==-0.226), hjust =1)tx_trans_0 <-ggplot(tx_transactions, aes(x = datetime_hour, y = transaction_count, color =as.factor(Class))) +geom_line() +theme_minimal() +labs(title ='Non-Fraud Txs', x =NULL, y =NULL, color ="Class") +scale_y_continuous(limits =c(1000, max(tx_transactions$transaction_count))) +theme(legend.position ="none") +annotate("text", x =max(tx_transactions$datetime_hour), y =max(tx_transactions$transaction_count) -50, label =expression(rho[1] ==0.918), hjust =1)# Combine the two plotstx_transactions_plot <- (tx_trans_0 / tx_trans_1) +plot_layout(heights =c(2, 1))print(tx_transactions_plot)#----Auto- and Cross-correlations----tx_nofraud = tx_transactions %>%filter(Class ==0) %>% dplyr::select(transaction_count)tx_nofraud = tx_nofraud$transaction_counttx_nofraud_autocor =acf(tx_nofraud, lag.max =3, plot = T)tx_nofraud_autocor```When investigating the plots from the EDA above one thing becomes clear: the data set is HEAVILY imbalanced. As discussed in the introduction above, this is unsurprising given the nature of non-fraud versus fraud transactions; however, this is an important consideration when selecting the models to run. Weak learners will likely not be as strong in performance as ensemble methods would be.Additionally, there are a few more interesting observations. For example, the correlation matrix between all features show no strong correlation between each other. This is important for several machine learning algorithms, and considering that this data set has undergone feature engineering and PCA, it is unsurprising that this is case. Nevertheless, this plot should be part of any machine learning implementation.Looking at the time-series graph, plotting the amounts of transactions per hour, over the time span of the data set, the cyclic nature of the non-fraud transactions is very apparent. This is not existent in the fraud transactions, which are mostly randomly happening. This can also be observed in the auto-correlations: ρ for the non-fraud transactions is 0.92, which points to a strong predictability for the next data point (i.e., after an increase in count, another increase if followed). The negative ρ of -0.23 of the fraudulent transactions points to a more random behavior across these two and a half days of time period of the data set. This feature will surely be quite important for the algorithm during training.Lastly, these auto-correlations can be seen in the ACF plot, that shows different lags. It can be seen that the strongest lag is 1, with decreasing auto-correlations with larger lags.Next, we split the data to prepare for the machine learning implementation. We chose to split the data 80/20 for training and test set. We deferred from a validation set as we are not going to engage in hyper parameter tuning in this exercise.### Data Preparation```{r}set.seed(2024)library(caret)library(e1071)library(randomForest)library(rpart)library(pROC)library(ranger)library(ranger)tx_raw$Class =as.factor(tx_raw$Class)# Ensure datetime is of the correct typetx_raw$datetime =as.POSIXct(tx_raw$datetime)# Split the data into training and testing setstrainIndex =createDataPartition(tx_raw$Class, p =0.8, list =FALSE)dataTrain = tx_raw[trainIndex, ]dataTest = tx_raw[-trainIndex, ]#Define CVtrain_control =trainControl(method ="cv", number =10)```### Algorithm SelectionGiven the fact the this is a highly imbalanced data set, a weak learner, such as a decision tree or logistic regression will likely not be very successful. Therefore, the better choice will likely be an ensemble. In order to test this, we will run a logistic regression, and a single decision tree. We wanted to also include a random forest, however, the large data set was computationally too expensive. In the real world, we would certainly use some type of ensemble method, like random forest and XGBoost.#### Train Models```{r}set.seed(2024)#Logistic Regressiontime_logistic_train =system.time({ logistic_model =train(Class ~ ., data = dataTrain, method ="glm", family ="binomial", trControl = train_control)})#Decision Treetime_tree_train =system.time({ tree_model =train(Class ~ ., data = dataTrain, method ="rpart", trControl = train_control)})```#### Predictions and Evaluation```{r}#Logistic Regression Predictiontime_logistic_pred =system.time({ logistic_pred =predict(logistic_model, dataTest) logistic_probs =predict(logistic_model, dataTest, type ="prob")[, 2]})#Decision Tree Predictiontime_tree_pred =system.time({ tree_pred =predict(tree_model, dataTest) tree_probs =predict(tree_model, dataTest, type ="prob")[, 2]})# Logistic Regression Confusion MatrixconfusionMatrix(logistic_pred, dataTest$Class)# Decision Tree Confusion MatrixconfusionMatrix(tree_pred, dataTest$Class)``````{r}#Benchmarking Training and Prediction Timebenchmark_results =data.frame(Model =c("Logistic Regression", "Decision Tree"),Training_Time =c(time_logistic_train[3], time_tree_train[3]),Prediction_Time =c(time_logistic_pred[3], time_tree_pred[3]))print(benchmark_results)```The performance of both, the logistic regression and the decision tree are good, with above 90% accuracy. Looking at the timing benchmarks, both models trained within about one minute, and took only seconds to predict the test set of 50,000 rows. As mentioned above, the random forest trained much longer, on the magnitude of hours, so we chose to not continue with this at this time.```{r}#ROC and AUC Curves#ROCroc_logistic =roc(dataTest$Class, logistic_probs)roc_tree =roc(dataTest$Class, tree_probs)plot(roc_logistic, col ="red", main ="ROC Curves", lwd =2)lines(roc_tree, col ="blue", lwd =2)legend("bottomright", legend =c("Logistic Regression", "Decision Tree"),col =c("red", "blue"), lwd =2)#AUCauc_logistic =auc(roc_logistic)auc_tree =auc(roc_tree)print(paste("AUC for Logistic Regression:", auc_logistic))print(paste("AUC for Decision Tree:", auc_tree))coefficients <-tidy(logistic_model$finalModel)coefficients <- coefficients[coefficients$term !="(Intercept)", ] # Remove intercept for better visualizationggplot(coefficients, aes(x =reorder(term, estimate), y = estimate)) +geom_bar(stat ="identity") +coord_flip() +labs(title ="Logistic Regression Coefficients", x ="Features", y ="Coefficient") +theme_minimal()rpart.plot(tree_model$finalModel)rules1<-rpart.rules(tree_model$finalModel)rules1 %>%kable() ```In order to further understand the predictive quality of both models, we decided to compute ROC curves that plot specificity against sensitivity, and here found, interestingly, that the logistic regression was much better across the board than the decision tree. It is likely that the tree over fit, which leads to diminished predictive quality. While the accuracy is still high, this does not mean that it stays high with other unseen data.Therefore, in the current case, we'd be deciding to utilize logistic regression over the decision tree. While again, an ensemble of trees would likely outperform the logistic regression, even with a smaller data set.## Conclusion and Summary Essay**Note:** Both dataset had responded well to Decision Tree and Logistic Regression modeling as expected.On the lending loan dataset, we have updated results on 3 models and their performance metrics such as accuracy, sensitivity and specificity.Sensitivity (True Positive Rate): measures the proportion of applicants that were predicted as charged off, who were actually charged off in the test dataset Specificity (True Negative Rate): measures the proportion of applicants that were predicted to be charged off and were also charged off in the test datasetThe ensemble model performed the best if we review all 3 metrics since we had better values across all 3 metrics. We would like to look into further the decision tree baseline model for this dataset as it had a very low sensitivity since we had updated the dataset with overampled/under sampled data.On the fraud dataset, we decided to utilize the logistic regression over the decision tree. While we did not run the ensemble methods here which may yield a better result in the future.**Takeaways:** Additionally the RandomForest took significant amount of time to run and we were unable to load the importance measures through the randomForestExplainer package. We were unable to save the .RDA file with the importance measures on the test model output. We would like to further review this in the future. We would like to further expand on time series with "fpp3" package in the future on the credit dataset to look at further extrapolations of the time series data and forecasting methods to answer questions such as can we forecast credit card fraud given the dataset, are there any seasonal patterns when frauds occur?## Referenceshttps://www.kaggle.com/code/krishnaraj30/xgboost-loan-defaulters-prediction https://www.kaggle.com/code/heidarmirhajisadati/advancedtechniques-for-detecting-credit-card-fraud/notebook https://rpubs.com/DeclanStockdale/799284 https://htmlpreview.github.io/?https://github.com/geneticsMiNIng/BlackBoxOpener/blob/master/randomForestExplainer/inst/doc/randomForestExplainer.html