Introduction

This document discusses analyses of two datasets, the Palmer Penguin dataset and a Loan Approvals dataset prepared by Group 6. We divide the document into five parts and adopted two key principles to undertaking this analysis:

First, group has developed a system of checks and balances in preparing each model’s output. A primary and secondary author cross validate each model (pun intended) by independently coding and reconciling outputs. Two co-authors also bring a wider perspective on discussion and diagnostics. Afterwards, the primary author drafts the text for inclusion in this document.

Second, because four of the five sections deal with the Loan Approvals data, we decided to adopt a common dataset for all model analyses. Data wrangling and preparation are done once only. Consistency of the data in the last four parts is essential to assess model performance.

Section 1 analyzes the Palmer Penguin dataset using the KNN model to predict species. The authors are Alexander Ng (primary) and Randy Thompson (secondary).

Section 2 conducts an exploratory data analysis and defining the common data set for modeling. The authors are Randy Thompson (primary) with contributions by all other members.

Section 3 analyzes the Loan Approvals by Decision Tree. The authors are Randy Thompson (primary) and Philip Tanofsky (secondary).

Section 4 analyzes the same Loan Approvals with a Random Forest model. The authors are Scott Reed (primary) and Alexander Ng (secondary).

Section 5 analyzes the same Loan Approvals with a Gradient Boosting approach. The authors are Philip Tanofsky (primary) and Scott Reed (secondary).

Section 6 concludes with a discussion of the model performance and an appraisal of the merits of each result. We also consider the role of model driven prediction in a loan approvals business context.

Section 7 presents our R code and technical appendices and references.

1 Penguins and the KNN algorithm

Following the data cleaning approach taken in prior assignments, we exclude observations where the feature sex is undefined. We also exclude the two variables island and year which do not seem helpful in prediction analysis. We follow the variable selection process described in a prior assignment here.

This still gives a substantial subset \(333\) out of \(344\) original observations. The below table illustrates sample rows of the cleansed dataset used in the subsequent KNN analysis.

Scrubbed penguin data set
species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
Adelie 39.1 18.7 181 3750 male
Adelie 39.5 17.4 186 3800 female
Adelie 40.3 18.0 195 3250 female
Adelie 36.7 19.3 193 3450 female
Adelie 39.3 20.6 190 3650 male
Adelie 38.9 17.8 181 3625 female

We obtain 333 rows of observations with 6 columns of explanatory and response variables.

1.1 Data Transformation for KNN

The next step is to normalize the data so that each quantitative variables have a mean zero and standard deviation equal to 1. However, we need to also include the sex variable in the K-nearest neighbor model. Normalization ensures that each variable has equal influence on the nearest neighbor calculation. Otherwise, a quantitative variable whose scale is much larger than the other variables would dominate the Euclidean distance calculation used in the KNN model.

However, we also need to include the sex variable as a feature in the KNN model. Categorical variables pose a challenge in the KNN context because they need to be mapped to a numerical equivalent. In the case of this data set, two facts are key:

  1. the original data set gathered penguins by breeding pairs - thus the number of males and females is very balanced.
  2. the variable is binary. Thus, two equal and opposite values can accomplish the goal of normalization.

For these reasons, we do not normalize sex but use the mapping function \(f\) defined as:

\[\begin{equation} f(\text{sex}) = \begin{cases} 1 & \text{sex} = \text{male} \\ -1 & \text{sex} = \text{female} \\ \end{cases} \end{equation} \] Because the data set is gender balanced, \(E[f(\text{sex})] \approx 0\) and \(Stdev(f(\text{sex})) \approx 1\). Hence, \(f(\text{sex})\) is nearly normalized. With these transformations, the KNN dataset is prepared for model evaluation. We illustrate the top 5 rows of the training data set below. Note that the sampled indices are displayed in the left most column as row values in the resulting dataframe.

Normalized Training Data - 1st Few Rows
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
1 -0.89 0.78 -1.42 -0.57 1
3 -0.68 0.42 -0.43 -1.19 -1
4 -1.33 1.08 -0.57 -0.94 -1
7 -0.88 1.24 -0.43 0.58 1
8 -0.53 0.22 -1.35 -1.25 -1
9 -0.99 2.05 -0.71 -0.51 1

1.2 Cross Validation and Parameter Tuning with KNN

We will use caret to test a range of possible values of \(k\) and select the optimal \(k\) based on minimizing the cross validation error. We did the pre-processing previously using scale so there is no need to invoke further data transformation inside caret pre-processing function calls.

trainControl allows us to define the training parameters.

tuneLength decides the range of allowed \(k\) value to compute against the model.

We choose a 5-fold cross validation method to ensure the test set is sufficiently large.

Using the above model plot which shows the average cross validation accuracy at each parameter \(k\) used in the KNN model, we see that the best kappa and accuracy occur at \(k=5\) nearest neighbors. This is evident when we display the (cross validation) model output below.

## k-Nearest Neighbors 
## 
## 333 samples
##   5 predictor
##   3 classes: 'Adelie', 'Chinstrap', 'Gentoo' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 267, 266, 268, 265, 266 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.9940737  0.9906842
##    7  0.9880556  0.9811323
##    9  0.9881914  0.9812667
##   11  0.9881914  0.9812667
##   13  0.9851145  0.9763978
##   15  0.9851145  0.9763978
##   17  0.9851145  0.9763978
##   19  0.9851145  0.9763978
##   21  0.9851145  0.9763978
##   23  0.9851145  0.9763978
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

1.3 KNN Model Output and Performance

To assess model performance and investigate the suitability of our choice of \(k=5\), we consider two approaches. First we examine the statistical output of the confusion matrix. The accuracy and kappa ought to look attractive in the model selection. Second, we examine the smoothness of the decision boundary by plotting the decision boundary for \(k=1\) and \(k=5\). If the optimal \(k\) shows a relatively smooth boundary, this is supporting evidence that the model parameters are reasonable.

First, we construct the confusion matrix of one specific implementation of the KNN model for the train-test split following the same proportions as the cross-validation study. We specifically use the caret interface rather than the simpler class interface to ensure consistency across multiple models in future.

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        29         1      0
##   Chinstrap      0        12      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9846          
##                  95% CI : (0.9172, 0.9996)
##     No Information Rate : 0.4462          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9757          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000           0.9231        1.0000
## Specificity                 0.9722           1.0000        1.0000
## Pos Pred Value              0.9667           1.0000        1.0000
## Neg Pred Value              1.0000           0.9811        1.0000
## Prevalence                  0.4462           0.2000        0.3538
## Detection Rate              0.4462           0.1846        0.3538
## Detection Prevalence        0.4615           0.1846        0.3538
## Balanced Accuracy           0.9861           0.9615        1.0000

We see in the confusion matrix output above that accuracy is 98.46% and kappa is 97.6%. These are consistent with the cross validation results. We conclude that KNN is a very effective classifier for penguin gender. Only 1 penguin in 65 was misclassified: a Chinstrap was predicted to be an Adelie penguin.

Next, we visualize the decision boundary of our KNN model to check for model suitability. Custom code needs to be written as we are not aware if this plot type is currently provided by any R package.

Adapting the R recipe for KNN decision boundary plots found in the stackoverflow page here we see below that the decision boundary for \(k=5\) nearest neighbor algorithm is reasonably smooth and not overfitted to the data.

However, the same KNN decision boundary for \(k=1\) shows a much more twisted curve. The border between the Chinstrap and Adelie regions in the upper middle half shows a zig-zag pattern characteristic of overfitting. So we reject the choice of \(k=1\) as the model even though both \(k=5\) and \(k=1\) KNN models have identical confusion matrices. ( The \(k=1\) confusion matrix is omitted for brevity. ) This means model selection requires considering factors not captured by the confusion matrix.

2 Loan Approvals: Exploratory Data Analysis

2.1 Assessing the Raw Data

Our EDA shows the raw loan approvals dataset has varying level of quality.

Complete variables include: Loan_ID, Education, Property_Area, Loan_Status, ApplicantIncome, CoapplicantIncome Mostly complete variables include: Married, Dependents, Loan_Amount_Term, Gender

Variables with significant gaps include: Self_Employed, LoanAmount, Credit_History Of these variables, Credit_History is the most problematic and it influences the Loan Approval decision.

\(\color{red}{\text{@Randy: Add your detailed discussion here...}}\)

2.2 Defining a Common Dataset for Loan Approvals

In this section, we construct a common loan approval data which will be used for all of the model studies that follow. There are several steps that we take:

  1. It consists solely of the records with no missing values for any relevant column and then exclude useless variables. We call this subset of observations. This reduces the number of observations from 614 to 480 which is a 78.2% of the original data set. While this is a significant reduction in the number of observations the remaining data is still sufficient for fitting models.

\(\color{red}{\text{@Scott - can you share the evidence that imputing missing values for Credit History is dicey?}}\) \(\color{red}{\text{this argument needs to be supported and justifies our choice of a common data set.}}\)

  1. Next, we omit one irrelevant column: the Loan_ID. An identifier which is exogenous to the loan applicant and is assumed to be related solely to computer system processes.

  2. Then we add a synthetic variable Total_Income which represents the sum of the ApplicantIncome and CoapplicantIncome.

  3. We scale all quantitative variables to mean 0 and variance 1 variables with the scale function and retain both the nominal and scaled versions. The scaled variables are named like their nominal counterparts but prefixed with a lower case s. For example, sApplicantIncome has mean 0 and variance 1 whereas ApplicantIncome has the monthly nominal income.

  4. Further, we convert the character type data columns from text to factor. This is importance for Loan_Status but also relevant for others like: Credit_History, Property_Area, Gender, Married, Education, Self_Employed. In addition, we convert Dependents from text to an ordered factor because the number of children is ordered.

Lastly, the common data set is exported to flat file. Each group member performs their Loan Approval model study by loading the common data set and making subsequent selections or transformations while retaining the same observations.

##     Education ApplicantIncome CoapplicantIncome Property_Area Loan_Status
## 480         1               1                 1             1           1
## 43          1               1                 1             1           1
## 25          1               1                 1             1           1
## 5           1               1                 1             1           1
## 19          1               1                 1             1           1
## 1           1               1                 1             1           1
## 10          1               1                 1             1           1
## 1           1               1                 1             1           1
## 12          1               1                 1             1           1
## 1           1               1                 1             1           1
## 1           1               1                 1             1           1
## 12          1               1                 1             1           1
## 1           1               1                 1             1           1
## 2           1               1                 1             1           1
## 1           1               1                 1             1           1
##             0               0                 0             0           0
##     Total_Income Married Gender Loan_Amount_Term Dependents LoanAmount
## 480            1       1      1                1          1          1
## 43             1       1      1                1          1          1
## 25             1       1      1                1          1          1
## 5              1       1      1                1          1          1
## 19             1       1      1                1          1          0
## 1              1       1      1                1          1          0
## 10             1       1      1                1          0          1
## 1              1       1      1                1          0          0
## 12             1       1      1                0          1          1
## 1              1       1      1                0          1          1
## 1              1       1      1                0          0          1
## 12             1       1      0                1          1          1
## 1              1       1      0                1          1          1
## 2              1       0      1                1          0          1
## 1              1       0      1                1          0          0
##                0       3     13               14         15         22
##     Self_Employed Credit_History    
## 480             1              1   0
## 43              1              0   1
## 25              0              1   1
## 5               0              0   2
## 19              1              1   1
## 1               0              0   3
## 10              1              1   1
## 1               1              1   2
## 12              1              1   1
## 1               0              1   2
## 1               1              1   2
## 12              1              1   1
## 1               1              0   2
## 2               1              1   2
## 1               1              1   3
##                32             50 149

We see that we have some missing variables including a substantial run on Credit History. As this is likely to be a very important variable, has only two output states, and is not obviously derivable from the features provided, it is not a great candidate for imputation.

After exporting the common loan approval data, we load it into the R session and transform character string data to factors. This allows some R models to better handle response variables.

When performing modeling of common data set, each member may omit one of the Income variables to avoid multi-collinearity problems involving: ApplicantIncome, CoapplicantIncome and Total_Income or their scaled equivalents.

2.3 Exploring data completeness

Rows with missing data has been removed. Five of our independant variables have only two levels as displayed in n_unique. For our dependant variable, Loan_Status, we have twice the number of yes’s than no’s. We could downsample to have an even split of yes/no but with only, 480 rows left, we decided we didn’t want to lose any more data. The four numeric variables have been duplicated then centered to a mean of 0 and scaled to a standard deviation of 1. Since out CART algorithm uses sum of squares to select splits and since center and scaled data shouldn’t effect the relationship between variables, I’m going to stick with the original currency values.

Next we explore the pairwise correlation of numeric and factor variable separately.

The pairwise correlation plot shows statistical significants with three asterisks next to the correlation value. As expected, Total Income is highly correlated with applicant and coapplicant incomes since Total income is a sum of the two. More importatntly, Loan Amount is moderately positively correlated with Applicant Income and weakly positively correlated with Coapplicant Income. Overall, there doesn’t appear to be a large difference in distribution among numeric variables between a loan being approved or not.

For pairwise factor bar charts, you’re looking for large mismatches in box size within the upper triangle of the matrix. For example: number of dependants vs married. The top row of boxes look fairly even meaning married people have a fairly even distribution of number of dependants. However, non-married people have a high level of 0 dependants and not very many 1’s, 2’s, or 3+’s. Our varialbe of interest Loan_status has a large mismatch on Credit_history. There appears to be an even amount of people who have and don’t have credit history when the loan status is not approved but almost no one has a loan status approved without also haveing a credit history of yes. This is quantified in our earlier analysis. This large variance should make credit history a good predictor variable.

2.4 Conditional Approval Rates by Categorical Variable

\(\color{red}{\text{Let's add this section (or its equivalent) to the discussion since it tells a useful story.}}\)

\(\color{red}{\text{I wrote some code to do the conditional approval rates.}}\)

One simple type of exploratory data analysis is to examine the conditional approval rates by values of each categorical variable \(C\).
For example, there are three subcategories (Urban, Suburban, Rural) of the variable Property_Area.
If we find significant differences in the conditional loan approval rates between Urban and Rural applications, this suggests that Property_Area may help to predict approvals.

The conditional approval rates for each categorical variable is displayed below.

Loan Approved vs. Property_Area by Percent
Rural Semiurban Urban
N (%) 11.2 8.8 10.8
Y (%) 17.7 31.0 20.4
Property_Area Column Total %: 29.0 39.8 31.2
Loan Approval Rate by Column% 61.2 78.0 65.3
Total Applications: 139.0 191.0 150.0
Loan Approved vs. Credit_History by Percent
0 1
N (%) 13.1 17.7
Y (%) 1.5 67.7
Credit_History Column Total %: 14.6 85.4
Loan Approval Rate by Column% 10.0 79.3
Total Applications: 70.0 410.0
Loan Approved vs. Loan_Amount_Term by Percent
36 60 84 120 180 240 300 360 480
N (%) 0.4 0.0 0.2 0.0 2.5 0.2 1.0 24.8 1.7
Y (%) 0.0 0.4 0.4 0.6 5.0 0.2 0.8 60.8 0.8
Loan_Amount_Term Column Total %: 0.4 0.4 0.6 0.6 7.5 0.4 1.9 85.6 2.5
Loan Approval Rate by Column% 0.0 100.0 66.7 100.0 66.7 50.0 44.4 71.0 33.3
Total Applications: 2.0 2.0 3.0 3.0 36.0 2.0 9.0 411.0 12.0
Loan Approved vs. Self_Employed by Percent
No Yes
N (%) 26.0 4.8
Y (%) 60.2 9.0
Self_Employed Column Total %: 86.2 13.8
Loan Approval Rate by Column% 69.8 65.2
Total Applications: 414.0 66.0
Loan Approved vs. Education by Percent
Graduate Not Graduate
N (%) 23.3 7.5
Y (%) 56.5 12.7
Education Column Total %: 79.8 20.2
Loan Approval Rate by Column% 70.8 62.9
Total Applications: 383.0 97.0
Loan Approved vs. Dependents by Percent
0 1 2 3+
N (%) 18.1 5.8 4.2 2.7
Y (%) 39.0 10.8 13.5 5.8
Dependents Column Total %: 57.1 16.7 17.7 8.5
Loan Approval Rate by Column% 68.2 65.0 76.5 68.3
Total Applications: 274.0 80.0 85.0 41.0
Loan Approved vs. Married by Percent
No Yes
N (%) 13.3 17.5
Y (%) 21.9 47.3
Married Column Total %: 35.2 64.8
Loan Approval Rate by Column% 62.1 73.0
Total Applications: 169.0 311.0
Loan Approved vs. Gender by Percent
Female Male
N (%) 6.7 24.2
Y (%) 11.2 57.9
Gender Column Total %: 17.9 82.1
Loan Approval Rate by Column% 62.8 70.6
Total Applications: 86.0 394.0

We observe large differences in approval rates in the following categorical variables:

  1. Credit_History: Those with credit history have 79.3% approval rate vs. 10.0% for those without.

  2. Property_Area: Rural approval rates 61.2% are lower than Semiurban at 78.0%

  3. Married: Unmarried applicant approval rate 62.1% is lower than for married at 73.0%

  4. Education: Not Graduate approval rates 62.9% are lower than Graduate at 70.8%

  5. Gender: Female applicant approval rate 62.8% is lower than for males at 70.6%.

These differences should suggest the most significant categorical variables for prediction purposes.

3 Loan Approvals: Decision Tree Models

3.1 Building and Training a Decision Tree

For pairwise factor bar charts, you’re looking for large mismatches in box size within the upper triangle of the matrix. For example: number of dependants vs married. The top row of boxes look fairly even meaning married people have a fairly even distribution of number of dependants. However, non-married people have a high level of 0 dependants and not very many 1’s, 2’s, or 3+’s. Our varialbe of interest Loan_status has a large mismatch on Credit_history. There appears to be an even amount of people who have and don’t have credit history when the loan status is not approved but almost no one has a loan status approved without also haveing a credit history of yes. This is quantified in our earlier analysis. This large variance should make credit history a good predictor variable.

Now we create our model recipe. This recipe object describes the dependant and independant variables we wish to use and the data source.

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         12
## 
## Training data contained 480 data points and no missing data.

Next we define the statistical model we wish to run on our recipe object. In this case, we are running a decision tree from the rpart package. The mode of decision tree we select is classification. Classification in the rpart package will split branches in a way that minimizes the sum of squares error.

## Decision Tree Model Specification (classification)
## 
## Computational engine: rpart

Now we create our training and test splits using an 80/20 split. The training set will be used to train our model while the test split will be used to evaluate our final model.

## <Analysis/Assess/Total>
## <385/95/480>

Now we use our model engine and our training data to create a classification model.

## parsnip model object
## 
## Fit time:  13ms 
## n= 385 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 385 117 Y (0.3038961 0.6961039)  
##    2) Credit_History=0 58   6 N (0.8965517 0.1034483) *
##    3) Credit_History=1 327  65 Y (0.1987768 0.8012232)  
##      6) Loan_Amount_Term=36,240,300,480 16   7 N (0.5625000 0.4375000) *
##      7) Loan_Amount_Term=60,84,120,180,360 311  56 Y (0.1800643 0.8199357)  
##       14) Total_Income>=18249 9   3 N (0.6666667 0.3333333) *
##       15) Total_Income< 18249 302  50 Y (0.1655629 0.8344371)  
##         30) Total_Income< 2381.5 7   2 N (0.7142857 0.2857143) *
##         31) Total_Income>=2381.5 295  45 Y (0.1525424 0.8474576) *

Our top level splits are credit history, property area, total income, then loan amount. Our first branch is also a terminal node. When credit history = 0, you are very unlikely to be approved for a loan.

No we’re going to resample our training data with 10 crossfold validation by taking a random 90% and checking the variability of our estimates.

Here we apply our resampled data to our model and define the metrics of interest.

Our prediction accuracy is 73%. To maximize accuracy, we are going to select the model that will optimize for accuracy. This will be the model we use in our final workflow. We fit our total training data one last time and view our splits.

## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## n= 385 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 385 117 Y (0.3038961 0.6961039)  
##    2) Credit_History=0 58   6 N (0.8965517 0.1034483) *
##    3) Credit_History=1 327  65 Y (0.1987768 0.8012232)  
##      6) Loan_Amount_Term=36,240,300,480 16   7 N (0.5625000 0.4375000) *
##      7) Loan_Amount_Term=60,84,120,180,360 311  56 Y (0.1800643 0.8199357)  
##       14) Total_Income>=18249 9   3 N (0.6666667 0.3333333) *
##       15) Total_Income< 18249 302  50 Y (0.1655629 0.8344371)  
##         30) Total_Income< 2381.5 7   2 N (0.7142857 0.2857143) *
##         31) Total_Income>=2381.5 295  45 Y (0.1525424 0.8474576) *

Our branches and nodes do not appear to have changed. Let’s look at variable importance. Importance in a decision tree is a combination of number of times a variable appears in a branch, how high of a branch, and number of times it appears in a terminal node.

Credit History is by far our most importable variable for predicting loan status.

Lastly, we test our model and collect our metrics.

Our accuracy is 80%. Below is our confusion matrix.

##           Truth
## Prediction  N  Y
##          N 13  6
##          Y 18 58

4 Loan Approvals: Random Forest Models

4.1 Feature Selection

We first run an all variables (save the index variable Loan_ID, and the constituent parts of scaled or combined variables) in a random forest to use it as a variable selection sieve. We use VSURF as it was the best performing in Speiser (2019).

VSURF for the uninitiated is a multi step variable selection process that first ranks variables by importance. It then excludes features that have a null importance. VSURF generates both a selection of features for for interpretation by adding variables sequentially and generating random forests while minimizing OOB error, and a selection for prediction which similarly sequentially adds variables but instead of blindly limiting OOB it penalizes adding variables that add noise.

VSURF selects a single variable “model” for predictive use. This suggests that a very parsimonious model indeed might be sufficient.

Using this as a “model” we get an accuracy of 83.1% and a Kappa of 0.486. Cohen (1960) would suggest that as an agreement unlikely to be caused by chance (which is important as a model that simply flagged every loan as a success would have an accuracy of .737 but a Kappa of course of 0)

4.2 Model building

If we were to ignore VSURF’s suggestion we can try to sweep through the ranked recommended features (which are reported in order from most promising to least)

Perhaps unsurprisingly (as this is broadly similar to what VSURF does internally) we don’t seem to receive any benefit of adding additional variable candidates, indeed the model actually performs somewhat worse for many options.

Still if we use the features for interpretation VSURF has selected and a wide parameter grid for the tunable parameters mtry and number of trees we can attempt to tune a model.

mtry trees .config
6 500 Preprocessor1_Model12
.metric .estimator .estimate
accuracy binary 0.8526316
kap binary 0.6006006

Looking at breakdowns of performance we see the best performance using Kappa to be mtry of 3 and trees of 1500 This does not appear to be a particular effective tuning and many others perform similarly. Unfortunately this doesn’t reflect much in the way of actual model improvement.

Our full model has an accuracy and kappa as presented above. Unfortunately, this is less than our “Model” of a simple credit_history.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  N  Y
##          N 16 10
##          Y  4 65
##                                          
##                Accuracy : 0.8526         
##                  95% CI : (0.7651, 0.917)
##     No Information Rate : 0.7895         
##     P-Value [Acc > NIR] : 0.0791         
##                                          
##                   Kappa : 0.6006         
##                                          
##  Mcnemar's Test P-Value : 0.1814         
##                                          
##             Sensitivity : 0.8000         
##             Specificity : 0.8667         
##          Pos Pred Value : 0.6154         
##          Neg Pred Value : 0.9420         
##              Prevalence : 0.2105         
##          Detection Rate : 0.1684         
##    Detection Prevalence : 0.2737         
##       Balanced Accuracy : 0.8333         
##                                          
##        'Positive' Class : N              
## 

We examine the confusion matrix and see the raw accuracy is .7790

5 Loan Approvals: Gradient Boosting

Gradient Boosting model approach is an ensemble method that applies repeated models to resampled data and pooling estimates. A boosted tree fits the next tree to the residuals of the prior tree. The residual error is minimized through each successive tree and typically accomplished through shallow trees. The gradient boosting approach uses the XGBoost algorithm which stands for extreme gradient boosting.

5.1 Feature selection

An initial attempt of the XGBoost model with all of the feature variables proved quite underwhelming with an accuracy of 0.75 and kappa of just 0.272. The importance chart showed \(Credit\_History\) as clearly the most important (greater than 75%), then \(ApplicantIncome\), followed by \(Total\_Income\), and the \(LoanAmount\). The dummy variable \(Property\_Area\_Semiurban\) also made the list of important variables. In an effort to simply the model and also construct a definite relationship between the income and the loan amount, a derived variable \(Income\_To\_LoanAmt\) is calculated as the ratio of the total income to the requested loan amount. The derived variable proved beneficial to increasing the model accuracy.

Based on the importance factor of the features from the initial baseline model along with the success of the derived variable, the following variables are selected as input into the gradient boosting model.

  • Credit_History: the most important feature

  • Income_To_LoanAmt: derived feature based on the combined income of the applicant and co-applicant to the requested loan amount

  • Property_Area: valuable feature for semiurban and urban

The data is split using 80% for training and 20% for test. Initial model attempts with the 80/20 split caused underfitting as the model predicted \(Loan\_Status\) as Yes for every instance. By decreasing the count of trees and lowering the number of cross-validation folds, the gradient boosting model began to perform better.

##  Credit_History Income_To_LoanAmt   Property_Area Loan_Status
##  0: 70          Min.   : 12.09    Rural    :139   N:148      
##  1:410          1st Qu.: 35.53    Semiurban:191   Y:332      
##                 Median : 41.67    Urban    :150              
##                 Mean   : 52.11                               
##                 3rd Qu.: 52.11                               
##                 Max.   :396.37

5.2 Model definition

The XGBoost model engine creates the gradient boosting model. The tree input parameter is set to 25 based on trial and error. Sample runs above 30 did not improve accuracy. With only five cross-validation folds generated, each with 10 combinations of the parameters, 50 unique XGBoost models are generated that each build 25 trees. The low number of variables does not require a great number of cross-validation folds, parameter combinations, or tree count in order to produce a high accuracy.

For generating the combinations of parameters, the tune() method allows the parameters to be determined by a tune grid to identify the best performing parameter definitions.

The parameters are defined by the sampling size and then the parameter grid is built by the function grid_max_entropy(). The size of 10 was selected based on trial and error. Values above 10 did not provide improved model accuracy. As mentioned above, the 10 combinations of parameters for each of the 5 folds produces 50 unqiue models. The few number of variables does not require a high number of model variations to produce a high accuracy.

In order to properly run the tidymodels approach to model creation, a recipe is defined for the training data. For the numeric input variables, the data is transformed per the Yeo-Johnson transformation which is similar to the Box-Cox transformation. The numeric data is also normalized. The nominal data is transformed by creating dummy variables within the recipe. The Yeo-Johnson transformation ensures a near-normal shape to the \(Income\_To\_LoanAmt\) variable, otherwise the model accuracy suffered greatly without the transformation.

## Rows: 385
## Columns: 5
## $ Income_To_LoanAmt       <dbl> 0.13615446, -0.12381804, 1.02971637, 0.6166922…
## $ Loan_Status             <fct> Y, Y, N, Y, Y, N, Y, Y, N, N, N, Y, Y, Y, N, N…
## $ Credit_History_X1       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1…
## $ Property_Area_Semiurban <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0…
## $ Property_Area_Urban     <dbl> 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1…

The tune_grid() method computes the evaluation metrics based on the parameters of the model. The results with the highest Kappa value are displayed. The log-loss metric is the default evaluation metric of the XGBoost algorithm. The tune_grid() produces the 50 unique XGBoost models to be considered.

The five parameters - mtry, min_n, tree_depth, learn_rate, and sample_size - are plotted against the mean log-loss value, the default XGBoost evaluation metric. The mtry plot shows a range of 1 to 3, with most results on 2. The min_n plot shows a range just above zero to over 30. The tree depth plot shows a range just above zero to over 10, as 14 appears as the most common value. The learn rate plot shows the lower learn rates to be most common. The sample size plot follows the defined range from 0.4 to 0.9 with slightly higher volume at the range’s higher end.

Based on the highest Kappa value, the best gradient boosting model is selected. The best model as shown below contains the following parameter values.

  • mtry 2: Number of predictors randomly sampled at each split when creating the tree models. With only three variables, two is sufficient.

  • min_n 3: Minimum number of data points in a node before the node is split further. The value of three does seem low considering the number of instances.

  • tree_depth 14: Maximum depth of the tree. With three variables, this tree depth is likely higher than needed.

  • learn_rate 1.484305e-07: Rate at which the algorithm adapts for each iteration. The lower the learn rate the better.

  • sample_size 0.8813854: Amount of data used in the fitting step. With a defined range of 0.4 to 0.9, the result indicates a larger percentage of the data used in the fitting routine.

5.3 Model performance

With the best model selected, the workflow is finalized before the model is fit based on the training data and subsequently generating predictions on the test data set.

## [08:39:24] WARNING: amalgamation/../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.

The variable importance plot confirms the primary importance of the \(Credit\_History\) variable. Given the few number of input variables, the chart also indicates the \(Loan\_To\_LoanAmt\) variable as second important followed by the dummy variables for the property area.

The model produces an accuracy of 86.3% and an AUC of 91.5% based on the fit metrics. The ROC-AUC is plotted to confirm the calculated percentage.

The predictions from the gradient boosting model produce a final accuracy of 86.3% and Kappa of 63.9% with the final confusion matrix of the test data predictions.

##           Truth
## Prediction  N  Y
##          N 13  2
##          Y 16 64

6 Loan Approvals: Model Performance and Assessment

In this section, we compare the performance of the decision tree, random forest and gradient boosting models on the Loan Approvals dataset.

6.1 Model Performance

6.2 Using Prediction in Loans Approvals

We can use Decision Trees, Random Forest, or Gradient Boosting to build a model. The models final performance is as below:

Model Accuracy Kappa
Decision Trees TBD TBD
Random Forest 0.811 0.455
Gradient boosting TBD TBD

Based on this we select the TBD Model as having the best balance of performance and ease of underwstanding.

6.2.1 Recommendations

The loan data is strongly controlled by Credit History and a combined income. It seems likely that we could extract more information from the loan amount and interval by approval by knowing the interest rates offered/requested. This could lead to a new composite value of Yearly payment which would likely have a relationship with the applicants income.

It would be very interesting to break down the Credit History Feature into components or at the least have a more continuous variable (say points above or below the FICO score cut off). Our models showed it to have an almost overwhelmingly strong impact on the output model, and presumably it is based on its own more continuous data that could be further analyzed. It seems likely, for example, that loan approval of applicants that do not meet the credit history requirements likely are closer to approval than applicants that seem mostly similar and do not receive approval. However, the data provided does not allow us to state that.

7 Appendices

7.1 References

\(\color{red}{\text{Put your external references here.}}\)

Akin, J. (2020, March 5). Does income affect credit scores? Experian https://www.experian.com/blogs/ask-experian/does-income-affect-credit-scores/.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1), 37-46.

Genuer, R., Poggi, J. M., & Tuleau-Malot, C. (2015). VSURF: an R package for variable selection using random forests. The R Journal, 7(2), 19-33.

UC Business Analytics R Programming Guide - Random Forests

Speiser, J. L., Miller, M. E., Tooze, J., & Ip, E. (2019). A comparison of random forest variable selection methods for classification prediction modeling. Expert systems with applications, 134, 93-101.

7.2 Code

We summarize all the R code used in this project in this appendix for ease of reading.

# Your libraries go here

library(tidyverse)
library(ggplot2)
library(knitr)
library(kableExtra)

library(caret)    # Model Framework
library(skimr)    # Used for EDA
library(klaR)     # Implemented KNN and Naive Bayes models, etc
library(class)    # used for KNN classifier

# PLEASE ADD YOUR R LIBRARIES BELOW
# ------------------------------
library(tidymodels)
library(VSURF) 
library(dplyr)
library(mice)

library(vip)
library(reshape2) # PT
library(data.table) # PT
library(xgboost)


# ---------------------------------
knitr::opts_chunk$set(echo = FALSE, message=FALSE, warning=FALSE)

library(palmerpenguins)

# The final dataset excludes observations with missing sex and drops the island and year variables.
pc = penguins %>% filter( is.na(sex) == FALSE) %>% dplyr::select( -one_of("island", "year") )

head(pc) %>% kable(caption="Scrubbed penguin data set") %>% kable_styling(bootstrap_options = c("striped", "hover"))


set.seed(10)

#  Construct the standardized dataset with the response variable, the 4 quantitative variables normalized 
#  with the scale function and the sex variable normalized with the 1,-1 assignment explained previously.
#
standardized = cbind( pc[,1], scale( pc[, 2:5] ) , sex = data.table::fifelse(pc$sex == 'male' , 1 , -1) )

# Define an 80-20 split of the training and test data.
# -------------------------------------------------------------------------------------
training.individuals = createDataPartition(standardized$species, p= 0.8 , list = FALSE)

# X variables include bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g,  sex (converted to -1,1)
# Y variable is just the species class
# train data is the standardized subset of the full data set
# test data is the complement of the training data set.
# -----------------------------------------------------------------
train.X  =  standardized[ training.individuals,  2:6]
test.X   =  standardized[-training.individuals,  2:6]
train.Y  =  standardized[ training.individuals,    1]
test.Y   =  standardized[-training.individuals,    1]

head(train.X ) %>% kable(caption="Normalized Training Data - 1st Few Rows", digits = 2 ) %>% 
  kable_styling(bootstrap_options = c("striped", "hover") )
model = train( species ~. , data = standardized , method = "knn" ,
         trControl = trainControl("cv", number = 5 ), tuneLength = 10 )

# Plot the model output vs. varying k values.
plot(model)
model

# The best tuning parameter k that 
# optimized model accuracy is:
#model$bestTune

# Simpler interface using the knn function within class.
knn.pred5 = knn(train.X, test.X, train.Y , k = 5, prob = FALSE) 
cm5 = confusionMatrix(knn.pred5, test.Y) 


# This model approach uses tuneGrid with a single value to calculate kNN 
# only at a single number of nearest neighbors.
# ------------------------------------------------------------------------------
x2 = train( species ~ . , data = standardized[training.individuals,] , 
            method = "knn", 
            trControl = trainControl(method="none"), 
            tuneGrid=data.frame(k=5) )

pred2 = predict(x2, newdata=test.X)

(cmknn2=confusionMatrix(pred2, test.Y))

# We visualize a projection of the decision boundary flattened to 2 dimensions to illustrate
# the smoothness of the KNN boundary at different values of k.
# First, choose the variables, then build a sampling grid.
# ------------------------------------------------------------------------
pl = seq(min(standardized$bill_length_mm), max(standardized$bill_length_mm), by = 0.05)
pw = seq(min(standardized$bill_depth_mm), max(standardized$bill_depth_mm), by = 0.05 )

# The sampling grid needs explicit values to be plugged in for the variables
# which are not the two projected variables.  Since variables are standardized to
# mean 0 and variance 1, we choose zero for flipper_length and body_mass because they are continuous 
# and the mean and mode are close.
# However, for sex, we choose male because the average sex = 0 is not the mode.
# In practice, sex = 0 gives weird decision boundaries in the projected plot.
# ------------------------------------------------------------------------------
lgrid = expand.grid(bill_length_mm=pl , bill_depth_mm = pw , flipper_length_mm = 0, body_mass_g = 0 , sex = 1 )

knnPredGrid = predict(x2, newdata=lgrid )

num_knnPredGrid = as.numeric(knnPredGrid)

num_pred2 = as.numeric(pred2)

test = cbind(test.X, test.Y)
test$Pred = num_pred2
ggplot(data=lgrid) + stat_contour(aes(x=bill_length_mm, y= bill_depth_mm, z = num_knnPredGrid), bins = 2 ) +
  geom_point( aes(x=bill_length_mm, y= bill_depth_mm, color = knnPredGrid),  size = 1, shape = 19, alpha = 0.2) +
  geom_point( data= test , aes(x=bill_length_mm, y=bill_depth_mm, color=pred2 ) , size = 4, alpha = 0.8, shape =24 ) +
  ggtitle("KNN Decision Boundary for Penguins Data with k=5 neighbor")


# This model approach uses tuneGrid with a single value to calculate kNN 
# only at a single number of nearest neighbors.
# ------------------------------------------------------------------------------
x3 = train( species ~ . , data = standardized[training.individuals,] , 
            method = "knn", 
            trControl = trainControl(method="none"), 
            tuneGrid=data.frame(k=1) )  # Now we use a k=1 nearest neighbor algorithm which is overfitted.

pred3 = predict(x3, newdata=test.X)


knnPredGrid3 = predict(x3, newdata=lgrid )

num_knnPredGrid3 = as.numeric(knnPredGrid3)

num_pred3 = as.numeric(pred3)

test = cbind(test.X, test.Y)
test$Pred = num_pred3

ggplot(data=lgrid) + stat_contour(aes(x=bill_length_mm, y= bill_depth_mm, z = num_knnPredGrid3), bins = 2 ) +
  geom_point( aes(x=bill_length_mm, y= bill_depth_mm, color = knnPredGrid3),  size = 1, shape = 19, alpha = 0.2) +
  geom_point( data= test , aes(x=bill_length_mm, y=bill_depth_mm, color=pred3 ) , size = 4, alpha = 0.8, shape =24 ) +
  ggtitle("KNN Decision Boundary for Penguins Data with k=1 neighbor")
# This code chunk should only be run to build the common data set.  
# Any model builder should rely on importing the common data set only.
# Thus, set eval=TRUE when generating a new version of the common data set.
# Otherwise, set eval=FALSE to skip this step.
# ------------------------------------------------------------------------

# This code block assumes the raw Loan Approval data file in csv form has been placed in the same folder as this script.
# ----------------------------------------------------------------------------------------
cla <- read_csv("Loan_approval.csv") %>% dplyr::select(-Loan_ID)  # drop the row identifier.

# Add a column for Total Income of Applicant and Co-Applicant
# ---------------------------------------------------------------------------
cla$Total_Income = cla$ApplicantIncome + cla$CoapplicantIncome

mice::md.pattern(cla, rotate.names = T)


# We build a dataset in which all observations have fully populated values.
# ---------------------------------------------------------------------------
cla = na.omit(cla)

# Add mean zero and variance 1 versions of quantitative variables.
# ----------------------------------------------------------
cla %>% mutate( sApplicantIncome = scale(ApplicantIncome), 
                sCoapplicantIncome = scale(CoapplicantIncome),
                sTotal_Income = scale(Total_Income) ,
                sLoanAmount   = scale( LoanAmount)
                ) -> cla

head(cla)

write_csv(cla, "cla.csv") # Write the scaled common loan approvals data set to local disk.  

cla = read_csv("cla.csv")

# Transform the character data columns into factors
# --------------------------------------------------------
cla$Loan_Amount_Term = factor(cla$Loan_Amount_Term)
cla$Loan_Status = factor(cla$Loan_Status)  # Convert the response to factor
cla$Credit_History = factor(cla$Credit_History)
cla$Property_Area = factor( cla$Property_Area)
cla$Gender = factor( cla$Gender)
cla$Married = factor(cla$Married)
cla$Dependents = ordered(cla$Dependents, levels = c("0" , "1", "2" , "3+") )
cla$Education = factor(cla$Education)
cla$Self_Employed = factor( cla$Self_Employed)

ncla <- cla[,1:13]
GGally::ggpairs(ncla, mapping = "Loan_Status", columns = c(6:8, 13, 12)) %>% print()
GGally::ggpairs(ncla, mapping = "Loan_Status", columns = c(1:5, 10:12), ) %>% print()

#  Display the approval rate statistics for each qualitative field
#  in absolute count and percentage terms by category.
# 
display_approval_rate <- function(la , field )
{
    a = la[, "Loan_Status"]
    b = la[, field]
    
    c = table(cbind(a,b))  # Count table
    
    pcts = as.data.frame.matrix(c) /nrow(la) * 100   # Percentage equivalent
    col_sums = colSums(pcts)  # Shows the results by the field

    app_rate = pcts[2,]/col_sums   * 100   # Approval rate assumes row 2 is the "Yes" row.

    yy = rbind( pcts, col_sums, app_rate, colSums(c)) 
    rownames(yy)[1]= paste0( rownames(yy)[1], " (%)")
    rownames(yy)[2]= paste0( rownames(yy)[2], " (%)")
    rownames(yy)[3]= paste0( field, " Column Total %:")
    rownames(yy)[4]="Loan Approval Rate by Column%"
    rownames(yy)[5]="Total Applications:"
    print(yy %>% kable(digits = 1, caption = paste0('Loan Approved vs. ', field , ' by Percent') ) %>%
             kable_styling(bootstrap_options = c("striped", "hover") )  %>%
             row_spec(4, background = "skyblue", bold = TRUE) )
}

cat_vars = c("Property_Area", "Credit_History", "Loan_Amount_Term", "Self_Employed", "Education", "Dependents", "Married", "Gender")

for( field in cat_vars)
{
   display_approval_rate( cla , field)  
   cat("\n")
}

loan_rec <- recipe(Loan_Status~., data=ncla) %>% 
  prep()

loan_rec
tree <- decision_tree() %>% 
  set_engine("rpart") %>% 
  set_mode("classification")

tree
set.seed(1234)
loan_split <- initial_split(ncla, prop = .8)

loan_train <- training(loan_split)
loan_test <- testing(loan_split)

loan_split
tree_fit <- tree %>% fit(Loan_Status~., data = loan_train)

tree_fit 
set.seed(1234)
loan_xval <- vfold_cv(data = loan_train, strata = Loan_Status) 
tree_res <- fit_resamples(
  tree,
  loan_rec,
  resamples = loan_xval, 
  metrics = metric_set(accuracy, kap, roc_auc, sens, spec),
  control = control_resamples(save_pred = TRUE)) 

tree_res %>%  collect_metrics(summarize = TRUE)
best_tree <- tree_res %>%
  select_best("accuracy")

final_wf <- 
  workflow() %>% 
  add_model(tree) %>% 
  add_recipe(loan_rec) %>%
  finalize_workflow(best_tree)

final_tree <- 
  final_wf %>%
  fit(data = loan_train)

final_tree
final_tree %>% 
  pull_workflow_fit() %>% 
  vip::vip()
final_fit <- 
  final_wf %>%
  last_fit(split = loan_split, metrics = metric_set( accuracy, kap, roc_auc, sens, spec))

final_fit %>%
  collect_metrics()
final_fit %>%
  collect_predictions() %>% 
  conf_mat(Loan_Status, .pred_class)
set.seed(10)
data_split <- initial_split(cla, prop = .8)
train_data <- training(data_split)
test_data  <- testing(data_split)
use_data <- train_data %>%dplyr::select( -ApplicantIncome, -CoapplicantIncome, -Total_Income, -sApplicantIncome, -sCoapplicantIncome, -LoanAmount)
surfedModel <- invisible(VSURF::VSURF(use_data%>%dplyr::select(-Loan_Status), use_data$Loan_Status, parallel=F))
vsurfStages <- tibble('Stage' = "Prediction" , 'Fields' =colnames(use_data%>%dplyr::select(-Loan_Status))[surfedModel$varselect.pred])
vsurfStages <-vsurfStages %>% add_row('Stage' = 'interpretation' , 'Fields'=paste(colnames(use_data%>%dplyr::select(-Loan_Status))[surfedModel$varselect.interp], collapse=", "))
vsurfStages <-vsurfStages %>% add_row('Stage' = 'Ranked Importance' , 'Fields'=paste(colnames(use_data%>%dplyr::select(-Loan_Status))[surfedModel$varselect.thres], collapse = ", "))
vsurfStages

loans_testing <- na.omit(testing(data_split))
loans_testing["factorCreditHistory"] <- as.factor(as.logical((loans_testing)$Credit_History==1))
loans_testing["factorLoanStatus"] <- as.factor((loans_testing)$Loan_Status=="Y")
loans_testing %>% metrics(truth = factorLoanStatus, estimate = factorCreditHistory)
loan_testing <- na.omit(testing(data_split))
loan_testing["factorCreditHistory"] <- as.factor(as.logical((loan_testing)$Credit_History==1))
loan_testing["factorLoanStatus"] <- as.factor((loan_testing)$Loan_Status=="Y")
actual_result <- loan_testing %>% metrics(truth = factorLoanStatus, estimate = factorCreditHistory)
ourAccuracy <- tibble('Variables' = 1 , 'Kappa' =actual_result$.estimate[2], 'Accuracy' = actual_result$.estimate[1])
ourAccuracy <- ourAccuracy[-c(1),]
#loan_testing <- testing(data_split) #reset with nas added back in
for(var in 1:length(surfedModel$varselect.thres)){
  loans_rf_rec <-  recipe(Loan_Status~., data = train_data)  %>%
    step_rm(everything(),-all_of(colnames(use_data[(surfedModel$varselect.thres)[1:var]])), -Loan_Status)  %>%
    #step_knnimpute(all_predictors()) %>%
    step_dummy(all_nominal(), -all_outcomes()) %>% 
    prep()
  loan_training<-juice(loans_rf_rec)
  loan_ranger <- rand_forest( mode = "classification") %>%   set_engine("ranger") %>%     fit(Loan_Status ~ ., data = loan_training)
  loan_testing <- loans_rf_rec  %>% bake(testing(data_split)) 
  actual_result <- loan_ranger %>% predict(loan_testing) %>%     bind_cols(loan_testing) %>%     metrics(truth = Loan_Status, estimate = .pred_class)
  ourAccuracy <- ourAccuracy %>% add_row('Variables' = var , 'Kappa' =actual_result$.estimate[2], 'Accuracy' = actual_result$.estimate[1])
}
ourAccuracy %>% ggplot(aes(x=Variables,y=Accuracy)) + ggtitle("Accuracy as variables are added") +geom_point(color="red") + ylim(0,1) + scale_color_viridis_d(option = "plasma", begin = .9, end=0)
ourAccuracy %>% ggplot(aes(x=Variables,y=Kappa)) +ggtitle("Kappa as Variables are added") +geom_point(color="blue") + ylim(0,.6) + scale_color_viridis_d(option = "plasma", begin = .9, end=0)
ourAccuracy
our_data<-use_data[,colnames(use_data[(surfedModel$varselect.thres)])]
our_data$Loan_Status<-use_data$Loan_Status
loans_rf_rec <-  recipe(Loan_Status~., data = our_data)  %>%
  prep()
loans_grid<-grid_regular(trees(), finalize(mtry(), use_data), levels=5)
loan_training<-juice(loans_rf_rec)
loan_ranger <- workflow()%>% add_recipe(loans_rf_rec) %>% add_model(rand_forest(mtry=tune::tune(), trees = tune::tune(), mode = "classification") %>% set_engine("ranger",importance = "impurity"))
folds <- vfold_cv(our_data)
class_metrics <- metric_set( kap, accuracy)
tree_res  <- loan_ranger %>% tune_grid(resamples = folds, grid=loans_grid, metrics=class_metrics)
loans_best <-  tree_res %>% select_best("kap")
final_loans_wf <-loan_ranger %>% finalize_workflow(loans_best)
final_loans_fit <-final_loans_wf %>% fit(data = our_data) 
testing_data<- (testing(data_split)) 
our_data<-testing_data[,colnames(use_data[(surfedModel$varselect.thres)])]
our_data$Loan_Status<-testing_data$Loan_Status
loan_testing <- loans_rf_rec %>% bake(our_data)
actual_result<-final_loans_fit %>% predict(loan_testing) %>%     bind_cols(loan_testing) %>%     metrics(truth = Loan_Status, estimate = .pred_class)
tree_res %>%
    collect_metrics() %>% mutate(trees = factor(trees)) %>% ggplot(aes(mtry, mean, color=trees)) +
    geom_line(size = 1.5, alpha = 0.6) +
    geom_point(size = 2) +
    facet_wrap(~ .metric, scales = "free", nrow = 2) +
    scale_color_viridis_d(option = "plasma", begin = .9, end=0)
final_loans_fit %>%      pull_workflow_fit() %>%      vip()
kable(loans_best)
kable(actual_result)
rfbound<-final_loans_fit %>% predict(loan_testing) %>%     bind_cols(loan_testing)
caret::confusionMatrix(rfbound$Loan_Status, rfbound$.pred_class )

# https://bcullen.rbind.io/post/2020-06-02-tidymodels-decision-tree-learning-in-r/
# Section Boosted Trees

# Tune usage from: https://www.r-bloggers.com/2020/05/tidymodels-and-xgbooost-a-few-learnings/

# Data splitting
set.seed(1234)

lap_data <- cla
lap_data$Income_To_LoanAmt <- lap_data$Total_Income / lap_data$LoanAmount

lap_data <- dplyr::select(lap_data, Credit_History, Income_To_LoanAmt, Property_Area, Loan_Status)

summary(lap_data)

lap_data_split <- initial_split(lap_data, prop=0.8, strata=Loan_Status)

lap_training <- lap_data_split %>% training()

lap_test <- lap_data_split %>% testing()

lap_folds <- vfold_cv(lap_training, v=5, repeats=1)
# Specify the model
mod_boost <- boost_tree(trees = 25,
                        learn_rate = tune(),
                        sample_size = tune(),
                        mtry = tune(),
                        min_n = tune(),
                        tree_depth = tune()) %>%
  set_engine("xgboost", lambda=0, alpha=1, num_class=3, verbose=0) %>%
  set_mode("classification")

boost_params <- parameters(
  min_n(),
  tree_depth(),
  learn_rate(),
  finalize(mtry(), dplyr::select(lap_training,-Loan_Status)),
  sample_size = sample_prop(c(0.4,0.9))
)

boost_grid <- grid_max_entropy(boost_params, size=10)
# Feature Engineering

lap_recipe <- recipe(Loan_Status ~ ., data = lap_training) %>%
  step_YeoJohnson(all_numeric(), -all_outcomes()) %>%
  step_normalize(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_nominal(), -all_outcomes())

lap_recipe %>%
  prep() %>%
  bake(new_data = lap_training) %>% glimpse()
# Create workflow
boost_workflow <- workflow() %>%
  add_recipe(lap_recipe) %>%
  add_model(mod_boost)

class_metrics <- metric_set(accuracy, kap, recall, precision, f_meas, roc_auc, sens, spec)

boost_tuning <- tune_grid(
  object = boost_workflow,
  resamples = lap_folds,
  grid = boost_grid,
  metrics=class_metrics,
  control = control_grid(verbose = TRUE)
)

boost_tuning %>% show_best(metric = 'kap')
boost_tuning %>% collect_metrics() %>% 
  dplyr::select(mean,mtry:sample_size) %>% data.table %>% 
  melt(id="mean") %>% 
  ggplot(aes(y=mean,x=value,colour=variable)) + 
  geom_point(show.legend = FALSE) + 
  facet_wrap(variable~. , scales="free") + theme_bw() +
  labs(y="Mean log-loss", x = "Parameter")
# Select best model based on kap
best_boost <- boost_tuning %>%
  select_best(metric = 'kap')

# view the best tree parameters
best_boost

# finalize workflow
final_boost_workflow <- boost_workflow %>%
  finalize_workflow(best_boost)
# fit the model
boost_wf_fit <- final_boost_workflow %>%
  fit(data = lap_training)
boost_fit <- boost_wf_fit %>%
  pull_workflow_fit()

vip(boost_fit)
# train and evaluate
boost_last_fit <- final_boost_workflow %>%
  last_fit(lap_data_split)

boost_last_fit %>% collect_metrics()

boost_last_fit %>% collect_predictions() %>%
  roc_curve(truth = Loan_Status, estimate = .pred_N) %>%
  autoplot()
boost_predictions <- boost_last_fit %>% collect_predictions()

boost_predictions %>% yardstick::metrics(truth=Loan_Status,estimate=.pred_class)

conf_mat(boost_predictions, truth = Loan_Status, estimate = .pred_class)