Introduction

This document discusses analyses of two datasets, the Palmer Penguin dataset and a Loan Approvals dataset prepared by Group 6. We divide the document into five parts and adopted two key principles to undertaking this analysis:

First, group has developed a system of checks and balances in preparing each model’s output. A primary and secondary author cross validate each model (pun intended) by independently coding and reconciling outputs. Two co-authors also bring a wider perspective on discussion and diagnostics. Afterwards, the primary author drafts the text for inclusion in this document.

Second, because four of the five sections deal with the Loan Approvals data, we decided to adopt a common dataset for all model analyses. Data wrangling and preparation are done once only. Consistency of the data in the last four parts is essential to assess model performance.

Section 1 analyzes the Palmer Penguin dataset using the KNN model to predict species. The authors are Alexander Ng (primary) and Randy Thompson (secondary).

Section 2 conducts an exploratory data analysis and defining the common data set for modeling. The authors are Randy Thompson (primary) with contributions by all other members.

Section 3 analyzes the Loan Approvals by Decision Tree. The authors are Randy Thompson (primary) and Philip Tanofsky (secondary).

Section 4 analyzes the same Loan Approvals with a Random Forest model. The authors are Scott Reed (primary) and Alexander Ng (secondary).

Section 5 analyzes the same Loan Approvals with a Gradient Boosting approach. The authors are Philip Tanofsky (primary) and Scott Reed (secondary).

Section 6 concludes with a discussion of the model performance and an appraisal of the merits of each result. We also consider the role of model driven prediction in a loan approvals business context.

Section 7 presents our R code and technical appendices and references.

1 Penguins and the KNN algorithm

Following the data cleaning approach taken in prior assignments, we exclude observations where the feature sex is undefined. We also exclude the two variables island and year which do not seem helpful in prediction analysis. We follow the variable selection process described in a prior assignment here.

This still gives a substantial subset \(333\) out of \(344\) original observations. The below table illustrates sample rows of the cleansed dataset used in the subsequent KNN analysis.

Scrubbed penguin data set
species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
Adelie 39.1 18.7 181 3750 male
Adelie 39.5 17.4 186 3800 female
Adelie 40.3 18.0 195 3250 female
Adelie 36.7 19.3 193 3450 female
Adelie 39.3 20.6 190 3650 male
Adelie 38.9 17.8 181 3625 female

We obtain 333 rows of observations with 6 columns of explanatory and response variables.

1.1 Data Transformation for KNN

The next step is to normalize the data so that each quantitative variables have a mean zero and standard deviation equal to 1. However, we need to also include the sex variable in the K-nearest neighbor model. Normalization ensures that each variable has equal influence on the nearest neighbor calculation. Otherwise, a quantitative variable whose scale is much larger than the other variables would dominate the Euclidean distance calculation used in the KNN model.

However, we also need to include the sex variable as a feature in the KNN model. Categorical variables pose a challenge in the KNN context because they need to be mapped to a numerical equivalent. In the case of this data set, two facts are key:

  1. the original data set gathered penguins by breeding pairs - thus the number of males and females is very balanced.
  2. the variable is binary. Thus, two equal and opposite values can accomplish the goal of normalization.

For these reasons, we do not normalize sex but use the mapping function \(f\) defined as:

\[\begin{equation} f(\text{sex}) = \begin{cases} 1 & \text{sex} = \text{male} \\ -1 & \text{sex} = \text{female} \\ \end{cases} \end{equation} \] Because the data set is gender balanced, \(E[f(\text{sex})] \approx 0\) and \(Stdev(f(\text{sex})) \approx 1\). Hence, \(f(\text{sex})\) is nearly normalized. With these transformations, the KNN dataset is prepared for model evaluation. We illustrate the top 5 rows of the training data set below. Note that the sampled indices are displayed in the left most column as row values in the resulting dataframe.

Normalized Training Data - 1st Few Rows
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
1 -0.89 0.78 -1.42 -0.57 1
3 -0.68 0.42 -0.43 -1.19 -1
4 -1.33 1.08 -0.57 -0.94 -1
7 -0.88 1.24 -0.43 0.58 1
8 -0.53 0.22 -1.35 -1.25 -1
9 -0.99 2.05 -0.71 -0.51 1

1.2 Cross Validation and Parameter Tuning with KNN

We will use caret to test a range of possible values of \(k\) and select the optimal \(k\) based on minimizing the cross validation error. We did the pre-processing previously using scale so there is no need to invoke further data transformation inside caret pre-processing function calls.

trainControl allows us to define the training parameters.

tuneLength decides the range of allowed \(k\) value to compute against the model.

We choose a 5-fold cross validation method to ensure the test set is sufficiently large.

Using the above model plot which shows the average cross validation accuracy at each parameter \(k\) used in the KNN model, we see that the best kappa and accuracy occur at \(k=5\) nearest neighbors. This is evident when we display the (cross validation) model output below.

## k-Nearest Neighbors 
## 
## 333 samples
##   5 predictor
##   3 classes: 'Adelie', 'Chinstrap', 'Gentoo' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 267, 266, 268, 265, 266 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.9940737  0.9906842
##    7  0.9880556  0.9811323
##    9  0.9881914  0.9812667
##   11  0.9881914  0.9812667
##   13  0.9851145  0.9763978
##   15  0.9851145  0.9763978
##   17  0.9851145  0.9763978
##   19  0.9851145  0.9763978
##   21  0.9851145  0.9763978
##   23  0.9851145  0.9763978
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

1.3 KNN Model Output and Performance

To assess model performance and investigate the suitability of our choice of \(k=5\), we consider two approaches. First we examine the statistical output of the confusion matrix. The accuracy and kappa ought to look attractive in the model selection. Second, we examine the smoothness of the decision boundary by plotting the decision boundary for \(k=1\) and \(k=5\). If the optimal \(k\) shows a relatively smooth boundary, this is supporting evidence that the model parameters are reasonable.

First, we construct the confusion matrix of one specific implementation of the KNN model for the train-test split following the same proportions as the cross-validation study. We specifically use the caret interface rather than the simpler class interface to ensure consistency across multiple models in future.

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        29         1      0
##   Chinstrap      0        12      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9846          
##                  95% CI : (0.9172, 0.9996)
##     No Information Rate : 0.4462          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9757          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000           0.9231        1.0000
## Specificity                 0.9722           1.0000        1.0000
## Pos Pred Value              0.9667           1.0000        1.0000
## Neg Pred Value              1.0000           0.9811        1.0000
## Prevalence                  0.4462           0.2000        0.3538
## Detection Rate              0.4462           0.1846        0.3538
## Detection Prevalence        0.4615           0.1846        0.3538
## Balanced Accuracy           0.9861           0.9615        1.0000

We see in the confusion matrix output above that accuracy is 98.46% and kappa is 97.6%. These are consistent with the cross validation results. We conclude that KNN is a very effective classifier for penguin gender. Only 1 penguin in 65 was misclassified: a Chinstrap was predicted to be an Adelie penguin.

Next, we visualize the decision boundary of our KNN model to check for model suitability. Custom code needs to be written as we are not aware if this plot type is currently provided by any R package.

Adapting the R recipe for KNN decision boundary plots found in the stackoverflow page here we see below that the decision boundary for \(k=5\) nearest neighbor algorithm is reasonably smooth and not overfitted to the data.

However, the same KNN decision boundary for \(k=1\) shows a much more twisted curve. The border between the Chinstrap and Adelie regions in the upper middle half shows a zig-zag pattern characteristic of overfitting. So we reject the choice of \(k=1\) as the model even though both \(k=5\) and \(k=1\) KNN models have identical confusion matrices. ( The \(k=1\) confusion matrix is omitted for brevity. ) This means model selection requires considering factors not captured by the confusion matrix.

2 Loan Approvals: Exploratory Data Analysis

2.1 Assessing the Raw Data

Our EDA shows the raw loan approvals dataset has varying level of quality.

Complete variables include: Loan_ID, Education, Property_Area, Loan_Status, ApplicantIncome, CoapplicantIncome Mostly complete variables include: Married, Dependents, Loan_Amount_Term, Gender

Variables with significant gaps include: Self_Employed, LoanAmount, Credit_History Of these variables, Credit_History is the most problematic and it influences the Loan Approval decision.

\(\color{red}{\text{@Randy: Add your detailed discussion here...}}\)

2.2 Defining a Common Dataset for Loan Approvals

In this section, we construct a common loan approval data which will be used for all of the model studies that follow. There are several steps that we take:

  1. It consists solely of the records with no missing values for any relevant column and then exclude useless variables. We call this subset of observations. This reduces the number of observations from 614 to 480 which is a 78.2% of the original data set. While this is a significant reduction in the number of observations the remaining data is still sufficient for fitting models.

Further Credit History in particular is relatively commonly missing, but it is a challenging feature to impute as it is a binary output, and from subject area knowledge is not directly determinable from the other data. Income is not directly linked to credit history (Akin, 2020), and one can have high income and poor credit. This makes it challenging to reliably impute.

  1. Next, we omit one irrelevant column: the Loan_ID. An identifier which is exogenous to the loan applicant and is assumed to be related solely to computer system processes.

  2. Then we add a synthetic variable Total_Income which represents the sum of the ApplicantIncome and CoapplicantIncome.

  3. We scale all quantitative variables to mean 0 and variance 1 variables with the scale function and retain both the nominal and scaled versions. The scaled variables are named like their nominal counterparts but prefixed with a lower case s. For example, sApplicantIncome has mean 0 and variance 1 whereas ApplicantIncome has the monthly nominal income.

  4. Further, we convert the character type data columns from text to factor. This is importance for Loan_Status but also relevant for others like: Credit_History, Property_Area, Gender, Married, Education, Self_Employed. In addition, we convert Dependents from text to an ordered factor because the number of children is ordered.

Lastly, the common data set is exported to flat file. Each group member performs their Loan Approval model study by loading the common data set and making subsequent selections or transformations while retaining the same observations.

We see that we have some missing variables including a substantial missing (43) on Credit History. We further see that there is a smattering of other missing features, but no observations with more than a few features missing.

After exporting the common loan approval data, we load it into the R session and transform character string data to factors. This allows some R models to better handle response variables.

When performing modeling of common data set, each member may omit one of the Income variables to avoid multi-collinearity problems involving: ApplicantIncome, CoapplicantIncome and Total_Income or their scaled equivalents.

\(\color{red}{\text{Please remember to drop one or more of the 3 Income variables prior to running your models.}}\)

2.3 Conditional Approval Rates by Categorical Variable

\(\color{red}{\text{Let's add this section (or its equivalent) to the discussion since it tells a useful story.}}\)

\(\color{red}{\text{I wrote some code to do the conditional approval rates.}}\)

One simple type of exploratory data analysis is to examine the conditional approval rates by values of each categorical variable \(C\).
For example, there are three subcategories (Urban, Suburban, Rural) of the variable Property_Area.
If we find significant differences in the conditional loan approval rates between Urban and Rural applications, this suggests that Property_Area may help to predict approvals.

The conditional approval rates for each categorical variable is displayed below.

Loan Approved vs. Property_Area by Percent
Rural Semiurban Urban
N (%) 11.2 8.8 10.8
Y (%) 17.7 31.0 20.4
Property_Area Column Total %: 29.0 39.8 31.2
Loan Approval Rate by Column% 61.2 78.0 65.3
Total Applications: 139.0 191.0 150.0
Loan Approved vs. Credit_History by Percent
0 1
N (%) 13.1 17.7
Y (%) 1.5 67.7
Credit_History Column Total %: 14.6 85.4
Loan Approval Rate by Column% 10.0 79.3
Total Applications: 70.0 410.0
Loan Approved vs. Loan_Amount_Term by Percent
36 60 84 120 180 240 300 360 480
N (%) 0.4 0.0 0.2 0.0 2.5 0.2 1.0 24.8 1.7
Y (%) 0.0 0.4 0.4 0.6 5.0 0.2 0.8 60.8 0.8
Loan_Amount_Term Column Total %: 0.4 0.4 0.6 0.6 7.5 0.4 1.9 85.6 2.5
Loan Approval Rate by Column% 0.0 100.0 66.7 100.0 66.7 50.0 44.4 71.0 33.3
Total Applications: 2.0 2.0 3.0 3.0 36.0 2.0 9.0 411.0 12.0
Loan Approved vs. Self_Employed by Percent
No Yes
N (%) 26.0 4.8
Y (%) 60.2 9.0
Self_Employed Column Total %: 86.2 13.8
Loan Approval Rate by Column% 69.8 65.2
Total Applications: 414.0 66.0
Loan Approved vs. Education by Percent
Graduate Not Graduate
N (%) 23.3 7.5
Y (%) 56.5 12.7
Education Column Total %: 79.8 20.2
Loan Approval Rate by Column% 70.8 62.9
Total Applications: 383.0 97.0
Loan Approved vs. Dependents by Percent
0 1 2 3+
N (%) 18.1 5.8 4.2 2.7
Y (%) 39.0 10.8 13.5 5.8
Dependents Column Total %: 57.1 16.7 17.7 8.5
Loan Approval Rate by Column% 68.2 65.0 76.5 68.3
Total Applications: 274.0 80.0 85.0 41.0
Loan Approved vs. Married by Percent
No Yes
N (%) 13.3 17.5
Y (%) 21.9 47.3
Married Column Total %: 35.2 64.8
Loan Approval Rate by Column% 62.1 73.0
Total Applications: 169.0 311.0
Loan Approved vs. Gender by Percent
Female Male
N (%) 6.7 24.2
Y (%) 11.2 57.9
Gender Column Total %: 17.9 82.1
Loan Approval Rate by Column% 62.8 70.6
Total Applications: 86.0 394.0

We observe large differences in approval rates in the following categorical variables:

  1. Credit_History: Those with credit history have 79.3% approval rate vs. 10.0% for those without.

  2. Property_Area: Rural approval rates 61.2% are lower than Semiurban at 78.0%

  3. Married: Unmarried applicant approval rate 62.1% is lower than for married at 73.0%

  4. Education: Not Graduate approval rates 62.9% are lower than Graduate at 70.8%

  5. Gender: Female applicant approval rate 62.8% is lower than for males at 70.6%.

These differences should suggest the most significant categorical variables for prediction purposes.

3 Loan Approvals: Decision Tree Models

\(\color{red}{\text{Please set.seed() in your section to ensure your R code works as expected. }}\)

\(\color{red}{\text{Suggest a 80-20 split for training and test data to allow cross model comparison. }}\)

3.1 Subsection A

3.2 Subsection B

3.3 Subsection C

4 Loan Approvals: Random Forest Models

4.1 Feature Selection

We first run an all variables (save the index variable Loan_ID, and the constituent parts of scaled or combined variables) in a random forest to use it as a variable selection sieve. We use VSURF as it was the best performing in Speiser (2019).

VSURF for the uninitiated is a multi step variable selection process that first ranks variables by importance. It then excludes features that have a null importance. VSURF generates both a selection of features for for interpretation by adding variables sequentially and generating random forests while minimizing OOB error, and a selection for prediction which similarly sequentially adds variables but instead of blindly limiting OOB it penalizes adding variables that add noise.

VSURF selects a single variable “model” for predictive use. This suggests that a very parsimonious model indeed might be sufficient.

Using this as a “model” we get an accuracy of 83.1% and a Kappa of 0.486. Cohen (1960) would suggest that as an agreement unlikely to be caused by chance (which is important as a model that simply flagged every loan as a success would have an accuracy of .737 but a Kappa of course of 0)

4.1.1 Model building

If we were to ignore VSURF’s suggestion we can try to sweep through the ranked recommended features (which are reported in order from most promising to least)

Perhaps unsurprisingly (as this is broadly similar to what VSURF does internally) we don’t seem to receive any benefit of adding additional variable candidates, indeed the model actually performs somewhat worse for many options.

Still if we use the features for interpretation VSURF has selected and a wide parameter grid for the tunable parameters mtry and number of trees we can attempt to tune a model.

mtry trees .config
3 1500 Preprocessor1_Model09
.metric .estimator .estimate
accuracy binary 0.8105263
kap binary 0.4554140

Looking at breakdowns of performance we see the best performance using Kappa to be mtry of 3 and trees of 1500 This does not appear to be a particular effective tuning and many others perform similarly. Unfortunately this doesn’t reflect much in the way of actual model improvement.

Our full model has an accuracy and kappa as presented above. Unfortunately, this is less than our “Model” of a simple credit_history.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  N  Y
##          N 12 13
##          Y  5 65
##                                           
##                Accuracy : 0.8105          
##                  95% CI : (0.7172, 0.8837)
##     No Information Rate : 0.8211          
##     P-Value [Acc > NIR] : 0.66467         
##                                           
##                   Kappa : 0.4554          
##                                           
##  Mcnemar's Test P-Value : 0.09896         
##                                           
##             Sensitivity : 0.7059          
##             Specificity : 0.8333          
##          Pos Pred Value : 0.4800          
##          Neg Pred Value : 0.9286          
##              Prevalence : 0.1789          
##          Detection Rate : 0.1263          
##    Detection Prevalence : 0.2632          
##       Balanced Accuracy : 0.7696          
##                                           
##        'Positive' Class : N               
## 

We examine the confusion matrix and see the raw accuracy is .7790

5 Loan Approvals: Gradient Boosting

5.1 Subsection A

\(\color{red}{\text{Please set.seed() in your section to ensure your R code works as expected. }}\)

\(\color{red}{\text{Maybe use a 80-20 split for training and test data to allow cross model comparison. }}\)

5.2 Subsection B

5.3 Subsection C

6 Loan Approvals: Model Performance and Assessment

In this section, we compare the performance of the decision tree, random forest and gradient boosting models on the Loan Approvals dataset.

6.1 Model Performance

6.2 Using Prediction in Loans Approvals

We can use Decision Trees, Random Forest, or Gradient Boosting to build a model. The models final performance is as bellow:

Model Accuracy Kappa
Decision Trees TBD TBD
Random Forest 0.811 0.455
Gradient boosting TBD TBD

Based on this we select the TBD Model as having the best balance of performance and ease of understanding.

6.2.1 Recomendations

The loan data is strongly controlled by Credit History and a combined income. It seems likely that we could extract more information from the loan amount and interval by approval by knowing the interest rates offered/requested. This could lead to a new composite value of Yearly payment which would likely have a relationship with the applicants income.

It would be very interesting to break down the Credit History Feature into components or at the least have a more continuous variable (say points above or below the FICO score cut off). Our models showed it to have an almost overwhelmingly strong impact on the output model, and presumably it is based on its own more continuous data that could be further analyzed. It seems likely, for example, that loan approval of applicants that do not meet the credit history requirements likely are closer to approval than applicants that seem mostly similar and do not receive approval. However, the data provided does not allow us to state that.

7 Appendices

7.1 References

\(\color{red}{\text{Put your external references here.}}\)

Akin, J. (2020, March 5). Does income affect credit scores? Experian https://www.experian.com/blogs/ask-experian/does-income-affect-credit-scores/.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1), 37-46.

Genuer, R., Poggi, J. M., & Tuleau-Malot, C. (2015). VSURF: an R package for variable selection using random forests. The R Journal, 7(2), 19-33.

UC Business Analytics R Programming Guide - Random Forests

Speiser, J. L., Miller, M. E., Tooze, J., & Ip, E. (2019). A comparison of random forest variable selection methods for classification prediction modeling. Expert systems with applications, 134, 93-101.

7.2 Code

We summarize all the R code used in this project in this appendix for ease of reading.

# Your libraries go here

library(tidyverse)
library(ggplot2)
library(knitr)
library(kableExtra)

library(caret)    # Model Framework
library(skimr)    # Used for EDA
library(klaR)     # Implemented KNN and Naive Bayes models, etc
library(class)    # used for KNN classifier

# PLEASE ADD YOUR R LIBRARIES BELOW
# ------------------------------
library(tidymodels)
library(VSURF) 
library(dplyr)
library(mice)

library(vip)


# ---------------------------------
knitr::opts_chunk$set(echo = FALSE, message=FALSE, warning=FALSE)

library(palmerpenguins)

# The final dataset excludes observations with missing sex and drops the island and year variables.
pc = penguins %>% filter( is.na(sex) == FALSE) %>% dplyr::select( -one_of("island", "year") )

head(pc) %>% kable(caption="Scrubbed penguin data set") %>% kable_styling(bootstrap_options = c("striped", "hover"))


set.seed(10)

#  Construct the standardized dataset with the response variable, the 4 quantitative variables normalized 
#  with the scale function and the sex variable normalized with the 1,-1 assignment explained previously.
#
standardized = cbind( pc[,1], scale( pc[, 2:5] ) , sex = data.table::fifelse(pc$sex == 'male' , 1 , -1) )

# Define an 80-20 split of the training and test data.
# -------------------------------------------------------------------------------------
training.individuals = createDataPartition(standardized$species, p= 0.8 , list = FALSE)

# X variables include bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g,  sex (converted to -1,1)
# Y variable is just the species class
# train data is the standardized subset of the full data set
# test data is the complement of the training data set.
# -----------------------------------------------------------------
train.X  =  standardized[ training.individuals,  2:6]
test.X   =  standardized[-training.individuals,  2:6]
train.Y  =  standardized[ training.individuals,    1]
test.Y   =  standardized[-training.individuals,    1]

head(train.X ) %>% kable(caption="Normalized Training Data - 1st Few Rows", digits = 2 ) %>% 
  kable_styling(bootstrap_options = c("striped", "hover") )
model = train( species ~. , data = standardized , method = "knn" ,
         trControl = trainControl("cv", number = 5 ), tuneLength = 10 )

# Plot the model output vs. varying k values.
plot(model)
model

# The best tuning parameter k that 
# optimized model accuracy is:
#model$bestTune

# Simpler interface using the knn function within class.
knn.pred5 = knn(train.X, test.X, train.Y , k = 5, prob = FALSE) 
cm5 = confusionMatrix(knn.pred5, test.Y) 


# This model approach uses tuneGrid with a single value to calculate kNN 
# only at a single number of nearest neighbors.
# ------------------------------------------------------------------------------
x2 = train( species ~ . , data = standardized[training.individuals,] , 
            method = "knn", 
            trControl = trainControl(method="none"), 
            tuneGrid=data.frame(k=5) )

pred2 = predict(x2, newdata=test.X)

(cmknn2=confusionMatrix(pred2, test.Y))

# We visualize a projection of the decision boundary flattened to 2 dimensions to illustrate
# the smoothness of the KNN boundary at different values of k.
# First, choose the variables, then build a sampling grid.
# ------------------------------------------------------------------------
pl = seq(min(standardized$bill_length_mm), max(standardized$bill_length_mm), by = 0.05)
pw = seq(min(standardized$bill_depth_mm), max(standardized$bill_depth_mm), by = 0.05 )

# The sampling grid needs explicit values to be plugged in for the variables
# which are not the two projected variables.  Since variables are standardized to
# mean 0 and variance 1, we choose zero for flipper_length and body_mass because they are continuous 
# and the mean and mode are close.
# However, for sex, we choose male because the average sex = 0 is not the mode.
# In practice, sex = 0 gives weird decision boundaries in the projected plot.
# ------------------------------------------------------------------------------
lgrid = expand.grid(bill_length_mm=pl , bill_depth_mm = pw , flipper_length_mm = 0, body_mass_g = 0 , sex = 1 )

knnPredGrid = predict(x2, newdata=lgrid )

num_knnPredGrid = as.numeric(knnPredGrid)

num_pred2 = as.numeric(pred2)

test = cbind(test.X, test.Y)
test$Pred = num_pred2
ggplot(data=lgrid) + stat_contour(aes(x=bill_length_mm, y= bill_depth_mm, z = num_knnPredGrid), bins = 2 ) +
  geom_point( aes(x=bill_length_mm, y= bill_depth_mm, color = knnPredGrid),  size = 1, shape = 19, alpha = 0.2) +
  geom_point( data= test , aes(x=bill_length_mm, y=bill_depth_mm, color=pred2 ) , size = 4, alpha = 0.8, shape =24 ) +
  ggtitle("KNN Decision Boundary for Penguins Data with k=5 neighbor")


# This model approach uses tuneGrid with a single value to calculate kNN 
# only at a single number of nearest neighbors.
# ------------------------------------------------------------------------------
x3 = train( species ~ . , data = standardized[training.individuals,] , 
            method = "knn", 
            trControl = trainControl(method="none"), 
            tuneGrid=data.frame(k=1) )  # Now we use a k=1 nearest neighbor algorithm which is overfitted.

pred3 = predict(x3, newdata=test.X)


knnPredGrid3 = predict(x3, newdata=lgrid )

num_knnPredGrid3 = as.numeric(knnPredGrid3)

num_pred3 = as.numeric(pred3)

test = cbind(test.X, test.Y)
test$Pred = num_pred3

ggplot(data=lgrid) + stat_contour(aes(x=bill_length_mm, y= bill_depth_mm, z = num_knnPredGrid3), bins = 2 ) +
  geom_point( aes(x=bill_length_mm, y= bill_depth_mm, color = knnPredGrid3),  size = 1, shape = 19, alpha = 0.2) +
  geom_point( data= test , aes(x=bill_length_mm, y=bill_depth_mm, color=pred3 ) , size = 4, alpha = 0.8, shape =24 ) +
  ggtitle("KNN Decision Boundary for Penguins Data with k=1 neighbor")
# This code chunk should only be run to build the common data set.  
# Any model builder should rely on importing the common data set only.
# Thus, set eval=TRUE when generating a new version of the common data set.
# Otherwise, set eval=FALSE to skip this step.
# ------------------------------------------------------------------------

# This code block assumes the raw Loan Approval data file in csv form has been placed in the same folder as this script.
# ----------------------------------------------------------------------------------------
cla <- read_csv("Loan_approval.csv") %>% dplyr::select(-Loan_ID)  # drop the row identifier.

# Add a column for Total Income of Applicant and Co-Applicant
# ---------------------------------------------------------------------------
cla$Total_Income = cla$ApplicantIncome + cla$CoapplicantIncome


invisible(mice::md.pattern(cla %>% dplyr::select(-Total_Income), rotate.names = T))


# We build a dataset in which all observations have fully populated values.
# ---------------------------------------------------------------------------
cla = na.omit(cla)

# Add mean zero and variance 1 versions of quantitative variables.
# ----------------------------------------------------------
cla %>% mutate( sApplicantIncome = scale(ApplicantIncome), 
                sCoapplicantIncome = scale(CoapplicantIncome),
                sTotal_Income = scale(Total_Income) ,
                sLoanAmount   = scale( LoanAmount)
                ) -> cla

head(cla)

write_csv(cla, "cla.csv") # Write the scaled common loan approvals data set to local disk.  

cla = read_csv("cla.csv")

# Transform the character data columns into factors
# --------------------------------------------------------
cla$Loan_Amount_Term = factor(cla$Loan_Amount_Term)
cla$Loan_Status = factor(cla$Loan_Status)  # Convert the response to factor
cla$Credit_History = factor(cla$Credit_History)
cla$Property_Area = factor( cla$Property_Area)
cla$Gender = factor( cla$Gender)
cla$Married = factor(cla$Married)
cla$Dependents = ordered(cla$Dependents, levels = c("0" , "1", "2" , "3+") )
cla$Education = factor(cla$Education)
cla$Self_Employed = factor( cla$Self_Employed)

head(cla)

#  Display the approval rate statistics for each qualitative field
#  in absolute count and percentage terms by category.
# 
display_approval_rate <- function(la , field )
{
    a = la[, "Loan_Status"]
    b = la[, field]
    
    c = table(cbind(a,b))  # Count table
    
    pcts = as.data.frame.matrix(c) /nrow(la) * 100   # Percentage equivalent
    col_sums = colSums(pcts)  # Shows the results by the field

    app_rate = pcts[2,]/col_sums   * 100   # Approval rate assumes row 2 is the "Yes" row.

    yy = rbind( pcts, col_sums, app_rate, colSums(c)) 
    rownames(yy)[1]= paste0( rownames(yy)[1], " (%)")
    rownames(yy)[2]= paste0( rownames(yy)[2], " (%)")
    rownames(yy)[3]= paste0( field, " Column Total %:")
    rownames(yy)[4]="Loan Approval Rate by Column%"
    rownames(yy)[5]="Total Applications:"
    print(yy %>% kable(digits = 1, caption = paste0('Loan Approved vs. ', field , ' by Percent') ) %>%
             kable_styling(bootstrap_options = c("striped", "hover") )  %>%
             row_spec(4, background = "skyblue", bold = TRUE) )
}

cat_vars = c("Property_Area", "Credit_History", "Loan_Amount_Term", "Self_Employed", "Education", "Dependents", "Married", "Gender")

for( field in cat_vars)
{
   display_approval_rate( cla , field)  
   cat("\n")
}

set.seed(10)
data_split <- initial_split(cla, prop = .8)
train_data <- training(data_split)
test_data  <- testing(data_split)
use_data <- train_data %>%dplyr::select( -ApplicantIncome, -CoapplicantIncome, -Total_Income, -sApplicantIncome, -sCoapplicantIncome, -LoanAmount)
surfedModel <- invisible(VSURF::VSURF(use_data%>%dplyr::select(-Loan_Status), use_data$Loan_Status, parallel=F))
vsurfStages <- tibble('Stage' = "Prediction" , 'Fields' =colnames(use_data%>%dplyr::select(-Loan_Status))[surfedModel$varselect.pred])
vsurfStages <-vsurfStages %>% add_row('Stage' = 'interpretation' , 'Fields'=paste(colnames(use_data%>%dplyr::select(-Loan_Status))[surfedModel$varselect.interp], collapse=", "))
vsurfStages <-vsurfStages %>% add_row('Stage' = 'Ranked Importance' , 'Fields'=paste(colnames(use_data%>%dplyr::select(-Loan_Status))[surfedModel$varselect.thres], collapse = ", "))
vsurfStages

loans_testing <- na.omit(testing(data_split))
loans_testing["factorCreditHistory"] <- as.factor(as.logical((loans_testing)$Credit_History==1))
loans_testing["factorLoanStatus"] <- as.factor((loans_testing)$Loan_Status=="Y")
loans_testing %>% metrics(truth = factorLoanStatus, estimate = factorCreditHistory)
loan_testing <- na.omit(testing(data_split))
loan_testing["factorCreditHistory"] <- as.factor(as.logical((loan_testing)$Credit_History==1))
loan_testing["factorLoanStatus"] <- as.factor((loan_testing)$Loan_Status=="Y")
actual_result <- loan_testing %>% metrics(truth = factorLoanStatus, estimate = factorCreditHistory)
ourAccuracy <- tibble('Variables' = 1 , 'Kappa' =actual_result$.estimate[2], 'Accuracy' = actual_result$.estimate[1])
ourAccuracy <- ourAccuracy[-c(1),]
#loan_testing <- testing(data_split) #reset with nas added back in
for(var in 1:length(surfedModel$varselect.thres)){
  loans_rf_rec <-  recipe(Loan_Status~., data = train_data)  %>%
    step_rm(everything(),-all_of(colnames(use_data[(surfedModel$varselect.thres)[1:var]])), -Loan_Status)  %>%
    #step_knnimpute(all_predictors()) %>%
    step_dummy(all_nominal(), -all_outcomes()) %>% 
    prep()
  loan_training<-juice(loans_rf_rec)
  loan_ranger <- rand_forest( mode = "classification") %>%   set_engine("ranger") %>%     fit(Loan_Status ~ ., data = loan_training)
  loan_testing <- loans_rf_rec  %>% bake(testing(data_split)) 
  actual_result <- loan_ranger %>% predict(loan_testing) %>%     bind_cols(loan_testing) %>%     metrics(truth = Loan_Status, estimate = .pred_class)
  ourAccuracy <- ourAccuracy %>% add_row('Variables' = var , 'Kappa' =actual_result$.estimate[2], 'Accuracy' = actual_result$.estimate[1])
}
ourAccuracy %>% ggplot(aes(x=Variables,y=Accuracy)) + ggtitle("Accuracy as variables are added") +geom_point(color="red") + ylim(0,1) + scale_color_viridis_d(option = "plasma", begin = .9, end=0)
ourAccuracy %>% ggplot(aes(x=Variables,y=Kappa)) +ggtitle("Kappa as Variables are added") +geom_point(color="blue") + ylim(0,.6) + scale_color_viridis_d(option = "plasma", begin = .9, end=0)
ourAccuracy
our_data<-use_data[,colnames(use_data[(surfedModel$varselect.thres)])]
our_data$Loan_Status<-use_data$Loan_Status
loans_rf_rec <-  recipe(Loan_Status~., data = our_data)  %>%
  prep()
loans_grid<-grid_regular(trees(), finalize(mtry(), use_data), levels=5)
loan_training<-juice(loans_rf_rec)
loan_ranger <- workflow()%>% add_recipe(loans_rf_rec) %>% add_model(rand_forest(mtry=tune::tune(), trees = tune::tune(), mode = "classification") %>% set_engine("ranger",importance = "impurity"))
folds <- vfold_cv(our_data)
class_metrics <- metric_set( kap, accuracy)
tree_res  <- loan_ranger %>% tune_grid(resamples = folds, grid=loans_grid, metrics=class_metrics)
loans_best <-  tree_res %>% select_best("kap")
final_loans_wf <-loan_ranger %>% finalize_workflow(loans_best)
final_loans_fit <-final_loans_wf %>% fit(data = our_data) 
testing_data<- (testing(data_split)) 
our_data<-testing_data[,colnames(use_data[(surfedModel$varselect.thres)])]
our_data$Loan_Status<-testing_data$Loan_Status
loan_testing <- loans_rf_rec %>% bake(our_data)
actual_result<-final_loans_fit %>% predict(loan_testing) %>%     bind_cols(loan_testing) %>%     metrics(truth = Loan_Status, estimate = .pred_class)
tree_res %>%
    collect_metrics() %>% mutate(trees = factor(trees)) %>% ggplot(aes(mtry, mean, color=trees)) +
    geom_line(size = 1.5, alpha = 0.6) +
    geom_point(size = 2) +
    facet_wrap(~ .metric, scales = "free", nrow = 2) +
    scale_color_viridis_d(option = "plasma", begin = .9, end=0)
final_loans_fit %>%      pull_workflow_fit() %>%      vip()
kable(loans_best)
kable(actual_result)
rfbound<-final_loans_fit %>% predict(loan_testing) %>%     bind_cols(loan_testing)
caret::confusionMatrix(rfbound$Loan_Status, rfbound$.pred_class )