Introduction

This assignment analyzes two problems. First, the Palmer Penguin dataset using the KNN model to predict species. Second, a Loan Approval dataset using Random Forest and Gradient Boosting methods.

1 Penguins and the KNN algorithm

Following the data cleaning approach taken in prior assignments, we exclude observations where the feature sex is undefined. We also exclude the two variables island and year which do not seem helpful in prediction analysis. We follow the variable selection process described in a prior assignment here.

This still gives a substantial subset \(333\) out of \(344\) original observations. The below table illustrates sample rows of the cleansed dataset used in the subsequent KNN analysis.

Scrubbed penguin data set
species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
Adelie 39.1 18.7 181 3750 male
Adelie 39.5 17.4 186 3800 female
Adelie 40.3 18.0 195 3250 female
Adelie 36.7 19.3 193 3450 female
Adelie 39.3 20.6 190 3650 male
Adelie 38.9 17.8 181 3625 female

We obtain 333 rows of observations with 6 columns of explanatory and response variables.

1.1 Data Transformation for KNN

The next step is to normalize the data so that each quantitative variables have a mean zero and standard deviation equal to 1. However, we need to also include the sex variable in the K-nearest neighbor model. Normalization ensures that each variable has equal influence on the nearest neighbor calculation. Otherwise, a quantitative variable whose scale is much larger than the other variables would dominate the Euclidean distance calculation used in the KNN model.

However, we also need to include the sex variable as a feature in the KNN model. Categorical variables pose a challenge in the KNN context because they need to be mapped to a numerical equivalent. In the case of this data set, two facts are key:

  1. the original data set gathered penguins by breeding pairs - thus the number of males and females is very balanced.
  2. the variable is binary. Thus, two equal and opposite values can accomplish the goal of normalization.

For these reasons, we do not normalize sex but use the mapping function \(f\) defined as:

\[\begin{equation} f(\text{sex}) = \begin{cases} 1 & \text{sex} = \text{male} \\ -1 & \text{sex} = \text{female} \\ \end{cases} \end{equation} \] Because the data set is gender balanced, \(E[f(\text{sex})] \approx 0\) and \(Stdev(f(\text{sex})) \approx 1\). Hence, \(f(\text{sex})\) is nearly normalized. With these transformations, the KNN dataset is prepared for model evaluation. We illustrate the top 5 rows of the training data set below. Note that the sampled indices are displayed in the left most column as row values in the resulting dataframe.

Normalized Training Data - 1st Few Rows
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
1 -0.89 0.78 -1.42 -0.57 1
3 -0.68 0.42 -0.43 -1.19 -1
4 -1.33 1.08 -0.57 -0.94 -1
7 -0.88 1.24 -0.43 0.58 1
8 -0.53 0.22 -1.35 -1.25 -1
9 -0.99 2.05 -0.71 -0.51 1

1.2 Cross Validation and Parameter Tuning with KNN

We will use caret to test a range of possible values of \(k\) and select the optimal \(k\) based on minimizing the cross validation error. We did the pre-processing previously using scale so there is no need to invoke further data transformation inside caret pre-processing function calls.

trainControl allows us to define the training parameters.

tuneLength decides the range of allowed \(k\) value to compute against the model.

We choose a 5-fold cross validation method to ensure the test set is sufficiently large.

Using the above model plot which shows the average cross validation accuracy at each parameter \(k\) used in the KNN model, we see that the best kappa and accuracy occur at \(k=5\) nearest neighbors. This is evident when we display the (cross validation) model output below.

## k-Nearest Neighbors 
## 
## 333 samples
##   5 predictor
##   3 classes: 'Adelie', 'Chinstrap', 'Gentoo' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 267, 266, 268, 265, 266 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.9940737  0.9906842
##    7  0.9880556  0.9811323
##    9  0.9881914  0.9812667
##   11  0.9881914  0.9812667
##   13  0.9851145  0.9763978
##   15  0.9851145  0.9763978
##   17  0.9851145  0.9763978
##   19  0.9851145  0.9763978
##   21  0.9851145  0.9763978
##   23  0.9851145  0.9763978
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

1.3 KNN Model Output and Performance

To assess model performance and investigate the suitability of our choice of \(k=5\), we consider two approaches. First we examine the statistical output of the confusion matrix. The accuracy and kappa ought to look attractive in the model selection. Second, we examine the smoothness of the decision boundary by plotting the decision boundary for \(k=1\) and \(k=5\). If the optimal \(k\) shows a relatively smooth boundary, this is supporting evidence that the model parameters are reasonable.

First, we construct the confusion matrix of one specific implementation of the KNN model for the train-test split following the same proportions as the cross-validation study. We specifically use the caret interface rather than the simpler class interface to ensure consistency across multiple models in future.

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        29         1      0
##   Chinstrap      0        12      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9846          
##                  95% CI : (0.9172, 0.9996)
##     No Information Rate : 0.4462          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9757          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000           0.9231        1.0000
## Specificity                 0.9722           1.0000        1.0000
## Pos Pred Value              0.9667           1.0000        1.0000
## Neg Pred Value              1.0000           0.9811        1.0000
## Prevalence                  0.4462           0.2000        0.3538
## Detection Rate              0.4462           0.1846        0.3538
## Detection Prevalence        0.4615           0.1846        0.3538
## Balanced Accuracy           0.9861           0.9615        1.0000

We see in the confusion matrix output above that accuracy is 98.46% and kappa is 97.6%. These are consistent with the cross validation results. We conclude that KNN is a very effective classifier for penguin gender. Only 1 penguin in 65 was misclassified: a Chinstrap was predicted to be an Adelie penguin.

Next, we visualize the decision boundary of our KNN model to check for model suitability. Custom code needs to be written as we are not aware if this plot type is currently provided by any R package.

Adapting the R recipe for KNN decision boundary plots found in the stackoverflow page here we see below that the decision boundary for \(k=5\) nearest neighbor algorithm is reasonably smooth and not overfitted to the data.

However, the same KNN decision boundary for \(k=1\) shows a much more twisted curve. The border between the Chinstrap and Adelie regions in the upper middle half shows a zig-zag pattern characteristic of overfitting. So we reject the choice of \(k=1\) as the model even though both \(k=5\) and \(k=1\) KNN models have identical confusion matrices. ( The \(k=1\) confusion matrix is omitted for brevity. ) This means model selection requires considering factors not captured by the confusion matrix.

2 Exploratory Data Analysis

2.1 Data Completeness

After an initial load of the data, we explore each variable’s completeness.

Data summary
Name original_loan_approval
Number of rows 614
Number of columns 13
_______________________
Column type frequency:
character 8
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Loan_ID 0 1.00 8 8 0 614 0
Gender 13 0.98 4 6 0 2 0
Married 3 1.00 2 3 0 2 0
Dependents 15 0.98 1 2 0 4 0
Education 0 1.00 8 12 0 2 0
Self_Employed 32 0.95 2 3 0 2 0
Property_Area 0 1.00 5 9 0 3 0
Loan_Status 0 1.00 1 1 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ApplicantIncome 0 1.00 5403.46 6109.04 150 2877.5 3812.5 5795.00 81000 ▇▁▁▁▁
CoapplicantIncome 0 1.00 1621.25 2926.25 0 0.0 1188.5 2297.25 41667 ▇▁▁▁▁
LoanAmount 22 0.96 146.41 85.59 9 100.0 128.0 168.00 700 ▇▃▁▁▁
Loan_Amount_Term 14 0.98 342.00 65.12 12 360.0 360.0 360.00 480 ▁▁▁▇▁
Credit_History 50 0.92 0.84 0.36 0 1.0 1.0 1.00 1 ▂▁▁▁▇

We conclude from skim that many variables are complete while others are incomplete:

Complete variables include: Loan_ID, Education, Property_Area, Loan_Status, ApplicantIncome, CoapplicantIncome

Mostly complete variables include: Married, Dependents, Loan_Amount_Term, Gender

Variables with significant gaps include: Self_Employed, LoanAmount, Credit_History

Of these variables, Credit_History is the most problematic and it influences the Loan Approval decision.

There are 614 observations and 13 columns including the response variable Loan_Status.

2.2 FeaturePlot Analysis

Loan Approved vs. Gender by Percent
Female Male
N 6.0 24.4
Y 12.2 55.2
Gender Column Total %: 18.2 79.6
Rejection Rate % 33.0 30.7
Loan Approval vs. Gender by Count
Female Male
N 37 150
Y 75 339
Gender Total Count: 112 489
Rejection Rate % 33 31
Loan Approved vs. Married by Percent
No Yes
N 12.9 18.4
Y 21.8 46.4
Married Column Total %: 34.7 64.8
Rejection Rate % 37.1 28.4
Loan Approval vs. Married by Count
No Yes
N 79 113
Y 134 285
Married Total Count: 213 398
Rejection Rate % 37 28
Loan Approved vs. Dependents by Percent
0 1 2 3+
N 17.4 5.9 4.1 2.9
Y 38.8 10.7 12.4 5.4
Dependents Column Total %: 56.2 16.6 16.4 8.3
Rejection Rate % 31.0 35.3 24.8 35.3
Loan Approval vs. Dependents by Count
0 1 2 3+
N 107 36 25 18
Y 238 66 76 33
Dependents Total Count: 345 102 101 51
Rejection Rate % 31 35 25 35
Loan Approved vs. Education by Percent
Graduate Not Graduate
N 22.8 8.5
Y 55.4 13.4
Education Column Total %: 78.2 21.8
Rejection Rate % 29.2 38.8
Loan Approval vs. Education by Count
Graduate Not Graduate
N 140 52
Y 340 82
Education Total Count: 480 134
Rejection Rate % 29 39
Loan Approved vs. Self_Employed by Percent
No Yes
N 25.6 4.2
Y 55.9 9.1
Self_Employed Column Total %: 81.4 13.4
Rejection Rate % 31.4 31.7
Loan Approval vs. Self_Employed by Count
No Yes
N 157 26
Y 343 56
Self_Employed Total Count: 500 82
Rejection Rate % 31 32
Loan Approved vs. Property_Area by Percent
Rural Semiurban Urban
N 11.2 8.8 11.2
Y 17.9 29.2 21.7
Property_Area Column Total %: 29.2 37.9 32.9
Rejection Rate % 38.5 23.2 34.2
Loan Approval vs. Property_Area by Count
Rural Semiurban Urban
N 69 54 69
Y 110 179 133
Property_Area Total Count: 179 233 202
Rejection Rate % 39 23 34
Loan Approved vs. Loan_Amount_Term by Percent
12 36 60 84 120 180 240 300 360 480
N 0.0 0.3 0.0 0.2 0.0 2.4 0.2 0.8 24.9 1.5
Y 0.2 0.0 0.3 0.5 0.5 4.7 0.5 1.3 58.5 1.0
Loan_Amount_Term Column Total %: 0.2 0.3 0.3 0.7 0.5 7.2 0.7 2.1 83.4 2.4
Rejection Rate % 0.0 100.0 0.0 25.0 0.0 34.1 25.0 38.5 29.9 60.0
Loan Approval vs. Loan_Amount_Term by Count
12 36 60 84 120 180 240 300 360 480
N 0 2 0 1 0 15 1 5 153 9
Y 1 0 2 3 3 29 3 8 359 6
Loan_Amount_Term Total Count: 1 2 2 4 3 44 4 13 512 15
Rejection Rate % 0 100 0 25 0 34 25 38 30 60
Loan Approved vs. Credit_History by Percent
0 1
N 13.4 15.8
Y 1.1 61.6
Credit_History Column Total %: 14.5 77.4
Rejection Rate % 92.1 20.4
Loan Approval vs. Credit_History by Count
0 1
N 82 97
Y 7 378
Credit_History Total Count: 89 475
Rejection Rate % 92 20

3 Building A Complete Dataset

In this section, we define a subset of the loan approval data which consists solely of the records with no missing values for any relevant column and then exclude useless variables. We call this subset of observations the complete dataset. Next, we omit one irrelevant column: the Loan_ID. An identifier which is exogenous to the loan applicant and is assumed to be related solely to computer system processes.

For experimental purposes, we will deal only in the complete records in the subsequent analysis. This reduces the number of observations from 614 to 480 which is a 78.2% of the original data set.

Further, we convert the character type data columns from text to factor. This is importance for Loan_Status but also relevant for others like: Credit_History, Property_Area, Gender, Married, Education, Self_Employed. In addition, we convert Dependents from text to an ordered factor because the number of children is ordered.

Data summary
Name cla
Number of rows 480
Number of columns 12
_______________________
Column type frequency:
factor 9
numeric 3
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Gender 0 1 FALSE 2 Mal: 394, Fem: 86
Married 0 1 FALSE 2 Yes: 311, No: 169
Dependents 0 1 TRUE 4 0: 274, 2: 85, 1: 80, 3+: 41
Education 0 1 FALSE 2 Gra: 383, Not: 97
Self_Employed 0 1 FALSE 2 No: 414, Yes: 66
Loan_Amount_Term 0 1 FALSE 9 360: 411, 180: 36, 480: 12, 300: 9
Credit_History 0 1 FALSE 2 1: 410, 0: 70
Property_Area 0 1 FALSE 3 Sem: 191, Urb: 150, Rur: 139
Loan_Status 0 1 FALSE 2 Y: 332, N: 148

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ApplicantIncome 0 1 5364.23 5668.25 150 2898.75 3859.0 5852.50 81000 ▇▁▁▁▁
CoapplicantIncome 0 1 1581.09 2617.69 0 0.00 1084.5 2253.25 33837 ▇▁▁▁▁
LoanAmount 0 1 144.74 80.51 9 100.00 128.0 170.00 600 ▇▇▁▁▁

3.1 Scaling

Further we scale the quantitative variables to mean zero and variance one to ensure equal influence.

Data summary
Name scla
Number of rows 480
Number of columns 12
_______________________
Column type frequency:
factor 9
numeric 3
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Loan_Status 0 1 FALSE 2 Y: 332, N: 148
Gender 0 1 FALSE 2 Mal: 394, Fem: 86
Married 0 1 FALSE 2 Yes: 311, No: 169
Dependents 0 1 TRUE 4 0: 274, 2: 85, 1: 80, 3+: 41
Education 0 1 FALSE 2 Gra: 383, Not: 97
Self_Employed 0 1 FALSE 2 No: 414, Yes: 66
Loan_Amount_Term 0 1 FALSE 9 360: 411, 180: 36, 480: 12, 300: 9
Credit_History 0 1 FALSE 2 1: 410, 0: 70
Property_Area 0 1 FALSE 3 Sem: 191, Urb: 150, Rur: 139

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ApplicantIncome 0 1 0 1 -0.92 -0.43 -0.27 0.09 13.34 ▇▁▁▁▁
CoapplicantIncome 0 1 0 1 -0.60 -0.60 -0.19 0.26 12.32 ▇▁▁▁▁
LoanAmount 0 1 0 1 -1.69 -0.56 -0.21 0.31 5.65 ▇▇▁▁▁

Next, we split the data into a training and test subset. Note that Random forest allow the use of out-of-bag errors which can be used to assess model accuracy out of sample. In this sense, OOB errors can proxy for test errors for random forests.

However, to be conservative in our assessment, we assess both OOB error and test errors.

The training set has the following size and attributes:

## 'data.frame':    385 obs. of  12 variables:
##  $ Loan_Status      : Factor w/ 2 levels "N","Y": 2 2 2 2 1 2 1 2 2 1 ...
##  $ Gender           : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Married          : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 2 2 1 ...
##  $ Dependents       : Ord.factor w/ 4 levels "0"<"1"<"2"<"3+": 1 1 1 3 4 3 2 3 3 1 ...
##  $ Education        : Factor w/ 2 levels "Graduate","Not Graduate": 1 2 1 1 1 1 1 1 1 1 ...
##  $ Self_Employed    : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1 1 1 1 1 ...
##  $ Loan_Amount_Term : Factor w/ 10 levels "12","36","60",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ Credit_History   : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 2 2 ...
##  $ Property_Area    : Factor w/ 3 levels "Rural","Semiurban",..: 3 3 3 3 2 3 2 3 3 1 ...
##  $ ApplicantIncome  : num  -0.4171 -0.49067 0.11216 0.00931 -0.41075 ...
##  $ CoapplicantIncome: num  -0.604 0.297 -0.604 0.999 0.353 ...
##  $ LoanAmount       : num  -0.978 -0.3072 -0.0464 1.5187 0.1648 ...

The test set has the following size and attributes:

## 'data.frame':    95 obs. of  12 variables:
##  $ Loan_Status      : Factor w/ 2 levels "N","Y": 1 2 2 1 2 1 1 2 1 1 ...
##  $ Gender           : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 2 2 2 2 1 ...
##  $ Married          : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 2 1 2 ...
##  $ Dependents       : Ord.factor w/ 4 levels "0"<"1"<"2"<"3+": 2 1 1 1 3 1 2 1 1 1 ...
##  $ Education        : Factor w/ 2 levels "Graduate","Not Graduate": 1 2 1 1 2 1 1 1 1 1 ...
##  $ Self_Employed    : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ Loan_Amount_Term : Factor w/ 10 levels "12","36","60",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ Credit_History   : Factor w/ 2 levels "0","1": 2 2 2 1 2 2 2 2 2 1 ...
##  $ Property_Area    : Factor w/ 3 levels "Rural","Semiurban",..: 1 3 3 3 3 3 1 2 3 3 ...
##  $ ApplicantIncome  : num  -0.1378 -0.5348 -0.0731 -0.3271 -0.2008 ...
##  $ CoapplicantIncome: num  -0.0279 -0.0249 -0.604 -0.604 -0.2067 ...
##  $ LoanAmount       : num  -0.208 -0.618 -0.245 -0.854 -0.431 ...

4 Random Forest Implementation

This implementation approaches uses the complete loan approval dataset with the randomForest library.

We will also evaluate using the ranger library which is stated by UC Business Analytics to be 6 times faster than randomForest because it is written in C++.

We will break the analysis of random forests into several phases:

  1. Selecting the number of trees require to achieve stability in the OOB error rate.
  2. Parameter tuning for the number of variables mtry to be used as each split.
  3. Studying sampling distribution of OOB error rates
  4. Assessing variable importance
  5. Assessing model accuracy of the random forest using tuned parameters

4.1 Select the number of trees

After running two initial sanity checks at different \(\text{mtry}\) parameters, we observe that the random forest OOB error rates stabilize after \(\text{ntree}=500\) trees. In fact, this is conservative as the actual error appears to stabilize at 150 trees. Therefore, we pay the cost of greater computing time.

First we look at the default algorithm value of mtry which happens to be 3 for this data set.

## 
## Call:
##  randomForest(formula = Loan_Status ~ ., data = test_scla) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 17.89%
## Confusion matrix:
##    N  Y class.error
## N 16 13  0.44827586
## Y  4 62  0.06060606

The following plot suggests that the OOB error stabilizes after 150 trees. Certainly if we use 500 trees, which is the default number of trees ntree for randomForest we obtain stable error rates when mtry equals 3.

Let’s verify the sample result when we go the other extreme: mtry close to the number f independent variables. We consider \(\text{mtry}=10\) and get similar error stability as shown below.

## 
## Call:
##  randomForest(formula = Loan_Status ~ ., data = test_scla, mtry = 10) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 10
## 
##         OOB estimate of  error rate: 17.89%
## Confusion matrix:
##    N  Y class.error
## N 16 13  0.44827586
## Y  4 62  0.06060606

4.2 Parameter Tuning of mtry

In this step, we examine the random forest error rates at different levels of the mtry parameter – the number of parameters to use at each node of the decision tree to use in defining the split. In our case, the range of mtry goes between \(\text{mtry}=1 \cdots 11\).

5 Appendices

5.2 Code

We summarize all the R code used in this project in this appendix for ease of reading.

library(tidyverse)
library(ggplot2)
library(knitr)
library(kableExtra)
library(GGally)
library(ellipse)
library(caret)
library(skimr)
library(klaR)
library(class)  # used for KNN classifier

knitr::opts_chunk$set(echo = FALSE, message=FALSE, warning=FALSE)

library(palmerpenguins)

# The final dataset excludes observations with missing sex and drops the island and year variables.
pc = penguins %>% filter( is.na(sex) == FALSE) %>% dplyr::select( -one_of("island", "year") )

head(pc) %>% kable(caption="Scrubbed penguin data set") %>% kable_styling(bootstrap_options = c("striped", "hover"))


set.seed(10)

#  Construct the standardized dataset with the response variable, the 4 quantitative variables normalized 
#  with the scale function and the sex variable normalized with the 1,-1 assignment explained previously.
#
standardized = cbind( pc[,1], scale( pc[, 2:5] ) , sex = data.table::fifelse(pc$sex == 'male' , 1 , -1) )

# Define an 80-20 split of the training and test data.
# -------------------------------------------------------------------------------------
training.individuals = createDataPartition(standardized$species, p= 0.8 , list = FALSE)

# X variables include bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g,  sex (converted to -1,1)
# Y variable is just the species class
# train data is the standardized subset of the full data set
# test data is the complement of the training data set.
# -----------------------------------------------------------------
train.X  =  standardized[ training.individuals,  2:6]
test.X   =  standardized[-training.individuals,  2:6]
train.Y  =  standardized[ training.individuals,    1]
test.Y   =  standardized[-training.individuals,    1]

head(train.X ) %>% kable(caption="Normalized Training Data - 1st Few Rows", digits = 2 ) %>% 
  kable_styling(bootstrap_options = c("striped", "hover") )
model = train( species ~. , data = standardized , method = "knn" ,
         trControl = trainControl("cv", number = 5 ), tuneLength = 10 )

# Plot the model output vs. varying k values.
plot(model)
model

# The best tuning parameter k that 
# optimized model accuracy is:
#model$bestTune

# Simpler interface using the knn function within class.
knn.pred5 = knn(train.X, test.X, train.Y , k = 5, prob = FALSE) 
cm5 = confusionMatrix(knn.pred5, test.Y) 


# This model approach uses tuneGrid with a single value to calculate kNN 
# only at a single number of nearest neighbors.
# ------------------------------------------------------------------------------
x2 = train( species ~ . , data = standardized[training.individuals,] , 
            method = "knn", 
            trControl = trainControl(method="none"), 
            tuneGrid=data.frame(k=5) )

pred2 = predict(x2, newdata=test.X)

confusionMatrix(pred2, test.Y)

# We visualize a projection of the decision boundary flattened to 2 dimensions to illustrate
# the smoothness of the KNN boundary at different values of k.
# First, choose the variables, then build a sampling grid.
# ------------------------------------------------------------------------
pl = seq(min(standardized$bill_length_mm), max(standardized$bill_length_mm), by = 0.05)
pw = seq(min(standardized$bill_depth_mm), max(standardized$bill_depth_mm), by = 0.05 )

# The sampling grid needs explicit values to be plugged in for the variables
# which are not the two projected variables.  Since variables are standardized to
# mean 0 and variance 1, we choose zero for flipper_length and body_mass because they are continuous 
# and the mean and mode are close.
# However, for sex, we choose male because the average sex = 0 is not the mode.
# In practice, sex = 0 gives weird decision boundaries in the projected plot.
# ------------------------------------------------------------------------------
lgrid = expand.grid(bill_length_mm=pl , bill_depth_mm = pw , flipper_length_mm = 0, body_mass_g = 0 , sex = 1 )

knnPredGrid = predict(x2, newdata=lgrid )

num_knnPredGrid = as.numeric(knnPredGrid)

num_pred2 = as.numeric(pred2)

test = cbind(test.X, test.Y)
test$Pred = num_pred2
ggplot(data=lgrid) + stat_contour(aes(x=bill_length_mm, y= bill_depth_mm, z = num_knnPredGrid), bins = 2 ) +
  geom_point( aes(x=bill_length_mm, y= bill_depth_mm, color = knnPredGrid),  size = 1, shape = 19, alpha = 0.2) +
  geom_point( data= test , aes(x=test$bill_length_mm, y=test$bill_depth_mm, color=pred2 ) , size = 4, alpha = 0.8, shape =24 ) +
  ggtitle("KNN Decision Boundary for Penguins Data with k=5 neighbor")


# This model approach uses tuneGrid with a single value to calculate kNN 
# only at a single number of nearest neighbors.
# ------------------------------------------------------------------------------
x3 = train( species ~ . , data = standardized[training.individuals,] , 
            method = "knn", 
            trControl = trainControl(method="none"), 
            tuneGrid=data.frame(k=1) )  # Now we use a k=1 nearest neighbor algorithm which is overfitted.

pred3 = predict(x3, newdata=test.X)


knnPredGrid3 = predict(x3, newdata=lgrid )

num_knnPredGrid3 = as.numeric(knnPredGrid3)

num_pred3 = as.numeric(pred3)

test = cbind(test.X, test.Y)
test$Pred = num_pred3

ggplot(data=lgrid) + stat_contour(aes(x=bill_length_mm, y= bill_depth_mm, z = num_knnPredGrid3), bins = 2 ) +
  geom_point( aes(x=bill_length_mm, y= bill_depth_mm, color = knnPredGrid3),  size = 1, shape = 19, alpha = 0.2) +
  geom_point( data= test , aes(x=test$bill_length_mm, y=test$bill_depth_mm, color=pred3 ) , size = 4, alpha = 0.8, shape =24 ) +
  ggtitle("KNN Decision Boundary for Penguins Data with k=1 neighbor")

original_loan_approval = read_csv("Loan_approval.csv")

skim(original_loan_approval)
#View(original_loan_approval)
la = original_loan_approval

la$Loan_Amount_Term = factor(la$Loan_Amount_Term)


# Feature Plot of box plots

categorical_vars = c("Gender", "Married", "Dependents", "Education", "Self_Employed", "Property_Area", "Credit_History" , "Loan_Amount_Term" )

quantitative_vars = c("ApplicantIncome", "CoapplicantIncome", "LoanAmount")


library(kableExtra)


#  Display the rejection rate statistics for each qualitative field
#  in absolute count and percentage terms.
# 
display_rejection_rate <- function(la , field , display_pct)
{
    a = la[, "Loan_Status"]
    b = la[, field]
    
    c = table(cbind(a,b))  # Count table
    

    if(display_pct == 1 )
    {
          pcts = as.data.frame.matrix(c) /nrow(la) * 100   # Percentage equivalent
          col_sums = colSums(pcts)  # Shows the results by the field

          rej_rate = pcts[1,]/col_sums   * 100   # Rejection rate assumes row 1 is the "No" row.

          yy = rbind( pcts, col_sums, rej_rate) 
          rownames(yy)[3]= paste0( field, " Column Total %:")
          rownames(yy)[4]="Rejection Rate %"
          yy %>% kable(digits = 1, caption = paste0('Loan Approved vs. ', field , ' by Percent') ) %>%
             kable_styling(bootstrap_options = c("striped", "hover") ) 
    }
    else
    {
        col_sums_count = colSums(c)
        rej_rate = c[1, ] / col_sums_count * 100
        
        d = rbind( c, col_sums_count, rej_rate)
        rownames(d)[3] = paste0( field, " Total Count: ")
        rownames(d)[4] = "Rejection Rate %"
        
        d %>% kable( digits = 0, caption = paste0('Loan Approval vs. ', field, ' by Count') ) %>%                             kable_styling(bootstrap_options = c("striped", "hover") ) 
        
    }
}

display_rejection_rate( la , "Gender", 1)  # Display in percentage form
display_rejection_rate( la,  "Gender", 0)  # Display in count form

display_rejection_rate( la , "Married", 1)  # Display in percentage form
display_rejection_rate( la,  "Married", 0)  # Display in count form

display_rejection_rate( la , "Dependents", 1)  # Display in percentage form
display_rejection_rate( la,  "Dependents", 0)  # Display in count form


display_rejection_rate( la , "Education", 1)  # Display in percentage form
display_rejection_rate( la,  "Education", 0)  # Display in count form

display_rejection_rate( la , "Self_Employed", 1)  # Display in percentage form
display_rejection_rate( la,  "Self_Employed", 0)  # Display in count form

display_rejection_rate( la , "Property_Area", 1)  # Display in percentage form
display_rejection_rate( la,  "Property_Area", 0)  # Display in count form


display_rejection_rate( la , "Loan_Amount_Term", 1)  # Display in percentage form
display_rejection_rate( la,  "Loan_Amount_Term", 0)  # Display in count form


display_rejection_rate( la , "Credit_History", 1)  # Display in percentage form
display_rejection_rate( la,  "Credit_History", 0)  # Display in count form




caret::featurePlot(
  x = la[, quantitative_vars],
  y = as.factor(la$Loan_Status),  
  plot = "pairs",
  auto.key = list(columns = 2)   )


caret::featurePlot(
  x = la[, quantitative_vars],
  y = as.factor(la$Loan_Status),  
  plot = "box",
  auto.key = list(columns = 2) ,
  scales = list(x=list(relation="free"), y=list(relation="free")))

cla <- la %>% dplyr::select(-Loan_ID ) 

cla$Loan_Status = factor(cla$Loan_Status)  # Convert the response to factor
cla$Credit_History = factor(cla$Credit_History)
cla$Property_Area = factor( cla$Property_Area)
cla$Gender = factor( cla$Gender)
cla$Married = factor(cla$Married)
cla$Dependents = ordered(cla$Dependents, levels = c("0" , "1", "2" , "3+") )
cla$Education = factor(cla$Education)
cla$Self_Employed = factor( cla$Self_Employed)
cla = na.omit(cla)


skim(cla)

# Reorder the columns for ease of use.
# Standardize the 3 quantitative variables:  ApplicantIncome, CoapplicantIncome, LoanAmount
scla = cbind( cla[,12], cla[,1:5], cla[, 9:11], scale( cla[, 6:8] ) )

skim(scla)
head(scla)


# Define an 80-20 split of the training and test data.
# -------------------------------------------------------------------------------------
training.indices = createDataPartition(scla$Loan_Status, p= 0.8 , list = FALSE)

train_scla = scla[training.indices,]

test_scla = scla[-training.indices,]
# Training
str(train_scla)
str(test_scla)
# Libraries used for random forest
# ------------------------------------
library(randomForest)
library(ranger)
set.seed(1234)

rf_numtrees = randomForest( Loan_Status ~ . , data = test_scla )

rf_numtrees

plot(rf_numtrees)
rf_numtrees_2 = randomForest( Loan_Status ~ . , data = test_scla , mtry = 10)

rf_numtrees_2

plot(rf_numtrees_2)