Introduction

This assignment analyzes two problems. First, the Palmer Penguin dataset using the KNN model to predict species. Second, a Loan Approval dataset using Random Forest and Gradient Boosting methods.

1 Penguins and the KNN algorithm

Following the data cleaning approach taken in prior assignments, we exclude observations where the feature sex is undefined. We also exclude the two variables island and year which do not seem helpful in prediction analysis. We follow the variable selection process described in a prior assignment here.

This still gives a substantial subset \(333\) out of \(344\) original observations. The below table illustrates sample rows of the cleansed dataset used in the subsequent KNN analysis.

Scrubbed penguin data set
species	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
Adelie	39.1	18.7	181	3750	male
Adelie	39.5	17.4	186	3800	female
Adelie	40.3	18.0	195	3250	female
Adelie	36.7	19.3	193	3450	female
Adelie	39.3	20.6	190	3650	male
Adelie	38.9	17.8	181	3625	female

We obtain 333 rows of observations with 6 columns of explanatory and response variables.

1.1 Data Transformation for KNN

The next step is to normalize the data so that each quantitative variables have a mean zero and standard deviation equal to 1. However, we need to also include the sex variable in the K-nearest neighbor model. Normalization ensures that each variable has equal influence on the nearest neighbor calculation. Otherwise, a quantitative variable whose scale is much larger than the other variables would dominate the Euclidean distance calculation used in the KNN model.

However, we also need to include the sex variable as a feature in the KNN model. Categorical variables pose a challenge in the KNN context because they need to be mapped to a numerical equivalent. In the case of this data set, two facts are key:

the original data set gathered penguins by breeding pairs - thus the number of males and females is very balanced.
the variable is binary. Thus, two equal and opposite values can accomplish the goal of normalization.

For these reasons, we do not normalize sex but use the mapping function \(f\) defined as:

\[\begin{equation} f(\text{sex}) = \begin{cases} 1 & \text{sex} = \text{male} \\ -1 & \text{sex} = \text{female} \\ \end{cases} \end{equation} \] Because the data set is gender balanced, \(E[f(\text{sex})] \approx 0\) and \(Stdev(f(\text{sex})) \approx 1\). Hence, \(f(\text{sex})\) is nearly normalized. With these transformations, the KNN dataset is prepared for model evaluation. We illustrate the top 5 rows of the training data set below. Note that the sampled indices are displayed in the left most column as row values in the resulting dataframe.

Normalized Training Data - 1st Few Rows
	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
1	-0.89	0.78	-1.42	-0.57	1
3	-0.68	0.42	-0.43	-1.19	-1
4	-1.33	1.08	-0.57	-0.94	-1
7	-0.88	1.24	-0.43	0.58	1
8	-0.53	0.22	-1.35	-1.25	-1
9	-0.99	2.05	-0.71	-0.51	1

1.2 Cross Validation and Parameter Tuning with KNN

We will use caret to test a range of possible values of \(k\) and select the optimal \(k\) based on minimizing the cross validation error. We did the pre-processing previously using scale so there is no need to invoke further data transformation inside caret pre-processing function calls.

trainControl allows us to define the training parameters.

tuneLength decides the range of allowed \(k\) value to compute against the model.

We choose a 5-fold cross validation method to ensure the test set is sufficiently large.

Using the above model plot which shows the average cross validation accuracy at each parameter \(k\) used in the KNN model, we see that the best kappa and accuracy occur at \(k=5\) nearest neighbors. This is evident when we display the (cross validation) model output below.

## k-Nearest Neighbors 
## 
## 333 samples
##   5 predictor
##   3 classes: 'Adelie', 'Chinstrap', 'Gentoo' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 267, 266, 268, 265, 266 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.9940737  0.9906842
##    7  0.9880556  0.9811323
##    9  0.9881914  0.9812667
##   11  0.9881914  0.9812667
##   13  0.9851145  0.9763978
##   15  0.9851145  0.9763978
##   17  0.9851145  0.9763978
##   19  0.9851145  0.9763978
##   21  0.9851145  0.9763978
##   23  0.9851145  0.9763978
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

1.3 KNN Model Output and Performance

To assess model performance and investigate the suitability of our choice of \(k=5\), we consider two approaches. First we examine the statistical output of the confusion matrix. The accuracy and kappa ought to look attractive in the model selection. Second, we examine the smoothness of the decision boundary by plotting the decision boundary for \(k=1\) and \(k=5\). If the optimal \(k\) shows a relatively smooth boundary, this is supporting evidence that the model parameters are reasonable.

First, we construct the confusion matrix of one specific implementation of the KNN model for the train-test split following the same proportions as the cross-validation study. We specifically use the caret interface rather than the simpler class interface to ensure consistency across multiple models in future.

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        29         1      0
##   Chinstrap      0        12      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9846          
##                  95% CI : (0.9172, 0.9996)
##     No Information Rate : 0.4462          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9757          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000           0.9231        1.0000
## Specificity                 0.9722           1.0000        1.0000
## Pos Pred Value              0.9667           1.0000        1.0000
## Neg Pred Value              1.0000           0.9811        1.0000
## Prevalence                  0.4462           0.2000        0.3538
## Detection Rate              0.4462           0.1846        0.3538
## Detection Prevalence        0.4615           0.1846        0.3538
## Balanced Accuracy           0.9861           0.9615        1.0000

We see in the confusion matrix output above that accuracy is 98.46% and kappa is 97.6%. These are consistent with the cross validation results. We conclude that KNN is a very effective classifier for penguin gender. Only 1 penguin in 65 was misclassified: a Chinstrap was predicted to be an Adelie penguin.

Next, we visualize the decision boundary of our KNN model to check for model suitability. Custom code needs to be written as we are not aware if this plot type is currently provided by any R package.

Adapting the R recipe for KNN decision boundary plots found in the stackoverflow page here we see below that the decision boundary for \(k=5\) nearest neighbor algorithm is reasonably smooth and not overfitted to the data.

However, the same KNN decision boundary for \(k=1\) shows a much more twisted curve. The border between the Chinstrap and Adelie regions in the upper middle half shows a zig-zag pattern characteristic of overfitting. So we reject the choice of \(k=1\) as the model even though both \(k=5\) and \(k=1\) KNN models have identical confusion matrices. ( The \(k=1\) confusion matrix is omitted for brevity. ) This means model selection requires considering factors not captured by the confusion matrix.

2 Exploratory Data Analysis

2.1 Data Completeness

After an initial load of the data, we explore each variable’s completeness.

Data summary

Name	original_loan_approval
Number of rows	614
Number of columns	13
_______________________
Column type frequency:
character	8
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
Loan_ID	0	1.00	8	8	614
Gender	13	0.98	4	6	2
Married	3	1.00	2	3	2
Dependents	15	0.98	1	2	4
Education	0	1.00	8	12	2
Self_Employed	32	0.95	2	3	2
Property_Area	0	1.00	5	9	3
Loan_Status	0	1.00	1	1	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ApplicantIncome	0	1.00	5403.46	6109.04	150	2877.5	3812.5	5795.00	81000	▇▁▁▁▁
CoapplicantIncome	0	1.00	1621.25	2926.25	0	0.0	1188.5	2297.25	41667	▇▁▁▁▁
LoanAmount	22	0.96	146.41	85.59	9	100.0	128.0	168.00	700	▇▃▁▁▁
Loan_Amount_Term	14	0.98	342.00	65.12	12	360.0	360.0	360.00	480	▁▁▁▇▁
Credit_History	50	0.92	0.84	0.36	0	1.0	1.0	1.00	1	▂▁▁▁▇

We conclude from skim that many variables are complete while others are incomplete:

Complete variables include: Loan_ID, Education, Property_Area, Loan_Status, ApplicantIncome, CoapplicantIncome

Mostly complete variables include: Married, Dependents, Loan_Amount_Term, Gender

Variables with significant gaps include: Self_Employed, LoanAmount, Credit_History

Of these variables, Credit_History is the most problematic and it influences the Loan Approval decision.

There are 614 observations and 13 columns including the response variable Loan_Status.

2.2 FeaturePlot Analysis

Loan Approved vs. Gender by Percent
	Female	Male
N	6.0	24.4
Y	12.2	55.2
Gender Column Total %:	18.2	79.6
Rejection Rate %	33.0	30.7

Loan Approval vs. Gender by Count
	Female	Male
N	37	150
Y	75	339
Gender Total Count:	112	489
Rejection Rate %	33	31

Loan Approved vs. Married by Percent
	No	Yes
N	12.9	18.4
Y	21.8	46.4
Married Column Total %:	34.7	64.8
Rejection Rate %	37.1	28.4

Loan Approval vs. Married by Count
	No	Yes
N	79	113
Y	134	285
Married Total Count:	213	398
Rejection Rate %	37	28

Loan Approved vs. Dependents by Percent
	0	1	2	3+
N	17.4	5.9	4.1	2.9
Y	38.8	10.7	12.4	5.4
Dependents Column Total %:	56.2	16.6	16.4	8.3
Rejection Rate %	31.0	35.3	24.8	35.3

Loan Approval vs. Dependents by Count
	0	1	2	3+
N	107	36	25	18
Y	238	66	76	33
Dependents Total Count:	345	102	101	51
Rejection Rate %	31	35	25	35

Loan Approved vs. Education by Percent
	Graduate	Not Graduate
N	22.8	8.5
Y	55.4	13.4
Education Column Total %:	78.2	21.8
Rejection Rate %	29.2	38.8

Loan Approval vs. Education by Count
	Graduate	Not Graduate
N	140	52
Y	340	82
Education Total Count:	480	134
Rejection Rate %	29	39

Loan Approved vs. Self_Employed by Percent
	No	Yes
N	25.6	4.2
Y	55.9	9.1
Self_Employed Column Total %:	81.4	13.4
Rejection Rate %	31.4	31.7

Loan Approval vs. Self_Employed by Count
	No	Yes
N	157	26
Y	343	56
Self_Employed Total Count:	500	82
Rejection Rate %	31	32

Loan Approved vs. Property_Area by Percent
	Rural	Semiurban	Urban
N	11.2	8.8	11.2
Y	17.9	29.2	21.7
Property_Area Column Total %:	29.2	37.9	32.9
Rejection Rate %	38.5	23.2	34.2

Loan Approval vs. Property_Area by Count
	Rural	Semiurban	Urban
N	69	54	69
Y	110	179	133
Property_Area Total Count:	179	233	202
Rejection Rate %	39	23	34

Loan Approved vs. Loan_Amount_Term by Percent
	12	36	60	84	120	180	240	300	360	480
N	0.0	0.3	0.0	0.2	0.0	2.4	0.2	0.8	24.9	1.5
Y	0.2	0.0	0.3	0.5	0.5	4.7	0.5	1.3	58.5	1.0
Loan_Amount_Term Column Total %:	0.2	0.3	0.3	0.7	0.5	7.2	0.7	2.1	83.4	2.4
Rejection Rate %	0.0	100.0	0.0	25.0	0.0	34.1	25.0	38.5	29.9	60.0

Loan Approval vs. Loan_Amount_Term by Count
	12	36	60	84	120	180	240	300	360	480
N	0	2	0	1	0	15	1	5	153	9
Y	1	0	2	3	3	29	3	8	359	6
Loan_Amount_Term Total Count:	1	2	2	4	3	44	4	13	512	15
Rejection Rate %	0	100	0	25	0	34	25	38	30	60

Loan Approved vs. Credit_History by Percent
	0	1
N	13.4	15.8
Y	1.1	61.6
Credit_History Column Total %:	14.5	77.4
Rejection Rate %	92.1	20.4

Loan Approval vs. Credit_History by Count
	0	1
N	82	97
Y	7	378
Credit_History Total Count:	89	475
Rejection Rate %	92	20

3 Building A Complete Dataset

In this section, we define a subset of the loan approval data which consists solely of the records with no missing values for any relevant column and then exclude useless variables. We call this subset of observations the complete dataset. Next, we omit one irrelevant column: the Loan_ID. An identifier which is exogenous to the loan applicant and is assumed to be related solely to computer system processes.

For experimental purposes, we will deal only in the complete records in the subsequent analysis. This reduces the number of observations from 614 to 480 which is a 78.2% of the original data set.

Further, we convert the character type data columns from text to factor. This is importance for Loan_Status but also relevant for others like: Credit_History, Property_Area, Gender, Married, Education, Self_Employed. In addition, we convert Dependents from text to an ordered factor because the number of children is ordered.

Data summary

Name	cla
Number of rows	480
Number of columns	12
_______________________
Column type frequency:
factor	9
numeric	3
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
Gender	1	FALSE	2	Mal: 394, Fem: 86
Married	1	FALSE	2	Yes: 311, No: 169
Dependents	1	TRUE	4	0: 274, 2: 85, 1: 80, 3+: 41
Education	1	FALSE	2	Gra: 383, Not: 97
Self_Employed	1	FALSE	2	No: 414, Yes: 66
Loan_Amount_Term	1	FALSE	9	360: 411, 180: 36, 480: 12, 300: 9
Credit_History	1	FALSE	2	1: 410, 0: 70
Property_Area	1	FALSE	3	Sem: 191, Urb: 150, Rur: 139
Loan_Status	1	FALSE	2	Y: 332, N: 148

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ApplicantIncome	1	5364.23	5668.25	150	2898.75	3859.0	5852.50	81000	▇▁▁▁▁
CoapplicantIncome	1	1581.09	2617.69	0	0.00	1084.5	2253.25	33837	▇▁▁▁▁
LoanAmount	1	144.74	80.51	9	100.00	128.0	170.00	600	▇▇▁▁▁

3.1 Scaling

Further we scale the quantitative variables to mean zero and variance one to ensure equal influence.

Data summary

Name	scla
Number of rows	480
Number of columns	12
_______________________
Column type frequency:
factor	9
numeric	3
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
Loan_Status	1	FALSE	2	Y: 332, N: 148
Gender	1	FALSE	2	Mal: 394, Fem: 86
Married	1	FALSE	2	Yes: 311, No: 169
Dependents	1	TRUE	4	0: 274, 2: 85, 1: 80, 3+: 41
Education	1	FALSE	2	Gra: 383, Not: 97
Self_Employed	1	FALSE	2	No: 414, Yes: 66
Loan_Amount_Term	1	FALSE	9	360: 411, 180: 36, 480: 12, 300: 9
Credit_History	1	FALSE	2	1: 410, 0: 70
Property_Area	1	FALSE	3	Sem: 191, Urb: 150, Rur: 139

Variable type: numeric

skim_variable	complete_rate	sd	p0	p25	p50	p75	p100	hist
ApplicantIncome	1	1	-0.92	-0.43	-0.27	0.09	13.34	▇▁▁▁▁
CoapplicantIncome	1	1	-0.60	-0.60	-0.19	0.26	12.32	▇▁▁▁▁
LoanAmount	1	1	-1.69	-0.56	-0.21	0.31	5.65	▇▇▁▁▁

Next, we split the data into a training and test subset. Note that Random forest allow the use of out-of-bag errors which can be used to assess model accuracy out of sample. In this sense, OOB errors can proxy for test errors for random forests.

However, to be conservative in our assessment, we assess both OOB error and test errors.

The training set has the following size and attributes:

## 'data.frame':    385 obs. of  12 variables:
##  $ Loan_Status      : Factor w/ 2 levels "N","Y": 2 2 2 2 1 2 1 2 2 1 ...
##  $ Gender           : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Married          : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 2 2 1 ...
##  $ Dependents       : Ord.factor w/ 4 levels "0"<"1"<"2"<"3+": 1 1 1 3 4 3 2 3 3 1 ...
##  $ Education        : Factor w/ 2 levels "Graduate","Not Graduate": 1 2 1 1 1 1 1 1 1 1 ...
##  $ Self_Employed    : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1 1 1 1 1 ...
##  $ Loan_Amount_Term : Factor w/ 10 levels "12","36","60",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ Credit_History   : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 2 2 ...
##  $ Property_Area    : Factor w/ 3 levels "Rural","Semiurban",..: 3 3 3 3 2 3 2 3 3 1 ...
##  $ ApplicantIncome  : num  -0.4171 -0.49067 0.11216 0.00931 -0.41075 ...
##  $ CoapplicantIncome: num  -0.604 0.297 -0.604 0.999 0.353 ...
##  $ LoanAmount       : num  -0.978 -0.3072 -0.0464 1.5187 0.1648 ...

The test set has the following size and attributes:

## 'data.frame':    95 obs. of  12 variables:
##  $ Loan_Status      : Factor w/ 2 levels "N","Y": 1 2 2 1 2 1 1 2 1 1 ...
##  $ Gender           : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 2 2 2 2 1 ...
##  $ Married          : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 2 1 2 ...
##  $ Dependents       : Ord.factor w/ 4 levels "0"<"1"<"2"<"3+": 2 1 1 1 3 1 2 1 1 1 ...
##  $ Education        : Factor w/ 2 levels "Graduate","Not Graduate": 1 2 1 1 2 1 1 1 1 1 ...
##  $ Self_Employed    : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ Loan_Amount_Term : Factor w/ 10 levels "12","36","60",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ Credit_History   : Factor w/ 2 levels "0","1": 2 2 2 1 2 2 2 2 2 1 ...
##  $ Property_Area    : Factor w/ 3 levels "Rural","Semiurban",..: 1 3 3 3 3 3 1 2 3 3 ...
##  $ ApplicantIncome  : num  -0.1378 -0.5348 -0.0731 -0.3271 -0.2008 ...
##  $ CoapplicantIncome: num  -0.0279 -0.0249 -0.604 -0.604 -0.2067 ...
##  $ LoanAmount       : num  -0.208 -0.618 -0.245 -0.854 -0.431 ...

4 Random Forest Implementation

This implementation approaches uses the complete loan approval dataset with the randomForest library.

We will also evaluate using the ranger library which is stated by UC Business Analytics to be 6 times faster than randomForest because it is written in C++.

We will break the analysis of random forests into several phases:

Selecting the number of trees require to achieve stability in the OOB error rate.
Parameter tuning for the number of variables mtry to be used as each split.
Studying sampling distribution of OOB error rates
Assessing variable importance
Assessing model accuracy of the random forest using tuned parameters

4.1 Select the number of trees

After running two initial sanity checks at different \(\text{mtry}\) parameters, we observe that the random forest OOB error rates stabilize after \(\text{ntree}=500\) trees. In fact, this is conservative as the actual error appears to stabilize at 150 trees. Therefore, we pay the cost of greater computing time.

First we look at the default algorithm value of mtry which happens to be 3 for this data set.

## 
## Call:
##  randomForest(formula = Loan_Status ~ ., data = test_scla) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 17.89%
## Confusion matrix:
##    N  Y class.error
## N 16 13  0.44827586
## Y  4 62  0.06060606

The following plot suggests that the OOB error stabilizes after 150 trees. Certainly if we use 500 trees, which is the default number of trees ntree for randomForest we obtain stable error rates when mtry equals 3.

Let’s verify the sample result when we go the other extreme: mtry close to the number f independent variables. We consider \(\text{mtry}=10\) and get similar error stability as shown below.

## 
## Call:
##  randomForest(formula = Loan_Status ~ ., data = test_scla, mtry = 10) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 10
## 
##         OOB estimate of  error rate: 17.89%
## Confusion matrix:
##    N  Y class.error
## N 16 13  0.44827586
## Y  4 62  0.06060606

4.2 Parameter Tuning of `mtry`

In this step, we examine the random forest error rates at different levels of the mtry parameter – the number of parameters to use at each node of the decision tree to use in defining the split. In our case, the range of mtry goes between \(\text{mtry}=1 \cdots 11\).

5 Appendices

5.1 References

UC Business Analytics R Programming Guide - Random Forests

5.2 Code

We summarize all the R code used in this project in this appendix for ease of reading.

library(tidyverse)
library(ggplot2)
library(knitr)
library(kableExtra)
library(GGally)
library(ellipse)
library(caret)
library(skimr)
library(klaR)
library(class)  # used for KNN classifier

knitr::opts_chunk$set(echo = FALSE, message=FALSE, warning=FALSE)

library(palmerpenguins)

# The final dataset excludes observations with missing sex and drops the island and year variables.
pc = penguins %>% filter( is.na(sex) == FALSE) %>% dplyr::select( -one_of("island", "year") )

head(pc) %>% kable(caption="Scrubbed penguin data set") %>% kable_styling(bootstrap_options = c("striped", "hover"))


set.seed(10)

#  Construct the standardized dataset with the response variable, the 4 quantitative variables normalized 
#  with the scale function and the sex variable normalized with the 1,-1 assignment explained previously.
#
standardized = cbind( pc[,1], scale( pc[, 2:5] ) , sex = data.table::fifelse(pc$sex == 'male' , 1 , -1) )

# Define an 80-20 split of the training and test data.
# -------------------------------------------------------------------------------------
training.individuals = createDataPartition(standardized$species, p= 0.8 , list = FALSE)

# X variables include bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g,  sex (converted to -1,1)
# Y variable is just the species class
# train data is the standardized subset of the full data set
# test data is the complement of the training data set.
# -----------------------------------------------------------------
train.X  =  standardized[ training.individuals,  2:6]
test.X   =  standardized[-training.individuals,  2:6]
train.Y  =  standardized[ training.individuals,    1]
test.Y   =  standardized[-training.individuals,    1]

head(train.X ) %>% kable(caption="Normalized Training Data - 1st Few Rows", digits = 2 ) %>% 
  kable_styling(bootstrap_options = c("striped", "hover") )
model = train( species ~. , data = standardized , method = "knn" ,
         trControl = trainControl("cv", number = 5 ), tuneLength = 10 )

# Plot the model output vs. varying k values.
plot(model)
model

# The best tuning parameter k that 
# optimized model accuracy is:
#model$bestTune

# Simpler interface using the knn function within class.
knn.pred5 = knn(train.X, test.X, train.Y , k = 5, prob = FALSE) 
cm5 = confusionMatrix(knn.pred5, test.Y) 


# This model approach uses tuneGrid with a single value to calculate kNN 
# only at a single number of nearest neighbors.
# ------------------------------------------------------------------------------
x2 = train( species ~ . , data = standardized[training.individuals,] , 
            method = "knn", 
            trControl = trainControl(method="none"), 
            tuneGrid=data.frame(k=5) )

pred2 = predict(x2, newdata=test.X)

confusionMatrix(pred2, test.Y)

# We visualize a projection of the decision boundary flattened to 2 dimensions to illustrate
# the smoothness of the KNN boundary at different values of k.
# First, choose the variables, then build a sampling grid.
# ------------------------------------------------------------------------
pl = seq(min(standardized$bill_length_mm), max(standardized$bill_length_mm), by = 0.05)
pw = seq(min(standardized$bill_depth_mm), max(standardized$bill_depth_mm), by = 0.05 )

# The sampling grid needs explicit values to be plugged in for the variables
# which are not the two projected variables.  Since variables are standardized to
# mean 0 and variance 1, we choose zero for flipper_length and body_mass because they are continuous 
# and the mean and mode are close.
# However, for sex, we choose male because the average sex = 0 is not the mode.
# In practice, sex = 0 gives weird decision boundaries in the projected plot.
# ------------------------------------------------------------------------------
lgrid = expand.grid(bill_length_mm=pl , bill_depth_mm = pw , flipper_length_mm = 0, body_mass_g = 0 , sex = 1 )

knnPredGrid = predict(x2, newdata=lgrid )

num_knnPredGrid = as.numeric(knnPredGrid)

num_pred2 = as.numeric(pred2)

test = cbind(test.X, test.Y)
test$Pred = num_pred2
ggplot(data=lgrid) + stat_contour(aes(x=bill_length_mm, y= bill_depth_mm, z = num_knnPredGrid), bins = 2 ) +
  geom_point( aes(x=bill_length_mm, y= bill_depth_mm, color = knnPredGrid),  size = 1, shape = 19, alpha = 0.2) +
  geom_point( data= test , aes(x=test$bill_length_mm, y=test$bill_depth_mm, color=pred2 ) , size = 4, alpha = 0.8, shape =24 ) +
  ggtitle("KNN Decision Boundary for Penguins Data with k=5 neighbor")


# This model approach uses tuneGrid with a single value to calculate kNN 
# only at a single number of nearest neighbors.
# ------------------------------------------------------------------------------
x3 = train( species ~ . , data = standardized[training.individuals,] , 
            method = "knn", 
            trControl = trainControl(method="none"), 
            tuneGrid=data.frame(k=1) )  # Now we use a k=1 nearest neighbor algorithm which is overfitted.

pred3 = predict(x3, newdata=test.X)


knnPredGrid3 = predict(x3, newdata=lgrid )

num_knnPredGrid3 = as.numeric(knnPredGrid3)

num_pred3 = as.numeric(pred3)

test = cbind(test.X, test.Y)
test$Pred = num_pred3

ggplot(data=lgrid) + stat_contour(aes(x=bill_length_mm, y= bill_depth_mm, z = num_knnPredGrid3), bins = 2 ) +
  geom_point( aes(x=bill_length_mm, y= bill_depth_mm, color = knnPredGrid3),  size = 1, shape = 19, alpha = 0.2) +
  geom_point( data= test , aes(x=test$bill_length_mm, y=test$bill_depth_mm, color=pred3 ) , size = 4, alpha = 0.8, shape =24 ) +
  ggtitle("KNN Decision Boundary for Penguins Data with k=1 neighbor")

original_loan_approval = read_csv("Loan_approval.csv")

skim(original_loan_approval)
#View(original_loan_approval)
la = original_loan_approval

la$Loan_Amount_Term = factor(la$Loan_Amount_Term)


# Feature Plot of box plots

categorical_vars = c("Gender", "Married", "Dependents", "Education", "Self_Employed", "Property_Area", "Credit_History" , "Loan_Amount_Term" )

quantitative_vars = c("ApplicantIncome", "CoapplicantIncome", "LoanAmount")


library(kableExtra)


#  Display the rejection rate statistics for each qualitative field
#  in absolute count and percentage terms.
# 
display_rejection_rate <- function(la , field , display_pct)
{
    a = la[, "Loan_Status"]
    b = la[, field]
    
    c = table(cbind(a,b))  # Count table
    

    if(display_pct == 1 )
    {
          pcts = as.data.frame.matrix(c) /nrow(la) * 100   # Percentage equivalent
          col_sums = colSums(pcts)  # Shows the results by the field

          rej_rate = pcts[1,]/col_sums   * 100   # Rejection rate assumes row 1 is the "No" row.

          yy = rbind( pcts, col_sums, rej_rate) 
          rownames(yy)[3]= paste0( field, " Column Total %:")
          rownames(yy)[4]="Rejection Rate %"
          yy %>% kable(digits = 1, caption = paste0('Loan Approved vs. ', field , ' by Percent') ) %>%
             kable_styling(bootstrap_options = c("striped", "hover") ) 
    }
    else
    {
        col_sums_count = colSums(c)
        rej_rate = c[1, ] / col_sums_count * 100
        
        d = rbind( c, col_sums_count, rej_rate)
        rownames(d)[3] = paste0( field, " Total Count: ")
        rownames(d)[4] = "Rejection Rate %"
        
        d %>% kable( digits = 0, caption = paste0('Loan Approval vs. ', field, ' by Count') ) %>%                             kable_styling(bootstrap_options = c("striped", "hover") ) 
        
    }
}

display_rejection_rate( la , "Gender", 1)  # Display in percentage form
display_rejection_rate( la,  "Gender", 0)  # Display in count form

display_rejection_rate( la , "Married", 1)  # Display in percentage form
display_rejection_rate( la,  "Married", 0)  # Display in count form

display_rejection_rate( la , "Dependents", 1)  # Display in percentage form
display_rejection_rate( la,  "Dependents", 0)  # Display in count form


display_rejection_rate( la , "Education", 1)  # Display in percentage form
display_rejection_rate( la,  "Education", 0)  # Display in count form

display_rejection_rate( la , "Self_Employed", 1)  # Display in percentage form
display_rejection_rate( la,  "Self_Employed", 0)  # Display in count form

display_rejection_rate( la , "Property_Area", 1)  # Display in percentage form
display_rejection_rate( la,  "Property_Area", 0)  # Display in count form


display_rejection_rate( la , "Loan_Amount_Term", 1)  # Display in percentage form
display_rejection_rate( la,  "Loan_Amount_Term", 0)  # Display in count form


display_rejection_rate( la , "Credit_History", 1)  # Display in percentage form
display_rejection_rate( la,  "Credit_History", 0)  # Display in count form




caret::featurePlot(
  x = la[, quantitative_vars],
  y = as.factor(la$Loan_Status),  
  plot = "pairs",
  auto.key = list(columns = 2)   )


caret::featurePlot(
  x = la[, quantitative_vars],
  y = as.factor(la$Loan_Status),  
  plot = "box",
  auto.key = list(columns = 2) ,
  scales = list(x=list(relation="free"), y=list(relation="free")))

cla <- la %>% dplyr::select(-Loan_ID ) 

cla$Loan_Status = factor(cla$Loan_Status)  # Convert the response to factor
cla$Credit_History = factor(cla$Credit_History)
cla$Property_Area = factor( cla$Property_Area)
cla$Gender = factor( cla$Gender)
cla$Married = factor(cla$Married)
cla$Dependents = ordered(cla$Dependents, levels = c("0" , "1", "2" , "3+") )
cla$Education = factor(cla$Education)
cla$Self_Employed = factor( cla$Self_Employed)
cla = na.omit(cla)


skim(cla)

# Reorder the columns for ease of use.
# Standardize the 3 quantitative variables:  ApplicantIncome, CoapplicantIncome, LoanAmount
scla = cbind( cla[,12], cla[,1:5], cla[, 9:11], scale( cla[, 6:8] ) )

skim(scla)
head(scla)


# Define an 80-20 split of the training and test data.
# -------------------------------------------------------------------------------------
training.indices = createDataPartition(scla$Loan_Status, p= 0.8 , list = FALSE)

train_scla = scla[training.indices,]

test_scla = scla[-training.indices,]
# Training
str(train_scla)
str(test_scla)
# Libraries used for random forest
# ------------------------------------
library(randomForest)
library(ranger)
set.seed(1234)

rf_numtrees = randomForest( Loan_Status ~ . , data = test_scla )

rf_numtrees

plot(rf_numtrees)
rf_numtrees_2 = randomForest( Loan_Status ~ . , data = test_scla , mtry = 10)

rf_numtrees_2

plot(rf_numtrees_2)

Data 622 Homework 3: Palmer Penguins and Loan Approval Data EDA

Alexander Ng

03/27/2021

Introduction

1 Penguins and the KNN algorithm

1.1 Data Transformation for KNN

1.2 Cross Validation and Parameter Tuning with KNN

1.3 KNN Model Output and Performance

2 Exploratory Data Analysis

2.1 Data Completeness

2.2 FeaturePlot Analysis

3 Building A Complete Dataset

3.1 Scaling

4 Random Forest Implementation

4.1 Select the number of trees

4.2 Parameter Tuning of `mtry`

5 Appendices

5.1 References

5.2 Code

Data 622 Homework 3: Palmer Penguins and Loan Approval Data EDA

Alexander Ng

03/27/2021

Introduction

1 Penguins and the KNN algorithm

1.1 Data Transformation for KNN

1.2 Cross Validation and Parameter Tuning with KNN

1.3 KNN Model Output and Performance

2 Exploratory Data Analysis

2.1 Data Completeness

2.2 FeaturePlot Analysis

3 Building A Complete Dataset

3.1 Scaling

4 Random Forest Implementation

4.1 Select the number of trees

4.2 Parameter Tuning of mtry

5 Appendices

5.1 References

5.2 Code

4.2 Parameter Tuning of `mtry`