This assignment analyzes two problems. First, the Palmer Penguin dataset using the KNN model to predict species. Second, a Loan Approval dataset using Random Forest and Gradient Boosting methods.
Following the data cleaning approach taken in prior assignments, we exclude observations where the feature sex is undefined. We also exclude the two variables island and year which do not seem helpful in prediction analysis. We follow the variable selection process described in a prior assignment here.
This still gives a substantial subset \(333\) out of \(344\) original observations. The below table illustrates sample rows of the cleansed dataset used in the subsequent KNN analysis.
| species | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex |
|---|---|---|---|---|---|
| Adelie | 39.1 | 18.7 | 181 | 3750 | male |
| Adelie | 39.5 | 17.4 | 186 | 3800 | female |
| Adelie | 40.3 | 18.0 | 195 | 3250 | female |
| Adelie | 36.7 | 19.3 | 193 | 3450 | female |
| Adelie | 39.3 | 20.6 | 190 | 3650 | male |
| Adelie | 38.9 | 17.8 | 181 | 3625 | female |
We obtain 333 rows of observations with 6 columns of explanatory and response variables.
The next step is to normalize the data so that each quantitative variables have a mean zero and standard deviation equal to 1. However, we need to also include the sex variable in the K-nearest neighbor model. Normalization ensures that each variable has equal influence on the nearest neighbor calculation. Otherwise, a quantitative variable whose scale is much larger than the other variables would dominate the Euclidean distance calculation used in the KNN model.
However, we also need to include the sex variable as a feature in the KNN model. Categorical variables pose a challenge in the KNN context because they need to be mapped to a numerical equivalent. In the case of this data set, two facts are key:
For these reasons, we do not normalize sex but use the mapping function \(f\) defined as:
\[\begin{equation} f(\text{sex}) = \begin{cases} 1 & \text{sex} = \text{male} \\ -1 & \text{sex} = \text{female} \\ \end{cases} \end{equation} \] Because the data set is gender balanced, \(E[f(\text{sex})] \approx 0\) and \(Stdev(f(\text{sex})) \approx 1\). Hence, \(f(\text{sex})\) is nearly normalized. With these transformations, the KNN dataset is prepared for model evaluation. We illustrate the top 5 rows of the training data set below. Note that the sampled indices are displayed in the left most column as row values in the resulting dataframe.
| bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|
| 1 | -0.89 | 0.78 | -1.42 | -0.57 | 1 |
| 3 | -0.68 | 0.42 | -0.43 | -1.19 | -1 |
| 4 | -1.33 | 1.08 | -0.57 | -0.94 | -1 |
| 7 | -0.88 | 1.24 | -0.43 | 0.58 | 1 |
| 8 | -0.53 | 0.22 | -1.35 | -1.25 | -1 |
| 9 | -0.99 | 2.05 | -0.71 | -0.51 | 1 |
We will use caret to test a range of possible values of \(k\) and select the optimal \(k\) based on minimizing the cross validation error. We did the pre-processing previously using scale so there is no need to invoke further data transformation inside caret pre-processing function calls.
trainControl allows us to define the training parameters.
tuneLength decides the range of allowed \(k\) value to compute against the model.
We choose a 5-fold cross validation method to ensure the test set is sufficiently large.
Using the above model plot which shows the average cross validation accuracy at each parameter \(k\) used in the KNN model, we see that the best kappa and accuracy occur at \(k=5\) nearest neighbors. This is evident when we display the (cross validation) model output below.
## k-Nearest Neighbors
##
## 333 samples
## 5 predictor
## 3 classes: 'Adelie', 'Chinstrap', 'Gentoo'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 267, 266, 268, 265, 266
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9940737 0.9906842
## 7 0.9880556 0.9811323
## 9 0.9881914 0.9812667
## 11 0.9881914 0.9812667
## 13 0.9851145 0.9763978
## 15 0.9851145 0.9763978
## 17 0.9851145 0.9763978
## 19 0.9851145 0.9763978
## 21 0.9851145 0.9763978
## 23 0.9851145 0.9763978
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
To assess model performance and investigate the suitability of our choice of \(k=5\), we consider two approaches. First we examine the statistical output of the confusion matrix. The accuracy and kappa ought to look attractive in the model selection. Second, we examine the smoothness of the decision boundary by plotting the decision boundary for \(k=1\) and \(k=5\). If the optimal \(k\) shows a relatively smooth boundary, this is supporting evidence that the model parameters are reasonable.
First, we construct the confusion matrix of one specific implementation of the KNN model for the train-test split following the same proportions as the cross-validation study. We specifically use the caret interface rather than the simpler class interface to ensure consistency across multiple models in future.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 29 1 0
## Chinstrap 0 12 0
## Gentoo 0 0 23
##
## Overall Statistics
##
## Accuracy : 0.9846
## 95% CI : (0.9172, 0.9996)
## No Information Rate : 0.4462
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9757
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 1.0000 0.9231 1.0000
## Specificity 0.9722 1.0000 1.0000
## Pos Pred Value 0.9667 1.0000 1.0000
## Neg Pred Value 1.0000 0.9811 1.0000
## Prevalence 0.4462 0.2000 0.3538
## Detection Rate 0.4462 0.1846 0.3538
## Detection Prevalence 0.4615 0.1846 0.3538
## Balanced Accuracy 0.9861 0.9615 1.0000
We see in the confusion matrix output above that accuracy is 98.46% and kappa is 97.6%. These are consistent with the cross validation results. We conclude that KNN is a very effective classifier for penguin gender. Only 1 penguin in 65 was misclassified: a Chinstrap was predicted to be an Adelie penguin.
Next, we visualize the decision boundary of our KNN model to check for model suitability. Custom code needs to be written as we are not aware if this plot type is currently provided by any R package.
Adapting the R recipe for KNN decision boundary plots found in the stackoverflow page here we see below that the decision boundary for \(k=5\) nearest neighbor algorithm is reasonably smooth and not overfitted to the data.
However, the same KNN decision boundary for \(k=1\) shows a much more twisted curve. The border between the Chinstrap and Adelie regions in the upper middle half shows a zig-zag pattern characteristic of overfitting. So we reject the choice of \(k=1\) as the model even though both \(k=5\) and \(k=1\) KNN models have identical confusion matrices. ( The \(k=1\) confusion matrix is omitted for brevity. ) This means model selection requires considering factors not captured by the confusion matrix.
After an initial load of the data, we explore each variable’s completeness.
| Name | original_loan_approval |
| Number of rows | 614 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Loan_ID | 0 | 1.00 | 8 | 8 | 0 | 614 | 0 |
| Gender | 13 | 0.98 | 4 | 6 | 0 | 2 | 0 |
| Married | 3 | 1.00 | 2 | 3 | 0 | 2 | 0 |
| Dependents | 15 | 0.98 | 1 | 2 | 0 | 4 | 0 |
| Education | 0 | 1.00 | 8 | 12 | 0 | 2 | 0 |
| Self_Employed | 32 | 0.95 | 2 | 3 | 0 | 2 | 0 |
| Property_Area | 0 | 1.00 | 5 | 9 | 0 | 3 | 0 |
| Loan_Status | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ApplicantIncome | 0 | 1.00 | 5403.46 | 6109.04 | 150 | 2877.5 | 3812.5 | 5795.00 | 81000 | ▇▁▁▁▁ |
| CoapplicantIncome | 0 | 1.00 | 1621.25 | 2926.25 | 0 | 0.0 | 1188.5 | 2297.25 | 41667 | ▇▁▁▁▁ |
| LoanAmount | 22 | 0.96 | 146.41 | 85.59 | 9 | 100.0 | 128.0 | 168.00 | 700 | ▇▃▁▁▁ |
| Loan_Amount_Term | 14 | 0.98 | 342.00 | 65.12 | 12 | 360.0 | 360.0 | 360.00 | 480 | ▁▁▁▇▁ |
| Credit_History | 50 | 0.92 | 0.84 | 0.36 | 0 | 1.0 | 1.0 | 1.00 | 1 | ▂▁▁▁▇ |
We conclude from skim that many variables are complete while others are incomplete:
Complete variables include: Loan_ID, Education, Property_Area, Loan_Status, ApplicantIncome, CoapplicantIncome
Mostly complete variables include: Married, Dependents, Loan_Amount_Term, Gender
Variables with significant gaps include: Self_Employed, LoanAmount, Credit_History
Of these variables, Credit_History is the most problematic and it influences the Loan Approval decision.
There are 614 observations and 13 columns including the response variable Loan_Status.
| Female | Male | |
|---|---|---|
| N | 6.0 | 24.4 |
| Y | 12.2 | 55.2 |
| Gender Column Total %: | 18.2 | 79.6 |
| Rejection Rate % | 33.0 | 30.7 |
| Female | Male | |
|---|---|---|
| N | 37 | 150 |
| Y | 75 | 339 |
| Gender Total Count: | 112 | 489 |
| Rejection Rate % | 33 | 31 |
| No | Yes | |
|---|---|---|
| N | 12.9 | 18.4 |
| Y | 21.8 | 46.4 |
| Married Column Total %: | 34.7 | 64.8 |
| Rejection Rate % | 37.1 | 28.4 |
| No | Yes | |
|---|---|---|
| N | 79 | 113 |
| Y | 134 | 285 |
| Married Total Count: | 213 | 398 |
| Rejection Rate % | 37 | 28 |
| 0 | 1 | 2 | 3+ | |
|---|---|---|---|---|
| N | 17.4 | 5.9 | 4.1 | 2.9 |
| Y | 38.8 | 10.7 | 12.4 | 5.4 |
| Dependents Column Total %: | 56.2 | 16.6 | 16.4 | 8.3 |
| Rejection Rate % | 31.0 | 35.3 | 24.8 | 35.3 |
| 0 | 1 | 2 | 3+ | |
|---|---|---|---|---|
| N | 107 | 36 | 25 | 18 |
| Y | 238 | 66 | 76 | 33 |
| Dependents Total Count: | 345 | 102 | 101 | 51 |
| Rejection Rate % | 31 | 35 | 25 | 35 |
| Graduate | Not Graduate | |
|---|---|---|
| N | 22.8 | 8.5 |
| Y | 55.4 | 13.4 |
| Education Column Total %: | 78.2 | 21.8 |
| Rejection Rate % | 29.2 | 38.8 |
| Graduate | Not Graduate | |
|---|---|---|
| N | 140 | 52 |
| Y | 340 | 82 |
| Education Total Count: | 480 | 134 |
| Rejection Rate % | 29 | 39 |
| No | Yes | |
|---|---|---|
| N | 25.6 | 4.2 |
| Y | 55.9 | 9.1 |
| Self_Employed Column Total %: | 81.4 | 13.4 |
| Rejection Rate % | 31.4 | 31.7 |
| No | Yes | |
|---|---|---|
| N | 157 | 26 |
| Y | 343 | 56 |
| Self_Employed Total Count: | 500 | 82 |
| Rejection Rate % | 31 | 32 |
| Rural | Semiurban | Urban | |
|---|---|---|---|
| N | 11.2 | 8.8 | 11.2 |
| Y | 17.9 | 29.2 | 21.7 |
| Property_Area Column Total %: | 29.2 | 37.9 | 32.9 |
| Rejection Rate % | 38.5 | 23.2 | 34.2 |
| Rural | Semiurban | Urban | |
|---|---|---|---|
| N | 69 | 54 | 69 |
| Y | 110 | 179 | 133 |
| Property_Area Total Count: | 179 | 233 | 202 |
| Rejection Rate % | 39 | 23 | 34 |
| 12 | 36 | 60 | 84 | 120 | 180 | 240 | 300 | 360 | 480 | |
|---|---|---|---|---|---|---|---|---|---|---|
| N | 0.0 | 0.3 | 0.0 | 0.2 | 0.0 | 2.4 | 0.2 | 0.8 | 24.9 | 1.5 |
| Y | 0.2 | 0.0 | 0.3 | 0.5 | 0.5 | 4.7 | 0.5 | 1.3 | 58.5 | 1.0 |
| Loan_Amount_Term Column Total %: | 0.2 | 0.3 | 0.3 | 0.7 | 0.5 | 7.2 | 0.7 | 2.1 | 83.4 | 2.4 |
| Rejection Rate % | 0.0 | 100.0 | 0.0 | 25.0 | 0.0 | 34.1 | 25.0 | 38.5 | 29.9 | 60.0 |
| 12 | 36 | 60 | 84 | 120 | 180 | 240 | 300 | 360 | 480 | |
|---|---|---|---|---|---|---|---|---|---|---|
| N | 0 | 2 | 0 | 1 | 0 | 15 | 1 | 5 | 153 | 9 |
| Y | 1 | 0 | 2 | 3 | 3 | 29 | 3 | 8 | 359 | 6 |
| Loan_Amount_Term Total Count: | 1 | 2 | 2 | 4 | 3 | 44 | 4 | 13 | 512 | 15 |
| Rejection Rate % | 0 | 100 | 0 | 25 | 0 | 34 | 25 | 38 | 30 | 60 |
| 0 | 1 | |
|---|---|---|
| N | 13.4 | 15.8 |
| Y | 1.1 | 61.6 |
| Credit_History Column Total %: | 14.5 | 77.4 |
| Rejection Rate % | 92.1 | 20.4 |
| 0 | 1 | |
|---|---|---|
| N | 82 | 97 |
| Y | 7 | 378 |
| Credit_History Total Count: | 89 | 475 |
| Rejection Rate % | 92 | 20 |
In this section, we define a subset of the loan approval data which consists solely of the records with no missing values for any relevant column and then exclude useless variables. We call this subset of observations the complete dataset. Next, we omit one irrelevant column: the Loan_ID. An identifier which is exogenous to the loan applicant and is assumed to be related solely to computer system processes.
For experimental purposes, we will deal only in the complete records in the subsequent analysis. This reduces the number of observations from 614 to 480 which is a 78.2% of the original data set.
Further, we convert the character type data columns from text to factor. This is importance for Loan_Status but also relevant for others like: Credit_History, Property_Area, Gender, Married, Education, Self_Employed. In addition, we convert Dependents from text to an ordered factor because the number of children is ordered.
| Name | cla |
| Number of rows | 480 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| factor | 9 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Gender | 0 | 1 | FALSE | 2 | Mal: 394, Fem: 86 |
| Married | 0 | 1 | FALSE | 2 | Yes: 311, No: 169 |
| Dependents | 0 | 1 | TRUE | 4 | 0: 274, 2: 85, 1: 80, 3+: 41 |
| Education | 0 | 1 | FALSE | 2 | Gra: 383, Not: 97 |
| Self_Employed | 0 | 1 | FALSE | 2 | No: 414, Yes: 66 |
| Loan_Amount_Term | 0 | 1 | FALSE | 9 | 360: 411, 180: 36, 480: 12, 300: 9 |
| Credit_History | 0 | 1 | FALSE | 2 | 1: 410, 0: 70 |
| Property_Area | 0 | 1 | FALSE | 3 | Sem: 191, Urb: 150, Rur: 139 |
| Loan_Status | 0 | 1 | FALSE | 2 | Y: 332, N: 148 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ApplicantIncome | 0 | 1 | 5364.23 | 5668.25 | 150 | 2898.75 | 3859.0 | 5852.50 | 81000 | ▇▁▁▁▁ |
| CoapplicantIncome | 0 | 1 | 1581.09 | 2617.69 | 0 | 0.00 | 1084.5 | 2253.25 | 33837 | ▇▁▁▁▁ |
| LoanAmount | 0 | 1 | 144.74 | 80.51 | 9 | 100.00 | 128.0 | 170.00 | 600 | ▇▇▁▁▁ |
Further we scale the quantitative variables to mean zero and variance one to ensure equal influence.
| Name | scla |
| Number of rows | 480 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| factor | 9 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Loan_Status | 0 | 1 | FALSE | 2 | Y: 332, N: 148 |
| Gender | 0 | 1 | FALSE | 2 | Mal: 394, Fem: 86 |
| Married | 0 | 1 | FALSE | 2 | Yes: 311, No: 169 |
| Dependents | 0 | 1 | TRUE | 4 | 0: 274, 2: 85, 1: 80, 3+: 41 |
| Education | 0 | 1 | FALSE | 2 | Gra: 383, Not: 97 |
| Self_Employed | 0 | 1 | FALSE | 2 | No: 414, Yes: 66 |
| Loan_Amount_Term | 0 | 1 | FALSE | 9 | 360: 411, 180: 36, 480: 12, 300: 9 |
| Credit_History | 0 | 1 | FALSE | 2 | 1: 410, 0: 70 |
| Property_Area | 0 | 1 | FALSE | 3 | Sem: 191, Urb: 150, Rur: 139 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ApplicantIncome | 0 | 1 | 0 | 1 | -0.92 | -0.43 | -0.27 | 0.09 | 13.34 | ▇▁▁▁▁ |
| CoapplicantIncome | 0 | 1 | 0 | 1 | -0.60 | -0.60 | -0.19 | 0.26 | 12.32 | ▇▁▁▁▁ |
| LoanAmount | 0 | 1 | 0 | 1 | -1.69 | -0.56 | -0.21 | 0.31 | 5.65 | ▇▇▁▁▁ |
Next, we split the data into a training and test subset. Note that Random forest allow the use of out-of-bag errors which can be used to assess model accuracy out of sample. In this sense, OOB errors can proxy for test errors for random forests.
However, to be conservative in our assessment, we assess both OOB error and test errors.
The training set has the following size and attributes:
## 'data.frame': 385 obs. of 12 variables:
## $ Loan_Status : Factor w/ 2 levels "N","Y": 2 2 2 2 1 2 1 2 2 1 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
## $ Married : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 2 2 1 ...
## $ Dependents : Ord.factor w/ 4 levels "0"<"1"<"2"<"3+": 1 1 1 3 4 3 2 3 3 1 ...
## $ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 2 1 1 1 1 1 1 1 1 ...
## $ Self_Employed : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1 1 1 1 1 ...
## $ Loan_Amount_Term : Factor w/ 10 levels "12","36","60",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ Credit_History : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 2 2 ...
## $ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 3 3 3 3 2 3 2 3 3 1 ...
## $ ApplicantIncome : num -0.4171 -0.49067 0.11216 0.00931 -0.41075 ...
## $ CoapplicantIncome: num -0.604 0.297 -0.604 0.999 0.353 ...
## $ LoanAmount : num -0.978 -0.3072 -0.0464 1.5187 0.1648 ...
The test set has the following size and attributes:
## 'data.frame': 95 obs. of 12 variables:
## $ Loan_Status : Factor w/ 2 levels "N","Y": 1 2 2 1 2 1 1 2 1 1 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 2 2 2 2 1 ...
## $ Married : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 2 1 2 ...
## $ Dependents : Ord.factor w/ 4 levels "0"<"1"<"2"<"3+": 2 1 1 1 3 1 2 1 1 1 ...
## $ Education : Factor w/ 2 levels "Graduate","Not Graduate": 1 2 1 1 2 1 1 1 1 1 ...
## $ Self_Employed : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
## $ Loan_Amount_Term : Factor w/ 10 levels "12","36","60",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ Credit_History : Factor w/ 2 levels "0","1": 2 2 2 1 2 2 2 2 2 1 ...
## $ Property_Area : Factor w/ 3 levels "Rural","Semiurban",..: 1 3 3 3 3 3 1 2 3 3 ...
## $ ApplicantIncome : num -0.1378 -0.5348 -0.0731 -0.3271 -0.2008 ...
## $ CoapplicantIncome: num -0.0279 -0.0249 -0.604 -0.604 -0.2067 ...
## $ LoanAmount : num -0.208 -0.618 -0.245 -0.854 -0.431 ...
This implementation approaches uses the complete loan approval dataset with the randomForest library.
We will also evaluate using the ranger library which is stated by UC Business Analytics to be 6 times faster than randomForest because it is written in C++.
We will break the analysis of random forests into several phases:
mtry to be used as each split.After running two initial sanity checks at different \(\text{mtry}\) parameters, we observe that the random forest OOB error rates stabilize after \(\text{ntree}=500\) trees. In fact, this is conservative as the actual error appears to stabilize at 150 trees. Therefore, we pay the cost of greater computing time.
First we look at the default algorithm value of mtry which happens to be 3 for this data set.
##
## Call:
## randomForest(formula = Loan_Status ~ ., data = test_scla)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 17.89%
## Confusion matrix:
## N Y class.error
## N 16 13 0.44827586
## Y 4 62 0.06060606
The following plot suggests that the OOB error stabilizes after 150 trees. Certainly if we use 500 trees, which is the default number of trees ntree for randomForest we obtain stable error rates when mtry equals 3.
Let’s verify the sample result when we go the other extreme: mtry close to the number f independent variables. We consider \(\text{mtry}=10\) and get similar error stability as shown below.
##
## Call:
## randomForest(formula = Loan_Status ~ ., data = test_scla, mtry = 10)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 10
##
## OOB estimate of error rate: 17.89%
## Confusion matrix:
## N Y class.error
## N 16 13 0.44827586
## Y 4 62 0.06060606
mtryIn this step, we examine the random forest error rates at different levels of the mtry parameter – the number of parameters to use at each node of the decision tree to use in defining the split. In our case, the range of mtry goes between \(\text{mtry}=1 \cdots 11\).
We summarize all the R code used in this project in this appendix for ease of reading.
library(tidyverse)
library(ggplot2)
library(knitr)
library(kableExtra)
library(GGally)
library(ellipse)
library(caret)
library(skimr)
library(klaR)
library(class) # used for KNN classifier
knitr::opts_chunk$set(echo = FALSE, message=FALSE, warning=FALSE)
library(palmerpenguins)
# The final dataset excludes observations with missing sex and drops the island and year variables.
pc = penguins %>% filter( is.na(sex) == FALSE) %>% dplyr::select( -one_of("island", "year") )
head(pc) %>% kable(caption="Scrubbed penguin data set") %>% kable_styling(bootstrap_options = c("striped", "hover"))
set.seed(10)
# Construct the standardized dataset with the response variable, the 4 quantitative variables normalized
# with the scale function and the sex variable normalized with the 1,-1 assignment explained previously.
#
standardized = cbind( pc[,1], scale( pc[, 2:5] ) , sex = data.table::fifelse(pc$sex == 'male' , 1 , -1) )
# Define an 80-20 split of the training and test data.
# -------------------------------------------------------------------------------------
training.individuals = createDataPartition(standardized$species, p= 0.8 , list = FALSE)
# X variables include bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex (converted to -1,1)
# Y variable is just the species class
# train data is the standardized subset of the full data set
# test data is the complement of the training data set.
# -----------------------------------------------------------------
train.X = standardized[ training.individuals, 2:6]
test.X = standardized[-training.individuals, 2:6]
train.Y = standardized[ training.individuals, 1]
test.Y = standardized[-training.individuals, 1]
head(train.X ) %>% kable(caption="Normalized Training Data - 1st Few Rows", digits = 2 ) %>%
kable_styling(bootstrap_options = c("striped", "hover") )
model = train( species ~. , data = standardized , method = "knn" ,
trControl = trainControl("cv", number = 5 ), tuneLength = 10 )
# Plot the model output vs. varying k values.
plot(model)
model
# The best tuning parameter k that
# optimized model accuracy is:
#model$bestTune
# Simpler interface using the knn function within class.
knn.pred5 = knn(train.X, test.X, train.Y , k = 5, prob = FALSE)
cm5 = confusionMatrix(knn.pred5, test.Y)
# This model approach uses tuneGrid with a single value to calculate kNN
# only at a single number of nearest neighbors.
# ------------------------------------------------------------------------------
x2 = train( species ~ . , data = standardized[training.individuals,] ,
method = "knn",
trControl = trainControl(method="none"),
tuneGrid=data.frame(k=5) )
pred2 = predict(x2, newdata=test.X)
confusionMatrix(pred2, test.Y)
# We visualize a projection of the decision boundary flattened to 2 dimensions to illustrate
# the smoothness of the KNN boundary at different values of k.
# First, choose the variables, then build a sampling grid.
# ------------------------------------------------------------------------
pl = seq(min(standardized$bill_length_mm), max(standardized$bill_length_mm), by = 0.05)
pw = seq(min(standardized$bill_depth_mm), max(standardized$bill_depth_mm), by = 0.05 )
# The sampling grid needs explicit values to be plugged in for the variables
# which are not the two projected variables. Since variables are standardized to
# mean 0 and variance 1, we choose zero for flipper_length and body_mass because they are continuous
# and the mean and mode are close.
# However, for sex, we choose male because the average sex = 0 is not the mode.
# In practice, sex = 0 gives weird decision boundaries in the projected plot.
# ------------------------------------------------------------------------------
lgrid = expand.grid(bill_length_mm=pl , bill_depth_mm = pw , flipper_length_mm = 0, body_mass_g = 0 , sex = 1 )
knnPredGrid = predict(x2, newdata=lgrid )
num_knnPredGrid = as.numeric(knnPredGrid)
num_pred2 = as.numeric(pred2)
test = cbind(test.X, test.Y)
test$Pred = num_pred2
ggplot(data=lgrid) + stat_contour(aes(x=bill_length_mm, y= bill_depth_mm, z = num_knnPredGrid), bins = 2 ) +
geom_point( aes(x=bill_length_mm, y= bill_depth_mm, color = knnPredGrid), size = 1, shape = 19, alpha = 0.2) +
geom_point( data= test , aes(x=test$bill_length_mm, y=test$bill_depth_mm, color=pred2 ) , size = 4, alpha = 0.8, shape =24 ) +
ggtitle("KNN Decision Boundary for Penguins Data with k=5 neighbor")
# This model approach uses tuneGrid with a single value to calculate kNN
# only at a single number of nearest neighbors.
# ------------------------------------------------------------------------------
x3 = train( species ~ . , data = standardized[training.individuals,] ,
method = "knn",
trControl = trainControl(method="none"),
tuneGrid=data.frame(k=1) ) # Now we use a k=1 nearest neighbor algorithm which is overfitted.
pred3 = predict(x3, newdata=test.X)
knnPredGrid3 = predict(x3, newdata=lgrid )
num_knnPredGrid3 = as.numeric(knnPredGrid3)
num_pred3 = as.numeric(pred3)
test = cbind(test.X, test.Y)
test$Pred = num_pred3
ggplot(data=lgrid) + stat_contour(aes(x=bill_length_mm, y= bill_depth_mm, z = num_knnPredGrid3), bins = 2 ) +
geom_point( aes(x=bill_length_mm, y= bill_depth_mm, color = knnPredGrid3), size = 1, shape = 19, alpha = 0.2) +
geom_point( data= test , aes(x=test$bill_length_mm, y=test$bill_depth_mm, color=pred3 ) , size = 4, alpha = 0.8, shape =24 ) +
ggtitle("KNN Decision Boundary for Penguins Data with k=1 neighbor")
original_loan_approval = read_csv("Loan_approval.csv")
skim(original_loan_approval)
#View(original_loan_approval)
la = original_loan_approval
la$Loan_Amount_Term = factor(la$Loan_Amount_Term)
# Feature Plot of box plots
categorical_vars = c("Gender", "Married", "Dependents", "Education", "Self_Employed", "Property_Area", "Credit_History" , "Loan_Amount_Term" )
quantitative_vars = c("ApplicantIncome", "CoapplicantIncome", "LoanAmount")
library(kableExtra)
# Display the rejection rate statistics for each qualitative field
# in absolute count and percentage terms.
#
display_rejection_rate <- function(la , field , display_pct)
{
a = la[, "Loan_Status"]
b = la[, field]
c = table(cbind(a,b)) # Count table
if(display_pct == 1 )
{
pcts = as.data.frame.matrix(c) /nrow(la) * 100 # Percentage equivalent
col_sums = colSums(pcts) # Shows the results by the field
rej_rate = pcts[1,]/col_sums * 100 # Rejection rate assumes row 1 is the "No" row.
yy = rbind( pcts, col_sums, rej_rate)
rownames(yy)[3]= paste0( field, " Column Total %:")
rownames(yy)[4]="Rejection Rate %"
yy %>% kable(digits = 1, caption = paste0('Loan Approved vs. ', field , ' by Percent') ) %>%
kable_styling(bootstrap_options = c("striped", "hover") )
}
else
{
col_sums_count = colSums(c)
rej_rate = c[1, ] / col_sums_count * 100
d = rbind( c, col_sums_count, rej_rate)
rownames(d)[3] = paste0( field, " Total Count: ")
rownames(d)[4] = "Rejection Rate %"
d %>% kable( digits = 0, caption = paste0('Loan Approval vs. ', field, ' by Count') ) %>% kable_styling(bootstrap_options = c("striped", "hover") )
}
}
display_rejection_rate( la , "Gender", 1) # Display in percentage form
display_rejection_rate( la, "Gender", 0) # Display in count form
display_rejection_rate( la , "Married", 1) # Display in percentage form
display_rejection_rate( la, "Married", 0) # Display in count form
display_rejection_rate( la , "Dependents", 1) # Display in percentage form
display_rejection_rate( la, "Dependents", 0) # Display in count form
display_rejection_rate( la , "Education", 1) # Display in percentage form
display_rejection_rate( la, "Education", 0) # Display in count form
display_rejection_rate( la , "Self_Employed", 1) # Display in percentage form
display_rejection_rate( la, "Self_Employed", 0) # Display in count form
display_rejection_rate( la , "Property_Area", 1) # Display in percentage form
display_rejection_rate( la, "Property_Area", 0) # Display in count form
display_rejection_rate( la , "Loan_Amount_Term", 1) # Display in percentage form
display_rejection_rate( la, "Loan_Amount_Term", 0) # Display in count form
display_rejection_rate( la , "Credit_History", 1) # Display in percentage form
display_rejection_rate( la, "Credit_History", 0) # Display in count form
caret::featurePlot(
x = la[, quantitative_vars],
y = as.factor(la$Loan_Status),
plot = "pairs",
auto.key = list(columns = 2) )
caret::featurePlot(
x = la[, quantitative_vars],
y = as.factor(la$Loan_Status),
plot = "box",
auto.key = list(columns = 2) ,
scales = list(x=list(relation="free"), y=list(relation="free")))
cla <- la %>% dplyr::select(-Loan_ID )
cla$Loan_Status = factor(cla$Loan_Status) # Convert the response to factor
cla$Credit_History = factor(cla$Credit_History)
cla$Property_Area = factor( cla$Property_Area)
cla$Gender = factor( cla$Gender)
cla$Married = factor(cla$Married)
cla$Dependents = ordered(cla$Dependents, levels = c("0" , "1", "2" , "3+") )
cla$Education = factor(cla$Education)
cla$Self_Employed = factor( cla$Self_Employed)
cla = na.omit(cla)
skim(cla)
# Reorder the columns for ease of use.
# Standardize the 3 quantitative variables: ApplicantIncome, CoapplicantIncome, LoanAmount
scla = cbind( cla[,12], cla[,1:5], cla[, 9:11], scale( cla[, 6:8] ) )
skim(scla)
head(scla)
# Define an 80-20 split of the training and test data.
# -------------------------------------------------------------------------------------
training.indices = createDataPartition(scla$Loan_Status, p= 0.8 , list = FALSE)
train_scla = scla[training.indices,]
test_scla = scla[-training.indices,]
# Training
str(train_scla)
str(test_scla)
# Libraries used for random forest
# ------------------------------------
library(randomForest)
library(ranger)
set.seed(1234)
rf_numtrees = randomForest( Loan_Status ~ . , data = test_scla )
rf_numtrees
plot(rf_numtrees)
rf_numtrees_2 = randomForest( Loan_Status ~ . , data = test_scla , mtry = 10)
rf_numtrees_2
plot(rf_numtrees_2)