CUNY SPS DATA 622 - Machine Learning and Big Data

Overview

In this project, there are two analyses carried out using two different datasets: 1) the Palmer Penguins dataset and 2) a Loan Approvals dataset. Thus, this project is divided into the following sections:

Section 1, Palmer Penguins, investigates K-Nearest Neighbor (KNN) algorithm to predict species of penguin, and it is compared to four previously fitted multi-classification models, namely Multinomial Logistic Regression, Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and Naive Bayes classification model.

Section 2, Loan Approval, a dataset on loan approval status is used to predict loan approval using Decision Trees, Random Forests, and Gradient Boosting Machines (GBM). Each analysis is conducted with a thorough explanation of the reasoning behind all the steps taken. In the end, there is a comparison of the model performance and the optimal model is selected.

The last two sections present references used during this analysis, and the written R codes in a technical appendix.

PROJECT SECTIONS

Palmer Penguins

Data Exploration

The palmerpenguins data contains size measurements collected from 2007 - 2009 for three penguin species observed on three islands in the Palmer Archipelago, Antarctica. For more information about this data collection, refer to palmerpenguins website.

Penguins Data Column Definition

Variable	Description
species	penguin species (Adélie, Chinstrap, and Gentoo)
island	island in Palmer Archipelago, Antarctica (Biscoe, Dream or Torgersen)
bill_length_mm	bill length (millimeters)
bill_depth_mm	bill depth (millimeters)
flipper_length_mm	flipper length (millimeters)
body_mass_g	body mass (grams)
sex	penguin sex (female, male)
year	year data was collected

The previous data exploration found that the response variable species denotes one of three penguin species, and a majority of the penguins are Adelie (n = 153), followed by Gentoo (n = 124) and Chinstrap (n = 68). The distribution between gender is nearly equally divided among the species but not for their island habitat.

There are 344 observations of 4 numeric predictor variables and 2-factor predictor variables, namely island, and sex. There is also a year variable that is ignored in this analysis. The data set does not have complete cases, and there is a presence of bi- and tri-modal distributions, which suggests that there are differences among the penguin species.

Lastly, it is noted that Adelie and Chinstrap measurements overlap for all variables except bill length. This feature is a definitive variable that produces a complete separation of penguin species into groups. This perfectly discriminating variable will be removed to get a reasonable estimate for the variables that can predict the outcome variable.

## Data Frame Summary  
## penguins  
## Dimensions: 344 x 8  
## Duplicates: 0  
## 
## +----+--------------------+---------------------------+---------------------+----------+---------+
## | No | Variable           | Stats / Values            | Freqs (% of Valid)  | Valid    | Missing |
## +====+====================+===========================+=====================+==========+=========+
## | 1  | species            | 1. Adelie                 | 152 (44.2%)         | 344      | 0       |
## |    | [factor]           | 2. Chinstrap              |  68 (19.8%)         | (100%)   | (0%)    |
## |    |                    | 3. Gentoo                 | 124 (36.0%)         |          |         |
## +----+--------------------+---------------------------+---------------------+----------+---------+
## | 2  | island             | 1. Biscoe                 | 168 (48.8%)         | 344      | 0       |
## |    | [factor]           | 2. Dream                  | 124 (36.0%)         | (100%)   | (0%)    |
## |    |                    | 3. Torgersen              |  52 (15.1%)         |          |         |
## +----+--------------------+---------------------------+---------------------+----------+---------+
## | 3  | bill_length_mm     | Mean (sd) : 43.9 (5.5)    | 164 distinct values | 342      | 2       |
## |    | [numeric]          | min < med < max:          |                     | (99.42%) | (0.58%) |
## |    |                    | 32.1 < 44.5 < 59.6        |                     |          |         |
## |    |                    | IQR (CV) : 9.3 (0.1)      |                     |          |         |
## +----+--------------------+---------------------------+---------------------+----------+---------+
## | 4  | bill_depth_mm      | Mean (sd) : 17.2 (2)      | 80 distinct values  | 342      | 2       |
## |    | [numeric]          | min < med < max:          |                     | (99.42%) | (0.58%) |
## |    |                    | 13.1 < 17.3 < 21.5        |                     |          |         |
## |    |                    | IQR (CV) : 3.1 (0.1)      |                     |          |         |
## +----+--------------------+---------------------------+---------------------+----------+---------+
## | 5  | flipper_length_mm  | Mean (sd) : 200.9 (14.1)  | 55 distinct values  | 342      | 2       |
## |    | [integer]          | min < med < max:          |                     | (99.42%) | (0.58%) |
## |    |                    | 172 < 197 < 231           |                     |          |         |
## |    |                    | IQR (CV) : 23 (0.1)       |                     |          |         |
## +----+--------------------+---------------------------+---------------------+----------+---------+
## | 6  | body_mass_g        | Mean (sd) : 4201.8 (802)  | 94 distinct values  | 342      | 2       |
## |    | [integer]          | min < med < max:          |                     | (99.42%) | (0.58%) |
## |    |                    | 2700 < 4050 < 6300        |                     |          |         |
## |    |                    | IQR (CV) : 1200 (0.2)     |                     |          |         |
## +----+--------------------+---------------------------+---------------------+----------+---------+
## | 7  | sex                | 1. female                 | 165 (49.5%)         | 333      | 11      |
## |    | [factor]           | 2. male                   | 168 (50.4%)         | (96.8%)  | (3.2%)  |
## +----+--------------------+---------------------------+---------------------+----------+---------+
## | 8  | year               | 1. 2007                   | 110 (32.0%)         | 344      | 0       |
## |    | [factor]           | 2. 2008                   | 114 (33.1%)         | (100%)   | (0%)    |
## |    |                    | 3. 2009                   | 120 (34.9%)         |          |         |
## +----+--------------------+---------------------------+---------------------+----------+---------+

Data Preparation

The summary above indicates the amount of missing data in the penguin dataset. It appears that more than 3% of the missing data is from the sex variable. This further suggests that nearly 97% is complete. There are no missingness patterns, and the overall proportions are not very extreme. As a result, missingness can be corrected by imputation.

Further exploration reveals that no variable seems to be strongly influenced by any outliers. An outlier is an observation that lies an abnormal distance from other values in a random sample. Outliers in the data could distort predictions and affect the accuracy, therefore, these would need to be corrected.

To build a smaller model without predictors with extremely high correlations, it is best to reduce the number of predictors such that there are no absolute pairwise correlations above 0.90. The correlogram below graphically represents the correlations between the numeric predictor variables, when ignoring the missing variables. Most of the numeric variables are uncorrelated with one another, but there are a few highly correlated pairs. From the correlogram, the relationship between the body_mass_g and flipper_length_mm is a highly positive correlation, and within reason, as larger flippers indicate an increase in body mass. There are some variables with moderate correlations, but their relationships are also intuitive. However, no relationship was too extreme, and it is clear that Adelie and Chinstrap overlap for all variable measurements except bill length. This feature is identified as the definitive variable that produces complete separation among the penguin species into groups.

Training & Testing Split

The models are trained on 70% of the data set. The remaining 30% is used for validation. This allows for the test via cross-validation scheme to tune parameters for optimal performance.

Pre-Processing of Predictors

Missing data are treated by imputation using the classification and regression trees (CART) missing data algorithm. CART can handle mixed types of missing data and is adaptable to interactions and non-linearity.

Dummy Variables

The categorical variables are then made sets of dummy variables. For instance, in the variable sex, the female will be used as the reference, whereas in the island variable, Biscoe island will be used as the reference.

Feature Selection

Feature selection is done using the random forest algorithm to identify the most important features for prediction. This method uses a top-down search for relevant features and compares the original attributes’ importance with the importance achievable at random. It shows that bill_length_mm is indeed the most contributing variable followed by flipper_length_mm, and so on.

Feature Importance of Penguin Data
	meanImp	decision
bill_length_mm	37.82263	Confirmed
flipper_length_mm	21.30305	Confirmed
bill_depth_mm	19.16537	Confirmed
body_mass_g	16.04483	Confirmed
island.Dream	15.95043	Confirmed
island.Biscoe	14.99178	Confirmed

All in all, the following decisions are made based on the feature selection investigation:

The flipper_length_mm and bill_depth_mm are the most likely contributing variables that will be in the model.
The bill_length_mm variable is removed due to it being a perfectly discriminating variable.
Due to the high correlation with flipper_length_mm, body_mass_g is removed to avoid collinearity.
The sex variable is removed as it does not contribute based on the feature selection investigation.
The island variable is kept and evaluated per models in the previous sections on how much of a contribution difference it makes based on the algorithm and algorithm assumptions. For the KNN model, it will be ignored.
The year variable is ignored.

Normality & Linearity

The data is then pre-processed to fulfill the assumption of normality, i.e., each quantitative variable \(~N(\mu,\sigma) = (0,1)\), by centering and scaling. Normalization ensures that each variable has equal influence when calculating the Euclidean distance, else a variable with a larger scale would add bias to calculation.

Building the KNN Model

KNN classification works such that for each row of the test set, the k nearest (in Euclidean distance) training set vectors are found, and the classification is decided by majority vote, with ties broken at random.

With the decision on the features set, there is no need to perform stepwise elimination to account for the best predictors. To optimize each model, using accuracy as the criterion, 10 repeats of 10-fold cross-validation (CV) are performed. In this scheme, the training set is divided randomly into 10 parts and then each of 10 parts is used as a testing set for the model trained on other the 9. Then the average of the 10 error terms is obtained by performing the 10-fold CV ten times. A repeated hold-out offers greater control than a simple k-fold.

## k-Nearest Neighbors 
## 
## 242 samples
##   2 predictor
##   3 classes: 'Adelie', 'Chinstrap', 'Gentoo' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 218, 218, 218, 217, 219, 219, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.8048398  0.6868103
##   7  0.8026957  0.6829265
##   9  0.8006852  0.6797395
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

Model Discussion

The best tune for the KNN model measured by accuracy is k = 5. It has accuracy = 80.5% and \(\kappa\) = 0.69. This tune accounts for the largest portion of the variability in the data. Moreover, the variable that contributes the most to identifying Adelie and Gentoo is flipper_length_mm, while bill_depth_mm is the most important variable in identifying Chinstrap. From the results based on the test data, the KNN model did exceptionally well in classifying the test set. Thus, the optimal model has an accuracy of 80.4% and \(\kappa\) = 0.68 on the test set.

In terms of the confusion matrix, the results suggest that 80.4% of the predicted results are correctly classified. The precision for each type of species is also high (Adelie = 74%, Chinstrap = 54%, and Gentoo = 100%), suggesting that the penguins belong to the actual species among all the penguins predicted to be that particular species, with Gentoos being classified correctly 100% of the time. Moreover, the recall highlights that 87% of the Adelie species have been correctly classified accordingly, whereas 35% of the Chinstrap species have been correctly classified, and 97% of the Gentoo species have been correctly classified. In all, this model is capable of classifying penguins into one of the three species with great accuracy, particularly Gentoo species which was expected as their measurements were quite different. And lastly, the Kappa statistic of 0.68 suggests that the overall accuracy of this model is better than the expected random chance classifier’s accuracy.

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        39        13      1
##   Chinstrap      6         7      0
##   Gentoo         0         0     36
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8039          
##                  95% CI : (0.7135, 0.8759)
##     No Information Rate : 0.4412          
##     P-Value [Acc > NIR] : 6.512e-14       
##                                           
##                   Kappa : 0.6826          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.8667          0.35000        0.9730
## Specificity                 0.7544          0.92683        1.0000
## Pos Pred Value              0.7358          0.53846        1.0000
## Neg Pred Value              0.8776          0.85393        0.9848
## Precision                   0.7358          0.53846        1.0000
## Recall                      0.8667          0.35000        0.9730
## F1                          0.7959          0.42424        0.9863
## Prevalence                  0.4412          0.19608        0.3627
## Detection Rate              0.3824          0.06863        0.3529
## Detection Prevalence        0.5196          0.12745        0.3529
## Balanced Accuracy           0.8105          0.63841        0.9865

Next, a receiver operating characteristic (ROC) analysis is shown in Figure 1. The area under the curve (AUC) for each class is estimated for observed penguin species and their predicted values by fitting the KNN model. The multi-class area under the curve for the predicted penguin species is the mean for all three AUC. It is computed to be 0.854. That is, there is an 85.4% chance that the model will be able to distinguish among the three penguin species.

Fig 1: ROC Curves of the KNN Model

## Multi-class area under the curve: 0.8545

##                    [,1]        [,2]       
## percent            FALSE       FALSE      
## sensitivities      Numeric,4   Numeric,4  
## specificities      Numeric,4   Numeric,4  
## thresholds         Numeric,4   Numeric,4  
## direction          "<"         "<"        
## cases              Numeric,37  Numeric,37 
## controls           Numeric,45  Numeric,20 
## fun.sesp           ?           ?          
## call               Expression  Expression 
## original.predictor Numeric,102 Numeric,102
## original.response  factor,102  factor,102 
## predictor          Numeric,82  Numeric,57 
## response           factor,82   factor,57  
## levels             Character,2 Character,2

Conclusion

Given the palmerpenguins dataset, five multi-classification models (multinomial logistic regression, LDA, QDA, Naive Bayes, and KNN models) were fitted. From previous analyses, each model had its strengths and weakness on the same cleaned training dataset, and in the end, the results are as follows:

Model	Accuracy
Multinomial Logistic Regression	0.87
Linear Discriminant Analysis	0.82
Quadratic Discriminant Analysis	0.83
Naive Bayes	0.92
K Nearest Neighbor	0.80

Of all the classification models used to classify the penguin species, the Naive Bayes model’s ability proved to be near-optimal. Adelie and Gentoo were seen to be classified easily based on the flipper length, as it was the most important variable used in the classification. Whereas, for Chinshtrap, it was the bill depth. In conclusion, the Naive Bayes classifier produced a model that is 92.1% accurate in correctly classifying penguins into Adelie, Chinstrap, and Gentoo. This model also had an error rate of 0.161 between the measurements, which is the smallest than what the other models determined.

Loan Approval

Data Exploration

The loan approval dataset will be used for the remaining models.

Loan Approvals Data Column Definition

Variable	Description
Loan_ID	Unique Loan ID
Gender	Male/Female
Married	Applicant married (Y/N)
Dependents	Number of dependents
Education	Applicant Education (Graduate/Undergraduate)
Self_Employed	Self-employed (Y/N)
ApplicantIncome	Applicant Income
CoapplicantIncome	Co-applicant Income
LoanAmount	Loan amount in thousands
Loan_Amount_Term	Term of loan in months
Credit_History	Credit history meets guidelines
Property_Area	Urban/Semi-Urban/Rural
Loan_Status	(Target) Loan Approved (Y/N)

This dataset includes 614 data points and 13 columns. The target variable is Loan_Status. Since the Loan_ID variable is unique to each record, it is removed from the dataset. It is clear that the Loan_Status classification is highly imbalanced, with more than twice the amount of approvals (Y) than rejections (N).

Understanding the Data

Below highlights the summary statistics for the dataset:

## Data Frame Summary  
## loan_df  
## Dimensions: 614 x 12  
## Duplicates: 0  
## 
## +----+--------------------+------------------------------+---------------------+----------+---------+
## | No | Variable           | Stats / Values               | Freqs (% of Valid)  | Valid    | Missing |
## +====+====================+==============================+=====================+==========+=========+
## | 1  | Gender             | 1. Female                    | 112 (18.6%)         | 601      | 13      |
## |    | [factor]           | 2. Male                      | 489 (81.4%)         | (97.88%) | (2.12%) |
## +----+--------------------+------------------------------+---------------------+----------+---------+
## | 2  | Married            | 1. No                        | 213 (34.9%)         | 611      | 3       |
## |    | [factor]           | 2. Yes                       | 398 (65.1%)         | (99.51%) | (0.49%) |
## +----+--------------------+------------------------------+---------------------+----------+---------+
## | 3  | Dependents         | 1. 0                         | 345 (57.6%)         | 599      | 15      |
## |    | [factor]           | 2. 1                         | 102 (17.0%)         | (97.56%) | (2.44%) |
## |    |                    | 3. 2                         | 101 (16.9%)         |          |         |
## |    |                    | 4. 3+                        |  51 ( 8.5%)         |          |         |
## +----+--------------------+------------------------------+---------------------+----------+---------+
## | 4  | Education          | 1. Graduate                  | 480 (78.2%)         | 614      | 0       |
## |    | [factor]           | 2. Not Graduate              | 134 (21.8%)         | (100%)   | (0%)    |
## +----+--------------------+------------------------------+---------------------+----------+---------+
## | 5  | Self_Employed      | 1. No                        | 500 (85.9%)         | 582      | 32      |
## |    | [factor]           | 2. Yes                       |  82 (14.1%)         | (94.79%) | (5.21%) |
## +----+--------------------+------------------------------+---------------------+----------+---------+
## | 6  | ApplicantIncome    | Mean (sd) : 5403.5 (6109)    | 505 distinct values | 614      | 0       |
## |    | [integer]          | min < med < max:             |                     | (100%)   | (0%)    |
## |    |                    | 150 < 3812.5 < 81000         |                     |          |         |
## |    |                    | IQR (CV) : 2917.5 (1.1)      |                     |          |         |
## +----+--------------------+------------------------------+---------------------+----------+---------+
## | 7  | CoapplicantIncome  | Mean (sd) : 1621.2 (2926.2)  | 287 distinct values | 614      | 0       |
## |    | [numeric]          | min < med < max:             |                     | (100%)   | (0%)    |
## |    |                    | 0 < 1188.5 < 41667           |                     |          |         |
## |    |                    | IQR (CV) : 2297.2 (1.8)      |                     |          |         |
## +----+--------------------+------------------------------+---------------------+----------+---------+
## | 8  | LoanAmount         | Mean (sd) : 146.4 (85.6)     | 203 distinct values | 592      | 22      |
## |    | [integer]          | min < med < max:             |                     | (96.42%) | (3.58%) |
## |    |                    | 9 < 128 < 700                |                     |          |         |
## |    |                    | IQR (CV) : 68 (0.6)          |                     |          |         |
## +----+--------------------+------------------------------+---------------------+----------+---------+
## | 9  | Loan_Amount_Term   | Mean (sd) : 342 (65.1)       | 12  :   1 ( 0.2%)   | 600      | 14      |
## |    | [integer]          | min < med < max:             | 36  :   2 ( 0.3%)   | (97.72%) | (2.28%) |
## |    |                    | 12 < 360 < 480               | 60  :   2 ( 0.3%)   |          |         |
## |    |                    | IQR (CV) : 0 (0.2)           | 84  :   4 ( 0.7%)   |          |         |
## |    |                    |                              | 120 :   3 ( 0.5%)   |          |         |
## |    |                    |                              | 180 :  44 ( 7.3%)   |          |         |
## |    |                    |                              | 240 :   4 ( 0.7%)   |          |         |
## |    |                    |                              | 300 :  13 ( 2.2%)   |          |         |
## |    |                    |                              | 360 : 512 (85.3%)   |          |         |
## |    |                    |                              | 480 :  15 ( 2.5%)   |          |         |
## +----+--------------------+------------------------------+---------------------+----------+---------+
## | 10 | Credit_History     | 1. 0                         |  89 (15.8%)         | 564      | 50      |
## |    | [factor]           | 2. 1                         | 475 (84.2%)         | (91.86%) | (8.14%) |
## +----+--------------------+------------------------------+---------------------+----------+---------+
## | 11 | Property_Area      | 1. Rural                     | 179 (29.1%)         | 614      | 0       |
## |    | [factor]           | 2. Semiurban                 | 233 (38.0%)         | (100%)   | (0%)    |
## |    |                    | 3. Urban                     | 202 (32.9%)         |          |         |
## +----+--------------------+------------------------------+---------------------+----------+---------+
## | 12 | Loan_Status        | 1. N                         | 192 (31.3%)         | 614      | 0       |
## |    | [factor]           | 2. Y                         | 422 (68.7%)         | (100%)   | (0%)    |
## +----+--------------------+------------------------------+---------------------+----------+---------+

It should be noted that:

Seven of the twelve variables have missing values, which will be corrected in the later sections.
The frequency of almost all of the categorical variables are unequally proportioned: Gender (more males than females), Married (more married loan applicants than single), Education (more graduates than non-graduates), Self_Employed (less self-employed individuals), and Credit History (more individuals with credit history than not).

Categorical Features

An in-depth investigation into each of the categorical features with respect to the final classification is conducted. Since most of the categorical features are disproportionate, the data is referred to in terms of percentages as opposed to counts.

Gender: Regardless of the sex, around 70% of individuals are approved for a loan.

Married: Married individuals tend to be approved more often than non-married individuals.

Dependents: The number of dependents an individual has does not appear to be indicative of loan approval.

Education: Graduates tend to be approved more often than non-graduates.

Self-Employed: Employment Status alone does not appear to have a significant impact on approval status.

Credit History: Credit history appears to be the most impactful on the final approval! Nearly 80% of individuals with a credit history are approved versus only 8% for those with no credit history.

Property Area: Individuals living in semi-urban areas tend to be approved more often than those in rural or urban areas.

Numeric Features

Next, an in-depth look into the numeric features is carried out to understand their influence on the loan approval status.

Applicant Income: There are a few outliers in both approvals and non-approvals. The distribution for each classification is right-skewed.

Co-applicant Income: Once again, there are a few very high outliers. Data is right-skewed for both classes.

Loan Amount: A few outliers in each class, but a normal distribution otherwise.

Loan Term: Most applicants have a loan term between 300 - 400 months.

Data Preparation

From the summary statistics, 5 categorical and 2 numeric variables have missing values. Their overall proportion is a bit extreme. For example, from the graph below, the first variable Credit_history accounts for 8% of the missing data when the other variables are complete, note this does not indicate within variable missingness.

As a result,

All of the records that have missing categorical features are discarded. These are not as easy to impute and some variables (like Credit_History) are impactful in the final classification.
Imputation for the numeric features is done using the Multivariate Imputation by Chained Equations (MICE) method. Multiple imputation involves creating multiple predictions for each missing value, helping to account for the uncertainty in the individual imputations.

The final cleaned dataset contains 511 records with 12 columns.

Training and Testing Split

All the models will be trained on the same approximately 70% of the training set, reserving 30% for validation of which model to select for the loan approval classification on the test set.

The training set contains 353 records and the test set contains 158 records.

Feature Selection

The features of the training data set is investigated by calculating the mean importance. This selection is itself done via an implementation of random forests. While the results of this process do not inform the subsequent Model #2: Random Forest and its features, they offer a high-level check of importance and can be useful for the other models.

Feature Importance of Loan Data Variables
	meanImp	decision
Credit_History	52.255485	Confirmed
ApplicantIncome	4.324072	Confirmed
CoapplicantIncome	4.298802	Confirmed
Property_Area	4.206467	Confirmed
Loan_Amount_Term	2.737219	Confirmed

Four features perform at least as well as the best performing randomized feature. By looking at the table, the most contributing feature in the data is Credit-History with a mean importance that is above 50. Credit_History has already been suggestive of its importance from the variable exploration, and it is the feature that allows loan givers to judge if individuals can pay for their bills on time, or if they are generally late with their payments. The other importance features that follow are ApplicantIncome and CoapplicantIncome, which is understandable as when it comes to loans, loan givers would want to reassurance that the individuals have enough income to repay for the loan.

Normality & Linearity

The EDA revealed that the numeric features, which all describe dollar amounts, are skewed rightward. In attempts to address that skewness, these features undergo the Box-Cox transformation and subsequent normalization prior to modeling.

Building the Models

Model #1: Decision Trees

Decision tree is a versatile machine learning algorithm that is able to perform classification and regression tasks. When it comes to a decision tree, there are important terms that one should know, and these are:

root nodes - represent the whole population (or sample) and get further divided
splitting - when nodes get divided into two or more
decision node - when a sub node splits
terminal or leaf - when a node no longer splits
branch - sub-section of the entire tree
parent-node - when a node is divided into sub-nodes

The decision tree is fitted on the training set, and visualized below.

The best first split is the one that provides the most information gain, and it shows that credit history plays an important role in the loan approval decision.

In addition, this plot below shows the error rate of the model and it is of the complexity parameter (cp). Looking at the error, the best complexity parameter is at 0.01, because the x-error is at its lowest at 0.61. Moreover, in terms of the confusion matrix, (at the time that this was run) the decision tree model resulted in an accuracy rate of 80.4% and \(\kappa\) at 48% on the test set. Looking at the predictions, there are 13 false positives, and 22 false negatives, while 98 are predicted correctly as an approved loan and 25 as no loan approval.

##           CP nsplit rel error    xerror       xstd
## 1 0.39316239      0 1.0000000 1.0000000 0.07559198
## 2 0.01068376      1 0.6068376 0.6068376 0.06436956
## 3 0.00000000      9 0.5213675 0.7863248 0.07049218

## Confusion Matrix and Statistics
## 
##                  
## unseen_prediction   N   Y
##                 N  24   8
##                 Y  23 103
##                                           
##                Accuracy : 0.8038          
##                  95% CI : (0.7332, 0.8626)
##     No Information Rate : 0.7025          
##     P-Value [Acc > NIR] : 0.002638        
##                                           
##                   Kappa : 0.483           
##                                           
##  Mcnemar's Test P-Value : 0.011921        
##                                           
##             Sensitivity : 0.5106          
##             Specificity : 0.9279          
##          Pos Pred Value : 0.7500          
##          Neg Pred Value : 0.8175          
##              Prevalence : 0.2975          
##          Detection Rate : 0.1519          
##    Detection Prevalence : 0.2025          
##       Balanced Accuracy : 0.7193          
##                                           
##        'Positive' Class : N               
##

The complexity parameter allows one to choose the optimal tree size. Thus, using this information from the base decision tree model, there is an attempt to improve the model and also compare the updated confusion matrix.

Improved Decision Tree (with Hyperparameters)

For this decision tree model, there is the inclusion of the complexity parameter of 0.011 as a hyperparameter, which was obtained from the first model. One might wonder how using a complexity parameter changes the decision, and by looking at the updated plot, the model has drastically changed.

This only showcases Credit_History as the most important feature, which is not surprising since the mean importance was so high that no other feature was even close.

The error rate of this model is just a straight line, and the accuracy and kappa statistic have improved after using the complexity parameter, accuracy = 82.9% and \(\kappa\) = 0.52. The false positive is only 2 and the false negative is 22. The improved model does better at predicting the loan approval status than the base decision tree model.

## Confusion Matrix and Statistics
## 
##                   
## unseen_prediction1   N   Y
##                  N  22   2
##                  Y  25 109
##                                           
##                Accuracy : 0.8291          
##                  95% CI : (0.7612, 0.8843)
##     No Information Rate : 0.7025          
##     P-Value [Acc > NIR] : 0.0001868       
##                                           
##                   Kappa : 0.524           
##                                           
##  Mcnemar's Test P-Value : 2.297e-05       
##                                           
##             Sensitivity : 0.4681          
##             Specificity : 0.9820          
##          Pos Pred Value : 0.9167          
##          Neg Pred Value : 0.8134          
##              Prevalence : 0.2975          
##          Detection Rate : 0.1392          
##    Detection Prevalence : 0.1519          
##       Balanced Accuracy : 0.7250          
##                                           
##        'Positive' Class : N               
##

Conditional Parting Plot

Lastly, the plot below shows the regression relationship of the conditional inference trees estimate that is partitioned by binary recursive in a framework that is conditional inference. It shows that credit history is the most important feature. Here you can see that, and that n = 56 is the number without a credit history, and n = 297 is the number with a credit history.

Model #2: Random Forest

As a classification ensemble method, the random forests algorithm seeks to address the challenge of tree correlation inherent to bagging decision trees. By linearly combining many individual, independent trees, it reduces variance in prediction. However, it also randomly selects predictor features at each split of each tree in the ensemble, which mitigates possible correlation. This combination makes it a strong performer in general.

As noted, random forests modeling starts with feature selection to assess feature importance in predicting Loan_Status. It incorporates all possible features, including dummies associated with the levels of factors like Gender, Education, and Property_Area. The sole tuning parameter for this implementation is \(m_{try}\), which refers to the number of randomly selected features to be chosen at each tree split, and has been shown to affect model accuracy. A vector of \(m_{try}\) values from 2 through 10 is provided for grid search. The number of trees in the forest ensemble is another general parameter, though one considered less important by Kuhn and Johnson (2013), who suggest 1000 trees as a reasonable number. Lastly, a 10-fold cross-validation scheme is repeated three times to create the final model.

The resulting model optimized the accuracy to approximately 80%, and the final model used three features at each split (\(m_{try}\) = 3). The plot depicts differences in accuracy across \(m_{try}\) values. In addition, considering feature importance, Credit_History = 1, representing rows with a credit history, is important in all trees, just as it was the most important in the initial feature selection. ApplicantIncome, LoanAmount, CoapplicantIncome, and Loan_Amount_Term are also relatively important, though none of them were notably so initially. In summary, given these data, an individual’s having a credit history is clearly the most important predictor of loan receipt, whereas other predictors are a mixed bag.

Predicting Loan_Status for the test set returns an accuracy of approximately 84.2% with \(\kappa\) of approximately 0.57. These values both exceed their training set counterparts, which is surprising. It could simply result from sampling, though further investigation – beyond this exercise – seems warranted.

Unfortunately, Loan_Status is not well balanced across classes. Specifically, the data set contains roughly twice as many “Y” values as there are “N” values. This imbalance suggests that accuracy may not be the best measure of the random forest model’s predictive performance. In response, a confusion matrix and other evaluative measures are run on the predictions.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N  24   2
##          Y  23 109
##                                           
##                Accuracy : 0.8418          
##                  95% CI : (0.7753, 0.8949)
##     No Information Rate : 0.7025          
##     P-Value [Acc > NIR] : 3.879e-05       
##                                           
##                   Kappa : 0.5655          
##                                           
##  Mcnemar's Test P-Value : 6.334e-05       
##                                           
##             Sensitivity : 0.9820          
##             Specificity : 0.5106          
##          Pos Pred Value : 0.8258          
##          Neg Pred Value : 0.9231          
##               Precision : 0.8258          
##                  Recall : 0.9820          
##                      F1 : 0.8971          
##              Prevalence : 0.7025          
##          Detection Rate : 0.6899          
##    Detection Prevalence : 0.8354          
##       Balanced Accuracy : 0.7463          
##                                           
##        'Positive' Class : Y               
##

The confusion matrix shows poor performance in predicting loan application rejections (“N”), with a specificity, or true negative rate, of approximately 0.51. By comparison, the sensitivity, or true positive rate, is quite high, at approximately 0.98. Combining the two measures, the balanced accuracy of approximately 0.75 is middling. Below, an ROC curve carries an associated AUC value of approximately 0.75, which, again, is just okay.

## Area under the curve: 0.7463

Model #3: Gradient Boosting

Gradient Boosted Machines (GBM) build an ensemble of shallow and weak successive trees with each tree learning and improving on the previous. When combined, these many weak successive trees produce a powerful “committee” that are often hard to beat with other algorithms. The advantages of GBM are it provides predictive accuracy, lots of flexibility with hyperparameter tuning options, works well with categorical and numerical values and can handle missing data. The disadvantages of GBM can be overfitting due to constant error minimization and requires many trees which can be time consuming and memory intensive. Results might be less interpretable unless used with other tools like variable importance, partial dependence plots, LIME, etc.

Boosting works by adding new models to the ensemble sequentially. At each particular iteration, a new weak, base-learner model is trained with respect to the error of the whole ensemble learnt so far. GBM starts with base-learning models, then training weak models and followed up with sequential training with respect to errors. The following is explained more in detail on the process.

Base-learning models: Boosted algorithms almost always use decision trees as the base-learner (here, boosting is used in the context of regression trees).
Training weak models: A weak model is one whose error rate is only slightly better than random guessing. The idea behind boosting is that each sequential model builds a simple weak model to slightly improve the remaining errors. Commonly, trees with only 1-6 splits are used. Combining many weak models increases speed, accuracy improvement, and avoids overfitting.
Sequential training with respect to errors: Boosted trees are grown sequentially; each tree is grown using information from previously grown trees.

The basic algorithm for boosted regression trees can be generalized to the following where x represents the predictor variables and y represents response variables:

Fit a decision tree to the data: \(F_1(x) = y\),
We then fit the next decision tree to the residuals of the previous: \(h_1(x) = y−F_1(x)\),
Add this new tree to the algorithm: \(F_2(x)=F_1(x)+h_1(x)\),
Fit the next decision tree to the residuals of \(F_2: h_2(x)=y−F_2(x))\),
Add this new tree to the algorithm: \(F_3(x)=F_2(x)+h_1(x)\),
Continue this process until some mechanism (i.e. cross validation) tells it to stop.

To start, a base GBM is fitted with normal parameters, i.e. a small shrinkage, such between 0.01 and 0.001; the number of trees, n.trees, is between 3000 and 10000, and an interaction depth of 1.

The plot above shows that the CV error starts leveling off at 3196 trees. So, the GBM is tuned to try to improve the model performance. The hyper-tuning parameters are:

Number of trees: The total number of trees to fit. GBMs often require many trees.
Depth of trees: The number d of splits in each tree, which controls the complexity of the boosted ensemble. Each tree is a stump consisting of a single split.
Learning rate: Controls how quickly the algorithm proceeds down the gradient descent. Smaller values reduce the chance of overfitting but also increases the time to find the optimal fit. This is also called shrinkage.
Subsampling: Controls whether or not you use a fraction of the available training observations. This can help to minimize overfitting.

Improved Gradient Boosting Model

Manually tweaking each hyperparamters can be time consuming, and less efficient than performing a grid search which iterates over every combination of hyperparameter values. This also allows to analyze which combination performs best. Therefore, a grid of hyperparameter conditions is constructed, and the algorithm will search across all the models with different learning rates and tree depths to determine the optimal fit. Firstly, there will be a lowering of the number of trees from 5000 to 30, because the values for optimal trees are all less than 30. Most of the shrinkage is shown as 0.3 and 0.1. Interaction.depth are 1, 3, 5. The n.minobsinnode values are 5, 10, 15. These will be used to create the new hyperparameter grid, and a new gradient boosting model is trained on the training set.

Accuracy is one minus the error rate and is thus the percentage of correctly classified observations. The best tuning parameters for the gradient boosting model which resulted in the largest accuracy, i.e. the overall predicted accuracy of the model, is with a n.trees = 30, interaction.depth = 3, shrinkage = 0.1, and n.minobsinnode = 15. It has a model accuracy of 79.6%, and \(\kappa\) = 0.47. In this case, it does not account for the largest portion of the variability in the data than all other variables, but it produces the smallest error which makes it the optimal fitting parameters.

In addition, considering feature importance, once again, as with the other models, Credit_History = 1, representing rows with a credit history, is most contributing variable in the model. Following it are LoanAmount, ApplicantIncome, CoapplicantIncome, and, interestingly, Property_Area.

Predicting Loan_Status for the test set returns an accuracy of approximately 82.3% with \(\kappa\) of approximately 0.51. These values both exceed their training set counterparts.

The confusion matrix results suggest that 82.3% of the predicted results seem to be correctly classified. Precision indicates how many of the correctly predicted cases actually turned out to be positive. With the gradient boosting model, the precision for loan approval was fairly good at 81.2%, whereas the recall, or how many of the actual positive cases that the model was able to predict correctly, highlights that 97.3% have been correctly classified. In all, this model is capable of classifying loan approval status with good accuracy. And lastly, the Kappa statistic of 0.51 suggests that the overall accuracy of this model is better than the expected random chance classifier’s accuracy.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   N   Y
##          N  22   3
##          Y  25 108
##                                           
##                Accuracy : 0.8228          
##                  95% CI : (0.7542, 0.8789)
##     No Information Rate : 0.7025          
##     P-Value [Acc > NIR] : 0.0003845       
##                                           
##                   Kappa : 0.5099          
##                                           
##  Mcnemar's Test P-Value : 7.229e-05       
##                                           
##             Sensitivity : 0.9730          
##             Specificity : 0.4681          
##          Pos Pred Value : 0.8120          
##          Neg Pred Value : 0.8800          
##               Precision : 0.8120          
##                  Recall : 0.9730          
##                      F1 : 0.8852          
##              Prevalence : 0.7025          
##          Detection Rate : 0.6835          
##    Detection Prevalence : 0.8418          
##       Balanced Accuracy : 0.7205          
##                                           
##        'Positive' Class : Y               
##

Model Performance

In binary classification, often, the measures of focus are precision and recall, as opposed to accuracy. A measure combining this is the F1 score, defined as twice the sum of precision and recall divided by their product. Another oft-used metric is the area under the receiver operating curve, AUC. In this case, the following three metrics, i.e. Accuracy, F1, and AUC, will be compared to select the model that performs best in at least two of the three metrics.

Once again, below shows the confusion matrices for the three models, followed by a table with the selected performance criteria for comparison.

Model 1: Decision Tree

##           Reference
## Prediction   N   Y
##          N  22   2
##          Y  25 109

Model 2: Random Forest

##           Reference
## Prediction   N   Y
##          N  24   2
##          Y  23 109

Model 3: Gradient Boosting

##           Reference
## Prediction   N   Y
##          N  22   3
##          Y  25 108

Selected Performance Statistics

Model Test Results
Models	ACC	F1	AUC
Model 1: Decision Tree	0.829	0.620	0.725
Model 2: Random Forest	0.842	0.658	0.783
Model 3: Gradient Boosting	0.823	0.611	0.772

Conclusion

All the models performed very well and classified the majority of the test set properly. The difference between the three models was very slight. False negatives were the lowest with the Random Forest model, than the other two, while there were more false positives with Gradient Boosting model. Given the above statistics and confusion matrices, Model 2: Random Forest barely edged out the other two models with the best accuracy and F-1 score, and is selected as the optimal model to predict loan approval.

This analysis highlights that loan approval data is strongly influenced by credit history and income of the individuals applying for the loan. Each models presented it to have the most impact on the model. However, these models were not trained on a well-balanced dataset, and this imbalance is felt in spite of cross-validation. More specifically, even the optimal model, the random forest model, predicts approved loan applications well but rejected applications poorly. Having additional observations for the negative class would be ideal, but otherwise, it is recommended that any extension of this analysis should consider resampling the training set –- perhaps over-sampling the negative class or testing additional resamples -– to promote better balance across Loan_Status classes.

Works Cited

Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi:10.5281/zenodo.3960218.
Kuhn M, Johnson K (2013). Applied Predictive Modeling. Springer Science+Business Media.
Wright MN, Wager S, Probst P (2020). Package ‘ranger’.
UC Business Analytics R Programming Guide.Gradient Boosting Machines.
DataCamp. What are the hyperparameters for a decision tree?.
datascievo. Decision Tree in R | A Real Guide on Titanic Dataset with Code.
Guru99. Decision Tree in R | Classification Tree & Code in R with Example.
DataFlair. R Decision Trees – The Best Tutorial on Tree Based Modeling in R!.

Code Appendix

The code chunks below represent the R code called in order during the analysis. They are reproduced in the appendix for review and comment.

# Load dataset
penguins = penguins

# Number of observations
ntrobs = dim(penguins)[[1]]

# Converting Year to factor
penguins$year = as.factor(penguins$year)

reorder <- function(x){
  factor(x, levels = names(sort(table(x), decreasing = TRUE)))
}

ggplot(drop_na(penguins), aes(x = reorder(species), fill = species)) + 
  geom_bar() +
  geom_text(stat = "count", aes(label =..count..), vjust=-0.5, size = 3) +
  facet_wrap(~sex) +
  scale_fill_brewer(palette = "Accent") +
  theme_minimal() +
  theme(legend.position = "none")+
  labs(title = "Distibution of Species by Gender", y = "Frequency", x = "Species")
  
 ggplot(drop_na(penguins), aes(x = reorder(species), fill = species)) + 
  geom_bar() +
  geom_text(stat = "count", aes(label =..count..), vjust=-0.5, size = 3) +
  facet_wrap(~island) +
  scale_fill_brewer(palette = "Accent") +
  theme_minimal() +
  theme(legend.position = "none")+
  labs(title = "Distibution of Species by Island Habitat", y = "Frequency", x = "Species")

dfSummary(penguins, plain.ascii = TRUE, style = "grid", graph.col = FALSE, footnote = NA)

ggpairs(penguins, columns = 3:6, title = "Correlogram of Variables", 
        ggplot2::aes(color = species),
        progress = FALSE, 
        lower = list(continuous = wrap("smooth", alpha = 0.3, size = 0.1)))

# Create training and testing split
set.seed(525)
intrain = createDataPartition(penguins$species, p = 0.70, list = FALSE)

# Train & Test predictor variables
train_peng.p = penguins[intrain, -c(1,8)] # remove species, and year
test_peng.p = penguins[-intrain, -c(1,8)] 

# Train & Test response variable (species)
train_peng.r = penguins$species[intrain]
test_peng.r = penguins$species[-intrain]

set.seed(525)
temp = mice(train_peng.p, method = 'cart', print = FALSE, m = 3, maxit = 3)
train_peng.p = complete(temp)

temp = mice(test_peng.p, method = 'cart', print = FALSE, m = 3, maxit = 3)
test_peng.p = complete(temp)

set.seed(525)
# Train set
processed_train_peng = preProcess(train_peng.p)
train_peng.p = predict(processed_train_peng, train_peng.p)

# Test set
processed_test_peng = preProcess(test_peng.p)
test_peng.p = predict(processed_test_peng, test_peng.p)

set.seed(525)
# Train set
train_peng.pd = dummy.data.frame(train_peng.p, names = c("island","sex") , sep = ".")
train_peng.p = cbind(train_peng.p, train_peng.pd[,c(1:3,8:9)])
train_peng.p[sapply(train_peng.p, is.factor)] = data.matrix(train_peng.p[sapply(train_peng.p, is.factor)])
train_peng.p[,c(6:11)] = lapply(train_peng.p[,c(6:11)], factor) 
train_peng.p$island = factor(train_peng.p$island)
  
# Test set 
test_peng.pd = dummy.data.frame(test_peng.p, names = c("island","sex") , sep = ".")
test_peng.p = cbind(test_peng.p, test_peng.pd[,c(1:3,8:9)])
test_peng.p[sapply(test_peng.p, is.factor)] = data.matrix(test_peng.p[sapply(test_peng.p, is.factor)])
test_peng.p[,c(6:11)] = lapply(test_peng.p[,c(6:11)], factor) 
test_peng.p$island = factor(test_peng.p$island)

output = Boruta(train_peng.r ~ ., data = train_peng.p, doTrace = 0)  
roughFixMod = TentativeRoughFix(output)
importance = attStats(TentativeRoughFix(output))
importance = importance[importance$decision != 'Rejected', c('meanImp', 'decision')]
kable(head(importance[order(-importance$meanImp), ]), 
      caption = "Feature Importance of Penguin Data") %>%
  kable_styling(bootstrap_options = "striped", full_width = TRUE)

set.seed(525)
knnModel = train(x = train_peng.p[, c(3:4)], 
                 y = train_peng.r,
                 method = "knn",
                 trControl = trainControl(method = "repeatedcv", 
                                          number = 10, 
                                          repeats = 10))

knnModel

plot(knnModel, main = "Accuracy of KNN Model")
plot(varImp(knnModel), main = "Rank of Most Important Variable")

set.seed(525)
# Confusion Matrix
pred.R = predict(knnModel, newdata = test_peng.p, type = "raw")
confusion = confusionMatrix(pred.R, test_peng.r, mode = "everything")
confusion

predictions = as.numeric(predict(knnModel, test_peng.p, type = 'raw'))
roc.multi = multiclass.roc(test_peng.r, predictions)
auc(roc.multi)
plot.roc(roc.multi[['rocs']][[1]], main = "Multi-class ROC, Macro-Average AUC = 0.854")
sapply(2:length(roc.multi[['rocs']]), function(i) lines.roc(roc.multi[['rocs']][[i]], col=i))

legend("bottomright", 
       legend = c("ROC curve of Chinstrap",
                  "ROC curve of Gentoo",
                  "ROC curve of Adelie"), 
       col = c("black", "red", "green"), lwd = 2)

loan_df_link <- 'https://raw.githubusercontent.com/greeneyefirefly/DATA622-Group3/main/Project_3/Loan_approval.csv'
loan_df <- read.csv(loan_df_link) %>%
  mutate_all(na_if,"") %>%
  mutate(Gender = as.factor(Gender),
         Married = as.factor(Married), 
         Dependents = as.factor(Dependents),
         Education = as.factor(Education),
         Self_Employed = as.factor(Self_Employed),
         Credit_History = as.factor(Credit_History),
         Property_Area = as.factor(Property_Area),
         Loan_Status = as.factor(Loan_Status)) %>%
  select(-Loan_ID)

loan_df %>%
  drop_na() %>%
  count(Loan_Status) %>%
  ggplot() + geom_col(aes(x = Loan_Status, y = n, fill = Loan_Status), show.legend = FALSE) +
  geom_label(aes(x = Loan_Status, y = n, label = n)) +
  theme_minimal() +
  labs(title = 'Distribution of Loan Status', y = "Frequency", x = "Loan Status")

dfSummary(loan_df, plain.ascii = TRUE, style = "grid", graph.col = FALSE, footnote = NA)

tab_gender <- with(loan_df, table(Gender, Loan_Status))
tab_gender <- as.data.frame(prop.table(tab_gender, margin = 1)) %>%
  filter(Loan_Status == 'Y')

tab_gender %>% 
  ggplot() +  
  geom_col(aes(x=Gender, y=Freq, fill=Gender), show.legend = FALSE) +
  geom_label(aes(x=Gender, y=Freq, label = paste0(round((Freq*100),1),"%"))) +
  scale_y_continuous(labels = function(x) paste0(x*100, "%")) +
  theme_minimal() +
  labs(title = 'Approved Loans by Gender', y = "Percentage")

tab_married <- with(loan_df, table(Married, Loan_Status))
tab_married <- as.data.frame(prop.table(tab_married, margin = 1)) %>%
  filter(Loan_Status == 'Y')

tab_married %>% 
  ggplot() +  
  geom_col(aes(x=Married, y=Freq, fill=Married), show.legend = FALSE) +
  geom_label(aes(x=Married, y=Freq, label = paste0(round((Freq*100),1),"%"))) +
  scale_y_continuous(labels = function(x) paste0(x*100, "%")) +
  theme_minimal() +
  labs(title = 'Approved Loans by Marital Status', y = "Percentage")

tab_kids <- with(loan_df, table(Dependents, Loan_Status))
tab_kids <- as.data.frame(prop.table(tab_kids, margin = 1)) %>%
  filter(Loan_Status == 'Y')

tab_kids %>% 
  ggplot() +  
  geom_col(aes(x=Dependents, y=Freq, fill=Dependents), show.legend = FALSE) +
  geom_label(aes(x=Dependents, y=Freq, label = paste0(round((Freq*100),1),"%"))) +
  scale_y_continuous(labels = function(x) paste0(x*100, "%")) +
  theme_minimal() +
  labs(title = 'Approved Loans by Number of Dependents', y = "Percentage", x = "# of Dependents")

tab_edu <- with(loan_df, table(Education, Loan_Status))
tab_edu <- as.data.frame(prop.table(tab_edu, margin = 1)) %>%
  filter(Loan_Status == 'Y')

tab_edu %>% 
  ggplot() +  
  geom_col(aes(x=Education, y=Freq, fill=Education), show.legend = FALSE) +
  geom_label(aes(x=Education, y=Freq, label = paste0(round((Freq*100),1),"%"))) +
  scale_y_continuous(labels = function(x) paste0(x*100, "%")) +
  theme_minimal() +
  labs(title = 'Approved Loans by Education', y = "Percentage")

tab_emp <- with(loan_df, table(Self_Employed, Loan_Status))
tab_emp <- as.data.frame(prop.table(tab_emp, margin = 1)) %>%
  filter(Loan_Status == 'Y')

tab_emp %>% 
  ggplot() +  
  geom_col(aes(x=Self_Employed, y=Freq, fill=Self_Employed), show.legend = FALSE) +
  geom_label(aes(x=Self_Employed, y=Freq, label = paste0(round((Freq*100),1),"%"))) +
  scale_y_continuous(labels = function(x) paste0(x*100, "%")) +
  theme_minimal() +
  labs(title = 'Approved Loans by Employment Status', y = "Percentage", x = "Self-Employed")

tab_cred <- with(loan_df, table(Credit_History, Loan_Status))
tab_cred <- as.data.frame(prop.table(tab_cred, margin = 1)) %>%
  filter(Loan_Status == 'Y')

tab_cred %>% 
  ggplot() +  
  geom_col(aes(x=Credit_History, y=Freq, fill=Credit_History), show.legend = FALSE) +
  geom_label(aes(x=Credit_History, y=Freq, label = paste0(round((Freq*100),1),"%"))) +
  scale_y_continuous(labels = function(x) paste0(x*100, "%")) +
  theme_minimal() +
  labs(title = 'Approved Loans by Credit History', y = "Percentage", x = "Credit History")

tab_prop <- with(loan_df, table(Property_Area, Loan_Status))
tab_prop <- as.data.frame(prop.table(tab_prop, margin = 1)) %>%
  filter(Loan_Status == 'Y')

tab_prop %>% 
  ggplot() +  
  geom_col(aes(x=Property_Area, y=Freq, fill=Property_Area), show.legend = FALSE) +
  geom_label(aes(x=Property_Area, y=Freq, label = paste0(round((Freq*100),1),"%"))) +
  scale_y_continuous(labels = function(x) paste0(x*100, "%")) +
  theme_minimal() +
  labs(title = 'Approved Loans by Property Area', y = "Percentage", x = "Property Area")

loan_df %>%
  drop_na() %>%
  ggplot( aes(x=ApplicantIncome, fill=Loan_Status)) +
  geom_histogram(alpha=0.6, position = 'identity') +
  labs(title = 'Loan Approval by Applicant Income', x = "Applicant Income")

loan_df %>%
  drop_na() %>%
  ggplot( aes(x=CoapplicantIncome, fill=Loan_Status)) +
  geom_histogram( alpha=0.6, position = 'identity') +
  labs(title = 'Loan Approval by Coapplicant Income', x = "Co-applicant Income")

loan_df %>%
  drop_na() %>%
  ggplot( aes(x=LoanAmount, fill=Loan_Status)) +
  geom_histogram( alpha=0.6, position = 'identity') +
  labs(title = 'Loan Approval by Loan Amount', x = "Loan Amount")

loan_df %>%
  drop_na() %>%
  ggplot( aes(x=Loan_Amount_Term, fill=Loan_Status)) +
  geom_histogram( alpha=0.6, position = 'identity') +
  labs(title = 'Loan Approval by Loan Amount Term', x = "Loan Amount Term")

loan_missing <- loan_df %>%
  gather(key = "key", value = "val") %>%
  mutate(isna = is.na(val)) %>%
  group_by(key) %>%
  mutate(total = n()) %>%
  group_by(key, total, isna) %>%
  summarise(num_null = n()) %>%
  mutate(pct_missing = round(num_null / total * 100,1)) %>%
  filter(isna == TRUE) 

loan_missing %>%
  ggplot() + 
  geom_col(aes(x = key, y = pct_missing), fill='steelblue') +
  geom_label(aes(x = key, y = pct_missing, label = paste0(pct_missing,"%"))) +
    scale_y_continuous(labels = function(x) paste0(x, "%")) +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  labs(title = 'Percent Missing by Variable', x = 'Variable', y = 'Percent Missing')

# remove records with null categorical values
loan_df <- loan_df %>% filter(!is.na(Married) & 
                                !is.na(Gender) & 
                                !is.na(Dependents) &
                                !is.na(Self_Employed) &
                                !is.na(Credit_History))

# impute numeric null values using MICE method
loan_df <- complete(mice(data = loan_df,
                         method = "pmm", print = FALSE), 3)

set.seed(525)
which_train <- sample(x = c(TRUE, FALSE), 
                      size = nrow(loan_df), 
                      replace = TRUE, 
                      prob = c(0.7, 0.3))

train_set <- loan_df[which_train, ]
test_set <- loan_df[!which_train, ]

set.seed(525)
output = Boruta(train_set$Loan_Status ~ ., data = train_set, doTrace = 0)  
roughFixMod = TentativeRoughFix(output)
importance = attStats(TentativeRoughFix(output))
importance = importance[importance$decision != 'Rejected', c('meanImp', 'decision')]
kable(head(importance[order(-importance$meanImp), ]), 
      caption = "Feature Importance of Loan Data Variables") %>%
  kable_styling(bootstrap_options = "striped", full_width = TRUE)

set.seed(525)
train_set <- train_set %>% 
  select(c("ApplicantIncome", "CoapplicantIncome", "LoanAmount", "Loan_Amount_Term")) %>% 
  preProcess(method = c("BoxCox", "center", "scale")) %>% 
  predict(train_set)
test_set <- test_set %>% 
  select(c("ApplicantIncome", "CoapplicantIncome", "LoanAmount", "Loan_Amount_Term")) %>% 
  preProcess(method = c("BoxCox", "center", "scale")) %>% 
  predict(test_set)

set.seed(525)
dt_loan_model1 <- rpart(Loan_Status ~., 
                        data = train_set, 
                        control=rpart.control(cp = 0), 
                        method = 'class')
rpart.plot(dt_loan_model1)

plotcp(dt_loan_model1)
print(dt_loan_model1$cptable)

unseen_prediction <-predict(dt_loan_model1, test_set, type = 'class')
confusionMatrix(table(unseen_prediction , test_set$Loan_Status))

set.seed(525)
dt_improve <- rpart(Loan_Status ~.,
                    method="class",
                    data=train_set,
                    control=rpart.control(cp = 0.01),
                    parms=list(split='information'))
rpart.plot(dt_improve)

# train improved decision tree model
dt.model = train(Loan_Status ~.,
                 data=train_set,
                 method = "rpart",
                 metric = "Accuracy",
                 control = rpart.control(cp = 0.01),
                 trControl = trainControl(method = "repeatedcv",
                                          number = 10, 
                                          repeats = 3, 
                                          allowParallel = TRUE,
                                          classProbs = TRUE))

unseen_prediction1 <-predict(dt.model, test_set, type = 'raw')
confusionMatrix(table(unseen_prediction1, test_set$Loan_Status))

set.seed(525)
ctree_<-ctree(Loan_Status ~., data=train_set)
plot(ctree_)

set.seed(525)
rf.model <- train(Loan_Status ~ .,
                  data = train_set,
                  method = "ranger",
                  metric = "Accuracy",
                  num.trees = 1000,
                  importance = "impurity",
                  tuneGrid = expand.grid(.mtry=c(2:10), .min.node.size=1, .splitrule="gini"),
                  trControl = trainControl(method = "repeatedcv", 
                                           number = 10, 
                                           repeats = 3, 
                                           allowParallel = TRUE, 
                                           classProbs = TRUE))

p1 = plot(rf.model, main = "Optimal mtry = 3, Accuracy ~ 79.9%")
p2 = plot(varImp(rf.model, scale = TRUE), main = "Feature Importance: RF Model")
gridExtra::grid.arrange(p1,p2, ncol = 2)

set.seed(525)
loan_rf_pred <- predict(rf.model, newdata = test_set)
stats = postResample(pred = loan_rf_pred, obs = test_set$Loan_Status)

(confusionMatrix(loan_rf_pred, test_set$Loan_Status, mode = "everything", positive = "Y"))

loan_rf_predictions <- cbind(test_set, loan_rf_pred = as.numeric(loan_rf_pred))
loan_rf_roc <- roc(loan_rf_predictions, Loan_Status, loan_rf_pred)
auc(loan_rf_roc)
plot.roc(loan_rf_roc, main = "AUC ~ 0.75")

set.seed(525)

# Casting loan status as {0,1} for caret purposes
train_set_gbm <- train_set %>%  mutate(Loan_Status = ifelse(Loan_Status == "N",0,1))

# fit base GBM model
gbm.fit <- gbm(formula = Loan_Status ~.,
               distribution = "bernoulli",
               data = train_set_gbm,
               n.trees = 10000,
               interaction.depth = 1,
               shrinkage = 0.001,
               cv.folds = 5,
               n.cores = NULL, # will use all cores by default
               verbose = FALSE)

# plot best iteration number
p1 = gbm.perf(gbm.fit, method = "cv")

# for reproducibility
set.seed(525)

# train GBM model
gbm.fit2 <- gbm(formula = Loan_Status ~ .,
                distribution = "bernoulli",
                data = train_set_gbm,
                n.trees = 3000,
                interaction.depth = 2,
                shrinkage = 0.01,
                cv.folds = 5,
                n.cores = NULL, # will use all cores by default
                verbose = FALSE)

# for reproducibility
set.seed(525)

# create hyperparameter grid
hyper_grid <- expand.grid(n.trees = c(1:30),
                          shrinkage = c(.01, .1, .3),
                          interaction.depth = c(1, 3, 5),
                          n.minobsinnode = c(5, 10, 15))
# create training control
control <- trainControl(method = "repeatedcv", 
                        number = 10, 
                        repeats = 3, 
                        allowParallel = TRUE, 
                        classProbs = TRUE)

# train final GBM model
gbm.model <- train(Loan_Status~.,
                   data = train_set,
                   method = "gbm", 
                   metric = "Accuracy",
                   trControl = control, 
                   tuneGrid = hyper_grid,
                   verbose = FALSE)

stats = data.frame(Model = "GBM",
                   Accuracy = gbm.model$results$Accuracy[as.numeric(rownames(gbm.model$bestTune))],
                   Kappa = gbm.model$results$Kappa[as.numeric(rownames(gbm.model$bestTune))])

plot(varImp(gbm.model, scale = TRUE), main = "Feature Importance: GBM Model")

set.seed(525)
loan_gbm_pred <- predict(gbm.model, newdata = test_set)
stats = postResample(pred = loan_gbm_pred, obs = test_set$Loan_Status)

(confusionMatrix(loan_gbm_pred, test_set$Loan_Status, mode = "everything", positive = "Y"))

# Create table to hold model comparison statistics
compTable = data.frame(Models = c('Model 1: Decision Tree', 
                                  'Model 2: Random Forest', 
                                  'Model 3: Gradient Boosting'),
                       ACC = double(3),
                       F1 = double(3),
                       AUC = double(3))

# Model 1 Performance: Calculate accuracy statistics
pred.R = predict(dt.model, newdata = test_set, type = "raw")
pred.P = predict(dt.model, newdata = test_set, type = "prob")
m1CM = confusionMatrix(pred.R, test_set$Loan_Status, mode = "everything")
m1CM$table
compTable[1, 2] = sum(diag(m1CM$table)) / sum(m1CM$table)
compTable[1, 3] = m1CM$byClass[7]
compTable[1, 4] = auc(response = test_set$Loan_Status, predictor = pred.P[, 2])

# Model 2 Performance: Calculate accuracy statistics
pred.R = predict(rf.model, newdata = test_set, type = "raw")
pred.P = predict(rf.model, newdata = test_set, type = "prob")
m2CM = confusionMatrix(pred.R, test_set$Loan_Status, mode = "everything")
m2CM$table
compTable[2, 2] = sum(diag(m2CM$table)) / sum(m2CM$table)
compTable[2, 3] = m2CM$byClass[7]
compTable[2, 4] = auc(response = test_set$Loan_Status, predictor = pred.P[, 2])

# Model 3 Performance: Calculate accuracy statistics
pred.R = predict(gbm.model, newdata = test_set, type = "raw")
pred.P = predict(gbm.model, newdata = test_set, type = "prob")
m3CM = confusionMatrix(pred.R, test_set$Loan_Status, mode = "everything")
m3CM$table
compTable[3, 2] = sum(diag(m3CM$table)) / sum(m3CM$table)
compTable[3, 3] = m3CM$byClass[7]
compTable[3, 4] = auc(response = test_set$Loan_Status, predictor = pred.P[, 2])

# Display Model Comparison
kable(compTable, digits = 3L, caption = "Model Test Results") %>%
  kable_styling(bootstrap_options = "striped", full_width = TRUE)

CUNY SPS DATA 622 - Machine Learning and Big Data

Spring 2021 - Group 3 - Homework 3

Maryluz Cruz, Samantha Deokinanan, Amber Ferger, Tony Mei, and Charlie Rosemond

April 07, 2021

R Packages

Overview

Palmer Penguins

Data Exploration

Data Preparation

Training & Testing Split

Pre-Processing of Predictors

Dummy Variables

Feature Selection

Normality & Linearity

Building the KNN Model

Model Discussion

Conclusion

Loan Approval

Data Exploration

Understanding the Data

Categorical Features

Numeric Features

Data Preparation

Training and Testing Split

Feature Selection

Normality & Linearity

Building the Models

Model #1: Decision Trees

Improved Decision Tree (with Hyperparameters)

Conditional Parting Plot

Model #2: Random Forest

Model #3: Gradient Boosting

Improved Gradient Boosting Model

Model Performance

Model 1: Decision Tree

Model 2: Random Forest

Model 3: Gradient Boosting

Selected Performance Statistics

Conclusion

Works Cited

Code Appendix