Source Code: https://github.com/djlofland/DATA622_S2021_Group2/tree/master/Homework3
Please use the K-nearest neighbor (KNN) algorithm to predict the species variable. Please be sure to walk through the steps you took. (40 points)
Similar to past assignments when using the Palmer Penguins dataset, we’ll first do some quick exploratory analysis to examine the different features available.
We can see below that there are four continuous variables: bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g. Additionally, there are a few categorical variables: island, sex, and year.
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|
| Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
| Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
| Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
| Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
| Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
| Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
For the continuous variables, we can examine the distributions, broken out by the target variable, species.
By separating our distributions by our target variable, species, we can see that many feature interactions show clustering between Adelie and Chinstrap penguins (red and green), while Gentoo penguins tend to contrast the other two species for most interactions. This is also confirmed by most of the single-variable distributions split out by species in the plots below. Except for the distribution of bill_length_mm, distributions for body_mass_g, bill_depth_mm, and flipper_length_mm all show there to be overlapping distributions between Adelie and Chinstrap penguin species.
Next, we’ll do some data tidying to get our data set ready for our KNN model. year reflects the date/time of recording and requires some additional thought. If we hypothesize or expect that variability in penguin features could have a time-dependent effect, then year could be important. For example, if penguin food supplies vary year by year and this impacts growth or yearly temperature affects penguin development, year could matter. If we thought global warming might impact penguins’ physical development or other natural disasters might have impacted growth in specific years, year might matter. That said, in this situation, because we don’t have any reason to expect a time-dependent effect, year probably doesn’t add any explanatory power to our models, and we can remove it from our data set.
# remove year column
penguins <- penguins %>%
dplyr::select(-year)Additionally, when exploring missing values, we can see the following:
# Plot missing values per feature
penguins %>%
summarise_all(funs(sum(is.na(.)))) %>%
pivot_longer(cols = 1:7, names_to = 'columns', values_to = 'NA_count') %>%
arrange(desc(NA_count)) %>%
ggplot(aes(y = columns, x = NA_count)) + geom_col(fill = 'deepskyblue4') +
geom_label(aes(label = NA_count)) +
theme_minimal() +
labs(title = 'Count of missing values in penguins dataset') +
theme(plot.title = element_text(hjust = 0.5))# Use nanair package to plot missing value patterns
gg_miss_upset(penguins)We can see that 11 individuals have at least one missing data point. The missing data points are clustered - two penguin records are missing bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, and sex. Nine penguin records are missing just the sex data point. Since we cannot do much with the two records missing most of their data, we will drop these records. For the records missing just sex, while we could try kNN imputation, we don’t know enough about gender differences to know if that’s a safe operation or whether it would introduce bias. Therefore, we’ll drop these from our data set as well.
# Remove rows with missing values
penguins <- na.omit(penguins)Looking at our dataset, we see that missing values are no longer an issue; however, we still have some additional cleanup to perform on our dataset below before we proceed with kNN modeling.
# Display data from summary
skim(penguins)| Name | penguins |
| Number of rows | 333 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| species | 0 | 1 | FALSE | 3 | Ade: 146, Gen: 119, Chi: 68 |
| island | 0 | 1 | FALSE | 3 | Bis: 163, Dre: 123, Tor: 47 |
| sex | 0 | 1 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | 0 | 1 | 43.99 | 5.47 | 32.1 | 39.5 | 44.5 | 48.6 | 59.6 | ▃▇▇▆▁ |
| bill_depth_mm | 0 | 1 | 17.16 | 1.97 | 13.1 | 15.6 | 17.3 | 18.7 | 21.5 | ▅▆▇▇▂ |
| flipper_length_mm | 0 | 1 | 200.97 | 14.02 | 172.0 | 190.0 | 197.0 | 213.0 | 231.0 | ▂▇▃▅▃ |
| body_mass_g | 0 | 1 | 4207.06 | 805.22 | 2700.0 | 3550.0 | 4050.0 | 4775.0 | 6300.0 | ▃▇▅▃▂ |
When setting up for modeling, we typically separate our features from our target to have an x and y. With this in mind, we remove species from the main dataframe and create a new dataframe, species_actual, to hold species. Note that the dataframes must maintain their same order, and we cannot do any operations that might alter dataframe length.
# New dataframe just holding the target
species_actual <- penguins %>%
dplyr::select(species)
# Remove the target column so we just have features in the dataframe
penguins <- penguins %>%
dplyr::select(-species)Additionally, we can see below in each of our continuous variables’ distributions that the scales are inconsistent across features. For better model performance, we’ll want to standardize each of our features.
We normalize our continuous features using a simple z-score standardization.
# z-score scale continuous features
penguins[, c("bill_length_mm",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g")] <- scale(penguins[, c("bill_length_mm",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g")])As you can see below, after the z-score standardization, the scaling on the x and y-axis is now more consistent across our features.
# pair plot to inpect features after scaling
penguins %>%
dplyr::select(body_mass_g, bill_length_mm, bill_depth_mm, flipper_length_mm) %>%
ggpairs()Although this may help our KNN model, we’ll have to be careful about interpretability later on!
Finally, we’ll want to dummy code our categorical island and sex variables. This process will create new columns for each value and assign a 0 or 1. Note that dummy encoding typically drops one value which becomes the baseline. So if we have a categorical feature with five unique values, we will have four columns. If all columns are 0, that represents the reference value. This helps reduce the number of necessary columns. With dummy columns in place, we need to remove our original island and sex variable from this dataset.
# Dummy code the island feature - this creates a new column for each island (0 || 1)
island_dcode <- as.data.frame(dummy.code(penguins$island))
penguins <- cbind(penguins, island_dcode)
# remove the original island factor columns - we will rely on the new dummy columns
penguins <- penguins %>%
dplyr::select(-island)
# dummy code the sex feature - since there are only 2 states, we will only be adding a single column (0 || 1)
penguins$sex <- dummy.code(penguins$sex)Our data set is ready for our KNN model.
# quick inspect of dataframe
kable(head(penguins)) %>%
kable_styling(bootstrap_options = "basic")| bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | Biscoe | Dream | Torgersen | |
|---|---|---|---|---|---|---|---|---|
| 1 | -0.8946955 | 0.7795590 | -1.4246077 | -0.5676206 | 1 | 0 | 0 | 0 |
| 2 | -0.8215515 | 0.1194043 | -1.0678666 | -0.5055254 | 0 | 1 | 0 | 0 |
| 3 | -0.6752636 | 0.4240910 | -0.4257325 | -1.1885721 | 0 | 1 | 0 | 0 |
| 5 | -1.3335592 | 1.0842457 | -0.5684290 | -0.9401915 | 0 | 1 | 0 | 0 |
| 6 | -0.8581235 | 1.7444004 | -0.7824736 | -0.6918109 | 1 | 0 | 0 | 0 |
| 7 | -0.9312674 | 0.3225288 | -1.4246077 | -0.7228585 | 0 | 1 | 0 | 0 |
One problem during the modeling process is overfitting. In this situation, a model appears to have good predictive power, but it has “memorized” the training data and doesn’t generalize with new data. So, we train with one dataset and evaluate with a separate dataset not seen during training. It is common to randomly separate the original dataset into 80% / 20%, with 80% used for training. Since our target variable is in a separate dataframe, we apply the same separation to it as well.
# we use the fixed random seed for reproducibility
set.seed(123)
# use caret to split out training vs test dataframes
trainingRows <- createDataPartition(as.matrix(species_actual$species), p=0.8, list=FALSE)
train_penguins <- penguins[trainingRows,]
test_penguins <- penguins[-trainingRows,]
# also split out the target dataframes
species_actual_train <- species_actual$species[trainingRows]
species_actual_test <- species_actual$species[-trainingRows]With our data split accordingly, we’ll need to identify the appropriate number for k. k represents the number of nearby neighbors used to determine the sample data point class. A rule of thumb has us start with the square root of the number of observations (i.e., rows or records). This varies depending on the number of records, and typically this rule of thumb leads to slightly higher than optimal values, but it’s a good starting point. Later we will use the caret package to help find the optimal value for k.
# initial k
sqrt(nrow(train_penguins))## [1] 16.37071
We can see above, that it is approximately 16. Therefore, we’ll perform our initial kNN model with \(k=16\).
set.seed(123)
# kNN with 16 clusters
k16 <- knn(train_penguins,
test_penguins,
cl=species_actual_train,
k=16)
confusionMatrix(k16, species_actual_test)## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 28 1 0
## Chinstrap 1 12 0
## Gentoo 0 0 23
##
## Overall Statistics
##
## Accuracy : 0.9692
## 95% CI : (0.8932, 0.9963)
## No Information Rate : 0.4462
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9516
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9655 0.9231 1.0000
## Specificity 0.9722 0.9808 1.0000
## Pos Pred Value 0.9655 0.9231 1.0000
## Neg Pred Value 0.9722 0.9808 1.0000
## Prevalence 0.4462 0.2000 0.3538
## Detection Rate 0.4308 0.1846 0.3538
## Detection Prevalence 0.4462 0.2000 0.3538
## Balanced Accuracy 0.9689 0.9519 1.0000
#misClassError <- mean(k16 != species_actual_test)
#paste0('The number of misclassified penguins with 16 neighbors is: ', misClassError)
#table(k16, species_actual_test)Our resulting confusion matrix shows the kNN predicted species compared to the actual species designations. We can see that \(k=16\) performed well on our test set with an overall \(Accuracy=0.9662\) and very good Sensitivity and Specificity for each class between 0.9231 and 1.00. This tells us that there were very few false positives and negatives. While this model performed exceptionally well, our choice of \(k=16\) might be sub-optimal. Next, we will leverage caret:::train() to help us find the optimal value for k thru testing.
The built-in caret::train() function will run the KNN classification multiple times with different values for k, evaluate model performance, and experimentally determine the optimal value of k for our data.
# kNN with caret - autofit k
knn_caret <- train(train_penguins,
species_actual_train,
method = "knn",
preProcess = c("center", "scale"))
knn_caret## k-Nearest Neighbors
##
## 268 samples
## 8 predictor
## 3 classes: 'Adelie', 'Chinstrap', 'Gentoo'
##
## Pre-processing: centered (7), scaled (7), ignore (1)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 268, 268, 268, 268, 268, 268, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9975302 0.9961114
## 7 0.9975104 0.9960614
## 9 0.9983326 0.9973580
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
From our output, we can see that a \(k=9\) was found to be the optimal value for our model based on the calculated Accuracy and Kappa values. The inter-rater reliability metric, Kappa, is quite important when working with unbalanced data sets. Unbalanced datasets have different counts for each class. As the count discrepancy between classes increases, our model can become biased by the larger represented class, leading to poorer predictions of the smaller classes. Given this, a \(k=9\) may ultimately perform better than our initial KNN model with \(k=16\), especially on new data.
# plot accuracy vs # neighbors
plot(knn_caret)We can see that a neighbors value of 9 clearly showed the highest accuracy value.
# apply model to holdout group
knn_caret_predictions <- predict(knn_caret, newdata = test_penguins)
# show confusion matrix and model performance KPIs
confusionMatrix(knn_caret_predictions, as.factor(species_actual_test))## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 28 1 0
## Chinstrap 1 12 0
## Gentoo 0 0 23
##
## Overall Statistics
##
## Accuracy : 0.9692
## 95% CI : (0.8932, 0.9963)
## No Information Rate : 0.4462
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9516
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9655 0.9231 1.0000
## Specificity 0.9722 0.9808 1.0000
## Pos Pred Value 0.9655 0.9231 1.0000
## Neg Pred Value 0.9722 0.9808 1.0000
## Prevalence 0.4462 0.2000 0.3538
## Detection Rate 0.4308 0.1846 0.3538
## Detection Prevalence 0.4462 0.2000 0.3538
## Balanced Accuracy 0.9689 0.9519 1.0000
In the end, with the caret package identifying the “optimal” k-value, we can see that both our KNN models (\(k=16\) and \(k=9\)) performed very well on our test data set. Again, k indicates how many nearby data points are used to assign a sample data point to a class. If our neighboring data points are spread out versus tightly clustered will influence the optimal k. Large k values may lead to more misclassifications near borders between classes. If classes overlap or have non-linear patterns, then smaller k values may better identify classes for pockets of data points.
In this case, it appears that either of our models could be chosen to provide accurate predictions of the three penguin species.
Please use the attached dataset on loan approval status to predict loan approval using Decision Trees. Please be sure to conduct a thorough exploratory analysis to start the task and walk us through your reasoning behind all the steps you are taking.
First, we decided to read in the loan approval data set and take a look at its features:
# load CSV data set
loan <- read.csv("https://raw.githubusercontent.com/DaisyCai2019/NewData/master/Loan_approval.csv")
# display first few rows for inspection
kable(head(loan)) %>%
kable_styling(bootstrap_options = "basic")| Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LP001002 | Male | No | 0 | Graduate | No | 5849 | 0 | NA | 360 | 1 | Urban | Y |
| LP001003 | Male | Yes | 1 | Graduate | No | 4583 | 1508 | 128 | 360 | 1 | Rural | N |
| LP001005 | Male | Yes | 0 | Graduate | Yes | 3000 | 0 | 66 | 360 | 1 | Urban | Y |
| LP001006 | Male | Yes | 0 | Not Graduate | No | 2583 | 2358 | 120 | 360 | 1 | Urban | Y |
| LP001008 | Male | No | 0 | Graduate | No | 6000 | 0 | 141 | 360 | 1 | Urban | Y |
| LP001011 | Male | Yes | 2 | Graduate | Yes | 5417 | 4196 | 267 | 360 | 1 | Urban | Y |
As we can see from a glimpse of the data set above, the following features are available:
Now, we can run some exploratory data analysis to get a better sense of how to tidy and interpret these features. Here’s an initial summary of the dataset:
# print dataframe summary
skim(loan)| Name | loan |
| Number of rows | 614 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Loan_ID | 0 | 1 | 8 | 8 | 0 | 614 | 0 |
| Gender | 0 | 1 | 0 | 6 | 13 | 3 | 0 |
| Married | 0 | 1 | 0 | 3 | 3 | 3 | 0 |
| Dependents | 0 | 1 | 0 | 2 | 15 | 5 | 0 |
| Education | 0 | 1 | 8 | 12 | 0 | 2 | 0 |
| Self_Employed | 0 | 1 | 0 | 3 | 32 | 3 | 0 |
| Property_Area | 0 | 1 | 5 | 9 | 0 | 3 | 0 |
| Loan_Status | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ApplicantIncome | 0 | 1.00 | 5403.46 | 6109.04 | 150 | 2877.5 | 3812.5 | 5795.00 | 81000 | ▇▁▁▁▁ |
| CoapplicantIncome | 0 | 1.00 | 1621.25 | 2926.25 | 0 | 0.0 | 1188.5 | 2297.25 | 41667 | ▇▁▁▁▁ |
| LoanAmount | 22 | 0.96 | 146.41 | 85.59 | 9 | 100.0 | 128.0 | 168.00 | 700 | ▇▃▁▁▁ |
| Loan_Amount_Term | 14 | 0.98 | 342.00 | 65.12 | 12 | 360.0 | 360.0 | 360.00 | 480 | ▁▁▁▇▁ |
| Credit_History | 50 | 0.92 | 0.84 | 0.36 | 0 | 1.0 | 1.0 | 1.00 | 1 | ▂▁▁▁▇ |
There are several cleanup steps we will need to perform before we can start modeling with decision trees. We noticed a lot of blank values, which we will need to change to NAs. By coding these to NA, we can then impute them when cleaning up the other NAs. We used the naniar package to help with cleaning up blanks and NAs.
# replace blank values with NA
loan <- loan %>%
replace_with_na_all(condition = ~. == "")Next, we explore missing values to understand any patterns. This will help us determine the best way to handle missing values. Options might include dropping records with missing data or imputing missing data.
loan %>%
summarise_all(funs(sum(is.na(.)))) %>%
pivot_longer(cols = 1:13, names_to = 'columns', values_to = 'NA_count') %>%
arrange(desc(NA_count)) %>%
ggplot(aes(y = columns, x = NA_count)) + geom_col(fill = 'deepskyblue4') +
geom_label(aes(label = NA_count)) +
theme_minimal() +
labs(title = 'Count of missing values in penguins dataset') +
theme(plot.title = element_text(hjust = 0.5))gg_miss_upset(loan)This can be further visualized by the “Missingness Map”.
missmap(loan)We can see a number of missing values (NAs) in our LoanAmount, Loan_Amount_Term, and Credit_History features. We will replace these in a later step.
Before progressing, we need to convert categorical features into factors. The factor data type is far more memory efficient and provides information about the unique values, or levels, within the feature. Most libraries will detect the factor data type and understand that the feature is a categorical type.
# convert categorical features to factor
loan <- loan %>%
mutate(Gender = as.factor(Gender),
Married = as.factor(Married),
Dependents = as.factor(Dependents),
Education = as.factor(Education),
Self_Employed = as.factor(Self_Employed),
Property_Area = as.factor(Property_Area),
Loan_Status = as.factor(Loan_Status))
summary(loan)## Loan_ID Gender Married Dependents Education
## Length:614 Female:112 No :213 0 :345 Graduate :480
## Class :character Male :489 Yes :398 1 :102 Not Graduate:134
## Mode :character NA's : 13 NA's: 3 2 :101
## 3+ : 51
## NA's: 15
##
##
## Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
## No :500 Min. : 150 Min. : 0 Min. : 9.0
## Yes : 82 1st Qu.: 2878 1st Qu.: 0 1st Qu.:100.0
## NA's: 32 Median : 3812 Median : 1188 Median :128.0
## Mean : 5403 Mean : 1621 Mean :146.4
## 3rd Qu.: 5795 3rd Qu.: 2297 3rd Qu.:168.0
## Max. :81000 Max. :41667 Max. :700.0
## NA's :22
## Loan_Amount_Term Credit_History Property_Area Loan_Status
## Min. : 12 Min. :0.0000 Rural :179 N:192
## 1st Qu.:360 1st Qu.:1.0000 Semiurban:233 Y:422
## Median :360 Median :1.0000 Urban :202
## Mean :342 Mean :0.8422
## 3rd Qu.:360 3rd Qu.:1.0000
## Max. :480 Max. :1.0000
## NA's :14 NA's :50
Next, we explore the distributions of our continuous features: ApplicantIncome, CoapplicantIncome, LoanAmount, and Loan_Amount_Term. We color-coded Loan_status so Y (blue) and N (red) to help visualize patterns between the distributions and our feature distribution.
We can see that ApplicantIncome, CoapplicantIncome and LoanAmount are highly right-skewed, with long right tails. Conversely, it looks like Loan_Amount_Term is highly left-skewed, with a long left tail.
Additionally, we can also look at the categorical variables, broken out by our target variable of whether a loan was issued to the applicant. From the green bars below, we can see the total amount of loan approvals relative to the red bars below documenting the loan denials.
Based on the categorical variables split on loan approval status, we can see a few interesting things:
The number of loans applied for by males was significantly higher than loans applied for by females in this dataset.
The same phenomenon is true for graduate students, where more loan applications came from graduate students than undergraduate students.
We see a large class imbalance with the Self_Employed and Married features. In both cases, there are more applicants who are not self-employed and who are married.
There was a pretty even split in the number of applicants based on Property_Area, where someone identifying that they live in rural, semiurban, or urban settings were pretty evenly distributed.
We can see that applicants who did not pass the credit history criteria were almost always denied a loan.
These factors will be interesting to examine and discuss as we build our decision tree model and random forest models. These class imbalances could affect model performance and interpretation.
We can use kNN imputation to help replace many of the missing values we identified above. With the KNN method, a categorical missing value is imputed by looking at other records with similar features. Once k similar records are found, they are used to infer the missing value. For missing numeric values, the k most similar records values are averaged (mean) to “predict” the missing value. Note this is only one way to handle numeric missing values. We could also do median imputation or mean imputation. It comes down to the nature of the data as to which technique makes the most sense.
loan <- kNN(loan) %>%
subset(select = Loan_ID:Loan_Status)Nexrt, we double check that this worked correctly by checking for missing values.
sapply(loan, function(x) sum(is.na(x)))## Loan_ID Gender Married Dependents
## 0 0 0 0
## Education Self_Employed ApplicantIncome CoapplicantIncome
## 0 0 0 0
## LoanAmount Loan_Amount_Term Credit_History Property_Area
## 0 0 0 0
## Loan_Status
## 0
Fortunately, we can see that there are now no missing data points!
Next, we’ll have to work through a few transformations for our highly skewed continuous data. We use BoxCox to identify the optimal \(\lambda\) which informs the ideal transformation to try.
# calculate optimal lambda for BoxCox
BoxCox.lambda(loan$LoanAmount)## [1] -0.1839035
BoxCox.lambda(loan$Loan_Amount_Term)## [1] 1.999924
For LoanAmount, \(\lambda=-0.184\) and for Loan_Amount_Term, \(\lambda=1.99\). We round \(\lambda\) to the neared integer. So, for LoanAmount, \(\lambda=0\) which tells us to do a log() transformation. However, for Loan_Amount_Term, \(\lambda=2\), which tells us to do a \(Y^2\) transformation.
# apply a log transform
loan$LoanAmount_Xform <- log(loan$LoanAmount)
# apply a square transformation
loan$Loan_Amount_Term_Xform <- loan$Loan_Amount_Term^2Banks generally don’t want loan payments to exceed 25% of an applicant’s income. Most banks will add the co-applicant income to the applicant’s income and treat the sum in their decision criteria. For this reason, we combine ApplicantIncome and CoapplicantIncome into a single Income feature and drop the separate ApplicantIncome and CoapplicantIncome.
# combine applicant and co-applicant incomes and drop the original separate features
loan$Income <- loan$ApplicantIncome + loan$CoapplicantIncome
loan$ApplicantIncome <- NULL
loan$CoapplicantIncome <- NULLAs above, Income is skewed, so we will want to transform it. Let’s calcualte BoxCOx lambda to inform how to transform.
# find the optimal lBoxCox lambda for Income
BoxCox.lambda(loan$Income)## [1] -0.5592006
For Income, \(\lambda=-0.5\) which tells us to try a \(\frac{1}{\sqrt(Y)}\) transform.
# transform Income
loan$Income_Xform <- sqrt(loan$Income) ^ (-1)When checking the new distribution of our Income variable, we can see that it is much more normally distributed:
With the exploratory data analysis behind us, and our data cleaned up, we are now ready to run our decision tree!
Similar to our KNN model process above, we’ll determine an 80/20 split for our testing and training datasets:
set.seed(123)
trainingRows <- createDataPartition(as.matrix(loan$Loan_Status), p=0.8, list=FALSE)
train_loan <- loan[trainingRows,]
test_loan <- loan[-trainingRows,]With the data split, we can now run our first decision tree:
tree1 <- rpart(Loan_Status ~ Gender + Married + Dependents + Education +
Self_Employed + Income_Xform + LoanAmount_Xform + Loan_Amount_Term_Xform +
Credit_History + Property_Area,
data=train_loan)
rpart.plot(tree1, nn=TRUE)summary(tree1)## Call:
## rpart(formula = Loan_Status ~ Gender + Married + Dependents +
## Education + Self_Employed + Income_Xform + LoanAmount_Xform +
## Loan_Amount_Term_Xform + Credit_History + Property_Area,
## data = train_loan)
## n= 492
##
## CP nsplit rel error xerror xstd
## 1 0.37662338 0 1.0000000 1.0000000 0.06679061
## 2 0.03246753 1 0.6233766 0.6233766 0.05707947
## 3 0.01948052 2 0.5909091 0.6233766 0.05707947
## 4 0.01000000 3 0.5714286 0.6233766 0.05707947
##
## Variable importance
## Credit_History Income_Xform LoanAmount_Xform
## 84 16 1
##
## Node number 1: 492 observations, complexity param=0.3766234
## predicted class=Y expected loss=0.3130081 P(node) =1
## class counts: 154 338
## probabilities: 0.313 0.687
## left son=2 (70 obs) right son=3 (422 obs)
## Primary splits:
## Credit_History < 0.5 to the left, improve=59.010690, (0 missing)
## Income_Xform < 0.02049156 to the right, improve= 7.590926, (0 missing)
## Property_Area splits as LRL, improve= 7.068090, (0 missing)
## Loan_Amount_Term_Xform < 180000 to the right, improve= 2.441737, (0 missing)
## Education splits as RL, improve= 1.729095, (0 missing)
## Surrogate splits:
## Income_Xform < 0.005074514 to the left, agree=0.86, adj=0.014, (0 split)
##
## Node number 2: 70 observations
## predicted class=N expected loss=0.08571429 P(node) =0.1422764
## class counts: 64 6
## probabilities: 0.914 0.086
##
## Node number 3: 422 observations, complexity param=0.03246753
## predicted class=Y expected loss=0.2132701 P(node) =0.8577236
## class counts: 90 332
## probabilities: 0.213 0.787
## left son=6 (9 obs) right son=7 (413 obs)
## Primary splits:
## Income_Xform < 0.02049156 to the right, improve=5.861038, (0 missing)
## Property_Area splits as LRL, improve=3.255450, (0 missing)
## LoanAmount_Xform < 5.093731 to the right, improve=2.353409, (0 missing)
## Married splits as LR, improve=1.943941, (0 missing)
## Loan_Amount_Term_Xform < 180000 to the right, improve=1.684190, (0 missing)
##
## Node number 6: 9 observations
## predicted class=N expected loss=0.2222222 P(node) =0.01829268
## class counts: 7 2
## probabilities: 0.778 0.222
##
## Node number 7: 413 observations, complexity param=0.01948052
## predicted class=Y expected loss=0.2009685 P(node) =0.8394309
## class counts: 83 330
## probabilities: 0.201 0.799
## left son=14 (11 obs) right son=15 (402 obs)
## Primary splits:
## Income_Xform < 0.007097162 to the left, improve=4.284634, (0 missing)
## LoanAmount_Xform < 5.093731 to the right, improve=3.068245, (0 missing)
## Property_Area splits as LRL, improve=2.162517, (0 missing)
## Loan_Amount_Term_Xform < 180000 to the right, improve=1.832774, (0 missing)
## Married splits as LR, improve=1.616488, (0 missing)
## Surrogate splits:
## LoanAmount_Xform < 6.280122 to the right, agree=0.976, adj=0.091, (0 split)
##
## Node number 14: 11 observations
## predicted class=N expected loss=0.3636364 P(node) =0.02235772
## class counts: 7 4
## probabilities: 0.636 0.364
##
## Node number 15: 402 observations
## predicted class=Y expected loss=0.1890547 P(node) =0.8170732
## class counts: 76 326
## probabilities: 0.189 0.811
From our decision tree summary and plot, we can see that a few variables, such as Credit_History, Income and LoanAmount, were the most important features in this classification method. We can use this decision tree to make predictions on our holdout test set.
loanPre <- predict(tree1,
test_loan,
type="class",
control = rpart.control(minsplit = 20, xval = 81, cp = 0.01))
confusionMatrix(loanPre, test_loan$Loan_Status)## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 22 6
## Y 16 78
##
## Accuracy : 0.8197
## 95% CI : (0.7398, 0.8834)
## No Information Rate : 0.6885
## P-Value [Acc > NIR] : 0.000775
##
## Kappa : 0.5469
##
## Mcnemar's Test P-Value : 0.055009
##
## Sensitivity : 0.5789
## Specificity : 0.9286
## Pos Pred Value : 0.7857
## Neg Pred Value : 0.8298
## Prevalence : 0.3115
## Detection Rate : 0.1803
## Detection Prevalence : 0.2295
## Balanced Accuracy : 0.7538
##
## 'Positive' Class : N
##
As we can see from our predictions and the confusion matrix, we were able to obtain about 82% accuracy using this decision tree. To achieve higher accuracy, we’ll use the prune() function in the rpart package to examine a predicted optimal tree size.
plotcp(tree1)By plotting the cross-validated error against the complexity parameter, we can see that the relationship between the yields the best optimal outcome for our tree at a tree size of 2.
tree1$cptable## CP nsplit rel error xerror xstd
## 1 0.37662338 0 1.0000000 1.0000000 0.06679061
## 2 0.03246753 1 0.6233766 0.6233766 0.05707947
## 3 0.01948052 2 0.5909091 0.6233766 0.05707947
## 4 0.01000000 3 0.5714286 0.6233766 0.05707947
We want to minimize xerror, which in row 2 (a tree with size 2), we see \(xerror=0.6233\)
Although this is the case, when using this cp value in our prune() function, we still obtain accuracy ratings of around 82%. To test different complexity parameters, we used a cp value of 0.03246753, which is a tree size of 2. When we use this tree size, and prune our initial tree accordingly, we can see the resulting decision tree.
# prune tree
tree1.pruned <- prune(tree1, cp=0.03246753)
rpart.plot(tree1.pruned,
extra = 104,
main = "Decision Tree")Interestingly, from this pruning process, the Credit_History feature alone seems to do a reasonably good job of classifying applicants into the approval vs. disapproval status. The accuracy is about 86% on our test set, which is similar to our initial tree.
However, it’s important to think critically about whether this pruned tree would perform better on another dataset and is applicable for real-life events.
# predict holdout data
tree1.pruned.pred <- predict(tree1.pruned,
test_loan,
type = "class" )
# show model performance KPIs
confusionMatrix(tree1.pruned.pred, test_loan$Loan_Status)## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 22 2
## Y 16 82
##
## Accuracy : 0.8525
## 95% CI : (0.7769, 0.9102)
## No Information Rate : 0.6885
## P-Value [Acc > NIR] : 2.502e-05
##
## Kappa : 0.6174
##
## Mcnemar's Test P-Value : 0.002183
##
## Sensitivity : 0.5789
## Specificity : 0.9762
## Pos Pred Value : 0.9167
## Neg Pred Value : 0.8367
## Prevalence : 0.3115
## Detection Rate : 0.1803
## Detection Prevalence : 0.1967
## Balanced Accuracy : 0.7776
##
## 'Positive' Class : N
##
While a single feature offers reasonably good accuracy, there may be mitigating information from other features that would be lost and affect a loan decision if we only include the first feature. While decision trees are simple to construct and understand, their weakness is that they follow a simple linear path for decision making. This linear flow doesn’t allow for downstream features to alter or influence potential predictions.
Using the same dataset on Loan Approval Status, please use Random Forests to predict loan approval status. Again, please be sure to walk us through the steps you took to get to your final model. (50 points)
To expand on our analysis of the loan approval dataset, we’ll run a few random forest models to see if they are a more effective way to classify whether or not someone will be accepted or rejected for a loan.
In order to run a random forest model, we’ll use the randomForest() function in the randomForest package. We’ll include the same initial features that we did for our first decision tree:
set.seed(123)
# fit a Random Forest model with training data
fit.forest <- randomForest(Loan_Status ~ Gender + Married + Dependents +
Education + Self_Employed + Income_Xform +
LoanAmount_Xform + Loan_Amount_Term_Xform +
Credit_History + Property_Area,
data=train_loan)
# display model details
fit.forest##
## Call:
## randomForest(formula = Loan_Status ~ Gender + Married + Dependents + Education + Self_Employed + Income_Xform + LoanAmount_Xform + Loan_Amount_Term_Xform + Credit_History + Property_Area, data = train_loan)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 20.33%
## Confusion matrix:
## N Y class.error
## N 73 81 0.52597403
## Y 19 319 0.05621302
# show which feature were most important
importance(fit.forest)## MeanDecreaseGini
## Gender 4.295290
## Married 4.815481
## Dependents 10.784519
## Education 4.104093
## Self_Employed 3.517598
## Income_Xform 46.900023
## LoanAmount_Xform 38.372816
## Loan_Amount_Term_Xform 9.644244
## Credit_History 52.770213
## Property_Area 11.332043
After fitting our model, we can now use it to create predictions on our holdout test set to evaluate its performance.
# run predictions with holdout test data
forest.pred <- predict(fit.forest,
newdata = test_loan)
# show model performance KPIs
(dtree.cm_train <- confusionMatrix(forest.pred, test_loan$Loan_Status))## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 21 5
## Y 17 79
##
## Accuracy : 0.8197
## 95% CI : (0.7398, 0.8834)
## No Information Rate : 0.6885
## P-Value [Acc > NIR] : 0.000775
##
## Kappa : 0.5398
##
## Mcnemar's Test P-Value : 0.019016
##
## Sensitivity : 0.5526
## Specificity : 0.9405
## Pos Pred Value : 0.8077
## Neg Pred Value : 0.8229
## Prevalence : 0.3115
## Detection Rate : 0.1721
## Detection Prevalence : 0.2131
## Balanced Accuracy : 0.7466
##
## 'Positive' Class : N
##
Here, we notice a slight decrease in Accuracy (\(Accuracy=0.8197\)) compared with our Decision Tree. for the training sample is 82.37%. Notice the disparity between Specificity and Sensitivity. This model has a high false-positive rate and tends to over predict Y when it should have chosen N.
Next, we want to see if we have generated enough trees so that the Out Of Bag (OOB Error) error rates are minimum. From the below we see that the OOB error rate is decreasing with 1-20 trees, and the rate stabilizes around 100 trees.
plot(fit.forest,
col = c("black", "red", "dark green"),
main = "Predicted Loan Error Rates")
legend("topright", colnames(fit.forest$err.rate), col = 1:6, cex = 0.8, fill = 1:6)Tto see whether the Random forest model can be improved, we will run the same model, but this time we will use the tuneRF function. We provide the features (x), target (y), and number of trees used at each tuning step (ntreeTry).
The values were assigned randomly initially, and they are then tweaked until the optimal value was found.
set.seed(123)
# run RF tuning
tree_var <- tuneRF(x = subset(train_loan, select = -Loan_Status),
y = train_loan$Loan_Status,
ntreeTry = 1000)## mtry = 3 OOB error = 20.33%
## Searching left ...
## mtry = 2 OOB error = 19.92%
## 0.02 0.05
## Searching right ...
## mtry = 6 OOB error = 19.72%
## 0.03 0.05
We rerun the Random Forest model with the new parameter \(Ntree = 100\) and \(Mtry = 6\)
set.seed(123)
# pull out optimal value from previous tuning step
val_opt <- tree_var [,"mtry"][which.min(tree_var [,"OOBError"])]
# fit a new RF model with better parameters
fit.forest2 <- randomForest(Loan_Status ~ Gender + Married + Dependents +
Education + Self_Employed + Income_Xform +
LoanAmount_Xform + Loan_Amount_Term_Xform +
Credit_History + Property_Area,
data = train_loan,
importance = TRUE,
ntree = 100,
mtry = 6)
# display model details
fit.forest2##
## Call:
## randomForest(formula = Loan_Status ~ Gender + Married + Dependents + Education + Self_Employed + Income_Xform + LoanAmount_Xform + Loan_Amount_Term_Xform + Credit_History + Property_Area, data = train_loan, importance = TRUE, ntree = 100, mtry = 6)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 21.14%
## Confusion matrix:
## N Y class.error
## N 81 73 0.47402597
## Y 31 307 0.09171598
The results of the Random Forest model on our training set yielded an out of bag error rate of 21.14%, which is higher than the original model’s OOB error of 20.33%, but still very close. Tuning in this case didn’t seem to make much difference.
# predict using holdout data
forest.pred2 <- predict(fit.forest2,
newdata = test_loan)
# show model performance KPIs
(forest.cm_train <- confusionMatrix(forest.pred2, test_loan$Loan_Status))## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 22 9
## Y 16 75
##
## Accuracy : 0.7951
## 95% CI : (0.7125, 0.8628)
## No Information Rate : 0.6885
## P-Value [Acc > NIR] : 0.00585
##
## Kappa : 0.4969
##
## Mcnemar's Test P-Value : 0.23014
##
## Sensitivity : 0.5789
## Specificity : 0.8929
## Pos Pred Value : 0.7097
## Neg Pred Value : 0.8242
## Prevalence : 0.3115
## Detection Rate : 0.1803
## Detection Prevalence : 0.2541
## Balanced Accuracy : 0.7359
##
## 'Positive' Class : N
##
When we use our second Random Forest Model to make predictions on our test set, we notice slightly lower accuracy, \(Accuracy=79.5%\), compared to our untuned RF model, \(Accuracy=81.97%\).
Next, we want to see if we have generated enough trees so that the Out Of Bag (OOB Error) error rates are minimum. From the below we see that the OOB error rate is decreasing with 1-20 trees, increasing a little bit with 20-50 trees, and the rate stabilizes that at around 60 trees.
plot(fit.forest2,
col = c("black", "red", "dark green"),
main = "Predicted Loan Error Rates")
legend("topright", colnames(fit.forest2$err.rate), col = 1:6, cex = 0.8, fill = 1:6)Using the Loan Approval Status data, please use Gradient Boosting to predict the loan approval status. Please use whatever boosting approach you deem appropriate; but please be sure to walk us through your steps. (50 points)
gbm uses a default number of trees of 100, which is rarely sufficient. Consequently, we increase it to 1000 trees. The default depth of each tree interaction.depth is 1, which means we are ensembling many stumps. Lastly, we include cv.folds to perform a 3-fold cross-validation.
set.seed(123)
# build gradient boosted model
boost <- gbm(Loan_Status ~ Gender + Married + Dependents +
Education + Self_Employed + Income_Xform +
LoanAmount_Xform + Loan_Amount_Term_Xform +
Credit_History + Property_Area,
data=train_loan[,-1],
distribution = "gaussian",
n.trees = 1000,
cv.folds = 3)
# display model details
boost## gbm(formula = Loan_Status ~ Gender + Married + Dependents + Education +
## Self_Employed + Income_Xform + LoanAmount_Xform + Loan_Amount_Term_Xform +
## Credit_History + Property_Area, distribution = "gaussian",
## data = train_loan[, -1], n.trees = 1000, cv.folds = 3)
## A gradient boosted model with gaussian loss function.
## 1000 iterations were performed.
## The best cross-validation iteration was 62.
## There were 10 predictors of which 8 had non-zero influence.
sqrt(min(boost$cv.error))## [1] 0.3913306
summary(boost)## var rel.inf
## Income_Xform Income_Xform 35.6412932
## LoanAmount_Xform LoanAmount_Xform 26.9510994
## Credit_History Credit_History 14.5858857
## Dependents Dependents 7.4811920
## Loan_Amount_Term_Xform Loan_Amount_Term_Xform 5.2503899
## Property_Area Property_Area 4.5494508
## Self_Employed Self_Employed 2.0068665
## Married Married 1.3407906
## Education Education 1.2198756
## Gender Gender 0.9731563
The best cross-validation iteration was 62 and the RMSE is 0.3913306
The summary of output creates a new data set with var, a factor variable with the variables in our model, and rel.inf, the relative influence each variable had on our model predictions. From the table, we can see LoanAmount, Income_Xform, and Credit_History are the top three most important variables for our model. Notice that the feature importance is different than we saw in our decision tree and illustrates how downstream features can be important. RF and GB both create many trees, randomly selecting different features, allowing less important features a chance to compete for attention.
We can tune parameters to see if we can improve the performance. We increase the number of trees (n.trees) and perform a 5-fold cross-validation (cv.folds).
set.seed(123)
# train gb model with different parameters
boost2 <- gbm(Loan_Status ~ Gender + Married + Dependents +
Education + Self_Employed + Income_Xform +
LoanAmount_Xform + Loan_Amount_Term_Xform +
Credit_History + Property_Area,
data = train_loan[,-1],
distribution = "gaussian",
n.trees = 2000,
cv.folds = 5)
# display performance KPIs
sqrt(min(boost2$cv.error))## [1] 0.3886695
summary(boost2)## var rel.inf
## Income_Xform Income_Xform 36.755530
## LoanAmount_Xform LoanAmount_Xform 30.036793
## Credit_History Credit_History 9.148464
## Dependents Dependents 7.747582
## Loan_Amount_Term_Xform Loan_Amount_Term_Xform 5.246914
## Property_Area Property_Area 5.235610
## Married Married 1.703130
## Education Education 1.631745
## Self_Employed Self_Employed 1.340918
## Gender Gender 1.153315
We can see the number of relative influence of some of the lower features changed and also the RMSE change to 0.3886695, lower than our previous model.
# plot loss function as a result of n trees added to the ensemble
gbm.perf(boost2, method = "cv")## [1] 50
gbm.perf(boost2, method = "OOB")## [1] 40
## attr(,"smoother")
## Call:
## loess(formula = object$oobag.improve ~ x, enp.target = min(max(4,
## length(x)/10), 50))
##
## Number of Observations: 2000
## Equivalent Number of Parameters: 39.99
## Residual Standard Error: 0.0003087
The plots indicating the optimum number of trees based on the technique we used. The green line indicates the error on the test data, and the blue dotted line points to the optimum number of iterations. We can observe that beyond a certain point (50 iterations for the cv method and 40 for the OOB method), the test data’s error appears to increase because of overfitting.
We use model 2 to forecast our data. According to the cv method, we choose 50 as the number of trees.
# predict based on holdout test data with tuned parameters
boostPre <- predict.gbm(boost2,
test_loan,
n.trees = 50)
boostPre <- ifelse(boostPre < 1.5, 'N', 'Y')
boostPre <- as.factor(boostPre)
(boost.cm <- confusionMatrix(boostPre, test_loan$Loan_Status))## Confusion Matrix and Statistics
##
## Reference
## Prediction N Y
## N 22 2
## Y 16 82
##
## Accuracy : 0.8525
## 95% CI : (0.7769, 0.9102)
## No Information Rate : 0.6885
## P-Value [Acc > NIR] : 2.502e-05
##
## Kappa : 0.6174
##
## Mcnemar's Test P-Value : 0.002183
##
## Sensitivity : 0.5789
## Specificity : 0.9762
## Pos Pred Value : 0.9167
## Neg Pred Value : 0.8367
## Prevalence : 0.3115
## Detection Rate : 0.1803
## Detection Prevalence : 0.1967
## Balanced Accuracy : 0.7776
##
## 'Positive' Class : N
##
According to the Confusion matrix, Our model Accuracy is 0.8525. Looking at Sensitivity and Specificity, this model is has a far lower false-negative rate and is less likely to pick ‘N’ when it shouldn’t. However, its false-positive rate is higher, where it selected ‘Y’ when it should have predicted ‘N’.
Model performance: please compare the models you settled on for problems # 2–4. Comment on their relative performance. Which one would you prefer the most? Why? (20 points)
After running various decision trees, random forest models, and gradient boosting methods, we can take a look at the overall evaluation metrics for these three techniques on the loan approval dataset. By creating a dataframe to store all of our metrics, we can visualize the outcomes below:
eval <- dplyr::select(eval, Accuracy, `Classification Error Rate`, Sensitivity, Specificity, Precision, Recall, F1)
rownames(eval) = c("Decision Tree", "Random Forest", "Gradient Boosting")
t_eval <- t(eval)
colnames(t_eval) <- rownames(eval)
rownames(t_eval) <- colnames(eval)
knitr::kable(t_eval)| Decision Tree | Random Forest | Gradient Boosting | |
|---|---|---|---|
| Accuracy | 0.8196721 | 0.7950820 | 0.8524590 |
| Classification Error Rate | 0.1803279 | 0.2049180 | 0.1475410 |
| Sensitivity | 0.5526316 | 0.5789474 | 0.5789474 |
| Specificity | 0.9404762 | 0.8928571 | 0.9761905 |
| Precision | 0.8076923 | 0.7096774 | 0.9166667 |
| Recall | 0.5526316 | 0.5789474 | 0.5789474 |
| F1 | 0.6562500 | 0.6376812 | 0.7096774 |
When comparing our three different techniques, we can see that our Gradient Boosting method seemed to perform the best and Random Forest performed worst. That said, Decision Tree and Random Forest had mixed results when we consider false-positives and false-negatives. RF performed worst overall but had better Sensitivity than Decision Tree. With lower overall accuracy, higher classification error rate, and worse performance on measures such as F1, sensitivity, and specificity, we can determine that we’d rule out Random Forest. Additionally, the decision tree we are using here to evaluate against our other techniques only utilizes one feature (Credit_History). Such isn’t as flexible if we were to use this model on a different dataset and doesn’t allow for special cases whre a loan should be approved based on other criteria.
When considering model performance, accuracy is a good starting point, but false positives versus false negatives can be pretty significant from a business perspective. If the bank incorrectly predicts they shouldn’t give a loan, but the applicant would have paid it (false-negative), the bank loses out on potential revenue. If the bank incorrectly gives a loan and the applicant defaults (false-positive), the bank loses money from loan defaults. In the end, the bank will need to assign a cost function to both of these situations and determine which has a more negative impact on business. We see high Specificity between 89~97% with all three models, which indicates low false-negative. These models do a pretty good job of giving loans to applicants that will repay, so there should be little opportunity cost on that side. However, we see low Sensitivity between 55~57%, i.e., high numbers of false positives. These models would suggest giving loans to applicants that might default. Of the two errors, this second is probably far more expensive. As a bank, we would probably care more about maximizing Sensitivity along with Accuracy. With that in mind, from a business perspective, the Gradient Boosting model ends up being a better compromise of both Accuracy and Sensitivity.