This document discusses analyses of two datasets, the Palmer Penguin dataset and a Loan Approvals dataset prepared by Group 6. We divide the document into five parts and adopted two key principles to undertaking this analysis:
First, group has developed a system of checks and balances in preparing each model’s output. A primary and secondary author cross validate each model (pun intended) by independently coding and reconciling outputs. Two co-authors also bring a wider perspective on discussion and diagnostics. Afterwards, the primary author drafts the text for inclusion in this document.
Second, because four of the five sections deal with the Loan Approvals data, we decided to adopt a common dataset for all model analyses. Data wrangling and preparation are done once only. Consistency of the data in the last four parts is essential to assess model performance.
Section 1 analyzes the Palmer Penguin dataset using the KNN model to predict species. The authors are Alexander Ng (primary) and Randy Thompson (secondary).
Section 2 conducts an exploratory data analysis and defining the common data set for modeling. The authors are Randy Thompson (primary) with contributions by all other members.
Section 3 analyzes the Loan Approvals by Decision Tree. The authors are Randy Thompson (primary) and Philip Tanofsky (secondary).
Section 4 analyzes the same Loan Approvals with a Random Forest model. The authors are Scott Reed (primary) and Alexander Ng (secondary).
Section 5 analyzes the same Loan Approvals with a Gradient Boosting approach. The authors are Philip Tanofsky (primary) and Scott Reed (secondary).
Section 6 concludes with a discussion of the model performance and an appraisal of the merits of each result. We also consider the role of model driven prediction in a loan approvals business context.
Section 7 presents our R code and technical appendices and references.
Following the data cleaning approach taken in prior assignments, we exclude observations where the feature sex is undefined. We also exclude the two variables island and year which do not seem helpful in prediction analysis. We follow the variable selection process described in a prior assignment here.
This still gives a substantial subset \(333\) out of \(344\) original observations. The below table illustrates sample rows of the cleansed dataset used in the subsequent KNN analysis.
| species | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex |
|---|---|---|---|---|---|
| Adelie | 39.1 | 18.7 | 181 | 3750 | male |
| Adelie | 39.5 | 17.4 | 186 | 3800 | female |
| Adelie | 40.3 | 18.0 | 195 | 3250 | female |
| Adelie | 36.7 | 19.3 | 193 | 3450 | female |
| Adelie | 39.3 | 20.6 | 190 | 3650 | male |
| Adelie | 38.9 | 17.8 | 181 | 3625 | female |
We obtain 333 rows of observations with 6 columns of explanatory and response variables.
The next step is to normalize the data so that each quantitative variables have a mean zero and standard deviation equal to 1. However, we need to also include the sex variable in the K-nearest neighbor model. Normalization ensures that each variable has equal influence on the nearest neighbor calculation. Otherwise, a quantitative variable whose scale is much larger than the other variables would dominate the Euclidean distance calculation used in the KNN model.
However, we also need to include the sex variable as a feature in the KNN model. Categorical variables pose a challenge in the KNN context because they need to be mapped to a numerical equivalent. In the case of this data set, two facts are key:
For these reasons, we do not normalize sex but use the mapping function \(f\) defined as:
\[\begin{equation} f(\text{sex}) = \begin{cases} 1 & \text{sex} = \text{male} \\ -1 & \text{sex} = \text{female} \\ \end{cases} \end{equation} \] Because the data set is gender balanced, \(E[f(\text{sex})] \approx 0\) and \(Stdev(f(\text{sex})) \approx 1\). Hence, \(f(\text{sex})\) is nearly normalized. With these transformations, the KNN dataset is prepared for model evaluation. We illustrate the top 5 rows of the training data set below. Note that the sampled indices are displayed in the left most column as row values in the resulting dataframe.
| bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|
| 1 | -0.89 | 0.78 | -1.42 | -0.57 | 1 |
| 3 | -0.68 | 0.42 | -0.43 | -1.19 | -1 |
| 4 | -1.33 | 1.08 | -0.57 | -0.94 | -1 |
| 7 | -0.88 | 1.24 | -0.43 | 0.58 | 1 |
| 8 | -0.53 | 0.22 | -1.35 | -1.25 | -1 |
| 9 | -0.99 | 2.05 | -0.71 | -0.51 | 1 |
We will use caret to test a range of possible values of \(k\) and select the optimal \(k\) based on minimizing the cross validation error. We did the pre-processing previously using scale so there is no need to invoke further data transformation inside caret pre-processing function calls.
trainControl allows us to define the training parameters.
tuneLength decides the range of allowed \(k\) value to compute against the model.
We choose a 5-fold cross validation method to ensure the test set is sufficiently large.
Using the above model plot which shows the average cross validation accuracy at each parameter \(k\) used in the KNN model, we see that the best kappa and accuracy occur at \(k=5\) nearest neighbors. This is evident when we display the (cross validation) model output below.
## k-Nearest Neighbors
##
## 333 samples
## 5 predictor
## 3 classes: 'Adelie', 'Chinstrap', 'Gentoo'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 267, 266, 268, 265, 266
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9940737 0.9906842
## 7 0.9880556 0.9811323
## 9 0.9881914 0.9812667
## 11 0.9881914 0.9812667
## 13 0.9851145 0.9763978
## 15 0.9851145 0.9763978
## 17 0.9851145 0.9763978
## 19 0.9851145 0.9763978
## 21 0.9851145 0.9763978
## 23 0.9851145 0.9763978
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
To assess model performance and investigate the suitability of our choice of \(k=5\), we consider two approaches. First we examine the statistical output of the confusion matrix. The accuracy and kappa ought to look attractive in the model selection. Second, we examine the smoothness of the decision boundary by plotting the decision boundary for \(k=1\) and \(k=5\). If the optimal \(k\) shows a relatively smooth boundary, this is supporting evidence that the model parameters are reasonable.
First, we construct the confusion matrix of one specific implementation of the KNN model for the train-test split following the same proportions as the cross-validation study. We specifically use the caret interface rather than the simpler class interface to ensure consistency across multiple models in future.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 29 1 0
## Chinstrap 0 12 0
## Gentoo 0 0 23
##
## Overall Statistics
##
## Accuracy : 0.9846
## 95% CI : (0.9172, 0.9996)
## No Information Rate : 0.4462
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9757
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 1.0000 0.9231 1.0000
## Specificity 0.9722 1.0000 1.0000
## Pos Pred Value 0.9667 1.0000 1.0000
## Neg Pred Value 1.0000 0.9811 1.0000
## Prevalence 0.4462 0.2000 0.3538
## Detection Rate 0.4462 0.1846 0.3538
## Detection Prevalence 0.4615 0.1846 0.3538
## Balanced Accuracy 0.9861 0.9615 1.0000
We see in the confusion matrix output above that accuracy is 98.46% and kappa is 97.6%. These are consistent with the cross validation results. We conclude that KNN is a very effective classifier for penguin gender. Only 1 penguin in 65 was misclassified: a Chinstrap was predicted to be an Adelie penguin.
Next, we visualize the decision boundary of our KNN model to check for model suitability. Custom code needs to be written as we are not aware if this plot type is currently provided by any R package.
Adapting the R recipe for KNN decision boundary plots found in the stackoverflow page here we see below that the decision boundary for \(k=5\) nearest neighbor algorithm is reasonably smooth and not overfitted to the data.
However, the same KNN decision boundary for \(k=1\) shows a much more twisted curve. The border between the Chinstrap and Adelie regions in the upper middle half shows a zig-zag pattern characteristic of overfitting. So we reject the choice of \(k=1\) as the model even though both \(k=5\) and \(k=1\) KNN models have identical confusion matrices. ( The \(k=1\) confusion matrix is omitted for brevity. ) This means model selection requires considering factors not captured by the confusion matrix.
Our EDA shows the raw loan approvals dataset has varying level of quality.
Complete variables include: Loan_ID, Education, Property_Area, Loan_Status, ApplicantIncome, CoapplicantIncome Mostly complete variables include: Married, Dependents, Loan_Amount_Term, Gender
Variables with significant gaps include: Self_Employed, LoanAmount, Credit_History Of these variables, Credit_History is the most problematic and it influences the Loan Approval decision.
\(\color{red}{\text{@Randy: Add your detailed discussion here...}}\)
In this section, we construct a common loan approval data which will be used for all of the model studies that follow. There are several steps that we take:
\(\color{red}{\text{@Scott - can you share the evidence that imputing missing values for Credit History is dicey?}}\) \(\color{red}{\text{this argument needs to be supported and justifies our choice of a common data set.}}\)
Next, we omit one irrelevant column: the Loan_ID. An identifier which is exogenous to the loan applicant and is assumed to be related solely to computer system processes.
Then we add a synthetic variable Total_Income which represents the sum of the ApplicantIncome and CoapplicantIncome.
We scale all quantitative variables to mean 0 and variance 1 variables with the scale function and retain both the nominal and scaled versions. The scaled variables are named like their nominal counterparts but prefixed with a lower case s. For example, sApplicantIncome has mean 0 and variance 1 whereas ApplicantIncome has the monthly nominal income.
Further, we convert the character type data columns from text to factor. This is importance for Loan_Status but also relevant for others like: Credit_History, Property_Area, Gender, Married, Education, Self_Employed. In addition, we convert Dependents from text to an ordered factor because the number of children is ordered.
Lastly, the common data set is exported to flat file. Each group member performs their Loan Approval model study by loading the common data set and making subsequent selections or transformations while retaining the same observations.
After exporting the common loan approval data, we load it into the R session and transform character string data to factors. This allows some R models to better handle response variables.
When performing modeling of common data set, each member may omit one of the Income variables to avoid multi-collinearity problems involving: ApplicantIncome, CoapplicantIncome and Total_Income or their scaled equivalents.
\(\color{red}{\text{Please remember to drop one or more of the 3 Income variables prior to running your models.}}\)
\(\color{red}{\text{Let's add this section (or its equivalent) to the discussion since it tells a useful story.}}\)
\(\color{red}{\text{I wrote some code to do the conditional approval rates.}}\)
One simple type of exploratory data analysis is to examine the conditional approval rates by values of each categorical variable \(C\).
For example, there are three subcategories (Urban, Suburban, Rural) of the variable Property_Area.
If we find significant differences in the conditional loan approval rates between Urban and Rural applications, this suggests that Property_Area may help to predict approvals.
The conditional approval rates for each categorical variable is displayed below.
| Rural | Semiurban | Urban | |
|---|---|---|---|
| N (%) | 11.2 | 8.8 | 10.8 |
| Y (%) | 17.7 | 31.0 | 20.4 |
| Property_Area Column Total %: | 29.0 | 39.8 | 31.2 |
| Loan Approval Rate by Column% | 61.2 | 78.0 | 65.3 |
| Total Applications: | 139.0 | 191.0 | 150.0 |
| 0 | 1 | |
|---|---|---|
| N (%) | 13.1 | 17.7 |
| Y (%) | 1.5 | 67.7 |
| Credit_History Column Total %: | 14.6 | 85.4 |
| Loan Approval Rate by Column% | 10.0 | 79.3 |
| Total Applications: | 70.0 | 410.0 |
| 36 | 60 | 84 | 120 | 180 | 240 | 300 | 360 | 480 | |
|---|---|---|---|---|---|---|---|---|---|
| N (%) | 0.4 | 0.0 | 0.2 | 0.0 | 2.5 | 0.2 | 1.0 | 24.8 | 1.7 |
| Y (%) | 0.0 | 0.4 | 0.4 | 0.6 | 5.0 | 0.2 | 0.8 | 60.8 | 0.8 |
| Loan_Amount_Term Column Total %: | 0.4 | 0.4 | 0.6 | 0.6 | 7.5 | 0.4 | 1.9 | 85.6 | 2.5 |
| Loan Approval Rate by Column% | 0.0 | 100.0 | 66.7 | 100.0 | 66.7 | 50.0 | 44.4 | 71.0 | 33.3 |
| Total Applications: | 2.0 | 2.0 | 3.0 | 3.0 | 36.0 | 2.0 | 9.0 | 411.0 | 12.0 |
| No | Yes | |
|---|---|---|
| N (%) | 26.0 | 4.8 |
| Y (%) | 60.2 | 9.0 |
| Self_Employed Column Total %: | 86.2 | 13.8 |
| Loan Approval Rate by Column% | 69.8 | 65.2 |
| Total Applications: | 414.0 | 66.0 |
| Graduate | Not Graduate | |
|---|---|---|
| N (%) | 23.3 | 7.5 |
| Y (%) | 56.5 | 12.7 |
| Education Column Total %: | 79.8 | 20.2 |
| Loan Approval Rate by Column% | 70.8 | 62.9 |
| Total Applications: | 383.0 | 97.0 |
| 0 | 1 | 2 | 3+ | |
|---|---|---|---|---|
| N (%) | 18.1 | 5.8 | 4.2 | 2.7 |
| Y (%) | 39.0 | 10.8 | 13.5 | 5.8 |
| Dependents Column Total %: | 57.1 | 16.7 | 17.7 | 8.5 |
| Loan Approval Rate by Column% | 68.2 | 65.0 | 76.5 | 68.3 |
| Total Applications: | 274.0 | 80.0 | 85.0 | 41.0 |
| No | Yes | |
|---|---|---|
| N (%) | 13.3 | 17.5 |
| Y (%) | 21.9 | 47.3 |
| Married Column Total %: | 35.2 | 64.8 |
| Loan Approval Rate by Column% | 62.1 | 73.0 |
| Total Applications: | 169.0 | 311.0 |
| Female | Male | |
|---|---|---|
| N (%) | 6.7 | 24.2 |
| Y (%) | 11.2 | 57.9 |
| Gender Column Total %: | 17.9 | 82.1 |
| Loan Approval Rate by Column% | 62.8 | 70.6 |
| Total Applications: | 86.0 | 394.0 |
We observe large differences in approval rates in the following categorical variables:
Credit_History: Those with credit history have 79.3% approval rate vs. 10.0% for those without.
Property_Area: Rural approval rates 61.2% are lower than Semiurban at 78.0%
Married: Unmarried applicant approval rate 62.1% is lower than for married at 73.0%
Education: Not Graduate approval rates 62.9% are lower than Graduate at 70.8%
Gender: Female applicant approval rate 62.8% is lower than for males at 70.6%.
These differences should suggest the most significant categorical variables for prediction purposes.
\(\color{red}{\text{Please set.seed() in your section to ensure your R code works as expected. }}\)
\(\color{red}{\text{Suggest a 80-20 split for training and test data to allow cross model comparison. }}\)
\(\color{red}{\text{@Scott - adapt this rubric of steps as appropriate since you push the analysis further than me.}}\)
\(\color{red}{\text{I can send randomForest code to you separately}}\)
We will break the analysis of random forests into several phases:
mtry to be used as each split.\(\color{red}{\text{Please set.seed() in your section to ensure your R code works as expected. }}\)
\(\color{red}{\text{Maybe use a 80-20 split for training and test data to allow cross model comparison. }}\)
\(\color{red}{\text{Please set.seed() in your section to ensure your R code works as expected. }}\)
\(\color{red}{\text{Maybe use a 80-20 split for training and test data to allow cross model comparison. }}\)
In this section, we compare the performance of the decision tree, random forest and gradient boosting models on the Loan Approvals dataset.
\(\color{red}{\text{Put your external references here.}}\)
We summarize all the R code used in this project in this appendix for ease of reading.
# Your libraries go here
library(tidyverse)
library(ggplot2)
library(knitr)
library(kableExtra)
library(caret) # Model Framework
library(skimr) # Used for EDA
library(klaR) # Implemented KNN and Naive Bayes models, etc
library(class) # used for KNN classifier
# PLEASE ADD YOUR R LIBRARIES BELOW
# ------------------------------
# ---------------------------------
knitr::opts_chunk$set(echo = FALSE, message=FALSE, warning=FALSE)
library(palmerpenguins)
# The final dataset excludes observations with missing sex and drops the island and year variables.
pc = penguins %>% filter( is.na(sex) == FALSE) %>% dplyr::select( -one_of("island", "year") )
head(pc) %>% kable(caption="Scrubbed penguin data set") %>% kable_styling(bootstrap_options = c("striped", "hover"))
set.seed(10)
# Construct the standardized dataset with the response variable, the 4 quantitative variables normalized
# with the scale function and the sex variable normalized with the 1,-1 assignment explained previously.
#
standardized = cbind( pc[,1], scale( pc[, 2:5] ) , sex = data.table::fifelse(pc$sex == 'male' , 1 , -1) )
# Define an 80-20 split of the training and test data.
# -------------------------------------------------------------------------------------
training.individuals = createDataPartition(standardized$species, p= 0.8 , list = FALSE)
# X variables include bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex (converted to -1,1)
# Y variable is just the species class
# train data is the standardized subset of the full data set
# test data is the complement of the training data set.
# -----------------------------------------------------------------
train.X = standardized[ training.individuals, 2:6]
test.X = standardized[-training.individuals, 2:6]
train.Y = standardized[ training.individuals, 1]
test.Y = standardized[-training.individuals, 1]
head(train.X ) %>% kable(caption="Normalized Training Data - 1st Few Rows", digits = 2 ) %>%
kable_styling(bootstrap_options = c("striped", "hover") )
model = train( species ~. , data = standardized , method = "knn" ,
trControl = trainControl("cv", number = 5 ), tuneLength = 10 )
# Plot the model output vs. varying k values.
plot(model)
model
# The best tuning parameter k that
# optimized model accuracy is:
#model$bestTune
# Simpler interface using the knn function within class.
knn.pred5 = knn(train.X, test.X, train.Y , k = 5, prob = FALSE)
cm5 = confusionMatrix(knn.pred5, test.Y)
# This model approach uses tuneGrid with a single value to calculate kNN
# only at a single number of nearest neighbors.
# ------------------------------------------------------------------------------
x2 = train( species ~ . , data = standardized[training.individuals,] ,
method = "knn",
trControl = trainControl(method="none"),
tuneGrid=data.frame(k=5) )
pred2 = predict(x2, newdata=test.X)
(cmknn2=confusionMatrix(pred2, test.Y))
# We visualize a projection of the decision boundary flattened to 2 dimensions to illustrate
# the smoothness of the KNN boundary at different values of k.
# First, choose the variables, then build a sampling grid.
# ------------------------------------------------------------------------
pl = seq(min(standardized$bill_length_mm), max(standardized$bill_length_mm), by = 0.05)
pw = seq(min(standardized$bill_depth_mm), max(standardized$bill_depth_mm), by = 0.05 )
# The sampling grid needs explicit values to be plugged in for the variables
# which are not the two projected variables. Since variables are standardized to
# mean 0 and variance 1, we choose zero for flipper_length and body_mass because they are continuous
# and the mean and mode are close.
# However, for sex, we choose male because the average sex = 0 is not the mode.
# In practice, sex = 0 gives weird decision boundaries in the projected plot.
# ------------------------------------------------------------------------------
lgrid = expand.grid(bill_length_mm=pl , bill_depth_mm = pw , flipper_length_mm = 0, body_mass_g = 0 , sex = 1 )
knnPredGrid = predict(x2, newdata=lgrid )
num_knnPredGrid = as.numeric(knnPredGrid)
num_pred2 = as.numeric(pred2)
test = cbind(test.X, test.Y)
test$Pred = num_pred2
ggplot(data=lgrid) + stat_contour(aes(x=bill_length_mm, y= bill_depth_mm, z = num_knnPredGrid), bins = 2 ) +
geom_point( aes(x=bill_length_mm, y= bill_depth_mm, color = knnPredGrid), size = 1, shape = 19, alpha = 0.2) +
geom_point( data= test , aes(x=bill_length_mm, y=bill_depth_mm, color=pred2 ) , size = 4, alpha = 0.8, shape =24 ) +
ggtitle("KNN Decision Boundary for Penguins Data with k=5 neighbor")
# This model approach uses tuneGrid with a single value to calculate kNN
# only at a single number of nearest neighbors.
# ------------------------------------------------------------------------------
x3 = train( species ~ . , data = standardized[training.individuals,] ,
method = "knn",
trControl = trainControl(method="none"),
tuneGrid=data.frame(k=1) ) # Now we use a k=1 nearest neighbor algorithm which is overfitted.
pred3 = predict(x3, newdata=test.X)
knnPredGrid3 = predict(x3, newdata=lgrid )
num_knnPredGrid3 = as.numeric(knnPredGrid3)
num_pred3 = as.numeric(pred3)
test = cbind(test.X, test.Y)
test$Pred = num_pred3
ggplot(data=lgrid) + stat_contour(aes(x=bill_length_mm, y= bill_depth_mm, z = num_knnPredGrid3), bins = 2 ) +
geom_point( aes(x=bill_length_mm, y= bill_depth_mm, color = knnPredGrid3), size = 1, shape = 19, alpha = 0.2) +
geom_point( data= test , aes(x=bill_length_mm, y=bill_depth_mm, color=pred3 ) , size = 4, alpha = 0.8, shape =24 ) +
ggtitle("KNN Decision Boundary for Penguins Data with k=1 neighbor")
# This code chunk should only be run to build the common data set.
# Any model builder should rely on importing the common data set only.
# Thus, set eval=TRUE when generating a new version of the common data set.
# Otherwise, set eval=FALSE to skip this step.
# ------------------------------------------------------------------------
# This code block assumes the raw Loan Approval data file in csv form has been placed in the same folder as this script.
# ----------------------------------------------------------------------------------------
cla <- read_csv("Loan_approval.csv") %>% dplyr::select(-Loan_ID) # drop the row identifier.
# Add a column for Total Income of Applicant and Co-Applicant
# ---------------------------------------------------------------------------
cla$Total_Income = cla$ApplicantIncome + cla$CoapplicantIncome
# We build a dataset in which all observations have fully populated values.
# ---------------------------------------------------------------------------
cla = na.omit(cla)
# Add mean zero and variance 1 versions of quantitative variables.
# ----------------------------------------------------------
cla %>% mutate( sApplicantIncome = scale(ApplicantIncome),
sCoapplicantIncome = scale(CoapplicantIncome),
sTotal_Income = scale(Total_Income) ,
sLoanAmount = scale( LoanAmount)
) -> cla
head(cla)
write_csv(cla, "cla.csv") # Write the scaled common loan approvals data set to local disk.
cla = read_csv("cla.csv")
# Transform the character data columns into factors
# --------------------------------------------------------
cla$Loan_Amount_Term = factor(cla$Loan_Amount_Term)
cla$Loan_Status = factor(cla$Loan_Status) # Convert the response to factor
cla$Credit_History = factor(cla$Credit_History)
cla$Property_Area = factor( cla$Property_Area)
cla$Gender = factor( cla$Gender)
cla$Married = factor(cla$Married)
cla$Dependents = ordered(cla$Dependents, levels = c("0" , "1", "2" , "3+") )
cla$Education = factor(cla$Education)
cla$Self_Employed = factor( cla$Self_Employed)
head(cla)
# Display the approval rate statistics for each qualitative field
# in absolute count and percentage terms by category.
#
display_approval_rate <- function(la , field )
{
a = la[, "Loan_Status"]
b = la[, field]
c = table(cbind(a,b)) # Count table
pcts = as.data.frame.matrix(c) /nrow(la) * 100 # Percentage equivalent
col_sums = colSums(pcts) # Shows the results by the field
app_rate = pcts[2,]/col_sums * 100 # Approval rate assumes row 2 is the "Yes" row.
yy = rbind( pcts, col_sums, app_rate, colSums(c))
rownames(yy)[1]= paste0( rownames(yy)[1], " (%)")
rownames(yy)[2]= paste0( rownames(yy)[2], " (%)")
rownames(yy)[3]= paste0( field, " Column Total %:")
rownames(yy)[4]="Loan Approval Rate by Column%"
rownames(yy)[5]="Total Applications:"
print(yy %>% kable(digits = 1, caption = paste0('Loan Approved vs. ', field , ' by Percent') ) %>%
kable_styling(bootstrap_options = c("striped", "hover") ) %>%
row_spec(4, background = "skyblue", bold = TRUE) )
}
cat_vars = c("Property_Area", "Credit_History", "Loan_Amount_Term", "Self_Employed", "Education", "Dependents", "Married", "Gender")
for( field in cat_vars)
{
display_approval_rate( cla , field)
cat("\n")
}