The purpose of this assignment was to explore classification via K-nearest neighbors, Decision Trees, Random Forests, and Gradient Boosting.
Classification is a supervised machine learning technique whose main purpose is to identify the category/class of provided input data. The model that is generated is trained using labeled data (hence the label “supervised”) and then the trained model is used to predict our discrete output.
First, we’re going to predict species
of penguin using the KNN algorithm.
Then we’re going to compare and contrast loan approval status
prediction accuracy for Decision Trees, Random Forests and Gradient Boosting.
……………………………………………………………………..
Being that we’ve worked with the penguins dataset twice before, we perform light EDA to re-familiarize ourselves with the data prior to applying the KNN algorithm to it.
We load in the data, pre-process it, verify the first 6 observations, and utilize the built-in glimpse() function to gain insight into the dimensions, variable characteristics, and value range:
#Load and tidy data
<- penguins %>% drop_na() %>%
penguin_measurements ::select(species, bill_length_mm, bill_depth_mm,
dplyr
flipper_length_mm, body_mass_g)head(penguin_measurements%>% as.data.frame())
## species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1 Adelie 39.1 18.7 181 3750
## 2 Adelie 39.5 17.4 186 3800
## 3 Adelie 40.3 18.0 195 3250
## 4 Adelie 36.7 19.3 193 3450
## 5 Adelie 39.3 20.6 190 3650
## 6 Adelie 38.9 17.8 181 3625
glimpse(penguin_measurements)
## Rows: 333
## Columns: 5
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6~
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2~
## $ flipper_length_mm <int> 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 18~
## $ body_mass_g <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 3800~
Once we’ve dropped NA values and selected pertinent variables, we end up with a 333 observation x 5 variable dataframe with:
species
, a categorical variable of type factor, as our dependent variable andbill_length_mm
, bill_depth_mm
, flipper_length_mm
, and body_mass_g
, all quantitative variables of type dbl or int, as our independent variables.The KNN algorithm hinges on the idea that similar data will be near one another. When applying KNN, the distance between data points are used to classify data. Those nearer one another are “batched” together. The amount of batches we see and the distance between points is determined by the k value we select:
Due to the simplicity of calculations involved and ease of interpretability of the result, KNN is a popular classifier. For our purposes, we’re going to apply the K-nearest neighbor algorithm to the Palmer penguins dataset to predict the species
variable.
We perform an 80-20 split, center and scale our independent variables via built-in scale() function, and fit the KNN model to our training dataset:
# Splitting data into train and test data
set.seed(123)
<- penguin_measurements$species %>%
training.individuals createDataPartition(p = 0.8, list = FALSE)
<- penguin_measurements[training.individuals, ]
train_cl <- penguin_measurements[-training.individuals, ]
test_cl # Feature Scaling
<- scale(train_cl[, 2:5])
train_scale <- scale(test_cl[, 2:5])
test_scale # Fitting KNN Model to training dataset
<- knn(train = train_scale,
classifier_knn test = test_scale,
cl = train_cl$species,
k = 1)
#classifier_knn #verify output
Performing an 80-20 split allocates 80% of our data for training the model and 20% of our data for testing it. Whereas applying the scale() function centers and scales our independent variables to reduce bias and improve predictive accuracy. Although, we start with k=1 (neighbors) for our KNN model fit here. We’ll vary this value later to interpret its impact on predictive accuracy.
Once our model has been fit, we assess its accuracy via confusion matrix :
# Confusion Matrix
<- table(test_cl$species, classifier_knn)
cm cm
## classifier_knn
## Adelie Chinstrap Gentoo
## Adelie 28 1 0
## Chinstrap 3 10 0
## Gentoo 0 0 23
# Model Evaluation - Choosing K Calculate out of Sample error
<- mean(classifier_knn != test_cl$species)
misClassError print(paste('Accuracy =', 1-misClassError))
## [1] "Accuracy = 0.938461538461538"
In the confusion matrix above, rows represent actual values while columns represent predicted values. With this in mind, we see that our test set results are:
Although our KNN classifier mis-predicted an Adelie as a Chinstrap and 3 Chinstraps as Adelies, it predicted Gentoo with 100% accuracy and produced a relatively favorable 93.8% accuracy.
Let’s explore the impact of increasing our k value to 3:
# K = 3
<- knn(train = train_scale,
classifier_knn test = test_scale,
cl = train_cl$species,
k = 3)
<- mean(classifier_knn != test_cl$species)
misClassError print(paste('Accuracy (k=3):', 1-misClassError))
## [1] "Accuracy (k=3): 0.953846153846154"
We don’t visualize the output with a confusion matrix and instead calculate the sample error and just subtract this value from 1 to produce a predictive accuracy of 95.4%. A 1.6% improvement over our KNN classifier with k=1.
We, once again, explore the impact of increasing our k value. This time we increase it to 15:
# K = 15
<- knn(train = train_scale,
classifier_knn test = test_scale,
cl = train_cl$species,
k = 15)
<- mean(classifier_knn != test_cl$species)
misClassError print(paste('Accuracy (k=15):', 1-misClassError))
## [1] "Accuracy (k=15): 0.969230769230769"
With k=15 our predictive accuracy climbs to 96.9%. A 1.5% improvement over our KNN classifier with k=3. In our case, each increase in k value improved our predictive accuracy. This is not always the case though …
When choosing k values, smaller values are generally less computationally expensive yet they’re also noisier and less accurate. Larger values, on the other hand, can result in smoother decision boundaries and a lower variance yet they increase the bias and processing demands.
Thus, more often than not we seek the “sweet spot”. A k value that’s not too large and not too small. Our choice in value is thus impacted by the number of observations, as well as the characteristics of the data we’re classifying.
……………………………………………………………………..
Being that we haven’t worked with the loan approval dataset before, our exploratory data analysis (EDA) will be more in-depth. The depth will allow for greater understanding of the data at hand prior to applying a Decision Tree, Random Forest, and Gradient Boosting model to it.
We load in the data, replace empty strings with NAs, and observe the first 6 observations of our dataset:
#Load in data
<- read.csv("https://raw.githubusercontent.com/SubhalaxmiRout002/Data-622-Group-5/main/Loan_approval.csv", stringsAsFactors = TRUE)
loan ==""] <- NA #replace empty strings with NAs
loan[loan#head(loan) #verify 1st 6 observations
The head() function provides some context regarding the format of our data. R’s built-in glimpse() and summary() functions provide further insight:
#Light EDA
glimpse(loan)
## Rows: 614
## Columns: 13
## $ Loan_ID <fct> LP001002, LP001003, LP001005, LP001006, LP001008, LP~
## $ Gender <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male~
## $ Married <fct> No, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes,~
## $ Dependents <fct> 0, 1, 0, 0, 0, 2, 0, 3+, 2, 1, 2, 2, 2, 0, 2, 0, 1, ~
## $ Education <fct> Graduate, Graduate, Graduate, Not Graduate, Graduate~
## $ Self_Employed <fct> No, No, Yes, No, No, Yes, No, No, No, No, No, NA, No~
## $ ApplicantIncome <int> 5849, 4583, 3000, 2583, 6000, 5417, 2333, 3036, 4006~
## $ CoapplicantIncome <dbl> 0, 1508, 0, 2358, 0, 4196, 1516, 2504, 1526, 10968, ~
## $ LoanAmount <int> NA, 128, 66, 120, 141, 267, 95, 158, 168, 349, 70, 1~
## $ Loan_Amount_Term <int> 360, 360, 360, 360, 360, 360, 360, 360, 360, 360, 36~
## $ Credit_History <int> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, NA, ~
## $ Property_Area <fct> Urban, Rural, Urban, Urban, Urban, Urban, Urban, Sem~
## $ Loan_Status <fct> Y, N, Y, Y, Y, Y, Y, N, Y, N, Y, Y, Y, N, Y, Y, Y, N~
summary(loan)
## Loan_ID Gender Married Dependents Education
## LP001002: 1 : 0 : 0 : 0 Graduate :480
## LP001003: 1 Female:112 No :213 0 :345 Not Graduate:134
## LP001005: 1 Male :489 Yes :398 1 :102
## LP001006: 1 NA's : 13 NA's: 3 2 :101
## LP001008: 1 3+ : 51
## LP001011: 1 NA's: 15
## (Other) :608
## Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
## : 0 Min. : 150 Min. : 0 Min. : 9.0
## No :500 1st Qu.: 2878 1st Qu.: 0 1st Qu.:100.0
## Yes : 82 Median : 3812 Median : 1188 Median :128.0
## NA's: 32 Mean : 5403 Mean : 1621 Mean :146.4
## 3rd Qu.: 5795 3rd Qu.: 2297 3rd Qu.:168.0
## Max. :81000 Max. :41667 Max. :700.0
## NA's :22
## Loan_Amount_Term Credit_History Property_Area Loan_Status
## Min. : 12 Min. :0.0000 Rural :179 N:192
## 1st Qu.:360 1st Qu.:1.0000 Semiurban:233 Y:422
## Median :360 Median :1.0000 Urban :202
## Mean :342 Mean :0.8422
## 3rd Qu.:360 3rd Qu.:1.0000
## Max. :480 Max. :1.0000
## NA's :14 NA's :50
We’re dealing with a 614 observation x 13 variable dataframe with:
Loan_Status
, a categorical, character-based variable, as our dependent variable,ApplicantIncome
, CoApplicantIncome
,LoanAmount
, Loan_Amount_Term
, and Credit_History
, all quantitative variables of type dbl or int, as independent variables, andLoan_ID
, LoanGender
, Married
, Dependents
, Education
, Self_Employed
, Property_Area
, and Loan_Status
, all categorical, character-based variables, as independent variables.From the above output, we also get an idea of proportions for our variables of type factor (ie.Gender
: 489 male, 112 female applicants and Loan_Status
: 422 Y, 192 N) which we can explore in greater depth later.
Of the above variables, we can see that Loan_ID
does not appear to provide much insight. We’ll drop this variable, explore a clearer visualization of NA counts and then deal with our NA values:
<- subset(loan, select = -c(1) ) #drop Loan_ID from consideration
loan colSums(is.na(loan)) #visualize NA counts
## Gender Married Dependents Education
## 13 3 15 0
## Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
## 32 0 0 22
## Loan_Amount_Term Credit_History Property_Area Loan_Status
## 14 50 0 0
7 / 12 variables have NA values and 3 of these variables have more than 20 NA values (a relatively significant margin).
We can drop these values but dropping values means losing valuable observations and thus we elect to impute instead. From the mice library, we impute using the pmm method (predictive mean matching):
#relabel Dependents "3+" value as "3" so that we can impute values
$Dependents <- revalue(loan$Dependents, c("3+"="3"))
loan#apply predictive mean matching to loan data
<- mice(loan, m = 1, method = "pmm", seed = 500) loan
##
## iter imp variable
## 1 1 Gender* Married* Dependents* Self_Employed LoanAmount Loan_Amount_Term* Credit_History*
## 2 1 Gender* Married Dependents Self_Employed LoanAmount Loan_Amount_Term* Credit_History*
## 3 1 Gender* Married* Dependents* Self_Employed LoanAmount* Loan_Amount_Term* Credit_History*
## 4 1 Gender Married Dependents Self_Employed LoanAmount Loan_Amount_Term* Credit_History*
## 5 1 Gender* Married Dependents Self_Employed* LoanAmount* Loan_Amount_Term* Credit_History
<- mice::complete(loan, 1) loan
We re-assign the “3+” value of the Dependents
variable to provide consistent leveling and enable pmm and then we actually apply pmm.
Predictive mean matching calculates the predicted value for our target variable, and, for missing values, forms a small set of “candidate donors” from the complete cases that are closest to the predicted value for our missing entry. Donors are then randomly chosen from candidates and imputed where values were once missing. To apply pmm we assume that the distribution is the same for missing cells as it is for observed data, and thus, the approach may be more limited when the % of missing values is higher.
Once we’ve imputed missing values into our loan dataset and returned the data in proper form, we verify whether our operation was successful:
#verify absence of NA values in the dataset
colSums(is.na(loan))
## Gender Married Dependents Education
## 0 0 0 0
## Self_Employed ApplicantIncome CoapplicantIncome LoanAmount
## 0 0 0 0
## Loan_Amount_Term Credit_History Property_Area Loan_Status
## 0 0 0 0
Imputation was a success and thus our data pre-processing has been completed. We can proceed with our exploratory data analysis (EDA).
To identify features that carry promise vs. those that may not, we consider a correlation matrix for our numeric variables:
#Correlation matrix for numeric variables
#Loan Approved correlation
<- loan %>%
loan_corr_y filter(Loan_Status == "Y") %>%
select_if(is.numeric) %>%
cor()
corrplot(loan_corr_y, title="Loan Approved",type = "lower", mar=c(0,0,1,0))
#Loan Rejected correlation
<- loan %>%
loan_corr_n filter(Loan_Status == "N") %>%
select_if(is.numeric) %>%
cor()
corrplot(loan_corr_n, title="Loan Rejected", type = "lower", diag = T, mar=c(0,0,1,0))
When we consider loan approval and rejection, LoanAmount
and ApplicantIncome
have a relatively strong correlation. For loan rejection, LoanAmount
also appears to have moderate correlation with CoapplicantIncome
. Remaining features do not have strong correlation and thus we’ll only note the aforementioned relationships for consideration for possible feature removal later.
We move on to explore histograms for our numeric variables:
#Histograms for all variables
%>%
loan keep(is.numeric) %>%
subset(select = -c(4,5)) %>% #drop CreditHistory, Loan_Amount_Term
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free", ncol=1) +
geom_histogram(bins=90,color="darkblue", fill="lightblue")
From the above figures we observe that:
ApplicantIncome
and LoanAmount
appear to be right skewed normal with a number of noteworthy outliers, andCoapplicantIncome
has a peak at 0 joined with an otherwise right skewed normal distribution.Scaling is not necessary for Decision Trees or Random Forests but may prove useful for Gradient Descent. Thus, we won’t act on transformation at this point, we’ll just note the point for future consideration.
Next, we explore our categorical variables:
#convert CreditHistory to type factor
$Credit_History <- factor(loan$Credit_History)
loan#levels(loan$Credit_History) #verify
#MS: if there's a way to facet wrap this / automate, would be preferred:
#Histograms for all categorical variables
##Gender
<- loan %>%
p1 ::select(1,12) %>% #dplyr::select(1:5,11:12)
dplyrgroup_by(,Loan_Status) %>%
#gather() %>%
count() %>%
ggplot(aes(x=Gender, y=freq, fill=Loan_Status)) +
#facet_wrap(~ key) +
geom_bar(stat='identity', position="stack")
##Married
<- loan %>% dplyr::select(2,12) %>% group_by(,Loan_Status) %>% count() %>%
p2 ggplot(aes(x=Married, y=freq, fill=Loan_Status)) +
geom_bar(stat='identity', position="stack")
##Dependents
<- loan %>% dplyr::select(3,12) %>% group_by(,Loan_Status) %>% count() %>%
p3 ggplot(aes(x=Dependents, y=freq, fill=Loan_Status)) +
geom_bar(stat='identity', position="stack")
##Education
<- loan %>% dplyr::select(4,12) %>% group_by(,Loan_Status) %>% count() %>%
p4 ggplot(aes(x=Education, y=freq, fill=Loan_Status)) +
geom_bar(stat='identity', position="stack")
##Self_Employed
<- loan %>% dplyr::select(5,12) %>% group_by(,Loan_Status) %>% count() %>%
p5 ggplot(aes(x=Self_Employed, y=freq, fill=Loan_Status)) +
geom_bar(stat='identity', position="stack")
##Property_Area
<- loan %>% dplyr::select(11,12) %>% group_by(,Loan_Status) %>% count() %>%
p6 ggplot(aes(x=Property_Area, y=freq, fill=Loan_Status)) +
geom_bar(stat='identity', position="stack")
grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 3, ncol = 2)
From the above figures we can extend the following observations:
While the visualization of our six categorical variables distributions with regard to our dependent variable is enlightening, it doesn’t appear to provide clear indication regarding one vs. another’s predictive capabilities. There’s the possibility that many of these variables will prove impertinent and thus we may be justified in their exclusion from our models later in the process.
With a relatively thorough exploratory analysis under our belt, we move on to the building and assessment of our Decision Tree, Random Forest, and Gradient Descent models.
Decision trees build classification or regression models in the form of a tree structure. Datasets are broken into smaller and smaller subsets along chosen parameters, as the associated decision tree is developed.
The result is a tree with nodes and branches. Nodes denote a split point based on the attribute in question while branches denote the corresponding outcome. We start at a “root node”, terminate at “leaf nodes”, and use corresponding lead nodes to provide proportions regarding resulting class labels.
Due to the ease of interpretability, decision trees are a popular early classifier. For our purposes, we’re dealing with a categorical dependent variable, and will build a Decision Tree model from the loan approval dataset to provide Loan_Status
predictions (ie. Approve or Reject).
To prepare our data for Decision Tree modeling, we convert variables of type factor to numeric, remove the Gender
variable from consideration because loan approval does not depend upon gender (as exemplified by our plots during EDA), and change variables Education
and Property_Area
from categorical to numerical type:
# Remove Gender
<- subset(loan, select = -c (Gender))
loan_new # Convert Dependents, Credit_History numeric type
$Dependents <- as.numeric(loan_new$Dependents)
loan_new$Credit_History <- as.numeric(loan_new$Credit_History)
loan_new# Change Variables values
$Education <- ifelse(loan_new$Education=="Graduate", 1, 0)
loan_new$Married <- ifelse(loan_new$Married=="Yes", 1, 0)
loan_new$Self_Employed <- ifelse(loan_new$Self_Employed == "Yes", 1, 0)
loan_newif(loan_new$Property_Area=="Semiurban"){loan_new$Property_Area <- 2} else if(loan_new$Property_Area=="Urban"){loan_new$Property_Area <- 1} else {
$Property_Area <- 0} loan_new
Using the built-in sample.split function we perform a train test split such that 75% of our data will be training data and 25% will be testing data:
#Split data into training and testing sets
set.seed(123)
= sample.split(loan_new, SplitRatio = 0.75)
sample_data <- subset(loan_new, sample_data == TRUE)
train_data <- subset(loan_new, sample_data == FALSE) test_data
With our data properly pre-processed and a train-test split performed, we recursively split our dataset using R’s built-in rpart (short for Recursive Partitioning and Regression Tree) function, plot the resulting model, and train our model using R’s built-in ctree (conditional inference tree) function:
set.seed(144)
# Plot tree
<- rpart(Loan_Status ~ ., data = train_data, cp = .02)
binary.model rpart.plot(binary.model)
# Fit model
<- ctree(Loan_Status ~ ., data = train_data) tree
From the Decision tree above we observe that Credit_History
is the most important factor when deciding if someone will be approved for a loan or not.
With our model fit to our training data and Loan_Status
as the response variable, we can explore the model’s accuracy via confusion matrix and mis-classification error rate for both our training and testing data:
# Confusion matrix and misclassification error rate for training data
<- table(Predicted = predict(tree), Actual = train_data$Loan_Status)
tab print(tab)
## Actual
## Predicted N Y
## N 62 5
## Y 75 305
print(paste('Misclassification error for training data', round(1 - sum(diag(tab))/sum(tab),3)))
## [1] "Misclassification error for training data 0.179"
# Confusion matrix and misclassification error rate for testing data
<- predict(tree, test_data)
testPred <- table(Predicted = testPred, Actual = test_data$Loan_Status)
testtab print(testtab)
## Actual
## Predicted N Y
## N 25 2
## Y 30 110
print(paste('Misclassification error for testing data', round(1 - sum(diag(testtab))/sum(testtab),3)))
## [1] "Misclassification error for testing data 0.192"
From the output statistics above, we can draw our model’s performance as:
With these statistics in mind, we utilize the confusionMatrix function from the caret library and present our statistics as a kable table to glean more insight regarding our model’s performance:
#confusionMatrix(testPred, test_data$Loan_Status) #verify
<- confusionMatrix(testPred, test_data$Loan_Status)$byClass
DT_Model <- confusionMatrix(testPred, test_data$Loan_Status)$overall['Accuracy']
AccuracyDT <- data.frame(DT_Model)
DT_Model <- rbind("Accuracy" = AccuracyDT, DT_Model)
DT_Model <- data.frame(DT_Model)
tabview #present statistics as kable table
%>% kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") tabview
DT_Model | |
---|---|
Accuracy | 0.8083832 |
Sensitivity | 0.4545455 |
Specificity | 0.9821429 |
Pos Pred Value | 0.9259259 |
Neg Pred Value | 0.7857143 |
Precision | 0.9259259 |
Recall | 0.4545455 |
F1 | 0.6097561 |
Prevalence | 0.3293413 |
Detection Rate | 0.1497006 |
Detection Prevalence | 0.1616766 |
Balanced Accuracy | 0.7183442 |
For now, we note the Decision Tree model’s performance based on these classification metrics (ie. stronger for Pos Pred Value
, weaker for Neg Pred Value
). Later, we’ll provide an in-depth interpretation of these statistics vs. those of our Random Forest and Gradient Boosting models.
Random forests are one the most popular machine learning algorithms. They can be used for classification or regression, deal with a large number of features, and generally are so successful because they provide a good predictive performance, low incidence of over-fitting, and easy interpret-ability.
More features are generally better for this model, but we’ll exclude Gender
and Dependents
from consideration because of their weak correlation with Loan_Status
.
To prepare our data for Random Forest modeling, we convert variables of type factor to numeric, remove the Gender
and Dependents
variables from consideration, and change variables Education
and Property_Area
from categorical to numerical type:
# Remove Gender, Dependents
<- subset(loan, select = -c(Gender, Dependents))
loan_RF # Convert Credit_History to numeric type
$Credit_History <- as.numeric(loan_RF$Credit_History)
loan_RF# Change Variables values
$Education <- ifelse(loan_RF$Education=="Graduate", 1, 0)
loan_RF$Married <- ifelse(loan_RF$Married=="Yes", 1, 0)
loan_RF$Self_Employed <- ifelse(loan_RF$Self_Employed=="Yes", 1, 0)
loan_RFif(loan_RF$Property_Area=="Semiurban")
{$Property_Area <- 2
loan_RFelse if(loan_RF$Property_Area=="Urban"){
} $Property_Area <- 1
loan_RFelse {
} $Property_Area <- 0
loan_RF
}#head(loan_RF) #verify output
Using the built-in sample function we perform a train test split with replacement such that 75% of our data will be training data and 25% will be testing data:
set.seed(1247)
<- sample(2, nrow(loan_RF), replace = TRUE, prob = c(0.75, 0.25))
ind <- loan_RF[ind == 1, ]
train_RF <- loan_RF[ind == 2, ]
test_RF #Verify dimensions of each set
#dim(train_RF)
#dim(test_RF)
The Random Forest model helps with feature selection based on importance and avoids over-fitting. We’ll lean on this functionality in adapting our 2nd model based on the feature importance specified during the building of our first model.
With our data properly pre-processed and a train-test split performed, we utilize R’s built-in RandomForest function to first help in determining the optimal value for mtry
, and then train our model using this value.
We determine the optimal mtry
value (2) based on the lowest corresponding OOB rate, create the first model with all variables except Gender
and Dependents
, and then visit the corresponding Confusion Matrix and feature importance scores:
set.seed(222)
for (i in 1:10)
{= randomForest(Loan_Status ~ . , data = train_RF, mtry = i)
rf <- rf$err.rate
err <- err[nrow(err), "OOB"]
oob_err print(paste("For mtry : ", i , "OOB Error Rate : ", round(oob_err, 4)))
}
## [1] "For mtry : 1 OOB Error Rate : 0.3096"
## [1] "For mtry : 2 OOB Error Rate : 0.1782"
## [1] "For mtry : 3 OOB Error Rate : 0.1826"
## [1] "For mtry : 4 OOB Error Rate : 0.1804"
## [1] "For mtry : 5 OOB Error Rate : 0.1849"
## [1] "For mtry : 6 OOB Error Rate : 0.1982"
## [1] "For mtry : 7 OOB Error Rate : 0.2004"
## [1] "For mtry : 8 OOB Error Rate : 0.2027"
## [1] "For mtry : 9 OOB Error Rate : 0.2094"
## [1] "For mtry : 10 OOB Error Rate : 0.2027"
#set.seed(35)
= randomForest(Loan_Status ~ . , data = train_RF, mtry = 2)
rf_1 print(rf_1)
##
## Call:
## randomForest(formula = Loan_Status ~ ., data = train_RF, mtry = 2)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 17.82%
## Confusion matrix:
## N Y class.error
## N 68 75 0.52447552
## Y 5 301 0.01633987
#Importance of each predictor
print(importance(rf_1, type=2))
## MeanDecreaseGini
## Married 2.703211
## Education 2.751339
## Self_Employed 1.822436
## ApplicantIncome 19.810001
## CoapplicantIncome 12.943733
## LoanAmount 18.484104
## Loan_Amount_Term 7.195066
## Credit_History 54.653124
## Property_Area 0.000000
We note an OOB (out of the box) error rate of 17.82% and will compare the output statistics of the confusion matrix with those of the second model to determine which has a greater predictive accuracy.
From the above feature importance scores, we observe that Credit_History
, Applicant_Income
, Loan_Amount
, and CoapplicantIncome
have the highest MeanDecreaseGini
scores. They provide the largest average gain of purity and are thus the most important features. We will include these features and omit all others in our construction of model 2.
We create the second model in the same manner as the first model. This time we only consider important factors such as Credit_History
, Applicant_Income
, CoapplicantIncome
, and Loan_Amount
:
set.seed(102)
for (i in 1:10)
{= randomForest(Loan_Status ~ Credit_History + LoanAmount + CoapplicantIncome + ApplicantIncome, data = train_RF, mtry = i)
rf <- rf$err.rate
err <- err[nrow(err), "OOB"]
oob_err print(paste("For mtry : ", i , "OOB Error Rate : ", round(oob_err, 4)))
}
## [1] "For mtry : 1 OOB Error Rate : 0.1804"
## [1] "For mtry : 2 OOB Error Rate : 0.2004"
## [1] "For mtry : 3 OOB Error Rate : 0.2094"
## [1] "For mtry : 4 OOB Error Rate : 0.216"
## [1] "For mtry : 5 OOB Error Rate : 0.216"
## [1] "For mtry : 6 OOB Error Rate : 0.2205"
## [1] "For mtry : 7 OOB Error Rate : 0.2138"
## [1] "For mtry : 8 OOB Error Rate : 0.2138"
## [1] "For mtry : 9 OOB Error Rate : 0.2183"
## [1] "For mtry : 10 OOB Error Rate : 0.2205"
= randomForest(Loan_Status ~ Credit_History + LoanAmount + CoapplicantIncome + ApplicantIncome , data = train_RF, mtry = 1)
rf_2 print(rf_2)
##
## Call:
## randomForest(formula = Loan_Status ~ Credit_History + LoanAmount + CoapplicantIncome + ApplicantIncome, data = train_RF, mtry = 1)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 17.82%
## Confusion matrix:
## N Y class.error
## N 65 78 0.545454545
## Y 2 304 0.006535948
We note an OOB error rate of 17.82% and will compare the output statistics of the confusion matrix with those of the first model to determine which has a greater predictive accuracy.
We evaluate our models by applying RF Model 1 and RF Model 2 to the test data set. We utilize the confusionMatrix function from the caret library and present our statistics as a kable table to glean more insight regarding the comparative statistics between each model’s performance:
#Tabular view of RF model 1 and Model 2 matrix comparison
<- predict(rf_1, newdata = test_RF)
rf_predict_1 #confusionMatrix(rf_predict_1, test_RF$Loan_Status)
<- confusionMatrix(rf_predict_1, test_RF$Loan_Status)$byClass
RF_Model_1 <- confusionMatrix(rf_predict_1, test_RF$Loan_Status)$overall['Accuracy']
AccuracyRF1 <- data.frame(RF_Model_1)
RF_Model_1 <- rbind("Accuracy" = AccuracyRF1, RF_Model_1)
RF_Model_1 <- predict(rf_2, newdata = test_RF)
rf_predict_2 #confusionMatrix(rf_predict_2, test_RF$Loan_Status)
<- confusionMatrix(rf_predict_2, test_RF$Loan_Status)$byClass
RF_Model_2 <- confusionMatrix(rf_predict_2, test_RF$Loan_Status)$overall['Accuracy']
AccuracyRF2 <- data.frame(RF_Model_2)
RF_Model_2 <- rbind("Accuracy" = AccuracyRF2, RF_Model_2)
RF_Model_2 <- data.frame(RF_Model_1, RF_Model_2)
tabularview %>% kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") tabularview
RF_Model_1 | RF_Model_2 | |
---|---|---|
Accuracy | 0.8000000 | 0.8121212 |
Sensitivity | 0.4693878 | 0.4693878 |
Specificity | 0.9396552 | 0.9568966 |
Pos Pred Value | 0.7666667 | 0.8214286 |
Neg Pred Value | 0.8074074 | 0.8102190 |
Precision | 0.7666667 | 0.8214286 |
Recall | 0.4693878 | 0.4693878 |
F1 | 0.5822785 | 0.5974026 |
Prevalence | 0.2969697 | 0.2969697 |
Detection Rate | 0.1393939 | 0.1393939 |
Detection Prevalence | 0.1818182 | 0.1696970 |
Balanced Accuracy | 0.7045215 | 0.7131422 |
The models share the same Sensitivity, Recall, Prevalence, and Detection Rates, but for every other metric RF Model 2 outperforms RF Model 1. Based on the output statistics above RF Model 2 is superior to RF Model 1 and will serve as our chosen Random Forest model. This is a noteworthy result being that more features are typically better and we elected the model with fewer features.
For now, we note the Random Forest model’s performance based on these classification metrics. Later, we’ll provide an in-depth interpretation of these statistics vs. those of our Decision Tree and Gradient Boosting models.
Boosting is a method for creating an ensemble, or a combination of machine learning models. Gradient boosting relies on the belief that the best possible next model, when combined with prior models, reduces how often we misclassify our data.
It’s heralded as one of the most powerful techniques for building predictive models and offers numerous options for implementation. We’ll make use of the XGBoost (eXtreme Gradient Boosting) distributed gradient library due to its promise of flexibility, portability, and efficiency.
For our purposes, we’re dealing with a categorical dependent variable, and will build a Gradient Boosting model from the loan approval dataset to provide Loan_Status
predictions (ie. Approve or Reject).
To prepare our data for Gradient Boost modeling, we verify our numeric variable’s skewness, perform a log() transform on each variable, and then re-verify the skewness to observe whether or not there was improvement:
#Pre-process data: normalize independent numeric vars
#store sum of ApplicantIncome and CoapplicantIncome in ApplicantIncome
$ApplicantIncome <- loan$ApplicantIncome + loan$CoapplicantIncome
loan#calculate skewness prior to transformation
skewness(loan$ApplicantIncome, na.rm = TRUE)
## [1] 5.605953
#skewness(loan$CoapplicantIncome, na.rm = TRUE)
skewness(loan$LoanAmount, na.rm = TRUE)
## [1] 2.345278
#transformation: account for outliers with log transform
$ApplicantIncome <- log10(loan$ApplicantIncome)
loan#loan$CoapplicantIncome <- log10(loan$CoapplicantIncome)
$LoanAmount <- log10(loan$LoanAmount)
loan#calculate skewness after transformation
skewness(loan$ApplicantIncome, na.rm = TRUE)
## [1] 1.071447
#skewness(loan$CoapplicantIncome, na.rm = TRUE) #produced NaN - dealt with via summing
skewness(loan$LoanAmount, na.rm = TRUE)
## [1] -0.1260652
After encountering issues transforming CoapplicantIncome
and realizing a simplification may be in store, the sum of ApplicantIncome
and CoapplicantIncome
was stored as one in ApplicantIncome
.
Based on the improved skewness values, we see that our transformation was a success and these (2) independent, numeric variables are now prepared to be included in our Gradient Boosting model.
As we continue processing our data, we drop impertinent variables, convert categorical variables to a binary scale, and convert our dependent variable to type factor:
#Data pre-processing: drop impertinent variables, convert to numeric, and change variable values
<- subset(loan, select = -c(Gender, Property_Area, CoapplicantIncome)) #since CoapplicantIncome is incorproated with ApplicantIncome
loan_GB # Convert Dependents, Credit_History numeric type
$Dependents <- as.numeric(loan_GB$Dependents)
loan_GB$Credit_History <- as.numeric(loan_GB$Credit_History)
loan_GB# Change Variables values
#loan_GB$Gender <- ifelse(loan_GB$Gender=="Male", 1, 0)
$Education <- ifelse(loan_GB$Education=="Graduate", 1, 0)
loan_GB$Married <- ifelse(loan_GB$Married=="Yes", 1, 0)
loan_GB$Self_Employed <- ifelse(loan_GB$Self_Employed=="Yes", 1, 0)
loan_GB#3. convert Loan_Status to type factor
$Loan_Status <- ifelse(loan_GB$Loan_Status=="Y", 1, 0)
loan_GB#loan$Loan_Status <- as.factor(loan$Loan_Status)
#head(loan_GB) #verify all numeric inputs
At this point pre-processing of our data is complete and we’ve verified that we will indeed be feeding our model all numeric inputs.
We use the built-in sample function to perform a train-test split with replacement (75% training, 25% testing) and then convert our dataframes to matrices (with corresponding labels) as required to be modeled via XGBoost:
#Partition data
set.seed(1234)
#loan_GB
<- sample(2, nrow(loan_GB), replace = TRUE, prob = c(0.75, 0.25))
ind <- loan_GB[ind == 1, ]
train_GB <- loan_GB[ind == 2, ]
test_GB #dim(train_GB) #457 x 10
#dim(test_GB) #157 x 10
#Create train, test matrices - one hot encoding for factor variables
<- sparse.model.matrix(Loan_Status ~ ., data = train_GB)
trainm #head(trainm)
<- train_GB[,"Loan_Status"]
train_label <- xgb.DMatrix(data = as.matrix(trainm),label = train_label )
train_matrix <- sparse.model.matrix(Loan_Status ~ ., data = test_GB)
testm <- test_GB[,"Loan_Status"]
test_label <- xgb.DMatrix(data = as.matrix(testm),label = test_label ) test_matrix
We specify our parameters, create our first model with all variables except Gender
and Dependents
(due to low correlation with Loan_Status
), optimize nrounds
(model parameter) for the lowest corresponding mlogloss rate, optimize eta
(model parameter) for greatest perceived accuracy, visualize corresponding Confusion Matrices and misclassification error rates, and observe feature importance scores for our first model:
#Parameters
<- length(unique(train_label)) #number of classes
nc <- list("objective" = "multi:softprob",
xgb_params "eval_metric" = "mlogloss",
"num_class" = nc)
<- list(train = train_matrix, test = test_matrix)
watchlist #extreme Gradient Boosting Model
<- xgb.train(params = xgb_params,
GB_model data = train_matrix,
nrounds = 5, #run 100 iterations 1st then update based on test error value
watchlist = watchlist,
eta = 0.1, seed = 186
#inc eta value increased accuracy by 1 )
## [1] train-mlogloss:0.646876 test-mlogloss:0.650519
## [2] train-mlogloss:0.610260 test-mlogloss:0.615244
## [3] train-mlogloss:0.580004 test-mlogloss:0.586152
## [4] train-mlogloss:0.548391 test-mlogloss:0.563035
## [5] train-mlogloss:0.522107 test-mlogloss:0.545515
#error plot
#e <- data.frame(GB_model$evaluation_log)
#plot(e$iter, e$train_mlogloss)
#lines(e$iter, e$test_mlogloss, col = 'red')
#determine when test error was lowest
#min(e$test_mlogloss) #0.456353 lowest error
#e[e$test_mlogloss == 0.456353,] #5th iteration
#prediction and confusion matrix from TRAIN data
<- predict(GB_model, newdata = train_matrix)
p_train <- matrix(p_train, nrow = nc, ncol = length(p_train)/nc) %>%
pred_train t() %>% #matrix transpose
data.frame() %>%
mutate(label = train_label, max_prob = max.col(.,"last")-1)
<- table(Prediction = pred_train$max_prob, Actual = pred_train$label)
tab_train print(tab_train)
## Actual
## Prediction 0 1
## 0 95 5
## 1 62 295
print(paste('Misclassification Error with Train data', round(1 - sum(diag(tab_train))/sum(tab_train),3)))
## [1] "Misclassification Error with Train data 0.147"
#prediction and confusion matrix from TEST data
<- predict(GB_model, newdata = test_matrix)
p_test <- matrix(p_test, nrow = nc, ncol = length(p_test)/nc) %>%
pred_test t() %>% #matrix transpose
data.frame() %>%
mutate(label = test_label, max_prob = max.col(.,"last")-1)
<- table(Prediction = pred_test$max_prob, Actual = pred_test$label)
tab_test print(tab_test)
## Actual
## Prediction 0 1
## 0 18 7
## 1 17 115
print(paste('Misclassification Error with Test data', round(1 - sum(diag(tab_test))/sum(tab_test),3)))
## [1] "Misclassification Error with Test data 0.153"
#feature importance
<- xgb.importance(colnames(train_matrix), model=GB_model)
imp print(imp) #higher Gain means higher feature importance
## Feature Gain Cover Frequency
## 1: Credit_History 0.682195530 0.191348259 0.06172840
## 2: ApplicantIncome 0.164518950 0.442968096 0.46913580
## 3: LoanAmount 0.100009505 0.235536151 0.29629630
## 4: Loan_Amount_Term 0.018195437 0.087624362 0.03703704
## 5: Dependents 0.012403557 0.016641873 0.03703704
## 6: Married 0.009862281 0.011915016 0.06172840
## 7: Self_Employed 0.009724444 0.011631068 0.02469136
## 8: Education 0.003090295 0.002335176 0.01234568
We note a train-mlogloss (multiclass log loss) value of 0.522107 on five iterations, training data misclassification error rate of 14.7%, and testing misclassification error rate of 15.3%. Later, we’ll compare the output statistics of the confusion matrix with those of the second model to determine which has a greater predictive accuracy.
From the above feature importance scores, we observe that Credit_History
, ApplicantIncome
, and LoanAmount
have the highest Gain
scores. We’ll omit all other features and utilize just these features to make up our second model.
Recall: ApplicantIncome
is now representative of ApplicantIncome
and CoapplicantIncome
.
We create the second model in the same manner as the first model. This time we only consider important factors such as Credit_History
, Applicant_Income
, and Loan_Amount
:
#Create train, test matrices - one hot encoding for factor variables
<- sparse.model.matrix(Loan_Status ~ Credit_History + LoanAmount + ApplicantIncome, data = train_GB)
trainm2 #head(trainm2)
<- train_GB[,"Loan_Status"]
train_label2 <- xgb.DMatrix(data = as.matrix(trainm2),label = train_label2 )
train_matrix2 <- sparse.model.matrix(Loan_Status ~ Credit_History + LoanAmount + ApplicantIncome, data = test_GB)
testm2 <- test_GB[,"Loan_Status"]
test_label2 <- xgb.DMatrix(data = as.matrix(testm2),label = test_label2 )
test_matrix2 #Parameters
<- length(unique(train_label2)) #number of classes
nc2 <- list("objective" = "multi:softprob",
xgb_params2 "eval_metric" = "mlogloss",
"num_class" = nc2)
<- list(train = train_matrix2, test = test_matrix2)
watchlist2 #extreme Gradient Boosting Model
<- xgb.train(params = xgb_params2,
GB_model2 data = train_matrix2,
nrounds = 20, #run 100 iterations 1st then update based on test error value
watchlist = watchlist2,
eta = 0.1, seed = 606
#inc eta value increased accuracy by 1 )
## [1] train-mlogloss:0.646308 test-mlogloss:0.655313
## [2] train-mlogloss:0.607838 test-mlogloss:0.621005
## [3] train-mlogloss:0.575808 test-mlogloss:0.591478
## [4] train-mlogloss:0.546353 test-mlogloss:0.569505
## [5] train-mlogloss:0.519880 test-mlogloss:0.553319
## [6] train-mlogloss:0.497542 test-mlogloss:0.538174
## [7] train-mlogloss:0.478225 test-mlogloss:0.526048
## [8] train-mlogloss:0.461115 test-mlogloss:0.517006
## [9] train-mlogloss:0.444932 test-mlogloss:0.506839
## [10] train-mlogloss:0.431035 test-mlogloss:0.498805
## [11] train-mlogloss:0.419173 test-mlogloss:0.493698
## [12] train-mlogloss:0.407526 test-mlogloss:0.489009
## [13] train-mlogloss:0.397120 test-mlogloss:0.484104
## [14] train-mlogloss:0.385230 test-mlogloss:0.482407
## [15] train-mlogloss:0.377459 test-mlogloss:0.482483
## [16] train-mlogloss:0.371262 test-mlogloss:0.481562
## [17] train-mlogloss:0.361928 test-mlogloss:0.481546
## [18] train-mlogloss:0.354974 test-mlogloss:0.481108
## [19] train-mlogloss:0.348857 test-mlogloss:0.479523
## [20] train-mlogloss:0.343374 test-mlogloss:0.480314
#error plot
#e2 <- data.frame(GB_model2$evaluation_log)
#plot(e2$iter, e2$train_mlogloss)
#lines(e2$iter, e2$test_mlogloss, col = 'red')
#determine when test error was lowest
#min(e2$test_mlogloss) #0.478216 lowest error
#e2[e2$test_mlogloss == 0.478216,] #20th iteration
#prediction and confusion matrix from train data
<- predict(GB_model2, newdata = train_matrix2)
p_train2 <- matrix(p_train2, nrow = nc2, ncol = length(p_train2)/nc2) %>%
pred_train2 t() %>% #matrix transpose
data.frame() %>%
mutate(label = train_label2, max_prob = max.col(.,"last")-1)
<- table(Prediction = pred_train2$max_prob, Actual = pred_train2$label)
tab_train2 print(tab_train2)
## Actual
## Prediction 0 1
## 0 96 4
## 1 61 296
print(paste('Misclassification Error with Train data', round(1 - sum(diag(tab_train2))/sum(tab_train2),3)))
## [1] "Misclassification Error with Train data 0.142"
#prediction and confusion matrix from test data
<- predict(GB_model2, newdata = test_matrix2)
p_test2 <- matrix(p_test2, nrow = nc2, ncol = length(p_test2)/nc2) %>%
pred_test2 t() %>% #matrix transpose
data.frame() %>%
mutate(label = test_label2, max_prob = max.col(.,"last")-1)
<- table(Prediction = pred_test2$max_prob, Actual = pred_test2$label)
tab_test2 print(tab_test2)
## Actual
## Prediction 0 1
## 0 18 9
## 1 17 113
print(paste('Misclassification Error with Test data', round(1 - sum(diag(tab_test2))/sum(tab_test2),3)))
## [1] "Misclassification Error with Test data 0.166"
We note a train-mlogloss (multiclass log loss) value of 0.343374 on twenty iterations, training data misclassification error rate of 14.2%, and testing misclassification error rate of 16.6%.
Next, we’ll put our models side-by-side to determine which has a greater predictive accuracy.
We evaluate our models by applying GB Model 1 and GB Model 2 to the test data set. We utilize the confusionMatrix function from the caret library and present our statistics as a kable table to glean more insight regarding the comparative statistics between each model’s performance:
<- confusionMatrix(factor(pred_test$max_prob),factor(pred_test$label))$overall['Accuracy']
AccuracyGB1 <- confusionMatrix(factor(pred_test2$max_prob),factor(pred_test2$label))$overall['Accuracy']
AccuracyGB2 <- confusionMatrix(factor(pred_test$max_prob),factor(pred_test$label))$byClass
GB_Model_1 <- data.frame(GB_Model_1)
GB_Model_1 <- rbind("Accuracy" = AccuracyGB1, GB_Model_1)
GB_Model_1 <- confusionMatrix(factor(pred_test2$max_prob),factor(pred_test2$label))$byClass
GB_Model_2 <- data.frame(GB_Model_2)
GB_Model_2 <- rbind("Accuracy" = AccuracyGB2, GB_Model_2)
GB_Model_2 <- data.frame(GB_Model_1, GB_Model_2)
tabularview %>% kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") tabularview
GB_Model_1 | GB_Model_2 | |
---|---|---|
Accuracy | 0.8471338 | 0.8343949 |
Sensitivity | 0.5142857 | 0.5142857 |
Specificity | 0.9426230 | 0.9262295 |
Pos Pred Value | 0.7200000 | 0.6666667 |
Neg Pred Value | 0.8712121 | 0.8692308 |
Precision | 0.7200000 | 0.6666667 |
Recall | 0.5142857 | 0.5142857 |
F1 | 0.6000000 | 0.5806452 |
Prevalence | 0.2229299 | 0.2229299 |
Detection Rate | 0.1146497 | 0.1146497 |
Detection Prevalence | 0.1592357 | 0.1719745 |
Balanced Accuracy | 0.7284543 | 0.7202576 |
The models share the same Sensitivity, Recall, Prevalence, and Detection Rates, but for every other metric GB Model 1 outperforms GB Model 2. Based on the output statistics above GB Model 1 is superior to GB Model 2 and will serve as our chosen Gradient Boosting model.
We put our strongest Decision Tree, Random Forest, and Gradient Boosting models side-by-side-by-side to interpret their common classification metrics and determine which has the greatest predictive accuracy.
We consider the following classification metrics in consulting each model’s Confusion Matrix:
These models are all applied to the test data set and make use of the confusionMatrix function from the caret library. We present the corresponding common classification metrics as a kable table to glean more insight regarding the relative strength of each model’s performance:
#Present all model statistics on the same graphic
<- data.frame(DT_Model, RF_Model_2, GB_Model_1)
tabularview %>% kableExtra::kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down") tabularview
DT_Model | RF_Model_2 | GB_Model_1 | |
---|---|---|---|
Accuracy | 0.8083832 | 0.8121212 | 0.8471338 |
Sensitivity | 0.4545455 | 0.4693878 | 0.5142857 |
Specificity | 0.9821429 | 0.9568966 | 0.9426230 |
Pos Pred Value | 0.9259259 | 0.8214286 | 0.7200000 |
Neg Pred Value | 0.7857143 | 0.8102190 | 0.8712121 |
Precision | 0.9259259 | 0.8214286 | 0.7200000 |
Recall | 0.4545455 | 0.4693878 | 0.5142857 |
F1 | 0.6097561 | 0.5974026 | 0.6000000 |
Prevalence | 0.3293413 | 0.2969697 | 0.2229299 |
Detection Rate | 0.1497006 | 0.1393939 | 0.1146497 |
Detection Prevalence | 0.1616766 | 0.1696970 | 0.1592357 |
Balanced Accuracy | 0.7183442 | 0.7131422 | 0.7284543 |
Consulting the above output table, we observe that DT Model has the strongest performance for Specificity
, Pos Pred Value
, Precision
, and F1
whereas GB Model 1 has the strongest performance for Sensitivity
, Neg Pred Value
and Balanced Accuracy
. Thus, DT Model performs better when predicting loan approvals while GB Model 1 performs better when predicting loan rejections.
Decision tree’s often raise concerns regarding over-fitting, bias and variance error because of their simplicity, and random forests are meant to address these concerns by accounting for a collection of decision trees to come to a single, aggregated result. We found it surprising that the decision tree model outperformed the random forest model. This may have been because of how we implemented the model or it may have simply been a poor situation for random forests.
Where decision trees are lauded for ease of interpretability and speed and random forests are lauded for how they handle noise, gradient boosting is known to handle unbalanced data the best. We would recommend gradient boosting for the simple fact that we’re dealing with imbalanced classes (recall 192 N’s, 422 Y’s) and GB Model 1 is the best performing model across the board when this is taken into account.
……………………………………………………………………..
In completing this assignment, reference was made to the following: