Business Intelligence Lab Submission Markdown

<> <23/10/2023>

Student Details
Setup Chunk
- Milestone 6 out of 8
Loading the Loan Status Train Imputed Dataset

Student Details

Student ID Numbers and Names of Group Members	<list one Student name, class group (just the letter; A, B, or C), and ID per line, e.g., 123456 - A - John Leposo; you should be between 2 and 5 members per group> 128998 - B - Crispus Nzano \|
GitHub Classroom Group Name	BI-Loan-Appraisal-Project \|
Course Code	BBT4206
Course Name	Business Intelligence II
Program	Bachelor of Business Information Technology
Semester Duration	21^st August 2023 to 28^th November 2023

Setup Chunk

We start by installing all the required packages, each Issue and Milestone will have its own packages.

## formatR - Required to format R code in the markdown ----

if (require("languageserver")) {
  require("languageserver")
} else {
  install.packages("languageserver", dependencies = TRUE,
                   repos = "https://cloud.r-project.org")
}

# Introduction ----
# Resampling methods are techniques that can be used to improve the performance
# and reliability of machine learning algorithms. They work by creating
# multiple training sets from the original training set. The model is then
# trained on each training set, and the results are averaged. This helps to
# reduce overfitting and improve the model's generalization performance.

# Resampling methods include:
## Splitting the dataset into train and test sets ----
## Bootstrapping (sampling with replacement) ----
## Basic k-fold cross validation ----
## Repeated cross validation ----
## Leave One Out Cross-Validation (LOOCV) ----

# STEP 1. Install and Load the Required Packages ----
## mlbench ----
if (require("mlbench")) {
  require("mlbench")
} else {
  install.packages("mlbench", dependencies = TRUE,
                   repos = "https://cloud.r-project.org")
}

## caret ----
if (require("caret")) {
  require("caret")
} else {
  install.packages("caret", dependencies = TRUE,
                   repos = "https://cloud.r-project.org")
}

## kernlab ----
if (require("kernlab")) {
  require("kernlab")
} else {
  install.packages("kernlab", dependencies = TRUE,
                   repos = "https://cloud.r-project.org")
}

## randomForest ----
if (require("randomForest")) {
  require("randomForest")
} else {
  install.packages("randomForest", dependencies = TRUE,
                   repos = "https://cloud.r-project.org")
}

install.packages("formatR")

## The following package(s) will be installed:
## - formatR [1.14]
## These packages will be installed into "C:/Users/Cris/github-classroom/BI-Loan-Appraisal-Project/markdown/renv/library/R-4.3/x86_64-w64-mingw32".
## 
## # Installing packages --------------------------------------------------------
## - Installing formatR ...                        OK [linked from cache]
## Successfully installed 1 package in 16 milliseconds.

Note: the following “KnitR” options have been set as the defaults in this markdown:
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, eval = TRUE, collapse = FALSE, tidy.opts = list(width.cutoff = 80), tidy = TRUE).

More KnitR options are documented here https://bookdown.org/yihui/rmarkdown-cookbook/chunk-options.html and here https://yihui.org/knitr/options/.

knitr::opts_chunk$set(
    eval = TRUE,
    echo = TRUE,
    warning = FALSE,
    collapse = FALSE,
    tidy = TRUE
)

Note: the following “R Markdown” options have been set as the defaults in this markdown:

output:

github_document:
toc: yes
toc_depth: 4
fig_width: 6
fig_height: 4
df_print: default

editor_options:
chunk_output_type: console

Milestone 6 out of 8

Loading the Loan Status Train Imputed Dataset

Issue 6 Training the Model.

## 6. Training the Model ----

if (!is.element("languageserver", installed.packages()[, 1])) {
    install.packages("languageserver", dependencies = TRUE, repos = "https://cloud.r-project.org")
}
require("languageserver")

# Introduction ---- The performance of the trained models can be compared
# visually. This is done to help you to identify and choose the top performing
# models.

# STEP 1. Install and Load the Required Packages ---- mlbench ----
if (require("mlbench")) {
    require("mlbench")
} else {
    install.packages("mlbench", dependencies = TRUE, repos = "https://cloud.r-project.org")
}

## caret ----
if (require("caret")) {
    require("caret")
} else {
    install.packages("caret", dependencies = TRUE, repos = "https://cloud.r-project.org")
}

## kernlab ----
if (require("kernlab")) {
    require("kernlab")
} else {
    install.packages("kernlab", dependencies = TRUE, repos = "https://cloud.r-project.org")
}

## randomForest ----
if (require("randomForest")) {
    require("randomForest")
} else {
    install.packages("randomForest", dependencies = TRUE, repos = "https://cloud.r-project.org")
}



## STEP 2. Load the Dataset ----
library(readr)
loans <- read_csv("C:/Users/Cris/github-classroom/BI-Loan-Appraisal-Project/data/loans_imputed.csv")

## Rows: 614 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Gender, Married, Dependents, Education, SelfEmployed, PropertyArea,...
## dbl (5): ApplicantIncome, CoapplicantIncome, LoanAmount, LoanAmountTerm, Cre...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(loans)

## STEP 3. The Resamples Function ----

# Analogy: We cannot compare apples with oranges; we compare apples with
# apples.

# The 'resamples()' function checks that the models are comparable and that
# they used the same training scheme ('train_control' configuration).  To do
# this, after the models are trained, they are added to a list and we pass this
# list of models as an argument to the resamples() function in R.

## 3.a. Train the Models ---- We train the following models, all of which are
## using 10-fold repeated cross validation with 3 repeats: LDA CART KNN SVM
## Random Fores

train_control <- trainControl(method = "repeatedcv", number = 10, repeats = 3)

### LDA ----
set.seed(7)
loans_imputed_model_lda <- train(Status ~ ., data = loans, method = "lda", trControl = train_control)

### CART ----
set.seed(7)
loans_imputed_model_cart <- train(Status ~ ., data = loans, method = "rpart", trControl = train_control)

### KNN ----
set.seed(7)
loans_imputed_model_knn <- train(Status ~ ., data = loans, method = "knn", trControl = train_control)

### SVM ----
set.seed(7)
loans_imputed_model_svm <- train(Status ~ ., data = loans, method = "svmRadial",
    trControl = train_control)

### Random Forest ----
set.seed(7)
loans_imputed_model_rf <- train(Status ~ ., data = loans, method = "rf", trControl = train_control)

## 3.b. Call the `resamples` Function ---- We then create a list of the model
## results and pass the list as an argument to the `resamples` function.

results <- resamples(list(LDA = loans_imputed_model_lda, CART = loans_imputed_model_cart,
    KNN = loans_imputed_model_knn, SVM = loans_imputed_model_svm, RF = loans_imputed_model_rf))

# STEP 4. Display the Results ---- 1. Table Summary ---- This is the simplest
# comparison. It creates a table with one model per row and its corresponding
# evaluation metrics displayed per column.

summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: LDA, CART, KNN, SVM, RF 
## Number of resamples: 30 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LDA  0.7377049 0.7960578 0.8130619 0.8137810 0.8360656 0.8688525    0
## CART 0.7049180 0.7714172 0.8064516 0.7947343 0.8196721 0.8688525    0
## KNN  0.5573770 0.6169355 0.6557377 0.6530434 0.6774194 0.7213115    0
## SVM  0.7377049 0.7911546 0.8064516 0.8094182 0.8326943 0.8524590    0
## RF   0.7377049 0.7877446 0.8064516 0.8050728 0.8225806 0.8524590    0
## 
## Kappa 
##            Min.     1st Qu.     Median      Mean    3rd Qu.      Max. NA's
## LDA   0.2339089  0.44725111 0.49539163 0.4930891 0.56480068 0.6543909    0
## CART  0.2855051  0.38377792 0.46883104 0.4562377 0.51904090 0.6543909    0
## KNN  -0.2263589 -0.04734239 0.01493286 0.0146983 0.09109683 0.2004626    0
## SVM   0.2339089  0.44265198 0.49041096 0.4767094 0.54651307 0.6174216    0
## RF    0.2339089  0.43317789 0.47285762 0.4699364 0.52405786 0.6047516    0

## 2. Box and Whisker Plot ---- This is useful for visually observing the
## spread of the estimated accuracies for different algorithms and how they
## relate.

scales <- list(x = list(relation = "free"), y = list(relation = "free"))
bwplot(results, scales = scales)

## 3. Dot Plots ---- They show both the mean estimated accuracy as well as the
## 95% confidence interval (e.g. the range in which 95% of observed scores
## fell).

scales <- list(x = list(relation = "free"), y = list(relation = "free"))
dotplot(results, scales = scales)

## 4. Scatter Plot Matrix ---- This is useful when considering whether the
## predictions from two different algorithms are correlated. If weakly
## correlated, then they are good candidates for being combined in an ensemble
## prediction.

splom(results)

## 5. Pairwise xyPlots ---- You can zoom in on one pairwise comparison of the
## accuracy of trial-folds for two models using an xyplot.

# xyplot plots to compare models
xyplot(results, models = c("LDA", "SVM"))

# or xyplot plots to compare models
xyplot(results, models = c("SVM", "CART"))

## 6. Statistical Significance Tests ---- This is used to calculate the
## significance of the differences between the metric distributions of the
## various models.

### Upper Diagonal ---- The upper diagonal of the table shows the estimated
### difference between the distributions. If we think that LDA is the most
### accurate model from looking at the previous graphs, we can get an estimate
### of how much better it is than specific other models in terms of absolute
### accuracy.

### Lower Diagonal ---- The lower diagonal contains p-values of the null
### hypothesis.  The null hypothesis is a claim that 'the distributions are the
### same'.  A lower p-value is better (more significant).

diffs <- diff(results)

summary(diffs)

## 
## Call:
## summary.diff.resamples(object = diffs)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## Accuracy 
##      LDA       CART      KNN       SVM       RF       
## LDA             0.019047  0.160738  0.004363  0.008708
## CART 0.003626             0.141691 -0.014684 -0.010338
## KNN  < 2.2e-16 2.636e-13           -0.156375 -0.152029
## SVM  0.300813  0.047484  < 2.2e-16            0.004345
## RF   0.013136  0.768128  < 2.2e-16 1.000000           
## 
## Kappa 
##      LDA       CART      KNN       SVM       RF       
## LDA             0.036851  0.478391  0.016380  0.023153
## CART 0.005434             0.441539 -0.020472 -0.013699
## KNN  < 2.2e-16 1.044e-15           -0.462011 -0.455238
## SVM  0.088726  0.454348  < 2.2e-16            0.006773
## RF   0.006104  1.000000  < 2.2e-16 1.000000

# The model of choice will be 'LDA'.This is as a result of the model givong the
# highest accuracy of 0.8137810 as compared to the other models (CART 0.7947),
# (KNN 0.6530), (SVM 0.8094) AND (RF 0.8050)


# Upload *the link* to 'Lab-Submission-Markdown.md' (not .Rmd) markdown file
# hosted on Github (do not upload the .Rmd or .md markdown files) through the
# submission link provided on eLearning. install the same package version in
# their local machine during their initialization step.  renv::snapshot()

etc. as per the lab submission requirements. Be neat and communicate in a clear and logical manner.