Business Intelligence Lab Submission Markdown

<> <23/10/2023>

Student Details
Setup Chunk
- Milestone 1 out 0f 8
Loading the Loan Status Train Imputed Dataset
STEP 4./

Student Details

Student ID Numbers and Names of Group Members	<list one Student name, class group (just the letter; A, B, or C), and ID per line, e.g., 123456 - A - John Leposo; you should be between 2 and 5 members per group> 128998 - B - Crispus Nzano \|
GitHub Classroom Group Name	BI-Loan-Appraisal-Project \|
Course Code	BBT4206
Course Name	Business Intelligence II
Program	Bachelor of Business Information Technology
Semester Duration	21^st August 2023 to 28^th November 2023

Setup Chunk

We start by installing all the required packages, each Issue and Milestone will have its own packages

## formatR - Required to format R code in the markdown ----

if (require("languageserver")) {
  require("languageserver")
} else {
  install.packages("languageserver", dependencies = TRUE,
                   repos = "https://cloud.r-project.org")
}

# Introduction ----
# Resampling methods are techniques that can be used to improve the performance
# and reliability of machine learning algorithms. They work by creating
# multiple training sets from the original training set. The model is then
# trained on each training set, and the results are averaged. This helps to
# reduce overfitting and improve the model's generalization performance.

# Resampling methods include:
## Splitting the dataset into train and test sets ----
## Bootstrapping (sampling with replacement) ----
## Basic k-fold cross validation ----
## Repeated cross validation ----
## Leave One Out Cross-Validation (LOOCV) ----

# STEP 1. Install and Load the Required Packages ----
## mlbench ----
if (require("mlbench")) {
  require("mlbench")
} else {
  install.packages("mlbench", dependencies = TRUE,
                   repos = "https://cloud.r-project.org")
}

## caret ----
if (require("caret")) {
  require("caret")
} else {
  install.packages("caret", dependencies = TRUE,
                   repos = "https://cloud.r-project.org")
}

## kernlab ----
if (require("kernlab")) {
  require("kernlab")
} else {
  install.packages("kernlab", dependencies = TRUE,
                   repos = "https://cloud.r-project.org")
}

## randomForest ----
if (require("randomForest")) {
  require("randomForest")
} else {
  install.packages("randomForest", dependencies = TRUE,
                   repos = "https://cloud.r-project.org")
}

Note: the following “KnitR” options have been set as the defaults in this markdown:
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, eval = TRUE, collapse = FALSE, tidy.opts = list(width.cutoff = 80), tidy = TRUE).

More KnitR options are documented here https://bookdown.org/yihui/rmarkdown-cookbook/chunk-options.html and here https://yihui.org/knitr/options/.

knitr::opts_chunk$set(
    eval = TRUE,
    echo = TRUE,
    warning = FALSE,
    collapse = FALSE,
    tidy = TRUE
)

Note: the following “R Markdown” options have been set as the defaults in this markdown:

output:

github_document:
toc: yes
toc_depth: 4
fig_width: 6
fig_height: 4
df_print: default

editor_options:
chunk_output_type: console

Milestone 1 out 0f 8

Loading the Loan Status Train Imputed Dataset

Issue 1 Descriptive Statistics.

# 1 Descriptive Statistics ----

# Install renv:
if (!is.element("renv", installed.packages()[, 1])) {
    install.packages("renv", dependencies = TRUE)
}
require("renv")

## Loading required package: renv

## 
## Attaching package: 'renv'

## The following object is masked from 'package:languageserver':
## 
##     run

## The following objects are masked from 'package:stats':
## 
##     embed, update

## The following objects are masked from 'package:utils':
## 
##     history, upgrade

## The following objects are masked from 'package:base':
## 
##     autoload, load, remove

# Use renv::init() to initialize renv in a new or existing project.

# The prompt received after executing renv::init() is as shown below: This
# project already has a lockfile. What would you like to do?

# 1: Restore the project from the lockfile.  2: Discard the lockfile and
# re-initialize the project.  3: Activate the project without snapshotting or
# installing any packages.  4: Abort project initialization.

# Select option 1 to restore the project from the lockfile


# This will set up a project library, containing all the packages you are
# currently using. The packages (and all the metadata needed to reinstall them)
# are recorded into a lockfile, renv.lock, and a .Rprofile ensures that the
# library is used every time you open that project.

# This can also be configured using the RStudio GUI when you click the project
# file, e.g., 'BBT4206-R.Rproj' in the case of this project. Then navigate to
# the 'Environments' tab and select 'Use renv with this project'.

# As you continue to work on your project, you can install and upgrade
# packages, using either: install.packages() and update.packages or
# renv::install() and renv::update()

# You can also clean up a project by removing unused packages using the
# following command: renv::clean()

# After you have confirmed that your code works as expected, use
# renv::snapshot() to record the packages and their sources in the lockfile.

# Later, if you need to share your code with someone else or run your code on a
# new machine, your collaborator (or you) can call renv::restore() to reinstall
# the specific package versions recorded in the lockfile.

# Execute the following code to reinstall the specific package versions
# recorded in the lockfile:

# One of the packages required to use R in VS Code is the 'languageserver'
# package. It can be installed manually as follows if you are not using the
# renv::restore() command.
if (!is.element("languageserver", installed.packages()[, 1])) {
    install.packages("languageserver", dependencies = TRUE)
}
require("languageserver")

# Loading Datasets ---- STEP 2: Load datasets ----

library(readr)
loans <- read_csv("C:/Users/Cris/github-classroom/BI-Loan-Appraisal-Project/data/loans_imputed.csv")

## Rows: 614 Columns: 12

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Gender, Married, Dependents, Education, SelfEmployed, PropertyArea,...
## dbl (5): ApplicantIncome, CoapplicantIncome, LoanAmount, LoanAmountTerm, Cre...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(loans)

# Dimensions ---- STEP 3. Preview the Loaded Datasets ---- Dimensions refer to
# the number of observations (rows) and the number of
# attributes/variables/features (columns). Execute the following commands to
# display the dimensions of your datasets:

dim(loans)

## [1] 614  12

# Data Types ---- STEP 4. Identify the Data Types ---- Knowing the data types
# will help you to identify the most appropriate visualization types and
# algorithms that can be applied. It can also help you to identify the need to
# convert from categorical data (factors) to integers or vice versa where
# necessary. Execute the following command to identify the data types:
sapply(loans, class)

##            Gender           Married        Dependents         Education 
##       "character"       "character"       "character"       "character" 
##      SelfEmployed   ApplicantIncome CoapplicantIncome        LoanAmount 
##       "character"         "numeric"         "numeric"         "numeric" 
##    LoanAmountTerm     CreditHistory      PropertyArea            Status 
##         "numeric"         "numeric"       "character"       "character"

# Descriptive Statistics ----

# We will first understand the data before using it to design prediction models
# and to make generalizable inferences.

# 1. Measures of frequency (e.g., count, percent)

# 2. Measures of central tendency (e.g., mean, median, mode) Further reading:
# https://www.scribbr.com/statistics/central-tendency/

# 3. Measures of distribution/dispersion/spread/scatter/variability (e.g.,
# range, quartiles, interquartile range, standard deviation, variance,
# kurtosis, skewness) Further reading:
# https://www.scribbr.com/statistics/variability/ Further reading:
# https://digitaschools.com/descriptive-statistics-skewness-and-kurtosis/
# Further reading: https://www.scribbr.com/statistics/skewness/

# 4. Measures of relationship (e.g., covariance, correlation, ANOVA)

# Further reading: https://www.k2analytics.co.in/covariance-and-correlation/
# Further reading: https://www.scribbr.com/statistics/one-way-anova/ Further
# reading: https://www.scribbr.com/statistics/two-way-anova/

# Understanding your data can lead to: (i)\t Data cleaning: Removing bad data
# or imputing missing data.  (ii)\tData transformation: Reduce the skewness by
# applying the same function to all the observations.  (iii)\tData modelling:
# You may notice properties of the data such as distributions or data types
# that suggest the use (or not) of specific algorithms.

## Measures of Frequency ----

### STEP 5. Identify the number of instances that belong to each class. ---- It
### is more sensible to count categorical variables (factors or dimensions)
### than numeric variables, e.g., counting the number of male and female
### participants instead of counting the frequency of each participant’s
### height.
loans_freq <- loans$Education
cbind(frequency = table(loans_freq), percentage = prop.table(table(loans_freq)) *
    100)

##              frequency percentage
## Graduate           480    78.1759
## Not Graduate       134    21.8241

## Measures of Central Tendency ---- STEP 6. Calculate the mode ----
## Unfortunately, R does not have an in-built function for calculating the
## mode.  We, therefore, must manually create a function that can calculate the
## mode.

loans_Education_mode <- names(table(loans$Education))[which(table(loans$Education) ==
    max(table(loans$Education)))]
print(loans_Education_mode)

## [1] "Graduate"

## Measures of Distribution/Dispersion/Spread/Scatter/Variability ----

### STEP 7. Measure the distribution of the data for each variable ----
summary(loans)

##     Gender            Married           Dependents         Education        
##  Length:614         Length:614         Length:614         Length:614        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  SelfEmployed       ApplicantIncome CoapplicantIncome   LoanAmount   
##  Length:614         Min.   :  150   Min.   :    0     Min.   :  150  
##  Class :character   1st Qu.: 2878   1st Qu.:    0     1st Qu.: 2875  
##  Mode  :character   Median : 3812   Median : 1188     Median : 3768  
##                     Mean   : 5403   Mean   : 1621     Mean   : 5371  
##                     3rd Qu.: 5795   3rd Qu.: 2297     3rd Qu.: 5746  
##                     Max.   :81000   Max.   :41667     Max.   :81000  
##  LoanAmountTerm  CreditHistory   PropertyArea          Status         
##  Min.   : 12.0   Min.   :0.000   Length:614         Length:614        
##  1st Qu.:360.0   1st Qu.:1.000   Class :character   Class :character  
##  Median :360.0   Median :1.000   Mode  :character   Mode  :character  
##  Mean   :342.3   Mean   :0.855                                        
##  3rd Qu.:360.0   3rd Qu.:1.000                                        
##  Max.   :480.0   Max.   :1.000

### STEP 8. Measure the standard deviation of each variable ---- Measuring the
### variability in the dataset is important because the amount of variability
### determines how well you can generalize results from the sample dataset to a
### new observation in the population.

# Low variability is ideal because it means that you can better predict
# information about the population based on sample data. High variability means
# that the values are less consistent, thus making it harder to make
# predictions.

# The format “dataset[rows, columns]” can be used to specify the exact rows and
# columns to be considered. “dataset[, columns]” implies all rows will be
# considered. Specifying “Loans[, -4]” implies all the columns except column
# number 4. This can also be stated as “Loans[,
# c(1,2,3,5,6,7,8,9,10,11,12,13,14)]”. This allows us to calculate the standard
# deviation of only columns that are numeric, thus leaving out the columns
# termed as “factors” (categorical) or those that have a string data type.

# check data types
str(loans)

## spc_tbl_ [614 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Gender           : chr [1:614] "Male" "Male" "Male" "Male" ...
##  $ Married          : chr [1:614] "No" "Yes" "Yes" "Yes" ...
##  $ Dependents       : chr [1:614] "0" "1" "0" "0" ...
##  $ Education        : chr [1:614] "Graduate" "Graduate" "Graduate" "Not Graduate" ...
##  $ SelfEmployed     : chr [1:614] "No" "No" "Yes" "No" ...
##  $ ApplicantIncome  : num [1:614] 5849 4583 3000 2583 6000 ...
##  $ CoapplicantIncome: num [1:614] 0 1508 0 2358 0 ...
##  $ LoanAmount       : num [1:614] 2600 4583 3000 2583 6000 ...
##  $ LoanAmountTerm   : num [1:614] 360 360 360 360 360 360 360 360 360 360 ...
##  $ CreditHistory    : num [1:614] 1 1 1 1 1 1 1 0 1 1 ...
##  $ PropertyArea     : chr [1:614] "Urban" "Rural" "Urban" "Urban" ...
##  $ Status           : chr [1:614] "Y" "N" "Y" "Y" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Gender = col_character(),
##   ..   Married = col_character(),
##   ..   Dependents = col_character(),
##   ..   Education = col_character(),
##   ..   SelfEmployed = col_character(),
##   ..   ApplicantIncome = col_double(),
##   ..   CoapplicantIncome = col_double(),
##   ..   LoanAmount = col_double(),
##   ..   LoanAmountTerm = col_double(),
##   ..   CreditHistory = col_double(),
##   ..   PropertyArea = col_character(),
##   ..   Status = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

sapply(loans[, 9], sd)

## LoanAmountTerm 
##       64.44741

sapply(loans[, c(6, 7, 8, 9, 10)], sd)

##   ApplicantIncome CoapplicantIncome        LoanAmount    LoanAmountTerm 
##      6109.0416734      2926.2483692      6088.9864510        64.4474059 
##     CreditHistory 
##         0.3523386

# The data type should be double (not numeric) so that it can be calculated.


### STEP 9. Measure the kurtosis of each variable ---- The Kurtosis informs you
### of how often outliers occur in the results.

# There are different formulas for calculating kurtosis.  Specifying “type = 2”
# allows us to use the 2nd formula which is the same kurtosis formula used in
# SPSS and SAS. More details about any function can be obtained by searching
# the R help knowledge base. The knowledge base says:

# In “type = 2” (used in SPSS and SAS): 1.\tKurtosis < 3 implies a low number
# of outliers 2.\tKurtosis = 3 implies a medium number of outliers 3.\tKurtosis
# > 3 implies a high number of outliers

if (!is.element("e1071", installed.packages()[, 1])) {
    install.packages("e1071", dependencies = TRUE)
}
require("e1071")

## Loading required package: e1071

sapply(loans[, 10], kurtosis, type = 2)

## CreditHistory 
##      2.095179

### STEP 10. Measure the skewness of each variable ----

# The skewness informs you of the asymmetry of the distribution of results.
# Similar to kurtosis, there are several ways of computing the skewness.  Using
# “type = 2” can be interpreted as:

# 1.\tSkewness between -0.4 and 0.4 (inclusive) implies that there is no skew
# in the distribution of results; the distribution of results is symmetrical;
# it is a normal distribution.  2.\tSkewness above 0.4 implies a positive skew;
# a right-skewed distribution.  3.\tSkewness below -0.4 implies a negative
# skew; a left-skewed distribution.

sapply(loans[, 10], skewness, type = 2)

## CreditHistory 
##     -2.021971

# Note, executing: skewness(loans$Dependents, type=2) computes the skewness for
# one variable called “Dependents” in theloans dataset. However, executing the
# following enables you to compute the skewness for all the variables in the
# “loans” dataset except variable number 12:


## Measures of Relationship ----

## STEP 11. Measure the covariance between variables ---- Note that the
## covariance and the correlation are computed for numeric values only, not
## categorical values.
loans_cov <- cov(loans[, 6:10])
View(loans_cov)

## STEP 12. Measure the correlation between variables ----
loans_cor <- cor(loans[, 6:10])
View(loans_cor)


# Inferential Statistics ---- Read the following article:
# https://www.scribbr.com/statistics/inferential-statistics/ Statistical tests
# (either for comparison, correlation, or regression) can be used to conduct
# *hypothesis testing*.

## Parametric versus Non-Parametric Statistical Tests ---- If all the 3 points
## below are true, then use parametric tests, else use non-parametric tests.
## (i)\t the population that the sample comes from follows a normal
## distribution of scores (ii) the sample size is large enough to represent the
## population (iii) the variances of each group being compared are similar

## Statistical tests for comparison ---- (i)\t t Test: parametric; compares
## means; uses 2 samples.  (ii)\tANOVA: parametric; compares means; can use
## more than 3 samples.  (iii)\tMood’s median: non-parametric; compares
## medians; can use more than 2 samples.  (iv)\tWilcoxon signed-rank:
## non-parametric; compares distributions; uses 2 samples.  (v)\t Wilcoxon
## rank-sum (Mann-Whitney U): non-parametric; compares sums of rankings; uses 2
## samples.  (vi)\tKruskal-Wallis H: non-parametric; compares mean rankings;
## can use more than 3 samples.

## Statistical tests for correlation ---- (i)\t Pearson’s r: parametric;
## expects interval/ratio variables.  (ii)\tSpearman’s r: non-parametric;
## expects ordinal/interval/ratio variables.  (iii)\tChi square test of
## independence: non-parametric; nominal/ordinal variables.

## Statistical tests for regression ---- (i)\t Simple linear regression:
## predictor is 1 interval/ratio variable; outcome is 1 interval/ratio
## variable.  (ii)\tMultiple linear regression: predictor can be more than 2
## interval/ratio variables; outcome is 1 interval/ratio variable.
## (iii)\tLogistic regression: predictor is 1 variable (any type); outcome is 1
## binary variable.  (iv)\tNominal regression: predictor can be more than 1
## variable; outcome is 1 nominal variable.  (v)\t Ordinal regression:
## predictor can be more than 1 variable; outcome is 1 ordinal variable.


# Qualitative Data Analysis ---- This can be done through either thematic
# analysis: https://www.scribbr.com/methodology/thematic-analysis/ or critical
# discourse analysis: https://www.scribbr.com/methodology/discourse-analysis/

# Basic Visualization for Understanding the Dataset ----

# Note: If you are using R Studio, ensure that the 'Plots' window on the bottom
# right of R Studio has enough space to display the chart.

# The fastest way to improve your understanding of the dataset is to visualize
# it. Visualization can help you to spot outliers and give you an idea of
# possible data transformations you can apply. The basic visualizations to
# understand your dataset can be univariate visualizations (helps you to
# understand a single attribute) or multivariate visualizations (helps you to
# understand the interaction between attributes). Packages used to create
# visualizations include: (i)\t Graphics package: Used to quickly create basic
# plots of data. This is the most appropriate to quickly understand the dataset
# before conducting further analysis.  (ii) Lattice package: Used to create
# more visually appealing plots of data.  (iii) ggplot2 package: Used to create
# even more visually appealing plots of data that can then be used to present
# the analysis results to the intended users. Given its complexity, it is not
# necessary to use ggplot2 to have a basic understanding of the dataset prior
# to further analysis.

# Note that the goal at this point is to understand your data, not to create
# visually appealing plots that are publicly shared. The visually appealing
# plots will be created much later after the best prediction model has been
# chosen.

## Univariate Plots ---- STEP 13. Create Histograms for Each Numeric Attribute
## ---- Histograms help in determining whether an attribute has a Gaussian
## distribution. They can also be used to identify the presence of outliers.

# Execute the following code to create histograms for the “loans” dataset:
# Assuming your dataset is named 'loans' (replace with the actual name of your
# dataset)
data_types <- sapply(loans, class)
print(data_types)

##            Gender           Married        Dependents         Education 
##       "character"       "character"       "character"       "character" 
##      SelfEmployed   ApplicantIncome CoapplicantIncome        LoanAmount 
##       "character"         "numeric"         "numeric"         "numeric" 
##    LoanAmountTerm     CreditHistory      PropertyArea            Status 
##         "numeric"         "numeric"       "character"       "character"

# Execute the following code to create one histogram for attribute 4 (the only
# numeric column was “final crop yield (in bushels per acre)”) in the “: The
# code below converts column number 4 into unlisted and numeric data first so
# that a histogram can be plotted. Further reading:
# https://www.programmingr.com/r-error-messages/x-must-be-numeric-error-in-r-histogram/
# )

Loans_status <- as.numeric(unlist(loans[, 9]))
hist(Loans_status, main = names(loans)[9])

### STEP 14. Create Box and Whisker Plots for Each Numeric Attribute ---- Box
### and whisker plots are useful in understanding the distribution of data.
### Further reading: https://www.scribbr.com/statistics/interquartile-range/

# Execute the following code to create box and whisker plots for the “”
# dataset: This considers the first 3 attributes which are numeric. The fourth
# attribute in the dataset is of the type “factor”, i.e., categorical.

par(mar = c(3, 3, 2, 1))

par(mfrow = c(6, 7))
for (i in 6:7) {
    boxplot(loans[, i], main = names(loans)[i])
}

# This considers the 5th to the 14th attributes which are numeric.  The fourth
# attribute in the dataset is of the type “factor”, i.e., categorical


boxplot(loans[, 6], main = names(loans)[6])
boxplot(loans[, 7], main = names(loans)[7])
boxplot(loans[, 8], main = names(loans)[8])
boxplot(loans[, 9], main = names(loans)[9])
boxplot(loans[, 10], main = names(loans)[10])


### STEP 15. Create Bar Plots for Each Categorical Attribute ---- Categorical
### attributes (factors) can also be visualized. This is done using a bar chart
### to give an idea of the proportion of instances that belong to each
### category.


barplot(table(loans[, 12]), main = names(loans)[12])

# Execute the following to create a bar plot for the categorical attributes 1
# to 11 in the “loans” dataset:

par(mfrow = c(1, 11))

for (i in 1:11) {
    barplot(table(loans[, i]), main = names(loans)[i])
}

### STEP 16. Create a Missingness Map to Identify Missing Data ---- Some
### machine learning algorithms cannot handle missing data. A missingness map
### (also known as a missing plot) can be used to get an idea of the amount
### missing data in the dataset. The x-axis of the missingness map shows the
### attributes of the dataset whereas the y-axis shows the instances in the
### dataset.  Horizontal lines indicate missing data for an instance whereas
### vertical lines indicate missing data for an attribute. The missingness map
### requires the “Amelia” package.

# Execute the following to create a map to identify the missing data in each
# dataset:
if (!is.element("Amelia", installed.packages()[, 1])) {
    install.packages("Amelia", dependencies = TRUE)
}
require("Amelia")

## Loading required package: Amelia
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.1, built: 2022-11-18)
## ## Copyright (C) 2005-2023 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

missmap(loans, col = c("red", "grey"), legend = TRUE)

# As shown in the results, the datasets that was loaded in this lab has no
# missing data.

## Multivariate Plots ----

### STEP 17. Create a Correlation Plot ---- Correlation plots can be used to
### get an idea of which attributes change together. The function “corrplot()”
### found in the package “corrplot” is required to perform this. The larger the
### dot in the correlation plot, the larger the correlation. Blue represents a
### positive correlation whereas red represents a negative correlation.

# Execute the following to create correlation plots for 3 of the datasets
# loaded in STEP 2 to STEP 4:
if (!is.element("corrplot", installed.packages()[, 1])) {
    install.packages("corrplot", dependencies = TRUE)
}
require("corrplot")

## Loading required package: corrplot
## corrplot 0.92 loaded

corrplot(cor(loans[, 6:10]), method = "circle")


# Alternatively, the 'ggcorrplot::ggcorrplot()' function can be used to plot a
# more visually appealing plot.  The code below shows how to install a package
# in R:
if (!is.element("ggcorrplot", installed.packages()[, 1])) {
    install.packages("ggcorrplot", dependencies = TRUE)
}
require("ggcorrplot")

## Loading required package: ggcorrplot

ggcorrplot(cor(loans[, 6:10]))


# Alternatively, the ggcorrplot package can be used to make the plots more
# appealing:
ggplot(loans, aes(x = Dependents, y = Education, shape = Status, color = Status)) +
    geom_point() + geom_smooth(method = lm)

## `geom_smooth()` using formula = 'y ~ x'

### STEP 18. Create Multivariate Box and Whisker Plots by Class ---- This
### applies to datasets where the target (dependent) variable is categorical.
### Execute the following code:
if (!is.element("caret", installed.packages()[, 1])) {
    install.packages("caret", dependencies = TRUE)
}
require("caret")
featurePlot(x = loans[, 1:12], y = loans[, 12], plot = "box")

## NULL

# References ---- Bevans, R. (2023a). ANOVA in R | A Complete Step-by-Step
# Guide with Examples. Scribbr. Retrieved August 24, 2023, from
# https://www.scribbr.com/statistics/anova-in-r/ ----

## Bevans, R. (2023b). Sample Crop Data Dataset for ANOVA (Version 1)
## [Dataset]. Scribbr.
## https://www.scribbr.com/wp-content/uploads//2020/03/crop.data_.anova_.zip
## ----

## Fisher, R. A. (1988). Iris [Dataset]. UCI Machine Learning Repository.
## https://archive.ics.uci.edu/dataset/53/iris ----

## National Institute of Diabetes and Digestive and Kidney Diseases. (1999).
## Pima Indians Diabetes Dataset [Dataset]. UCI Machine Learning Repository.
## https://www.kaggle.com/datasets/uciml/ ----

## StatLib CMU. (1997). Boston Housing [Dataset]. StatLib Carnegie Mellon
## University. http://lib.stat.cmu.edu/datasets/boston_corrected.txt ----

Milestone 1 out of 8

Issue 2 Inferential Statistics .

# 2: Inferential Statistics ----

# STEP 1. Install and use renv ---- **Initialization: Install and use renv ----
# The renv package helps you create reproducible environments for your



# As you continue to work on your project, you can install and upgrade
# packages, using either: install.packages() and update.packages or
# renv::install() and renv::update()

# You can also clean up a project by removing unused packages using the
# following command: renv::clean()

# After you have confirmed that your code works as expected, use
# renv::snapshot() to record the packages and their sources in the lockfile.

# One of the packages required to use R in VS Code is the 'languageserver'
# package. It can be installed manually as follows if you are not using the
# renv::restore() command.
if (!is.element("languageserver", installed.packages()[, 1])) {
    install.packages("languageserver", dependencies = TRUE)
}
require("languageserver")

# Loading Datasets ---- STEP 2: Load datasets ----

library(readr)
loans <- read_csv("C:/Users/Cris/github-classroom/BI-Loan-Appraisal-Project/data/loans_imputed.csv")

## Rows: 614 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Gender, Married, Dependents, Education, SelfEmployed, PropertyArea,...
## dbl (5): ApplicantIncome, CoapplicantIncome, LoanAmount, LoanAmountTerm, Cre...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(loans)

## Step 1; Perform ANOVA on the “loans_imputed_dataset” dataset ---- ANOVA
## (Analysis of Variance) is a statistical test used to estimate how a
## quantitative dependent variable changes according to the levels of one or
## more categorical independent variables.

# The null hypothesis (H0) of the ANOVA is that “there is no difference in
# means”, and the alternative hypothesis (Ha) is that “the means are different
# from one another”.

# We can use the “aov()” function in R to calculate the test statistic for
# ANOVA. The test statistic is in turn used to calculate the p-value of your
# results. A p-value is a number that describes how likely you are to have
# found a particular set of observations if the null hypothesis were true. The
# smaller the p-value, the more likely you are to reject the null-hypothesis.

# The “crop_dataset” sample dataset loaded in STEP 4 contains observations from
# an imaginary study of the effects of fertilizer type and planting density on
# crop yield. In other words:

# Dependent variable:\tStatus


# One-Way ANOVA can be used to test the effect of the 3 types of Property Areas
# on Credit History whereas, Two-Way ANOVA can be used to test the effect of
# the 3 types of fertilizer and the 2 types of planting density on crop yield.
summary(loans$Status)

##    Length     Class      Mode 
##       614 character character

summary(loans$Gender)

##    Length     Class      Mode 
##       614 character character

loans_dataset_one_way_anova <- aov(CreditHistory ~ PropertyArea, data = loans)
summary(loans_dataset_one_way_anova)

##               Df Sum Sq Mean Sq F value Pr(>F)
## PropertyArea   2    0.1 0.04966   0.399  0.671
## Residuals    611   76.0 0.12439

# This shows the result of each variable and the residual. The residual refers
# to all the variation that is not explained by the independent variable. The
# list below is a description of each column in the result:

# 1.  Df column: Displays the degrees of freedom for the independent variable
# (the number of levels (categories) in the variable minus 1), and the degrees
# of freedom for the residuals (the total number of observations minus the
# number of variables being estimated + 1, i.e., (df(Residuals)=n-(k+1)).

# 2.\tSum Sq column: Displays the sum of squares (a.k.a. the total variation
# between the group means and the overall mean). It is better to have a lower
# Sum Sq value for residuals.

# 3.  Mean Sq column: The mean of the sum of squares, calculated by dividing
# the sum of squares by the degrees of freedom for each parameter.

# 4.\tF value column: The test statistic from the F test. This is the mean
# square of each independent variable divided by the mean square of the
# residuals. The larger the F value, the more likely it is that the variation
# caused by the independent variable is real and not due to chance.

# 5.\tPr(>F) column: The p-value of the F statistic. This shows how likely it
# is that the F value calculated from the test would have occurred if the null
# hypothesis of “no difference among group means” were true.

# The three asterisk symbols (***) implies that the p-value is less than 0.001.
# P<0.001 can be interpreted as “the type of fertilizer used has an impact on
# the final crop yield”.

# We can also have a situation where the Credit History depends not only on the
# type of Property Area used but also on the Dependents. A two-way ANOVA can
# then be used to confirm this. Execute the following for a two-way ANOVA (two
# independent variables):

loans_dataset_additive_two_way_anova <- aov(CreditHistory ~ PropertyArea + Dependents,
    data = loans)
summary(loans_dataset_additive_two_way_anova)

##               Df Sum Sq Mean Sq F value Pr(>F)
## PropertyArea   2   0.10 0.04966   0.399  0.671
## Dependents     3   0.28 0.09362   0.752  0.522
## Residuals    608  75.72 0.12454

# Specifying an asterisk (*) instead of a plus (+) between the two independent
# variables (fertilizer * density) implies that they have an interaction effect
# rather than an additive effect.

# For example, an interaction effect would be that the fertilizer uptake by
# plants is affected by how close the plants are planted (density). An additive
# effect would be that the fertilizer uptake by plants is NOT affected by how
# close the plants are planted (density).

# Execute the following to perform a two-way ANOVA with the assumption that
# Credit History and Dependents have an interaction effect:

loans_dataset_interactive_two_way_anova <- aov(CreditHistory ~ PropertyArea * Dependents,
    data = loans)
summary(loans_dataset_interactive_two_way_anova)

##                          Df Sum Sq Mean Sq F value Pr(>F)
## PropertyArea              2   0.10 0.04966   0.400  0.670
## Dependents                3   0.28 0.09362   0.754  0.520
## PropertyArea:Dependents   6   1.01 0.16885   1.361  0.228
## Residuals               602  74.71 0.12410

# This can be interpreted as follows: The additive two-way ANOVA shows that the
# Credit History is affected by both the Property Area and the Dependents
# (P<0.001 for both independent variables).  The interactive two-way ANOVA also
# shows that the Credit History is affected by both the Property Area and the
# Dependents (P<0.001 for both independent variables).


# Qualitative Data Analysis ---- This can be done through either thematic
# analysis: https://www.scribbr.com/methodology/thematic-analysis/ or critical
# discourse analysis: https://www.scribbr.com/methodology/discourse-analysis/

# Basic Visualization for Understanding the Dataset ----

# Note: If you are using R Studio, ensure that the 'Plots' window on the bottom
# right of R Studio has enough space to display the chart.

# The fastest way to improve your understanding of the dataset is to visualize
# it. Visualization can help you to spot outliers and give you an idea of
# possible data transformations you can apply. The basic visualizations to
# understand your dataset can be univariate visualizations (helps you to
# understand a single attribute) or multivariate visualizations (helps you to
# understand the interaction between attributes). Packages used to create
# visualizations include: (i)\t Graphics package: Used to quickly create basic
# plots of data. This is the most appropriate to quickly understand the dataset
# before conducting further analysis.  (ii) Lattice package: Used to create
# more visually appealing plots of data.  (iii) ggplot2 package: Used to create
# even more visually appealing plots of data that can then be used to present
# the analysis results to the intended users. Given its complexity, it is not
# necessary to use ggplot2 to have a basic understanding of the dataset prior
# to further analysis.

<Milestone 1 out of 3>

Issue 3 Basic Visualization.

# *****************************************************************************
# 3: Basic Visualization ----


# One of the packages required to use R in VS Code is the 'languageserver'
# package. It can be installed manually as follows if you are not using the
# renv::restore() command.
if (!is.element("languageserver", installed.packages()[, 1])) {
    install.packages("languageserver", dependencies = TRUE)
}
require("languageserver")

# Loading Datasets ---- STEP 2: Load datasets ----

library(readr)
loans <- read_csv("C:/Users/Cris/github-classroom/BI-Loan-Appraisal-Project/data/loans_imputed.csv")

## Rows: 614 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Gender, Married, Dependents, Education, SelfEmployed, PropertyArea,...
## dbl (5): ApplicantIncome, CoapplicantIncome, LoanAmount, LoanAmountTerm, Cre...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(loans)

## Univariate Plots ---- STEP 1. Create Histograms for Each Numeric Attribute
## ---- Histograms help in determining whether an attribute has a Gaussian
## distribution. They can also be used to identify the presence of outliers.

# Execute the following code to create histograms for the “loans” dataset:
# Assuming your dataset is named 'loans' (replace with the actual name of your
# dataset)
data_types <- sapply(loans, class)
print(data_types)

##            Gender           Married        Dependents         Education 
##       "character"       "character"       "character"       "character" 
##      SelfEmployed   ApplicantIncome CoapplicantIncome        LoanAmount 
##       "character"         "numeric"         "numeric"         "numeric" 
##    LoanAmountTerm     CreditHistory      PropertyArea            Status 
##         "numeric"         "numeric"       "character"       "character"

# Execute the following code to create one histogram for attribute 4 (the only
# numeric column was “final crop yield (in bushels per acre)”) in the “: The
# code below converts column number 4 into unlisted and numeric data first so
# that a histogram can be plotted. Further reading:
# https://www.programmingr.com/r-error-messages/x-must-be-numeric-error-in-r-histogram/
# )


# Loans_status <- as.numeric(unlist(loans[, 6:7])) hist(Loans_status, main =
# names(loans)[, 6:7])



### STEP 2. Create Box and Whisker Plots for Each Numeric Attribute ---- Box
### and whisker plots are useful in understanding the distribution of data.
### Further reading: https://www.scribbr.com/statistics/interquartile-range/

class(loans)

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

summary(loans)

##     Gender            Married           Dependents         Education        
##  Length:614         Length:614         Length:614         Length:614        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  SelfEmployed       ApplicantIncome CoapplicantIncome   LoanAmount   
##  Length:614         Min.   :  150   Min.   :    0     Min.   :  150  
##  Class :character   1st Qu.: 2878   1st Qu.:    0     1st Qu.: 2875  
##  Mode  :character   Median : 3812   Median : 1188     Median : 3768  
##                     Mean   : 5403   Mean   : 1621     Mean   : 5371  
##                     3rd Qu.: 5795   3rd Qu.: 2297     3rd Qu.: 5746  
##                     Max.   :81000   Max.   :41667     Max.   :81000  
##  LoanAmountTerm  CreditHistory   PropertyArea          Status         
##  Min.   : 12.0   Min.   :0.000   Length:614         Length:614        
##  1st Qu.:360.0   1st Qu.:1.000   Class :character   Class :character  
##  Median :360.0   Median :1.000   Mode  :character   Mode  :character  
##  Mean   :342.3   Mean   :0.855                                        
##  3rd Qu.:360.0   3rd Qu.:1.000                                        
##  Max.   :480.0   Max.   :1.000

colnames(loans)

##  [1] "Gender"            "Married"           "Dependents"       
##  [4] "Education"         "SelfEmployed"      "ApplicantIncome"  
##  [7] "CoapplicantIncome" "LoanAmount"        "LoanAmountTerm"   
## [10] "CreditHistory"     "PropertyArea"      "Status"

par(mar = c(3, 3, 2, 1))

par(mfrow = c(6, 7))
for (i in 6:7) {
    boxplot(loans[, i], main = names(loans)[i])
}

# This considers the 6th to the 7th attributes which are numeric.  The fourth
# attribute in the dataset is of the type “factor”, i.e., categorical

boxplot(loans[, 6], main = names(loans)[6])
boxplot(loans[, 7], main = names(loans)[7])
boxplot(loans[, 8], main = names(loans)[8])
boxplot(loans[, 9], main = names(loans)[9])
boxplot(loans[, 10], main = names(loans)[10])



### STEP 3. Create Bar Plots for Each Categorical Attribute ---- Categorical
### attributes (factors) can also be visualized. This is done using a bar chart
### to give an idea of the proportion of instances that belong to each
### category.


barplot(table(loans[, 10]), main = names(loans)[10])

# Execute the following to create a bar plot for the categorical attributes 1
# to 11 in the “loans” dataset:
par(mar = c(3, 3, 2, 1))

par(mfrow = c(1, 5))

for (i in 1:5) {
    barplot(table(loans[, i]), main = names(loans)[i])
}

### STEP 4. Create a Missingness Map to Identify Missing Data ---- Some machine
### learning algorithms cannot handle missing data. A missingness map (also
### known as a missing plot) can be used to get an idea of the amount missing
### data in the dataset. The x-axis of the missingness map shows the attributes
### of the dataset whereas the y-axis shows the instances in the dataset.
### Horizontal lines indicate missing data for an instance whereas vertical
### lines indicate missing data for an attribute. The missingness map requires
### the “Amelia” package.

# Execute the following to create a map to identify the missing data in each
# dataset:
if (!is.element("Amelia", installed.packages()[, 1])) {
    install.packages("Amelia", dependencies = TRUE)
}
require("Amelia")

missmap(loans, col = c("red", "grey"), legend = TRUE)

# As shown in the results, the datasets that was loaded in this lab has no
# missing data.

## Multivariate Plots ----

### STEP 5. Create a Correlation Plot ---- Correlation plots can be used to get
### an idea of which attributes change together. The function “corrplot()”
### found in the package “corrplot” is required to perform this. The larger the
### dot in the correlation plot, the larger the correlation. Blue represents a
### positive correlation whereas red represents a negative correlation.

# Execute the following to create correlation plots for 3 of the datasets
# loaded in STEP 2 to STEP 4:
if (!is.element("corrplot", installed.packages()[, 1])) {
    install.packages("corrplot", dependencies = TRUE)
}
require("corrplot")
corrplot(cor(loans[, 6:10]), method = "circle")


# Alternatively, the 'ggcorrplot::ggcorrplot()' function can be used to plot a
# more visually appealing plot.  The code below shows how to install a package
# in R:
if (!is.element("ggcorrplot", installed.packages()[, 1])) {
    install.packages("ggcorrplot", dependencies = TRUE)
}
require("ggcorrplot")
ggcorrplot(cor(loans[, 6:10]))


# Alternatively, the ggcorrplot package can be used to make the plots more
# appealing:
ggplot(loans, aes(x = Dependents, y = Education, shape = Status, color = Status)) +
    geom_point() + geom_smooth(method = lm)

## `geom_smooth()` using formula = 'y ~ x'

### STEP 6. Create Multivariate Box and Whisker Plots by Class ---- This
### applies to datasets where the target (dependent) variable is categorical.
### Execute the following code:
if (!is.element("caret", installed.packages()[, 1])) {
    install.packages("caret", dependencies = TRUE)
}
require("caret")
featurePlot(x = loans[, 1:12], y = loans[, 12], plot = "box")

## NULL

# References ---- Bevans, R. (2023a). ANOVA in R | A Complete Step-by-Step
# Guide with Examples. Scribbr. Retrieved August 24, 2023, from
# https://www.scribbr.com/statistics/anova-in-r/ ----

## Bevans, R. (2023b). Sample Crop Data Dataset for ANOVA (Version 1)
## [Dataset]. Scribbr.
## https://www.scribbr.com/wp-content/uploads//2020/03/crop.data_.anova_.zip
## ----

## Fisher, R. A. (1988). Iris [Dataset]. UCI Machine Learning Repository.
## https://archive.ics.uci.edu/dataset/53/iris ----

## National Institute of Diabetes and Digestive and Kidney Diseases. (1999).
## Pima Indians Diabetes Dataset [Dataset]. UCI Machine Learning Repository.
## https://www.kaggle.com/datasets/uciml/ ----

## StatLib CMU. (1997). Boston Housing [Dataset]. StatLib Carnegie Mellon
## University. http://lib.stat.cmu.edu/datasets/boston_corrected.txt ----


# Upload *the link* to 'Lab-Submission-Markdown.md' (not .Rmd) markdown file
# hosted on Github (do not upload the .Rmd or .md markdown files) through the
# submission link provided on eLearning.

<Milestone 2 out of 8>

Issue 5 Processing and Data Transformation.

# 5: Data Imputation ----

# This can also be configured using the RStudio GUI when you click the project
# file, e.g., 'BBT4206-R.Rproj' in the case of this project. Then navigate to
# the 'Environments' tab and select 'Use renv with this project'.

# As you continue to work on your project, you can install and upgrade
# packages, using either: install.packages() and update.packages or
# renv::install() and renv::update()

# You can also clean up a project by removing unused packages using the
# following command: renv::clean()

# After you have confirmed that your code works as expected, use
# renv::snapshot(), AT THE END, to record the packages and their sources in the
# lockfile.

# Later, if you need to share your code with someone else or run your code on a
# new machine, your collaborator (or you) can call renv::restore() to reinstall
# the specific package versions recorded in the lockfile.

# [OPTIONAL] Execute the following code to reinstall the specific package
# versions recorded in the lockfile (restart R after executing the command):
# renv::restore()

# [OPTIONAL] If you get several errors setting up renv and you prefer not to
# use it, then you can deactivate it using the following command (restart R
# after executing the command): renv::deactivate()

# If renv::restore() did not install the 'languageserver' package (required to
# use R for VS Code), then it can be installed manually as follows (restart R
# after executing the command):

if (!is.element("languageserver", installed.packages()[, 1])) {
    install.packages("languageserver", dependencies = TRUE, repos = "https://cloud.r-project.org")
}
require("languageserver")

# Introduction ---- Data imputation, also known as missing data imputation, is
# a technique used in data analysis and statistics to fill in missing values in
# a dataset.  Missing data can occur due to various reasons, such as equipment
# malfunction, human error, or non-response in surveys.

# Imputing missing data is important because many statistical analysis methods
# and Machine Learning algorithms require complete datasets to produce accurate
# and reliable results. By filling in the missing values, data imputation helps
# to preserve the integrity and usefulness of the dataset.

## Data Imputation Methods ----

### 1. Mean/Median Imputation ----

# This method involves replacing missing values with the mean or median value
# of the available data for that variable. It is a simple and quick approach
# but does not consider any relationships between variables.

# Unlike the recorded values, mean-imputed values do not include natural
# variance. Therefore, they are less “scattered” and would technically minimize
# the standard error in a linear regression. We would perceive our estimates to
# be more accurate than they actually are in real-life.

### 2. Regression Imputation ---- In this approach, missing values are
### estimated by regressing the variable with missing values on other variables
### that are known. The estimated values are then used to fill in the missing
### values.

### 3. Multiple Imputation ---- Multiple imputation involves creating several
### plausible imputations for each missing value based on statistical models
### that capture the relationships between variables. This technique recognizes
### the uncertainty associated with imputing missing values.

### 4. Machine Learning-Based Imputation ---- Machine learning algorithms can
### be used to predict missing values based on the patterns and relationships
### present in the available data. Techniques such as K-Nearest Neighbours
### (KNN) imputation or decision tree-based imputation can be employed.

### 5. Hot Deck Imputation ---- This method involves finding similar cases
### (referred to as donors) that have complete data and using their values to
### impute missing values in other cases (referred to as recipients).

### 6. Multiple Imputation by Chained Equations (MICE) ---- MICE is flexible
### and can handle different variable types at once (e.g., continuous, binary,
### ordinal etc.). For each variable containing missing values, we can use the
### remaining information in the data to train a model that predicts what could
### have been recorded to fill in the blanks.  To account for the statistical
### uncertainty in the imputations, the MICE procedure goes through several
### rounds and computes replacements for missing values in each round. As the
### name suggests, we thus fill in the missing values multiple times and create
### several complete datasets before we pool the results to arrive at more
### realistic results.

## Types of Missing Data ---- 1. Missing Not At Random (MNAR) ---- Locations of
## missing values in the dataset depend on the missing values themselves. For
## example, students submitting a course evaluation tend to report positive or
## neutral responses and skip questions that will result in a negative
## response. Such students may systematically leave the following question
## blank because they are uncomfortable giving a bad rating for their lecturer:
## “Classes started and ended on time”.

### 2. Missing At Random (MAR) ---- Locations of missing values in the dataset
### depend on some other observed data. In the case of course evaluations,
### students who are not certain about a response may feel unable to give
### accurate responses on a numeric scale, for example, the question 'I
### developed my oral and writing skills ' may be difficult to measure on a
### scale of 1-5. Subsequently, if such questions are optional, they rarely get
### a response because it depends on another unobserved mechanism: in this
### case, the individual need for more precise self-assessments.

### 3. Missing Completely At Random (MCAR) ---- In this case, the locations of
### missing values in the dataset are purely random and they do not depend on
### any other data.

# In all the above cases, removing the entire response because one question has
# missing data may distort the results.

# If the data are MAR or MNAR, imputing missing values is advisable.

# STEP 1. Install and Load the Required Packages ---- The following packages
# should be installed and loaded before proceeding to the subsequent steps.

## NHANES ---- The dataset we will use (for educational purposes) is the US
## National Health and Nutrition Examination Study (NHANES) dataset created
## from 1999 to 2004.

# Documentation of NHANES: https://cran.r-project.org/package=NHANES or
# https://cran.r-project.org/web/packages/NHANES/NHANES.pdf or
# http://www.cdc.gov/nchs/nhanes.htm

# This requires the 'NHANES' package available in R

if (!is.element("NHANES", installed.packages()[, 1])) {
    install.packages("NHANES", dependencies = TRUE, repos = "https://cloud.r-project.org")
}
require("NHANES")

## Loading required package: NHANES

## dplyr ----
if (!is.element("dplyr", installed.packages()[, 1])) {
    install.packages("dplyr", dependencies = TRUE, repos = "https://cloud.r-project.org")
}
require("dplyr")

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## naniar ---- Documentation: https://cran.r-project.org/package=naniar or
## https://www.rdocumentation.org/packages/naniar/versions/1.0.0
if (!is.element("naniar", installed.packages()[, 1])) {
    install.packages("naniar", dependencies = TRUE, repos = "https://cloud.r-project.org")
}
require("naniar")

## Loading required package: naniar

## ggplot2 ---- We require the 'ggplot2' package to create more appealing
## visualizations
if (!is.element("ggplot2", installed.packages()[, 1])) {
    install.packages("ggplot2", dependencies = TRUE, repos = "https://cloud.r-project.org")
}
require("ggplot2")

## MICE ---- We use the MICE package to perform data imputation
if (!is.element("mice", installed.packages()[, 1])) {
    install.packages("mice", dependencies = TRUE, repos = "https://cloud.r-project.org")
}
require("mice")

## Loading required package: mice

## 
## Attaching package: 'mice'

## The following object is masked from 'package:kernlab':
## 
##     convergence

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

## Amelia ----
if (!is.element("Amelia", installed.packages()[, 1])) {
    install.packages("Amelia", dependencies = TRUE, repos = "https://cloud.r-project.org")
}
require("Amelia")

# STEP 2. We Load the dataset ----
library(readr)
loans <- read_csv("C:/Users/Cris/github-classroom/BI-Loan-Appraisal-Project/data/Loans.csv")

## Rows: 614 Columns: 12

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Gender, Married, Dependents, Education, SelfEmployed, PropertyArea,...
## dbl (5): ApplicantIncome, CoapplicantIncome, LoanAmount, LoanAmountTerm, Cre...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(loans)

# STEP 3. Confirm the 'missingness' in the Dataset before Imputation ---- Are
# there missing values in the dataset?
any_na(loans)

## [1] TRUE

# How many?
n_miss(loans)

## [1] 110

# What is the percentage of missing data in the entire dataset?
prop_miss(loans)

## [1] 0.01492942

# How many missing values does each variable have?
loans %>%
    is.na() %>%
    colSums()

##            Gender           Married        Dependents         Education 
##                10                 2                13                 0 
##      SelfEmployed   ApplicantIncome CoapplicantIncome        LoanAmount 
##                27                 0                 0                 5 
##    LoanAmountTerm     CreditHistory      PropertyArea            Status 
##                 8                45                 0                 0

# What is the number and percentage of missing values grouped by each variable?
miss_var_summary(loans)

## # A tibble: 12 × 3
##    variable          n_miss pct_miss
##    <chr>              <int>    <dbl>
##  1 CreditHistory         45    7.33 
##  2 SelfEmployed          27    4.40 
##  3 Dependents            13    2.12 
##  4 Gender                10    1.63 
##  5 LoanAmountTerm         8    1.30 
##  6 LoanAmount             5    0.814
##  7 Married                2    0.326
##  8 Education              0    0    
##  9 ApplicantIncome        0    0    
## 10 CoapplicantIncome      0    0    
## 11 PropertyArea           0    0    
## 12 Status                 0    0

# What is the number and percentage of missing values grouped by each
# observation?
miss_case_summary(loans)

## # A tibble: 614 × 3
##     case n_miss pct_miss
##    <int>  <int>    <dbl>
##  1   229      2    16.7 
##  2   237      2    16.7 
##  3   336      2    16.7 
##  4   412      2    16.7 
##  5   436      2    16.7 
##  6   461      2    16.7 
##  7   601      2    16.7 
##  8     1      1     8.33
##  9    17      1     8.33
## 10    25      1     8.33
## # ℹ 604 more rows

# Which variables contain the most missing values?
gg_miss_var(loans)

# Where are missing values located (the shaded regions in the plot)?
vis_miss(loans) + theme(axis.text.x = element_text(angle = 80))

# Which combinations of variables are missing together?
gg_miss_upset(loans)

# Create a heatmap of 'missingness' broken down by 'SelfEmployed' First,
# confirm that the 'SelfEmployed' variable is a categorical variable
data_types <- sapply(loans, class)
print(data_types)

##            Gender           Married        Dependents         Education 
##       "character"       "character"       "character"       "character" 
##      SelfEmployed   ApplicantIncome CoapplicantIncome        LoanAmount 
##       "character"         "numeric"         "numeric"         "numeric" 
##    LoanAmountTerm     CreditHistory      PropertyArea            Status 
##         "numeric"         "numeric"       "character"       "character"

is.factor(loans$SelfEmployed)

## [1] FALSE

# Second, create the visualization
gg_miss_fct(loans, fct = SelfEmployed)

# We can also create a heatmap of 'missingness' broken down by 'Dependents'
# First, confirm that the 'Dependents' variable is a categorical variable
is.factor(loans$Dependents)

## [1] FALSE

# Second, create the visualization
gg_miss_fct(loans, fct = Dependents)

STEP 4./

Issue 5 Processing and Data Transformation.

# 5b: Data Imputation ----

# This can also be configured using the RStudio GUI when you click the project
# file, e.g., 'BBT4206-R.Rproj' in the case of this project. Then navigate to
# the 'Environments' tab and select 'Use renv with this project'.

# As you continue to work on your project, you can install and upgrade
# packages, using either: install.packages() and update.packages or
# renv::install() and renv::update()

# You can also clean up a project by removing unused packages using the
# following command: renv::clean()

# After you have confirmed that your code works as expected, use
# renv::snapshot(), AT THE END, to record the packages and their sources in the
# lockfile.

# Later, if you need to share your code with someone else or run your code on a
# new machine, your collaborator (or you) can call renv::restore() to reinstall
# the specific package versions recorded in the lockfile.

# [OPTIONAL] Execute the following code to reinstall the specific package
# versions recorded in the lockfile (restart R after executing the command):
# renv::restore()

# [OPTIONAL] If you get several errors setting up renv and you prefer not to
# use it, then you can deactivate it using the following command (restart R
# after executing the command): renv::deactivate()

# If renv::restore() did not install the 'languageserver' package (required to
# use R for VS Code), then it can be installed manually as follows (restart R
# after executing the command):

if (!is.element("languageserver", installed.packages()[, 1])) {
    install.packages("languageserver", dependencies = TRUE, repos = "https://cloud.r-project.org")
}
require("languageserver")

# Introduction ---- Data imputation, also known as missing data imputation, is
# a technique used in data analysis and statistics to fill in missing values in
# a dataset.  Missing data can occur due to various reasons, such as equipment
# malfunction, human error, or non-response in surveys.

# Imputing missing data is important because many statistical analysis methods
# and Machine Learning algorithms require complete datasets to produce accurate
# and reliable results. By filling in the missing values, data imputation helps
# to preserve the integrity and usefulness of the dataset.

## Data Imputation Methods ----

### 1. Mean/Median Imputation ----

# This method involves replacing missing values with the mean or median value
# of the available data for that variable. It is a simple and quick approach
# but does not consider any relationships between variables.

# Unlike the recorded values, mean-imputed values do not include natural
# variance. Therefore, they are less “scattered” and would technically minimize
# the standard error in a linear regression. We would perceive our estimates to
# be more accurate than they actually are in real-life.

### 2. Regression Imputation ---- In this approach, missing values are
### estimated by regressing the variable with missing values on other variables
### that are known. The estimated values are then used to fill in the missing
### values.

### 3. Multiple Imputation ---- Multiple imputation involves creating several
### plausible imputations for each missing value based on statistical models
### that capture the relationships between variables. This technique recognizes
### the uncertainty associated with imputing missing values.

### 4. Machine Learning-Based Imputation ---- Machine learning algorithms can
### be used to predict missing values based on the patterns and relationships
### present in the available data. Techniques such as K-Nearest Neighbours
### (KNN) imputation or decision tree-based imputation can be employed.

### 5. Hot Deck Imputation ---- This method involves finding similar cases
### (referred to as donors) that have complete data and using their values to
### impute missing values in other cases (referred to as recipients).

### 6. Multiple Imputation by Chained Equations (MICE) ---- MICE is flexible
### and can handle different variable types at once (e.g., continuous, binary,
### ordinal etc.). For each variable containing missing values, we can use the
### remaining information in the data to train a model that predicts what could
### have been recorded to fill in the blanks.  To account for the statistical
### uncertainty in the imputations, the MICE procedure goes through several
### rounds and computes replacements for missing values in each round. As the
### name suggests, we thus fill in the missing values multiple times and create
### several complete datasets before we pool the results to arrive at more
### realistic results.

## Types of Missing Data ---- 1. Missing Not At Random (MNAR) ---- Locations of
## missing values in the dataset depend on the missing values themselves. For
## example, students submitting a course evaluation tend to report positive or
## neutral responses and skip questions that will result in a negative
## response. Such students may systematically leave the following question
## blank because they are uncomfortable giving a bad rating for their lecturer:
## “Classes started and ended on time”.

### 2. Missing At Random (MAR) ---- Locations of missing values in the dataset
### depend on some other observed data. In the case of course evaluations,
### students who are not certain about a response may feel unable to give
### accurate responses on a numeric scale, for example, the question 'I
### developed my oral and writing skills ' may be difficult to measure on a
### scale of 1-5. Subsequently, if such questions are optional, they rarely get
### a response because it depends on another unobserved mechanism: in this
### case, the individual need for more precise self-assessments.

### 3. Missing Completely At Random (MCAR) ---- In this case, the locations of
### missing values in the dataset are purely random and they do not depend on
### any other data.

# In all the above cases, removing the entire response because one question has
# missing data may distort the results.

# If the data are MAR or MNAR, imputing missing values is advisable.

# STEP 1. Install and Load the Required Packages ---- The following packages
# should be installed and loaded before proceeding to the subsequent steps.

## NHANES ---- The dataset we will use (for educational purposes) is the US
## National Health and Nutrition Examination Study (NHANES) dataset created
## from 1999 to 2004.

# Documentation of NHANES: https://cran.r-project.org/package=NHANES or
# https://cran.r-project.org/web/packages/NHANES/NHANES.pdf or
# http://www.cdc.gov/nchs/nhanes.htm

# This requires the 'NHANES' package available in R

if (!is.element("NHANES", installed.packages()[, 1])) {
    install.packages("NHANES", dependencies = TRUE, repos = "https://cloud.r-project.org")
}
require("NHANES")

## dplyr ----
if (!is.element("dplyr", installed.packages()[, 1])) {
    install.packages("dplyr", dependencies = TRUE, repos = "https://cloud.r-project.org")
}
require("dplyr")

## naniar ---- Documentation: https://cran.r-project.org/package=naniar or
## https://www.rdocumentation.org/packages/naniar/versions/1.0.0
if (!is.element("naniar", installed.packages()[, 1])) {
    install.packages("naniar", dependencies = TRUE, repos = "https://cloud.r-project.org")
}
require("naniar")

## ggplot2 ---- We require the 'ggplot2' package to create more appealing
## visualizations
if (!is.element("ggplot2", installed.packages()[, 1])) {
    install.packages("ggplot2", dependencies = TRUE, repos = "https://cloud.r-project.org")
}
require("ggplot2")

## MICE ---- We use the MICE package to perform data imputation
if (!is.element("mice", installed.packages()[, 1])) {
    install.packages("mice", dependencies = TRUE, repos = "https://cloud.r-project.org")
}
require("mice")

## Amelia ----
if (!is.element("Amelia", installed.packages()[, 1])) {
    install.packages("Amelia", dependencies = TRUE, repos = "https://cloud.r-project.org")
}
require("Amelia")

# STEP 2. We Load the dataset ----
library(readr)
loans <- read_csv("C:/Users/Cris/github-classroom/BI-Loan-Appraisal-Project/data/Loans.csv")

## Rows: 614 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Gender, Married, Dependents, Education, SelfEmployed, PropertyArea,...
## dbl (5): ApplicantIncome, CoapplicantIncome, LoanAmount, LoanAmountTerm, Cre...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(loans)

# We can use the dplyr::mutate() function inside the dplyr package to add new
# variables that are functions of existing variables


# We finally begin to make use of Multivariate Imputation by Chained Equations
# (MICE). We use 11 multiple imputations.

# To arrive at good predictions for each variable containing missing values, we
# save the variables that are at least 'somewhat correlated' (r > 0.3).
somewhat_correlated_variables <- quickpred(loans, mincor = 0.3)

# m = 11 Specifies that the imputation (filling in the missing data) will be
# performed 11 times (multiple times) to create several complete datasets
# before we pool the results to arrive at a more realistic final result. The
# larger the value of 'm' and the larger the dataset, the longer the data
# imputation will take.  seed = 7 Specifies that number 7 will be used to
# offset the random number generator used by mice. This is so that we get the
# same results each time we run MICE.  meth = 'pmm' Specifies the imputation
# method. 'pmm' stands for 'Predictive Mean Matching' and it can be used for
# numeric data.  Other methods include: 1. 'logreg': logistic regression
# imputation; used for binary categorical data 2. 'polyreg': Polytomous
# Regression Imputation for unordered categorical data with more than 2
# categories, and 3. 'polr': Proportional Odds model for ordered categorical
# data with more than 2 categories.


loans_mice <- mice(loans, m = 20, method = "pmm", seed = 7, predictorMatrix = somewhat_correlated_variables)

## 
##  iter imp variable
##   1   1  LoanAmount  LoanAmountTerm  CreditHistory
##   1   2  LoanAmount  LoanAmountTerm  CreditHistory
##   1   3  LoanAmount  LoanAmountTerm  CreditHistory
##   1   4  LoanAmount  LoanAmountTerm  CreditHistory
##   1   5  LoanAmount  LoanAmountTerm  CreditHistory
##   1   6  LoanAmount  LoanAmountTerm  CreditHistory
##   1   7  LoanAmount  LoanAmountTerm  CreditHistory
##   1   8  LoanAmount  LoanAmountTerm  CreditHistory
##   1   9  LoanAmount  LoanAmountTerm  CreditHistory
##   1   10  LoanAmount  LoanAmountTerm  CreditHistory
##   1   11  LoanAmount  LoanAmountTerm  CreditHistory
##   1   12  LoanAmount  LoanAmountTerm  CreditHistory
##   1   13  LoanAmount  LoanAmountTerm  CreditHistory
##   1   14  LoanAmount  LoanAmountTerm  CreditHistory
##   1   15  LoanAmount  LoanAmountTerm  CreditHistory
##   1   16  LoanAmount  LoanAmountTerm  CreditHistory
##   1   17  LoanAmount  LoanAmountTerm  CreditHistory
##   1   18  LoanAmount  LoanAmountTerm  CreditHistory
##   1   19  LoanAmount  LoanAmountTerm  CreditHistory
##   1   20  LoanAmount  LoanAmountTerm  CreditHistory
##   2   1  LoanAmount  LoanAmountTerm  CreditHistory
##   2   2  LoanAmount  LoanAmountTerm  CreditHistory
##   2   3  LoanAmount  LoanAmountTerm  CreditHistory
##   2   4  LoanAmount  LoanAmountTerm  CreditHistory
##   2   5  LoanAmount  LoanAmountTerm  CreditHistory
##   2   6  LoanAmount  LoanAmountTerm  CreditHistory
##   2   7  LoanAmount  LoanAmountTerm  CreditHistory
##   2   8  LoanAmount  LoanAmountTerm  CreditHistory
##   2   9  LoanAmount  LoanAmountTerm  CreditHistory
##   2   10  LoanAmount  LoanAmountTerm  CreditHistory
##   2   11  LoanAmount  LoanAmountTerm  CreditHistory
##   2   12  LoanAmount  LoanAmountTerm  CreditHistory
##   2   13  LoanAmount  LoanAmountTerm  CreditHistory
##   2   14  LoanAmount  LoanAmountTerm  CreditHistory
##   2   15  LoanAmount  LoanAmountTerm  CreditHistory
##   2   16  LoanAmount  LoanAmountTerm  CreditHistory
##   2   17  LoanAmount  LoanAmountTerm  CreditHistory
##   2   18  LoanAmount  LoanAmountTerm  CreditHistory
##   2   19  LoanAmount  LoanAmountTerm  CreditHistory
##   2   20  LoanAmount  LoanAmountTerm  CreditHistory
##   3   1  LoanAmount  LoanAmountTerm  CreditHistory
##   3   2  LoanAmount  LoanAmountTerm  CreditHistory
##   3   3  LoanAmount  LoanAmountTerm  CreditHistory
##   3   4  LoanAmount  LoanAmountTerm  CreditHistory
##   3   5  LoanAmount  LoanAmountTerm  CreditHistory
##   3   6  LoanAmount  LoanAmountTerm  CreditHistory
##   3   7  LoanAmount  LoanAmountTerm  CreditHistory
##   3   8  LoanAmount  LoanAmountTerm  CreditHistory
##   3   9  LoanAmount  LoanAmountTerm  CreditHistory
##   3   10  LoanAmount  LoanAmountTerm  CreditHistory
##   3   11  LoanAmount  LoanAmountTerm  CreditHistory
##   3   12  LoanAmount  LoanAmountTerm  CreditHistory
##   3   13  LoanAmount  LoanAmountTerm  CreditHistory
##   3   14  LoanAmount  LoanAmountTerm  CreditHistory
##   3   15  LoanAmount  LoanAmountTerm  CreditHistory
##   3   16  LoanAmount  LoanAmountTerm  CreditHistory
##   3   17  LoanAmount  LoanAmountTerm  CreditHistory
##   3   18  LoanAmount  LoanAmountTerm  CreditHistory
##   3   19  LoanAmount  LoanAmountTerm  CreditHistory
##   3   20  LoanAmount  LoanAmountTerm  CreditHistory
##   4   1  LoanAmount  LoanAmountTerm  CreditHistory
##   4   2  LoanAmount  LoanAmountTerm  CreditHistory
##   4   3  LoanAmount  LoanAmountTerm  CreditHistory
##   4   4  LoanAmount  LoanAmountTerm  CreditHistory
##   4   5  LoanAmount  LoanAmountTerm  CreditHistory
##   4   6  LoanAmount  LoanAmountTerm  CreditHistory
##   4   7  LoanAmount  LoanAmountTerm  CreditHistory
##   4   8  LoanAmount  LoanAmountTerm  CreditHistory
##   4   9  LoanAmount  LoanAmountTerm  CreditHistory
##   4   10  LoanAmount  LoanAmountTerm  CreditHistory
##   4   11  LoanAmount  LoanAmountTerm  CreditHistory
##   4   12  LoanAmount  LoanAmountTerm  CreditHistory
##   4   13  LoanAmount  LoanAmountTerm  CreditHistory
##   4   14  LoanAmount  LoanAmountTerm  CreditHistory
##   4   15  LoanAmount  LoanAmountTerm  CreditHistory
##   4   16  LoanAmount  LoanAmountTerm  CreditHistory
##   4   17  LoanAmount  LoanAmountTerm  CreditHistory
##   4   18  LoanAmount  LoanAmountTerm  CreditHistory
##   4   19  LoanAmount  LoanAmountTerm  CreditHistory
##   4   20  LoanAmount  LoanAmountTerm  CreditHistory
##   5   1  LoanAmount  LoanAmountTerm  CreditHistory
##   5   2  LoanAmount  LoanAmountTerm  CreditHistory
##   5   3  LoanAmount  LoanAmountTerm  CreditHistory
##   5   4  LoanAmount  LoanAmountTerm  CreditHistory
##   5   5  LoanAmount  LoanAmountTerm  CreditHistory
##   5   6  LoanAmount  LoanAmountTerm  CreditHistory
##   5   7  LoanAmount  LoanAmountTerm  CreditHistory
##   5   8  LoanAmount  LoanAmountTerm  CreditHistory
##   5   9  LoanAmount  LoanAmountTerm  CreditHistory
##   5   10  LoanAmount  LoanAmountTerm  CreditHistory
##   5   11  LoanAmount  LoanAmountTerm  CreditHistory
##   5   12  LoanAmount  LoanAmountTerm  CreditHistory
##   5   13  LoanAmount  LoanAmountTerm  CreditHistory
##   5   14  LoanAmount  LoanAmountTerm  CreditHistory
##   5   15  LoanAmount  LoanAmountTerm  CreditHistory
##   5   16  LoanAmount  LoanAmountTerm  CreditHistory
##   5   17  LoanAmount  LoanAmountTerm  CreditHistory
##   5   18  LoanAmount  LoanAmountTerm  CreditHistory
##   5   19  LoanAmount  LoanAmountTerm  CreditHistory
##   5   20  LoanAmount  LoanAmountTerm  CreditHistory

# One can then train a model to predict MAP using BMI and PhysActiveDays or to
# identify the p-Value and confidence intervals between MAP and BMI and
# PhysActiveDays


## Impute the missing data ---- We then create imputed data for the final
## dataset using the mice::complete() function in the mice package to fill in
## the missing data.
loans_imputed <- mice::complete(loans_mice, 1)

# STEP 5. Confirm the 'missingness' in the Imputed Dataset ---- A textual
# confirmation that the dataset has no more missing values in any feature:
miss_var_summary(loans_imputed)

## # A tibble: 12 × 3
##    variable          n_miss pct_miss
##    <chr>              <int>    <dbl>
##  1 SelfEmployed          27    4.40 
##  2 Dependents            13    2.12 
##  3 Gender                10    1.63 
##  4 Married                2    0.326
##  5 Education              0    0    
##  6 ApplicantIncome        0    0    
##  7 CoapplicantIncome      0    0    
##  8 LoanAmount             0    0    
##  9 LoanAmountTerm         0    0    
## 10 CreditHistory          0    0    
## 11 PropertyArea           0    0    
## 12 Status                 0    0

# A visual confirmation that the dataset has no more missing values in any
# feature:
Amelia::missmap(loans_imputed)

######################### Are there missing values in the dataset?
any_na(loans_imputed)

## [1] TRUE

# How many?
n_miss(loans_imputed)

## [1] 52

# What is the percentage of missing data in the entire dataset?
prop_miss(loans_imputed)

## [1] 0.007057546

# How many missing values does each variable have?
loans_imputed %>%
    is.na() %>%
    colSums()

##            Gender           Married        Dependents         Education 
##                10                 2                13                 0 
##      SelfEmployed   ApplicantIncome CoapplicantIncome        LoanAmount 
##                27                 0                 0                 0 
##    LoanAmountTerm     CreditHistory      PropertyArea            Status 
##                 0                 0                 0                 0

# What is the number and percentage of missing values grouped by each variable?
miss_var_summary(loans_imputed)

## # A tibble: 12 × 3
##    variable          n_miss pct_miss
##    <chr>              <int>    <dbl>
##  1 SelfEmployed          27    4.40 
##  2 Dependents            13    2.12 
##  3 Gender                10    1.63 
##  4 Married                2    0.326
##  5 Education              0    0    
##  6 ApplicantIncome        0    0    
##  7 CoapplicantIncome      0    0    
##  8 LoanAmount             0    0    
##  9 LoanAmountTerm         0    0    
## 10 CreditHistory          0    0    
## 11 PropertyArea           0    0    
## 12 Status                 0    0

# What is the number and percentage of missing values grouped by each
# observation?
miss_case_summary(loans_imputed)

## # A tibble: 614 × 3
##     case n_miss pct_miss
##    <int>  <int>    <dbl>
##  1   229      2    16.7 
##  2   436      2    16.7 
##  3    96      1     8.33
##  4   108      1     8.33
##  5   112      1     8.33
##  6   115      1     8.33
##  7   121      1     8.33
##  8   159      1     8.33
##  9   171      1     8.33
## 10   189      1     8.33
## # ℹ 604 more rows

# Which variables contain the most missing values?
gg_miss_var(loans_imputed)

# We require the 'ggplot2' package to create more appealing visualizations

# Where are missing values located (the shaded regions in the plot)?
vis_miss(loans_imputed) + theme(axis.text.x = element_text(angle = 80))

# Which combinations of variables are missing together?

# Note: The following command should give you an error stating that at least 2
# variables should have missing data for the plot to be created.
# gg_miss_upset(loans_imputed)

<Milestone 7 out of 8>

Issue 7 Hyper Parameter Tuning and Ensembles.

# 7. Hyper Parameter Tuning and Ensemble Methods ----


# After you have confirmed that your code works as expected, use
# renv::snapshot(), AT THE END, to record the packages and their sources in the
# lockfile.

# Later, if you need to share your code with someone else or run your code on a
# new machine, your collaborator (or you) can call renv::restore() to reinstall
# the specific package versions recorded in the lockfile.

# [OPTIONAL] Execute the following code to reinstall the specific package
# versions recorded in the lockfile (restart R after executing the command):
# renv::restore()

# [OPTIONAL] If you get several errors setting up renv and you prefer not to
# use it, then you can deactivate it using the following command (restart R
# after executing the command): renv::deactivate()

# If renv::restore() did not install the 'languageserver' package (required to
# use R for VS Code), then it can be installed manually as follows (restart R
# after executing the command):

if (require("languageserver")) {
    require("languageserver")
} else {
    install.packages("languageserver", dependencies = TRUE, repos = "https://cloud.r-project.org")
}

# Introduction ---- In addition to hyperparameter tuning, you can also combine
# the predictions of multiple different models together. This is called an
# 'ensemble prediction'.

## Ensemble Methods ---- (1) Bagging (Bootstrap Aggregation) ---- Building
## multiple models (typically models of the same type) from different
## subsamples of the training dataset.

### (2) Boosting ---- Building multiple models (typically models of the same
### type) each of which learns to fix the prediction errors of a prior model in
### the chain.

### (3) Stacking ---- Building multiple models (typically models of differing
### types) and a supervised model that learns how to best combine the
### predictions of the primary models.

# STEP 1. Install and Load the Required Packages ---- mlbench ----
if (require("mlbench")) {
    require("mlbench")
} else {
    install.packages("mlbench", dependencies = TRUE, repos = "https://cloud.r-project.org")
}

## caret ----
if (require("caret")) {
    require("caret")
} else {
    install.packages("caret", dependencies = TRUE, repos = "https://cloud.r-project.org")
}

## caretEnsemble ----
if (require("caretEnsemble")) {
    require("caretEnsemble")
} else {
    install.packages("caretEnsemble", dependencies = TRUE, repos = "https://cloud.r-project.org")
}

## Loading required package: caretEnsemble

## 
## Attaching package: 'caretEnsemble'

## The following object is masked from 'package:ggplot2':
## 
##     autoplot

## C50 ----
if (require("C50")) {
    require("C50")
} else {
    install.packages("C50", dependencies = TRUE, repos = "https://cloud.r-project.org")
}

## Loading required package: C50

## adabag ----
if (require("adabag")) {
    require("adabag")
} else {
    install.packages("adabag", dependencies = TRUE, repos = "https://cloud.r-project.org")
}

## Loading required package: adabag

## Loading required package: rpart

## Loading required package: foreach

## Loading required package: doParallel

## Loading required package: iterators

## Loading required package: parallel

## randomForest ----
if (require("randomForest")) {
    require("randomForest")
} else {
    install.packages("randomForest", dependencies = TRUE, repos = "https://cloud.r-project.org")
}

## mlbench ----
if (require("mlbench")) {
    require("mlbench")
} else {
    install.packages("mlbench", dependencies = TRUE, repos = "https://cloud.r-project.org")
}

## caret ----
if (require("caret")) {
    require("caret")
} else {
    install.packages("caret", dependencies = TRUE, repos = "https://cloud.r-project.org")
}

## RRF ----
if (require("RRF")) {
    require("RRF")
} else {
    install.packages("RRF", dependencies = TRUE, repos = "https://cloud.r-project.org")
}

## Loading required package: RRF

## Registered S3 method overwritten by 'RRF':
##   method      from        
##   plot.margin randomForest

## RRF 1.9.4

## Type rrfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'RRF'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following objects are masked from 'package:randomForest':
## 
##     classCenter, combine, getTree, grow, importance, margin, MDSplot,
##     na.roughfix, outlier, partialPlot, treesize, varImpPlot, varUsed

## The following object is masked from 'package:ggplot2':
## 
##     margin

if (!requireNamespace("gbm", quietly = TRUE)) {
    install.packages("gbm")
}


# STEP 2. Load the Dataset ----
library(readr)
loans <- read_csv("C:/Users/Cris/github-classroom/BI-Loan-Appraisal-Project/data/loans_imputed.csv")

## Rows: 614 Columns: 12

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Gender, Married, Dependents, Education, SelfEmployed, PropertyArea,...
## dbl (5): ApplicantIncome, CoapplicantIncome, LoanAmount, LoanAmountTerm, Cre...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(loans)


# 1. Bagging ---- Two popular bagging algorithms are: 1. Bagged CART 2. Random
# Forest

# Example of Bagging algorithms
train_control <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
seed <- 7
metric <- "Accuracy"

## 2.a. Bagged CART ----
set.seed(seed)
loans_model_bagged_cart <- train(Status ~ ., data = loans, method = "treebag", metric = metric,
    trControl = train_control)

## 2.b. Random Forest ----
set.seed(seed)
loans_model_rf <- train(Status ~ ., data = loans, method = "rf", metric = metric,
    trControl = train_control)

# Summarize results
bagging_results <- resamples(list(`Bagged Decision Tree` = loans_model_bagged_cart,
    `Random Forest` = loans_model_rf))

summary(bagging_results)

## 
## Call:
## summary.resamples(object = bagging_results)
## 
## Models: Bagged Decision Tree, Random Forest 
## Number of resamples: 30 
## 
## Accuracy 
##                           Min.   1st Qu.    Median      Mean   3rd Qu.
## Bagged Decision Tree 0.6557377 0.7377049 0.7704918 0.7626969 0.7928187
## Random Forest        0.7377049 0.7877446 0.8064516 0.8050728 0.8225806
##                           Max. NA's
## Bagged Decision Tree 0.8387097    0
## Random Forest        0.8524590    0
## 
## Kappa 
##                            Min.   1st Qu.    Median      Mean   3rd Qu.
## Bagged Decision Tree 0.07775378 0.3332556 0.4249484 0.4041174 0.4958678
## Random Forest        0.23390895 0.4331779 0.4728576 0.4699364 0.5240579
##                           Max. NA's
## Bagged Decision Tree 0.5656051    0
## Random Forest        0.6047516    0

dotplot(bagging_results)

# 2. Boosting ---- Three popular boosting algorithms are: 1. AdaBoost.M1 2.
# C5.0 3. Stochastic Gradient Boosting

# Example of Boosting Algorithms
train_control <- trainControl(method = "cv", number = 5)
seed <- 7
metric <- "Accuracy"

## 1.a. Boosting with C5.0 ---- C5.0
set.seed(seed)
loans_model_c50 <- train(Status ~ ., data = loans, method = "C5.0", metric = metric,
    trControl = train_control)

## 1.b. Boosting with Stochastic Gradient Boosting ----
set.seed(seed)
loans_model_gbm <- train(Status ~ ., data = loans, method = "gbm", metric = metric,
    trControl = train_control, verbose = FALSE)


# 3. Stacking ---- The 'caretEnsemble' package allows you to combine the
# predictions of multiple caret models.

## caretEnsemble::caretStack() ---- Given a list of caret models, the
## 'caretStack()' function (in the 'caretEnsemble' package) can be used to
## specify a higher-order model to learn how to best combine together the
## predictions of sub-models.

## caretEnsemble::caretList() ---- The 'caretList()' function provided by the
## 'caretEnsemble' package can be used to create a list of standard caret
## models.

# Example of Stacking algorithms
train_control <- trainControl(method = "repeatedcv", number = 10, repeats = 3, savePredictions = TRUE,
    classProbs = TRUE)
set.seed(seed)

algorithm_list <- c("glm", "lda", "rpart", "knn", "svmRadial")
models <- caretList(Status ~ ., data = loans, trControl = train_control, methodList = algorithm_list)

# Summarize results before stacking
results <- resamples(models)
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: glm, lda, rpart, knn, svmRadial 
## Number of resamples: 30 
## 
## Accuracy 
##                Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glm       0.7377049 0.7868852 0.8196721 0.8120402 0.8380487 0.8571429    0
## lda       0.7419355 0.7909836 0.8196721 0.8131331 0.8380487 0.8571429    0
## rpart     0.7419355 0.7704918 0.7950820 0.7979722 0.8319672 0.8524590    0
## knn       0.5806452 0.6154151 0.6371324 0.6449577 0.6760973 0.7377049    0
## svmRadial 0.7419355 0.7868852 0.8196721 0.8077383 0.8225806 0.8524590    0
## 
## Kappa 
##                 Min.    1st Qu.      Median         Mean   3rd Qu.      Max.
## glm        0.2606061  0.4185118  0.50037230  0.488780836 0.5767215 0.6467290
## lda        0.2883788  0.4185118  0.50037230  0.490406183 0.5767215 0.6467290
## rpart      0.2945258  0.3977580  0.46117990  0.463905092 0.5627234 0.6047516
## knn       -0.1757903 -0.0797002 -0.04902565 -0.006883191 0.0666972 0.2052117
## svmRadial  0.2883788  0.4170230  0.50037230  0.473356452 0.5312050 0.6047516
##           NA's
## glm          0
## lda          0
## rpart        0
## knn          0
## svmRadial    0

dotplot(results)

# The predictions made by the sub-models that are combined using stacking
# should have a low-correlation (for diversity amongst the sub-models, i.e.,
# different sub-models are accurate in different ways). If the predictions for
# the sub-models were highly correlated (> 0.75) then they would be making the
# same or very similar predictions most of the time reducing the benefit of
# combining the predictions.

# correlation between results
modelCor(results)

##                 glm       lda     rpart       knn svmRadial
## glm       1.0000000 0.9827723 0.7868395 0.1169668 0.9355483
## lda       0.9827723 1.0000000 0.7953293 0.1590473 0.9280996
## rpart     0.7868395 0.7953293 1.0000000 0.1801689 0.7424085
## knn       0.1169668 0.1590473 0.1801689 1.0000000 0.2280942
## svmRadial 0.9355483 0.9280996 0.7424085 0.2280942 1.0000000

splom(results)

## 3.a. Stack using glm ----
stack_control <- trainControl(method = "repeatedcv", number = 10, repeats = 3, savePredictions = TRUE,
    classProbs = TRUE)
set.seed(seed)
loans_model_stack_glm <- caretStack(models, method = "glm", metric = "Accuracy",
    trControl = stack_control)
print(loans_model_stack_glm)

## A glm ensemble of 5 base models: glm, lda, rpart, knn, svmRadial
## 
## Ensemble results:
## Generalized Linear Model 
## 
## 1842 samples
##    5 predictor
##    2 classes: 'N', 'Y' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 1657, 1658, 1657, 1658, 1658, 1658, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8127081  0.4919081

## 3.b. Stack using random forest ----
set.seed(seed)
loans_model_stack_rf <- caretStack(models, method = "rf", metric = "Accuracy", trControl = stack_control)
print(loans_model_stack_rf)

## A rf ensemble of 5 base models: glm, lda, rpart, knn, svmRadial
## 
## Ensemble results:
## Random Forest 
## 
## 1842 samples
##    5 predictor
##    2 classes: 'N', 'Y' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 1657, 1658, 1657, 1658, 1658, 1658, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8083523  0.4979652
##   3     0.8083523  0.4992015
##   5     0.8081712  0.5001011
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

# 4 Hyperparameter Tuning ----

# Introduction ---- Hyperparameter tuning involves identifying and applying the
# best combination of algorithm parameters. Only the algorithm parameters that
# have a significant effect on the model's performance are available for
# tuning.


# STEP 3. Train the Model ---- The default random forest algorithm exposes the
# 'mtry' parameter to be tuned.

## The 'mtry' parameter ---- Number of variables randomly sampled as candidates
## at each split.

# This can be confirmed from here:
# https://topepo.github.io/caret/available-models.html or by executing the
# following command: names(getModelInfo())

# We start by identifying the accuracy by using the recommended defaults for
# each parameter, i.e., mtry=floor(sqrt(ncol(sonar_independent_variables))) or
# mtry=7

seed <- 7
metric <- "Accuracy"

train_control <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(seed)
mtry <- sqrt(ncol(loans))
tunegrid <- expand.grid(.mtry = mtry)
loans_model_default_rf <- train(Status ~ ., data = loans, method = "rf", metric = metric,
    tuneGrid = tunegrid, trControl = train_control)
print(loans_model_default_rf)

## Random Forest 
## 
## 614 samples
##  11 predictor
##   2 classes: 'N', 'Y' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 552, 553, 552, 553, 552, 553, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8018208  0.4712284
## 
## Tuning parameter 'mtry' was held constant at a value of 3.464102

# STEP 4. Apply a 'Random Search' to identify the best parameter value ---- A
# random search is good if we are unsure of what the value might be and we want
# to overcome any biases we may have for setting the parameter value (like the
# suggested 'mtry' equation above).

train_control <- trainControl(method = "repeatedcv", number = 10, repeats = 3, search = "random")
set.seed(seed)
mtry <- sqrt(ncol(loans))

loans_model_random_search_rf <- train(Status ~ ., data = loans, method = "rf", metric = metric,
    tuneLength = 12, trControl = train_control)

print(loans_model_random_search_rf)

## Random Forest 
## 
## 614 samples
##  11 predictor
##   2 classes: 'N', 'Y' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 552, 553, 552, 553, 552, 553, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8045263  0.4679203
##    3    0.8039713  0.4774923
##    6    0.7795310  0.4346134
##    7    0.7795222  0.4345172
##    8    0.7708231  0.4153193
##   10    0.7697214  0.4125276
##   12    0.7654027  0.4044407
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

plot(loans_model_random_search_rf)

# STEP 5. Apply a 'Grid Search' to identify the best parameter value ---- Each
# axis of the grid is an algorithm parameter, and points on the grid are
# specific combinations of parameters.

train_control <- trainControl(method = "repeatedcv", number = 10, repeats = 3, search = "grid")
set.seed(seed)

getModelInfo("RRFglobal")

## $RRFglobal
## $RRFglobal$label
## [1] "Regularized Random Forest"
## 
## $RRFglobal$library
## [1] "RRF"
## 
## $RRFglobal$loop
## NULL
## 
## $RRFglobal$type
## [1] "Regression"     "Classification"
## 
## $RRFglobal$parameters
##   parameter   class                         label
## 1      mtry numeric #Randomly Selected Predictors
## 2   coefReg numeric          Regularization Value
## 
## $RRFglobal$grid
## function (x, y, len = NULL, search = "grid") 
## {
##     if (search == "grid") {
##         out <- expand.grid(mtry = caret::var_seq(p = ncol(x), 
##             classification = is.factor(y), len = len), coefReg = seq(0.01, 
##             1, length = len))
##     }
##     else {
##         out <- data.frame(mtry = sample(1:ncol(x), size = len, 
##             replace = TRUE), coefReg = runif(len, min = 0, max = 1))
##     }
##     out
## }
## 
## $RRFglobal$fit
## function (x, y, wts, param, lev, last, classProbs, ...) 
## {
##     RRF::RRF(x, y, mtry = param$mtry, coefReg = param$coefReg, 
##         ...)
## }
## 
## $RRFglobal$predict
## function (modelFit, newdata, submodels = NULL) 
## predict(modelFit, newdata)
## 
## $RRFglobal$prob
## function (modelFit, newdata, submodels = NULL) 
## predict(modelFit, newdata, type = "prob")
## 
## $RRFglobal$varImp
## function (object, ...) 
## {
##     varImp <- RRF::importance(object, ...)
##     if (object$type == "regression") 
##         varImp <- data.frame(Overall = varImp[, "%IncMSE"])
##     else {
##         retainNames <- levels(object$y)
##         if (all(retainNames %in% colnames(varImp))) {
##             varImp <- varImp[, retainNames]
##         }
##         else {
##             varImp <- data.frame(Overall = varImp[, 1])
##         }
##     }
##     out <- as.data.frame(varImp, stringsAsFactors = TRUE)
##     if (dim(out)[2] == 2) {
##         tmp <- apply(out, 1, mean)
##         out[, 1] <- out[, 2] <- tmp
##     }
##     out
## }
## 
## $RRFglobal$levels
## function (x) 
## x$obsLevels
## 
## $RRFglobal$tags
## [1] "Random Forest"              "Ensemble Model"            
## [3] "Bagging"                    "Implicit Feature Selection"
## [5] "Regularization"            
## 
## $RRFglobal$sort
## function (x) 
## x[order(x$coefReg), ]

# The Regularized Random Forest algorithm exposes the 'coefReg' parameter in
# addition to the 'mtry' parameter for tuning.  The 'mtry' parameter ----
# Number of variables randomly sampled as candidates at each split.

## The 'coefReg' parameter ---- It stands for coefficient(s) of regularization.
## A smaller coefficient may lead to a smaller feature subset, i.e., there are
## fewer variables with non-zero importance scores. coefReg must be either a
## single value (all variables have the same coefficient) or a numeric vector
## of length equal to the number of predictor variables. default: 0.8

tunegrid <- expand.grid(.mtry = c(1:10), .coefReg = seq(from = 0.1, to = 1, by = 0.1))

loans_model_grid_search_rrf_global <- train(Status ~ ., data = loans, method = "RRFglobal",
    metric = metric, tuneGrid = tunegrid, trControl = train_control)
print(loans_model_grid_search_rrf_global)

## Regularized Random Forest 
## 
## 614 samples
##  11 predictor
##   2 classes: 'N', 'Y' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 552, 553, 552, 553, 552, 553, ... 
## Resampling results across tuning parameters:
## 
##   mtry  coefReg  Accuracy   Kappa    
##    1    0.1      0.7490877  0.3779218
##    1    0.2      0.7583074  0.3989046
##    1    0.3      0.7582633  0.3945359
##    1    0.4      0.7565986  0.3986669
##    1    0.5      0.7523945  0.3783282
##    1    0.6      0.7669539  0.4152272
##    1    0.7      0.7686461  0.4139478
##    1    0.8      0.7664779  0.4090405
##    1    0.9      0.7675444  0.4080530
##    1    1.0      0.7697478  0.4156015
##    2    0.1      0.7405038  0.3556486
##    2    0.2      0.7523672  0.3913850
##    2    0.3      0.7572497  0.3988650
##    2    0.4      0.7545266  0.3918324
##    2    0.5      0.7572318  0.3970357
##    2    0.6      0.7587839  0.4052189
##    2    0.7      0.7637633  0.4036724
##    2    0.8      0.7681085  0.4107795
##    2    0.9      0.7659053  0.4058152
##    2    1.0      0.7681173  0.4132699
##    3    0.1      0.7615864  0.4067717
##    3    0.2      0.7544643  0.3947414
##    3    0.3      0.7577879  0.3996656
##    3    0.4      0.7588274  0.4017437
##    3    0.5      0.7507711  0.3840043
##    3    0.6      0.7713340  0.4311461
##    3    0.7      0.7697390  0.4155437
##    3    0.8      0.7626528  0.3983443
##    3    0.9      0.7675973  0.4100462
##    3    1.0      0.7675532  0.4110335
##    4    0.1      0.7572583  0.3962856
##    4    0.2      0.7485074  0.3782360
##    4    0.3      0.7593736  0.4044182
##    4    0.4      0.7475018  0.3732134
##    4    0.5      0.7638247  0.4162105
##    4    0.6      0.7620802  0.4102128
##    4    0.7      0.7675882  0.4120125
##    4    0.8      0.7659403  0.4085487
##    4    0.9      0.7691840  0.4164565
##    4    1.0      0.7653762  0.4036270
##    5    0.1      0.7599555  0.4072146
##    5    0.2      0.7555319  0.3925749
##    5    0.3      0.7571525  0.4023121
##    5    0.4      0.7512030  0.3845527
##    5    0.5      0.7615340  0.4093256
##    5    0.6      0.7642484  0.4144601
##    5    0.7      0.7664603  0.4086835
##    5    0.8      0.7653850  0.4061506
##    5    0.9      0.7680908  0.4127933
##    5    1.0      0.7653762  0.4059993
##    6    0.1      0.7632436  0.4112904
##    6    0.2      0.7604841  0.4091690
##    6    0.3      0.7653669  0.4186704
##    6    0.4      0.7571795  0.4003302
##    6    0.5      0.7560951  0.3992575
##    6    0.6      0.7750983  0.4365077
##    6    0.7      0.7653762  0.4041517
##    6    0.8      0.7680732  0.4123879
##    6    0.9      0.7670332  0.4085133
##    6    1.0      0.7653850  0.4060358
##    7    0.1      0.7523320  0.3913853
##    7    0.2      0.7621411  0.4107272
##    7    0.3      0.7626437  0.4156168
##    7    0.4      0.7669894  0.4211991
##    7    0.5      0.7638245  0.4184038
##    7    0.6      0.7653853  0.4128226
##    7    0.7      0.7670332  0.4107846
##    7    0.8      0.7686288  0.4138641
##    7    0.9      0.7664956  0.4083607
##    7    1.0      0.7680997  0.4125427
##    8    0.1      0.7626696  0.4165443
##    8    0.2      0.7587481  0.4082678
##    8    0.3      0.7626972  0.4149211
##    8    0.4      0.7626437  0.4123091
##    8    0.5      0.7605100  0.4113312
##    8    0.6      0.7577874  0.3992001
##    8    0.7      0.7692102  0.4146067
##    8    0.8      0.7637545  0.4005511
##    8    0.9      0.7653850  0.4044895
##    8    1.0      0.7670070  0.4083417
##    9    0.1      0.7599203  0.4079152
##    9    0.2      0.7604659  0.4085950
##    9    0.3      0.7626264  0.4116051
##    9    0.4      0.7626960  0.4111481
##    9    0.5      0.7638245  0.4164444
##    9    0.6      0.7587927  0.4002655
##    9    0.7      0.7675620  0.4108690
##    9    0.8      0.7675708  0.4092840
##    9    0.9      0.7653850  0.4056802
##    9    1.0      0.7659227  0.4075146
##   10    0.1      0.7577262  0.4071059
##   10    0.2      0.7566331  0.3985721
##   10    0.3      0.7571616  0.4045030
##   10    0.4      0.7577086  0.4059484
##   10    0.5      0.7604576  0.4100483
##   10    0.6      0.7642916  0.4112054
##   10    0.7      0.7643010  0.4043159
##   10    0.8      0.7659227  0.4054709
##   10    0.9      0.7659665  0.4066739
##   10    1.0      0.7680908  0.4149045
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 6 and coefReg = 0.6.

plot(loans_model_grid_search_rrf_global)

# STEP 6. Apply a 'Manual Search' to identify the best parameter value ----
# Manual Search The 'mtry' parameter ---- Number of variables randomly sampled
# as candidates at each split.

## The 'ntree' parameter ---- Number of trees to grow. It is limited by the
## amount of compute time available.

# We randomly search for a value for the mtry parameter but we manually search
# for a value for the ntree parameter.

train_control <- trainControl(method = "repeatedcv", number = 10, repeats = 3, search = "random")

tunegrid <- expand.grid(.mtry = c(1:5))

modellist <- list()
for (ntree in c(500, 800, 1000)) {
    set.seed(seed)
    loans_model_manual_search_rf <- train(Status ~ ., data = loans, method = "rf",
        metric = metric, tuneGrid = tunegrid, trControl = train_control, ntree = ntree)
    key <- toString(ntree)
    modellist[[key]] <- loans_model_manual_search_rf
}

# Lastly, we compare results to find which parameters gave the highest accuracy
print(modellist)

## $`500`
## Random Forest 
## 
## 614 samples
##  11 predictor
##   2 classes: 'N', 'Y' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 552, 553, 552, 553, 552, 553, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      
##   1     0.6894885  0.009491513
##   2     0.8045263  0.467304813
##   3     0.8007279  0.470197930
##   4     0.7903718  0.452882949
##   5     0.7876660  0.452909505
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## 
## $`800`
## Random Forest 
## 
## 614 samples
##  11 predictor
##   2 classes: 'N', 'Y' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 552, 553, 552, 553, 552, 553, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      
##   1     0.6889421  0.007122647
##   2     0.8056192  0.470373772
##   3     0.8023496  0.473479065
##   4     0.7925576  0.457014412
##   5     0.7876660  0.453165553
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## 
## $`1000`
## Random Forest 
## 
## 614 samples
##  11 predictor
##   2 classes: 'N', 'Y' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 552, 553, 552, 553, 552, 553, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      
##   1     0.6894885  0.009491513
##   2     0.8045263  0.466775858
##   3     0.8034337  0.476495853
##   4     0.7909183  0.454307793
##   5     0.7871196  0.451884901
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

results <- resamples(modellist)
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: 500, 800, 1000 
## Number of resamples: 30 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu.     Max. NA's
## 500  0.7377049 0.7868852 0.8064516 0.8045263 0.8225806 0.852459    0
## 800  0.7377049 0.7868852 0.8064516 0.8056192 0.8225806 0.852459    0
## 1000 0.7377049 0.7868852 0.8064516 0.8045263 0.8225806 0.852459    0
## 
## Kappa 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## 500  0.2339089 0.4133117 0.4772699 0.4673048 0.5240579 0.6174216    0
## 800  0.2339089 0.4133117 0.4772699 0.4703738 0.5240579 0.6174216    0
## 1000 0.2339089 0.4133117 0.4772699 0.4667759 0.5240579 0.6047516    0

dotplot(results)

# [OPTIONAL] **Deinitialization: Create a snapshot of the R environment ----
# Lastly, as a follow-up to the initialization step, record the packages
# installed and their sources in the lockfile so that other team-members can
# use renv::restore() to re-install the same package version in their local
# machine during their initialization step.  renv::snapshot() [OPTIONAL]
# **Deinitialization: Create a snapshot of the R environment ---- Lastly, as a
# follow-up to the initialization step, record the packages installed and their
# sources in the lockfile so that other team-members can use renv::restore() to
# re-install the same package version in their local machine during their
# initialization step.  renv::snapshot()

<Milestone 8 out of 8>

Issue 8a Consolidation.

# *****************************************************************************
# 8: Consolidation ----

# This will set up a project library, containing all the packages you are
# currently using. The packages (and all the metadata needed to reinstall them)
# are recorded into a lockfile, renv.lock, and a .Rprofile ensures that the
# library is used every time you open the project.

# Consider a library as the location where packages are stored.  Execute the
# following command to list all the libraries available in your computer:
.libPaths()

## [1] "C:/Users/Cris/github-classroom/BI-Loan-Appraisal-Project/markdown/renv/library/R-4.3/x86_64-w64-mingw32"
## [2] "C:/Users/Cris/AppData/Local/R/cache/R/renv/sandbox/R-4.3/x86_64-w64-mingw32/bd3f13aa"

# One of the libraries should be a folder inside the project if you are using
# renv

# Then execute the following command to see which packages are available in
# each library:
lapply(.libPaths(), list.files)

## [[1]]
##   [1] "abind"          "adabag"         "Amelia"         "askpass"       
##   [5] "backports"      "base64enc"      "BH"             "bit"           
##   [9] "bit64"          "brew"           "brio"           "broom"         
##  [13] "broom.mixed"    "bslib"          "C50"            "cachem"        
##  [17] "callr"          "car"            "carData"        "caret"         
##  [21] "caretEnsemble"  "checkmate"      "chron"          "class"         
##  [25] "cli"            "clipr"          "clock"          "coda"          
##  [29] "codetools"      "collections"    "colorspace"     "commonmark"    
##  [33] "ConsRank"       "corrplot"       "covr"           "cpp11"         
##  [37] "crayon"         "Cubist"         "curl"           "cyclocomp"     
##  [41] "data.table"     "DBI"            "DEoptimR"       "desc"          
##  [45] "diagram"        "diffobj"        "digest"         "doParallel"    
##  [49] "doRNG"          "dplyr"          "e1071"          "ellipsis"      
##  [53] "evaluate"       "fansi"          "farver"         "fastmap"       
##  [57] "fontawesome"    "forcats"        "foreach"        "forecast"      
##  [61] "foreign"        "formatR"        "Formula"        "fracdiff"      
##  [65] "fs"             "furrr"          "future"         "future.apply"  
##  [69] "gbm"            "generics"       "ggcorrplot"     "ggformula"     
##  [73] "ggplot2"        "ggridges"       "ggtext"         "glmnet"        
##  [77] "globals"        "glue"           "gower"          "gridExtra"     
##  [81] "gridtext"       "gtable"         "gtools"         "hardhat"       
##  [85] "haven"          "here"           "highr"          "Hmisc"         
##  [89] "hms"            "htmlTable"      "htmltools"      "htmlwidgets"   
##  [93] "httpuv"         "httr"           "hunspell"       "imputeTS"      
##  [97] "inline"         "inum"           "ipred"          "isoband"       
## [101] "iterators"      "itertools"      "jomo"           "jpeg"          
## [105] "jquerylib"      "jsonlite"       "kernlab"        "KernSmooth"    
## [109] "knitr"          "labeling"       "labelled"       "laeken"        
## [113] "languageserver" "later"          "lattice"        "lava"          
## [117] "lazyeval"       "libcoin"        "lifecycle"      "lintr"         
## [121] "listenv"        "lme4"           "lmtest"         "loo"           
## [125] "lubridate"      "magrittr"       "markdown"       "MASS"          
## [129] "Matrix"         "MatrixModels"   "matrixStats"    "memoise"       
## [133] "mgcv"           "mice"           "miceadds"       "mime"          
## [137] "minqa"          "missForest"     "mitml"          "mitools"       
## [141] "mlbench"        "ModelMetrics"   "mosaic"         "mosaicCore"    
## [145] "mosaicData"     "munsell"        "mvtnorm"        "naniar"        
## [149] "NHANES"         "nlme"           "nloptr"         "nnet"          
## [153] "norm"           "numDeriv"       "openssl"        "ordinal"       
## [157] "pan"            "parallelly"     "partykit"       "pbapply"       
## [161] "pbkrtest"       "pillar"         "pkgbuild"       "pkgconfig"     
## [165] "pkgload"        "plumber"        "plyr"           "png"           
## [169] "praise"         "prettyunits"    "pROC"           "processx"      
## [173] "prodlim"        "progress"       "progressr"      "promises"      
## [177] "proxy"          "ps"             "purrr"          "quadprog"      
## [181] "quantmod"       "quantreg"       "QuickJSR"       "R.cache"       
## [185] "R.methodsS3"    "R.oo"           "R.utils"        "R6"            
## [189] "randomForest"   "ranger"         "rappdirs"       "RColorBrewer"  
## [193] "Rcpp"           "RcppArmadillo"  "RcppEigen"      "RcppParallel"  
## [197] "readr"          "recipes"        "rematch2"       "remotes"       
## [201] "renv"           "reshape2"       "rex"            "rgl"           
## [205] "rlang"          "rlist"          "rmarkdown"      "rngtools"      
## [209] "robustbase"     "roxygen2"       "rpart"          "rpart.plot"    
## [213] "rprojroot"      "RRF"            "rstan"          "rstudioapi"    
## [217] "sass"           "scales"         "shape"          "simputation"   
## [221] "sodium"         "sp"             "SparseM"        "spelling"      
## [225] "SQUAREM"        "StanHeaders"    "stinepack"      "stringi"       
## [229] "stringr"        "styler"         "survival"       "swagger"       
## [233] "sys"            "testthat"       "tibble"         "tidyr"         
## [237] "tidyselect"     "timechange"     "timeDate"       "tinytex"       
## [241] "tseries"        "TTR"            "tzdb"           "ucminf"        
## [245] "UpSetR"         "urca"           "utf8"           "vcd"           
## [249] "vctrs"          "vdiffr"         "VIM"            "viridis"       
## [253] "viridisLite"    "visdat"         "vroom"          "wakefield"     
## [257] "waldo"          "webutils"       "withr"          "xfun"          
## [261] "XML"            "xml2"           "xmlparsedata"   "xts"           
## [265] "yaml"           "zoo"           
## 
## [[2]]
##  [1] "base"       "boot"       "class"      "cluster"    "codetools" 
##  [6] "compiler"   "datasets"   "foreign"    "graphics"   "grDevices" 
## [11] "grid"       "KernSmooth" "lattice"    "MASS"       "Matrix"    
## [16] "methods"    "mgcv"       "nlme"       "nnet"       "parallel"  
## [21] "rpart"      "spatial"    "splines"    "stats"      "stats4"    
## [26] "survival"   "tcltk"      "tools"      "utils"

# This can also be configured using the RStudio GUI when you click the project
# file, e.g., 'BBT4206-R.Rproj' in the case of this project. Then navigate to
# the 'Environments' tab and select 'Use renv with this project'.

# As you continue to work on your project, you can install and upgrade
# packages, using either: install.packages() and update.packages or
# renv::install() and renv::update()

# You can also clean up a project by removing unused packages using the
# following command: renv::clean()

# After you have confirmed that your code works as expected, use
# renv::snapshot(), AT THE END, to record the packages and their sources in the
# lockfile.

# Later, if you need to share your code with someone else or run your code on a
# new machine, your collaborator (or you) can call renv::restore() to reinstall
# the specific package versions recorded in the lockfile.

# [OPTIONAL] Execute the following code to reinstall the specific package
# versions recorded in the lockfile (restart R after executing the command):
# renv::restore()

# [OPTIONAL] If you get several errors setting up renv and you prefer not to
# use it, then you can deactivate it using the following command (restart R
# after executing the command): renv::deactivate()

# If renv::restore() did not install the 'languageserver' package (required to
# use R for VS Code), then it can be installed manually as follows (restart R
# after executing the command):

if (require("languageserver")) {
    require("languageserver")
} else {
    install.packages("languageserver", dependencies = TRUE, repos = "https://cloud.r-project.org")
}

# Introduction ---- What do you do after you have designed a model that is
# accurate enough to use?  This is a critical question whose answer enables the
# gap between research and practice to be addressed.

# It is possible to discover the key internal representation of a model found
# by an algorithm (e.g., the coefficients in a linear model) and use them in a
# new implementation of the prediction algorithm in another program developed
# using a programming language other than R.

# This is easier to do for simpler algorithms that use a simple representation,
# e.g., a linear model, than for algorithms that use more complex
# representations.

# 'caret' provides access to 'the best' model from a training run in the
# 'finalModel' variable.  The 'predict()' function in the 'caret' package
# automatically uses the 'finalModel' to make predictions on a new dataset. The
# data provided as the 'new dataset' can be stored in a separate file and
# loaded as a data frame.

# STEP 1. Install and Load the Required Packages ---- caret ----
if (require("caret")) {
    require("caret")
} else {
    install.packages("caret", dependencies = TRUE, repos = "https://cloud.r-project.org")
}

## mlbench ----
if (require("mlbench")) {
    require("mlbench")
} else {
    install.packages("mlbench", dependencies = TRUE, repos = "https://cloud.r-project.org")
}

## plumber ----
if (require("plumber")) {
    require("plumber")
} else {
    install.packages("plumber", dependencies = TRUE, repos = "https://cloud.r-project.org")
}

## Loading required package: plumber

# STEP 2. Load the Dataset ----

library(readr)
loans_imputed <- read_csv("C:/Users/Cris/github-classroom/BI-Loan-Appraisal-Project/data/loans_imputed.csv")

## Rows: 614 Columns: 12

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): Gender, Married, Dependents, Education, SelfEmployed, PropertyArea,...
## dbl (5): ApplicantIncome, CoapplicantIncome, LoanAmount, LoanAmountTerm, Cre...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(loans_imputed)
# STEP 3. Train the Model ---- create an 80%/20% data split for training and
# testing datasets respectively
set.seed(9)
train_index <- createDataPartition(loans_imputed$Status, p = 0.8, list = FALSE)
loans_imputed_training <- loans_imputed[train_index, ]
loans_imputed_testing <- loans_imputed[-train_index, ]

set.seed(9)
train_control <- trainControl(method = "cv", number = 10)
loans_imputed_model_lda <- train(Status ~ ., data = loans_imputed_training, method = "lda",
    metric = "Accuracy", trControl = train_control)

# We print a summary of what caret has done
print(loans_imputed_model_lda)

## Linear Discriminant Analysis 
## 
## 492 samples
##  11 predictor
##   2 classes: 'N', 'Y' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 443, 443, 442, 442, 444, 442, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8128214  0.4867037

# We then print the details of the model that has been created
print(loans_imputed_model_lda$finalModel)

## Call:
## lda(x, grouping = y)
## 
## Prior probabilities of groups:
##         N         Y 
## 0.3130081 0.6869919 
## 
## Group means:
##   GenderMale MarriedYes Dependents1 Dependents2 Dependents3+
## N  0.8181818  0.5974026   0.2012987   0.1298701    0.1103896
## Y  0.8343195  0.6982249   0.1656805   0.1893491    0.0887574
##   EducationNot Graduate SelfEmployedYes ApplicantIncome CoapplicantIncome
## N             0.2922078       0.1363636        5503.662          1936.734
## Y             0.1656805       0.1390533        5400.500          1488.355
##   LoanAmount LoanAmountTerm CreditHistory PropertyAreaSemiurban
## N   5394.370       340.7532     0.5844156             0.2792208
## Y   5390.893       340.4734     0.9852071             0.4230769
##   PropertyAreaUrban
## N         0.3441558
## Y         0.3224852
## 
## Coefficients of linear discriminants:
##                                 LD1
## GenderMale            -3.826180e-02
## MarriedYes             5.288644e-01
## Dependents1           -3.245753e-01
## Dependents2            1.695665e-01
## Dependents3+           4.753936e-02
## EducationNot Graduate -5.096103e-01
## SelfEmployedYes       -5.474336e-03
## ApplicantIncome       -2.229830e-04
## CoapplicantIncome     -5.319287e-05
## LoanAmount             2.222240e-04
## LoanAmountTerm        -8.777288e-04
## CreditHistory          3.208886e+00
## PropertyAreaSemiurban  7.473690e-01
## PropertyAreaUrban      3.365292e-01

# STEP 4. Test the Model ---- We can test the model
set.seed(9)

# Define the levels you expect in the Status variable
expected_levels <- c("Y", "N")

# Convert the Status variable to a factor with the defined levels
loans_imputed_testing[, 1:12]$Status <- factor(loans_imputed_testing[, 1:12]$Status,
    levels = expected_levels)

predictions <- predict(loans_imputed_model_lda, newdata = loans_imputed_testing)
confusionMatrix(predictions, loans_imputed_testing$Status)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Y  N
##          Y 82 20
##          N  2 18
##                                           
##                Accuracy : 0.8197          
##                  95% CI : (0.7398, 0.8834)
##     No Information Rate : 0.6885          
##     P-Value [Acc > NIR] : 0.0007750       
##                                           
##                   Kappa : 0.5169          
##                                           
##  Mcnemar's Test P-Value : 0.0002896       
##                                           
##             Sensitivity : 0.9762          
##             Specificity : 0.4737          
##          Pos Pred Value : 0.8039          
##          Neg Pred Value : 0.9000          
##              Prevalence : 0.6885          
##          Detection Rate : 0.6721          
##    Detection Prevalence : 0.8361          
##       Balanced Accuracy : 0.7249          
##                                           
##        'Positive' Class : Y               
##

print(loans_imputed_testing)

## # A tibble: 122 × 12
##    Gender Married Dependents Education    SelfEmployed ApplicantIncome
##    <chr>  <chr>   <chr>      <chr>        <chr>                  <dbl>
##  1 Male   No      0          Graduate     No                      6000
##  2 Male   Yes     0          Not Graduate No                      2333
##  3 Female No      0          Graduate     No                      3510
##  4 Male   Yes     0          Not Graduate No                      7660
##  5 Male   Yes     1          Graduate     Yes                     3717
##  6 Male   No      0          Not Graduate No                      1442
##  7 Male   No      0          Graduate     No                      3167
##  8 Male   No      0          Graduate     No                      1800
##  9 Male   Yes     0          Graduate     No                      5821
## 10 Male   Yes     0          Graduate     No                      3366
## # ℹ 112 more rows
## # ℹ 6 more variables: CoapplicantIncome <dbl>, LoanAmount <dbl>,
## #   LoanAmountTerm <dbl>, CreditHistory <dbl>, PropertyArea <chr>, Status <fct>

print(predictions)

##   [1] Y Y N N Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y N N Y Y Y Y Y Y Y Y Y Y
##  [38] N Y Y Y Y Y Y Y Y N N Y Y Y Y Y Y N Y Y N Y N Y Y Y Y N Y Y N Y Y N Y Y Y
##  [75] Y Y N Y Y Y N Y Y Y Y Y Y N Y N Y Y Y Y Y Y Y N Y Y Y Y Y Y Y Y Y N N Y Y
## [112] Y Y Y Y Y Y Y Y Y Y Y
## Levels: N Y

# STEP 5. Save and Load your Model ---- Saving a model into a file allows you
# to load it later and use it to make predictions. Saved models can be loaded
# by calling the `readRDS()` function

saveRDS(loans_imputed_model_lda, "C:/Users/Cris/github-classroom/BI-Loan-Appraisal-Project/models/saved_loans_model_lda.rds")
# The saved model can then be loaded later as follows:
loaded_loans_imputed_model_lda <- readRDS("C:/Users/Cris/github-classroom/BI-Loan-Appraisal-Project/models/saved_loans_model_lda.rds")
print(loaded_loans_imputed_model_lda)

## Linear Discriminant Analysis 
## 
## 492 samples
##  11 predictor
##   2 classes: 'N', 'Y' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 443, 443, 442, 442, 444, 442, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8128214  0.4867037

predictions_with_loaded_model <- predict(loaded_loans_imputed_model_lda, newdata = loans_imputed_testing)
confusionMatrix(predictions_with_loaded_model, loans_imputed_testing$Status)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Y  N
##          Y 82 20
##          N  2 18
##                                           
##                Accuracy : 0.8197          
##                  95% CI : (0.7398, 0.8834)
##     No Information Rate : 0.6885          
##     P-Value [Acc > NIR] : 0.0007750       
##                                           
##                   Kappa : 0.5169          
##                                           
##  Mcnemar's Test P-Value : 0.0002896       
##                                           
##             Sensitivity : 0.9762          
##             Specificity : 0.4737          
##          Pos Pred Value : 0.8039          
##          Neg Pred Value : 0.9000          
##              Prevalence : 0.6885          
##          Detection Rate : 0.6721          
##    Detection Prevalence : 0.8361          
##       Balanced Accuracy : 0.7249          
##                                           
##        'Positive' Class : Y               
##

# STEP 6. Creating Functions in R ----

# Plumber requires functions, an example of the syntax for creating a function
# in R is:

name_of_function <- function(arg) {
    # Do something with the argument called `arg`
}

# STEP 7. Make Predictions on New Data using the Saved Model ---- We can also
# create and use our own data frame as follows:
to_be_predicted <- # Create a data frame with appropriate values and types to_be_predicted
to_be_predicted <- # Create a data frame with appropriate values and types <- #
to_be_predicted <- # Create a data frame with appropriate values and types Create
to_be_predicted <- # Create a data frame with appropriate values and types a
to_be_predicted <- # Create a data frame with appropriate values and types data
to_be_predicted <- # Create a data frame with appropriate values and types frame
to_be_predicted <- # Create a data frame with appropriate values and types with
to_be_predicted <- # Create a data frame with appropriate values and types appropriate
to_be_predicted <- # Create a data frame with appropriate values and types values
to_be_predicted <- # Create a data frame with appropriate values and types and
to_be_predicted <- # Create a data frame with appropriate values and types types
to_be_predicted <- data.frame(Gender = "Male", Married = "No", Dependents = "0",
    Education = "Graduate", SelfEmployed = "Yes", ApplicantIncome = 4583, CoapplicantIncome = 1508,
    LoanAmount = 12841, LoanAmountTerm = 360, CreditHistory = 1, PropertyArea = "Urban")

# Use factor() to set factor levels if needed
to_be_predicted$Gender <- factor(to_be_predicted$Gender, levels = levels(loaded_loans_imputed_model_lda$Gender))
to_be_predicted$Married <- factor(to_be_predicted$Married, levels = levels(loaded_loans_imputed_model_lda$Married))
to_be_predicted$Education <- factor(to_be_predicted$Education, levels = levels(loaded_loans_imputed_model_lda$Education))
to_be_predicted$SelfEmployed <- factor(to_be_predicted$SelfEmployed, levels = levels(loaded_loans_imputed_model_lda$SelfEmployed))
to_be_predicted$PropertyArea <- factor(to_be_predicted$PropertyArea, levels = levels(loaded_loans_imputed_model_lda$PropertyArea))
# to_be_predicted$Dependents <- factor(to_be_predicted$Dependents, levels =
# levels(loaded_loans_imputed_model_lda$Dependents))

# Make predictions
predictions <- predict(loaded_loans_imputed_model_lda, newdata = to_be_predicted)
print(predictions)

## factor()
## Levels: N Y

# We then use the data frame to make predictions
predict(loaded_loans_imputed_model_lda, newdata = to_be_predicted)

## factor()
## Levels: N Y

# STEP 8. Make predictions using the model through a function ---- An
# alternative is to create a function and then use the function to make
# predictions

predict_status <- function(arg_Gender, arg_Married, arg_Dependents, arg_Education,
    arg_SelfEmployed, arg_ApplicantIncome, arg_CoapplicantIncome, arg_LoanAmount,
    arg_LoanAmountTerm, arg_CreditHistory, arg_PropertyArea) {
    # Create a data frame using the arguments
    to_be_predicted <- data.frame(Gender = arg_Gender, Married = arg_Married, Dependents = arg_Dependents,
        Education = arg_Education, SelfEmployed = arg_SelfEmployed, ApplicantIncome = arg_ApplicantIncome,
        CoapplicantIncome = arg_CoapplicantIncome, LoanAmount = arg_LoanAmount, LoanAmountTerm = arg_LoanAmountTerm,
        CreditHistory = arg_CreditHistory, PropertyArea = arg_PropertyArea)

    # Make a prediction based on the data frame
    predict(loaded_loans_imputed_model_lda, to_be_predicted)
}


# We can now call the function predict_diabetes() instead of calling the
# predict() function directly

# Assuming Dependents should be a factor

predict_status("Male", "No", "0", "Graduate", "Yes", 4583, 1508, 12841, 360, 1, "Urban")

## [1] Y
## Levels: N Y

# [OPTIONAL] **Deinitialization: Create a snapshot of the R environment ----
# Lastly, as a follow-up to the initialization step, record the packages
# installed and their sources in the lockfile so that other team-members can
# use renv::restore() to re-install the same package version in their local
# machine during their initialization step.  renv::snapshot()

<Milestone 8 out of 8>

Issue 8b Consolidation.

# 8b. API ----

# This will set up a project library, containing all the packages you are
# currently using. The packages (and all the metadata needed to reinstall them)
# are recorded into a lockfile, renv.lock, and a .Rprofile ensures that the
# library is used every time you open the project.

# Consider a library as the location where packages are stored.  Execute the
# following command to list all the libraries available in your computer:
.libPaths()

## [1] "C:/Users/Cris/github-classroom/BI-Loan-Appraisal-Project/markdown/renv/library/R-4.3/x86_64-w64-mingw32"
## [2] "C:/Users/Cris/AppData/Local/R/cache/R/renv/sandbox/R-4.3/x86_64-w64-mingw32/bd3f13aa"

# One of the libraries should be a folder inside the project if you are using
# renv

# Then execute the following command to see which packages are available in
# each library:
lapply(.libPaths(), list.files)

## [[1]]
##   [1] "abind"          "adabag"         "Amelia"         "askpass"       
##   [5] "backports"      "base64enc"      "BH"             "bit"           
##   [9] "bit64"          "brew"           "brio"           "broom"         
##  [13] "broom.mixed"    "bslib"          "C50"            "cachem"        
##  [17] "callr"          "car"            "carData"        "caret"         
##  [21] "caretEnsemble"  "checkmate"      "chron"          "class"         
##  [25] "cli"            "clipr"          "clock"          "coda"          
##  [29] "codetools"      "collections"    "colorspace"     "commonmark"    
##  [33] "ConsRank"       "corrplot"       "covr"           "cpp11"         
##  [37] "crayon"         "Cubist"         "curl"           "cyclocomp"     
##  [41] "data.table"     "DBI"            "DEoptimR"       "desc"          
##  [45] "diagram"        "diffobj"        "digest"         "doParallel"    
##  [49] "doRNG"          "dplyr"          "e1071"          "ellipsis"      
##  [53] "evaluate"       "fansi"          "farver"         "fastmap"       
##  [57] "fontawesome"    "forcats"        "foreach"        "forecast"      
##  [61] "foreign"        "formatR"        "Formula"        "fracdiff"      
##  [65] "fs"             "furrr"          "future"         "future.apply"  
##  [69] "gbm"            "generics"       "ggcorrplot"     "ggformula"     
##  [73] "ggplot2"        "ggridges"       "ggtext"         "glmnet"        
##  [77] "globals"        "glue"           "gower"          "gridExtra"     
##  [81] "gridtext"       "gtable"         "gtools"         "hardhat"       
##  [85] "haven"          "here"           "highr"          "Hmisc"         
##  [89] "hms"            "htmlTable"      "htmltools"      "htmlwidgets"   
##  [93] "httpuv"         "httr"           "hunspell"       "imputeTS"      
##  [97] "inline"         "inum"           "ipred"          "isoband"       
## [101] "iterators"      "itertools"      "jomo"           "jpeg"          
## [105] "jquerylib"      "jsonlite"       "kernlab"        "KernSmooth"    
## [109] "knitr"          "labeling"       "labelled"       "laeken"        
## [113] "languageserver" "later"          "lattice"        "lava"          
## [117] "lazyeval"       "libcoin"        "lifecycle"      "lintr"         
## [121] "listenv"        "lme4"           "lmtest"         "loo"           
## [125] "lubridate"      "magrittr"       "markdown"       "MASS"          
## [129] "Matrix"         "MatrixModels"   "matrixStats"    "memoise"       
## [133] "mgcv"           "mice"           "miceadds"       "mime"          
## [137] "minqa"          "missForest"     "mitml"          "mitools"       
## [141] "mlbench"        "ModelMetrics"   "mosaic"         "mosaicCore"    
## [145] "mosaicData"     "munsell"        "mvtnorm"        "naniar"        
## [149] "NHANES"         "nlme"           "nloptr"         "nnet"          
## [153] "norm"           "numDeriv"       "openssl"        "ordinal"       
## [157] "pan"            "parallelly"     "partykit"       "pbapply"       
## [161] "pbkrtest"       "pillar"         "pkgbuild"       "pkgconfig"     
## [165] "pkgload"        "plumber"        "plyr"           "png"           
## [169] "praise"         "prettyunits"    "pROC"           "processx"      
## [173] "prodlim"        "progress"       "progressr"      "promises"      
## [177] "proxy"          "ps"             "purrr"          "quadprog"      
## [181] "quantmod"       "quantreg"       "QuickJSR"       "R.cache"       
## [185] "R.methodsS3"    "R.oo"           "R.utils"        "R6"            
## [189] "randomForest"   "ranger"         "rappdirs"       "RColorBrewer"  
## [193] "Rcpp"           "RcppArmadillo"  "RcppEigen"      "RcppParallel"  
## [197] "readr"          "recipes"        "rematch2"       "remotes"       
## [201] "renv"           "reshape2"       "rex"            "rgl"           
## [205] "rlang"          "rlist"          "rmarkdown"      "rngtools"      
## [209] "robustbase"     "roxygen2"       "rpart"          "rpart.plot"    
## [213] "rprojroot"      "RRF"            "rstan"          "rstudioapi"    
## [217] "sass"           "scales"         "shape"          "simputation"   
## [221] "sodium"         "sp"             "SparseM"        "spelling"      
## [225] "SQUAREM"        "StanHeaders"    "stinepack"      "stringi"       
## [229] "stringr"        "styler"         "survival"       "swagger"       
## [233] "sys"            "testthat"       "tibble"         "tidyr"         
## [237] "tidyselect"     "timechange"     "timeDate"       "tinytex"       
## [241] "tseries"        "TTR"            "tzdb"           "ucminf"        
## [245] "UpSetR"         "urca"           "utf8"           "vcd"           
## [249] "vctrs"          "vdiffr"         "VIM"            "viridis"       
## [253] "viridisLite"    "visdat"         "vroom"          "wakefield"     
## [257] "waldo"          "webutils"       "withr"          "xfun"          
## [261] "XML"            "xml2"           "xmlparsedata"   "xts"           
## [265] "yaml"           "zoo"           
## 
## [[2]]
##  [1] "base"       "boot"       "class"      "cluster"    "codetools" 
##  [6] "compiler"   "datasets"   "foreign"    "graphics"   "grDevices" 
## [11] "grid"       "KernSmooth" "lattice"    "MASS"       "Matrix"    
## [16] "methods"    "mgcv"       "nlme"       "nnet"       "parallel"  
## [21] "rpart"      "spatial"    "splines"    "stats"      "stats4"    
## [26] "survival"   "tcltk"      "tools"      "utils"

# This can also be configured using the RStudio GUI when you click the project
# file, e.g., 'BBT4206-R.Rproj' in the case of this project. Then navigate to
# the 'Environments' tab and select 'Use renv with this project'.

# As you continue to work on your project, you can install and upgrade
# packages, using either: install.packages() and update.packages or
# renv::install() and renv::update()

# You can also clean up a project by removing unused packages using the
# following command: renv::clean()

# After you have confirmed that your code works as expected, use
# renv::snapshot(), AT THE END, to record the packages and their sources in the
# lockfile.

# Later, if you need to share your code with someone else or run your code on a
# new machine, your collaborator (or you) can call renv::restore() to reinstall
# the specific package versions recorded in the lockfile.

# [OPTIONAL] Execute the following code to reinstall the specific package
# versions recorded in the lockfile (restart R after executing the command):
# renv::restore()

# [OPTIONAL] If you get several errors setting up renv and you prefer not to
# use it, then you can deactivate it using the following command (restart R
# after executing the command): renv::deactivate()

# If renv::restore() did not install the 'languageserver' package (required to
# use R for VS Code), then it can be installed manually as follows (restart R
# after executing the command):

if (require("languageserver")) {
    require("languageserver")
} else {
    install.packages("languageserver", dependencies = TRUE, repos = "https://cloud.r-project.org")
}

# Introduction ----

# We can create an API to access the model from outside R using a package
# called Plumber.

# STEP 1. Install and Load the Required Packages ---- plumber ----
if (require("plumber")) {
    require("plumber")
} else {
    install.packages("plumber", dependencies = TRUE, repos = "https://cloud.r-project.org")
}

## caret ----
if (require("caret")) {
    require("caret")
} else {
    install.packages("caret", dependencies = TRUE, repos = "https://cloud.r-project.org")
}

# Create a REST API using Plumber ---- REST API stands for Representational
# State Transfer Application Programming Interface. It is an architectural
# style and a set of guidelines for building web services that provide
# interoperability between different systems on the internet. RESTful APIs are
# widely used for creating and consuming web services.

## Principles of REST API ----

### 1. Stateless ---- The server does not store any client state between
### requests. Each request from the client contains all the necessary
### information for the server to understand and process the request.

### 2. Client-Server Architecture ---- The client and server are separate
### entities that communicate over the Internet. The client sends requests to
### the server, and the server processes those requests and sends back
### responses.

### 3. Uniform Interface ---- REST APIs use a uniform and consistent set of
### interfaces and protocols. The most common interfaces are based on the HTTP
### protocol, such as GET (retrieve a resource), POST (create a new resource),
### PUT (update a resource), DELETE (remove a resource), etc.

### 4. Resource-Oriented ---- REST APIs are based on the concept of resources,
### which are identified by unique URIs (Uniform Resource Identifiers). Clients
### interact with these resources by sending requests to their corresponding
### URIs.

### 5. Representation of Resources ---- Resources in a REST API can be
### represented in various formats, such as JSON (JavaScript Object Notation),
### XML (eXtensible Markup Language), YAML (YAML Ain't Markup Language) or
### plain text. The server sends the representation of a resource in the
### response to the client.


# REST APIs are widely used for building web services that can be consumed by
# various client applications, such as web browsers, mobile apps, or other
# servers. They provide a scalable and flexible approach to designing APIs that
# can evolve over time. Developers can use RESTful principles to create APIs
# that are easy to understand, use, and integrate into different systems.

# When working with a REST API, clients typically send HTTP requests to
# specific endpoints (URLs) provided by the server, and the server responds
# with the requested data or performs the requested actions. The communication
# between client and server is based on the HTTP protocol, making REST APIs
# widely supported and accessible across different platforms and programming
# languages.

# In summary, a REST API is a set of rules and conventions for building web
# services that follow the principles of REST. It provides a standardized and
# scalable way for systems to communicate and exchange data over the internet.

# This requires the 'plumber' package that was installed and loaded earlier in
# STEP 1. The commenting below makes R recognize the code as the definition of
# an API, i.e., #* comments.

loaded_loans_imputed_model_lda <- readRDS("C:/Users/Cris/github-classroom/BI-Loan-Appraisal-Project/models/saved_loans_model_lda.rds")

# * @apiTitle Loan Approval Status Prediction Model API

# * @apiDescription Used to predict whether a client will be given a loan or
# not.

# * @param arg_Gender The clients Gender(Male or Female) * @param arg_Married
# The clients Marital status(Yes or No) * @param arg_Dependents The number of
# Dependents the client has(0,1,2 or 3+) * @param arg_Education The clients
# education status(Graduate or NotGraduate) * @param arg_SelfEmployed The
# clients employment status(Yes or No) * @param arg_ApplicantIncome The
# Applicants monthly income * @param arg_CoapplicantIncome The Coapplicants
# monthly income if applicable * @param arg_LoanAmount The Loan amount(in Ksh)
# * @param arg_LoanAmountTerm The Loan term (in days) * @param
# arg_CreditHistory The clients credit history (0 being less than 350, 1 being
# more thatn 500) * @param arg_PropertyArea The clients property area (Urban,
# SemiUrban, Rural)

# * @get /Status

predict_status <- function(arg_Gender, arg_Married, arg_Dependents, arg_Education,
    arg_SelfEmployed, arg_ApplicantIncome, arg_CoapplicantIncome, arg_LoanAmount,
    arg_LoanAmountTerm, arg_CreditHistory, arg_PropertyArea) {
    # Create a data frame using the arguments
    to_be_predicted <- data.frame(Gender = as.factor(arg_Gender), Married = as.factor(arg_Married),
        Dependents = as.factor(arg_Dependents), Education = as.factor(arg_Education),
        SelfEmployed = as.factor(arg_SelfEmployed), ApplicantIncome = as.numeric(arg_ApplicantIncome),
        CoapplicantIncome = as.numeric(arg_CoapplicantIncome), LoanAmount = as.numeric(arg_LoanAmount),
        LoanAmountTerm = as.numeric(arg_LoanAmountTerm), CreditHistory = as.numeric(arg_CreditHistory),
        PropertyArea = as.factor(arg_PropertyArea))
    # Make a prediction based on the data frame
    predict(loaded_loans_imputed_model_lda, to_be_predicted)
}

# [OPTIONAL] **Deinitialization: Create a snapshot of the R environment ----
# Lastly, as a follow-up to the initialization step, record the packages
# installed and their sources in the lockfile so that other team-members can
# use renv::restore() to re-install the same package version in their local
# machine during their initialization step.  renv::snapshot()

<Milestone 8 out of 8>

Issue 8c Consolidation.

# *****************************************************************************
# Lab 11: Plumber API ----

# **[OPTIONAL] Initialization: Install and use renv ---- The R Environment
# ('renv') package helps you create reproducible environments for your R
# projects. This is helpful when working in teams because it makes your R
# projects more isolated, portable and reproducible.


# Once installed, you can then use renv::init() to initialize renv in a new
# project.

# The prompt received after executing renv::init() is as shown below: This
# project already has a lockfile. What would you like to do?

# 1: Restore the project from the lockfile.  2: Discard the lockfile and
# re-initialize the project.  3: Activate the project without snapshotting or
# installing any packages.  4: Abort project initialization.

# This will set up a project library, containing all the packages you are
# currently using. The packages (and all the metadata needed to reinstall them)
# are recorded into a lockfile, renv.lock, and a .Rprofile ensures that the
# library is used every time you open the project.

# Consider a library as the location where packages are stored.  Execute the
# following command to list all the libraries available in your computer:
.libPaths()

## [1] "C:/Users/Cris/github-classroom/BI-Loan-Appraisal-Project/markdown/renv/library/R-4.3/x86_64-w64-mingw32"
## [2] "C:/Users/Cris/AppData/Local/R/cache/R/renv/sandbox/R-4.3/x86_64-w64-mingw32/bd3f13aa"

# One of the libraries should be a folder inside the project if you are using
# renv

# Then execute the following command to see which packages are available in
# each library:
lapply(.libPaths(), list.files)

## [[1]]
##   [1] "abind"          "adabag"         "Amelia"         "askpass"       
##   [5] "backports"      "base64enc"      "BH"             "bit"           
##   [9] "bit64"          "brew"           "brio"           "broom"         
##  [13] "broom.mixed"    "bslib"          "C50"            "cachem"        
##  [17] "callr"          "car"            "carData"        "caret"         
##  [21] "caretEnsemble"  "checkmate"      "chron"          "class"         
##  [25] "cli"            "clipr"          "clock"          "coda"          
##  [29] "codetools"      "collections"    "colorspace"     "commonmark"    
##  [33] "ConsRank"       "corrplot"       "covr"           "cpp11"         
##  [37] "crayon"         "Cubist"         "curl"           "cyclocomp"     
##  [41] "data.table"     "DBI"            "DEoptimR"       "desc"          
##  [45] "diagram"        "diffobj"        "digest"         "doParallel"    
##  [49] "doRNG"          "dplyr"          "e1071"          "ellipsis"      
##  [53] "evaluate"       "fansi"          "farver"         "fastmap"       
##  [57] "fontawesome"    "forcats"        "foreach"        "forecast"      
##  [61] "foreign"        "formatR"        "Formula"        "fracdiff"      
##  [65] "fs"             "furrr"          "future"         "future.apply"  
##  [69] "gbm"            "generics"       "ggcorrplot"     "ggformula"     
##  [73] "ggplot2"        "ggridges"       "ggtext"         "glmnet"        
##  [77] "globals"        "glue"           "gower"          "gridExtra"     
##  [81] "gridtext"       "gtable"         "gtools"         "hardhat"       
##  [85] "haven"          "here"           "highr"          "Hmisc"         
##  [89] "hms"            "htmlTable"      "htmltools"      "htmlwidgets"   
##  [93] "httpuv"         "httr"           "hunspell"       "imputeTS"      
##  [97] "inline"         "inum"           "ipred"          "isoband"       
## [101] "iterators"      "itertools"      "jomo"           "jpeg"          
## [105] "jquerylib"      "jsonlite"       "kernlab"        "KernSmooth"    
## [109] "knitr"          "labeling"       "labelled"       "laeken"        
## [113] "languageserver" "later"          "lattice"        "lava"          
## [117] "lazyeval"       "libcoin"        "lifecycle"      "lintr"         
## [121] "listenv"        "lme4"           "lmtest"         "loo"           
## [125] "lubridate"      "magrittr"       "markdown"       "MASS"          
## [129] "Matrix"         "MatrixModels"   "matrixStats"    "memoise"       
## [133] "mgcv"           "mice"           "miceadds"       "mime"          
## [137] "minqa"          "missForest"     "mitml"          "mitools"       
## [141] "mlbench"        "ModelMetrics"   "mosaic"         "mosaicCore"    
## [145] "mosaicData"     "munsell"        "mvtnorm"        "naniar"        
## [149] "NHANES"         "nlme"           "nloptr"         "nnet"          
## [153] "norm"           "numDeriv"       "openssl"        "ordinal"       
## [157] "pan"            "parallelly"     "partykit"       "pbapply"       
## [161] "pbkrtest"       "pillar"         "pkgbuild"       "pkgconfig"     
## [165] "pkgload"        "plumber"        "plyr"           "png"           
## [169] "praise"         "prettyunits"    "pROC"           "processx"      
## [173] "prodlim"        "progress"       "progressr"      "promises"      
## [177] "proxy"          "ps"             "purrr"          "quadprog"      
## [181] "quantmod"       "quantreg"       "QuickJSR"       "R.cache"       
## [185] "R.methodsS3"    "R.oo"           "R.utils"        "R6"            
## [189] "randomForest"   "ranger"         "rappdirs"       "RColorBrewer"  
## [193] "Rcpp"           "RcppArmadillo"  "RcppEigen"      "RcppParallel"  
## [197] "readr"          "recipes"        "rematch2"       "remotes"       
## [201] "renv"           "reshape2"       "rex"            "rgl"           
## [205] "rlang"          "rlist"          "rmarkdown"      "rngtools"      
## [209] "robustbase"     "roxygen2"       "rpart"          "rpart.plot"    
## [213] "rprojroot"      "RRF"            "rstan"          "rstudioapi"    
## [217] "sass"           "scales"         "shape"          "simputation"   
## [221] "sodium"         "sp"             "SparseM"        "spelling"      
## [225] "SQUAREM"        "StanHeaders"    "stinepack"      "stringi"       
## [229] "stringr"        "styler"         "survival"       "swagger"       
## [233] "sys"            "testthat"       "tibble"         "tidyr"         
## [237] "tidyselect"     "timechange"     "timeDate"       "tinytex"       
## [241] "tseries"        "TTR"            "tzdb"           "ucminf"        
## [245] "UpSetR"         "urca"           "utf8"           "vcd"           
## [249] "vctrs"          "vdiffr"         "VIM"            "viridis"       
## [253] "viridisLite"    "visdat"         "vroom"          "wakefield"     
## [257] "waldo"          "webutils"       "withr"          "xfun"          
## [261] "XML"            "xml2"           "xmlparsedata"   "xts"           
## [265] "yaml"           "zoo"           
## 
## [[2]]
##  [1] "base"       "boot"       "class"      "cluster"    "codetools" 
##  [6] "compiler"   "datasets"   "foreign"    "graphics"   "grDevices" 
## [11] "grid"       "KernSmooth" "lattice"    "MASS"       "Matrix"    
## [16] "methods"    "mgcv"       "nlme"       "nnet"       "parallel"  
## [21] "rpart"      "spatial"    "splines"    "stats"      "stats4"    
## [26] "survival"   "tcltk"      "tools"      "utils"

# This can also be configured using the RStudio GUI when you click the project
# file, e.g., 'BBT4206-R.Rproj' in the case of this project. Then navigate to
# the 'Environments' tab and select 'Use renv with this project'.

# As you continue to work on your project, you can install and upgrade
# packages, using either: install.packages() and update.packages or
# renv::install() and renv::update()

# You can also clean up a project by removing unused packages using the
# following command: renv::clean()

# After you have confirmed that your code works as expected, use
# renv::snapshot(), AT THE END, to record the packages and their sources in the
# lockfile.

# Later, if you need to share your code with someone else or run your code on a
# new machine, your collaborator (or you) can call renv::restore() to reinstall
# the specific package versions recorded in the lockfile.

# [OPTIONAL] Execute the following code to reinstall the specific package
# versions recorded in the lockfile (restart R after executing the command):
# renv::restore()

# [OPTIONAL] If you get several errors setting up renv and you prefer not to
# use it, then you can deactivate it using the following command (restart R
# after executing the command): renv::deactivate()

# If renv::restore() did not install the 'languageserver' package (required to
# use R for VS Code), then it can be installed manually as follows (restart R
# after executing the command):

if (require("languageserver")) {
    require("languageserver")
} else {
    install.packages("languageserver", dependencies = TRUE, repos = "https://cloud.r-project.org")
}

# STEP 1. Install and load the required packages ---- plumber ----
if (require("plumber")) {
    require("plumber")
} else {
    install.packages("plumber", dependencies = TRUE, repos = "https://cloud.r-project.org")
}

# STEP 2. Process a Plumber API ---- This allows us to process a plumber API
api <- plumber::plumb("C:/Users/Cris/github-classroom/BI-Loan-Appraisal-Project/8- Consolidation-b.R")

# STEP 3. Run the API on a specific port ---- Specify a constant localhost port
# to use api$run(host = '127.0.0.2', port = 5026)

# STEP 4. Test the API ---- We test the API using the following values: for the
# arguments: pregnant, glucose, pressure, triceps, insulin, mass, pedigree, age
# 6, 148, 72, 35, 0, 33.6, 0.627, and 50 respectively should be 'positive' 1,
# 85, 66, 29, 0, 26.6, 0.351, and 31 respectively should be 'negative'

# [OPTIONAL] **Deinitialization: Create a snapshot of the R environment ----
# Lastly, as a follow-up to the initialization step, record the packages
# installed and their sources in the lockfile so that other team-members can
# use renv::restore() to re-install the same package version in their local
# machine during their initialization step.  renv::snapshot()

<Milestone 8 out of 8>

Issue 8d Consume Plumber API. {r Your Twelveth Code Chunk} #<?php

8c.Consume data from the Plumber API Output (using PHP) ----

Course Code: BBT4206

Course Name: Business Intelligence II

Semester Duration: 21st August 2023 to 28th November 2023

Lecturer: Allan Omondi

Contact: aomondi [at] strathmore.edu

Note: The lecture contains both theory and practice. This file forms part of

the practice. It has required lab work submissions that are graded for

coursework marks.

License: GNU GPL-3.0-or-later

See LICENSE file for licensing information.

*****************************************************************************

// Full documentation of the client URL (cURL) library: https://www.php.net/manual/en/book.curl.php

$apiUrl = 'http://127.0.0.2:5026/Status'; $curl = curl_init();

if ($_SERVER['REQUEST_METHOD'] === 'POST') { // Check if the form data is set if ( isset($_POST['Gender']) && isset($_POST['Married']) && isset($_POST['Dependents']) && isset($_POST['Education']) && isset($_POST['SelfEmployed']) && isset($_POST['ApplicantIncome']) && isset($_POST['CoapplicantIncome']) && isset($_POST['LoanAmount']) && isset($_POST['LoanAmountTerm']) && isset($_POST['CreditHistory']) && isset($_POST['PropertyArea']) ) { // Check if the form values are numeric if ( is_numeric($_POST['Dependents']) && is_numeric($_POST['ApplicantIncome']) && is_numeric($_POST['CoapplicantIncome']) && is_numeric($_POST['LoanAmount']) && is_numeric($_POST['LoanAmountTerm']) && is_numeric($_POST['CreditHistory']) && is_numeric($_POST['PropertyArea']) ) { $formData = array( 'Gender' => $_POST[‘Gender’], ‘Married’ => $_POST[‘Married’], ‘Dependents’ => $_POST[‘Dependents’], ‘Education’ => $_POST[‘Education’], ‘SelfEmployed’ => $_POST[‘SelfEmployed’], ‘ApplicantIncome’ => $_POST[‘ApplicantIncome’], ‘CoapplicantIncome’ => $_POST[‘CoapplicantIncome’], ‘LoanAmount’ => $_POST[‘LoanAmount’], ‘LoanAmountTerm’ => $_POST[‘LoanAmountTerm’], ‘CreditHistory’ => $_POST[‘CreditHistory’], ‘PropertyArea’ => $_POST[‘PropertyArea’], );

        // Set cURL options
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($curl, CURLOPT_POST, true);
        curl_setopt($curl, CURLOPT_POSTFIELDS, $formData);

        curl_setopt($curl, CURLOPT_URL, $apiUrl);

        // Make a POST request
        $response = curl_exec($curl);

        // Check for cURL errors
        if (curl_errno($curl)) {
            $error = curl_error($curl);
            // Handle the error appropriately
            die("cURL Error: $error");
        }

        // Process the response
        $data = json_decode($response, true);

        // Check if the response was successful
        if (isset($data['prediction'])) {
            // API request was successful
            // Access the predicted loan status
            echo "The predicted loan status is: " . $data['prediction'];
        } else {
            // API request failed or returned an error
            // Handle the error appropriately
            echo "API Error: " . $data['message'];
        }
    } else {
        echo "Form values must be numeric.";
    }
} else {
    echo "All form fields are required.";
}

}

// Close cURL session/resource curl_close($curl); ?> <!DOCTYPE html>

POST Body

Loan Appraisal prediction

etc. as per the lab submission requirements. Be neat and communicate in a clear and logical manner.