Business Intelligence Project

Acers Team 20th November 2023

Student Details
Setup Chunk
- Downloading the Dataset
  - Source:
  - Reference:
Install and load all the packages
Loading the Dataset
Exploratory Data Analysis
Preprocessing and Data Transformation
- Heatmap of “missingness” broken down by “Age”
- Data Transformation
Training Models
Hyper Parameter Tunning
- Grid Search
Ensemble methods
- Boosting
  - GBM
Save and Load your Model
- Making prediction with the model
- Make predictions using the model through a function
Plumber API
Prediction Web Form And Customer-facing Dashboard

Student Details

Student ID Numbers and Names of Group Members	122790 - C - Bwalley Nicholas 133834 - C - Mongare Sarah 133928 - C - Cheptoi Millicent 134879 - C - Tulienge Lesley 124461 - C - Kinya Angela
GitHub Classroom Group Name	Acers Team
Course Code	BBT4206
Course Name	Business Intelligence II
Program	Bachelor of Business Information Technology
Semester Duration	21^st August 2023 to 28^th November 2023

Setup Chunk

Note: the following KnitR options have been set as the global defaults:
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, eval = TRUE, collapse = FALSE, tidy = TRUE).

More KnitR options are documented here https://bookdown.org/yihui/rmarkdown-cookbook/chunk-options.html and here https://yihui.org/knitr/options/.

Downloading the Dataset

Source:

The dataset that was used can be downloaded here:

https://www.kaggle.com/code/faressayah/ibm-hr-analytics-employee-attrition-performance/input

Reference:

F. (2018). IBM HR Analytics Employee Attrition & Performance. Kaggle. https://www.kaggle.com/code/faressayah/ibm-hr-analytics-employee-attrition-performance/input

Install and load all the packages

We installed all the packages that will enable us execute this lab.

#Loading Packages 

if (!is.element("languageserver", installed.packages()[, 1])) {
  install.packages("languageserver", dependencies = TRUE)
}
require("languageserver")

## Loading required package: languageserver

if (!is.element("e1071", installed.packages()[, 1])) {
  install.packages("e1071", dependencies = TRUE)
}
require("e1071")

## Loading required package: e1071

if (!is.element("Amelia", installed.packages()[, 1])) {
  install.packages("Amelia", dependencies = TRUE)
}
require("Amelia")

## Loading required package: Amelia

## Loading required package: Rcpp

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.1, built: 2022-11-18)
## ## Copyright (C) 2005-2023 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

if (!is.element("corrplot", installed.packages()[, 1])) {
  install.packages("corrplot", dependencies = TRUE)
}
require("corrplot")

## Loading required package: corrplot

## corrplot 0.92 loaded

if (!is.element("ggcorrplot", installed.packages()[, 1])) {
  install.packages("ggcorrplot", dependencies = TRUE)
}
require("ggcorrplot")

## Loading required package: ggcorrplot

## Loading required package: ggplot2

if (!is.element("readr", installed.packages()[, 1])) {
  install.packages("readr", dependencies = TRUE)
}
require("readr")

## Loading required package: readr

if (!is.element("caret", installed.packages()[, 1])) {
  install.packages("caret", dependencies = TRUE)
}
require("caret")

## Loading required package: caret

## Loading required package: lattice

if (!is.element("naniar", installed.packages()[, 1])) {
  install.packages("naniar", dependencies = TRUE)
}
require("naniar")

## Loading required package: naniar

if (!is.element("dplyr", installed.packages()[, 1])) {
  install.packages("dplyr", dependencies = TRUE)
}
require("dplyr")

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

if (!is.element("ggplot2", installed.packages()[, 1])) {
  install.packages("ggplot2", dependencies = TRUE)
}
require("ggplot2")

if (!is.element("mice", installed.packages()[, 1])) {
  install.packages("mice", dependencies = TRUE)
}
require("mice")

## Loading required package: mice

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

## e1071 ----
if (require("e1071")) {
  require("e1071")
} else {
  install.packages("e1071", dependencies = TRUE,
                   repos = "https://cloud.r-project.org")
}

## factoextra ----
if (require("factoextra")) {
  require("factoextra")
} else {
  install.packages("factoextra", dependencies = TRUE,
                   repos = "https://cloud.r-project.org")
}

## Loading required package: factoextra

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

## FactoMineR ----
if (require("FactoMineR")) {
  require("FactoMineR")
} else {
  install.packages("FactoMineR", dependencies = TRUE,
                   repos = "https://cloud.r-project.org")
}

## Loading required package: FactoMineR

# STEP 1. Install and Load the Required Packages ----
## randomForest ----
if (require("randomForest")) {
  require("randomForest")
} else {
  install.packages("randomForest", dependencies = TRUE,
                   repos = "https://cloud.r-project.org")
}

## Loading required package: randomForest

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

## RRF ----
if (require("RRF")) {
  require("RRF")
} else {
  install.packages("RRF", dependencies = TRUE,
                   repos = "https://cloud.r-project.org")
}

## Loading required package: RRF

## Registered S3 method overwritten by 'RRF':
##   method      from        
##   plot.margin randomForest

## RRF 1.9.4

## Type rrfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'RRF'

## The following objects are masked from 'package:randomForest':
## 
##     classCenter, combine, getTree, grow, importance, margin, MDSplot,
##     na.roughfix, outlier, partialPlot, treesize, varImpPlot, varUsed

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

## caretEnsemble ----
if (require("caretEnsemble")) {
  require("caretEnsemble")
} else {
  install.packages("caretEnsemble", dependencies = TRUE,
                   repos = "https://cloud.r-project.org")
}

## Loading required package: caretEnsemble

## 
## Attaching package: 'caretEnsemble'

## The following object is masked from 'package:ggplot2':
## 
##     autoplot

## C50 ----
if (require("C50")) {
  require("C50")
} else {
  install.packages("C50", dependencies = TRUE,
                   repos = "https://cloud.r-project.org")
}

## Loading required package: C50

## adabag ----
if (require("adabag")) {
  require("adabag")
} else {
  install.packages("adabag", dependencies = TRUE,
                   repos = "https://cloud.r-project.org")
}

## Loading required package: adabag

## Loading required package: rpart

## Loading required package: foreach

## Loading required package: doParallel

## Loading required package: iterators

## Loading required package: parallel

## plumber ----
if (require("plumber")) {
  require("plumber")
} else {
  install.packages("plumber", dependencies = TRUE,
                   repos = "https://cloud.r-project.org")
}

## Loading required package: plumber

## httr ----
if (require("httr")) {
  require("httr")
} else {
  install.packages("httr", dependencies = TRUE,
                   repos = "https://cloud.r-project.org")
}

## Loading required package: httr

## 
## Attaching package: 'httr'

## The following object is masked from 'package:caret':
## 
##     progress

## jsonlite ----
if (require("jsonlite")) {
  require("jsonlite")
} else {
  install.packages("jsonlite", dependencies = TRUE,
                   repos = "https://cloud.r-project.org")
}

## Loading required package: jsonlite

Loading the Dataset

we are using readr to load a data from IBM HR Analytics Employee Attrition & Performance.The dataset has 1470 observation and 31 variables. Employees are the backbone of the organization. Organization’s performance is heavily based on the quality of the employees.

library(readr)
STREAMLINING_PROCESSES_AT_KINYANJUI_FARM <- read_csv("data/STREAMLINING PROCESSES AT KINYANJUI FARM.csv")

## Rows: 1470 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (9): Attrition, BusinessTravel, Department, EducationField, Gender, Job...
## dbl (26): Age, DailyRate, DistanceFromHome, Education, EmployeeCount, Employ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)

Exploratory Data Analysis

we used this approach to analyzing data sets to summarize their main characteristics, often with the help of statistical graphics and other data visualization methods. We then drop ‘EmployeeCount’, ‘Over18’, ‘StandardHours’ have only one unique values and ‘EmployeeNumber’ has 1470 unique values. This features aren’t useful for us, So we are going to drop those columns from the dataset

## Dimensions ----
###  Preview the Loaded Datasets ----
# Dimensions refer to the number of observations (rows) and the number of
# attributes/variables/features (columns). Execute the following commands to
# display the dimensions of your datasets:
dim(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)

## [1] 1470   35

## Data Types ----
### Identify the Data Types ----
# Knowing the data types will help you to identify the most appropriate
# visualization types and algorithms that can be applied. It can also help you
# to identify the need to convert from categorical data (factors) to integers
# or vice versa where necessary. Execute the following command to identify the
# data types:
sapply(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM, class)

##                      Age                Attrition           BusinessTravel 
##                "numeric"              "character"              "character" 
##                DailyRate               Department         DistanceFromHome 
##                "numeric"              "character"                "numeric" 
##                Education           EducationField            EmployeeCount 
##                "numeric"              "character"                "numeric" 
##           EmployeeNumber  EnvironmentSatisfaction                   Gender 
##                "numeric"                "numeric"              "character" 
##               HourlyRate           JobInvolvement                 JobLevel 
##                "numeric"                "numeric"                "numeric" 
##                  JobRole          JobSatisfaction            MaritalStatus 
##              "character"                "numeric"              "character" 
##            MonthlyIncome              MonthlyRate       NumCompaniesWorked 
##                "numeric"                "numeric"                "numeric" 
##                   Over18                 OverTime        PercentSalaryHike 
##              "character"              "character"                "numeric" 
##        PerformanceRating RelationshipSatisfaction            StandardHours 
##                "numeric"                "numeric"                "numeric" 
##         StockOptionLevel        TotalWorkingYears    TrainingTimesLastYear 
##                "numeric"                "numeric"                "numeric" 
##          WorkLifeBalance           YearsAtCompany       YearsInCurrentRole 
##                "numeric"                "numeric"                "numeric" 
##  YearsSinceLastPromotion     YearsWithCurrManager 
##                "numeric"                "numeric"

# Drop specified columns
columns_to_drop <- c('EmployeeCount', 'EmployeeNumber', 'Over18', 'StandardHours')
STREAMLINING_PROCESSES_AT_KINYANJUI_FARM <- STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, !(names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM) %in% columns_to_drop)]
View(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)

Descriptive Statistics

Descriptive statistics are a set of techniques and metrics used to summarize and describe the main features, characteristics, and patterns within a dataset. These statistics provide a snapshot of the data’s distribution, central tendency, variability, and other important properties.Before you can effectively use your data for tasks like building prediction models or drawing meaningful conclusions, it’s crucial to have a deep understanding of the dataset.Understanding the context is essential for making informed decisions based on those results

Measures of Frequency

This code calculates and displays the frequency and mode (most common value) for several categorical variables in a dataset, helping to understand the data’s distribution and identify prevalent values.Some datasets show relatively even distributions, while others have dominant categories. The frequencies provide insights into the composition of each dataset. we did this to show categorical variables Gender, JobRole and BusinessTravel.

#### Identify the number of instances that belong to Gender ----
# It is more sensible to count categorical variables (factors or dimensions)
# than numeric variables, e.g., counting the number of male and female

streamlining_processes_freq <- STREAMLINING_PROCESSES_AT_KINYANJUI_FARM$Gender
cbind(frequency = table(streamlining_processes_freq),
      percentage = prop.table(table(streamlining_processes_freq)) * 100)

##        frequency percentage
## Female       588         40
## Male         882         60

streamlining_processes_freq <- STREAMLINING_PROCESSES_AT_KINYANJUI_FARM$JobRole
cbind(frequency = table(streamlining_processes_freq),
      percentage = prop.table(table(streamlining_processes_freq)) * 100)

##                           frequency percentage
## Healthcare Representative       131   8.911565
## Human Resources                  52   3.537415
## Laboratory Technician           259  17.619048
## Manager                         102   6.938776
## Manufacturing Director          145   9.863946
## Research Director                80   5.442177
## Research Scientist              292  19.863946
## Sales Executive                 326  22.176871
## Sales Representative             83   5.646259

streamlining_processes_freq <- STREAMLINING_PROCESSES_AT_KINYANJUI_FARM$BusinessTravel
cbind(frequency = table(streamlining_processes_freq),
      percentage = prop.table(table(streamlining_processes_freq)) * 100)

##                   frequency percentage
## Non-Travel              150   10.20408
## Travel_Frequently       277   18.84354
## Travel_Rarely          1043   70.95238

Measures of Central Tendency

streamlining_processes_gender_mode <- names(table(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM$Gender))[
  which(table(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM$Gender) == max(table(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM$Gender)))
]
print(streamlining_processes_gender_mode)

## [1] "Male"

streamlining_processes_jobRole_mode <- names(table(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM$JobRole))[
  which(table(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM$JobRole) == max(table(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM$JobRole)))
]
print(streamlining_processes_jobRole_mode)

## [1] "Sales Executive"

streamlining_processes_businessRole_mode <- names(table(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM$BusinessTravel))[
  which(table(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM$BusinessTravel) == max(table(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM$BusinessTravel)))
]
print(streamlining_processes_businessRole_mode)

## [1] "Travel_Rarely"

Measures of Distribution/Dispersion/Spread/Scatter/Variability

The code chunk provides various measures of distribution, dispersion, and variability for selected variables in the ‘STREAMLINING_PROCESSES_AT_KINYANJUI_FARM’ dataset. It includes a summary of the dataset, calculates standard deviation, variance, kurtosis, and skewness for specific columns, offering insights into the spread and shape of the data distribution. Standard deviation measures the dispersion or spread of data points around the mean, providing insight into data variability. Variancemeasures the spread or dispersion of a set of values. A higher variance indicates greater variability from the mean, while a lower variance suggests that the values are closer to the mean. Kurtosis measures the shape of the probability distribution of a real-valued random variable. Positive kurtosis indicates a relatively peaked distribution (heavy tails), while negative kurtosis indicates a flatter distribution (light tails).Skewnessmeasures the asymmetry of the probability distribution of a real-valued random variable. Positive skewness indicates a distribution that is skewed to the right (tail on the right), while negative skewness indicates a distribution that is skewed to the left (tail on the left).

summary(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)

##       Age         Attrition         BusinessTravel       DailyRate     
##  Min.   :18.00   Length:1470        Length:1470        Min.   : 102.0  
##  1st Qu.:30.00   Class :character   Class :character   1st Qu.: 465.0  
##  Median :36.00   Mode  :character   Mode  :character   Median : 802.0  
##  Mean   :36.92                                         Mean   : 802.5  
##  3rd Qu.:43.00                                         3rd Qu.:1157.0  
##  Max.   :60.00                                         Max.   :1499.0  
##   Department        DistanceFromHome   Education     EducationField    
##  Length:1470        Min.   : 1.000   Min.   :1.000   Length:1470       
##  Class :character   1st Qu.: 2.000   1st Qu.:2.000   Class :character  
##  Mode  :character   Median : 7.000   Median :3.000   Mode  :character  
##                     Mean   : 9.193   Mean   :2.913                     
##                     3rd Qu.:14.000   3rd Qu.:4.000                     
##                     Max.   :29.000   Max.   :5.000                     
##  EnvironmentSatisfaction    Gender            HourlyRate     JobInvolvement
##  Min.   :1.000           Length:1470        Min.   : 30.00   Min.   :1.00  
##  1st Qu.:2.000           Class :character   1st Qu.: 48.00   1st Qu.:2.00  
##  Median :3.000           Mode  :character   Median : 66.00   Median :3.00  
##  Mean   :2.722                              Mean   : 65.89   Mean   :2.73  
##  3rd Qu.:4.000                              3rd Qu.: 83.75   3rd Qu.:3.00  
##  Max.   :4.000                              Max.   :100.00   Max.   :4.00  
##     JobLevel       JobRole          JobSatisfaction MaritalStatus     
##  Min.   :1.000   Length:1470        Min.   :1.000   Length:1470       
##  1st Qu.:1.000   Class :character   1st Qu.:2.000   Class :character  
##  Median :2.000   Mode  :character   Median :3.000   Mode  :character  
##  Mean   :2.064                      Mean   :2.729                     
##  3rd Qu.:3.000                      3rd Qu.:4.000                     
##  Max.   :5.000                      Max.   :4.000                     
##  MonthlyIncome    MonthlyRate    NumCompaniesWorked   OverTime        
##  Min.   : 1009   Min.   : 2094   Min.   :0.000      Length:1470       
##  1st Qu.: 2911   1st Qu.: 8047   1st Qu.:1.000      Class :character  
##  Median : 4919   Median :14236   Median :2.000      Mode  :character  
##  Mean   : 6503   Mean   :14313   Mean   :2.693                        
##  3rd Qu.: 8379   3rd Qu.:20462   3rd Qu.:4.000                        
##  Max.   :19999   Max.   :26999   Max.   :9.000                        
##  PercentSalaryHike PerformanceRating RelationshipSatisfaction StockOptionLevel
##  Min.   :11.00     Min.   :3.000     Min.   :1.000            Min.   :0.0000  
##  1st Qu.:12.00     1st Qu.:3.000     1st Qu.:2.000            1st Qu.:0.0000  
##  Median :14.00     Median :3.000     Median :3.000            Median :1.0000  
##  Mean   :15.21     Mean   :3.154     Mean   :2.712            Mean   :0.7939  
##  3rd Qu.:18.00     3rd Qu.:3.000     3rd Qu.:4.000            3rd Qu.:1.0000  
##  Max.   :25.00     Max.   :4.000     Max.   :4.000            Max.   :3.0000  
##  TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany  
##  Min.   : 0.00     Min.   :0.000         Min.   :1.000   Min.   : 0.000  
##  1st Qu.: 6.00     1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.000  
##  Median :10.00     Median :3.000         Median :3.000   Median : 5.000  
##  Mean   :11.28     Mean   :2.799         Mean   :2.761   Mean   : 7.008  
##  3rd Qu.:15.00     3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.: 9.000  
##  Max.   :40.00     Max.   :6.000         Max.   :4.000   Max.   :40.000  
##  YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000     Min.   : 0.000          Min.   : 0.000      
##  1st Qu.: 2.000     1st Qu.: 0.000          1st Qu.: 2.000      
##  Median : 3.000     Median : 1.000          Median : 3.000      
##  Mean   : 4.229     Mean   : 2.188          Mean   : 4.123      
##  3rd Qu.: 7.000     3rd Qu.: 3.000          3rd Qu.: 7.000      
##  Max.   :18.000     Max.   :15.000          Max.   :17.000

STREAMLINING_PROCESSES_AT_KINYANJUI_FARM <- na.omit(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)

#### Measure the standard deviation ----

sapply(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, c(1,4, 6, 7, 9, 11, 12, 13, 15 , 17 , 18, 19,  21, 22, 23,24, 25, 26, 27, 28 ,29, 30, 31   )], sd)

##                      Age                DailyRate         DistanceFromHome 
##                9.1353735              403.5090999                8.1068644 
##                Education  EnvironmentSatisfaction               HourlyRate 
##                1.0241649                1.0930822               20.3294276 
##           JobInvolvement                 JobLevel          JobSatisfaction 
##                0.7115611                1.1069399                1.1028461 
##            MonthlyIncome              MonthlyRate       NumCompaniesWorked 
##             4707.9567831             7117.7860441                2.4980090 
##        PercentSalaryHike        PerformanceRating RelationshipSatisfaction 
##                3.6599377                0.3608235                1.0812089 
##         StockOptionLevel        TotalWorkingYears    TrainingTimesLastYear 
##                0.8520767                7.7807817                1.2892706 
##          WorkLifeBalance           YearsAtCompany       YearsInCurrentRole 
##                0.7064758                6.1265252                3.6231370 
##  YearsSinceLastPromotion     YearsWithCurrManager 
##                3.2224303                3.5681361

#### Measure the variance of each variable ----
  
sapply(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, c(1,4, 6, 7, 9, 11, 12, 13, 15 , 17 , 18, 19,  21, 22, 23,24, 25, 26, 27, 28 ,29, 30, 31   )], var)

##                      Age                DailyRate         DistanceFromHome 
##             8.345505e+01             1.628196e+05             6.572125e+01 
##                Education  EnvironmentSatisfaction               HourlyRate 
##             1.048914e+00             1.194829e+00             4.132856e+02 
##           JobInvolvement                 JobLevel          JobSatisfaction 
##             5.063193e-01             1.225316e+00             1.216270e+00 
##            MonthlyIncome              MonthlyRate       NumCompaniesWorked 
##             2.216486e+07             5.066288e+07             6.240049e+00 
##        PercentSalaryHike        PerformanceRating RelationshipSatisfaction 
##             1.339514e+01             1.301936e-01             1.169013e+00 
##         StockOptionLevel        TotalWorkingYears    TrainingTimesLastYear 
##             7.260346e-01             6.054056e+01             1.662219e+00 
##          WorkLifeBalance           YearsAtCompany       YearsInCurrentRole 
##             4.991081e-01             3.753431e+01             1.312712e+01 
##  YearsSinceLastPromotion     YearsWithCurrManager 
##             1.038406e+01             1.273160e+01

#### Measure the kurtosis of each variable ---- 

sapply(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, c(1,4, 6, 7, 9, 11, 12, 13, 15 , 17 , 18, 19,  21, 22, 23,24, 25, 26, 27, 28 ,29, 30, 31   )], kurtosis, type = 2)

##                      Age                DailyRate         DistanceFromHome 
##              -0.40414514              -1.20382281              -0.22483340 
##                Education  EnvironmentSatisfaction               HourlyRate 
##              -0.55911497              -1.20252052              -1.19639846 
##           JobInvolvement                 JobLevel          JobSatisfaction 
##               0.27099877               0.39915206              -1.22219257 
##            MonthlyIncome              MonthlyRate       NumCompaniesWorked 
##               1.00523269              -1.21495610               0.01021382 
##        PercentSalaryHike        PerformanceRating RelationshipSatisfaction 
##              -0.30059822               1.69593867              -1.18481398 
##         StockOptionLevel        TotalWorkingYears    TrainingTimesLastYear 
##               0.36463433               0.91826954               0.49499299 
##          WorkLifeBalance           YearsAtCompany       YearsInCurrentRole 
##               0.41946050               3.93550876               0.47742077 
##  YearsSinceLastPromotion     YearsWithCurrManager 
##               3.61267311               0.17105808

#### Measure the skewness of each variable ----

sapply(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, c(1,4, 6, 7, 9, 11, 12, 13, 15 , 17 , 18, 19,  21, 22, 23,24, 25, 26, 27, 28 ,29, 30, 31   )],  skewness, type = 2)

##                      Age                DailyRate         DistanceFromHome 
##              0.413286302             -0.003518568              0.958117996 
##                Education  EnvironmentSatisfaction               HourlyRate 
##             -0.289681082             -0.321654448             -0.032310953 
##           JobInvolvement                 JobLevel          JobSatisfaction 
##             -0.498419364              1.025401283             -0.329671959 
##            MonthlyIncome              MonthlyRate       NumCompaniesWorked 
##              1.369816681              0.018577808              1.026471112 
##        PercentSalaryHike        PerformanceRating RelationshipSatisfaction 
##              0.821127976              1.921882702             -0.302827565 
##         StockOptionLevel        TotalWorkingYears    TrainingTimesLastYear 
##              0.968980317              1.117171853              0.553124171 
##          WorkLifeBalance           YearsAtCompany       YearsInCurrentRole 
##             -0.552480299              1.764529454              0.917363156 
##  YearsSinceLastPromotion     YearsWithCurrManager 
##              1.984289983              0.833450992

Measures of Relationship

Measures of Relationship quantify associations between variables. The Pearson correlation coefficient assesses linear relationships, covariance measures changes together, and the slope of the regression line indicates the strength of a linear relationship. R-squared explains the variance in the dependent variable, and rank correlation coefficients assess monotonic relationships. These measures aid in understanding the extent and nature of connections in data analysis

Measure the correlation between variables

streamlining_processes_cov <- cov(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, c(1,4, 6, 7, 9, 11, 12, 13, 15 , 17 , 18, 19,  21, 22, 23,24, 25, 26, 27, 28 ,29, 30, 31   )])
View(streamlining_processes_cov)

Measure the covariance between variables

streamlining_processes_cor <- cor(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, c(1,4, 6, 7, 9, 11, 12, 13, 15 , 17 , 18, 19,  21, 22, 23,24, 25, 26, 27, 28 ,29, 30, 31   )])
View(streamlining_processes_cor)

Inferential Statistics

Inferential statistics involves drawing conclusions about a population based on a sample of data. It uses probability theory to make inferences, such as estimating population parameters or testing hypotheses. Techniques include hypothesis testing, confidence intervals, and regression analysis. Inferential statistics helps generalize findings from a sample to the larger population, providing insights beyond the observed data.This code fits a one-way ANOVA model where Age is the dependent variable, and Attrition is the categorical independent variable.determine if there is a significant difference in the means of the ‘Age’ variable across different levels of ‘Attrition’.it further fits a two-way ANOVA model, considering both Attrition and Gender as categorical independent variables affecting the dependent variable Age.

# Inferential Statistics ----

streamlining_processes_one_way_anova <- aov( Age ~ Attrition , data = STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)
summary(streamlining_processes_one_way_anova)

##               Df Sum Sq Mean Sq F value   Pr(>F)    
## Attrition      1   3107  3107.3   38.18 8.36e-10 ***
## Residuals   1468 119488    81.4                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

streamlining_processes_two_way_anova <- aov(Age  ~ Attrition + Gender, data = STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)
summary(streamlining_processes_two_way_anova)

##               Df Sum Sq Mean Sq F value  Pr(>F)    
## Attrition      1   3107  3107.3  38.189 8.3e-10 ***
## Gender         1    123   122.7   1.508    0.22    
## Residuals   1467 119365    81.4                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Basic Visualization

Univariate Plots

Univariate plots are used to explore the distribution and characteristics of individual variables in a dataset. Common types of univariate plots include histograms, box plots, and density plots, which provide insights into the central tendency, spread, and shape of the distribution of a single variable. These visualizations are fundamental for understanding the nature of individual variables before proceeding to more complex analyses

Histograms

#### Histograms for Each Numeric Attribute ----
str(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)

## tibble [1,470 × 31] (S3: tbl_df/tbl/data.frame)
##  $ Age                     : num [1:1470] 41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : chr [1:1470] "Yes" "No" "Yes" "No" ...
##  $ BusinessTravel          : chr [1:1470] "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" "Travel_Frequently" ...
##  $ DailyRate               : num [1:1470] 1102 279 1373 1392 591 ...
##  $ Department              : chr [1:1470] "Sales" "Research & Development" "Research & Development" "Research & Development" ...
##  $ DistanceFromHome        : num [1:1470] 1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : num [1:1470] 2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : chr [1:1470] "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
##  $ EnvironmentSatisfaction : num [1:1470] 2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : chr [1:1470] "Female" "Male" "Male" "Female" ...
##  $ HourlyRate              : num [1:1470] 94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : num [1:1470] 3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : num [1:1470] 2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : chr [1:1470] "Sales Executive" "Research Scientist" "Laboratory Technician" "Research Scientist" ...
##  $ JobSatisfaction         : num [1:1470] 4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : chr [1:1470] "Single" "Married" "Single" "Married" ...
##  $ MonthlyIncome           : num [1:1470] 5993 5130 2090 2909 3468 ...
##  $ MonthlyRate             : num [1:1470] 19479 24907 2396 23159 16632 ...
##  $ NumCompaniesWorked      : num [1:1470] 8 1 6 1 9 0 4 1 0 6 ...
##  $ OverTime                : chr [1:1470] "Yes" "No" "Yes" "Yes" ...
##  $ PercentSalaryHike       : num [1:1470] 11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : num [1:1470] 3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: num [1:1470] 1 4 2 3 4 3 1 2 2 2 ...
##  $ StockOptionLevel        : num [1:1470] 0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : num [1:1470] 8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : num [1:1470] 0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : num [1:1470] 1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : num [1:1470] 6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : num [1:1470] 4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : num [1:1470] 0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : num [1:1470] 5 7 0 0 2 6 0 0 8 7 ...

# Identify numeric columns
numeric_columns <- sapply(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM, is.numeric)

# Extract only numeric columns
STREAMLINING_PROCESSES_AT_KINYANJUI_FARM <- STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, numeric_columns]

STREAMLINING_PROCESSES_AT_KINYANJUI_FARM<- sapply(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM, as.numeric)

# Set up the layout for multiple histograms
par(mfrow = c(1,1))

# Plot histograms for numeric columns


hist(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 1], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[1])

hist(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 4], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[4])

hist(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 6], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[6])

hist(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 7], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[7])

hist(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 9], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[9])

hist(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 10], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[10])

hist(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 11], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[11])

hist(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 13], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[13])

hist(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 14], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[14])

Box and Whisker Plots for Each Numeric Attribute

# Set up the layout for multiple histograms
# Plot histograms for numeric columns
for (i in 1:ncol(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)) {
  jpeg("boxplot.jpg", width = 800, height = 600)
  boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 1], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[1])
  dev.off()
  
}

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 1], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[1])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 4], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[4])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 6], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[6])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 7], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[7])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 9], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[9])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 10], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[10])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 11], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[11])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 13], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[13])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 14], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[14])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 15], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[15])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 17], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[17])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 19], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[19])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 20], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[20])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 21], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[21])

Bar Plots for Each Categorical Attribute Gender

library(readr)
STREAMLINING_PROCESSES_AT_KINYANJUI_FARM <- read_csv("data/STREAMLINING PROCESSES AT KINYANJUI FARM.csv")

## Rows: 1470 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (9): Attrition, BusinessTravel, Department, EducationField, Gender, Job...
## dbl (26): Age, DailyRate, DistanceFromHome, Education, EmployeeCount, Employ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Drop specified columns
columns_to_drop <- c('EmployeeCount', 'EmployeeNumber', 'Over18', 'StandardHours')
STREAMLINING_PROCESSES_AT_KINYANJUI_FARM <- STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, !(names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM) %in% columns_to_drop)]

barplot(table(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 2]), main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[2])

Missingness Map to Identify Missing Data

missmap(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM, col = c("red", "grey"), legend = TRUE)

Multivariate Plots

Multivariate Plots involve visualizations that simultaneously represent and explore the relationships between two or more variables in a dataset. Unlike univariate plots that focus on a single variable, multivariate plots provide insights into the interactions and patterns between multiple variables. Examples of multivariate plots include scatter plots, bubble charts, heatmaps, and 3D plots. These visualizations are essential for identifying correlations, trends, and dependencies between variables, aiding in a more comprehensive understanding of the overall dataset structure. They are particularly useful for uncovering complex relationships and patterns in multidimensional data

Correlation Plot

corrplot(cor(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, c(1,4, 6, 7, 9, 11, 12, 13, 15 , 17 , 18, 19,  21, 22, 23,24, 25, 26, 27, 28 ,29, 30, 31   )]), method = "circle")

ggcorrplot(cor(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, c(1,4, 6, 7, 9, 11, 12, 13, 15 , 17 , 18, 19,  21, 22, 23,24, 25, 26, 27, 28 ,29, 30, 31   )]))

Scatter Plot

# Select only numeric columns
numeric_columns <- STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[sapply(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM, is.numeric)]

# Create scatterplot matrix
pairs(numeric_columns, col = STREAMLINING_PROCESSES_AT_KINYANJUI_FARM$Age)

Multivariate Box and Whisker Plots by Class

library(readr)
STREAMLINING_PROCESSES_AT_KINYANJUI_FARM <- read_csv("data/STREAMLINING PROCESSES AT KINYANJUI FARM.csv")

## Rows: 1470 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (9): Attrition, BusinessTravel, Department, EducationField, Gender, Job...
## dbl (26): Age, DailyRate, DistanceFromHome, Education, EmployeeCount, Employ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Drop specified columns
columns_to_drop <- c('EmployeeCount', 'EmployeeNumber', 'Over18', 'StandardHours')
STREAMLINING_PROCESSES_AT_KINYANJUI_FARM <- STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, !(names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM) %in% columns_to_drop)]
featurePlot(x = STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, c(1,4, 6, 7, 9, 11, 12, 13, 15 , 17 , 18, 19,  21, 22, 23,24, 25, 26, 27, 28 ,29, 30, 31   )], y = STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 2],
            plot = "box")

## NULL

Preprocessing and Data Transformation

Preprocessing and Data Transformation” refer to the steps taken to prepare raw data for analysis. This involves cleaning, organizing, and modifying the data to improve its quality and make it suitable for modeling. Common preprocessing tasks include handling missing values, removing outliers, scaling or normalizing numerical features, encoding categorical variables, and splitting the data into training and testing sets. Data transformation may also involve creating new features or aggregating existing ones. These processes are crucial for enhancing the accuracy and effectiveness of machine learning models

## Confirm the "missingness" in the Initial Dataset ----

# Are there missing values in the dataset?
any_na(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)

## [1] FALSE

# How many?
n_miss(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)

## [1] 0

# What is the percentage of missing data in the entire dataset?
prop_miss(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)

## [1] 0

# How many missing values does each variable have?
STREAMLINING_PROCESSES_AT_KINYANJUI_FARM %>% is.na() %>% colSums()

##                      Age                Attrition           BusinessTravel 
##                        0                        0                        0 
##                DailyRate               Department         DistanceFromHome 
##                        0                        0                        0 
##                Education           EducationField  EnvironmentSatisfaction 
##                        0                        0                        0 
##                   Gender               HourlyRate           JobInvolvement 
##                        0                        0                        0 
##                 JobLevel                  JobRole          JobSatisfaction 
##                        0                        0                        0 
##            MaritalStatus            MonthlyIncome              MonthlyRate 
##                        0                        0                        0 
##       NumCompaniesWorked                 OverTime        PercentSalaryHike 
##                        0                        0                        0 
##        PerformanceRating RelationshipSatisfaction         StockOptionLevel 
##                        0                        0                        0 
##        TotalWorkingYears    TrainingTimesLastYear          WorkLifeBalance 
##                        0                        0                        0 
##           YearsAtCompany       YearsInCurrentRole  YearsSinceLastPromotion 
##                        0                        0                        0 
##     YearsWithCurrManager 
##                        0

# What is the number and percentage of missing values grouped by
# each variable?
miss_var_summary(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)

## # A tibble: 31 × 3
##    variable                n_miss pct_miss
##    <chr>                    <int>    <dbl>
##  1 Age                          0        0
##  2 Attrition                    0        0
##  3 BusinessTravel               0        0
##  4 DailyRate                    0        0
##  5 Department                   0        0
##  6 DistanceFromHome             0        0
##  7 Education                    0        0
##  8 EducationField               0        0
##  9 EnvironmentSatisfaction      0        0
## 10 Gender                       0        0
## # ℹ 21 more rows

# What is the number and percentage of missing values grouped by
# each observation?
miss_case_summary(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)

## # A tibble: 1,470 × 3
##     case n_miss pct_miss
##    <int>  <int>    <dbl>
##  1     1      0        0
##  2     2      0        0
##  3     3      0        0
##  4     4      0        0
##  5     5      0        0
##  6     6      0        0
##  7     7      0        0
##  8     8      0        0
##  9     9      0        0
## 10    10      0        0
## # ℹ 1,460 more rows

# Which variables contain the most missing values?
#gg_miss_var(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)

# Where are missing values located (the shaded regions in the plot)?
vis_miss(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM) + theme(axis.text.x = element_text(angle = 80))

# Which combinations of variables are missing together?
#gg_miss_upset(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)

Heatmap of “missingness” broken down by “Age”

# First, confirm that the "AgeDecade" variable is a categorical variable
is.factor(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM$Age)

## [1] FALSE

# Second, create the visualization
gg_miss_fct(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM, fct = Age)

# We can also create a heatmap of "missingness" broken down by "Depressed"
# First, confirm that the "Depressed" variable is a categorical variable
is.factor(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM$Attrition)

## [1] FALSE

# Second, create the visualization
gg_miss_fct(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM, fct = Attrition)

Data Transformation

Scale Data Transform

# Scale Data Transform 
# BEFORE
# Identify numeric columns

library(readr)
STREAMLINING_PROCESSES_AT_KINYANJUI_FARM <- read_csv("data/STREAMLINING PROCESSES AT KINYANJUI FARM.csv")

## Rows: 1470 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (9): Attrition, BusinessTravel, Department, EducationField, Gender, Job...
## dbl (26): Age, DailyRate, DistanceFromHome, Education, EmployeeCount, Employ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Drop specified columns
columns_to_drop <- c('EmployeeCount', 'EmployeeNumber', 'Over18', 'StandardHours')
STREAMLINING_PROCESSES_AT_KINYANJUI_FARM <- STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, !(names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM) %in% columns_to_drop)]
numeric_columns <- sapply(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM, is.numeric)

# Extract only numeric columns
STREAMLINING_PROCESSES_AT_KINYANJUI_FARM <- STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, numeric_columns]

STREAMLINING_PROCESSES_AT_KINYANJUI_FARM<- sapply(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM, as.numeric)

# Set up the layout for multiple histograms
par(mfrow = c(1, 1))

summary(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)

##       Age          DailyRate      DistanceFromHome   Education    
##  Min.   :18.00   Min.   : 102.0   Min.   : 1.000   Min.   :1.000  
##  1st Qu.:30.00   1st Qu.: 465.0   1st Qu.: 2.000   1st Qu.:2.000  
##  Median :36.00   Median : 802.0   Median : 7.000   Median :3.000  
##  Mean   :36.92   Mean   : 802.5   Mean   : 9.193   Mean   :2.913  
##  3rd Qu.:43.00   3rd Qu.:1157.0   3rd Qu.:14.000   3rd Qu.:4.000  
##  Max.   :60.00   Max.   :1499.0   Max.   :29.000   Max.   :5.000  
##  EnvironmentSatisfaction   HourlyRate     JobInvolvement    JobLevel    
##  Min.   :1.000           Min.   : 30.00   Min.   :1.00   Min.   :1.000  
##  1st Qu.:2.000           1st Qu.: 48.00   1st Qu.:2.00   1st Qu.:1.000  
##  Median :3.000           Median : 66.00   Median :3.00   Median :2.000  
##  Mean   :2.722           Mean   : 65.89   Mean   :2.73   Mean   :2.064  
##  3rd Qu.:4.000           3rd Qu.: 83.75   3rd Qu.:3.00   3rd Qu.:3.000  
##  Max.   :4.000           Max.   :100.00   Max.   :4.00   Max.   :5.000  
##  JobSatisfaction MonthlyIncome    MonthlyRate    NumCompaniesWorked
##  Min.   :1.000   Min.   : 1009   Min.   : 2094   Min.   :0.000     
##  1st Qu.:2.000   1st Qu.: 2911   1st Qu.: 8047   1st Qu.:1.000     
##  Median :3.000   Median : 4919   Median :14236   Median :2.000     
##  Mean   :2.729   Mean   : 6503   Mean   :14313   Mean   :2.693     
##  3rd Qu.:4.000   3rd Qu.: 8379   3rd Qu.:20462   3rd Qu.:4.000     
##  Max.   :4.000   Max.   :19999   Max.   :26999   Max.   :9.000     
##  PercentSalaryHike PerformanceRating RelationshipSatisfaction StockOptionLevel
##  Min.   :11.00     Min.   :3.000     Min.   :1.000            Min.   :0.0000  
##  1st Qu.:12.00     1st Qu.:3.000     1st Qu.:2.000            1st Qu.:0.0000  
##  Median :14.00     Median :3.000     Median :3.000            Median :1.0000  
##  Mean   :15.21     Mean   :3.154     Mean   :2.712            Mean   :0.7939  
##  3rd Qu.:18.00     3rd Qu.:3.000     3rd Qu.:4.000            3rd Qu.:1.0000  
##  Max.   :25.00     Max.   :4.000     Max.   :4.000            Max.   :3.0000  
##  TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany  
##  Min.   : 0.00     Min.   :0.000         Min.   :1.000   Min.   : 0.000  
##  1st Qu.: 6.00     1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.000  
##  Median :10.00     Median :3.000         Median :3.000   Median : 5.000  
##  Mean   :11.28     Mean   :2.799         Mean   :2.761   Mean   : 7.008  
##  3rd Qu.:15.00     3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.: 9.000  
##  Max.   :40.00     Max.   :6.000         Max.   :4.000   Max.   :40.000  
##  YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000     Min.   : 0.000          Min.   : 0.000      
##  1st Qu.: 2.000     1st Qu.: 0.000          1st Qu.: 2.000      
##  Median : 3.000     Median : 1.000          Median : 3.000      
##  Mean   : 4.229     Mean   : 2.188          Mean   : 4.123      
##  3rd Qu.: 7.000     3rd Qu.: 3.000          3rd Qu.: 7.000      
##  Max.   :18.000     Max.   :15.000          Max.   :17.000

hist(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 1], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[1])

hist(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 4], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[4])

hist(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 6], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[6])

hist(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 7], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[7])

hist(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 9], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[9])

hist(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 10], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[10])

hist(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 11], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[11])

hist(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 13], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[13])

hist(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 14], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[14])

model_of_the_transform <- preProcess(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM, method = c("scale"))
print(model_of_the_transform)

## Created from 1470 samples and 23 variables
## 
## Pre-processing:
##   - ignored (0)
##   - scaled (23)

streamling_processes_scale_transform <- predict(model_of_the_transform,
                                          STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)

#After
summary(streamling_processes_scale_transform)

##       Age          DailyRate      DistanceFromHome   Education     
##  Min.   :1.970   Min.   :0.2528   Min.   :0.1234   Min.   :0.9764  
##  1st Qu.:3.284   1st Qu.:1.1524   1st Qu.:0.2467   1st Qu.:1.9528  
##  Median :3.941   Median :1.9876   Median :0.8635   Median :2.9292  
##  Mean   :4.042   Mean   :1.9888   Mean   :1.1339   Mean   :2.8442  
##  3rd Qu.:4.707   3rd Qu.:2.8673   3rd Qu.:1.7269   3rd Qu.:3.9056  
##  Max.   :6.568   Max.   :3.7149   Max.   :3.5772   Max.   :4.8820  
##  EnvironmentSatisfaction   HourlyRate    JobInvolvement     JobLevel     
##  Min.   :0.9148          Min.   :1.476   Min.   :1.405   Min.   :0.9034  
##  1st Qu.:1.8297          1st Qu.:2.361   1st Qu.:2.811   1st Qu.:0.9034  
##  Median :2.7445          Median :3.247   Median :4.216   Median :1.8068  
##  Mean   :2.4900          Mean   :3.241   Mean   :3.837   Mean   :1.8646  
##  3rd Qu.:3.6594          3rd Qu.:4.120   3rd Qu.:4.216   3rd Qu.:2.7102  
##  Max.   :3.6594          Max.   :4.919   Max.   :5.621   Max.   :4.5170  
##  JobSatisfaction  MonthlyIncome     MonthlyRate     NumCompaniesWorked
##  Min.   :0.9067   Min.   :0.2143   Min.   :0.2942   Min.   :0.0000    
##  1st Qu.:1.8135   1st Qu.:0.6183   1st Qu.:1.1305   1st Qu.:0.4003    
##  Median :2.7202   Median :1.0448   Median :2.0000   Median :0.8006    
##  Mean   :2.4741   Mean   :1.3813   Mean   :2.0109   Mean   :1.0781    
##  3rd Qu.:3.6270   3rd Qu.:1.7798   3rd Qu.:2.8747   3rd Qu.:1.6013    
##  Max.   :3.6270   Max.   :4.2479   Max.   :3.7932   Max.   :3.6029    
##  PercentSalaryHike PerformanceRating RelationshipSatisfaction StockOptionLevel
##  Min.   :3.006     Min.   : 8.314    Min.   :0.9249           Min.   :0.0000  
##  1st Qu.:3.279     1st Qu.: 8.314    1st Qu.:1.8498           1st Qu.:0.0000  
##  Median :3.825     Median : 8.314    Median :2.7747           Median :1.1736  
##  Mean   :4.156     Mean   : 8.740    Mean   :2.5085           Mean   :0.9317  
##  3rd Qu.:4.918     3rd Qu.: 8.314    3rd Qu.:3.6996           3rd Qu.:1.1736  
##  Max.   :6.831     Max.   :11.086    Max.   :3.6996           Max.   :3.5208  
##  TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany  
##  Min.   :0.0000    Min.   :0.000         Min.   :1.415   Min.   :0.0000  
##  1st Qu.:0.7711    1st Qu.:1.551         1st Qu.:2.831   1st Qu.:0.4897  
##  Median :1.2852    Median :2.327         Median :4.246   Median :0.8161  
##  Mean   :1.4497    Mean   :2.171         Mean   :3.908   Mean   :1.1439  
##  3rd Qu.:1.9278    3rd Qu.:2.327         3rd Qu.:4.246   3rd Qu.:1.4690  
##  Max.   :5.1409    Max.   :4.654         Max.   :5.662   Max.   :6.5290  
##  YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
##  Min.   :0.000      Min.   :0.0000          Min.   :0.0000      
##  1st Qu.:0.552      1st Qu.:0.0000          1st Qu.:0.5605      
##  Median :0.828      Median :0.3103          Median :0.8408      
##  Mean   :1.167      Mean   :0.6789          Mean   :1.1555      
##  3rd Qu.:1.932      3rd Qu.:0.9310          3rd Qu.:1.9618      
##  Max.   :4.968      Max.   :4.6549          Max.   :4.7644

hist(streamling_processes_scale_transform[, 1], main = names(streamling_processes_scale_transform)[1])

hist(streamling_processes_scale_transform[, 4], main = names(streamling_processes_scale_transform)[4])

hist(streamling_processes_scale_transform[, 6], main = names(streamling_processes_scale_transform)[6])

hist(streamling_processes_scale_transform[, 7], main = names(streamling_processes_scale_transform)[7])

hist(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 9], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[9])

hist(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 10], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[10])

hist(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 11], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[11])

hist(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 13], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[13])

hist(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 14], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[14])

Center Data Transform

# BEFORE
summary(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)

##       Age          DailyRate      DistanceFromHome   Education    
##  Min.   :18.00   Min.   : 102.0   Min.   : 1.000   Min.   :1.000  
##  1st Qu.:30.00   1st Qu.: 465.0   1st Qu.: 2.000   1st Qu.:2.000  
##  Median :36.00   Median : 802.0   Median : 7.000   Median :3.000  
##  Mean   :36.92   Mean   : 802.5   Mean   : 9.193   Mean   :2.913  
##  3rd Qu.:43.00   3rd Qu.:1157.0   3rd Qu.:14.000   3rd Qu.:4.000  
##  Max.   :60.00   Max.   :1499.0   Max.   :29.000   Max.   :5.000  
##  EnvironmentSatisfaction   HourlyRate     JobInvolvement    JobLevel    
##  Min.   :1.000           Min.   : 30.00   Min.   :1.00   Min.   :1.000  
##  1st Qu.:2.000           1st Qu.: 48.00   1st Qu.:2.00   1st Qu.:1.000  
##  Median :3.000           Median : 66.00   Median :3.00   Median :2.000  
##  Mean   :2.722           Mean   : 65.89   Mean   :2.73   Mean   :2.064  
##  3rd Qu.:4.000           3rd Qu.: 83.75   3rd Qu.:3.00   3rd Qu.:3.000  
##  Max.   :4.000           Max.   :100.00   Max.   :4.00   Max.   :5.000  
##  JobSatisfaction MonthlyIncome    MonthlyRate    NumCompaniesWorked
##  Min.   :1.000   Min.   : 1009   Min.   : 2094   Min.   :0.000     
##  1st Qu.:2.000   1st Qu.: 2911   1st Qu.: 8047   1st Qu.:1.000     
##  Median :3.000   Median : 4919   Median :14236   Median :2.000     
##  Mean   :2.729   Mean   : 6503   Mean   :14313   Mean   :2.693     
##  3rd Qu.:4.000   3rd Qu.: 8379   3rd Qu.:20462   3rd Qu.:4.000     
##  Max.   :4.000   Max.   :19999   Max.   :26999   Max.   :9.000     
##  PercentSalaryHike PerformanceRating RelationshipSatisfaction StockOptionLevel
##  Min.   :11.00     Min.   :3.000     Min.   :1.000            Min.   :0.0000  
##  1st Qu.:12.00     1st Qu.:3.000     1st Qu.:2.000            1st Qu.:0.0000  
##  Median :14.00     Median :3.000     Median :3.000            Median :1.0000  
##  Mean   :15.21     Mean   :3.154     Mean   :2.712            Mean   :0.7939  
##  3rd Qu.:18.00     3rd Qu.:3.000     3rd Qu.:4.000            3rd Qu.:1.0000  
##  Max.   :25.00     Max.   :4.000     Max.   :4.000            Max.   :3.0000  
##  TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany  
##  Min.   : 0.00     Min.   :0.000         Min.   :1.000   Min.   : 0.000  
##  1st Qu.: 6.00     1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.000  
##  Median :10.00     Median :3.000         Median :3.000   Median : 5.000  
##  Mean   :11.28     Mean   :2.799         Mean   :2.761   Mean   : 7.008  
##  3rd Qu.:15.00     3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.: 9.000  
##  Max.   :40.00     Max.   :6.000         Max.   :4.000   Max.   :40.000  
##  YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000     Min.   : 0.000          Min.   : 0.000      
##  1st Qu.: 2.000     1st Qu.: 0.000          1st Qu.: 2.000      
##  Median : 3.000     Median : 1.000          Median : 3.000      
##  Mean   : 4.229     Mean   : 2.188          Mean   : 4.123      
##  3rd Qu.: 7.000     3rd Qu.: 3.000          3rd Qu.: 7.000      
##  Max.   :18.000     Max.   :15.000          Max.   :17.000

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 1], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[1])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 2], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[2])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 3], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[3])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 5], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[5])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 6], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[6])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 7], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[7])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 8], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[8])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 9], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[9])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 10], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[10])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 11], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[11])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 12], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[12])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 13], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[13])

boxplot(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, 14], main = names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)[14])

model_of_the_transform <- preProcess(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM, method = c("center"))
print(model_of_the_transform)

## Created from 1470 samples and 23 variables
## 
## Pre-processing:
##   - centered (23)
##   - ignored (0)

streamling_processes_center_transform <- predict(model_of_the_transform, # nolint
                                           STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)

# AFTER

summary(streamling_processes_center_transform)

##       Age             DailyRate         DistanceFromHome   Education       
##  Min.   :-18.9238   Min.   :-700.4857   Min.   :-8.193   Min.   :-1.91293  
##  1st Qu.: -6.9238   1st Qu.:-337.4857   1st Qu.:-7.193   1st Qu.:-0.91293  
##  Median : -0.9238   Median :  -0.4857   Median :-2.193   Median : 0.08707  
##  Mean   :  0.0000   Mean   :   0.0000   Mean   : 0.000   Mean   : 0.00000  
##  3rd Qu.:  6.0762   3rd Qu.: 354.5143   3rd Qu.: 4.807   3rd Qu.: 1.08707  
##  Max.   : 23.0762   Max.   : 696.5143   Max.   :19.807   Max.   : 2.08707  
##  EnvironmentSatisfaction   HourlyRate       JobInvolvement   
##  Min.   :-1.7218         Min.   :-35.8912   Min.   :-1.7299  
##  1st Qu.:-0.7218         1st Qu.:-17.8912   1st Qu.:-0.7299  
##  Median : 0.2782         Median :  0.1088   Median : 0.2701  
##  Mean   : 0.0000         Mean   :  0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.2782         3rd Qu.: 17.8588   3rd Qu.: 0.2701  
##  Max.   : 1.2782         Max.   : 34.1088   Max.   : 1.2701  
##     JobLevel        JobSatisfaction   MonthlyIncome    MonthlyRate      
##  Min.   :-1.06395   Min.   :-1.7286   Min.   :-5494   Min.   :-12219.1  
##  1st Qu.:-1.06395   1st Qu.:-0.7286   1st Qu.:-3592   1st Qu.: -6266.1  
##  Median :-0.06395   Median : 0.2714   Median :-1584   Median :   -77.6  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   :    0   Mean   :     0.0  
##  3rd Qu.: 0.93605   3rd Qu.: 1.2714   3rd Qu.: 1876   3rd Qu.:  6148.4  
##  Max.   : 2.93605   Max.   : 1.2714   Max.   :13496   Max.   : 12685.9  
##  NumCompaniesWorked PercentSalaryHike PerformanceRating
##  Min.   :-2.6932    Min.   :-4.21     Min.   :-0.1537  
##  1st Qu.:-1.6932    1st Qu.:-3.21     1st Qu.:-0.1537  
##  Median :-0.6932    Median :-1.21     Median :-0.1537  
##  Mean   : 0.0000    Mean   : 0.00     Mean   : 0.0000  
##  3rd Qu.: 1.3068    3rd Qu.: 2.79     3rd Qu.:-0.1537  
##  Max.   : 6.3068    Max.   : 9.79     Max.   : 0.8463  
##  RelationshipSatisfaction StockOptionLevel  TotalWorkingYears
##  Min.   :-1.7122          Min.   :-0.7939   Min.   :-11.28   
##  1st Qu.:-0.7122          1st Qu.:-0.7939   1st Qu.: -5.28   
##  Median : 0.2878          Median : 0.2061   Median : -1.28   
##  Mean   : 0.0000          Mean   : 0.0000   Mean   :  0.00   
##  3rd Qu.: 1.2878          3rd Qu.: 0.2061   3rd Qu.:  3.72   
##  Max.   : 1.2878          Max.   : 2.2061   Max.   : 28.72   
##  TrainingTimesLastYear WorkLifeBalance   YearsAtCompany   YearsInCurrentRole
##  Min.   :-2.7993       Min.   :-1.7612   Min.   :-7.008   Min.   :-4.229    
##  1st Qu.:-0.7993       1st Qu.:-0.7612   1st Qu.:-4.008   1st Qu.:-2.229    
##  Median : 0.2007       Median : 0.2388   Median :-2.008   Median :-1.229    
##  Mean   : 0.0000       Mean   : 0.0000   Mean   : 0.000   Mean   : 0.000    
##  3rd Qu.: 0.2007       3rd Qu.: 0.2388   3rd Qu.: 1.992   3rd Qu.: 2.771    
##  Max.   : 3.2007       Max.   : 1.2388   Max.   :32.992   Max.   :13.771    
##  YearsSinceLastPromotion YearsWithCurrManager
##  Min.   :-2.1878         Min.   :-4.123      
##  1st Qu.:-2.1878         1st Qu.:-2.123      
##  Median :-1.1878         Median :-1.123      
##  Mean   : 0.0000         Mean   : 0.000      
##  3rd Qu.: 0.8122         3rd Qu.: 2.877      
##  Max.   :12.8122         Max.   :12.877

boxplot(streamling_processes_center_transform[, 1], main = names(streamling_processes_center_transform)[1])

boxplot(streamling_processes_center_transform[, 2], main = names(streamling_processes_center_transform)[2])

boxplot(streamling_processes_center_transform[, 3], main = names(streamling_processes_center_transform)[3])

boxplot(streamling_processes_center_transform[, 5], main = names(streamling_processes_center_transform)[5])

boxplot(streamling_processes_center_transform[, 6], main = names(streamling_processes_center_transform)[6])

boxplot(streamling_processes_center_transform[, 7], main = names(streamling_processes_center_transform)[7])

boxplot(streamling_processes_center_transform[, 8], main = names(streamling_processes_center_transform)[8])

boxplot(streamling_processes_center_transform[, 9], main = names(streamling_processes_center_transform)[9])

boxplot(streamling_processes_center_transform[, 10], main = names(streamling_processes_center_transform)[10])

boxplot(streamling_processes_center_transform[, 11], main = names(streamling_processes_center_transform)[11])

boxplot(streamling_processes_center_transform[, 12], main = names(streamling_processes_center_transform)[12])

boxplot(streamling_processes_center_transform[, 13], main = names(streamling_processes_center_transform)[13])

boxplot(streamling_processes_center_transform[, 14], main = names(streamling_processes_center_transform)[14])

Standardize Data Transform to have a std of 1

### The Standardize Basic Transform on the Boston Housing Dataset ----
# BEFORE

library(readr)
STREAMLINING_PROCESSES_AT_KINYANJUI_FARM <- read_csv("data/STREAMLINING PROCESSES AT KINYANJUI FARM.csv")

## Rows: 1470 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (9): Attrition, BusinessTravel, Department, EducationField, Gender, Job...
## dbl (26): Age, DailyRate, DistanceFromHome, Education, EmployeeCount, Employ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Drop specified columns
columns_to_drop <- c('EmployeeCount', 'EmployeeNumber', 'Over18', 'StandardHours')
STREAMLINING_PROCESSES_AT_KINYANJUI_FARM <- STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, !(names(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM) %in% columns_to_drop)]
summary(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)

##       Age         Attrition         BusinessTravel       DailyRate     
##  Min.   :18.00   Length:1470        Length:1470        Min.   : 102.0  
##  1st Qu.:30.00   Class :character   Class :character   1st Qu.: 465.0  
##  Median :36.00   Mode  :character   Mode  :character   Median : 802.0  
##  Mean   :36.92                                         Mean   : 802.5  
##  3rd Qu.:43.00                                         3rd Qu.:1157.0  
##  Max.   :60.00                                         Max.   :1499.0  
##   Department        DistanceFromHome   Education     EducationField    
##  Length:1470        Min.   : 1.000   Min.   :1.000   Length:1470       
##  Class :character   1st Qu.: 2.000   1st Qu.:2.000   Class :character  
##  Mode  :character   Median : 7.000   Median :3.000   Mode  :character  
##                     Mean   : 9.193   Mean   :2.913                     
##                     3rd Qu.:14.000   3rd Qu.:4.000                     
##                     Max.   :29.000   Max.   :5.000                     
##  EnvironmentSatisfaction    Gender            HourlyRate     JobInvolvement
##  Min.   :1.000           Length:1470        Min.   : 30.00   Min.   :1.00  
##  1st Qu.:2.000           Class :character   1st Qu.: 48.00   1st Qu.:2.00  
##  Median :3.000           Mode  :character   Median : 66.00   Median :3.00  
##  Mean   :2.722                              Mean   : 65.89   Mean   :2.73  
##  3rd Qu.:4.000                              3rd Qu.: 83.75   3rd Qu.:3.00  
##  Max.   :4.000                              Max.   :100.00   Max.   :4.00  
##     JobLevel       JobRole          JobSatisfaction MaritalStatus     
##  Min.   :1.000   Length:1470        Min.   :1.000   Length:1470       
##  1st Qu.:1.000   Class :character   1st Qu.:2.000   Class :character  
##  Median :2.000   Mode  :character   Median :3.000   Mode  :character  
##  Mean   :2.064                      Mean   :2.729                     
##  3rd Qu.:3.000                      3rd Qu.:4.000                     
##  Max.   :5.000                      Max.   :4.000                     
##  MonthlyIncome    MonthlyRate    NumCompaniesWorked   OverTime        
##  Min.   : 1009   Min.   : 2094   Min.   :0.000      Length:1470       
##  1st Qu.: 2911   1st Qu.: 8047   1st Qu.:1.000      Class :character  
##  Median : 4919   Median :14236   Median :2.000      Mode  :character  
##  Mean   : 6503   Mean   :14313   Mean   :2.693                        
##  3rd Qu.: 8379   3rd Qu.:20462   3rd Qu.:4.000                        
##  Max.   :19999   Max.   :26999   Max.   :9.000                        
##  PercentSalaryHike PerformanceRating RelationshipSatisfaction StockOptionLevel
##  Min.   :11.00     Min.   :3.000     Min.   :1.000            Min.   :0.0000  
##  1st Qu.:12.00     1st Qu.:3.000     1st Qu.:2.000            1st Qu.:0.0000  
##  Median :14.00     Median :3.000     Median :3.000            Median :1.0000  
##  Mean   :15.21     Mean   :3.154     Mean   :2.712            Mean   :0.7939  
##  3rd Qu.:18.00     3rd Qu.:3.000     3rd Qu.:4.000            3rd Qu.:1.0000  
##  Max.   :25.00     Max.   :4.000     Max.   :4.000            Max.   :3.0000  
##  TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany  
##  Min.   : 0.00     Min.   :0.000         Min.   :1.000   Min.   : 0.000  
##  1st Qu.: 6.00     1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.000  
##  Median :10.00     Median :3.000         Median :3.000   Median : 5.000  
##  Mean   :11.28     Mean   :2.799         Mean   :2.761   Mean   : 7.008  
##  3rd Qu.:15.00     3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.: 9.000  
##  Max.   :40.00     Max.   :6.000         Max.   :4.000   Max.   :40.000  
##  YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000     Min.   : 0.000          Min.   : 0.000      
##  1st Qu.: 2.000     1st Qu.: 0.000          1st Qu.: 2.000      
##  Median : 3.000     Median : 1.000          Median : 3.000      
##  Mean   : 4.229     Mean   : 2.188          Mean   : 4.123      
##  3rd Qu.: 7.000     3rd Qu.: 3.000          3rd Qu.: 7.000      
##  Max.   :18.000     Max.   :15.000          Max.   :17.000

sapply(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[, c(1,4, 6, 7, 9, 11, 12, 13, 15 , 17 , 18, 19,  21, 22, 23,24, 25, 26, 27, 28 ,29, 30, 31   )], sd)

##                      Age                DailyRate         DistanceFromHome 
##                9.1353735              403.5090999                8.1068644 
##                Education  EnvironmentSatisfaction               HourlyRate 
##                1.0241649                1.0930822               20.3294276 
##           JobInvolvement                 JobLevel          JobSatisfaction 
##                0.7115611                1.1069399                1.1028461 
##            MonthlyIncome              MonthlyRate       NumCompaniesWorked 
##             4707.9567831             7117.7860441                2.4980090 
##        PercentSalaryHike        PerformanceRating RelationshipSatisfaction 
##                3.6599377                0.3608235                1.0812089 
##         StockOptionLevel        TotalWorkingYears    TrainingTimesLastYear 
##                0.8520767                7.7807817                1.2892706 
##          WorkLifeBalance           YearsAtCompany       YearsInCurrentRole 
##                0.7064758                6.1265252                3.6231370 
##  YearsSinceLastPromotion     YearsWithCurrManager 
##                3.2224303                3.5681361

model_of_the_transform <- preProcess(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM,
                                     method = c("scale", "center"))
print(model_of_the_transform)

## Created from 1470 samples and 31 variables
## 
## Pre-processing:
##   - centered (23)
##   - ignored (8)
##   - scaled (23)

streamling_processes_standardize_transform <- predict(model_of_the_transform, # nolint
                                                STREAMLINING_PROCESSES_AT_KINYANJUI_FARM)

# AFTER
summary(streamling_processes_standardize_transform)

##       Age           Attrition         BusinessTravel       DailyRate        
##  Min.   :-2.0715   Length:1470        Length:1470        Min.   :-1.735985  
##  1st Qu.:-0.7579   Class :character   Class :character   1st Qu.:-0.836377  
##  Median :-0.1011   Mode  :character   Mode  :character   Median :-0.001204  
##  Mean   : 0.0000                                         Mean   : 0.000000  
##  3rd Qu.: 0.6651                                         3rd Qu.: 0.878578  
##  Max.   : 2.5260                                         Max.   : 1.726143  
##   Department        DistanceFromHome    Education        EducationField    
##  Length:1470        Min.   :-1.0106   Min.   :-1.86779   Length:1470       
##  Class :character   1st Qu.:-0.8872   1st Qu.:-0.89138   Class :character  
##  Mode  :character   Median :-0.2705   Median : 0.08502   Mode  :character  
##                     Mean   : 0.0000   Mean   : 0.00000                     
##                     3rd Qu.: 0.5930   3rd Qu.: 1.06143                     
##                     Max.   : 2.4433   Max.   : 2.03783                     
##  EnvironmentSatisfaction    Gender            HourlyRate       
##  Min.   :-1.5752         Length:1470        Min.   :-1.765478  
##  1st Qu.:-0.6603         Class :character   1st Qu.:-0.880062  
##  Median : 0.2545         Mode  :character   Median : 0.005354  
##  Mean   : 0.0000                            Mean   : 0.000000  
##  3rd Qu.: 1.1694                            3rd Qu.: 0.878472  
##  Max.   : 1.1694                            Max.   : 1.677806  
##  JobInvolvement       JobLevel          JobRole          JobSatisfaction  
##  Min.   :-2.4312   Min.   :-0.96116   Length:1470        Min.   :-1.5674  
##  1st Qu.:-1.0258   1st Qu.:-0.96116   Class :character   1st Qu.:-0.6606  
##  Median : 0.3795   Median :-0.05777   Mode  :character   Median : 0.2461  
##  Mean   : 0.0000   Mean   : 0.00000                      Mean   : 0.0000  
##  3rd Qu.: 0.3795   3rd Qu.: 0.84562                      3rd Qu.: 1.1529  
##  Max.   : 1.7849   Max.   : 2.65241                      Max.   : 1.1529  
##  MaritalStatus      MonthlyIncome      MonthlyRate      NumCompaniesWorked
##  Length:1470        Min.   :-1.1669   Min.   :-1.7167   Min.   :-1.0781   
##  Class :character   1st Qu.:-0.7629   1st Qu.:-0.8803   1st Qu.:-0.6778   
##  Mode  :character   Median :-0.3364   Median :-0.0109   Median :-0.2775   
##                     Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   
##                     3rd Qu.: 0.3985   3rd Qu.: 0.8638   3rd Qu.: 0.5231   
##                     Max.   : 2.8667   Max.   : 1.7823   Max.   : 2.5247   
##    OverTime         PercentSalaryHike PerformanceRating
##  Length:1470        Min.   :-1.1502   Min.   :-0.4261  
##  Class :character   1st Qu.:-0.8769   1st Qu.:-0.4261  
##  Mode  :character   Median :-0.3305   Median :-0.4261  
##                     Mean   : 0.0000   Mean   : 0.0000  
##                     3rd Qu.: 0.7624   3rd Qu.:-0.4261  
##                     Max.   : 2.6750   Max.   : 2.3454  
##  RelationshipSatisfaction StockOptionLevel  TotalWorkingYears
##  Min.   :-1.5836          Min.   :-0.9317   Min.   :-1.4497  
##  1st Qu.:-0.6587          1st Qu.:-0.9317   1st Qu.:-0.6785  
##  Median : 0.2661          Median : 0.2419   Median :-0.1645  
##  Mean   : 0.0000          Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.1910          3rd Qu.: 0.2419   3rd Qu.: 0.4782  
##  Max.   : 1.1910          Max.   : 2.5891   Max.   : 3.6912  
##  TrainingTimesLastYear WorkLifeBalance  YearsAtCompany    YearsInCurrentRole
##  Min.   :-2.1712       Min.   :-2.493   Min.   :-1.1439   Min.   :-1.1673   
##  1st Qu.:-0.6200       1st Qu.:-1.077   1st Qu.:-0.6542   1st Qu.:-0.6153   
##  Median : 0.1557       Median : 0.338   Median :-0.3278   Median :-0.3393   
##  Mean   : 0.0000       Mean   : 0.000   Mean   : 0.0000   Mean   : 0.0000   
##  3rd Qu.: 0.1557       3rd Qu.: 0.338   3rd Qu.: 0.3251   3rd Qu.: 0.7647   
##  Max.   : 2.4826       Max.   : 1.753   Max.   : 5.3851   Max.   : 3.8008   
##  YearsSinceLastPromotion YearsWithCurrManager
##  Min.   :-0.6789         Min.   :-1.1555     
##  1st Qu.:-0.6789         1st Qu.:-0.5950     
##  Median :-0.3686         Median :-0.3148     
##  Mean   : 0.0000         Mean   : 0.0000     
##  3rd Qu.: 0.2521         3rd Qu.: 0.8063     
##  Max.   : 3.9760         Max.   : 3.6089

sapply(streamling_processes_standardize_transform[, c(1,4, 6, 7, 9, 11, 12, 13, 15 , 17 , 18, 19,  21, 22, 23,24, 25, 26, 27, 28 ,29, 30, 31   )], sd)

##                      Age                DailyRate         DistanceFromHome 
##                        1                        1                        1 
##                Education  EnvironmentSatisfaction               HourlyRate 
##                        1                        1                        1 
##           JobInvolvement                 JobLevel          JobSatisfaction 
##                        1                        1                        1 
##            MonthlyIncome              MonthlyRate       NumCompaniesWorked 
##                        1                        1                        1 
##        PercentSalaryHike        PerformanceRating RelationshipSatisfaction 
##                        1                        1                        1 
##         StockOptionLevel        TotalWorkingYears    TrainingTimesLastYear 
##                        1                        1                        1 
##          WorkLifeBalance           YearsAtCompany       YearsInCurrentRole 
##                        1                        1                        1 
##  YearsSinceLastPromotion     YearsWithCurrManager 
##                        1                        1

Training Models

Training Models refers to the process of building and refining predictive or descriptive models using machine learning algorithms. In this context, “training” involves using a labeled dataset (input data paired with corresponding output labels) to teach the model to make predictions or classifications. The model learns patterns and relationships within the training data, adjusting its internal parameters to optimize its performance. The trained model can then be evaluated on new, unseen data to assess its generalization ability. The goal is to develop a model that accurately captures underlying patterns in the data and can make reliable predictions on new, similar data. Techniques for training models include regression, classification, clustering, and neural networks, among others.

Split the dataset

We generate random indices for the training set using sample, and then we use these indices to extract the corresponding rows from the dataset, creating the training and testing datasets (train_streamling_processes_train and test_streamling_processes_test).Now, you have a 75:25 split of theSTREAMLINING_PROCESSES_AT_KINYANJUI_FARM dataset into training and testing sets. We use train_streamling_processes_train for model training and test_streamling_processes_test for model evaluation.

## 1. Split the dataset
train_index <- createDataPartition(STREAMLINING_PROCESSES_AT_KINYANJUI_FARM$Attrition,
                                   p = 0.75,
                                   list = FALSE)
train_streamling_processes_train <- STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[train_index, ]
test_streamling_processes_test <- STREAMLINING_PROCESSES_AT_KINYANJUI_FARM[-train_index, ]

# Check the column names of pima_train
colnames(train_streamling_processes_train)

##  [1] "Age"                      "Attrition"               
##  [3] "BusinessTravel"           "DailyRate"               
##  [5] "Department"               "DistanceFromHome"        
##  [7] "Education"                "EducationField"          
##  [9] "EnvironmentSatisfaction"  "Gender"                  
## [11] "HourlyRate"               "JobInvolvement"          
## [13] "JobLevel"                 "JobRole"                 
## [15] "JobSatisfaction"          "MaritalStatus"           
## [17] "MonthlyIncome"            "MonthlyRate"             
## [19] "NumCompaniesWorked"       "OverTime"                
## [21] "PercentSalaryHike"        "PerformanceRating"       
## [23] "RelationshipSatisfaction" "StockOptionLevel"        
## [25] "TotalWorkingYears"        "TrainingTimesLastYear"   
## [27] "WorkLifeBalance"          "YearsAtCompany"          
## [29] "YearsInCurrentRole"       "YearsSinceLastPromotion" 
## [31] "YearsWithCurrManager"

NaiveBayes() function in the e1071 package

# Train the Naive Bayes model
e107_streamling_processes_dataset_model_nb_e1071 <-
  e1071::naiveBayes(Attrition ~ .,
                    data = train_streamling_processes_train)

# Testing 
predictions_nb_e1071 <-
  predict(e107_streamling_processes_dataset_model_nb_e1071,
          test_streamling_processes_test[, 1:31])

# Check unique levels
unique_levels_pred <- levels(predictions_nb_e1071)
unique_levels_actual <- levels(test_streamling_processes_test$Attrition)

# Compare levels
setdiff(unique_levels_pred, unique_levels_actual)

## [1] "No"  "Yes"

setdiff(unique_levels_actual, unique_levels_pred)

## NULL

# Convert predictions to factor and align levels
predictions_nb_e1071 <- as.factor(predictions_nb_e1071)
#levels(predictions_nb_e1071) <- levels(test_streamling_processes_test$Attrition)

#Update Actual Levels to Match Predicted Levels
test_streamling_processes_test$Attrition <- as.factor(test_streamling_processes_test$Attrition)
levels(test_streamling_processes_test$Attrition) <- levels(predictions_nb_e1071)


# View the Results
print(predictions_nb_e1071)

##   [1] Yes No  No  No  Yes No  Yes Yes No  Yes No  Yes No  Yes No  No  No  No 
##  [19] No  No  No  Yes No  No  No  No  No  No  No  No  No  No  No  No  Yes No 
##  [37] No  No  No  Yes No  No  No  No  No  No  No  No  No  No  No  No  Yes No 
##  [55] No  No  No  No  No  No  Yes No  No  No  No  No  No  No  No  Yes No  No 
##  [73] No  No  No  No  No  Yes No  Yes No  Yes Yes No  No  No  No  No  No  No 
##  [91] Yes Yes No  Yes Yes No  No  No  No  No  No  No  No  No  Yes No  Yes No 
## [109] No  Yes No  No  No  No  No  Yes No  Yes Yes No  No  No  No  Yes No  No 
## [127] No  No  No  No  Yes Yes No  No  No  Yes No  No  No  No  No  No  No  No 
## [145] Yes Yes No  No  No  No  Yes Yes No  Yes Yes No  No  No  No  No  Yes No 
## [163] Yes No  No  No  Yes No  No  No  Yes No  Yes No  No  No  Yes No  No  No 
## [181] No  Yes Yes No  No  No  No  No  No  Yes No  Yes No  No  Yes No  No  No 
## [199] No  No  Yes No  Yes No  No  Yes No  Yes No  No  No  No  No  No  No  No 
## [217] No  No  No  No  No  No  No  Yes No  No  No  No  No  Yes No  No  No  No 
## [235] No  No  Yes No  No  No  No  No  No  No  Yes No  No  No  No  No  No  No 
## [253] No  Yes No  No  No  No  Yes No  No  Yes No  No  No  No  No  No  No  Yes
## [271] Yes Yes No  No  No  No  No  No  No  No  No  No  Yes No  No  No  Yes No 
## [289] Yes No  No  No  Yes No  Yes No  No  No  No  No  Yes No  No  Yes No  No 
## [307] No  No  Yes No  Yes No  No  No  No  No  No  No  No  No  No  No  Yes No 
## [325] No  No  No  No  Yes No  Yes No  Yes No  No  No  No  No  No  No  No  Yes
## [343] No  Yes No  No  No  No  No  No  No  No  No  Yes Yes No  Yes No  No  Yes
## [361] No  No  No  No  Yes No  No 
## Levels: No Yes

# Confusion Matrix
conf_matrix <- caret::confusionMatrix(predictions_nb_e1071,
                                      test_streamling_processes_test$Attrition)
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  261  24
##        Yes  47  35
##                                           
##                Accuracy : 0.8065          
##                  95% CI : (0.7624, 0.8457)
##     No Information Rate : 0.8392          
##     P-Value [Acc > NIR] : 0.95954         
##                                           
##                   Kappa : 0.3806          
##                                           
##  Mcnemar's Test P-Value : 0.00903         
##                                           
##             Sensitivity : 0.8474          
##             Specificity : 0.5932          
##          Pos Pred Value : 0.9158          
##          Neg Pred Value : 0.4268          
##              Prevalence : 0.8392          
##          Detection Rate : 0.7112          
##    Detection Prevalence : 0.7766          
##       Balanced Accuracy : 0.7203          
##                                           
##        'Positive' Class : No              
##

# The confusion matrix can also be viewed graphically,
# although with less information.
plot(table(predictions_nb_e1071,
           test_streamling_processes_test$`Attrition`))

### Bootstrapping train control 

#train_control <- trainControl(method = "boot", number = 10000)

#streamling_processes_dataset_model_lm <- caret::train(`MonthlyIncome` ~`Attrition` ,
    #                                  data = train_streamling_processes_train,
    #                                  trControl = train_control,
    #                                  method = "lm",
    #                                  metric = "RMSE")


### Test the trained linear regression model using the testing dataset 
#predictions_lm <- predict(streamling_processes_dataset_model_lm,
 #                         test_streamling_processes_test[, 1:31])

### View the RMSE and the predicted values for the 9 observations 
#print(streamling_processes_dataset_model_lm)
#print(predictions_lm)

Classification: LDA with k-fold Cross Validation

### LDA Attritionifier based on a 5-fold cross validation 
train_control_lda <- trainControl(method = "cv", number = 5)

streamling_dateset_model_lda <-
  caret::train(`Attrition` ~ ., data = STREAMLINING_PROCESSES_AT_KINYANJUI_FARM,
               trControl = train_control_lda, na.action = na.omit, method = "lda2",
               metric = "Accuracy")

### 3.b. Test the trained LDA model using the testing dataset 
predictions_lda <- predict(streamling_dateset_model_lda,
                           test_streamling_processes_test[, 1:31])

### 3.c. View the summary of the model and view the confusion matrix
print(streamling_dateset_model_lda)

## Linear Discriminant Analysis 
## 
## 1470 samples
##   30 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1176, 1175, 1177, 1175, 1177 
## Resampling results:
## 
##   Accuracy   Kappa   
##   0.8734775  0.436173
## 
## Tuning parameter 'dimen' was held constant at a value of 1

caret::confusionMatrix(predictions_lda, test_streamling_processes_test$Attrition)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  298  31
##        Yes  10  28
##                                           
##                Accuracy : 0.8883          
##                  95% CI : (0.8515, 0.9186)
##     No Information Rate : 0.8392          
##     P-Value [Acc > NIR] : 0.004869        
##                                           
##                   Kappa : 0.5164          
##                                           
##  Mcnemar's Test P-Value : 0.001787        
##                                           
##             Sensitivity : 0.9675          
##             Specificity : 0.4746          
##          Pos Pred Value : 0.9058          
##          Neg Pred Value : 0.7368          
##              Prevalence : 0.8392          
##          Detection Rate : 0.8120          
##    Detection Prevalence : 0.8965          
##       Balanced Accuracy : 0.7211          
##                                           
##        'Positive' Class : No              
##

plot(table(predictions_lda, test_streamling_processes_test$Attrition))

Classification: Naive Bayes with Repeated k-fold Cross Validation —-

### 4.a. Train an e1071::naive Bayes Attritionifier based on the churn variable
streamling_dateset_model_nb <-
  e1071::naiveBayes(`Attrition` ~ ., data = train_streamling_processes_train)

### 4.b. Test the trained naive Bayes Attritionifier using the testing dataset 
predictions_nb_e1071 <-
  predict(streamling_dateset_model_nb, test_streamling_processes_test[, 1:31])

### 4.c. View a summary of the naive Bayes model and the confusion matrix 
print(streamling_dateset_model_nb)

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##        No       Yes 
## 0.8386219 0.1613781 
## 
## Conditional probabilities:
##      Age
## Y         [,1]     [,2]
##   No  37.63892 9.000685
##   Yes 33.25843 9.892676
## 
##      BusinessTravel
## Y     Non-Travel Travel_Frequently Travel_Rarely
##   No  0.10594595        0.16324324    0.73081081
##   Yes 0.05617978        0.28089888    0.66292135
## 
##      DailyRate
## Y         [,1]     [,2]
##   No  814.0259 401.8619
##   Yes 757.4888 400.0808
## 
##      Department
## Y     Human Resources Research & Development      Sales
##   No       0.04108108             0.66918919 0.28972973
##   Yes      0.05617978             0.52808989 0.41573034
## 
##      DistanceFromHome
## Y          [,1]     [,2]
##   No   8.884324 8.035542
##   Yes 10.224719 8.267768
## 
##      Education
## Y         [,1]     [,2]
##   No  2.913514 1.045026
##   Yes 2.808989 1.029249
## 
##      EducationField
## Y     Human Resources Life Sciences  Marketing    Medical      Other
##   No       0.01513514    0.40864865 0.10702703 0.33405405 0.05405405
##   Yes      0.03370787    0.33146067 0.16292135 0.26404494 0.04494382
##      EducationField
## Y     Technical Degree
##   No        0.08108108
##   Yes       0.16292135
## 
##      EnvironmentSatisfaction
## Y        [,1]     [,2]
##   No  2.76000 1.073450
##   Yes 2.52809 1.179772
## 
##      Gender
## Y        Female      Male
##   No  0.4075676 0.5924324
##   Yes 0.3314607 0.6685393
## 
##      HourlyRate
## Y         [,1]     [,2]
##   No  66.70919 20.29076
##   Yes 65.51685 20.49259
## 
##      JobInvolvement
## Y         [,1]      [,2]
##   No  2.770811 0.6911718
##   Yes 2.522472 0.7753625
## 
##      JobLevel
## Y         [,1]      [,2]
##   No  2.138378 1.1244467
##   Yes 1.634831 0.9485762
## 
##      JobRole
## Y     Healthcare Representative Human Resources Laboratory Technician
##   No                0.096216216     0.033513514           0.162162162
##   Yes               0.022471910     0.056179775           0.264044944
##      JobRole
## Y         Manager Manufacturing Director Research Director Research Scientist
##   No  0.082162162            0.116756757       0.057297297        0.193513514
##   Yes 0.022471910            0.044943820       0.005617978        0.179775281
##      JobRole
## Y     Sales Executive Sales Representative
##   No      0.217297297          0.041081081
##   Yes     0.258426966          0.146067416
## 
##      JobSatisfaction
## Y         [,1]     [,2]
##   No  2.771892 1.096467
##   Yes 2.438202 1.109334
## 
##      MaritalStatus
## Y      Divorced   Married    Single
##   No  0.2313514 0.4821622 0.2864865
##   Yes 0.1573034 0.3314607 0.5112360
## 
##      MonthlyIncome
## Y         [,1]     [,2]
##   No  6775.414 4819.691
##   Yes 4750.534 3691.935
## 
##      MonthlyRate
## Y         [,1]     [,2]
##   No  14426.74 7030.436
##   Yes 14726.62 7128.048
## 
##      NumCompaniesWorked
## Y         [,1]     [,2]
##   No  2.725405 2.486817
##   Yes 2.887640 2.640143
## 
##      OverTime
## Y            No       Yes
##   No  0.7740541 0.2259459
##   Yes 0.4887640 0.5112360
## 
##      PercentSalaryHike
## Y         [,1]     [,2]
##   No  15.31243 3.643875
##   Yes 15.02809 3.749847
## 
##      PerformanceRating
## Y         [,1]      [,2]
##   No  3.156757 0.3637680
##   Yes 3.146067 0.3541701
## 
##      RelationshipSatisfaction
## Y         [,1]     [,2]
##   No  2.721081 1.065008
##   Yes 2.606742 1.126145
## 
##      StockOptionLevel
## Y          [,1]      [,2]
##   No  0.8421622 0.8432028
##   Yes 0.5056180 0.8250813
## 
##      TotalWorkingYears
## Y          [,1]     [,2]
##   No  11.871351 7.869037
##   Yes  8.348315 7.463056
## 
##      TrainingTimesLastYear
## Y         [,1]     [,2]
##   No  2.842162 1.298082
##   Yes 2.696629 1.261373
## 
##      WorkLifeBalance
## Y         [,1]      [,2]
##   No  2.787027 0.6702105
##   Yes 2.623596 0.8294743
## 
##      YearsAtCompany
## Y         [,1]     [,2]
##   No  7.172973 6.022354
##   Yes 5.101124 6.168161
## 
##      YearsInCurrentRole
## Y         [,1]     [,2]
##   No  4.411892 3.590522
##   Yes 2.853933 3.233129
## 
##      YearsSinceLastPromotion
## Y         [,1]     [,2]
##   No  2.090811 3.081219
##   Yes 1.955056 3.241364
## 
##      YearsWithCurrManager
## Y         [,1]     [,2]
##   No  4.280000 3.555753
##   Yes 2.724719 3.134924

caret::confusionMatrix(predictions_nb_e1071, test_streamling_processes_test$Attrition)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  261  24
##        Yes  47  35
##                                           
##                Accuracy : 0.8065          
##                  95% CI : (0.7624, 0.8457)
##     No Information Rate : 0.8392          
##     P-Value [Acc > NIR] : 0.95954         
##                                           
##                   Kappa : 0.3806          
##                                           
##  Mcnemar's Test P-Value : 0.00903         
##                                           
##             Sensitivity : 0.8474          
##             Specificity : 0.5932          
##          Pos Pred Value : 0.9158          
##          Neg Pred Value : 0.4268          
##              Prevalence : 0.8392          
##          Detection Rate : 0.7112          
##    Detection Prevalence : 0.7766          
##       Balanced Accuracy : 0.7203          
##                                           
##        'Positive' Class : No              
##

plot(table(predictions_nb_e1071, test_streamling_processes_test$Attrition))

Classification: SVM with Repeated k-fold Cross Validation

### 5.a. SVM Attritionifier using 5-fold cross validation with 3 reps 
# We train a Support Vector Machine (for Classification) using "Churn" variable
# in the training dataset based on a repeated 5-fold cross validation train
# control with 10 reps.

# The repeated k-fold cross-validation method involves repeating the number of
# times the dataset is split into k-subsets. The final model accuracy/RMSE is
# taken as the mean from the number of repeats.

train_control_svm <- trainControl(method = "repeatedcv", number = 5, repeats = 10)

streamling_dateset_model_svm <-
  caret::train(`Attrition` ~ ., data = train_streamling_processes_train,
               trControl = train_control_svm, na.action = na.omit,
               method = "svmLinearWeights2", metric = "Accuracy")

### 5.b. Test the trained SVM model using the testing dataset 
predictions_svm <- predict(streamling_dateset_model_svm, test_streamling_processes_test[, 1:31])

### 5.c. View a summary of the model and view the confusion matrix 
print(streamling_dateset_model_svm)

## L2 Regularized Linear Support Vector Machines with Class Weights 
## 
## 1103 samples
##   30 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 10 times) 
## Summary of sample sizes: 882, 882, 882, 883, 883, 882, ... 
## Resampling results across tuning parameters:
## 
##   cost  Loss  weight  Accuracy   Kappa       
##   0.25  L1    1       0.7316981  0.0008966698
##   0.25  L1    2       0.7498893  0.0091594654
##   0.25  L1    3       0.6940498  0.0187904658
##   0.25  L2    1       0.8386261  0.0000000000
##   0.25  L2    2       0.8386261  0.0000000000
##   0.25  L2    3       0.8348161  0.0272426179
##   0.50  L1    1       0.7314788  0.0094526012
##   0.50  L1    2       0.7113369  0.0091888134
##   0.50  L1    3       0.7560950  0.0108168897
##   0.50  L2    1       0.8386261  0.0000000000
##   0.50  L2    2       0.8386261  0.0000000000
##   0.50  L2    3       0.8348161  0.0272426179
##   1.00  L1    1       0.7801501  0.0024531820
##   1.00  L1    2       0.7121156  0.0012114600
##   1.00  L1    3       0.7351740  0.0113118554
##   1.00  L2    1       0.8386261  0.0000000000
##   1.00  L2    2       0.8386261  0.0000000000
##   1.00  L2    3       0.8348161  0.0272426179
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25, Loss = L2 and weight = 1.

caret::confusionMatrix(predictions_svm, test_streamling_processes_test$Attrition)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  308  59
##        Yes   0   0
##                                           
##                Accuracy : 0.8392          
##                  95% CI : (0.7976, 0.8753)
##     No Information Rate : 0.8392          
##     P-Value [Acc > NIR] : 0.5347          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 4.321e-14       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8392          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.8392          
##          Detection Rate : 0.8392          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : No              
##

plot(table(predictions_svm, test_streamling_processes_test$Attrition))

Classification: Naive Bayes with Leave One Out Cross Validation

# In Leave One Out Cross-Validation (LOOCV), a data instance is left out and a
# model constructed on all other data instances in the training set. This is
# repeated for all data instances.


### 6.a. Train a Naive Bayes Attritionifier based on an LOOCV 
train_control_loocv <- trainControl(method = "LOOCV")

streamling_dateset_model_nb_loocv <-
  caret::train(`Attrition` ~ ., data = train_streamling_processes_train,
               trControl = train_control_loocv, na.action = na.omit,
               method = "naive_bayes", metric = "Accuracy")

### 6.b. Test the trained model using the testing dataset
predictions_nb_loocv <-
  predict(streamling_dateset_model_nb_loocv, test_streamling_processes_test[, 1:31])

### 6.c. View the confusion matrix 
print(streamling_dateset_model_nb_loocv)

## Naive Bayes 
## 
## 1103 samples
##   30 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 1102, 1102, 1102, 1102, 1102, 1102, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa    
##   FALSE      0.6681777  0.2418589
##    TRUE      0.8386219  0.0000000
## 
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = TRUE
##  and adjust = 1.

caret::confusionMatrix(predictions_nb_loocv, test_streamling_processes_test$Attrition)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  308  59
##        Yes   0   0
##                                           
##                Accuracy : 0.8392          
##                  95% CI : (0.7976, 0.8753)
##     No Information Rate : 0.8392          
##     P-Value [Acc > NIR] : 0.5347          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 4.321e-14       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8392          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.8392          
##          Detection Rate : 0.8392          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : No              
##

plot(table(predictions_nb_loocv, test_streamling_processes_test$Attrition))

Actual Our Model Algorithm

The decision to employ a Support Vector Machine (SVM) with a radial basis function (RBF) kernel is justified by its superior accuracy compared to alternatives like Linear Discriminant Analysis (LDA). The SVM-RBF combination excels in capturing complex, non-linear patterns, handles high-dimensional spaces adeptly, exhibits robustness to outliers, and offers flexible parameter tuning. These traits collectively contribute to its efficacy in achieving accurate predictions, especially on unseen data

### Train the model
set.seed(7)
train_control <- trainControl(method = "cv", number = 5)
streamling_caret_model_svm_radial <- # nolint: object_length_linter.
  train(Attrition ~ ., data = train_streamling_processes_train, method = "svmRadial",
        metric = "Accuracy", trControl = train_control)

#### Display the model's details 
print(streamling_caret_model_svm_radial)

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1103 samples
##   30 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 882, 882, 883, 882, 883 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa     
##   0.25  0.8386261  0.00000000
##   0.50  0.8404401  0.01855692
##   1.00  0.8604072  0.23391160
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01332559
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.01332559 and C = 1.

#### Make predictions 
predictions <- predict(streamling_caret_model_svm_radial,
                       test_streamling_processes_test[, 1:31])

#### Display the model's evaluation metrics   
table(predictions, test_streamling_processes_test$Attrition)

##            
## predictions  No Yes
##         No  306  45
##         Yes   2  14

confusion_matrix <-
  caret::confusionMatrix(predictions,
                         test_streamling_processes_test[, 1:31]$Attrition)
print(confusion_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  306  45
##        Yes   2  14
##                                           
##                Accuracy : 0.8719          
##                  95% CI : (0.8334, 0.9044)
##     No Information Rate : 0.8392          
##     P-Value [Acc > NIR] : 0.04801         
##                                           
##                   Kappa : 0.3272          
##                                           
##  Mcnemar's Test P-Value : 8.993e-10       
##                                           
##             Sensitivity : 0.9935          
##             Specificity : 0.2373          
##          Pos Pred Value : 0.8718          
##          Neg Pred Value : 0.8750          
##              Prevalence : 0.8392          
##          Detection Rate : 0.8338          
##    Detection Prevalence : 0.9564          
##       Balanced Accuracy : 0.6154          
##                                           
##        'Positive' Class : No              
##

# Load required libraries
library(ggplot2)

# Convert the confusion matrix to a data frame
conf_matrix_df <- as.data.frame(as.table(confusion_matrix$table))

# Rename columns for clarity
names(conf_matrix_df) <- c("Reference", "Prediction", "Frequency")

# Plot the confusion matrix as a heatmap
ggplot(data = conf_matrix_df, aes(x = Reference, y = Prediction, fill = Frequency)) +
  geom_tile() +
  geom_text(aes(label = Frequency)) +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(title = "Confusion Matrix") +
  theme_minimal()

Model Performance and Comparison

The code performs model training and comparison for predicting ‘Attrition’ using different algorithms. It sets up a repeated cross-validation control (train_control), then trains models using Linear Discriminant Analysis (LDA), Classification and Regression Trees (CART), k-Nearest Neighbors (KNN), Support Vector Machine with radial kernel (SVM), and Random Forest. The goal is to evaluate and compare the performance of these models using the specified training control settings.

train_control <- trainControl(method = "repeatedcv", number = 10, repeats = 3)

### LDA
set.seed(7)
streamling_model_lda <- train(Attrition ~ ., data = STREAMLINING_PROCESSES_AT_KINYANJUI_FARM,
                            method = "lda", trControl = train_control)

### CART
set.seed(7)
streamling_model_cart <- train(Attrition ~ ., data = STREAMLINING_PROCESSES_AT_KINYANJUI_FARM,
                             method = "rpart", trControl = train_control)

### KNN 
set.seed(7)
streamling_model_knn <- train(Attrition ~ ., data = STREAMLINING_PROCESSES_AT_KINYANJUI_FARM,
                            method = "knn", trControl = train_control)

### SVM 
set.seed(7)
streamling_model_svm <- train(Attrition ~ ., data = STREAMLINING_PROCESSES_AT_KINYANJUI_FARM,
                            method = "svmRadial", trControl = train_control)

### Random Forest 
set.seed(7)
streamling_model_rf <- train(Attrition ~ ., data = STREAMLINING_PROCESSES_AT_KINYANJUI_FARM,
                           method = "rf", trControl = train_control)

## 3.b. Call the `resamples` Function
# We then create a list of the model results and pass the list as an argument
# to the `resamples` function.

results <- resamples(list(LDA = streamling_model_lda, CART = streamling_model_cart,
                          KNN = streamling_model_knn, SVM = streamling_model_svm,
                          RF = streamling_model_rf))

scales <- list(x = list(relation = "free"), y = list(relation = "free"))
bwplot(results, scales = scales)

scales <- list(x = list(relation = "free"), y = list(relation = "free"))
dotplot(results, scales = scales)

splom(results)

# xyplot plots to compare models
xyplot(results, models = c("LDA", "SVM"))

# or
# xyplot plots to compare models
xyplot(results, models = c("SVM", "CART"))

diffs <- diff(results)

summary(diffs)

## 
## Call:
## summary.diff.resamples(object = diffs)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## Accuracy 
##      LDA       CART      KNN       SVM       RF       
## LDA             0.033098  0.041479  0.010201  0.017220
## CART 2.309e-08            0.008381 -0.022897 -0.015878
## KNN  2.859e-08 0.836319            -0.031278 -0.024259
## SVM  0.011950  1.216e-05 7.210e-09            0.007019
## RF   0.000505  0.002636  1.744e-05 0.178204           
## 
## Kappa 
##      LDA       CART      KNN       SVM       RF       
## LDA             0.247718  0.380570  0.169228  0.163456
## CART 4.288e-13            0.132852 -0.078490 -0.084262
## KNN  2.745e-15 3.032e-05           -0.211342 -0.217114
## SVM  4.027e-11 0.010489  6.963e-10           -0.005772
## RF   4.040e-09 0.004762  2.889e-10 1.000000

Hyper Parameter Tunning

Hyperparameter Tuning involves optimizing the hyperparameters of a machine learning algorithm to enhance model performance. Hyperparameters are configuration settings external to the model that influence its learning process. The code likely employs techniques like grid search or random search to systematically explore different hyperparameter combinations for algorithms such as LDA, CART, KNN, SVM, and Random Forest. This process aims to find the optimal set of hyperparameters that yields the best model performance.

Grid Search

# Apply a "Grid Search" to identify the best parameter value
# Each axis of the grid is an algorithm parameter, and points on the grid are
# specific combinations of parameters.

seed <- 7
metric <- "Accuracy"
train_control <- trainControl(method = "repeatedcv", number = 10, repeats = 3,
                              search = "grid")
set.seed(seed)

getModelInfo("RRFglobal")

## $RRFglobal
## $RRFglobal$label
## [1] "Regularized Random Forest"
## 
## $RRFglobal$library
## [1] "RRF"
## 
## $RRFglobal$loop
## NULL
## 
## $RRFglobal$type
## [1] "Regression"     "Classification"
## 
## $RRFglobal$parameters
##   parameter   class                         label
## 1      mtry numeric #Randomly Selected Predictors
## 2   coefReg numeric          Regularization Value
## 
## $RRFglobal$grid
## function (x, y, len = NULL, search = "grid") 
## {
##     if (search == "grid") {
##         out <- expand.grid(mtry = caret::var_seq(p = ncol(x), 
##             classification = is.factor(y), len = len), coefReg = seq(0.01, 
##             1, length = len))
##     }
##     else {
##         out <- data.frame(mtry = sample(1:ncol(x), size = len, 
##             replace = TRUE), coefReg = runif(len, min = 0, max = 1))
##     }
##     out
## }
## 
## $RRFglobal$fit
## function (x, y, wts, param, lev, last, classProbs, ...) 
## {
##     RRF::RRF(x, y, mtry = param$mtry, coefReg = param$coefReg, 
##         ...)
## }
## 
## $RRFglobal$predict
## function (modelFit, newdata, submodels = NULL) 
## predict(modelFit, newdata)
## 
## $RRFglobal$prob
## function (modelFit, newdata, submodels = NULL) 
## predict(modelFit, newdata, type = "prob")
## 
## $RRFglobal$varImp
## function (object, ...) 
## {
##     varImp <- RRF::importance(object, ...)
##     if (object$type == "regression") 
##         varImp <- data.frame(Overall = varImp[, "%IncMSE"])
##     else {
##         retainNames <- levels(object$y)
##         if (all(retainNames %in% colnames(varImp))) {
##             varImp <- varImp[, retainNames]
##         }
##         else {
##             varImp <- data.frame(Overall = varImp[, 1])
##         }
##     }
##     out <- as.data.frame(varImp, stringsAsFactors = TRUE)
##     if (dim(out)[2] == 2) {
##         tmp <- apply(out, 1, mean)
##         out[, 1] <- out[, 2] <- tmp
##     }
##     out
## }
## 
## $RRFglobal$levels
## function (x) 
## x$obsLevels
## 
## $RRFglobal$tags
## [1] "Random Forest"              "Ensemble Model"            
## [3] "Bagging"                    "Implicit Feature Selection"
## [5] "Regularization"            
## 
## $RRFglobal$sort
## function (x) 
## x[order(x$coefReg), ]

# The Regularized Random Forest algorithm exposes the "coefReg" parameter
# in addition to the "mtry" parameter for tuning.
## The "mtry" parameter
# Number of variables randomly sampled as candidates at each split.

## The "coefReg" parameter 
# It stands for coefficient(s) of regularization.
# A smaller coefficient may lead to a smaller feature subset, i.e.,
# there are fewer variables with non-zero importance scores. coefReg must be
# either a single value (all variables have the same coefficient) or a numeric
# vector of length equal to the number of predictor variables. default: 0.8

tunegrid <- expand.grid(.mtry = c(1:10),
                        .coefReg = seq(from = 0.1, to = 1, by = 0.1))

streamling_model_grid_search_rrf_global <- train(Attrition ~ ., data = STREAMLINING_PROCESSES_AT_KINYANJUI_FARM, # nolint
                                            method = "RRFglobal",
                                            metric = metric,
                                            tuneGrid = tunegrid,
                                            trControl = train_control)
print(streamling_model_grid_search_rrf_global)

## Regularized Random Forest 
## 
## 1470 samples
##   30 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 1323, 1323, 1322, 1323, 1323, 1323, ... 
## Resampling results across tuning parameters:
## 
##   mtry  coefReg  Accuracy   Kappa    
##    1    0.1      0.8206506  0.1595824
##    1    0.2      0.8195215  0.1126055
##    1    0.3      0.8176812  0.1142811
##    1    0.4      0.8197033  0.1122957
##    1    0.5      0.8226715  0.1267909
##    1    0.6      0.8256177  0.1316026
##    1    0.7      0.8462778  0.2133087
##    1    0.8      0.8508053  0.2434784
##    1    0.9      0.8551060  0.2726389
##    1    1.0      0.8576112  0.2944384
##    2    0.1      0.8136057  0.1003532
##    2    0.2      0.8188056  0.1416495
##    2    0.3      0.8140269  0.1208266
##    2    0.4      0.8226867  0.1479794
##    2    0.5      0.8186205  0.1191445
##    2    0.6      0.8310648  0.1532200
##    2    0.7      0.8433237  0.2034060
##    2    0.8      0.8566980  0.2780614
##    2    0.9      0.8564682  0.2784487
##    2    1.0      0.8576066  0.2920436
##    3    0.1      0.8188120  0.1456522
##    3    0.2      0.8172370  0.1289569
##    3    0.3      0.8224617  0.1545650
##    3    0.4      0.8113320  0.1075445
##    3    0.5      0.8226792  0.1425047
##    3    0.6      0.8283435  0.1408576
##    3    0.7      0.8505677  0.2497933
##    3    0.8      0.8539923  0.2724134
##    3    0.9      0.8562476  0.2819425
##    3    1.0      0.8564728  0.2857081
##    4    0.1      0.8233610  0.1764235
##    4    0.2      0.8272096  0.1789299
##    4    0.3      0.8152021  0.1341274
##    4    0.4      0.8167912  0.1219350
##    4    0.5      0.8183784  0.1482384
##    4    0.6      0.8385786  0.2076412
##    4    0.7      0.8465029  0.2361529
##    4    0.8      0.8535233  0.2674868
##    4    0.9      0.8571546  0.2855559
##    4    1.0      0.8560301  0.2806241
##    5    0.1      0.8154398  0.1238621
##    5    0.2      0.8147487  0.1077841
##    5    0.3      0.8267455  0.1631013
##    5    0.4      0.8254049  0.1863146
##    5    0.5      0.8190818  0.1207993
##    5    0.6      0.8362896  0.1791708
##    5    0.7      0.8478604  0.2384046
##    5    0.8      0.8539815  0.2680770
##    5    0.9      0.8548931  0.2747854
##    5    1.0      0.8548900  0.2755062
##    6    0.1      0.8193008  0.1377217
##    6    0.2      0.8226699  0.1673286
##    6    0.3      0.8152055  0.1161970
##    6    0.4      0.8217782  0.1441949
##    6    0.5      0.8188428  0.1344704
##    6    0.6      0.8365239  0.2070344
##    6    0.7      0.8517078  0.2540012
##    6    0.8      0.8528569  0.2661296
##    6    0.9      0.8546649  0.2746277
##    6    1.0      0.8557940  0.2748237
##    7    0.1      0.8167819  0.1226699
##    7    0.2      0.8145190  0.1133275
##    7    0.3      0.8138694  0.1273908
##    7    0.4      0.8158871  0.1349048
##    7    0.5      0.8238269  0.1406273
##    7    0.6      0.8346991  0.1742235
##    7    0.7      0.8544380  0.2788625
##    7    0.8      0.8535233  0.2639496
##    7    0.9      0.8587435  0.2923206
##    7    1.0      0.8555642  0.2801702
##    8    0.1      0.8254295  0.1740382
##    8    0.2      0.8186147  0.1513947
##    8    0.3      0.8106378  0.1188940
##    8    0.4      0.8290392  0.1592170
##    8    0.5      0.8213324  0.1612306
##    8    0.6      0.8349290  0.1752814
##    8    0.7      0.8512572  0.2550419
##    8    0.8      0.8564712  0.2842590
##    8    0.9      0.8571531  0.2827742
##    8    1.0      0.8571562  0.2854611
##    9    0.1      0.8172461  0.1329156
##    9    0.2      0.8160954  0.1387811
##    9    0.3      0.8233440  0.1612904
##    9    0.4      0.8195092  0.1566592
##    9    0.5      0.8226807  0.1592357
##    9    0.6      0.8358407  0.1857527
##    9    0.7      0.8521535  0.2575612
##    9    0.8      0.8557940  0.2826399
##    9    0.9      0.8562460  0.2813795
##    9    1.0      0.8580678  0.2959344
##   10    0.1      0.8226792  0.1636577
##   10    0.2      0.8272129  0.1718803
##   10    0.3      0.8172556  0.1552148
##   10    0.4      0.8260589  0.1805620
##   10    0.5      0.8222256  0.1724748
##   10    0.6      0.8354072  0.1738582
##   10    0.7      0.8528477  0.2683606
##   10    0.8      0.8537547  0.2680981
##   10    0.9      0.8557971  0.2807187
##   10    1.0      0.8557971  0.2800267
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 7 and coefReg = 0.9.

plot(streamling_model_grid_search_rrf_global)

Ensemble methods

Ensemble methods combine multiple machine learning models to improve overall predictive performance and robustness. These methods leverage the strengths of different models to compensate for individual weaknesses. Common ensemble methods include bagging, boosting, and stacking. Bagging (Random Forest) builds multiple models in parallel and averages their predictions, reducing variance. Boosting (AdaBoost) focuses on sequentially improving model performance by giving more weight to missclassified instances. Stacking combines predictions from multiple models using another model as a meta-learner. Ensemble methods often outperform individual models, providing more reliable and accurate predictions.

Boosting

GBM

seed <- 7
metric <- "Accuracy"
train_control <- trainControl(method = "cv", number = 5)


#### Boosting with C5.0 ----
# C5.0
set.seed(seed)
ion_model_c50 <- train(Attrition ~ ., data = STREAMLINING_PROCESSES_AT_KINYANJUI_FARM, method = "C5.0",
                       metric = metric,
                       trControl = train_control)

####  Boosting with Stochastic Gradient Boosting ----
set.seed(seed)
ion_model_gbm <- train(Attrition ~ ., data = STREAMLINING_PROCESSES_AT_KINYANJUI_FARM, method = "gbm",
                       metric = metric, trControl = train_control,
                       verbose = FALSE)

# Summarize results 
boosting_results <-
  resamples(list("C5.0 boosting ensemble" =ion_model_c50,
                 "Generalized Boosted Regression Modeling (GBM)" = ion_model_gbm))

summary(boosting_results)

## 
## Call:
## summary.resamples(object = boosting_results)
## 
## Models: C5.0 boosting ensemble, Generalized Boosted Regression Modeling (GBM) 
## Number of resamples: 5 
## 
## Accuracy 
##                                                    Min.   1st Qu.    Median
## C5.0 boosting ensemble                        0.8474576 0.8508475 0.8639456
## Generalized Boosted Regression Modeling (GBM) 0.8474576 0.8600683 0.8779661
##                                                    Mean   3rd Qu.     Max. NA's
## C5.0 boosting ensemble                        0.8639860 0.8737201 0.883959    0
## Generalized Boosted Regression Modeling (GBM) 0.8721261 0.8843537 0.890785    0
## 
## Kappa 
##                                                    Min.   1st Qu.   Median
## C5.0 boosting ensemble                        0.2432879 0.3606016 0.379719
## Generalized Boosted Regression Modeling (GBM) 0.2600747 0.3077278 0.461788
##                                                    Mean   3rd Qu.      Max.
## C5.0 boosting ensemble                        0.3633392 0.3880737 0.4450139
## Generalized Boosted Regression Modeling (GBM) 0.3950538 0.4658159 0.4798626
##                                               NA's
## C5.0 boosting ensemble                           0
## Generalized Boosted Regression Modeling (GBM)    0

dotplot(boosting_results)

Save and Load your Model

# Saving a model into a file allows you to load it later and use it to make
# predictions. Saved models can be loaded by calling the `readRDS()` function

saveRDS(streamling_caret_model_svm_radial, "./models/saved_streamling_caret_model_svm_radial.rds")
# The saved model can then be loaded later as follows:
loaded_streamling_model_svm <- readRDS("./models/saved_streamling_caret_model_svm_radial.rds")
print(loaded_streamling_model_svm)

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1103 samples
##   30 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 882, 882, 883, 882, 883 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa     
##   0.25  0.8386261  0.00000000
##   0.50  0.8404401  0.01855692
##   1.00  0.8604072  0.23391160
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01332559
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.01332559 and C = 1.

predictions_with_loaded_model <-
  predict(loaded_streamling_model_svm, newdata = test_streamling_processes_test)
confusionMatrix(predictions_with_loaded_model, test_streamling_processes_test$Attrition)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  306  45
##        Yes   2  14
##                                           
##                Accuracy : 0.8719          
##                  95% CI : (0.8334, 0.9044)
##     No Information Rate : 0.8392          
##     P-Value [Acc > NIR] : 0.04801         
##                                           
##                   Kappa : 0.3272          
##                                           
##  Mcnemar's Test P-Value : 8.993e-10       
##                                           
##             Sensitivity : 0.9935          
##             Specificity : 0.2373          
##          Pos Pred Value : 0.8718          
##          Neg Pred Value : 0.8750          
##              Prevalence : 0.8392          
##          Detection Rate : 0.8338          
##    Detection Prevalence : 0.9564          
##       Balanced Accuracy : 0.6154          
##                                           
##        'Positive' Class : No              
##

Making prediction with the model

to_be_predicted <-
  data.frame(
  Age = 41,
  BusinessTravel = "Travel_Rarely",
  DailyRate = 1102,
  Department = "Sales",
  DistanceFromHome = 1,
  Education = 2,
  EducationField = "Life Sciences",
  EnvironmentSatisfaction = 2,
  Gender = "Female",
  HourlyRate = 94,
  JobInvolvement = 3,
  JobLevel = 2,
  JobRole = "Sales Executive",
  JobSatisfaction = 4,
  MaritalStatus = "Single",
  MonthlyIncome = 5993,
  MonthlyRate = 19479,
  NumCompaniesWorked = 8,
  OverTime = "Yes",
  PercentSalaryHike = 11,
  PerformanceRating = 3,
  RelationshipSatisfaction = 1,
  StockOptionLevel = 0,
  TotalWorkingYears = 8,
  TrainingTimesLastYear = 0,
  WorkLifeBalance = 1,
  YearsAtCompany = 6,
  YearsInCurrentRole = 4,
  YearsSinceLastPromotion = 0,
  YearsWithCurrManager = 5
)

predict(loaded_streamling_model_svm, to_be_predicted)

## [1] No
## Levels: No Yes

Make predictions using the model through a function

predict_attrition <-
  function(arg_Age, arg_BusinessTravel, arg_DailyRate,
           arg_Department, arg_DistanceFromHome, arg_Education, arg_EducationField,
           arg_EnvironmentSatisfaction, arg_Gender,
           arg_HourlyRate, arg_JobInvolvement, arg_JobLevel, arg_JobRole,
           arg_JobSatisfaction, arg_MaritalStatus, arg_MonthlyIncome, arg_MonthlyRate,
           arg_NumCompaniesWorked, arg_OverTime, arg_PercentSalaryHike,
           arg_PerformanceRating, arg_RelationshipSatisfaction, arg_StockOptionLevel,
           arg_TotalWorkingYears, arg_TrainingTimesLastYear, arg_WorkLifeBalance, arg_YearsAtCompany,
           arg_YearsInCurrentRole, arg_YearsSinceLastPromotion, arg_YearsWithCurrManager) {
    
    # Create a data frame using the arguments
    to_be_predicted <- data.frame(Age = arg_Age, 
                                  BusinessTravel = arg_BusinessTravel, DailyRate = arg_DailyRate,
                                  Department = arg_Department, DistanceFromHome = arg_DistanceFromHome,
                                  Education = arg_Education, EducationField = arg_EducationField,
                                  EnvironmentSatisfaction = arg_EnvironmentSatisfaction, Gender = arg_Gender,
                                  HourlyRate = arg_HourlyRate, JobInvolvement = arg_JobInvolvement,
                                  JobLevel = arg_JobLevel, JobRole = arg_JobRole,
                                  JobSatisfaction = arg_JobSatisfaction, MaritalStatus = arg_MaritalStatus,
                                  MonthlyIncome = arg_MonthlyIncome, MonthlyRate = arg_MonthlyRate,
                                  NumCompaniesWorked = arg_NumCompaniesWorked, 
                                  OverTime = arg_OverTime, PercentSalaryHike = arg_PercentSalaryHike,
                                  PerformanceRating = arg_PerformanceRating,
                                  RelationshipSatisfaction = arg_RelationshipSatisfaction,
                                  StockOptionLevel = arg_StockOptionLevel,
                                  TotalWorkingYears = arg_TotalWorkingYears,
                                  TrainingTimesLastYear = arg_TrainingTimesLastYear,
                                  WorkLifeBalance = arg_WorkLifeBalance, YearsAtCompany = arg_YearsAtCompany,
                                  YearsInCurrentRole = arg_YearsInCurrentRole,
                                  YearsSinceLastPromotion = arg_YearsSinceLastPromotion,
                                  YearsWithCurrManager = arg_YearsWithCurrManager)
    
    # Make a prediction based on the data frame
    predict(loaded_streamling_model_svm, to_be_predicted)
  }
predict_attrition(41,
                 "Travel_Rarely",
                 1102,
                 "Sales",
                 1,
                 2,
                "Life Sciences",
                 2,
                 "Female",
                 94,
                 3,
                 2,
                 "Sales Executive",
                 4,
                 "Single",
                 5993,
                 19479,
                 8,
                 "Yes",
                 11,
                 3,
                 1,
                 0,
                 8,
                 0,
                 1,
                 6,
                 4,
                 0,
                 5)

## [1] No
## Levels: No Yes

Plumber API

Plumber API in R facilitates the creation of web APIs, enabling seamless integration of R code for data processing and analysis. It transforms R scripts into interactive and accessible web services, enhancing collaboration and data-driven decision-making.

NB because we we wanted to execute the plumber API in our md we had run the Plumber API in a separate R process, allowing our R Markdown document to continue executing by using the future package.The on.exit() block is used to stop the Plumber API after all the code chunks have been executed

if (!requireNamespace("future", quietly = TRUE)) install.packages("future")

library(future)

## 
## Attaching package: 'future'

## The following object is masked from 'package:caret':
## 
##     cluster

## The following object is masked from 'package:languageserver':
## 
##     run

# Define a function to start the Plumber API in a separate process
start_api <- function() {
  api <- plumber::plumb("PlumberAPI.R")
  api$run(host = "127.0.0.1", port = 5022, swagger = TRUE)
}

# Start the API in a separate process
future::plan("multisession")  # Use multisession backend for parallel processing
future({
  start_api()
})

## MultisessionFuture:
## Label: '<none>'
## Expression:
## {
##     start_api()
## }
## Lazy evaluation: FALSE
## Asynchronous evaluation: TRUE
## Local evaluation: TRUE
## Environment: R_GlobalEnv
## Capture standard output: TRUE
## Capture condition classes: 'condition' (excluding 'nothing')
## Globals: 1 objects totaling 6.65 KiB (function 'start_api' of 6.65 KiB)
## Packages: <none>
## L'Ecuyer-CMRG RNG seed: <none> (seed = FALSE)
## Resolved: FALSE
## Value: <not collected>
## Conditions captured: <none>
## Early signaling: FALSE
## Owner process: 1f35b32c-3fd8-aa67-1a03-364646a4c6ac
## Class: 'MultisessionFuture', 'ClusterFuture', 'MultiprocessFuture', 'Future', 'environment'

### Generate the URL required to access the API ----

# We set this as a constant port 5022 running on localhost
base_url <- "http://127.0.0.1:5022/Attrition"

params <- data.frame(
  arg_Age = 41,
  arg_BusinessTravel = "Travel_Rarely",
  arg_DailyRate = 1102,
  arg_Department = "Sales",
  arg_DistanceFromHome = 1,
  arg_Education = 2,
  arg_EducationField = "Life Sciences",
  arg_EnvironmentSatisfaction = 2,
  arg_Gender = "Female",
  arg_HourlyRate = 94,
  arg_JobInvolvement = 3,
  arg_JobLevel = 2,
  arg_JobRole = "Sales Executive",
  arg_JobSatisfaction = 4,
  arg_MaritalStatus = "Single",
  arg_MonthlyIncome = 5993,
  arg_MonthlyRate = 19479,
  arg_NumCompaniesWorked = 8,
  arg_OverTime = "Yes",
  arg_PercentSalaryHike = 11,
  arg_PerformanceRating = 3,
  arg_RelationshipSatisfaction = 1,
  arg_StockOptionLevel = 0,
  arg_TotalWorkingYears = 8,
  arg_TrainingTimesLastYear = 0,
  arg_WorkLifeBalance = 1,
  arg_YearsAtCompany = 6,
  arg_YearsInCurrentRole = 4,
  arg_YearsSinceLastPromotion = 0,
  arg_YearsWithCurrManager = 5
)



query_url <- httr::modify_url(url = base_url, query = params)
print(query_url)

## [1] "http://127.0.0.1:5022/Attrition?arg_Age=41&arg_BusinessTravel=Travel_Rarely&arg_DailyRate=1102&arg_Department=Sales&arg_DistanceFromHome=1&arg_Education=2&arg_EducationField=Life%20Sciences&arg_EnvironmentSatisfaction=2&arg_Gender=Female&arg_HourlyRate=94&arg_JobInvolvement=3&arg_JobLevel=2&arg_JobRole=Sales%20Executive&arg_JobSatisfaction=4&arg_MaritalStatus=Single&arg_MonthlyIncome=5993&arg_MonthlyRate=19479&arg_NumCompaniesWorked=8&arg_OverTime=Yes&arg_PercentSalaryHike=11&arg_PerformanceRating=3&arg_RelationshipSatisfaction=1&arg_StockOptionLevel=0&arg_TotalWorkingYears=8&arg_TrainingTimesLastYear=0&arg_WorkLifeBalance=1&arg_YearsAtCompany=6&arg_YearsInCurrentRole=4&arg_YearsSinceLastPromotion=0&arg_YearsWithCurrManager=5"

Sys.sleep(10)  
model_prediction <- GET(query_url)
content(model_prediction)

## [[1]]
## [1] "No"

content(model_prediction)[[1]]

## [1] "No"

# Parse the response into the right format 
model_prediction_raw <- content(model_prediction, as = "text",
                                encoding = "utf-8")
jsonlite::fromJSON(model_prediction_raw)

## [1] "No"

get_attrition_predictions <-
  function(arg_Age, arg_BusinessTravel, arg_DailyRate,
           arg_Department, arg_DistanceFromHome, arg_Education, arg_EducationField,
           arg_EnvironmentSatisfaction, arg_Gender,
           arg_HourlyRate, arg_JobInvolvement, arg_JobLevel, arg_JobRole,
           arg_JobSatisfaction, arg_MaritalStatus, arg_MonthlyIncome, arg_MonthlyRate,
           arg_NumCompaniesWorked, arg_OverTime, arg_PercentSalaryHike,
           arg_PerformanceRating, arg_RelationshipSatisfaction, arg_StockOptionLevel,
           arg_TotalWorkingYears, arg_TrainingTimesLastYear, arg_WorkLifeBalance, arg_YearsAtCompany,
           arg_YearsInCurrentRole, arg_YearsSinceLastPromotion, arg_YearsWithCurrManager) {
    
    base_url <- "http://127.0.0.1:5022/Attrition"
    
    params <- list(
      arg_Age = arg_Age,
      arg_BusinessTravel = arg_BusinessTravel,
      arg_DailyRate = arg_DailyRate,
      arg_Department = arg_Department,
      arg_DistanceFromHome = arg_DistanceFromHome,
      arg_Education = arg_Education,
      arg_EducationField = arg_EducationField,
      arg_EnvironmentSatisfaction = arg_EnvironmentSatisfaction,
      arg_Gender = arg_Gender,
      arg_HourlyRate = arg_HourlyRate,
      arg_JobInvolvement = arg_JobInvolvement,
      arg_JobLevel = arg_JobLevel,
      arg_JobRole = arg_JobRole,
      arg_JobSatisfaction = arg_JobSatisfaction,
      arg_MaritalStatus = arg_MaritalStatus,
      arg_MonthlyIncome = arg_MonthlyIncome,
      arg_MonthlyRate = arg_MonthlyRate,
      arg_NumCompaniesWorked = arg_NumCompaniesWorked,
      arg_OverTime = arg_OverTime,
      arg_PercentSalaryHike = arg_PercentSalaryHike,
      arg_PerformanceRating = arg_PerformanceRating,
      arg_RelationshipSatisfaction = arg_RelationshipSatisfaction,
      arg_StockOptionLevel = arg_StockOptionLevel,
      arg_TotalWorkingYears = arg_TotalWorkingYears,
      arg_TrainingTimesLastYear = arg_TrainingTimesLastYear,
      arg_WorkLifeBalance = arg_WorkLifeBalance,
      arg_YearsAtCompany = arg_YearsAtCompany,
      arg_YearsInCurrentRole = arg_YearsInCurrentRole,
      arg_YearsSinceLastPromotion = arg_YearsSinceLastPromotion,
      arg_YearsWithCurrManager = arg_YearsWithCurrManager
    )
    
    
    query_url <- modify_url(url = base_url, query = params)
    
    model_prediction <- GET(query_url)
    
    model_prediction_raw <- content(model_prediction, as = "text",
                                    encoding = "utf-8")
    
    jsonlite::fromJSON(model_prediction_raw)
  }

get_attrition_predictions(41,
                  "Travel_Rarely",
                  1102,
                  "Sales",
                  1,
                  2,
                  "Life Sciences",
                  2,
                  "Female",
                  94,
                  3,
                  2,
                  "Sales Executive",
                  4,
                  "Single",
                  5993,
                  19479,
                  8,
                  "Yes",
                  11,
                  3,
                  1,
                  0,
                  8,
                  0,
                  1,
                  6,
                  4,
                  0,
                  5)

## [1] "No"

Prediction Web Form And Customer-facing Dashboard

The customer facing dashboard is on real-time bases where it visualizes the customer data verses other customer data to show the goal to be attained or to guide for future decisions making.NB since the file is to large can be opened when downloaded .html and viewed via the Browser

cat('<iframe src="http://localhost/Good%20Architecture/Attrition%20Prediction.php" width="1000" height="800"></iframe>')