1 Introduction Exploratory Data Analysis and Feature Engineering

In the realm of People Analytics, the significance of a data set created by IBM for attrition modeling cannot be overstated. This data set presents a valuable resource for addressing critical questions related to employee turnover and engagement. With 1470 rows and 35 columns, it offers a wealth of information that encompasses various aspects of employees’ professional and personal lives. The data set combines the typical Human Resources Information System (HRIS) data with a comprehensive engagement survey, providing a holistic view of the employees’ experiences and sentiments.

The main objective of this analysis is to understand and predict employee turnover within an organization. By examining the factors that contribute to attrition, we aim to uncover valuable insights that can guide HR strategies and improve employee retention rates. Moreover, this data set offers a unique opportunity to identify differences between the group of employees who chose to stay and those who decided to leave the organization.

2 Description of Data

The data set was created and made available by IBM, a renowned leader in the tech industry. As a reputable source, the data is expected to be reliable and well-structured, enabling meaningful analysis and inference. While the specific details of data collection methods are not provided, the comprehensive nature of the data set suggests that it was meticulously curated to capture a wide range of attributes relevant to employee turnover and engagement.

Sample Size and Feature Variables: The data set consists of 1470 rows, representing individual employee records. Each row contains 35 columns, making these columns the feature variables for analysis. The feature variables include demographic information such as age and gender, factors related to job satisfaction and environment satisfaction, education field, job role, income, overtime, percentage salary hike, tenure, training time, years in the current role, relationship status, and several other parameters that may impact attrition and engagement.

The feature variables encompass various data types, including numeric, categorical, and ordinal data, which allows for a diverse set of analyses and modeling approaches.

In conclusion, this data set offers a comprehensive collection of features that are crucial for understanding and predicting employee attrition within an organization. By delving into the relationships between different variables, we can gain valuable insights that have practical implications for HR policies and practices.

A detailed description of the variables is given below:

Education 1 ‘Below College’ 2 ‘College’ 3 ‘Bachelor’ 4 ‘Master’ 5 ‘Doctor’

EnvironmentSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

JobInvolvement 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

JobSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

PerformanceRating 1 ‘Low’ 2 ‘Good’ 3 ‘Excellent’ 4 ‘Outstanding’

RelationshipSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

WorkLifeBalance 1 ‘Bad’ 2 ‘Good’ 3 ‘Better’ 4 ‘Best’

A copy of this publicly available data is stored at https://raw.githubusercontent.com/Tenam01/DATASETS/main/EmployeeAttritionData.csv.

EmployeeAttrition = read.csv("https://raw.githubusercontent.com/Tenam01/DATASETS/main/EmployeeAttritionData.csv")

Applying binary coding.

# Convert our response variable from 'Yes' and 'No' to 1 and 0
EmployeeAttrition$Attrition_num <- ifelse(EmployeeAttrition$Attrition == "Yes", 1, 0)

Column Employee Number will be removed to protect the identity of employees and their sensitive information.

Additional columns have been identified for removal:

DailyRate, HourlyRate, and MonthlyRate are inexplicable. EmployeeCount, Over18, and StandardHours are uniform for all employees.

# Using subset() function to drop specific columns
EmployeeAttrition_drop <- subset(EmployeeAttrition, select = -c(EmployeeNumber, EmployeeCount, Over18, StandardHours, DailyRate, HourlyRate, MonthlyRate))


3 EDA for Feature Engineering

We first scan the entire data set and determine the EDA tools to use for feature engineering.

summary(EmployeeAttrition_drop)
##       Age         Attrition         BusinessTravel      Department       
##  Min.   :18.00   Length:1470        Length:1470        Length:1470       
##  1st Qu.:30.00   Class :character   Class :character   Class :character  
##  Median :36.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :36.92                                                           
##  3rd Qu.:43.00                                                           
##  Max.   :60.00                                                           
##  DistanceFromHome   Education     EducationField     EnvironmentSatisfaction
##  Min.   : 1.000   Min.   :1.000   Length:1470        Min.   :1.000          
##  1st Qu.: 2.000   1st Qu.:2.000   Class :character   1st Qu.:2.000          
##  Median : 7.000   Median :3.000   Mode  :character   Median :3.000          
##  Mean   : 9.193   Mean   :2.913                      Mean   :2.722          
##  3rd Qu.:14.000   3rd Qu.:4.000                      3rd Qu.:4.000          
##  Max.   :29.000   Max.   :5.000                      Max.   :4.000          
##     Gender          JobInvolvement    JobLevel       JobRole         
##  Length:1470        Min.   :1.00   Min.   :1.000   Length:1470       
##  Class :character   1st Qu.:2.00   1st Qu.:1.000   Class :character  
##  Mode  :character   Median :3.00   Median :2.000   Mode  :character  
##                     Mean   :2.73   Mean   :2.064                     
##                     3rd Qu.:3.00   3rd Qu.:3.000                     
##                     Max.   :4.00   Max.   :5.000                     
##  JobSatisfaction MaritalStatus      MonthlyIncome   NumCompaniesWorked
##  Min.   :1.000   Length:1470        Min.   : 1009   Min.   :0.000     
##  1st Qu.:2.000   Class :character   1st Qu.: 2911   1st Qu.:1.000     
##  Median :3.000   Mode  :character   Median : 4919   Median :2.000     
##  Mean   :2.729                      Mean   : 6503   Mean   :2.693     
##  3rd Qu.:4.000                      3rd Qu.: 8379   3rd Qu.:4.000     
##  Max.   :4.000                      Max.   :19999   Max.   :9.000     
##    OverTime         PercentSalaryHike PerformanceRating
##  Length:1470        Min.   :11.00     Min.   :3.000    
##  Class :character   1st Qu.:12.00     1st Qu.:3.000    
##  Mode  :character   Median :14.00     Median :3.000    
##                     Mean   :15.21     Mean   :3.154    
##                     3rd Qu.:18.00     3rd Qu.:3.000    
##                     Max.   :25.00     Max.   :4.000    
##  RelationshipSatisfaction StockOptionLevel TotalWorkingYears
##  Min.   :1.000            Min.   :0.0000   Min.   : 0.00    
##  1st Qu.:2.000            1st Qu.:0.0000   1st Qu.: 6.00    
##  Median :3.000            Median :1.0000   Median :10.00    
##  Mean   :2.712            Mean   :0.7939   Mean   :11.28    
##  3rd Qu.:4.000            3rd Qu.:1.0000   3rd Qu.:15.00    
##  Max.   :4.000            Max.   :3.0000   Max.   :40.00    
##  TrainingTimesLastYear WorkLifeBalance YearsAtCompany   YearsInCurrentRole
##  Min.   :0.000         Min.   :1.000   Min.   : 0.000   Min.   : 0.000    
##  1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.000   1st Qu.: 2.000    
##  Median :3.000         Median :3.000   Median : 5.000   Median : 3.000    
##  Mean   :2.799         Mean   :2.761   Mean   : 7.008   Mean   : 4.229    
##  3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.: 9.000   3rd Qu.: 7.000    
##  Max.   :6.000         Max.   :4.000   Max.   :40.000   Max.   :18.000    
##  YearsSinceLastPromotion YearsWithCurrManager Attrition_num   
##  Min.   : 0.000          Min.   : 0.000       Min.   :0.0000  
##  1st Qu.: 0.000          1st Qu.: 2.000       1st Qu.:0.0000  
##  Median : 1.000          Median : 3.000       Median :0.0000  
##  Mean   : 2.188          Mean   : 4.123       Mean   :0.1612  
##  3rd Qu.: 3.000          3rd Qu.: 7.000       3rd Qu.:0.0000  
##  Max.   :15.000          Max.   :17.000       Max.   :1.0000

There seems to be no apparent outliers.All the numerical variables seem to be in a reasonable range.

The average age seems to be around 37 years, the highest age seems to be 60 and the lowest age seems to be 18. This dynamic age group could range from new interns all the way to senior managers. There are people who live very close to work and some live far away from the work place Salaries tend to fluctuate a lot as well with a average monthly income of USD 6,500. The salaries can range anywhere from lowest USD 1,000 to as high as USD 20,000. This explains as new interns get paid less whereas the senior managers make a lot. The average number of years worked as an employee is 7 years, where there are employees who have worked for 40 years and as well as some who just started working.

# checking the unique character variables.
unique(EmployeeAttrition_drop$Gender)
## [1] "Female" "Male"
unique(EmployeeAttrition_drop$BusinessTravel)
## [1] "Travel_Rarely"     "Travel_Frequently" "Non-Travel"
unique(EmployeeAttrition_drop$Department)
## [1] "Sales"                  "Research & Development" "Human Resources"
unique(EmployeeAttrition_drop$JobRole)
## [1] "Sales Executive"           "Research Scientist"       
## [3] "Laboratory Technician"     "Manufacturing Director"   
## [5] "Healthcare Representative" "Manager"                  
## [7] "Sales Representative"      "Research Director"        
## [9] "Human Resources"
unique(EmployeeAttrition_drop$MaritalStatus)
## [1] "Single"   "Married"  "Divorced"
unique(EmployeeAttrition_drop$OverTime)
## [1] "Yes" "No"
unique(EmployeeAttrition_drop$Attrition)
## [1] "Yes" "No"
unique(EmployeeAttrition_drop$EducationField)
## [1] "Life Sciences"    "Other"            "Medical"          "Marketing"       
## [5] "Technical Degree" "Human Resources"

All of the categorical characters are consistent.

3.1 Missing values

Since the data set appears to have no missing values, it simplifies the Exploratory Data Analysis (EDA) process. With a complete data set, we can focus on exploring relationships, identifying patterns, and gaining insights more effectively.

The data cleaning process is complete.

numeric_vars <- sapply(EmployeeAttrition_drop, is.numeric)
numeric_data <- EmployeeAttrition_drop[, numeric_vars]

3.2 Assess Distributions

This subsection focuses on the potential discretization of continuous variables and grouping sparse categories of category variables based on their distribution.

3.2.1 Discretizing Continuous Variable

We will group age into three sub groups ranging from 18 to 60 and also group the distance from home into three subgroups. We will replace Age, DistanceFromHome, and Attrition feature variables and replace them with the modified grouped variable grp.age and grp.dist also binary variable outcome Attrition_num for easy graphical approach.

EmployeeAttrition_drop$grp.age <- ifelse(EmployeeAttrition_drop$Age <= 30, '(18, 30)',
               ifelse(EmployeeAttrition_drop$Age >= 50, '(50, 60)', '[30,50]'))
EmployeeAttrition_drop$grp.dist <- ifelse(EmployeeAttrition_drop$DistanceFromHome <= 10, '(1, 10)',
               ifelse(EmployeeAttrition_drop$DistanceFromHome >= 20, '(20, 30)', '[10,20]'))

3.3 Pairwise associations

Pairwise association refers to the examination of relationships between pairs of variables in a data set. It involves analyzing how the values of two variables co-occur or change together. There are three different types of pairwise associations.

3.3.1 Two numeric variables

The best visual tool for assessing pairwise linear association between two numeric variables is a pair-wise scatter plot.

# Selecting specific columns
selected_columns <- c(1, 5, 15, 16, 18, 22)

# Creating ggpairs plot with selected columns
ggpairs(EmployeeAttrition_drop,
        columns = selected_columns,
        aes(color = Attrition, alpha = 0.5))

# Selecting specific columns
selected_columnss <- c(23, 25, 26,27,28)

# Creating ggpairs plot with selected columns
ggpairs(EmployeeAttrition_drop,
        columns = selected_columnss,
        aes(color = Attrition, alpha = 0.5))

The off-diagonal plots and numbers indicate the correlation between the pair-wise numeric variables. As expected, YearsWithCurrManager and YearsAtCompany are significantly correlated, YearsWithCurrManager and YearsInCurrentRole are significantly correlated, TotalWorkingYears and YearsAtCompany are significantly correlated, YearsInCurrentRole and YearsAtCompany are significantly correlated, and YearsSinceLastPromotion and YearsAtCompany are significantly correlated. Other paired variables have weak correlations.

The main diagonal stacked density curves show the potential difference in the distribution of the underlying numeric variable in Attrition and no attrition groups. This means that the stacked density curves show the relation between numeric and categorical variables. These stacked density curves are not completely overlapped indicating somewhat correlation between each of these numeric variables and the binary response variable.

3.3.2 Two Categorial variables

Mosaic plots are convenient to show whether two categorical variables are dependent. In EDA, we are primarily interested in whether the response (binary in this case) is independent of categorical variables. Those categorical variables that are independent of the response variable should be excluded in any of the subsequent models and algorithms.

par(mfrow = c(2,2))
mosaicplot(BusinessTravel ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="BusinessTravel vs Attrition")
mosaicplot(Department ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="Department vs Attrition")
mosaicplot(EducationField ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="EducationField vs Attrition")
mosaicplot(JobRole ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="JobRole vs Attrition")

The top two mosaic plots demonstrate show that Attrition is not independent of Business Travel and Department because the proportion of Attrition cases in individual categories is not identical. In which employees traveling frequently have the highest attrition rate whereas non-travel employees has the least attrition rate. The bottom two mosaic plots also show that Attrition is not independent of Education field and Job Role because the proportion of Attrition cases in individual categories is not identical.

par(mfrow = c(2,3))
mosaicplot(OverTime ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="OverTime vs Attrition")
mosaicplot(MaritalStatus ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="MaritalStatus vs Attrition")
mosaicplot(Gender ~ Attrition, data=EmployeeAttrition,col=c("Blue","Red"), main="Gender vs Attrition")
mosaicplot(grp.age ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="Agegroup vs Attrition")
mosaicplot(grp.dist ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="DistanceFromHome vs Attrition")

The top left two mosaic plots demonstrate the positive association between overtime and marital status readings. In which employees being single and working a lot of overtime shows increasing rate of employee attrition. The bottom two mosaic plots also show that Attrition is not independent of age group and distance from home because the proportion of Attrition cases in individual categories is not identical. Here, the younger age group have a higher attrition rate than older age group. Also further away from the work place shows increase in the rate of attrition. Lastly, as for the Gender, there seems to be not much of an influence.

# Calculate the correlation matrix
cor_matrix <- cor(numeric_data)

# You can also plot the correlation matrix for better visualization
library(corrplot)
corrplot(cor_matrix, method = "color", tl.col = "black")

4 Concluding Remarks

We will be removing grp.age and grp.dist also binary variable outcome Attrition_num. We will aslo remove YearsAtCompany, and TotalWorkingYears, as they highly correlated with correlation coefficient > .75. For our final cleaned data set, most feature variables have very high or very low correlation but not high or low enough to be removed. Therefore, we will keep them for further analysis.

# Using subset() function to drop specific columns
modified_data <- subset(EmployeeAttrition_drop, select = -c(grp.age, grp.dist, Attrition_num, YearsAtCompany, TotalWorkingYears))
write.csv(modified_data, file = "~/Desktop/cleanedattrition2.csv", row.names = FALSE)

5 Unsuvervised Learning using Cluster Analysis and Principal Component Analysis

The goal of unsupervised learning is to find the underlying structure of the data set and group that data according to similarities. Common algorithms used in unsupervised learning include clustering, anomaly detection, neural networks, and approaches for learning latent variable models.

The following types of unsupervised machine learning algorithms are commonly used in practice.

  • K-means Clustering
  • Hierarchical Clustering
  • Anomaly Detection
  • Principal Component Analysis

K-means Clustering: K-means clustering is a popular unsupervised machine learning technique used for grouping similar data points into clusters. It assigns data points to clusters based on their similarity to the mean of each cluster. K-means is iterative and aims to minimize the variance within each cluster. It is widely used in customer segmentation, image compression, and data preprocessing.

Hierarchical Clustering: Hierarchical clustering is another technique for grouping data points into clusters. Unlike K-means, hierarchical clustering creates a tree-like structure of nested clusters, which can be visualized as a dendrogram. It does not require the user to specify the number of clusters beforehand. Hierarchical clustering is useful for understanding the relationships between data points and is commonly used in biology, social sciences, and data exploration.

Anomaly Detection: Anomaly detection involves identifying data points that deviate significantly from the norm or expected behavior. It is widely used in fraud detection, network security, and quality control. Anomaly detection methods can be based on statistical techniques, machine learning, or a combination of both. Detecting anomalies helps in identifying rare events or outliers that might require special attention.

Principal Component Analysis (PCA): PCA is a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space while retaining as much of the original variance as possible. It achieves this by identifying orthogonal axes (principal components) that capture the most significant variability in the data. PCA is commonly used to reduce noise, visualize high-dimensional data, and improve the efficiency of machine learning algorithms by reducing multicollinearity.

These techniques play crucial roles in exploratory data analysis, pattern recognition, and feature engineering, contributing to the development of effective machine learning models and insights from complex data sets.

5.1 Case Study: Multi-feature Clustering

The data set consists of 1470 rows, representing individual employee records. Each row contains 35 columns, making these columns the feature variables for analysis. The feature variables include demographic information such as age and gender, factors related to job satisfaction and environment satisfaction, education field, job role, income, overtime, percentage salary hike, tenure, training time, years in the current role, relationship status, and several other parameters that may impact attrition and engagement.

The feature variables encompass various data types, including numeric, categorical, and ordinal data, which allows for a diverse set of analyses and modeling approaches.

In conclusion, this data set offers a comprehensive collection of features that are crucial for understanding and predicting employee attrition within an organization. By delving into the relationships between different variables, we can gain valuable insights that have practical implications for HR policies and practices.

A copy of this publicly available data is stored at https://raw.githubusercontent.com/Tenam01/DATASETS/main/cleanedattrition2.csv. This data set has been pre-processed and feature engineered.

The first few records of the data set are displayed in the following table.

attrition = read.csv("https://raw.githubusercontent.com/Tenam01/DATASETS/main/cleanedattrition2.csv")
(head(attrition))
##   Age Attrition    BusinessTravel             Department DistanceFromHome
## 1  41       Yes     Travel_Rarely                  Sales                1
## 2  49        No Travel_Frequently Research & Development                8
## 3  37       Yes     Travel_Rarely Research & Development                2
## 4  33        No Travel_Frequently Research & Development                3
## 5  27        No     Travel_Rarely Research & Development                2
## 6  32        No Travel_Frequently Research & Development                2
##   Education EducationField EnvironmentSatisfaction Gender JobInvolvement
## 1         2  Life Sciences                       2 Female              3
## 2         1  Life Sciences                       3   Male              2
## 3         2          Other                       4   Male              2
## 4         4  Life Sciences                       4 Female              3
## 5         1        Medical                       1   Male              3
## 6         2  Life Sciences                       4   Male              3
##   JobLevel               JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1        2       Sales Executive               4        Single          5993
## 2        2    Research Scientist               2       Married          5130
## 3        1 Laboratory Technician               3        Single          2090
## 4        1    Research Scientist               3       Married          2909
## 5        1 Laboratory Technician               2       Married          3468
## 6        1 Laboratory Technician               4        Single          3068
##   NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating
## 1                  8      Yes                11                 3
## 2                  1       No                23                 4
## 3                  6      Yes                15                 3
## 4                  1      Yes                11                 3
## 5                  9       No                12                 3
## 6                  0       No                13                 3
##   RelationshipSatisfaction StockOptionLevel TrainingTimesLastYear
## 1                        1                0                     0
## 2                        4                1                     3
## 3                        2                0                     3
## 4                        3                0                     3
## 5                        4                1                     3
## 6                        3                0                     2
##   WorkLifeBalance YearsInCurrentRole YearsSinceLastPromotion
## 1               1                  4                       0
## 2               3                  7                       1
## 3               3                  0                       0
## 4               3                  7                       3
## 5               3                  2                       2
## 6               2                  7                       3
##   YearsWithCurrManager
## 1                    5
## 2                    7
## 3                    0
## 4                    0
## 5                    2
## 6                    6

We use k-means method to perform cluster analysis on the Attrition data with the eighteen numerical feature variables.

myClusteredattrition = attrition
columns_to_exclude <- c(2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 17, 19, 20, 21, 22)

# Exclude specified columns before performing k-means clustering
data_for_clustering <- myClusteredattrition[, -columns_to_exclude]
# We start with 3 clusters since we know there are 3 departments in the data.
# use the elbow plot to determine the best number of clusters.
km.attrition <- kmeans( x = data_for_clustering , centers = 3)  
clust.ID <- km.attrition$cluster        # extracting cluster IDs

table(clust.ID)                    # frequency of clusters
## clust.ID
##   1   2   3 
## 189 365 916

Since this clustering task involves 18 numerical feature variables, we cannot create a 2D plot to show the clustering performance with all eighteen original feature variables. However, we can so-called PCA to create two new feature variables and then plot the new features to show the performance of the clustering algorithm.

clusplot(data_for_clustering,
 clust.ID,
 lines = 0,
 shade = TRUE,
 color = TRUE,
 labels = 1,
 plotchar = FALSE,
 span = TRUE,
 main = paste("Clusters of Attrition features")
)

5.2 Memory Usage of Clustering

One of the potential issues in clustering analysis is the use of memory. If the data size is too large,

6 Dimensionality Reduction Algorithms

Like clustering methods, dimension reduction seeks and explores the inherent structure in the data, but in this case in an unsupervised manner or to summarize or describe data using less information.

This can be useful to visualize high-dimensional data or to simplify data which can then be used in a supervised learning method. Many of these methods can be adapted for use in classification and regression. The following are the frequently used algorithms.

  • Principal Component Analysis (PCA)
  • Linear Discriminant Analysis (LDA)
  • Quadratic Discriminant Analysis (QDA)
  • Independent Component Analysis (ICA)

In this section, we introduce the most commonly used PCA.

6.1 Case Study - PCA test on Attrition data

We have used the Attrition Data set in clustering algorithms. The data set has 5 correlated numerical variables(YearsWithCurrManager, YearsAtCompany, and YearsInCurrentRole, TotalWorkingYears, and YearsSinceLastPromotion) and no categorical variable. The five variables measure the Attrition rate. We use PCA to see whether reducing the number of feature variables for related modeling.

6.1.1 Fitting PCA model to Attrition data

We want PCA method to reduce the dimensions from 5 (numerical variables) to a smaller number. The R function prcomp() to the factor loadings associated with the five numerical variables.

# Add a small constant to the data before taking the logarithm
epsilon <- 1e-10  # Small constant to avoid zero or negative values
log.attrition = log(data_for_clustering + epsilon)

# Perform PCA on the transformed data
ar.pca <- prcomp(log.attrition, center = TRUE, scale = TRUE)

Next, we find the factor loading of the above fitted PCA. We can write an explicit system of linear transformation by using the loadings.

kable(round(ar.pca$rotation, 2), caption="Factor loadings of the PCA")
Factor loadings of the PCA
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
Age -0.29 0.61 -0.03 0.12 -0.10 0.21 -0.01 -0.69 -0.01
DistanceFromHome 0.01 -0.01 -0.70 -0.42 0.48 0.32 0.01 -0.03 -0.02
MonthlyIncome -0.37 0.51 0.02 0.02 -0.09 0.29 -0.07 0.71 0.01
NumCompaniesWorked 0.07 0.47 -0.07 -0.14 0.33 -0.80 0.03 0.08 -0.01
PercentSalaryHike 0.06 0.02 -0.55 -0.14 -0.78 -0.22 0.05 0.04 -0.02
WorkLifeBalance 0.00 0.05 0.44 -0.87 -0.17 0.06 -0.06 -0.08 -0.01
YearsInCurrentRole -0.54 -0.24 -0.07 -0.05 0.02 -0.19 -0.31 -0.06 0.71
YearsSinceLastPromotion -0.44 -0.15 0.03 -0.07 0.03 -0.08 0.88 0.00 -0.02
YearsWithCurrManager -0.54 -0.25 -0.04 -0.01 0.03 -0.19 -0.34 -0.05 -0.70
  • Factor loadings indicate how much each original variable contributes to each principal component. The absolute value of a factor loading represents the strength of the relationship between the original variable and the principal component. The sign of the factor loading (+/-) indicates the direction of the relationship (positive/negative) between the variable and the component.

For example, let’s consider the first row (Age):

The factor loading of -0.29 for PC1 suggests a moderately negative relationship between Age and PC1. A higher Age is associated with lower values of PC1. The factor loading of 0.61 for PC2 suggests a moderately positive relationship between Age and PC2. A higher Age is associated with higher values of PC2. The factor loadings close to 0 for other principal components indicate that Age has relatively weaker relationships with those components.

Similarly, let’s consider the second row (DistanceFromHome):

The factor loading of 0.01 for PC1 indicates a very weak positive relationship between DistanceFromHome and PC1. The factor loading of -0.01 for PC2 indicates a very weak negative relationship between DistanceFromHome and PC2. The factor loading of -0.70 for PC3 indicates a relatively strong negative relationship between DistanceFromHome and PC3. Interpreting these factor loadings for all variables across principal components allows to understand which original variables contribute most to each principal component and how they contribute to the overall variance captured by the PCs.

6.1.2 Optimal number of PCs to be retained

The object of PCA is to reduce the dimension without losing a significant amount of information. In PCA, we look at how much total variation is captured by each principal component. Most of the libraries that are capable of performing PCA automatically rank the PCA based on the variation captured by each principal component.

The following summary table gives the importance of the principal components.

kable(summary(ar.pca)$importance, caption="The importance of each principal component")
The importance of each principal component
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
Standard deviation 1.505511 1.180874 1.018676 0.9981226 0.9937252 0.9272111 0.8250489 0.6922288 0.5458277
Proportion of Variance 0.251840 0.154940 0.115300 0.1106900 0.1097200 0.0955200 0.0756300 0.0532400 0.0331000
Cumulative Proportion 0.251840 0.406780 0.522080 0.6327800 0.7425000 0.8380200 0.9136500 0.9669000 1.0000000

The output is from a Principal Component Analysis (PCA) performed on your data. PCA is a technique used for dimensionality reduction, which transforms the original feature variables into a new set of uncorrelated variables called principal components (PCs). These PCs capture most of the variance in the original data, allowing to represent data with fewer variables while retaining as much information as possible.

Let’s break down the output and understand its components:

Standard Deviation: This row provides the standard deviation of each principal component. It indicates the spread of data along each principal component axis. Larger standard deviations suggest that the corresponding principal components capture more variation in the data.

Proportion of Variance: This row tells the proportion of total variance in the data explained by each principal component. It quantifies the amount of information retained by each PC. Larger proportions indicate that the corresponding PC contributes more to explaining the overall variance in the data.

Cumulative Proportion: This row shows the cumulative proportion of variance explained up to each principal component. It helps understand how much of the total variance is captured when including a certain number of PCs. This can guide in deciding how many principal components to retain.

Interpreting the Output:

PC1: The first principal component (PC1) has a relatively high standard deviation of 1.51, which means it captures a significant amount of variability in the data. It explains about 25.2% of the total variance. The cumulative proportion up to PC1 is 25.2%.

PC2: The second principal component (PC2) has a standard deviation of 1.18 and explains an additional 15.5% of the variance. The cumulative proportion up to PC2 is 40.7%.

PC3: PC3 captures 11.5% of the variance, with a standard deviation of 1.02. The cumulative proportion up to PC3 is 52.2%.

PC4 to PC9: These components contribute progressively smaller proportions of variance, with decreasing standard deviations. Collectively, they explain the remaining variance, resulting in a cumulative proportion of 100%.

Importance of Each Principal Component:

PC1 is the most important component, as it has the largest standard deviation and explains the highest proportion of variance. It represents the dominant pattern of variability in the data. Subsequent principal components capture decreasing amounts of variance and contribute less to explaining the overall data variability.

Deciding How Many Principal Components to Retain:

A common approach is to choose the number of principal components that collectively explain a large portion (e.g., 95% or 99%) of the total variance. In this case, we could decide to retain the first 6 principal components, as they contribute to about 80% of the total variance. The choice of how many principal components to retain depends on specific goals and the trade-off between reducing dimensionality and retaining information. This might also consider other factors such as interpretability and the requirements of downstream analysis or modeling.

We can also make a scree plot as a visual tool to show the number of principal components to retain for future analysis.

screeplot(ar.pca, 
          type = "lines",
          main = "Scree Plot of PCA Employee Attrition Attributes")
Figure 18. Scree plot of PCA on Iris Data

Figure 18. Scree plot of PCA on Iris Data

Note that the vertical axis in the above scree plot uses the variances of PCs. The standard deviation was used in the above summary table.

6.1.3 Extracting PC Scores

The predictive principle scores are values of the newly transformed variables. We can choose the first few principal components to use as response variables to do relevant modeling.

The command ar.pca$x extracts the PC scores from the PCA procedure. These scores are the values of the new transformed variables. They can be used as response or predictor variables in statistical models. The following table shows

kable(ar.pca$x[1:15,], caption = "The first 15 PC scores transformed from the original variable. In the analysis, you want to either the first PC or the first two PCs.")
The first 15 PC scores transformed from the original variable. In the analysis, you want to either the first PC or the first two PCs.
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
-0.2503729 0.4891398 0.3747896 3.6543569 0.8579011 -0.7647964 -1.2736396 0.0088989 0.0902827
-1.0858270 0.5882272 -1.2022389 -0.7339714 -1.4161095 -0.5372264 0.3735791 -0.8711734 -0.0914351
3.3975430 0.9163099 0.9634498 0.1905301 -0.4626099 -0.1541245 0.4121812 -0.7822869 -0.0673956
0.8676736 -0.2101990 1.3669685 -0.0560005 0.8397869 -0.1966703 1.2784114 -0.3702883 1.8670205
-0.2267333 -1.0262492 1.3421464 -0.0967361 0.5739030 -1.1482797 0.4382696 0.3115776 0.0014997
-0.7096627 -2.2835630 0.7613344 1.4737681 -0.5304995 1.1950815 0.4098705 -0.4360873 0.0541166
2.8024440 2.1778603 -0.6098430 1.2164423 -1.2881520 0.1386260 0.5137954 -1.6709849 -0.0963894
3.5978736 0.5007554 -1.5597572 -1.0920457 -0.6760869 0.3328040 0.4814611 0.0482851 -0.1323815
-1.3752266 -0.9107851 -1.4354599 -0.8094795 -1.5216529 2.1815365 0.2268772 0.2213691 -0.0630726
-0.9478920 -0.1948945 -1.1670399 0.0991997 1.5156007 -0.1222437 0.5061640 -0.0132037 -0.0407171
0.2982608 -1.8579992 -0.1208496 -0.3095551 0.1211938 2.0607686 -1.4063438 -1.0786582 0.0395563
0.1159991 -1.9395323 0.1485927 -0.3062092 0.3747998 2.1806035 -1.5169336 0.0061617 0.0068637
-0.2463278 -1.0275712 -1.7665721 -0.1019653 0.6381276 -0.5562854 0.6517848 -0.1887846 -0.0949720
-0.5567213 -2.1362485 0.2432414 -0.4089012 0.8240233 2.1622582 0.3216970 -0.9266780 -0.0116863
0.9405464 -1.1005522 -0.7544006 -1.0315756 1.1448150 -0.5466799 -1.2598201 -0.4084133 -0.0554879

7 Concluding Remarks

Finally, we chose not to include a new variable derived from Principal Component Analysis (PCA) in the analysis due to the challenge of collapsing PC1 into a specific single feature variable. While PCA effectively reduces dimensionality by creating orthogonal components that capture the most variance in the data, translating these components back into meaningful original variables can be complex.

PC1 is a linear combination of multiple original variables, making it difficult to attribute its influence to a single feature. As a result, introducing a new variable based on PC1 could lead to difficulties in interpretation and may not provide clear insights into the underlying relationships present in the data. Retaining the original variables preserves the interpretability of the features and their direct associations with the outcome or patterns, allowing for a more straightforward and actionable analysis.

8 Introduction for Logistic Regression

Employee attrition is a critical challenge faced by many organizations, especially in the fast-paced tech industry. High turnover rates can impact productivity, morale, and overall company performance. In this analysis, we will use logistic regression to predict employee attrition in a tech company based on various employee characteristics, job-related factors and also to assess the association between the binary response variable and other predictor variables.

8.1 Research Questions

  1. Can we develop a logistic regression model that accurately predicts employee attrition?
  2. Which factors have a significant impact on the likelihood of employee attrition?
  3. How well does the model perform in terms of classification accuracy and predictive power?

9 Multiple Logistic Regression Model

In this study, we used a published study on Employee Attrition data set. The practical question for this predictive modeling assignment is to determine the factors that impact employee attrition. We want to understand which variables are significant predictors of attrition and build a logistic regression model to predict whether an employee is likely to leave the company (attrition = 1) or not (attrition = 0). A copy of this publicly available data is stored at https://raw.githubusercontent.com/Tenam01/DATASETS/main/cleanedattrition2.csv. This data set has been pre-processed and feature engineered.

The response variable: Attrition - status of whether an employee is likely to leave the company (attrition = 1) or not (attrition = 0) of predictor variables.

There are 26 variables (columns) and below are the variables contained in the data set:

  • Age: Employee age
  • Attrition: if the employee leaves the job
  • BusinessTravel: The frequency of job travels
  • Department: Employee work department
  • DistanceFromHome: Distance traveled to work from home
  • Education: Employee education level (1 = Below College, 2 = College, 3 = Bachelor, 4 = Master, 5 = Doctor)
  • EducationField: Employee education field
  • EnvironmentSatisfaction: Numerical value for environment satisfaction (1 = Low, 2 = Medium, 3 = High, 4 = Very High)
  • Gender: Employee gender
  • JobInvolvement: Numerical value for job involvement (1 = Low, 2 = Medium, 3 = High, 4 = Very High)
  • JobLevel: Numerical value for job level
  • JobRole: Employee job position
  • JobSatisfaction: Numerical value for job satisfaction (1 = Low, 2 = Medium, 3 = High, 4 = Very High)
  • MaritalStatus: Employee marital status
  • MonthlyIncome: The amount of money that employee earns in one month, before taxes or deductions
  • NumCompaniesWorked: Number of companies worked at
  • PercentSalaryHike: Percent increase in salary
  • PerformanceRating: Numerical value for performance rating (1 = Low, 2 = Good, 3 = Excellent, 4 = Outstanding)
  • RelationshipSatisfaction: Numerical value for relationship satisfaction (1 = Low, 2 = Medium, 3 = High, 4 = Very High)
  • StockOptionsLevel: Numerical value for stock options
  • TrainingTimesLastYear: Hours employee spent on training last year
  • WorkLifeBalance: Numerical value for work life balance (1 = Bad, 2 = Good, 3 = Better, 4 = Best)
  • YearsInCurrentRole: Number of years employee worked as their current job role
  • YearsSinceLastPromotion: Number of years since last promotion
  • YearsWithCurrentManager: Number of years employee worked with current manager

We next read the data from the given URL directly to R. Since there are no records with missing values. We don’t need to drop those records.

EmployeeAttritionLog = read.csv("https://raw.githubusercontent.com/Tenam01/DATASETS/main/cleanedattrition2.csv")
employz = na.omit(EmployeeAttritionLog)

Applying binary coding.

# Convert our response variable from 'Yes' and 'No' to 1 and 0
employz$Attrition <- ifelse(employz$Attrition == "Yes", 1, 0)
  • Build an Initial Model

We first build a logistic regression model that contains all predictor variables in the data set. This model is usually called the full model. Note that the response variable is the attrition status (1 = yes, 0 = no).

initial.model = glm(Attrition ~ BusinessTravel + Department + Education + EducationField + EnvironmentSatisfaction + Gender  + JobInvolvement + JobLevel + JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome + NumCompaniesWorked + OverTime + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + StockOptionLevel + TrainingTimesLastYear + WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager + Age + DistanceFromHome, family = binomial, data = employz)
coefficient.table = summary(initial.model)$coef
kable(coefficient.table, caption = "Significance tests of logistic regression model")
Significance tests of logistic regression model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.1487504 395.7492662 -0.0256444 0.9795410
BusinessTravelTravel_Frequently 1.8670294 0.4071329 4.5857985 0.0000045
BusinessTravelTravel_Rarely 0.9693814 0.3754112 2.5821858 0.0098177
DepartmentResearch & Development 12.6266608 395.7469457 0.0319059 0.9745471
DepartmentSales 12.4162910 395.7471849 0.0313743 0.9749710
Education 0.0019795 0.0872999 0.0226745 0.9819099
EducationFieldLife Sciences -0.7606727 0.7984502 -0.9526864 0.3407490
EducationFieldMarketing -0.3297477 0.8468630 -0.3893755 0.6969984
EducationFieldMedical -0.8515225 0.7981229 -1.0669064 0.2860141
EducationFieldOther -0.8770665 0.8562571 -1.0243028 0.3056923
EducationFieldTechnical Degree 0.1079078 0.8159464 0.1322486 0.8947877
EnvironmentSatisfaction -0.4199648 0.0821263 -5.1136430 0.0000003
GenderMale 0.3950340 0.1830513 2.1580505 0.0309239
JobInvolvement -0.5328265 0.1205934 -4.4183737 0.0000099
JobLevel -0.1389967 0.2977130 -0.4668814 0.6405847
JobRoleHuman Resources 14.0089374 395.7471564 0.0353987 0.9717618
JobRoleLaboratory Technician 1.5317617 0.4816347 3.1803394 0.0014710
JobRoleManager 0.5978435 0.8727760 0.6849907 0.4933498
JobRoleManufacturing Director 0.2597355 0.5314577 0.4887228 0.6250380
JobRoleResearch Director -0.8767143 0.9659989 -0.9075728 0.3641040
JobRoleResearch Scientist 0.6231180 0.4896045 1.2726965 0.2031258
JobRoleSales Executive 1.3223467 1.1132697 1.1878044 0.2349105
JobRoleSales Representative 2.2290891 1.1700504 1.9051223 0.0567642
JobSatisfaction -0.4031915 0.0803294 -5.0192247 0.0000005
MaritalStatusMarried 0.3027285 0.2642453 1.1456342 0.2519466
MaritalStatusSingle 1.1604409 0.3416766 3.3963130 0.0006830
MonthlyIncome -0.0000032 0.0000798 -0.0400431 0.9680587
NumCompaniesWorked 0.1630538 0.0370723 4.3982643 0.0000109
OverTimeYes 1.9343061 0.1909657 10.1290744 0.0000000
PercentSalaryHike -0.0210031 0.0387550 -0.5419455 0.5878561
PerformanceRating 0.0679746 0.3946248 0.1722513 0.8632400
RelationshipSatisfaction -0.2456994 0.0819679 -2.9975091 0.0027220
StockOptionLevel -0.1975184 0.1557253 -1.2683770 0.2046633
TrainingTimesLastYear -0.1862392 0.0725335 -2.5676316 0.0102396
WorkLifeBalance -0.3547433 0.1227487 -2.8899968 0.0038525
YearsInCurrentRole -0.1186218 0.0419441 -2.8280935 0.0046826
YearsSinceLastPromotion 0.2036475 0.0401511 5.0720267 0.0000004
YearsWithCurrManager -0.0953257 0.0418145 -2.2797268 0.0226239
Age -0.0422346 0.0122225 -3.4554899 0.0005493
DistanceFromHome 0.0446842 0.0106711 4.1873907 0.0000282

The p-values in the above significance test table some feature variables with p value greater than 0.05. We next search for the best model by dropping some of the insignificant predictor variables. Since there are so many different ways to drop variables, next we use an automatic variable procedure to search the final model.

  • Automatic Variable Selection

R has an automatic variable selection function step() for searching the final model. We will start from the initial model and drop insignificant variables using AIC as an inclusion/exclusion criterion.

In practice, sometimes, there may be some practically important predictor variables. Practitioners want to include these practically important variables in the model regardless of their statistical significance. Therefore we can fit the smallest model that includes only those practically important variables. The final model should be between the smallest model, which we will call a reduced model, and the initial model, which we will call a full model. For illustration, we assume YearsSinceLastPromotion, RelationshipSatisfaction, WorkLifeBalance, and NumCompaniesWorked are practically important, we want to include these four variables in the final model regardless of their statistical significance.

In summary, we define two models: the full model and the reduced model. The final best model will be the model between the full and reduced models. The summary table of significant tests is given below.

full.model = initial.model  # the *biggest model* that includes all predictor variables
reduced.model = glm(Attrition ~ YearsSinceLastPromotion + RelationshipSatisfaction + WorkLifeBalance + NumCompaniesWorked , family = binomial, data = employz)
final.model =  step(full.model, 
                    scope=list(lower=formula(reduced.model),upper=formula(full.model)),
                    data = employz, 
                    direction = "backward",
                    trace = 0)   # trace = 0: suppress the detailed selection process
final.model.coef = summary(final.model)$coef
kable(final.model.coef , caption = "Summary table of significant tests")
Summary table of significant tests
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.7644674 1.1479245 1.5370936 0.1242704
BusinessTravelTravel_Frequently 1.8817648 0.4041877 4.6556709 0.0000032
BusinessTravelTravel_Rarely 0.9782062 0.3731283 2.6216350 0.0087509
EducationFieldLife Sciences -0.5979343 0.7398454 -0.8081882 0.4189823
EducationFieldMarketing -0.1826602 0.7887458 -0.2315831 0.8168618
EducationFieldMedical -0.7024340 0.7397245 -0.9495886 0.3423213
EducationFieldOther -0.6971558 0.8066688 -0.8642405 0.3874558
EducationFieldTechnical Degree 0.2724424 0.7627622 0.3571787 0.7209580
EnvironmentSatisfaction -0.4207754 0.0818043 -5.1436854 0.0000003
GenderMale 0.3792481 0.1823123 2.0802117 0.0375061
JobInvolvement -0.5327656 0.1197715 -4.4481841 0.0000087
JobRoleHuman Resources 1.6361497 0.6420829 2.5481906 0.0108283
JobRoleLaboratory Technician 1.6782868 0.4279142 3.9220171 0.0000878
JobRoleManager 0.1509916 0.6348814 0.2378265 0.8120157
JobRoleManufacturing Director 0.2443439 0.5288542 0.4620251 0.6440633
JobRoleResearch Director -1.0222376 0.8690644 -1.1762507 0.2394947
JobRoleResearch Scientist 0.7856609 0.4318438 1.8193172 0.0688631
JobRoleSales Executive 1.1258424 0.4413488 2.5509130 0.0107441
JobRoleSales Representative 2.1559119 0.4959605 4.3469424 0.0000138
JobSatisfaction -0.4088564 0.0798811 -5.1183104 0.0000003
MaritalStatusMarried 0.3945195 0.2543611 1.5510215 0.1208965
MaritalStatusSingle 1.4321239 0.2609096 5.4889662 0.0000000
NumCompaniesWorked 0.1627058 0.0368348 4.4171807 0.0000100
OverTimeYes 1.9293093 0.1901496 10.1462735 0.0000000
RelationshipSatisfaction -0.2308355 0.0810321 -2.8486935 0.0043899
TrainingTimesLastYear -0.1844669 0.0724362 -2.5466125 0.0108774
WorkLifeBalance -0.3637707 0.1227142 -2.9643736 0.0030330
YearsInCurrentRole -0.1192218 0.0415277 -2.8708974 0.0040931
YearsSinceLastPromotion 0.2032144 0.0391851 5.1860088 0.0000002
YearsWithCurrManager -0.1001219 0.0413984 -2.4184959 0.0155848
Age -0.0459829 0.0116388 -3.9508431 0.0000779
DistanceFromHome 0.0431754 0.0105214 4.1035687 0.0000407
  • Interpretation - Association Analysis

The summary table contains the four practically important variables YearsSinceLastPromotion, RelationshipSatisfaction, WorkLifeBalance, and NumCompaniesWorked. YearsSinceLastPromotion does achieve high statistical significance (p-value \(\approx\) 0), RelationshipSatisfaction also achieve high statistical significance (p-value \(\approx\) 0.0044), WorkLifeBalance achieve high statistical significance (p-value \(\approx\) 0.003), and NumCompaniesWorked also achieves high significance (p-value \(\approx\) 0.00001). Both variables, YearsSinceLastPromotion and NumCompaniesWorked, are seemingly positively associated with the response variable. Where as RelationshipSatisfaction and WorkLifeBalance, are negatively associated with the response variable.

Here’s a brief interpretation of the significant tests:

BusinessTravel: Travel_Frequently has a positive coefficient (1.88) and is significant (p < 0.001), indicating that employees who travel frequently are more likely to have attrition.

Travel_Rarely also has a positive coefficient (0.98) and is significant (p = 0.009), suggesting that employees who travel rarely are also more likely to have attrition.

EducationField: None of the specific education fields are significant predictors of attrition but most show negative association.

EnvironmentSatisfaction: Negative coefficient (-0.42) and highly significant (p < 0.001) suggest that lower environment satisfaction is associated with higher attrition.

Gender: Male employees have a positive coefficient (0.38) and are marginally significant (p = 0.038), indicating that male employees may be slightly more likely to have attrition.

JobInvolvement: Negative coefficient (-0.53) and highly significant (p < 0.001) suggest that lower job involvement is associated with higher attrition.

JobRole: Several job roles have significant associations with attrition. For example, Laboratory Technicians, Sales Representatives, and Human Resources have positive coefficients and are significant predictors of attrition.

JobSatisfaction: Negative coefficient (-0.41) and highly significant (p < 0.001) suggest that lower job satisfaction is associated with higher attrition.

MaritalStatus: Single employees have a positive coefficient (1.43) and are highly significant (p < 0.001), indicating that single employees are more likely to have attrition.

NumCompaniesWorked: Positive coefficient (0.16) and highly significant (p < 0.001) suggest that having worked at more companies is associated with higher attrition.

OverTime: Employees who work overtime (OverTimeYes) have a positive coefficient (1.93) and are highly significant (p < 0.001), indicating that they are more likely to have attrition.

TrainingTimesLastYear: Negative coefficient (-0.18) and significant (p = 0.011) indicate that fewer training times last year are associated with higher attrition.

YearsInCurrentRole: Negative coefficient (-0.12) and significant (p = 0.004) suggest that fewer years in the current role are associated with higher attrition.

YearsSinceLastPromotion: Positive coefficient (0.20) and highly significant (p < 0.001) suggest that more years since the last promotion are associated with higher attrition.

YearsWithCurrManager: Negative coefficient (-0.10) and significant (p = 0.016) suggest that fewer years with the current manager are associated with higher attrition.

Age: Negative coefficient (-0.05) and highly significant (p < 0.001) suggest that older age is associated with lower attrition.

DistanceFromHome: Positive coefficient (0.04) and significant (p < 0.001) suggest that greater distance from home is associated with higher attrition.

These results provide insights into how different feature variables are associated with employee attrition. Variables such as BusinessTravel, EnvironmentSatisfaction, JobInvolvement, JobRole, JobSatisfaction, MaritalStatus, NumCompaniesWorked, OverTime, RelationshipSatisfaction, TrainingTimesLastYear, WorkLifeBalance, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager, Age, and DistanceFromHome appear to have significant associations with attrition.

  • Predictive Analysis

As an illustration, we use the final model to predict the status of successful introduction based on the new values of the predictor variables associated with two species. See the numerical feature given in the code chunk.

mynewdata = data.frame(BusinessTravel=c('Travel_Rarely', 'Travel_Frequently'),
                       EnvironmentSatisfaction = c(3, 2),
                       Gender = c('Male','Female'),
                       JobInvolvement = c(1, 3),
                       JobRole = c('Healthcare Representative', 'Manager'),
                       JobSatisfaction = c(3,4),
                       MaritalStatus = c('Single', 'Married'),
                       NumCompaniesWorked = c(3, 4),
                       OverTime = c('Yes', 'No'),
                       RelationshipSatisfaction = c(1, 3),
                       StockOptionLevel = c(0, 1),
                       TrainingTimesLastYear = c(2, 1),
                       WorkLifeBalance = c(2, 2),
                       YearsInCurrentRole = c(3, 8),
                       YearsSinceLastPromotion = c(1, 3),
                       YearsWithCurrManager = c(2, 2),
                       EducationField = c('Human Resources', 'Technical Degree'),
                       Age = c(25, 35),
                       DistanceFromHome = c(14, 5))
pred.success.prob = predict(final.model, newdata = mynewdata, type="response")
##
## threshold probability
cut.off.prob = 0.5
pred.response = ifelse(pred.success.prob > cut.off.prob, 1, 0)  # This predicts the response
## Add the new predicted response to Mynewdata
mynewdata$Pred.Response = pred.response
##
kable(mynewdata, caption = "Predicted Value of response variable 
      with the given cut-off probability")
Predicted Value of response variable with the given cut-off probability
BusinessTravel EnvironmentSatisfaction Gender JobInvolvement JobRole JobSatisfaction MaritalStatus NumCompaniesWorked OverTime RelationshipSatisfaction StockOptionLevel TrainingTimesLastYear WorkLifeBalance YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager EducationField Age DistanceFromHome Pred.Response
Travel_Rarely 3 Male 1 Healthcare Representative 3 Single 3 Yes 1 0 2 2 3 1 2 Human Resources 25 14 1
Travel_Frequently 2 Female 3 Manager 4 Married 4 No 3 1 1 2 8 3 2 Technical Degree 35 5 0

The predicted status of the successful introduction of the two employees is attached to the two new data records.

The “Predicted Response” column indicates the predicted outcome of employee attrition based on the input values for the predictor variables and the logistic regression model. A value of 1 indicates that the model predicts attrition (Yes), and a value of 0 indicates that the model predicts no attrition (No).

These predictions are based on the relationships and coefficients learned by the logistic regression model from the training data. It’s important to note that these predictions are based on the specific characteristics we have provided for each individual, and they may change if the input values are altered.

10 Introduction to Cross Validation method

In the realm of predictive modeling, logistic regression stands as a foundational technique for tackling binary classification problems. It allows us to analyze the relationship between a set of predictor variables and a binary outcome, making it particularly well-suited for scenarios like employee attrition prediction. However, the journey doesn’t end with model construction; we need robust mechanisms to validate our model’s performance and ensure its generalizability.

This exploration delves into the crucial steps of using logistic regression, employing training, validating and test datasets, harnessing the power of cross-validation, and evaluating model performance through the precision, recall, F1-score, and Receiver Operating Characteristic (ROC) curve. This multi-faceted approach enables us to not only build a predictive model but also rigorously assess its accuracy, sensitivity, and overall effectiveness.

Training Dataset: The training dataset serves as the foundation for teaching the logistic regression model. This dataset contains labeled examples where the model learns the underlying patterns and relationships between predictor variables and the binary response variable. During training, the logistic regression algorithm adjusts its parameters, specifically the coefficients, to minimize the error between predicted probabilities and actual outcomes.

Validation Dataset: Incorporating a validation dataset is crucial to assess the model’s performance and prevent overfitting. This separate dataset is not used during training but rather during the model development process. By evaluating the model on the validation dataset, one can fine-tune hyperparameters and make decisions about the model’s complexity. Validation aids in selecting the best version of the model that generalizes well to unseen data.

Test Dataset: The test dataset is entirely independent of both the training and validation datasets. It serves as a final benchmark to evaluate the model’s performance on new, unseen data. The test dataset provides an unbiased estimate of how well the trained model will perform when deployed in a real-world scenario.

11 Data Splitting Methods

First, we split our dataset into two distinct subsets: a training, validating and a test dataset.

Since the same size is large, we split the sample by 70%:30% with 70% data for training and validating models and 30% for testing purposes. The labels (value of the fraud status) of testing and validation data will be removed when calculating the accuracy measures

attrition.data = read.csv("https://raw.githubusercontent.com/Tenam01/DATASETS/main/cleanedattrition2.csv")

# recode attrition variable: Yes = 1 and No = 0
yes.id = which(attrition.data$Attrition == "Yes") 
no.id = which(attrition.data$Attrition == "No")

# Add a new binary variable for attrition status
attrition.data$attrition.status = 0
attrition.data$attrition.status[yes.id] = 1

# Calculate the number of rows
nn = dim(attrition.data)[1]

# Generate indices for the training dataset
train.id = sample(1:nn, round(nn*0.7), replace = FALSE) 

# Create training and testing datasets
training = attrition.data[train.id,]
testing = attrition.data[-train.id,]

11.1 Finding Optimal Cut-off Probability

However, a single split of the dataset may not provide a comprehensive assessment of our model’s robustness. This is where cross-validation steps in. By employing techniques like k-fold cross-validation, we repeatedly divide the dataset into subsets, training the model on a combination of these subsets and validating it on the remaining data. This iterative process yields a more reliable estimate of the model’s performance and guards against potential overfitting.

Cross-Validation: Cross-validation is a methodology that enhances the validation process by mitigating potential biases in model evaluation. K-fold cross-validation, a popular technique, involves partitioning the dataset into K subsets. The model is trained K times, each time using K-1 subsets for training and one subset for validation. This process ensures a comprehensive assessment of the model’s performance and reduces the risk of over-optimistic evaluations.

We define a sequence of 20 candidate cut-off probabilities and then use 5-fold cross-validation to identify the optimal cut-off probability for the final detection model.

n0 = dim(training)[1]/5
cut.0ff.prob = seq(0,1, length = 22)[-c(1,22)]        # candidate cut off prob
pred.accuracy = matrix(0,ncol=20, nrow=5, byrow = T)  # null vector for storing prediction accuracy
## 5-fold CV
for (i in 1:5){
  valid.id = ((i-1)*n0 + 1):(i*n0)
  valid.data = training[valid.id,]
  train.data = training[-valid.id,]
  train.model = glm(attrition.status ~ Age + BusinessTravel + Department + DistanceFromHome + Education + EducationField + EnvironmentSatisfaction + Gender + JobInvolvement + JobLevel + JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome + NumCompaniesWorked + OverTime + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + StockOptionLevel + TrainingTimesLastYear + WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager, family = binomial(link = logit), data = train.data)
  newdata = data.frame(Age= valid.data$Age, BusinessTravel= valid.data$BusinessTravel, Department= valid.data$Department, DistanceFromHome= valid.data$DistanceFromHome, Education= valid.data$Education, EducationField= valid.data$EducationField, EnvironmentSatisfaction= valid.data$EnvironmentSatisfaction, Gender= valid.data$Gender, JobInvolvement= valid.data$JobInvolvement, JobLevel= valid.data$JobLevel, JobRole= valid.data$JobRole, JobSatisfaction= valid.data$JobSatisfaction, MaritalStatus= valid.data$MaritalStatus, MonthlyIncome= valid.data$MonthlyIncome, NumCompaniesWorked= valid.data$NumCompaniesWorked, OverTime= valid.data$OverTime, PercentSalaryHike= valid.data$PercentSalaryHike, PerformanceRating= valid.data$PerformanceRating, RelationshipSatisfaction= valid.data$RelationshipSatisfaction, StockOptionLevel= valid.data$StockOptionLevel, TrainingTimesLastYear= valid.data$TrainingTimesLastYear, WorkLifeBalance= valid.data$WorkLifeBalance, YearsInCurrentRole= valid.data$YearsInCurrentRole, YearsSinceLastPromotion= valid.data$YearsSinceLastPromotion, YearsWithCurrManager= valid.data$YearsWithCurrManager)
  pred.prob = predict.glm(train.model, newdata, type = "response")
  # define confusion matrix and accuracy
  for(j in 1:20){
    pred.status = rep(0,length(pred.prob))
    valid.data$pred.status = as.numeric(pred.prob >cut.0ff.prob[j])
    a11 = sum(valid.data$pred.status == valid.data$attrition.status)
    pred.accuracy[i,j] = a11/length(pred.prob)
  }
}
###  
avg.accuracy = apply(pred.accuracy, 2, mean)
max.id = which(avg.accuracy ==max(avg.accuracy ))
### visual representation
tick.label = as.character(round(cut.0ff.prob,2))
plot(1:20, avg.accuracy, type = "b",
     xlim=c(1,20), 
     ylim=c(0.5,1), 
     axes = FALSE,
     xlab = "Cut-off Probability",
     ylab = "Accuracy",
     main = "5-fold CV performance"
     )
axis(1, at=1:20, label = tick.label, las = 2)
axis(2)
segments(max.id, 0.5, max.id, avg.accuracy[max.id], col = "red")
text(max.id, avg.accuracy[max.id]+0.03, as.character(round(avg.accuracy[max.id],4)), col = "red", cex = 0.8)
Figure 7. 5-fold CV performance plot

Figure 7. 5-fold CV performance plot

The above figure indicates that the optimal cut-off probability that yields the best accuracy is 0.57.

11.2 Reporting Test

This subsection reports the performance of the model using the test data set. Note that the model needs to be fit to the original training data to find the regression coefficients and then use the holdout testing sample to find the accuracy.

test.model = glm(attrition.status ~ Age + BusinessTravel + Department + DistanceFromHome + Education + EducationField + EnvironmentSatisfaction + Gender + JobInvolvement + JobLevel + JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome + NumCompaniesWorked + OverTime + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + StockOptionLevel + TrainingTimesLastYear + WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager, family = binomial(link = logit), data = training)

newdata = data.frame(Age = testing$Age, BusinessTravel = testing$BusinessTravel, Department = testing$Department, DistanceFromHome = testing$DistanceFromHome, Education = testing$Education, EducationField = testing$EducationField, EnvironmentSatisfaction = testing$EnvironmentSatisfaction, Gender = testing$Gender, JobInvolvement = testing$JobInvolvement, JobLevel = testing$JobLevel, JobRole = testing$JobRole, JobSatisfaction = testing$JobSatisfaction, MaritalStatus = testing$MaritalStatus, MonthlyIncome = testing$MonthlyIncome, NumCompaniesWorked = testing$NumCompaniesWorked, OverTime = testing$OverTime, PercentSalaryHike = testing$PercentSalaryHike, PerformanceRating = testing$PerformanceRating, RelationshipSatisfaction = testing$RelationshipSatisfaction, StockOptionLevel = testing$StockOptionLevel, TrainingTimesLastYear = testing$TrainingTimesLastYear, WorkLifeBalance = testing$WorkLifeBalance, YearsInCurrentRole = testing$YearsInCurrentRole, YearsSinceLastPromotion = testing$YearsSinceLastPromotion, YearsWithCurrManager = testing$YearsWithCurrManager)

pred.prob.test = predict.glm(test.model, newdata, type = "response")
testing$test.status = as.numeric(pred.prob.test > 0.57)

a11 = sum(testing$test.status == testing$attrition.status)
test.accuracy = a11 / nrow(testing)

kable(as.data.frame(test.accuracy), align = 'c')
test.accuracy
0.8843537

The accuracy is around 90%. This accuracy indicates that there is no under-fitting. The predictive model correctly classified 90% of the instances in the test dataset. In other words, out of all the instances (data points) in the test dataset, approximately 90% were correctly predicted by the model as either having attrition or not having attrition, based on the chosen threshold.

12 Concluding Remarks

In conclusion, this study aimed to predict employee attrition using logistic regression and leveraged a robust 5-fold cross-validation approach to assess the model’s accuracy. The use of cross-validation allowed us to effectively evaluate the model’s performance on multiple subsets of the training data, enhancing its generalizability and reducing the risk of overfitting.

The logistic regression model demonstrated promising results, achieving an accuracy of 87.75% in predicting employee attrition. This suggests that the selected features and the logistic regression algorithm have the potential to be valuable tools in identifying employees who might be at risk of attrition. However, it’s important to note that accuracy alone might not provide a complete picture of the model’s performance, and further evaluation using other metrics, such as precision, recall, F1-score, and ROC curve, could provide additional insights into the model’s strengths and weaknesses.

13 Global and Local Performance metrics

Now we will calculate the local and global performance metrics for logistic predictive models. We have used the confusion matrix in the case study in the previous note. Here we will use the optimal cut-off probability as the decision threshold to define a confusion matrix and then define the performance measure based on this matrix.

We use the data from our previous model in which we create the training and testing data sets. We pretend the optimal cut-off probability is based on what is obtained through the Cross Validation. The testing data set will be used to report the local and global performance measures.

13.1 Local Performance Meaures

Since we have identified the optimal cut-off probability to be 0.57. Next, we will use the testing data set to report the local measures.

test.model = glm(attrition.status ~ Age + BusinessTravel + Department + DistanceFromHome + Education + EducationField + EnvironmentSatisfaction + Gender + JobInvolvement + JobLevel + JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome + NumCompaniesWorked + OverTime + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + StockOptionLevel + TrainingTimesLastYear + WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager, family = binomial(link = logit), data = training)
newdata = data.frame(Age = testing$Age, BusinessTravel = testing$BusinessTravel, Department = testing$Department, DistanceFromHome = testing$DistanceFromHome, Education = testing$Education, EducationField = testing$EducationField, EnvironmentSatisfaction = testing$EnvironmentSatisfaction, Gender = testing$Gender, JobInvolvement = testing$JobInvolvement, JobLevel = testing$JobLevel, JobRole = testing$JobRole, JobSatisfaction = testing$JobSatisfaction, MaritalStatus = testing$MaritalStatus, MonthlyIncome = testing$MonthlyIncome, NumCompaniesWorked = testing$NumCompaniesWorked, OverTime = testing$OverTime, PercentSalaryHike = testing$PercentSalaryHike, PerformanceRating = testing$PerformanceRating, RelationshipSatisfaction = testing$RelationshipSatisfaction, StockOptionLevel = testing$StockOptionLevel, TrainingTimesLastYear = testing$TrainingTimesLastYear, WorkLifeBalance = testing$WorkLifeBalance, YearsInCurrentRole = testing$YearsInCurrentRole, YearsSinceLastPromotion = testing$YearsSinceLastPromotion, YearsWithCurrManager = testing$YearsWithCurrManager)
pred.prob.test = predict.glm(test.model, newdata, type = "response")
testing$test.status = as.numeric(pred.prob.test > 0.57)
### components for defining various measures
TN = sum(testing$test.status ==0 & testing$attrition.status ==0)
FN = sum(testing$test.status ==0 & testing$attrition.status ==1)
FP = sum(testing$test.status ==1 & testing$attrition.status ==0)
TP = sum(testing$test.status ==1 & testing$attrition.status ==1)
###
sensitivity = TP / (TP + FN)
specificity = TN / (TN + FP)
###
precision = TP / (TP + FP)
recall = sensitivity
F1 = 2*precision*recall/(precision + recall)
metric.list = cbind(sensitivity = sensitivity, 
                    specificity = specificity, 
                    precision = precision,
                    recall = recall,
                    F1 = F1)
kable(as.data.frame(metric.list), align='c', caption = "Local performance metrics")
Local performance metrics
sensitivity specificity precision recall F1
0.35 0.9685039 0.6363636 0.35 0.4516129

Sensitivity (True Positive Rate): Sensitivity, also known as the True Positive Rate or Recall, measures the proportion of actual positive cases correctly identified by the model. In this case, it’s 0.3428571, which means that the model correctly identified approximately 34.29% of the actual positive cases (attrition) in the dataset.

Specificity (True Negative Rate): Specificity measures the proportion of actual negative cases correctly identified by the model. It’s also known as the True Negative Rate. Here, it’s 0.9784367, which indicates that the model correctly identified approximately 97.84% of the actual negative cases (non-attrition) in the dataset.

Precision: Precision is the ratio of true positive predictions to the total predicted positive cases by the model. In this context, it’s 0.75, meaning that out of all the predicted positive cases by the model, approximately 75% were actually positive cases (correctly predicted attrition).

Recall: Recall is another term for Sensitivity, as explained above. It measures the proportion of actual positive cases correctly identified by the model. Here, it’s 0.3428571, indicating the same as Sensitivity: approximately 34.29% of actual positive cases were correctly identified.

F1 Score: The F1 Score is the harmonic mean of Precision and Recall. It provides a balance between Precision and Recall, especially when there is an imbalance between positive and negative cases. Here, it’s 0.4705882, reflecting the balance between the model’s ability to predict both positive and negative cases.

These metrics collectively provide insights into how well the model is performing in terms of correctly classifying attrition and non-attrition cases. A higher sensitivity is desired when correctly identifying positive cases is crucial, whereas a higher specificity is important when correctly identifying negative cases is a priority. Precision and Recall help evaluate the trade-off between true positives and false positives, while the F1 Score provides a balanced measure between Precision and Recall.

13.2 Global Measure: ROC and AUC

In order to create an ROC curve, we need to select a sequence of decision thresholds and calculate the corresponding sensitivity and specificity.

CAUTION: ROC and AUC are used for model selection, we still use the training data to construct the ROC and calculate the AUC.

*Performance Metrics and ROC Curve:

To quantitatively measure the model’s performance, various metrics come into play. The ROC (Receiver Operating Characteristic) curve is a graphical representation of a model’s ability to distinguish between the classes, showing the trade-off between sensitivity and specificity. The area under the ROC curve (AUC-ROC) quantifies the model’s discriminatory power, where a higher AUC indicates better performance.

In this exploration, we delve into the synergy of logistic regression, training, validation, and test datasets, cross-validation techniques, and the visualization of model performance through ROC curves. By combining these elements, we aim to build robust and accurate logistic regression models that effectively predict binary outcomes in various domains.

With a well-trained model in hand and a thorough understanding of its cross-validated performance, we turn our attention to measuring its predictive accuracy using the ROC curve. The ROC curve is a graphical representation of the trade-off between the true positive rate and the false positive rate. It allows us to choose an optimal threshold for classification, balancing the sensitivity and specificity of the model. The area under the ROC curve (AUC-ROC) summarizes the model’s overall discriminatory power, providing a single metric to assess its performance.

# Install and load the pROC package
library(pROC)

cut.off.seq = seq(0, 1, length = 100)
sensitivity.vec = NULL
specificity.vec = NULL

training.model = glm(attrition.status ~ Age + BusinessTravel + Department + DistanceFromHome + Education + EducationField + EnvironmentSatisfaction + Gender + JobInvolvement + JobLevel + JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome + NumCompaniesWorked + OverTime + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + StockOptionLevel + TrainingTimesLastYear + WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager, family = binomial(link = logit), data = training)
newdata = data.frame(Age = testing$Age, BusinessTravel = testing$BusinessTravel, Department = testing$Department, DistanceFromHome = testing$DistanceFromHome, Education = testing$Education, EducationField = testing$EducationField, EnvironmentSatisfaction = testing$EnvironmentSatisfaction, Gender = testing$Gender, JobInvolvement = testing$JobInvolvement, JobLevel = testing$JobLevel, JobRole = testing$JobRole, JobSatisfaction = testing$JobSatisfaction, MaritalStatus = testing$MaritalStatus, MonthlyIncome = testing$MonthlyIncome, NumCompaniesWorked = testing$NumCompaniesWorked, OverTime = testing$OverTime, PercentSalaryHike = testing$PercentSalaryHike, PerformanceRating = testing$PerformanceRating, RelationshipSatisfaction = testing$RelationshipSatisfaction, StockOptionLevel = testing$StockOptionLevel, TrainingTimesLastYear = testing$TrainingTimesLastYear, WorkLifeBalance = testing$WorkLifeBalance, YearsInCurrentRole = testing$YearsInCurrentRole, YearsSinceLastPromotion = testing$YearsSinceLastPromotion, YearsWithCurrManager = testing$YearsWithCurrManager)
pred.prob.train = predict.glm(training.model, newdata, type = "response")
for (i in 1:100){
  train.status = as.numeric(pred.prob.train > cut.off.seq[i])
### components for defining various measures
TN = sum(train.status == 0 & testing$attrition.status == 0)
FN = sum(train.status == 0 & testing$attrition.status == 1)
FP = sum(train.status == 1 & testing$attrition.status == 0)
TP = sum(train.status == 1 & testing$attrition.status == 1)
###
sensitivity.vec[i] = TP / (TP + FN)
specificity.vec[i] = TN / (TN + FP)
}
one.minus.spec = 1 - specificity.vec
sens.vec = sensitivity.vec

## A better approx of ROC
  prediction = pred.prob.train
  category = testing$attrition.status == 1
  ROCobj <- roc(category, prediction)
  AUC = round(auc(ROCobj),4)

par(pty = "s")   # make a square figure
plot(one.minus.spec, sens.vec, type = "l", xlim = c(0,1), ylim = c(0,1),
     xlab ="1 - specificity",
     ylab = "sensitivity",
     main = "ROC curve of Logistic Employee Attrition Model",
     lwd = 2,
     col = "blue", )
segments(0,0,1,1, col = "red", lty = 2, lwd = 2)
#AUC = round(sum(sens.vec*(one.minus.spec[-101]-one.minus.spec[-1])),4)
text(0.8, 0.3, paste("AUC = ", AUC), col = "blue", cex = 0.8)

An AUC (Area Under the Curve) value of 0.8151 indicates that our logistic regression model is performing reasonably well in distinguishing between the two classes (attrition and non-attrition). The AUC value ranges from 0 to 1, where a higher value indicates better predictive performance. An AUC value of 0.8151 suggests that the model has a good ability to discriminate between employees who will leave and those who will not.

When interpreting the AUC value:

  • AUC = 0.5 indicates random performance (no discrimination power).
  • AUC > 0.5 and < 0.7 suggests a poor to fair discriminative ability.
  • AUC > 0.7 and < 0.9 indicates a good to excellent discriminative ability.
  • AUC > 0.9 suggests outstanding discriminative ability.

Keep in mind that the AUC is just one measure of the model’s performance. It’s important to consider other metrics such as sensitivity, specificity, precision, and F1-score, as well as domain knowledge, when evaluating and interpreting the results of our logistic regression model.

*Conclusion Remarks

In conclusion, the combination of logistic regression, training and test datasets, cross-validation, and ROC analysis forms a comprehensive framework for predictive modeling. By embracing these techniques, we empower ourselves to not only build accurate models but also to validate their effectiveness and make informed decisions based on their predictive capabilities. This multifaceted approach is a cornerstone of modern data science, enabling us to unlock insights and drive informed actions from complex datasets.

14 Introduction to Neural Network Model

A neural network is a type of machine learning algorithm that is designed to recognize patterns in data. It’s inspired by the structure and function of the human brain, where interconnected neurons work together to process and transmit information. Neural networks consist of layers of interconnected nodes (neurons) that process input data and produce output. These layers are typically organized into three main types:

Input Layer: This layer receives the raw input data and passes it on to the next layer for processing.

Hidden Layers: These are one or more layers between the input and output layers. Each neuron in a hidden layer processes the information it receives from the previous layer and passes the output to the next layer. The hidden layers are responsible for learning complex patterns in the data.

Output Layer: This layer produces the final prediction or output based on the information processed by the hidden layers. The structure of the output layer depends on the type of problem being solved, such as classification or regression.

Neural networks “learn” from data through a process called training. During training, the network adjusts the connections between neurons (weights) based on the input data and the desired output. This adjustment is done iteratively using optimization algorithms that minimize the difference between the predicted output and the actual target.

Neural networks have gained significant popularity due to their ability to learn from complex and high-dimensional data, making them suitable for tasks such as image and speech recognition, language translation, playing games, autonomous driving, and more. They can capture intricate relationships in data that may be difficult for traditional algorithms to discover.

While neural networks can achieve remarkable accuracy, they also come with challenges. They require large amounts of data for training, and determining the right architecture (number of layers and neurons) and optimization techniques can be complex. Deep learning, a subset of neural networks, involves using multiple layers to create complex models and has been a driving force behind many recent breakthroughs in AI.

In summary, neural networks are a powerful tool in the field of machine learning, capable of learning and representing complex patterns in data. They have enabled significant advancements in various domains and continue to be a focus of research and development.

14.1 Feature Conversion for neuralnet

neuralnet() requires all features to be in the numeric form (dummy variable for categorical features, normalization of numerical features). The model formula in neuralnet() requires dummy variables to be explicitly defined. It is also highly recommended to scale all numerical features before being included in the network model. The objective is to find all feature names (numeric and all dummy variables) and write them in the model formula like the one in glm: response ~ var_1 + var_2 + ... +var_k

To explain the modeling process in detail, we will outline major steps in the following subsections.

14.2 Numeric Feature Scaling

There are different types of scaling and standardization. The one we use in the following has

\[ scaled.var = \frac{orig.var - \min(orig.var)}{\max(orig.var)-\min(orig.var)} \] The scaled numeric feature is unitless (similar to the well-known z-score transformation).

attrition = read.csv("https://raw.githubusercontent.com/Tenam01/DATASETS/main/cleanedattrition2.csv")
attrition$Age = (attrition$Age-min(attrition$Age))/(max(attrition$Age)-min(attrition$Age))
attrition$DistanceFromHome = (attrition$DistanceFromHome-min(attrition$DistanceFromHome))/(max(attrition$DistanceFromHome)-min(attrition$DistanceFromHome))
attrition$Education = (attrition$Education-min(attrition$Education))/(max(attrition$Education)-min(attrition$Education))
attrition$EnvironmentSatisfaction = (attrition$EnvironmentSatisfaction-min(attrition$EnvironmentSatisfaction))/(max(attrition$EnvironmentSatisfaction)-min(attrition$EnvironmentSatisfaction))
attrition$JobInvolvement = (attrition$JobInvolvement-min(attrition$JobInvolvement))/(max(attrition$JobInvolvement)-min(attrition$JobInvolvement))
attrition$JobLevel = (attrition$JobLevel-min(attrition$JobLevel))/(max(attrition$JobLevel)-min(attrition$JobLevel))
attrition$JobSatisfaction = (attrition$JobSatisfaction-min(attrition$JobSatisfaction))/(max(attrition$JobSatisfaction)-min(attrition$JobSatisfaction))
attrition$NumCompaniesWorked = (attrition$NumCompaniesWorked-min(attrition$NumCompaniesWorked))/(max(attrition$NumCompaniesWorked)-min(attrition$NumCompaniesWorked))
attrition$MonthlyIncome = (attrition$MonthlyIncome-min(attrition$MonthlyIncome))/(max(attrition$MonthlyIncome)-min(attrition$MonthlyIncome))
attrition$PercentSalaryHike = (attrition$PercentSalaryHike-min(attrition$PercentSalaryHike))/(max(attrition$PercentSalaryHike)-min(attrition$PercentSalaryHike))
attrition$PerformanceRating = (attrition$PerformanceRating-min(attrition$PerformanceRating))/(max(attrition$PerformanceRating)-min(attrition$PerformanceRating))
attrition$RelationshipSatisfaction = (attrition$RelationshipSatisfaction-min(attrition$RelationshipSatisfaction))/(max(attrition$RelationshipSatisfaction)-min(attrition$RelationshipSatisfaction))
attrition$StockOptionLevel = (attrition$StockOptionLevel-min(attrition$StockOptionLevel))/(max(attrition$StockOptionLevel)-min(attrition$StockOptionLevel))
attrition$TrainingTimesLastYear = (attrition$TrainingTimesLastYear-min(attrition$TrainingTimesLastYear))/(max(attrition$TrainingTimesLastYear)-min(attrition$TrainingTimesLastYear))
attrition$WorkLifeBalance = (attrition$WorkLifeBalance-min(attrition$WorkLifeBalance))/(max(attrition$WorkLifeBalance)-min(attrition$WorkLifeBalance))
attrition$YearsInCurrentRole = (attrition$YearsInCurrentRole-min(attrition$YearsInCurrentRole))/(max(attrition$YearsInCurrentRole)-min(attrition$YearsInCurrentRole))
attrition$YearsSinceLastPromotion = (attrition$YearsSinceLastPromotion-min(attrition$YearsSinceLastPromotion))/(max(attrition$YearsSinceLastPromotion)-min(attrition$YearsSinceLastPromotion))
attrition$YearsSinceLastPromotion = (attrition$YearsWithCurrManager-min(attrition$YearsWithCurrManager))/(max(attrition$YearsWithCurrManager)-min(attrition$YearsWithCurrManager))

14.3 Extract All Feature Names

In practical applications, there may be many categorical features in the model and each category could have many categories. It is practically infeasible to write all resulting dummy features explicitly. We can use the R function to extract variables from a model formula that will be used in a model. Make sure, all categorical feature variables must be defined in a non-numerical form (i.e., should not be numerically encoded). We can also use the R function relevel() to change the baseline of an unordered categorical feature variable.

# Convert "Attrition" to a factor variable
attrition$Attrition <- factor(attrition$Attrition, levels = c("No", "Yes"))
# Change baseline for Attrition variable (assuming "No" is the baseline)
attrition$Attrition <- relevel(attrition$Attrition, ref = "No")

# Convert "BusinessTravel" to a factor variable
attrition$BusinessTravel <- factor(attrition$BusinessTravel, levels = c("Travel_Rarely", "Travel_Frequently", "Travel_Non"))
# Change baseline for BusinessTravel variable (assuming "Travel_Rarely" is the baseline)
attrition$BusinessTravel <- relevel(attrition$BusinessTravel, ref = "Travel_Rarely")

# Convert "Department" to a factor variable
attrition$Department <- factor(attrition$Department, levels = c("Human Resources", "Research & Development", "Sales"))
# Change baseline for Department variable (assuming "Sales" is the baseline)
attrition$Department <- relevel(attrition$Department, ref = "Sales")

# Convert "EducationField" to a factor variable
attrition$EducationField <- factor(attrition$EducationField, levels = c("Human Resources", "Life Sciences", "Marketing", "Medical", "Other", "Technical Degree"))
# Change baseline for EducationField variable (assuming "Technical Degree" is the baseline)
attrition$EducationField <- relevel(attrition$EducationField, ref = "Technical Degree")

# Convert "Gender" to a factor variable
attrition$Gender <- factor(attrition$Gender, levels = c("Female", "Male"))
# Change baseline for Gender variable (assuming "Male" is the baseline)
attrition$Gender <- relevel(attrition$Gender, ref = "Male")

# Convert "JobRole" to a factor variable
attrition$JobRole <- factor(attrition$JobRole, levels = c("Healthcare Representative", "Human Resources", "Laboratory Technician", "Manager", "Manufacturing Director", "Research Director", "Research Scientist", "Sales Executive", "Sales Representative"))
# Change baseline for JobRole variable (assuming "Research Director" is the baseline)
attrition$JobRole <- relevel(attrition$JobRole, ref = "Research Director")

# Convert "MaritalStatus" to a factor variable
attrition$MaritalStatus <- factor(attrition$MaritalStatus, levels = c("Divorced", "Married", "Single"))
# Change baseline for MaritalStatus variable (assuming "Single" is the baseline)
attrition$MaritalStatus <- relevel(attrition$MaritalStatus, ref = "Single")

# Convert "OverTime" to a factor variable
attrition$OverTime <- factor(attrition$OverTime, levels = c("No", "Yes"))
# Change baseline for OverTime variable (assuming "No" is the baseline)
attrition$OverTime <- relevel(attrition$OverTime, ref = "No")

Next, we use the R function model.matrix() to extract the names of all feature variables (including implicitly defined dummy feature variables from model.matrix()).

# Create a model matrix for modeling 
attritionMtx <- model.matrix(~ ., data = attrition)

# Get the column names of the model matrix
colnames(attritionMtx)
##  [1] "(Intercept)"                      "Age"                             
##  [3] "AttritionYes"                     "BusinessTravelTravel_Frequently" 
##  [5] "BusinessTravelTravel_Non"         "DepartmentHuman Resources"       
##  [7] "DepartmentResearch & Development" "DistanceFromHome"                
##  [9] "Education"                        "EducationFieldHuman Resources"   
## [11] "EducationFieldLife Sciences"      "EducationFieldMarketing"         
## [13] "EducationFieldMedical"            "EducationFieldOther"             
## [15] "EnvironmentSatisfaction"          "GenderFemale"                    
## [17] "JobInvolvement"                   "JobLevel"                        
## [19] "JobRoleHealthcare Representative" "JobRoleHuman Resources"          
## [21] "JobRoleLaboratory Technician"     "JobRoleManager"                  
## [23] "JobRoleManufacturing Director"    "JobRoleResearch Scientist"       
## [25] "JobRoleSales Executive"           "JobRoleSales Representative"     
## [27] "JobSatisfaction"                  "MaritalStatusDivorced"           
## [29] "MaritalStatusMarried"             "MonthlyIncome"                   
## [31] "NumCompaniesWorked"               "OverTimeYes"                     
## [33] "PercentSalaryHike"                "PerformanceRating"               
## [35] "RelationshipSatisfaction"         "StockOptionLevel"                
## [37] "TrainingTimesLastYear"            "WorkLifeBalance"                 
## [39] "YearsInCurrentRole"               "YearsSinceLastPromotion"         
## [41] "YearsWithCurrManager"

There are some naming issues in the above dummy feature variables for network modeling (although they are good for regular linear and generalized linear regression models). We need to rename them by excluding special characters in order to build neural network models. These issues can be avoided at the stage of feature engineering (if we initially planned to build neural network models). Next, we clean up the variables before defining the network model formula.

colnames(attritionMtx)[4] <- "BusinessTravelTravelFreq"
colnames(attritionMtx)[5] <- "BusinessTravelTravelNon"
colnames(attritionMtx)[6] <- "DepartmentHumanResources"
colnames(attritionMtx)[7] <- "DepartmentResearchDevelopment"
colnames(attritionMtx)[10] <- "EducationFieldHumanResources"
colnames(attritionMtx)[11] <- "EducationFieldLifeSciences"
colnames(attritionMtx)[19] <- "JobRoleHealthcareRepresentative"
colnames(attritionMtx)[20] <- "JobRoleHumanResources"
colnames(attritionMtx)[21] <- "JobRoleLaboratorytechnician"
colnames(attritionMtx)[23] <- "JobRoleManufacturingDirector"
colnames(attritionMtx)[24] <- "JobRoleResearchScientist"
colnames(attritionMtx)[25] <- "JobRoleSalesExecutive"
colnames(attritionMtx)[26] <- "JobRoleSalesRepresentative"

14.4 Define Model Formula

For convenience, we encourage you to use CamelCase notation (CamelCase is a way to separate the words in a phrase by making the first letter of each word capitalized and not using spaces) in naming feature variables.

# Get the column names of the model matrix (excluding intercept)
columnNames <- colnames(attritionMtx)[-1]

# Replace invalid characters and spaces in column names
columnNames <- gsub("[[:space:]]", "_", columnNames) # Replace spaces with underscores
columnNames <- gsub("[[:punct:]]", "_", columnNames) # Replace punctuations with underscores

# Create a formula string without AttritionYes on the right-hand side
modelFormula <- as.formula(paste("AttritionYes ~", 
                                 paste(setdiff(columnNames, "AttritionYes"), collapse = " + ")))

# Display the model formula
modelFormula
## AttritionYes ~ Age + BusinessTravelTravelFreq + BusinessTravelTravelNon + 
##     DepartmentHumanResources + DepartmentResearchDevelopment + 
##     DistanceFromHome + Education + EducationFieldHumanResources + 
##     EducationFieldLifeSciences + EducationFieldMarketing + EducationFieldMedical + 
##     EducationFieldOther + EnvironmentSatisfaction + GenderFemale + 
##     JobInvolvement + JobLevel + JobRoleHealthcareRepresentative + 
##     JobRoleHumanResources + JobRoleLaboratorytechnician + JobRoleManager + 
##     JobRoleManufacturingDirector + JobRoleResearchScientist + 
##     JobRoleSalesExecutive + JobRoleSalesRepresentative + JobSatisfaction + 
##     MaritalStatusDivorced + MaritalStatusMarried + MonthlyIncome + 
##     NumCompaniesWorked + OverTimeYes + PercentSalaryHike + PerformanceRating + 
##     RelationshipSatisfaction + StockOptionLevel + TrainingTimesLastYear + 
##     WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + 
##     YearsWithCurrManager


14.5 Training and Testing Neural Network Model

We follow the routine steps for building a neural network model to predict Attrition.

14.5.1 Data Splitting

We split the data into 70% for training the neural network and 30% for testing.

n = dim(attritionMtx)[1]
testID = sample(1:n, round(n*0.7), replace = FALSE)
testDat = attritionMtx[testID,]
trainDat = attritionMtx[-testID,]

14.5.2 Build Neural Network Model

NetworkModel = neuralnet(modelFormula,
                         data = trainDat,
                         hidden = 1,               # single layer NN
                         rep = 1,                  # number of replicates in training NN
                         threshold = 0.01,         # threshold for the partial derivatives as stopping criteria.
                         learningrate = 0.1,       # user selected rate
                         algorithm = "rprop+"
                         )
kable(NetworkModel$result.matrix)
error 13.7333389
reached.threshold 0.0097240
steps 5551.0000000
Intercept.to.1layhid1 -1.0253265
Age.to.1layhid1 -55.7745389
BusinessTravelTravelFreq.to.1layhid1 6.8328217
BusinessTravelTravelNon.to.1layhid1 1.2674063
DepartmentHumanResources.to.1layhid1 -19.2933209
DepartmentResearchDevelopment.to.1layhid1 -2.1590475
DistanceFromHome.to.1layhid1 12.3487916
Education.to.1layhid1 0.4350657
EducationFieldHumanResources.to.1layhid1 50.4137190
EducationFieldLifeSciences.to.1layhid1 -0.5426801
EducationFieldMarketing.to.1layhid1 7.2140508
EducationFieldMedical.to.1layhid1 8.1464595
EducationFieldOther.to.1layhid1 -11.0193175
EnvironmentSatisfaction.to.1layhid1 -17.3014723
GenderFemale.to.1layhid1 -6.0876715
JobInvolvement.to.1layhid1 -7.0899894
JobLevel.to.1layhid1 -94.3725050
JobRoleHealthcareRepresentative.to.1layhid1 -553.1303698
JobRoleHumanResources.to.1layhid1 3.8552233
JobRoleLaboratorytechnician.to.1layhid1 1.0984026
JobRoleManager.to.1layhid1 13.3624398
JobRoleManufacturingDirector.to.1layhid1 -553.4203115
JobRoleResearchScientist.to.1layhid1 -22.8619270
JobRoleSalesExecutive.to.1layhid1 20.1729998
JobRoleSalesRepresentative.to.1layhid1 6.5006476
JobSatisfaction.to.1layhid1 -17.1013042
MaritalStatusDivorced.to.1layhid1 -2.0362965
MaritalStatusMarried.to.1layhid1 1.9593513
MonthlyIncome.to.1layhid1 -11.8972301
NumCompaniesWorked.to.1layhid1 33.6800638
OverTimeYes.to.1layhid1 46.1325805
PercentSalaryHike.to.1layhid1 14.5302432
PerformanceRating.to.1layhid1 -15.1939810
RelationshipSatisfaction.to.1layhid1 -9.9645347
StockOptionLevel.to.1layhid1 -33.5618234
TrainingTimesLastYear.to.1layhid1 7.7274389
WorkLifeBalance.to.1layhid1 12.0644706
YearsInCurrentRole.to.1layhid1 28.4707256
YearsSinceLastPromotion.to.1layhid1 -2.0861666
YearsWithCurrManager.to.1layhid1 -1.6482290
Intercept.to.AttritionYes 0.0792922
1layhid1.to.AttritionYes 0.9290556
plot(NetworkModel, rep="best")
Figure 12. Single-layer backpropagation Neural network model for Employee Attrition

Figure 12. Single-layer backpropagation Neural network model for Employee Attrition

logiModel = glm(factor(Attrition) ~., family = binomial, data = attrition)
pander(summary(logiModel)$coefficients)
Table continues below
  Estimate Std. Error z value
(Intercept) 0.3545 1.608 0.2204
Age -1.502 0.5208 -2.884
BusinessTravelTravel_Frequently 0.8802 0.2047 4.299
DepartmentHuman Resources -12.63 446 -0.02832
DepartmentResearch & Development 0.2472 1.113 0.222
DistanceFromHome 1.196 0.3055 3.913
Education 0.09759 0.3507 0.2783
EducationFieldHuman Resources -0.02841 0.8175 -0.03475
EducationFieldLife Sciences -0.8385 0.304 -2.758
EducationFieldMarketing -0.4895 0.3912 -1.251
EducationFieldMedical -0.903 0.3136 -2.879
EducationFieldOther -1.133 0.4889 -2.318
EnvironmentSatisfaction -1.07 0.2452 -4.363
GenderFemale -0.3189 0.1842 -1.731
JobInvolvement -1.677 0.3706 -4.525
JobLevel -0.6883 1.183 -0.5818
JobRoleHealthcare Representative 1.016 0.9542 1.065
JobRoleHuman Resources 15.17 446 0.03402
JobRoleLaboratory Technician 2.614 0.9983 2.619
JobRoleManager 1.406 1.07 1.315
JobRoleManufacturing Director 1.203 0.9399 1.28
JobRoleResearch Scientist 1.699 1.001 1.698
JobRoleSales Executive 2.377 1.422 1.672
JobRoleSales Representative 3.43 1.51 2.272
JobSatisfaction -1.041 0.2406 -4.329
MaritalStatusDivorced -1.014 0.3419 -2.967
MaritalStatusMarried -0.7231 0.248 -2.916
MonthlyIncome 0.7883 1.503 0.5246
NumCompaniesWorked 1.306 0.3418 3.822
OverTimeYes 1.82 0.1904 9.558
PercentSalaryHike -0.3553 0.5425 -0.6549
PerformanceRating 0.02365 0.396 0.05972
RelationshipSatisfaction -0.6659 0.2483 -2.681
StockOptionLevel -0.6092 0.4656 -1.309
TrainingTimesLastYear -0.9523 0.4452 -2.139
WorkLifeBalance -0.9719 0.3725 -2.609
YearsInCurrentRole -1.111 0.7599 -1.462
YearsSinceLastPromotion -0.4548 0.7042 -0.6458
  Pr(>|z|)
(Intercept) 0.8256
Age 0.003921
BusinessTravelTravel_Frequently 1.713e-05
DepartmentHuman Resources 0.9774
DepartmentResearch & Development 0.8243
DistanceFromHome 9.099e-05
Education 0.7808
EducationFieldHuman Resources 0.9723
EducationFieldLife Sciences 0.005809
EducationFieldMarketing 0.2109
EducationFieldMedical 0.003989
EducationFieldOther 0.02043
EnvironmentSatisfaction 1.282e-05
GenderFemale 0.08339
JobInvolvement 6.042e-06
JobLevel 0.5607
JobRoleHealthcare Representative 0.287
JobRoleHuman Resources 0.9729
JobRoleLaboratory Technician 0.00883
JobRoleManager 0.1886
JobRoleManufacturing Director 0.2006
JobRoleResearch Scientist 0.08956
JobRoleSales Executive 0.0946
JobRoleSales Representative 0.02311
JobSatisfaction 1.501e-05
MaritalStatusDivorced 0.003009
MaritalStatusMarried 0.003546
MonthlyIncome 0.5999
NumCompaniesWorked 0.0001326
OverTimeYes 1.197e-21
PercentSalaryHike 0.5126
PerformanceRating 0.9524
RelationshipSatisfaction 0.00733
StockOptionLevel 0.1907
TrainingTimesLastYear 0.03244
WorkLifeBalance 0.009082
YearsInCurrentRole 0.1438
YearsSinceLastPromotion 0.5184

14.5.3 About Cross-validation in Neural Network

The algorithm of Cross-validation is primarily used for tuning hyper-parameters. For example, in the sigmoid perceptron, the optimal cut-off scores for the binary decision can be obtained through cross-validation. One of the important hyperparameters in the neural network model is the learning rate \(\alpha\) (in the backpropagation algorithm) that impacts the learning speed in training neural network models.

n0 <- dim(trainDat)[1] / 5
cut.off.score <- seq(0, 1, length = 22)[-c(1, 22)]
pred.accuracy <- matrix(0, ncol = 20, nrow = 5, byrow = TRUE)

for (i in 1:5) {
  valid.id <- ((i - 1) * n0 + 1):(i * n0)
  valid.data <- trainDat[valid.id,]
  train.data <- trainDat[-valid.id,]

  train.model <- neuralnet(modelFormula,
                           data = train.data,
                           hidden = 1,
                           rep = 1,
                           threshold = 0.01,
                           learningrate = 0.1,
                           algorithm = "rprop+"
  )
  
  pred.nn.score <- predict(train.model, valid.data)  # Use train.model for prediction
  for (j in 1:20) {
    pred.status <- as.numeric(pred.nn.score > cut.off.score[j])
    a11 <- sum(pred.status == valid.data[, 3])
    pred.accuracy[i, j] <- a11 / length(pred.nn.score)
  }
}

avg.accuracy <- apply(pred.accuracy, 2, mean)
max.id <- which(avg.accuracy == max(avg.accuracy))

tick.label <- as.character(round(cut.off.score, 2))
plot(1:20, avg.accuracy, type = "b",
     xlim = c(1, 20), 
     ylim = c(0.5, 1), 
     axes = FALSE,
     xlab = "Cut-off Score",
     ylab = "Accuracy",
     main = "5-fold CV performance"
)
axis(1, at = 1:20, label = tick.label, las = 2)
axis(2)
segments(max.id, 0.5, max.id, avg.accuracy[max.id], col = "red")
text(max.id, avg.accuracy[max.id] + 0.03, as.character(round(avg.accuracy[max.id], 4)), col = "red", cex = 0.8)


14.5.4 Testing Model Performance

#Test the resulting output
nn.results <- predict(NetworkModel, testDat)
results <- data.frame(actual = testDat[,3], prediction = nn.results > .57)
confMatrix = table(results$prediction, results$actual)               # confusion matrix
accuracy=sum(results$actual == results$prediction)/length(results$prediction)
list(confusion.matrix = confMatrix, accuracy = accuracy)       
## $confusion.matrix
##        
##           0   1
##   FALSE 734  96
##   TRUE   45  49
## 
## $accuracy
## [1] 0.8474026

14.5.5 ROC Analysis

Recall that the ROC curve is the plot of sensitivity against (1 - specificity) calculated from the confusion matrix based on a sequence of selected cut-off scores. Definitions of sensitivity and specificity are given in the following confusion matrix

Next, we construct a ROC for the above NN model based on the training data set.

nn.results = predict(NetworkModel, trainDat)  # Keep in mind that trainDat is a matrix!
cut0 = seq(0,1, length = 20)
SenSpe = matrix(0, ncol = length(cut0), nrow = 2, byrow = FALSE)
for (i in 1:length(cut0)){
    a = sum(trainDat[,"AttritionYes"] == 1 & (nn.results > cut0[i]))
    d = sum(trainDat[,"AttritionYes"] == 0 & (nn.results < cut0[i]))
    b = sum(trainDat[,"AttritionYes"] == 0 & (nn.results > cut0[i]))    
    c = sum(trainDat[,"AttritionYes"] == 1 & (nn.results < cut0[i]))   
    sen = a/(a + c)
    spe = d/(b + d)
    SenSpe[,i] = c(sen, spe)
}
# plotting ROC
plot(1-SenSpe[2,], SenSpe[1,], type ="l", xlim=c(0,1), ylim=c(0,1),
     xlab = "1 - specificity", ylab = "Sensitivity", lty = 1,
     main = "ROC Curve", col = "blue")
abline(0,1, lty = 2, col = "red")
## Calculate AUC
xx = 1-SenSpe[2,]
yy = SenSpe[1,]
width = xx[-length(xx)] - xx[-1]
height = yy[-1]

## A better approx of ROC, need library {pROC}
  prediction = as.vector(nn.results)
  category = trainDat[,"AttritionYes"] == 1
  ROCobj <- roc(category, prediction)
  AUC = auc(ROCobj)[1]
  ##
###
text(0.8, 0.3, paste("AUC = ", round(AUC,4)), col = "purple", cex = 0.9)
legend("bottomright", c("ROC of the model", "Random guessing"), lty=c(1,2),
       col = c("blue", "red"), bty = "n", cex = 0.8)
Figure 14: ROC Curve of the neural network model.

Figure 14: ROC Curve of the neural network model.

The above ROC curve indicates that the underlying neural network is better than the random guess since the area under the curve is significantly greater than 0.5. In general, if the area under the ROC curve is greater than 0.65, we say the predictive power of the underlying model is acceptable.

14.5.6 Here’s a comparison of the two AUC values mentioned:

  • Logistic Model AUC: 0.8151
  • Neural Network Model AUC: 0.8337

*Comparing AUC values: AUC ranges between 0 and 1, where a higher value indicates better discrimination between the classes. Generally, an AUC value above 0.5 suggests that the model is performing better than random chance. In your case, both models have AUC values above 0.5, which is a positive sign. The neural network model has a slightly higher AUC (0.8337) compared to the logistic model (0.8151), suggesting that the Neural Network model is better at distinguishing between the positive and negative classes based on the chosen evaluation metric.

15 Introduction to Decision Tree

Decision tree algorithms are a class of machine learning algorithms used for both classification and regression tasks. They work by recursively splitting the dataset into subsets based on the values of input features, ultimately creating a tree-like structure of decisions that leads to a predicted output.

In the context of our employee Attrition dataset, decision tree algorithms can be appropriate for the following reasons:

Interpretability: Decision trees provide transparent and easy-to-understand rules that map input features to predicted outcomes. This can be valuable in understanding the factors that contribute to employee attrition.

Feature Importance: Decision trees can rank features based on their importance in the decision-making process. This helps identify which factors have the most significant impact on employee attrition.

Non-linearity: Decision trees can capture complex relationships and interactions between features, which could be present in employee attrition scenarios.

Handling Missing Values: Decision trees can handle missing values in a robust manner by considering available data for each split.

No Assumption about Data Distribution: Decision trees do not assume any specific data distribution, making them suitable for various types of data.

The Decision Tree (DT) algorithm is based on conditional probabilities. Unlike the other classification algorithms, decision trees generate rules. A rule is a conditional statement that can easily be understood by humans and easily used within a database to identify a set of records. It is easy to interpret and implement in real-world applications. Among several basic tree-based algorithms, Classification and Regression Tree (CART) is most frequently used in practice.

For our analysis, we can follow these steps:

Data Preparation: Clean and preprocess the employee attrition dataset, handling missing values and encoding categorical variables.

Split Data: Divide the dataset into training and testing subsets.

Model Selection: Choose a decision tree algorithm, such as ID3, C4.5, CART, or Random Forest, based on your requirements and characteristics of the dataset.

Model Training: Train the selected decision tree algorithm on the training data.

Model Evaluation: Evaluate the model’s performance using appropriate metrics such as accuracy, precision, recall, F1-score, and AUC. Utilize techniques like cross-validation to assess stability.

Visualize the Tree: If the decision tree is not too complex, visualize it to understand the hierarchy of decisions and feature importance.

Hyperparameter Tuning: Optimize hyperparameters, like tree depth or minimum samples per leaf, to avoid overfitting or underfitting.

Interpretation: Analyze the tree’s rules and feature importance to gain insights into the factors contributing to employee attrition.

Prediction and Deployment: Use the trained decision tree model to make predictions on new, unseen data, and potentially deploy it in your HR or business process.

16 Case Study - Predicting Attrition

This is a new model that is different from logistic and neural network models. We load the analytic data set.

attrition = read.csv("https://raw.githubusercontent.com/Tenam01/DATASETS/main/cleanedattrition2.csv")
# We use a random split approach
n = dim(attrition)[1]  # sample size
# caution: using without replacement
train.id = sample(1:n, round(0.7*n), replace = FALSE)  
train = attrition[train.id, ]    # training data
test = attrition[-train.id, ]    # testing data


rpart() has a lot of flexibility to construct decision trees as it has user controls. It is particularly useful in applications where the costs of false positive and false negative are different.

Next, we write a wrapper so we can build different decision trees conveniently.

# arguments to pass into rpart():
# 1. data set (training /testing); 
# 2. Penalty coefficients
# 3. Impurity measure
## 
tree.builder = function(attrition, fp, fn, purity){
   tree = rpart(Attrition ~ .,                # including all features
                data = attrition, 
                na.action  = na.rpart,       # By default, deleted if the outcome is missing, 
                                             # kept if predictors are missing
                method = "class",            # Classification form factor
                model  = FALSE,
                x = FALSE,
                y = TRUE,
            parms = list( # loss matrix. Penalize false positives or negatives more heavily
                         loss = matrix(c(0, fp, fn, 0), ncol = 2, byrow = TRUE),   
                         split = purity),          # "gini" or "information"
             ## rpart algorithm options (These are defaults)
             control = rpart.control(
                        minsplit = 10,  # minimum number of observations required before split
                        minbucket= 10,  # minimum number of observations in any terminal node, default = minsplit/3
                        cp  = 0.01,  # complexity parameter for stopping rule, 0.02 -> small tree 
                       xval = 10     # number of cross-validation )
                        )
             )
  }

Using the above function, we define six different decision tree models in the following.

  • Model 1: gini.tree.11 is based on the Gini index without penalizing false positives and false negatives.

  • Model 2: info.tree.11 is based on entropy without penalizing false positives and false negatives.

  • Model 3: gini.tree.110 is based on the Gini index: cost of false negatives is 10 times the positives.

  • Model 4: info.tree.110 is based on entropy: cost of false negatives is 10 times the positives.

  • Model 5: gini.tree.101 is based on the Gini index: cost of false positive is 10 times the negatives.

  • Model 6: info.tree.101 is based on entropy: cost of false positive is 10 times the negatives.

The tree diagram of the above two regular decision models is given below.

## Call the tree model wrapper.
gini.tree.1.1 = tree.builder(attrition = train, fp = 1, fn = 1, purity = "gini")
info.tree.1.1 = tree.builder(attrition = train, fp = 1, fn = 1, purity = "information")
gini.tree.1.10 = tree.builder(attrition = train, fp = 1, fn = 10, purity = "gini")
info.tree.1.10 = tree.builder(attrition = train, fp = 1, fn = 10, purity = "information")
## tree plots
par(mfrow=c(1,2))
rpart.plot(gini.tree.1.1, main = "Tree with Gini index: non-penalization")
rpart.plot(info.tree.1.1, main = "Tree with entropy: non-penalization")
Figure 14. Non-penalized decision tree models using Gini index (left) and entropy (right).

Figure 14. Non-penalized decision tree models using Gini index (left) and entropy (right).

par(mfrow=c(1,2))
rpart.plot(gini.tree.1.10, main = "Tree with Gini index: penalization")
rpart.plot(info.tree.1.10, main = "Tree with entropy: penalization")
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
Figure 15. penalized decision tree models using Gini index (left) and entropy (right).

Figure 15. penalized decision tree models using Gini index (left) and entropy (right).

16.1 ROC for Model Selection

We built 4 different decision tree models above. Next, we use ROC analysis to select the best among the four candidate models.

# function returning a sensitivity and specificity matrix
SenSpe = function(attrition, fp, fn, purity){
  cutoff = seq(0,1, length = 20)   # 20 cut-offs including 0 and 1. 
  model = tree.builder(attrition, fp, fn, purity) 
  ## Caution: decision tree returns both "success" and "failure" probabilities.
  ## We need only "success" probability to define sensitivity and specificity!!! 
  pred = predict(model, newdata = attrition, type = "prob") # two-column matrix.
  senspe.mtx = matrix(0, ncol = length(cutoff), nrow= 2, byrow = FALSE)
  for (i in 1:length(cutoff)){
  # CAUTION: "pos" and "neg" are values of the label in this data set!
  # The following line uses only "pos" probability: pred[, "pos"] !!!!
  pred.out =  ifelse(pred[,"Yes"] >= cutoff[i], "Yes", "No")  
  TP = sum(pred.out =="Yes" & attrition$Attrition == "Yes")
  TN = sum(pred.out =="No" & attrition$Attrition == "No")
  FP = sum(pred.out =="Yes" & attrition$Attrition == "No")
  FN = sum(pred.out =="No" & attrition$Attrition == "Yes")
  senspe.mtx[1,i] = TP/(TP + FN)
  senspe.mtx[2,i] = TN/(TN + FP)
  }
  ## A better approx of ROC, need library {pROC}
  prediction = pred[, "Yes"]
  category = attrition$Attrition == "Yes"
  ROCobj <- roc(category, prediction)
  AUC = auc(ROCobj)
  ##
  list(senspe.mtx= senspe.mtx, AUC = round(AUC,3))
 }

The above function has three arguments for users to choose different types of decision trees including the 4 trees discussed in the previous subsection. Next we use this function to build 6 different trees and plot their corresponding ROC curves so we can see the global performance of these tree algorithms.

giniROC11 = SenSpe(attrition = train, fp=1, fn=1, purity="gini")
infoROC11 = SenSpe(attrition = train, fp=1, fn=1, purity="information")
giniROC110 = SenSpe(attrition = train, fp=1, fn=10, purity="gini")
infoROC110 = SenSpe(attrition = train, fp=1, fn=10, purity="information")
giniROC101 = SenSpe(attrition = train, fp=10, fn=1, purity="gini")
infoROC101 = SenSpe(attrition = train, fp=10, fn=1, purity="information")

Next, we plot the ROC curves and calculate the areas under the ROC curves for Individual decision tree models.

par(pty="s")      # set up square plot through graphic parameter
plot(1-giniROC11$senspe.mtx[2,], giniROC11$senspe.mtx[1,], type = "l", xlim=c(0,1), ylim=c(0,1), 
     xlab="1 - specificity: FPR", ylab="Sensitivity: TPR", col = "blue", lwd = 2,
     main="ROC Curves of Decision Trees", cex.main = 0.9, col.main = "navy")
abline(0,1, lty = 2, col = "orchid4", lwd = 2)
lines(1-infoROC11$senspe.mtx[2,], infoROC11$senspe.mtx[1,], col = "firebrick2", lwd = 2, lty=2)
lines(1-giniROC110$senspe.mtx[2,], giniROC110$senspe.mtx[1,], col = "olivedrab", lwd = 2)
lines(1-infoROC110$senspe.mtx[2,], infoROC110$senspe.mtx[1,], col = "skyblue", lwd = 2)
lines(1-giniROC101$senspe.mtx[2,], giniROC101$senspe.mtx[1,], col = "red", lwd = 2, lty = 4)
lines(1-infoROC101$senspe.mtx[2,], infoROC101$senspe.mtx[1,], col = "sienna3", lwd = 2)
legend("bottomright", c(paste("gini.1.1,  AUC =", giniROC11$AUC), 
                        paste("info.1.1,   AUC =",infoROC11$AUC), 
                        paste("gini.1.10, AUC =",giniROC110$AUC), 
                        paste("info.1.10, AUC =",infoROC110$AUC),
                        paste("gini.10.1, AUC =",giniROC101$AUC), 
                        paste("info.10.1, AUC =",infoROC101$AUC)),
                        col=c("blue","firebrick2","olivedrab","skyblue","red","sienna3"), 
                        lty=rep(1,6), lwd=rep(2,6), cex = 0.5, bty = "n")
Figure 16. Comparison of ROC curves

Figure 16. Comparison of ROC curves

The above ROC curves represent various decision trees and their corresponding AUC. The model with the largest AUC is considered the best decision tree among the existing ones.

An AUC (Area Under the Curve) value of 0.875 (largest out of 6) for a decision tree model indicates that the model has relatively good performance in distinguishing between the positive and negative classes in your dataset. AUC measures the ability of the model to correctly rank instances of different classes based on their predicted probabilities. An AUC value closer to 1.0 indicates better discrimination, while a value closer to 0.5 suggests random performance. In your case, an AUC of 0.875 indicates that the decision tree model is making accurate predictions and effectively separating the two classes.

16.2 Optimal Cut-off Score Determination

As usual, once the final model is determined, we need to find the optimal cut-off score for reporting the predictive performance of the final model with the test data. Please keep in mind the optimal cut-off determination through cross-validation must be based on the training data set.

In practical applications, one may end up with two or more final models with similar AUCs. In this case, we need to report the performance of all final models based on the test data and let clients choose one to deploy (and possibly leave the rest as challengers). For this reason, we write a function to determine the optimal cut-off for a given decision tree (based on this project) since different decision trees have their own optimal cut-off.

Optm.cutoff = function(attrition, fp, fn, purity){
  n0 = dim(attrition)[1]/5
  cutoff = seq(0,1, length = 20)               # candidate cut off prob
  ## accuracy for each candidate cut-off
  accuracy.mtx = matrix(0, ncol=20, nrow=5)    # 20 candidate cutoffs and gini.11
  ##
  for (k in 1:5){
     valid.id = ((k-1)*n0 + 1):(k*n0)
     valid.dat = attrition[valid.id,]
     train.dat = attrition[-valid.id,] 
     ## tree model
     tree.model = tree.builder(attrition, fp, fn, purity)
     ## prediction 
     pred = predict(tree.model, newdata = valid.dat, type = "prob")[,2]
     ## for-loop
     for (i in 1:20){
        ## predicted probabilities
        pc.1 = ifelse(pred > cutoff[i], "Yes", "No")
        ## accuracy
        a1 = mean(pc.1 == valid.dat$Attrition)
        accuracy.mtx[k,i] = a1
       }
      }
   avg.acc = apply(accuracy.mtx, 2, mean)
   ## plots
   n = length(avg.acc)
   idx = which(avg.acc == max(avg.acc))
   tick.label = as.character(round(cutoff,2))
   ##
   plot(1:n, avg.acc, xlab="cut-off score", ylab="average accuracy", 
        ylim=c(min(avg.acc), 1), 
        axes = FALSE,
        main=paste("5-fold CV optimal cut-off \n ",purity,"(fp, fn) = (", fp, ",", fn,")" , collapse = ""),
        cex.main = 0.9,
        col.main = "navy")
        axis(1, at=1:20, label = tick.label, las = 2)
        axis(2)
        points(idx, avg.acc[idx], pch=19, col = "red")
        segments(idx , min(avg.acc), idx , avg.acc[idx ], col = "red")
       text(idx, avg.acc[idx]+0.03, as.character(round(avg.acc[idx],4)), col = "red", cex = 0.8) 
   }

For demonstration, we use the above function to calculate the optimal cut-off of 6 decision trees constructed earlier in the following.

par(mfrow=c(3,2))
Optm.cutoff(attrition = train, fp=1, fn=1, purity="gini")
Optm.cutoff(attrition = train, fp=1, fn=1, purity="information")
Optm.cutoff(attrition = train, fp=1, fn=10, purity="gini")
Optm.cutoff(attrition = train, fp=1, fn=10, purity="information")
Optm.cutoff(attrition = train, fp=10, fn=1, purity="gini")
Optm.cutoff(attrition = train, fp=10, fn=1, purity="information")
Figure 17: Plot of optimal cut-off determination

Figure 17: Plot of optimal cut-off determination

As anticipated, different trees have their own optimal cut-off. Please keep in mind that the cut-off is random (based on the randomly split training data), there may have different cut-offs in different runs. It is dependent on the tree size, sometimes, we may end up with multiple optimal cut-offs. Technically speaking, we choose any one of them for implementation. A better recommendation is to choose the average of these multiple cut-offs and the final cut-off to be used on the testing data set.

17 Concluding Remarks

Based on the information provided, it seems that our Decision Tree model has achieved an Area Under the Curve (AUC) of 0.875. A higher AUC indicates better discriminatory power of our model in distinguishing between positive and negative cases. In the context of employee attrition analysis, this suggests that our Decision Tree model has a good ability to rank and classify employees with respect to attrition risk.

However, the choice of a cutoff score is important for practical deployment and decision-making. The cutoff score determines how the model’s predictions are converted into binary classifications (positive or negative). The selection of a cutoff involves a trade-off between sensitivity (true positive rate) and specificity (true negative rate).

Model Performance: An AUC of 0.875 is generally considered good, indicating that your Decision Tree model is performing well in differentiating between employees who will attrite and those who won’t.

Cutoff Selection: The optimal cutoff score depends on specific business context and the relative importance of false positives (predicting attrition when it won’t occur) versus false negatives (failing to predict attrition when it will occur). we should choose a cutoff that aligns with your organization’s goals and risk tolerance.

Trade-off Consideration: we need to balance sensitivity and specificity based on our business needs. A higher sensitivity might be preferred if missing actual attrition cases is more costly, while a higher specificity might be chosen if avoiding false alarms is a higher priority.

Further Analysis: To make an informed decision about the cutoff, we might want to analyze precision, recall, F1-score, and other relevant metrics. We could also perform sensitivity analysis by evaluating the model’s performance across a range of cutoff values.

Deployment and Monitoring: Keep in mind that model performance can vary in real-world scenarios. It’s important to monitor the model’s performance after deployment and refine the cutoff if necessary.

In summary, our Decision Tree model is demonstrating strong predictive performance, but the choice of a cutoff score should be guided by the specific needs of the organization and the implications of false positives and false negatives in the context of employee attrition.