1 Introduction Exploratory Data Analysis and Feature Engineering

In the realm of People Analytics, the significance of a data set created by IBM for attrition modeling cannot be overstated. This data set presents a valuable resource for addressing critical questions related to employee turnover and engagement. With 1470 rows and 35 columns, it offers a wealth of information that encompasses various aspects of employees’ professional and personal lives. The data set combines the typical Human Resources Information System (HRIS) data with a comprehensive engagement survey, providing a holistic view of the employees’ experiences and sentiments.

The main objective of this analysis is to understand and predict employee turnover within an organization. By examining the factors that contribute to attrition, we aim to uncover valuable insights that can guide HR strategies and improve employee retention rates. Moreover, this data set offers a unique opportunity to identify differences between the group of employees who chose to stay and those who decided to leave the organization.

2 Description of Data

The data set was created and made available by IBM, a renowned leader in the tech industry. As a reputable source, the data is expected to be reliable and well-structured, enabling meaningful analysis and inference. While the specific details of data collection methods are not provided, the comprehensive nature of the data set suggests that it was meticulously curated to capture a wide range of attributes relevant to employee turnover and engagement.

Sample Size and Feature Variables: The data set consists of 1470 rows, representing individual employee records. Each row contains 35 columns, making these columns the feature variables for analysis. The feature variables include demographic information such as age and gender, factors related to job satisfaction and environment satisfaction, education field, job role, income, overtime, percentage salary hike, tenure, training time, years in the current role, relationship status, and several other parameters that may impact attrition and engagement.

The feature variables encompass various data types, including numeric, categorical, and ordinal data, which allows for a diverse set of analyses and modeling approaches.

In conclusion, this data set offers a comprehensive collection of features that are crucial for understanding and predicting employee attrition within an organization. By delving into the relationships between different variables, we can gain valuable insights that have practical implications for HR policies and practices.

A detailed description of the variables is given below:

Education 1 ‘Below College’ 2 ‘College’ 3 ‘Bachelor’ 4 ‘Master’ 5 ‘Doctor’

EnvironmentSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

JobInvolvement 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

JobSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

PerformanceRating 1 ‘Low’ 2 ‘Good’ 3 ‘Excellent’ 4 ‘Outstanding’

RelationshipSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

WorkLifeBalance 1 ‘Bad’ 2 ‘Good’ 3 ‘Better’ 4 ‘Best’

A copy of this publicly available data is stored at https://raw.githubusercontent.com/Tenam01/DATASETS/main/EmployeeAttritionData.csv.

Applying binary coding.

Column Employee Number will be removed to protect the identity of employees and their sensitive information.

Additional columns have been identified for removal:

DailyRate, HourlyRate, and MonthlyRate are inexplicable. EmployeeCount, Over18, and StandardHours are uniform for all employees.

3 EDA for Feature Engineering

We first scan the entire data set and determine the EDA tools to use for feature engineering.

##       Age         Attrition         BusinessTravel      Department       
##  Min.   :18.00   Length:1470        Length:1470        Length:1470       
##  1st Qu.:30.00   Class :character   Class :character   Class :character  
##  Median :36.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :36.92                                                           
##  3rd Qu.:43.00                                                           
##  Max.   :60.00                                                           
##  DistanceFromHome   Education     EducationField     EnvironmentSatisfaction
##  Min.   : 1.000   Min.   :1.000   Length:1470        Min.   :1.000          
##  1st Qu.: 2.000   1st Qu.:2.000   Class :character   1st Qu.:2.000          
##  Median : 7.000   Median :3.000   Mode  :character   Median :3.000          
##  Mean   : 9.193   Mean   :2.913                      Mean   :2.722          
##  3rd Qu.:14.000   3rd Qu.:4.000                      3rd Qu.:4.000          
##  Max.   :29.000   Max.   :5.000                      Max.   :4.000          
##     Gender          JobInvolvement    JobLevel       JobRole         
##  Length:1470        Min.   :1.00   Min.   :1.000   Length:1470       
##  Class :character   1st Qu.:2.00   1st Qu.:1.000   Class :character  
##  Mode  :character   Median :3.00   Median :2.000   Mode  :character  
##                     Mean   :2.73   Mean   :2.064                     
##                     3rd Qu.:3.00   3rd Qu.:3.000                     
##                     Max.   :4.00   Max.   :5.000                     
##  JobSatisfaction MaritalStatus      MonthlyIncome   NumCompaniesWorked
##  Min.   :1.000   Length:1470        Min.   : 1009   Min.   :0.000     
##  1st Qu.:2.000   Class :character   1st Qu.: 2911   1st Qu.:1.000     
##  Median :3.000   Mode  :character   Median : 4919   Median :2.000     
##  Mean   :2.729                      Mean   : 6503   Mean   :2.693     
##  3rd Qu.:4.000                      3rd Qu.: 8379   3rd Qu.:4.000     
##  Max.   :4.000                      Max.   :19999   Max.   :9.000     
##    OverTime         PercentSalaryHike PerformanceRating
##  Length:1470        Min.   :11.00     Min.   :3.000    
##  Class :character   1st Qu.:12.00     1st Qu.:3.000    
##  Mode  :character   Median :14.00     Median :3.000    
##                     Mean   :15.21     Mean   :3.154    
##                     3rd Qu.:18.00     3rd Qu.:3.000    
##                     Max.   :25.00     Max.   :4.000    
##  RelationshipSatisfaction StockOptionLevel TotalWorkingYears
##  Min.   :1.000            Min.   :0.0000   Min.   : 0.00    
##  1st Qu.:2.000            1st Qu.:0.0000   1st Qu.: 6.00    
##  Median :3.000            Median :1.0000   Median :10.00    
##  Mean   :2.712            Mean   :0.7939   Mean   :11.28    
##  3rd Qu.:4.000            3rd Qu.:1.0000   3rd Qu.:15.00    
##  Max.   :4.000            Max.   :3.0000   Max.   :40.00    
##  TrainingTimesLastYear WorkLifeBalance YearsAtCompany   YearsInCurrentRole
##  Min.   :0.000         Min.   :1.000   Min.   : 0.000   Min.   : 0.000    
##  1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.000   1st Qu.: 2.000    
##  Median :3.000         Median :3.000   Median : 5.000   Median : 3.000    
##  Mean   :2.799         Mean   :2.761   Mean   : 7.008   Mean   : 4.229    
##  3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.: 9.000   3rd Qu.: 7.000    
##  Max.   :6.000         Max.   :4.000   Max.   :40.000   Max.   :18.000    
##  YearsSinceLastPromotion YearsWithCurrManager Attrition_num   
##  Min.   : 0.000          Min.   : 0.000       Min.   :0.0000  
##  1st Qu.: 0.000          1st Qu.: 2.000       1st Qu.:0.0000  
##  Median : 1.000          Median : 3.000       Median :0.0000  
##  Mean   : 2.188          Mean   : 4.123       Mean   :0.1612  
##  3rd Qu.: 3.000          3rd Qu.: 7.000       3rd Qu.:0.0000  
##  Max.   :15.000          Max.   :17.000       Max.   :1.0000

There seems to be no apparent outliers.All the numerical variables seem to be in a reasonable range.

The average age seems to be around 37 years, the highest age seems to be 60 and the lowest age seems to be 18. This dynamic age group could range from new interns all the way to senior managers. There are people who live very close to work and some live far away from the work place Salaries tend to fluctuate a lot as well with a average monthly income of USD 6,500. The salaries can range anywhere from lowest USD 1,000 to as high as USD 20,000. This explains as new interns get paid less whereas the senior managers make a lot. The average number of years worked as an employee is 7 years, where there are employees who have worked for 40 years and as well as some who just started working.

All of the categorical characters are consistent.

3.1 Missing values

Since the data set appears to have no missing values, it simplifies the Exploratory Data Analysis (EDA) process. With a complete data set, we can focus on exploring relationships, identifying patterns, and gaining insights more effectively.

The data cleaning process is complete.

3.2 Assess Distributions

This subsection focuses on the potential discretization of continuous variables and grouping sparse categories of category variables based on their distribution.

3.2.1 Discretizing Continuous Variable

We will group age into three sub groups ranging from 18 to 60 and also group the distance from home into three subgroups. We will replace Age, DistanceFromHome, and Attrition feature variables and replace them with the modified grouped variable grp.age and grp.dist also binary variable outcome Attrition_num for easy graphical approach.

3.3 Pairwise associations

Pairwise association refers to the examination of relationships between pairs of variables in a data set. It involves analyzing how the values of two variables co-occur or change together. There are three different types of pairwise associations.

3.3.1 Two numeric variables

The best visual tool for assessing pairwise linear association between two numeric variables is a pair-wise scatter plot.

The off-diagonal plots and numbers indicate the correlation between the pair-wise numeric variables. As expected, YearsWithCurrManager and YearsAtCompany are significantly correlated, YearsWithCurrManager and YearsInCurrentRole are significantly correlated, TotalWorkingYears and YearsAtCompany are significantly correlated, YearsInCurrentRole and YearsAtCompany are significantly correlated, and YearsSinceLastPromotion and YearsAtCompany are significantly correlated. Other paired variables have weak correlations.

The main diagonal stacked density curves show the potential difference in the distribution of the underlying numeric variable in Attrition and no attrition groups. This means that the stacked density curves show the relation between numeric and categorical variables. These stacked density curves are not completely overlapped indicating somewhat correlation between each of these numeric variables and the binary response variable.

3.3.2 Two Categorial variables

Mosaic plots are convenient to show whether two categorical variables are dependent. In EDA, we are primarily interested in whether the response (binary in this case) is independent of categorical variables. Those categorical variables that are independent of the response variable should be excluded in any of the subsequent models and algorithms.

The top two mosaic plots demonstrate show that Attrition is not independent of Business Travel and Department because the proportion of Attrition cases in individual categories is not identical. In which employees traveling frequently have the highest attrition rate whereas non-travel employees has the least attrition rate. The bottom two mosaic plots also show that Attrition is not independent of Education field and Job Role because the proportion of Attrition cases in individual categories is not identical.

The top left two mosaic plots demonstrate the positive association between overtime and marital status readings. In which employees being single and working a lot of overtime shows increasing rate of employee attrition. The bottom two mosaic plots also show that Attrition is not independent of age group and distance from home because the proportion of Attrition cases in individual categories is not identical. Here, the younger age group have a higher attrition rate than older age group. Also further away from the work place shows increase in the rate of attrition. Lastly, as for the Gender, there seems to be not much of an influence.

# Concluding Remarks

We will be removing grp.age and grp.dist also binary variable outcome Attrition_num. We will aslo remove YearsAtCompany, and TotalWorkingYears, as they highly correlated with correlation coefficient > .75. For our final cleaned data set, most feature variables have very high or very low correlation but not high or low enough to be removed. Therefore, we will keep them for further analysis.

4 Unsuvervised Learning using Cluster Analysis and Principal Component Analysis

The goal of unsupervised learning is to find the underlying structure of the data set and group that data according to similarities. Common algorithms used in unsupervised learning include clustering, anomaly detection, neural networks, and approaches for learning latent variable models.

The following types of unsupervised machine learning algorithms are commonly used in practice.

K-means Clustering
Hierarchical Clustering
Anomaly Detection
Principal Component Analysis

K-means Clustering: K-means clustering is a popular unsupervised machine learning technique used for grouping similar data points into clusters. It assigns data points to clusters based on their similarity to the mean of each cluster. K-means is iterative and aims to minimize the variance within each cluster. It is widely used in customer segmentation, image compression, and data preprocessing.

Hierarchical Clustering: Hierarchical clustering is another technique for grouping data points into clusters. Unlike K-means, hierarchical clustering creates a tree-like structure of nested clusters, which can be visualized as a dendrogram. It does not require the user to specify the number of clusters beforehand. Hierarchical clustering is useful for understanding the relationships between data points and is commonly used in biology, social sciences, and data exploration.

Anomaly Detection: Anomaly detection involves identifying data points that deviate significantly from the norm or expected behavior. It is widely used in fraud detection, network security, and quality control. Anomaly detection methods can be based on statistical techniques, machine learning, or a combination of both. Detecting anomalies helps in identifying rare events or outliers that might require special attention.

Principal Component Analysis (PCA): PCA is a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space while retaining as much of the original variance as possible. It achieves this by identifying orthogonal axes (principal components) that capture the most significant variability in the data. PCA is commonly used to reduce noise, visualize high-dimensional data, and improve the efficiency of machine learning algorithms by reducing multicollinearity.

These techniques play crucial roles in exploratory data analysis, pattern recognition, and feature engineering, contributing to the development of effective machine learning models and insights from complex data sets.

4.1 Case Study: Multi-feature Clustering

The data set consists of 1470 rows, representing individual employee records. Each row contains 35 columns, making these columns the feature variables for analysis. The feature variables include demographic information such as age and gender, factors related to job satisfaction and environment satisfaction, education field, job role, income, overtime, percentage salary hike, tenure, training time, years in the current role, relationship status, and several other parameters that may impact attrition and engagement.

The feature variables encompass various data types, including numeric, categorical, and ordinal data, which allows for a diverse set of analyses and modeling approaches.

A copy of this publicly available data is stored at https://raw.githubusercontent.com/Tenam01/DATASETS/main/cleanedattrition2.csv. This data set has been pre-processed and feature engineered.

We use k-means method to perform cluster analysis on the Attrition data with the nine numerical feature variables.

## clust.ID
##   1   2   3 
## 365 916 189

Since this clustering task involves 9 numerical feature variables, we cannot create a 2D plot to show the clustering performance with all nine original feature variables. However, we can so-called PCA to create two new feature variables and then plot the new features to show the performance of the clustering algorithm.

4.2 Memory Usage of Clustering

One of the potential issues in clustering analysis is the use of memory. If the data size is too large,

5 Dimensionality Reduction Algorithms

Like clustering methods, dimension reduction seeks and explores the inherent structure in the data, but in this case in an unsupervised manner or to summarize or describe data using less information.

This can be useful to visualize high-dimensional data or to simplify data which can then be used in a supervised learning method. Many of these methods can be adapted for use in classification and regression. The following are the frequently used algorithms.

Principal Component Analysis (PCA)
Linear Discriminant Analysis (LDA)
Quadratic Discriminant Analysis (QDA)
Independent Component Analysis (ICA)

In this section, we introduce the most commonly used PCA.

5.1 Case Study - PCA test on Attrition data

We have used the Attrition Data set in clustering algorithms. The data set has 5 correlated numerical variables(YearsWithCurrManager, YearsAtCompany, and YearsInCurrentRole, TotalWorkingYears, and YearsSinceLastPromotion) and no categorical variable. The five variables measure the Attrition rate. We use PCA to see whether reducing the number of feature variables for related modeling.

5.1.1 Fitting PCA model to Attrition data

We want PCA method to reduce the dimensions from 5 (numerical variables) to a smaller number. The R function prcomp() to the factor loadings associated with the five numerical variables.

Next, we find the factor loading of the above fitted PCA. We can write an explicit system of linear transformation by using the loadings.

Factor loadings of the PCA
	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	PC9
Age	-0.29	0.61	-0.03	0.12	-0.10	0.21	-0.01	-0.69	-0.01
DistanceFromHome	0.01	-0.01	-0.70	-0.42	0.48	0.32	0.01	-0.03	-0.02
MonthlyIncome	-0.37	0.51	0.02	0.02	-0.09	0.29	-0.07	0.71	0.01
NumCompaniesWorked	0.07	0.47	-0.07	-0.14	0.33	-0.80	0.03	0.08	-0.01
PercentSalaryHike	0.06	0.02	-0.55	-0.14	-0.78	-0.22	0.05	0.04	-0.02
WorkLifeBalance	0.00	0.05	0.44	-0.87	-0.17	0.06	-0.06	-0.08	-0.01
YearsInCurrentRole	-0.54	-0.24	-0.07	-0.05	0.02	-0.19	-0.31	-0.06	0.71
YearsSinceLastPromotion	-0.44	-0.15	0.03	-0.07	0.03	-0.08	0.88	0.00	-0.02
YearsWithCurrManager	-0.54	-0.25	-0.04	-0.01	0.03	-0.19	-0.34	-0.05	-0.70

Factor loadings indicate how much each original variable contributes to each principal component. The absolute value of a factor loading represents the strength of the relationship between the original variable and the principal component. The sign of the factor loading (+/-) indicates the direction of the relationship (positive/negative) between the variable and the component.

For example, let’s consider the first row (Age): The factor loading of -0.29 for PC1 suggests a moderately negative relationship between Age and PC1. A higher Age is associated with lower values of PC1. The factor loading of 0.61 for PC2 suggests a moderately positive relationship between Age and PC2. A higher Age is associated with higher values of PC2. The factor loadings close to 0 for other principal components indicate that Age has relatively weaker relationships with those components.

Similarly, let’s consider the second row (DistanceFromHome): The factor loading of 0.01 for PC1 indicates a very weak positive relationship between DistanceFromHome and PC1. The factor loading of -0.01 for PC2 indicates a very weak negative relationship between DistanceFromHome and PC2. The factor loading of -0.70 for PC3 indicates a relatively strong negative relationship between DistanceFromHome and PC3. Interpreting these factor loadings for all variables across principal components allows to understand which original variables contribute most to each principal component and how they contribute to the overall variance captured by the PCs.

5.1.2 Optimal number of PCs to be retained

The object of PCA is to reduce the dimension without losing a significant amount of information. In PCA, we look at how much total variation is captured by each principal component. Most of the libraries that are capable of performing PCA automatically rank the PCA based on the variation captured by each principal component.

The following summary table gives the importance of the principal components.

The importance of each principal component
	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	PC9
Standard deviation	1.505511	1.180874	1.018676	0.9981226	0.9937252	0.9272111	0.8250489	0.6922288	0.5458277
Proportion of Variance	0.251840	0.154940	0.115300	0.1106900	0.1097200	0.0955200	0.0756300	0.0532400	0.0331000
Cumulative Proportion	0.251840	0.406780	0.522080	0.6327800	0.7425000	0.8380200	0.9136500	0.9669000	1.0000000

The output is from a Principal Component Analysis (PCA) performed on your data. PCA is a technique used for dimensionality reduction, which transforms the original feature variables into a new set of uncorrelated variables called principal components (PCs). These PCs capture most of the variance in the original data, allowing to represent data with fewer variables while retaining as much information as possible.

Let’s break down the output and understand its components:

Standard Deviation: This row provides the standard deviation of each principal component. It indicates the spread of data along each principal component axis. Larger standard deviations suggest that the corresponding principal components capture more variation in the data. Proportion of Variance: This row tells the proportion of total variance in the data explained by each principal component. It quantifies the amount of information retained by each PC. Larger proportions indicate that the corresponding PC contributes more to explaining the overall variance in the data. Cumulative Proportion: This row shows the cumulative proportion of variance explained up to each principal component. It helps understand how much of the total variance is captured when including a certain number of PCs. This can guide in deciding how many principal components to retain. Interpreting the Output: PC1: The first principal component (PC1) has a relatively high standard deviation of 1.51, which means it captures a significant amount of variability in the data. It explains about 25.2% of the total variance. The cumulative proportion up to PC1 is 25.2%. PC2: The second principal component (PC2) has a standard deviation of 1.18 and explains an additional 15.5% of the variance. The cumulative proportion up to PC2 is 40.7%. PC3: PC3 captures 11.5% of the variance, with a standard deviation of 1.02. The cumulative proportion up to PC3 is 52.2%. PC4 to PC9: These components contribute progressively smaller proportions of variance, with decreasing standard deviations. Collectively, they explain the remaining variance, resulting in a cumulative proportion of 100%. Importance of Each Principal Component: PC1 is the most important component, as it has the largest standard deviation and explains the highest proportion of variance. It represents the dominant pattern of variability in the data. Subsequent principal components capture decreasing amounts of variance and contribute less to explaining the overall data variability. Deciding How Many Principal Components to Retain: A common approach is to choose the number of principal components that collectively explain a large portion (e.g., 95% or 99%) of the total variance. In this case, we could decide to retain the first 6 principal components, as they contribute to about 80% of the total variance. The choice of how many principal components to retain depends on specific goals and the trade-off between reducing dimensionality and retaining information. This might also consider other factors such as interpretability and the requirements of downstream analysis or modeling.

We can also make a scree plot as a visual tool to show the number of principal components to retain for future analysis.

Figure 18. Scree plot of PCA on Iris Data

Note that the vertical axis in the above scree plot uses the variances of PCs. The standard deviation was used in the above summary table.

5.1.3 Extracting PC Scores

The predictive principle scores are values of the newly transformed variables. We can choose the first few principal components to use as response variables to do relevant modeling. The command ar.pca$x extracts the PC scores from the PCA procedure. These scores are the values of the new transformed variables. They can be used as response or predictor variables in statistical models. The following table shows

The first 15 PC scores transformed from the original variable. In the analysis, you want to either the first PC or the first two PCs.
PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	PC9
-0.2503729	0.4891398	0.3747896	3.6543569	0.8579011	-0.7647964	-1.2736396	0.0088989	0.0902827
-1.0858270	0.5882272	-1.2022389	-0.7339714	-1.4161095	-0.5372264	0.3735791	-0.8711734	-0.0914351
3.3975430	0.9163099	0.9634498	0.1905301	-0.4626099	-0.1541245	0.4121812	-0.7822869	-0.0673956
0.8676736	-0.2101990	1.3669685	-0.0560005	0.8397869	-0.1966703	1.2784114	-0.3702883	1.8670205
-0.2267333	-1.0262492	1.3421464	-0.0967361	0.5739030	-1.1482797	0.4382696	0.3115776	0.0014997
-0.7096627	-2.2835630	0.7613344	1.4737681	-0.5304995	1.1950815	0.4098705	-0.4360873	0.0541166
2.8024440	2.1778603	-0.6098430	1.2164423	-1.2881520	0.1386260	0.5137954	-1.6709849	-0.0963894
3.5978736	0.5007554	-1.5597572	-1.0920457	-0.6760869	0.3328040	0.4814611	0.0482851	-0.1323815
-1.3752266	-0.9107851	-1.4354599	-0.8094795	-1.5216529	2.1815365	0.2268772	0.2213691	-0.0630726
-0.9478920	-0.1948945	-1.1670399	0.0991997	1.5156007	-0.1222437	0.5061640	-0.0132037	-0.0407171
0.2982608	-1.8579992	-0.1208496	-0.3095551	0.1211938	2.0607686	-1.4063438	-1.0786582	0.0395563
0.1159991	-1.9395323	0.1485927	-0.3062092	0.3747998	2.1806035	-1.5169336	0.0061617	0.0068637
-0.2463278	-1.0275712	-1.7665721	-0.1019653	0.6381276	-0.5562854	0.6517848	-0.1887846	-0.0949720
-0.5567213	-2.1362485	0.2432414	-0.4089012	0.8240233	2.1622582	0.3216970	-0.9266780	-0.0116863
0.9405464	-1.1005522	-0.7544006	-1.0315756	1.1448150	-0.5466799	-1.2598201	-0.4084133	-0.0554879

5.1.4 Concluding Remarks

Finally, we chose not to include a new variable derived from Principal Component Analysis (PCA) in the analysis due to the challenge of collapsing PC1 into a specific single feature variable. While PCA effectively reduces dimensionality by creating orthogonal components that capture the most variance in the data, translating these components back into meaningful original variables can be complex. PC1 is a linear combination of multiple original variables, making it difficult to attribute its influence to a single feature. As a result, introducing a new variable based on PC1 could lead to difficulties in interpretation and may not provide clear insights into the underlying relationships present in the data. Retaining the original variables preserves the interpretability of the features and their direct associations with the outcome or patterns, allowing for a more straightforward and actionable analysis.

6 Predictive Modeling:

We build and evaluate logistic regression, neural network, and decision tree models to predict employee attrition, providing insights into each model’s performance.

6.1 Introduction for Logistic Regression

Employee attrition is a critical challenge faced by many organizations, especially in the fast-paced tech industry. High turnover rates can impact productivity, morale, and overall company performance. In this analysis, we will use logistic regression to predict employee attrition in a tech company based on various employee characteristics, job-related factors and also to assess the association between the binary response variable and other predictor variables.

6.1.1 Research Questions

Can we develop a logistic regression model that accurately predicts employee attrition?
Which factors have a significant impact on the likelihood of employee attrition?
How well does the model perform in terms of classification accuracy and predictive power?

6.1.2 Multiple Logistic Regression Model

In this study, we used a published study on Employee Attrition data set. The practical question for this predictive modeling assignment is to determine the factors that impact employee attrition. We want to understand which variables are significant predictors of attrition and build a logistic regression model to predict whether an employee is likely to leave the company (attrition = 1) or not (attrition = 0). A copy of this publicly available data is stored at https://raw.githubusercontent.com/Tenam01/DATASETS/main/cleanedattrition2.csv. This data set has been pre-processed and feature engineered.

The response variable: Attrition - status of whether an employee is likely to leave the company (attrition = 1) or not (attrition = 0) of predictor variables. There are 26 variables (columns) and below are the variables contained in the data set: * Age: Employee age * Attrition: if the employee leaves the job * BusinessTravel: The frequency of job travels * Department: Employee work department
* DistanceFromHome: Distance traveled to work from home * Education: Employee education level (1 = Below College, 2 = College, 3 = Bachelor, 4 = Master, 5 = Doctor) * EducationField: Employee education field * EnvironmentSatisfaction: Numerical value for environment satisfaction (1 = Low, 2 = Medium, 3 = High, 4 = Very High) * Gender: Employee gender * JobInvolvement: Numerical value for job involvement (1 = Low, 2 = Medium, 3 = High, 4 = Very High) * JobLevel: Numerical value for job level * JobRole: Employee job position * JobSatisfaction: Numerical value for job satisfaction (1 = Low, 2 = Medium, 3 = High, 4 = Very High) * MaritalStatus: Employee marital status * MonthlyIncome: The amount of money that employee earns in one month, before taxes or deductions * NumCompaniesWorked: Number of companies worked at * PercentSalaryHike: Percent increase in salary * PerformanceRating: Numerical value for performance rating (1 = Low, 2 = Good, 3 = Excellent, 4 = Outstanding) * RelationshipSatisfaction: Numerical value for relationship satisfaction (1 = Low, 2 = Medium, 3 = High, 4 = Very High) * StockOptionsLevel: Numerical value for stock options * TrainingTimesLastYear: Hours employee spent on training last year * WorkLifeBalance: Numerical value for work life balance (1 = Bad, 2 = Good, 3 = Better, 4 = Best) * YearsInCurrentRole: Number of years employee worked as their current job role * YearsSinceLastPromotion: Number of years since last promotion * YearsWithCurrentManager: Number of years employee worked with current manager

We next read the data from the given URL directly to R. Since there are no records with missing values. We don’t need to drop those records.

Applying binary coding.

Build an Initial Model

We first build a logistic regression model that contains all predictor variables in the data set. This model is usually called the full model. Note that the response variable is the attrition status (1 = yes, 0 = no).

Significance tests of logistic regression model
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-10.1487504	395.7492662	-0.0256444	0.9795410
BusinessTravelTravel_Frequently	1.8670294	0.4071329	4.5857985	0.0000045
BusinessTravelTravel_Rarely	0.9693814	0.3754112	2.5821858	0.0098177
DepartmentResearch & Development	12.6266608	395.7469457	0.0319059	0.9745471
DepartmentSales	12.4162910	395.7471849	0.0313743	0.9749710
Education	0.0019795	0.0872999	0.0226745	0.9819099
EducationFieldLife Sciences	-0.7606727	0.7984502	-0.9526864	0.3407490
EducationFieldMarketing	-0.3297477	0.8468630	-0.3893755	0.6969984
EducationFieldMedical	-0.8515225	0.7981229	-1.0669064	0.2860141
EducationFieldOther	-0.8770665	0.8562571	-1.0243028	0.3056923
EducationFieldTechnical Degree	0.1079078	0.8159464	0.1322486	0.8947877
EnvironmentSatisfaction	-0.4199648	0.0821263	-5.1136430	0.0000003
GenderMale	0.3950340	0.1830513	2.1580505	0.0309239
JobInvolvement	-0.5328265	0.1205934	-4.4183737	0.0000099
JobLevel	-0.1389967	0.2977130	-0.4668814	0.6405847
JobRoleHuman Resources	14.0089374	395.7471564	0.0353987	0.9717618
JobRoleLaboratory Technician	1.5317617	0.4816347	3.1803394	0.0014710
JobRoleManager	0.5978435	0.8727760	0.6849907	0.4933498
JobRoleManufacturing Director	0.2597355	0.5314577	0.4887228	0.6250380
JobRoleResearch Director	-0.8767143	0.9659989	-0.9075728	0.3641040
JobRoleResearch Scientist	0.6231180	0.4896045	1.2726965	0.2031258
JobRoleSales Executive	1.3223467	1.1132697	1.1878044	0.2349105
JobRoleSales Representative	2.2290891	1.1700504	1.9051223	0.0567642
JobSatisfaction	-0.4031915	0.0803294	-5.0192247	0.0000005
MaritalStatusMarried	0.3027285	0.2642453	1.1456342	0.2519466
MaritalStatusSingle	1.1604409	0.3416766	3.3963130	0.0006830
MonthlyIncome	-0.0000032	0.0000798	-0.0400431	0.9680587
NumCompaniesWorked	0.1630538	0.0370723	4.3982643	0.0000109
OverTimeYes	1.9343061	0.1909657	10.1290744	0.0000000
PercentSalaryHike	-0.0210031	0.0387550	-0.5419455	0.5878561
PerformanceRating	0.0679746	0.3946248	0.1722513	0.8632400
RelationshipSatisfaction	-0.2456994	0.0819679	-2.9975091	0.0027220
StockOptionLevel	-0.1975184	0.1557253	-1.2683770	0.2046633
TrainingTimesLastYear	-0.1862392	0.0725335	-2.5676316	0.0102396
WorkLifeBalance	-0.3547433	0.1227487	-2.8899968	0.0038525
YearsInCurrentRole	-0.1186218	0.0419441	-2.8280935	0.0046826
YearsSinceLastPromotion	0.2036475	0.0401511	5.0720267	0.0000004
YearsWithCurrManager	-0.0953257	0.0418145	-2.2797268	0.0226239
Age	-0.0422346	0.0122225	-3.4554899	0.0005493
DistanceFromHome	0.0446842	0.0106711	4.1873907	0.0000282

The p-values in the above significance test table some feature variables with p value greater than 0.05. We next search for the best model by dropping some of the insignificant predictor variables. Since there are so many different ways to drop variables, next we use an automatic variable procedure to search the final model.

Automatic Variable Selection R has an automatic variable selection function step() for searching the final model. We will start from the initial model and drop insignificant variables using AIC as an inclusion/exclusion criterion. In practice, sometimes, there may be some practically important predictor variables. Practitioners want to include these practically important variables in the model regardless of their statistical significance. Therefore we can fit the smallest model that includes only those practically important variables. The final model should be between the smallest model, which we will call a reduced model, and the initial model, which we will call a full model. For illustration, we assume YearsSinceLastPromotion, RelationshipSatisfaction, WorkLifeBalance, and NumCompaniesWorked are practically important, we want to include these four variables in the final model regardless of their statistical significance. In summary, we define two models: the full model and the reduced model. The final best model will be the model between the full and reduced models. The summary table of significant tests is given below.

Summary table of significant tests
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	1.7644674	1.1479245	1.5370936	0.1242704
BusinessTravelTravel_Frequently	1.8817648	0.4041877	4.6556709	0.0000032
BusinessTravelTravel_Rarely	0.9782062	0.3731283	2.6216350	0.0087509
EducationFieldLife Sciences	-0.5979343	0.7398454	-0.8081882	0.4189823
EducationFieldMarketing	-0.1826602	0.7887458	-0.2315831	0.8168618
EducationFieldMedical	-0.7024340	0.7397245	-0.9495886	0.3423213
EducationFieldOther	-0.6971558	0.8066688	-0.8642405	0.3874558
EducationFieldTechnical Degree	0.2724424	0.7627622	0.3571787	0.7209580
EnvironmentSatisfaction	-0.4207754	0.0818043	-5.1436854	0.0000003
GenderMale	0.3792481	0.1823123	2.0802117	0.0375061
JobInvolvement	-0.5327656	0.1197715	-4.4481841	0.0000087
JobRoleHuman Resources	1.6361497	0.6420829	2.5481906	0.0108283
JobRoleLaboratory Technician	1.6782868	0.4279142	3.9220171	0.0000878
JobRoleManager	0.1509916	0.6348814	0.2378265	0.8120157
JobRoleManufacturing Director	0.2443439	0.5288542	0.4620251	0.6440633
JobRoleResearch Director	-1.0222376	0.8690644	-1.1762507	0.2394947
JobRoleResearch Scientist	0.7856609	0.4318438	1.8193172	0.0688631
JobRoleSales Executive	1.1258424	0.4413488	2.5509130	0.0107441
JobRoleSales Representative	2.1559119	0.4959605	4.3469424	0.0000138
JobSatisfaction	-0.4088564	0.0798811	-5.1183104	0.0000003
MaritalStatusMarried	0.3945195	0.2543611	1.5510215	0.1208965
MaritalStatusSingle	1.4321239	0.2609096	5.4889662	0.0000000
NumCompaniesWorked	0.1627058	0.0368348	4.4171807	0.0000100
OverTimeYes	1.9293093	0.1901496	10.1462735	0.0000000
RelationshipSatisfaction	-0.2308355	0.0810321	-2.8486935	0.0043899
TrainingTimesLastYear	-0.1844669	0.0724362	-2.5466125	0.0108774
WorkLifeBalance	-0.3637707	0.1227142	-2.9643736	0.0030330
YearsInCurrentRole	-0.1192218	0.0415277	-2.8708974	0.0040931
YearsSinceLastPromotion	0.2032144	0.0391851	5.1860088	0.0000002
YearsWithCurrManager	-0.1001219	0.0413984	-2.4184959	0.0155848
Age	-0.0459829	0.0116388	-3.9508431	0.0000779
DistanceFromHome	0.0431754	0.0105214	4.1035687	0.0000407

*Interpretation - Association Analysis The summary table contains the four practically important variables YearsSinceLastPromotion, RelationshipSatisfaction, WorkLifeBalance, and NumCompaniesWorked. YearsSinceLastPromotion does achieve high statistical significance (p-value $\approx$ 0), RelationshipSatisfaction also achieve high statistical significance (p-value $\approx$ 0.0044), WorkLifeBalance achieve high statistical significance (p-value $\approx$ 0.003), and NumCompaniesWorked also achieves high significance (p-value $\approx$ 0.00001). Both variables, YearsSinceLastPromotion and NumCompaniesWorked, are seemingly positively associated with the response variable. Where as RelationshipSatisfaction and WorkLifeBalance, are negatively associated with the response variable.

Here’s a brief interpretation of the significant tests:

BusinessTravel: Travel_Frequently has a positive coefficient (1.88) and is significant (p < 0.001), indicating that employees who travel frequently are more likely to have attrition. Travel_Rarely also has a positive coefficient (0.98) and is significant (p = 0.009), suggesting that employees who travel rarely are also more likely to have attrition. EducationField: None of the specific education fields are significant predictors of attrition but most show negative association. EnvironmentSatisfaction: Negative coefficient (-0.42) and highly significant (p < 0.001) suggest that lower environment satisfaction is associated with higher attrition. Gender: Male employees have a positive coefficient (0.38) and are marginally significant (p = 0.038), indicating that male employees may be slightly more likely to have attrition. JobInvolvement: Negative coefficient (-0.53) and highly significant (p < 0.001) suggest that lower job involvement is associated with higher attrition. JobRole: Several job roles have significant associations with attrition. For example, Laboratory Technicians, Sales Representatives, and Human Resources have positive coefficients and are significant predictors of attrition. JobSatisfaction: Negative coefficient (-0.41) and highly significant (p < 0.001) suggest that lower job satisfaction is associated with higher attrition. MaritalStatus: Single employees have a positive coefficient (1.43) and are highly significant (p < 0.001), indicating that single employees are more likely to have attrition. NumCompaniesWorked: Positive coefficient (0.16) and highly significant (p < 0.001) suggest that having worked at more companies is associated with higher attrition. OverTime: Employees who work overtime (OverTimeYes) have a positive coefficient (1.93) and are highly significant (p < 0.001), indicating that they are more likely to have attrition. TrainingTimesLastYear: Negative coefficient (-0.18) and significant (p = 0.011) indicate that fewer training times last year are associated with higher attrition. YearsInCurrentRole: Negative coefficient (-0.12) and significant (p = 0.004) suggest that fewer years in the current role are associated with higher attrition. YearsSinceLastPromotion: Positive coefficient (0.20) and highly significant (p < 0.001) suggest that more years since the last promotion are associated with higher attrition. YearsWithCurrManager: Negative coefficient (-0.10) and significant (p = 0.016) suggest that fewer years with the current manager are associated with higher attrition. Age: Negative coefficient (-0.05) and highly significant (p < 0.001) suggest that older age is associated with lower attrition. DistanceFromHome: Positive coefficient (0.04) and significant (p < 0.001) suggest that greater distance from home is associated with higher attrition. These results provide insights into how different feature variables are associated with employee attrition. Variables such as BusinessTravel, EnvironmentSatisfaction, JobInvolvement, JobRole, JobSatisfaction, MaritalStatus, NumCompaniesWorked, OverTime, RelationshipSatisfaction, TrainingTimesLastYear, WorkLifeBalance, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager, Age, and DistanceFromHome appear to have significant associations with attrition.

Predictive Analysis As an illustration, we use the final model to predict the status of successful introduction based on the new values of the predictor variables associated with two species. See the numerical feature given in the code chunk.

Predicted Value of response variable with the given cut-off probability
BusinessTravel	EnvironmentSatisfaction	Gender	JobInvolvement	JobRole	JobSatisfaction	MaritalStatus	NumCompaniesWorked	OverTime	RelationshipSatisfaction	StockOptionLevel	TrainingTimesLastYear	WorkLifeBalance	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager	EducationField	Age	DistanceFromHome	Pred.Response
Travel_Rarely	3	Male	1	Healthcare Representative	3	Single	3	Yes	1	0	2	2	3	1	2	Human Resources	25	14	1
Travel_Frequently	2	Female	3	Manager	4	Married	4	No	3	1	1	2	8	3	2	Technical Degree	35	5	0

The predicted status of the successful introduction of the two employees is attached to the two new data records.

The “Predicted Response” column indicates the predicted outcome of employee attrition based on the input values for the predictor variables and the logistic regression model. A value of 1 indicates that the model predicts attrition (Yes), and a value of 0 indicates that the model predicts no attrition (No). These predictions are based on the relationships and coefficients learned by the logistic regression model from the training data. It’s important to note that these predictions are based on the specific characteristics we have provided for each individual, and they may change if the input values are altered.

6.2 Introduction to Cross Validation method

In the realm of predictive modeling, logistic regression stands as a foundational technique for tackling binary classification problems. It allows us to analyze the relationship between a set of predictor variables and a binary outcome, making it particularly well-suited for scenarios like employee attrition prediction. However, the journey doesn’t end with model construction; we need robust mechanisms to validate our model’s performance and ensure its generalizability. This exploration delves into the crucial steps of using logistic regression, employing training, validating and test datasets, harnessing the power of cross-validation, and evaluating model performance through the precision, recall, F1-score, and Receiver Operating Characteristic (ROC) curve. This multi-faceted approach enables us to not only build a predictive model but also rigorously assess its accuracy, sensitivity, and overall effectiveness.

Training Dataset: The training dataset serves as the foundation for teaching the logistic regression model. This dataset contains labeled examples where the model learns the underlying patterns and relationships between predictor variables and the binary response variable. During training, the logistic regression algorithm adjusts its parameters, specifically the coefficients, to minimize the error between predicted probabilities and actual outcomes. Validation Dataset: Incorporating a validation dataset is crucial to assess the model’s performance and prevent overfitting. This separate dataset is not used during training but rather during the model development process. By evaluating the model on the validation dataset, one can fine-tune hyperparameters and make decisions about the model’s complexity. Validation aids in selecting the best version of the model that generalizes well to unseen data. Test Dataset: The test dataset is entirely independent of both the training and validation datasets. It serves as a final benchmark to evaluate the model’s performance on new, unseen data. The test dataset provides an unbiased estimate of how well the trained model will perform when deployed in a real-world scenario.

6.2.1 Data Splitting Methods

First, we split our dataset into two distinct subsets: a training, validating and a test dataset. Since the same size is large, we split the sample by 70%:30% with 70% data for training and validating models and 30% for testing purposes. The labels (value of the fraud status) of testing and validation data will be removed when calculating the accuracy measures

6.2.2 Finding Optimal Cut-off Probability

However, a single split of the dataset may not provide a comprehensive assessment of our model’s robustness. This is where cross-validation steps in. By employing techniques like k-fold cross-validation, we repeatedly divide the dataset into subsets, training the model on a combination of these subsets and validating it on the remaining data. This iterative process yields a more reliable estimate of the model’s performance and guards against potential overfitting.

Cross-Validation: Cross-validation is a methodology that enhances the validation process by mitigating potential biases in model evaluation. K-fold cross-validation, a popular technique, involves partitioning the dataset into K subsets. The model is trained K times, each time using K-1 subsets for training and one subset for validation. This process ensures a comprehensive assessment of the model’s performance and reduces the risk of over-optimistic evaluations. We define a sequence of 20 candidate cut-off probabilities and then use 5-fold cross-validation to identify the optimal cut-off probability for the final detection model.

Figure 7. 5-fold CV performance plot

The above figure indicates that the optimal cut-off probability that yields the best accuracy is 0.57.

6.2.3 Reporting Test

This subsection reports the performance of the model using the test data set. Note that the model needs to be fit to the original training data to find the regression coefficients and then use the holdout testing sample to find the accuracy.

test.accuracy
0.8843537

The accuracy is around 90%. This accuracy indicates that there is no under-fitting. The predictive model correctly classified 90% of the instances in the test dataset. In other words, out of all the instances (data points) in the test dataset, approximately 90% were correctly predicted by the model as either having attrition or not having attrition, based on the chosen threshold.

6.2.4 Concluding Remarks

In conclusion, this study aimed to predict employee attrition using logistic regression and leveraged a robust 5-fold cross-validation approach to assess the model’s accuracy. The use of cross-validation allowed us to effectively evaluate the model’s performance on multiple subsets of the training data, enhancing its generalizability and reducing the risk of overfitting. The logistic regression model demonstrated promising results, achieving an accuracy of 87.75% in predicting employee attrition. This suggests that the selected features and the logistic regression algorithm have the potential to be valuable tools in identifying employees who might be at risk of attrition. However, it’s important to note that accuracy alone might not provide a complete picture of the model’s performance, and further evaluation using other metrics, such as precision, recall, F1-score, and ROC curve, could provide additional insights into the model’s strengths and weaknesses.

6.2.5 Global and Local Performance metrics

Now we will calculate the local and global performance metrics for logistic predictive models. We have used the confusion matrix in the case study in the previous note. Here we will use the optimal cut-off probability as the decision threshold to define a confusion matrix and then define the performance measure based on this matrix. We use the data from our previous model in which we create the training and testing data sets. We pretend the optimal cut-off probability is based on what is obtained through the Cross Validation. The testing data set will be used to report the local and global performance measures.

6.2.6 Local Performance Meaures

Since we have identified the optimal cut-off probability to be 0.57. Next, we will use the testing data set to report the local measures.

Local performance metrics
sensitivity	specificity	precision	recall	F1
0.3134328	0.986631	0.8076923	0.3134328	0.4516129

Sensitivity (True Positive Rate): Sensitivity, also known as the True Positive Rate or Recall, measures the proportion of actual positive cases correctly identified by the model. In this case, it’s 0.3428571, which means that the model correctly identified approximately 34.29% of the actual positive cases (attrition) in the dataset. Specificity (True Negative Rate): Specificity measures the proportion of actual negative cases correctly identified by the model. It’s also known as the True Negative Rate. Here, it’s 0.9784367, which indicates that the model correctly identified approximately 97.84% of the actual negative cases (non-attrition) in the dataset. Precision: Precision is the ratio of true positive predictions to the total predicted positive cases by the model. In this context, it’s 0.75, meaning that out of all the predicted positive cases by the model, approximately 75% were actually positive cases (correctly predicted attrition). Recall: Recall is another term for Sensitivity, as explained above. It measures the proportion of actual positive cases correctly identified by the model. Here, it’s 0.3428571, indicating the same as Sensitivity: approximately 34.29% of actual positive cases were correctly identified. F1 Score: The F1 Score is the harmonic mean of Precision and Recall. It provides a balance between Precision and Recall, especially when there is an imbalance between positive and negative cases. Here, it’s 0.4705882, reflecting the balance between the model’s ability to predict both positive and negative cases.

These metrics collectively provide insights into how well the model is performing in terms of correctly classifying attrition and non-attrition cases. A higher sensitivity is desired when correctly identifying positive cases is crucial, whereas a higher specificity is important when correctly identifying negative cases is a priority. Precision and Recall help evaluate the trade-off between true positives and false positives, while the F1 Score provides a balanced measure between Precision and Recall.

6.2.7 Global Measure: ROC and AUC

In order to create an ROC curve, we need to select a sequence of decision thresholds and calculate the corresponding sensitivity and specificity. CAUTION: ROC and AUC are used for model selection, we still use the training data to construct the ROC and calculate the AUC.

Performance Metrics and ROC Curve: To quantitatively measure the model’s performance, various metrics come into play. The ROC (Receiver Operating Characteristic) curve is a graphical representation of a model’s ability to distinguish between the classes, showing the trade-off between sensitivity and specificity. The area under the ROC curve (AUC-ROC) quantifies the model’s discriminatory power, where a higher AUC indicates better performance. In this exploration, we delve into the synergy of logistic regression, training, validation, and test datasets, cross-validation techniques, and the visualization of model performance through ROC curves. By combining these elements, we aim to build robust and accurate logistic regression models that effectively predict binary outcomes in various domains. An AUC (Area Under the Curve) value of 0.8151 indicates that our logistic regression model is performing reasonably well in distinguishing between the two classes (attrition and non-attrition). The AUC value ranges from 0 to 1, where a higher value indicates better predictive performance. An AUC value of 0.8151 suggests that the model has a good ability to discriminate between employees who will leave and those who will not. When interpreting the AUC value: AUC = 0.5 indicates random performance (no discrimination power). * AUC > 0.5 and < 0.7 suggests a poor to fair discriminative ability. * AUC > 0.7 and < 0.9 indicates a good to excellent discriminative ability. * AUC > 0.9 suggests outstanding discriminative ability. Keep in mind that the AUC is just one measure of the model’s performance. It’s important to consider other metrics such as sensitivity, specificity, precision, and F1-score, as well as domain knowledge, when evaluating and interpreting the results of our logistic regression model.

*Conclusion Remarks In conclusion, the combination of logistic regression, training and test datasets, cross-validation, and ROC analysis forms a comprehensive framework for predictive modeling. By embracing these techniques, we empower ourselves to not only build accurate models but also to validate their effectiveness and make informed decisions based on their predictive capabilities. This multifaceted approach is a cornerstone of modern data science, enabling us to unlock insights and drive informed actions from complex datasets.

6.3 Introduction to Neural Network Model

A neural network is a type of machine learning algorithm that is designed to recognize patterns in data. It’s inspired by the structure and function of the human brain, where interconnected neurons work together to process and transmit information. Neural networks consist of layers of interconnected nodes (neurons) that process input data and produce output. These layers are typically organized into three main types:

Input Layer: This layer receives the raw input data and passes it on to the next layer for processing. Hidden Layers: These are one or more layers between the input and output layers. Each neuron in a hidden layer processes the information it receives from the previous layer and passes the output to the next layer. The hidden layers are responsible for learning complex patterns in the data. Output Layer: This layer produces the final prediction or output based on the information processed by the hidden layers. The structure of the output layer depends on the type of problem being solved, such as classification or regression.

Neural networks “learn” from data through a process called training. During training, the network adjusts the connections between neurons (weights) based on the input data and the desired output. This adjustment is done iteratively using optimization algorithms that minimize the difference between the predicted output and the actual target. Neural networks have gained significant popularity due to their ability to learn from complex and high-dimensional data, making them suitable for tasks such as image and speech recognition, language translation, playing games, autonomous driving, and more. They can capture intricate relationships in data that may be difficult for traditional algorithms to discover. While neural networks can achieve remarkable accuracy, they also come with challenges. They require large amounts of data for training, and determining the right architecture (number of layers and neurons) and optimization techniques can be complex. Deep learning, a subset of neural networks, involves using multiple layers to create complex models and has been a driving force behind many recent breakthroughs in AI.

In summary, neural networks are a powerful tool in the field of machine learning, capable of learning and representing complex patterns in data. They have enabled significant advancements in various domains and continue to be a focus of research and development.

6.3.1 Feature Conversion for `neuralnet`

neuralnet() requires all features to be in the numeric form (dummy variable for categorical features, normalization of numerical features). The model formula in neuralnet() requires dummy variables to be explicitly defined. It is also highly recommended to scale all numerical features before being included in the network model. The objective is to find all feature names (numeric and all dummy variables) and write them in the model formula like the one in glm: response ~ var_1 + var_2 + ... +var_k To explain the modeling process in detail, we will outline major steps in the following subsections.

6.3.2 Numeric Feature Scaling

There are different types of scaling and standardization. The one we use in the following has

\[ scaled.var = \frac{orig.var - \min(orig.var)}{\max(orig.var)-\min(orig.var)} \] The scaled numeric feature is unitless (similar to the well-known z-score transformation).

6.3.3 Extract All Feature Names

In practical applications, there may be many categorical features in the model and each category could have many categories. It is practically infeasible to write all resulting dummy features explicitly. We can use the R function to extract variables from a model formula that will be used in a model. Make sure, all categorical feature variables must be defined in a non-numerical form (i.e., should not be numerically encoded). We can also use the R function relevel() to change the baseline of an unordered categorical feature variable.

Next, we use the R function model.matrix() to extract the names of all feature variables (including implicitly defined dummy feature variables from model.matrix()).

##  [1] "(Intercept)"                      "Age"                             
##  [3] "AttritionYes"                     "BusinessTravelTravel_Frequently" 
##  [5] "BusinessTravelTravel_Non"         "DepartmentHuman Resources"       
##  [7] "DepartmentResearch & Development" "DistanceFromHome"                
##  [9] "Education"                        "EducationFieldHuman Resources"   
## [11] "EducationFieldLife Sciences"      "EducationFieldMarketing"         
## [13] "EducationFieldMedical"            "EducationFieldOther"             
## [15] "EnvironmentSatisfaction"          "GenderFemale"                    
## [17] "JobInvolvement"                   "JobLevel"                        
## [19] "JobRoleHealthcare Representative" "JobRoleHuman Resources"          
## [21] "JobRoleLaboratory Technician"     "JobRoleManager"                  
## [23] "JobRoleManufacturing Director"    "JobRoleResearch Scientist"       
## [25] "JobRoleSales Executive"           "JobRoleSales Representative"     
## [27] "JobSatisfaction"                  "MaritalStatusDivorced"           
## [29] "MaritalStatusMarried"             "MonthlyIncome"                   
## [31] "NumCompaniesWorked"               "OverTimeYes"                     
## [33] "PercentSalaryHike"                "PerformanceRating"               
## [35] "RelationshipSatisfaction"         "StockOptionLevel"                
## [37] "TrainingTimesLastYear"            "WorkLifeBalance"                 
## [39] "YearsInCurrentRole"               "YearsSinceLastPromotion"         
## [41] "YearsWithCurrManager"

There are some naming issues in the above dummy feature variables for network modeling (although they are good for regular linear and generalized linear regression models). We need to rename them by excluding special characters in order to build neural network models. These issues can be avoided at the stage of feature engineering (if we initially planned to build neural network models). Next, we clean up the variables before defining the network model formula.

6.3.4 Define Model Formula

For convenience, we encourage you to use CamelCase notation (CamelCase is a way to separate the words in a phrase by making the first letter of each word capitalized and not using spaces) in naming feature variables.

## AttritionYes ~ Age + BusinessTravelTravelFreq + BusinessTravelTravelNon + 
##     DepartmentHumanResources + DepartmentResearchDevelopment + 
##     DistanceFromHome + Education + EducationFieldHumanResources + 
##     EducationFieldLifeSciences + EducationFieldMarketing + EducationFieldMedical + 
##     EducationFieldOther + EnvironmentSatisfaction + GenderFemale + 
##     JobInvolvement + JobLevel + JobRoleHealthcareRepresentative + 
##     JobRoleHumanResources + JobRoleLaboratorytechnician + JobRoleManager + 
##     JobRoleManufacturingDirector + JobRoleResearchScientist + 
##     JobRoleSalesExecutive + JobRoleSalesRepresentative + JobSatisfaction + 
##     MaritalStatusDivorced + MaritalStatusMarried + MonthlyIncome + 
##     NumCompaniesWorked + OverTimeYes + PercentSalaryHike + PerformanceRating + 
##     RelationshipSatisfaction + StockOptionLevel + TrainingTimesLastYear + 
##     WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + 
##     YearsWithCurrManager

6.3.5 Training and Testing Neural Network Model

We follow the routine steps for building a neural network model to predict Attrition.

6.3.6 Data Splitting

We split the data into 70% for training the neural network and 30% for testing.

6.3.7 Build Neural Network Model

error	12.4130373
reached.threshold	0.0097831
steps	4121.0000000
Intercept.to.1layhid1	4.7283796
Age.to.1layhid1	-25.2700187
BusinessTravelTravelFreq.to.1layhid1	29.9256578
BusinessTravelTravelNon.to.1layhid1	-0.1415772
DepartmentHumanResources.to.1layhid1	-1.5439061
DepartmentResearchDevelopment.to.1layhid1	-0.7428586
DistanceFromHome.to.1layhid1	13.9251898
Education.to.1layhid1	0.6244195
EducationFieldHumanResources.to.1layhid1	21.3871291
EducationFieldLifeSciences.to.1layhid1	10.6746956
EducationFieldMarketing.to.1layhid1	16.6426970
EducationFieldMedical.to.1layhid1	3.2402965
EducationFieldOther.to.1layhid1	10.8804176
EnvironmentSatisfaction.to.1layhid1	-33.3847948
GenderFemale.to.1layhid1	-9.9868604
JobInvolvement.to.1layhid1	-9.3427068
JobLevel.to.1layhid1	-3.6680898
JobRoleHealthcareRepresentative.to.1layhid1	-409.4723540
JobRoleHumanResources.to.1layhid1	-0.4357436
JobRoleLaboratorytechnician.to.1layhid1	16.5525537
JobRoleManager.to.1layhid1	-409.4229911
JobRoleManufacturingDirector.to.1layhid1	-15.1456200
JobRoleResearchScientist.to.1layhid1	6.8461683
JobRoleSalesExecutive.to.1layhid1	6.4702694
JobRoleSalesRepresentative.to.1layhid1	22.7881528
JobSatisfaction.to.1layhid1	-2.6865442
MaritalStatusDivorced.to.1layhid1	-20.3574510
MaritalStatusMarried.to.1layhid1	-21.5059449
MonthlyIncome.to.1layhid1	-12.7027070
NumCompaniesWorked.to.1layhid1	7.1220929
OverTimeYes.to.1layhid1	41.7518850
PercentSalaryHike.to.1layhid1	22.2213500
PerformanceRating.to.1layhid1	-8.6784197
RelationshipSatisfaction.to.1layhid1	-9.0012367
StockOptionLevel.to.1layhid1	3.9173498
TrainingTimesLastYear.to.1layhid1	-11.3360298
WorkLifeBalance.to.1layhid1	9.3880611
YearsInCurrentRole.to.1layhid1	-30.2971577
YearsSinceLastPromotion.to.1layhid1	-2.3969326
YearsWithCurrManager.to.1layhid1	-2.7083013
Intercept.to.AttritionYes	0.0543155
1layhid1.to.AttritionYes	0.8470791

Figure 12. Single-layer backpropagation Neural network model for Employee Attrition

Table continues below
	Estimate	Std. Error	z value
(Intercept)	0.3545	1.608	0.2204
Age	-1.502	0.5208	-2.884
BusinessTravelTravel_Frequently	0.8802	0.2047	4.299
DepartmentHuman Resources	-12.63	446	-0.02832
DepartmentResearch & Development	0.2472	1.113	0.222
DistanceFromHome	1.196	0.3055	3.913
Education	0.09759	0.3507	0.2783
EducationFieldHuman Resources	-0.02841	0.8175	-0.03475
EducationFieldLife Sciences	-0.8385	0.304	-2.758
EducationFieldMarketing	-0.4895	0.3912	-1.251
EducationFieldMedical	-0.903	0.3136	-2.879
EducationFieldOther	-1.133	0.4889	-2.318
EnvironmentSatisfaction	-1.07	0.2452	-4.363
GenderFemale	-0.3189	0.1842	-1.731
JobInvolvement	-1.677	0.3706	-4.525
JobLevel	-0.6883	1.183	-0.5818
JobRoleHealthcare Representative	1.016	0.9542	1.065
JobRoleHuman Resources	15.17	446	0.03402
JobRoleLaboratory Technician	2.614	0.9983	2.619
JobRoleManager	1.406	1.07	1.315
JobRoleManufacturing Director	1.203	0.9399	1.28
JobRoleResearch Scientist	1.699	1.001	1.698
JobRoleSales Executive	2.377	1.422	1.672
JobRoleSales Representative	3.43	1.51	2.272
JobSatisfaction	-1.041	0.2406	-4.329
MaritalStatusDivorced	-1.014	0.3419	-2.967
MaritalStatusMarried	-0.7231	0.248	-2.916
MonthlyIncome	0.7883	1.503	0.5246
NumCompaniesWorked	1.306	0.3418	3.822
OverTimeYes	1.82	0.1904	9.558
PercentSalaryHike	-0.3553	0.5425	-0.6549
PerformanceRating	0.02365	0.396	0.05972
RelationshipSatisfaction	-0.6659	0.2483	-2.681
StockOptionLevel	-0.6092	0.4656	-1.309
TrainingTimesLastYear	-0.9523	0.4452	-2.139
WorkLifeBalance	-0.9719	0.3725	-2.609
YearsInCurrentRole	-1.111	0.7599	-1.462
YearsSinceLastPromotion	-0.4548	0.7042	-0.6458

	Pr(>\|z\|)
(Intercept)	0.8256
Age	0.003921
BusinessTravelTravel_Frequently	1.713e-05
DepartmentHuman Resources	0.9774
DepartmentResearch & Development	0.8243
DistanceFromHome	9.099e-05
Education	0.7808
EducationFieldHuman Resources	0.9723
EducationFieldLife Sciences	0.005809
EducationFieldMarketing	0.2109
EducationFieldMedical	0.003989
EducationFieldOther	0.02043
EnvironmentSatisfaction	1.282e-05
GenderFemale	0.08339
JobInvolvement	6.042e-06
JobLevel	0.5607
JobRoleHealthcare Representative	0.287
JobRoleHuman Resources	0.9729
JobRoleLaboratory Technician	0.00883
JobRoleManager	0.1886
JobRoleManufacturing Director	0.2006
JobRoleResearch Scientist	0.08956
JobRoleSales Executive	0.0946
JobRoleSales Representative	0.02311
JobSatisfaction	1.501e-05
MaritalStatusDivorced	0.003009
MaritalStatusMarried	0.003546
MonthlyIncome	0.5999
NumCompaniesWorked	0.0001326
OverTimeYes	1.197e-21
PercentSalaryHike	0.5126
PerformanceRating	0.9524
RelationshipSatisfaction	0.00733
StockOptionLevel	0.1907
TrainingTimesLastYear	0.03244
WorkLifeBalance	0.009082
YearsInCurrentRole	0.1438
YearsSinceLastPromotion	0.5184

6.3.8 About Cross-validation in Neural Network

The algorithm of Cross-validation is primarily used for tuning hyper-parameters. For example, in the sigmoid perceptron, the optimal cut-off scores for the binary decision can be obtained through cross-validation. One of the important hyperparameters in the neural network model is the learning rate $\alpha$ (in the backpropagation algorithm) that impacts the learning speed in training neural network models.

6.3.9 Testing Model Performance

## $confusion.matrix
##        
##           0   1
##   FALSE 699  71
##   TRUE   76  78
## 
## $accuracy
## [1] 0.8409091

6.3.10 ROC Analysis

Recall that the ROC curve is the plot of sensitivity against (1 - specificity) calculated from the confusion matrix based on a sequence of selected cut-off scores. Definitions of sensitivity and specificity are given in the following confusion matrix Next, we construct a ROC for the above NN model based on the training data set.

Figure 14: ROC Curve of the neural network model.

The above ROC curve indicates that the underlying neural network is better than the random guess since the area under the curve is significantly greater than 0.5. In general, if the area under the ROC curve is greater than 0.65, we say the predictive power of the underlying model is acceptable. ### Here’s a comparison of the two AUC values mentioned: * Logistic Model AUC: 0.8151 * Neural Network Model AUC: 0.8337 *Comparing AUC values: AUC ranges between 0 and 1, where a higher value indicates better discrimination between the classes. Generally, an AUC value above 0.5 suggests that the model is performing better than random chance. In your case, both models have AUC values above 0.5, which is a positive sign. The neural network model has a slightly higher AUC (0.8337) compared to the logistic model (0.8151), suggesting that the Neural Network model is better at distinguishing between the positive and negative classes based on the chosen evaluation metric.

6.4 Introduction to Decision Tree

Decision tree algorithms are a class of machine learning algorithms used for both classification and regression tasks. They work by recursively splitting the dataset into subsets based on the values of input features, ultimately creating a tree-like structure of decisions that leads to a predicted output. In the context of our employee Attrition dataset, decision tree algorithms can be appropriate for the following reasons:

Interpretability: Decision trees provide transparent and easy-to-understand rules that map input features to predicted outcomes. This can be valuable in understanding the factors that contribute to employee attrition. Feature Importance: Decision trees can rank features based on their importance in the decision-making process. This helps identify which factors have the most significant impact on employee attrition. Non-linearity: Decision trees can capture complex relationships and interactions between features, which could be present in employee attrition scenarios. Handling Missing Values: Decision trees can handle missing values in a robust manner by considering available data for each split. No Assumption about Data Distribution: Decision trees do not assume any specific data distribution, making them suitable for various types of data.

The Decision Tree (DT) algorithm is based on conditional probabilities. Unlike the other classification algorithms, decision trees generate rules. A rule is a conditional statement that can easily be understood by humans and easily used within a database to identify a set of records. It is easy to interpret and implement in real-world applications. Among several basic tree-based algorithms, Classification and Regression Tree (CART) is most frequently used in practice.

6.4.1 Case Study - Predicting Attrition

This is a new model that is different from logistic and neural network models. We load the analytic data set.

rpart() has a lot of flexibility to construct decision trees as it has user controls. It is particularly useful in applications where the costs of false positive and false negative are different.

Next, we write a wrapper so we can build different decision trees conveniently.

Using the above function, we define six different decision tree models in the following. * Model 1: gini.tree.11 is based on the Gini index without penalizing false positives and false negatives. * Model 2: info.tree.11 is based on entropy without penalizing false positives and false negatives. * Model 3: gini.tree.110 is based on the Gini index: cost of false negatives is 10 times the positives. * Model 4: info.tree.110 is based on entropy: cost of false negatives is 10 times the positives. * Model 5: gini.tree.101 is based on the Gini index: cost of false positive is 10 times the negatives.
* Model 6: info.tree.101 is based on entropy: cost of false positive is 10 times the negatives.

The tree diagram of the above two regular decision models is given below.

Figure 14. Non-penalized decision tree models using Gini index (left) and entropy (right).

Figure 15. penalized decision tree models using Gini index (left) and entropy (right).

6.4.2 ROC for Model Selection

We built 4 different decision tree models above. Next, we use ROC analysis to select the best among the four candidate models.

The above function has three arguments for users to choose different types of decision trees including the 4 trees discussed in the previous subsection. Next we use this function to build 6 different trees and plot their corresponding ROC curves so we can see the global performance of these tree algorithms.

Next, we plot the ROC curves and calculate the areas under the ROC curves for Individual decision tree models.

Figure 16. Comparison of ROC curves

The above ROC curves represent various decision trees and their corresponding AUC. The model with the largest AUC is considered the best decision tree among the existing ones. An AUC (Area Under the Curve) value of 0.875 (largest out of 6) for a decision tree model indicates that the model has relatively good performance in distinguishing between the positive and negative classes in your dataset. AUC measures the ability of the model to correctly rank instances of different classes based on their predicted probabilities. An AUC value closer to 1.0 indicates better discrimination, while a value closer to 0.5 suggests random performance. In your case, an AUC of 0.875 indicates that the decision tree model is making accurate predictions and effectively separating the two classes.

6.4.3 Optimal Cut-off Score Determination

As usual, once the final model is determined, we need to find the optimal cut-off score for reporting the predictive performance of the final model with the test data. Please keep in mind the optimal cut-off determination through cross-validation must be based on the training data set.

In practical applications, one may end up with two or more final models with similar AUCs. In this case, we need to report the performance of all final models based on the test data and let clients choose one to deploy (and possibly leave the rest as challengers). For this reason, we write a function to determine the optimal cut-off for a given decision tree (based on this project) since different decision trees have their own optimal cut-off.

For demonstration, we use the above function to calculate the optimal cut-off of 6 decision trees constructed earlier in the following.

Figure 17: Plot of optimal cut-off determination

As anticipated, different trees have their own optimal cut-off. Please keep in mind that the cut-off is random (based on the randomly split training data), there may have different cut-offs in different runs. It is dependent on the tree size, sometimes, we may end up with multiple optimal cut-offs. Technically speaking, we choose any one of them for implementation. A better recommendation is to choose the average of these multiple cut-offs and the final cut-off to be used on the testing data set.

7 Concluding Remarks

Based on the information provided, it seems that our Decision Tree model has achieved an Area Under the Curve (AUC) of 0.875. A higher AUC indicates better discriminatory power of our model in distinguishing between positive and negative cases. In the context of employee attrition analysis, this suggests that our Decision Tree model has a good ability to rank and classify employees with respect to attrition risk.

However, the choice of a cutoff score is important for practical deployment and decision-making. The cutoff score determines how the model’s predictions are converted into binary classifications (positive or negative). The selection of a cutoff involves a trade-off between sensitivity (true positive rate) and specificity (true negative rate).

Model Performance: An AUC of 0.875 is generally considered good, indicating that your Decision Tree model is performing well in differentiating between employees who will attrite and those who won’t. Cutoff Selection: The optimal cutoff score depends on specific business context and the relative importance of false positives (predicting attrition when it won’t occur) versus false negatives (failing to predict attrition when it will occur). we should choose a cutoff that aligns with your organization’s goals and risk tolerance. Trade-off Consideration: we need to balance sensitivity and specificity based on our business needs. A higher sensitivity might be preferred if missing actual attrition cases is more costly, while a higher specificity might be chosen if avoiding false alarms is a higher priority. Further Analysis: To make an informed decision about the cutoff, we might want to analyze precision, recall, F1-score, and other relevant metrics. We could also perform sensitivity analysis by evaluating the model’s performance across a range of cutoff values. Deployment and Monitoring: Keep in mind that model performance can vary in real-world scenarios. It’s important to monitor the model’s performance after deployment and refine the cutoff if necessary.

In summary, our Decision Tree model is demonstrating strong predictive performance, but the choice of a cutoff score should be guided by the specific needs of the organization and the implications of false positives and false negatives in the context of employee attrition.

8 Conclusion

Based on the analysis and comparison of the three models used in this project, namely Logistic Regression, Neural Network, and Decision Tree, it can be concluded that the Neural Network model outperformed the other two in terms of predictive accuracy. The Neural Network model achieved an AUC of 0.8333, which was higher than the AUC values obtained from Logistic Regression (0.8151) and Decision Tree (0.875).

While the Decision Tree model had the highest AUC value, it’s important to consider that Decision Trees are prone to overfitting, which may lead to overly optimistic results on the training data. The Neural Network model demonstrated a good balance between predictive accuracy and generalization to unseen data.

9 Future Analysis

Understanding the attrition by Department. Basically, running predictive modeling on each of the three sub categories of department separately and look at what factors affect it. This helps determine more in depth on what factors are the key driving force to employee attrition in each sub - categories. Therefore, we can provide meaningful insights on attrition factors on department level.

10 References:

Aggarwal, C. C. (2015). Data Mining: The Textbook. Springer.
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R. Springer.
Raschka, S., & Mirjalili, V. (2017). Python Machine Learning. Packt Publishing.
OpenAI. (2021). GPT-3.5 Architecture. https://openai.com/
R Core Team. (2021). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. https://www.R-project.org/
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer.
Kuhn, M. (2020). caret: Classification and Regression Training. R package version 6.0-88. https://cran.r-project.org/package=caret
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., & Hornik, K. (2019). cluster: Cluster Analysis Basics and Extensions. R package version 2.1.0. https://cran.r-project.org/package=cluster
Venables, W. N., & Ripley, B. D. (2002). Modern Applied Statistics with S. Springer.
Zeileis, A., & Hothorn, T. (2002). Diagnostic Checking in Regression Relationships. R News, 2(3), 7-10. https://cran.r-project.org/doc/Rnews/
Zeileis, A., Leisch, F., Hornik, K., & Kleiber, C. (2002). strucchange: An R package for testing for structural change in linear regression models. Journal of Statistical Software, 7(2), 1-38. https://www.jstatsoft.org/v07/i02/
Wickham, H., & Grolemund, G. (2017). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media.

Feed back is welcome. Thank you.

10.1 Appendices:

Class notes, Powerpoint

From Data Exploration to Prediction: Unveiling Patterns Through Regression, Clustering, and Neural Networks

Tenam Lama