In the realm of People Analytics, the significance of a data set created by IBM for attrition modeling cannot be overstated. This data set presents a valuable resource for addressing critical questions related to employee turnover and engagement. With 1470 rows and 35 columns, it offers a wealth of information that encompasses various aspects of employees’ professional and personal lives. The data set combines the typical Human Resources Information System (HRIS) data with a comprehensive engagement survey, providing a holistic view of the employees’ experiences and sentiments.
The main objective of this analysis is to understand and predict employee turnover within an organization. By examining the factors that contribute to attrition, we aim to uncover valuable insights that can guide HR strategies and improve employee retention rates. Moreover, this data set offers a unique opportunity to identify differences between the group of employees who chose to stay and those who decided to leave the organization.
The data set was created and made available by IBM, a renowned leader in the tech industry. As a reputable source, the data is expected to be reliable and well-structured, enabling meaningful analysis and inference. While the specific details of data collection methods are not provided, the comprehensive nature of the data set suggests that it was meticulously curated to capture a wide range of attributes relevant to employee turnover and engagement.
Sample Size and Feature Variables: The data set consists of 1470 rows, representing individual employee records. Each row contains 35 columns, making these columns the feature variables for analysis. The feature variables include demographic information such as age and gender, factors related to job satisfaction and environment satisfaction, education field, job role, income, overtime, percentage salary hike, tenure, training time, years in the current role, relationship status, and several other parameters that may impact attrition and engagement.
The feature variables encompass various data types, including numeric, categorical, and ordinal data, which allows for a diverse set of analyses and modeling approaches.
In conclusion, this data set offers a comprehensive collection of features that are crucial for understanding and predicting employee attrition within an organization. By delving into the relationships between different variables, we can gain valuable insights that have practical implications for HR policies and practices.
A detailed description of the variables is given below:
Education 1 ‘Below College’ 2 ‘College’ 3 ‘Bachelor’ 4
‘Master’ 5 ‘Doctor’
EnvironmentSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4
‘Very High’
JobInvolvement 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very
High’
JobSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very
High’
PerformanceRating 1 ‘Low’ 2 ‘Good’ 3 ‘Excellent’ 4
‘Outstanding’
RelationshipSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4
‘Very High’
WorkLifeBalance 1 ‘Bad’ 2 ‘Good’ 3 ‘Better’ 4 ‘Best’
A copy of this publicly available data is stored at https://raw.githubusercontent.com/Tenam01/DATASETS/main/EmployeeAttritionData.csv.
Applying binary coding.
Column Employee Number will be removed to protect the identity of employees and their sensitive information.
Additional columns have been identified for removal:
DailyRate, HourlyRate, and MonthlyRate are inexplicable. EmployeeCount, Over18, and StandardHours are uniform for all employees.
We first scan the entire data set and determine the EDA tools to use for feature engineering.
## Age Attrition BusinessTravel Department
## Min. :18.00 Length:1470 Length:1470 Length:1470
## 1st Qu.:30.00 Class :character Class :character Class :character
## Median :36.00 Mode :character Mode :character Mode :character
## Mean :36.92
## 3rd Qu.:43.00
## Max. :60.00
## DistanceFromHome Education EducationField EnvironmentSatisfaction
## Min. : 1.000 Min. :1.000 Length:1470 Min. :1.000
## 1st Qu.: 2.000 1st Qu.:2.000 Class :character 1st Qu.:2.000
## Median : 7.000 Median :3.000 Mode :character Median :3.000
## Mean : 9.193 Mean :2.913 Mean :2.722
## 3rd Qu.:14.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :29.000 Max. :5.000 Max. :4.000
## Gender JobInvolvement JobLevel JobRole
## Length:1470 Min. :1.00 Min. :1.000 Length:1470
## Class :character 1st Qu.:2.00 1st Qu.:1.000 Class :character
## Mode :character Median :3.00 Median :2.000 Mode :character
## Mean :2.73 Mean :2.064
## 3rd Qu.:3.00 3rd Qu.:3.000
## Max. :4.00 Max. :5.000
## JobSatisfaction MaritalStatus MonthlyIncome NumCompaniesWorked
## Min. :1.000 Length:1470 Min. : 1009 Min. :0.000
## 1st Qu.:2.000 Class :character 1st Qu.: 2911 1st Qu.:1.000
## Median :3.000 Mode :character Median : 4919 Median :2.000
## Mean :2.729 Mean : 6503 Mean :2.693
## 3rd Qu.:4.000 3rd Qu.: 8379 3rd Qu.:4.000
## Max. :4.000 Max. :19999 Max. :9.000
## OverTime PercentSalaryHike PerformanceRating
## Length:1470 Min. :11.00 Min. :3.000
## Class :character 1st Qu.:12.00 1st Qu.:3.000
## Mode :character Median :14.00 Median :3.000
## Mean :15.21 Mean :3.154
## 3rd Qu.:18.00 3rd Qu.:3.000
## Max. :25.00 Max. :4.000
## RelationshipSatisfaction StockOptionLevel TotalWorkingYears
## Min. :1.000 Min. :0.0000 Min. : 0.00
## 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.: 6.00
## Median :3.000 Median :1.0000 Median :10.00
## Mean :2.712 Mean :0.7939 Mean :11.28
## 3rd Qu.:4.000 3rd Qu.:1.0000 3rd Qu.:15.00
## Max. :4.000 Max. :3.0000 Max. :40.00
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## Min. :0.000 Min. :1.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.000 1st Qu.: 2.000
## Median :3.000 Median :3.000 Median : 5.000 Median : 3.000
## Mean :2.799 Mean :2.761 Mean : 7.008 Mean : 4.229
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.: 9.000 3rd Qu.: 7.000
## Max. :6.000 Max. :4.000 Max. :40.000 Max. :18.000
## YearsSinceLastPromotion YearsWithCurrManager Attrition_num
## Min. : 0.000 Min. : 0.000 Min. :0.0000
## 1st Qu.: 0.000 1st Qu.: 2.000 1st Qu.:0.0000
## Median : 1.000 Median : 3.000 Median :0.0000
## Mean : 2.188 Mean : 4.123 Mean :0.1612
## 3rd Qu.: 3.000 3rd Qu.: 7.000 3rd Qu.:0.0000
## Max. :15.000 Max. :17.000 Max. :1.0000
There seems to be no apparent outliers.All the numerical variables seem to be in a reasonable range.
The average age seems to be around 37 years, the highest age seems to be 60 and the lowest age seems to be 18. This dynamic age group could range from new interns all the way to senior managers. There are people who live very close to work and some live far away from the work place Salaries tend to fluctuate a lot as well with a average monthly income of USD 6,500. The salaries can range anywhere from lowest USD 1,000 to as high as USD 20,000. This explains as new interns get paid less whereas the senior managers make a lot. The average number of years worked as an employee is 7 years, where there are employees who have worked for 40 years and as well as some who just started working.
All of the categorical characters are consistent.
Since the data set appears to have no missing values, it simplifies the Exploratory Data Analysis (EDA) process. With a complete data set, we can focus on exploring relationships, identifying patterns, and gaining insights more effectively.
The data cleaning process is complete.
This subsection focuses on the potential discretization of continuous variables and grouping sparse categories of category variables based on their distribution.
We will group age into three sub groups ranging from 18 to 60 and also group the distance from home into three subgroups. We will replace Age, DistanceFromHome, and Attrition feature variables and replace them with the modified grouped variable grp.age and grp.dist also binary variable outcome Attrition_num for easy graphical approach.
Pairwise association refers to the examination of relationships between pairs of variables in a data set. It involves analyzing how the values of two variables co-occur or change together. There are three different types of pairwise associations.
The best visual tool for assessing pairwise linear association between two numeric variables is a pair-wise scatter plot.
The off-diagonal plots and numbers indicate the correlation between
the pair-wise numeric variables. As expected,
YearsWithCurrManager and YearsAtCompany are
significantly correlated, YearsWithCurrManager and
YearsInCurrentRole are significantly correlated,
TotalWorkingYears and YearsAtCompany are
significantly correlated, YearsInCurrentRole and
YearsAtCompany are significantly correlated, and
YearsSinceLastPromotion and YearsAtCompany are
significantly correlated. Other paired variables have weak
correlations.
The main diagonal stacked density curves show the potential difference in the distribution of the underlying numeric variable in Attrition and no attrition groups. This means that the stacked density curves show the relation between numeric and categorical variables. These stacked density curves are not completely overlapped indicating somewhat correlation between each of these numeric variables and the binary response variable.
Mosaic plots are convenient to show whether two categorical variables are dependent. In EDA, we are primarily interested in whether the response (binary in this case) is independent of categorical variables. Those categorical variables that are independent of the response variable should be excluded in any of the subsequent models and algorithms.
The top two mosaic plots demonstrate show that Attrition is not
independent of Business Travel and Department because the proportion of
Attrition cases in individual categories is not identical. In which
employees traveling frequently have the highest attrition rate whereas
non-travel employees has the least attrition rate. The bottom two mosaic
plots also show that Attrition is not independent of Education field and
Job Role because the proportion of Attrition cases in individual
categories is not identical.
The top left two mosaic plots demonstrate the positive association
between overtime and marital status readings. In which employees being
single and working a lot of overtime shows increasing rate of employee
attrition. The bottom two mosaic plots also show that Attrition is not
independent of age group and distance from home because the proportion
of Attrition cases in individual categories is not identical. Here, the
younger age group have a higher attrition rate than older age group.
Also further away from the work place shows increase in the rate of
attrition. Lastly, as for the Gender, there seems to be not much of an
influence.
# Concluding Remarks
We will be removing grp.age and grp.dist also binary variable outcome Attrition_num. We will aslo remove YearsAtCompany, and TotalWorkingYears, as they highly correlated with correlation coefficient > .75. For our final cleaned data set, most feature variables have very high or very low correlation but not high or low enough to be removed. Therefore, we will keep them for further analysis.
The goal of unsupervised learning is to find the underlying structure of the data set and group that data according to similarities. Common algorithms used in unsupervised learning include clustering, anomaly detection, neural networks, and approaches for learning latent variable models.
The following types of unsupervised machine learning algorithms are commonly used in practice.
K-means Clustering: K-means clustering is a popular unsupervised machine learning technique used for grouping similar data points into clusters. It assigns data points to clusters based on their similarity to the mean of each cluster. K-means is iterative and aims to minimize the variance within each cluster. It is widely used in customer segmentation, image compression, and data preprocessing.
Hierarchical Clustering: Hierarchical clustering is another technique for grouping data points into clusters. Unlike K-means, hierarchical clustering creates a tree-like structure of nested clusters, which can be visualized as a dendrogram. It does not require the user to specify the number of clusters beforehand. Hierarchical clustering is useful for understanding the relationships between data points and is commonly used in biology, social sciences, and data exploration.
Anomaly Detection: Anomaly detection involves identifying data points that deviate significantly from the norm or expected behavior. It is widely used in fraud detection, network security, and quality control. Anomaly detection methods can be based on statistical techniques, machine learning, or a combination of both. Detecting anomalies helps in identifying rare events or outliers that might require special attention.
Principal Component Analysis (PCA): PCA is a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space while retaining as much of the original variance as possible. It achieves this by identifying orthogonal axes (principal components) that capture the most significant variability in the data. PCA is commonly used to reduce noise, visualize high-dimensional data, and improve the efficiency of machine learning algorithms by reducing multicollinearity.
These techniques play crucial roles in exploratory data analysis, pattern recognition, and feature engineering, contributing to the development of effective machine learning models and insights from complex data sets.
The data set consists of 1470 rows, representing individual employee records. Each row contains 35 columns, making these columns the feature variables for analysis. The feature variables include demographic information such as age and gender, factors related to job satisfaction and environment satisfaction, education field, job role, income, overtime, percentage salary hike, tenure, training time, years in the current role, relationship status, and several other parameters that may impact attrition and engagement.
The feature variables encompass various data types, including numeric, categorical, and ordinal data, which allows for a diverse set of analyses and modeling approaches.
In conclusion, this data set offers a comprehensive collection of features that are crucial for understanding and predicting employee attrition within an organization. By delving into the relationships between different variables, we can gain valuable insights that have practical implications for HR policies and practices.
A copy of this publicly available data is stored at https://raw.githubusercontent.com/Tenam01/DATASETS/main/cleanedattrition2.csv. This data set has been pre-processed and feature engineered.
We use k-means method to perform cluster analysis on the Attrition data with the nine numerical feature variables.
## clust.ID
## 1 2 3
## 365 916 189
Since this clustering task involves 9 numerical feature variables, we cannot create a 2D plot to show the clustering performance with all nine original feature variables. However, we can so-called PCA to create two new feature variables and then plot the new features to show the performance of the clustering algorithm.
One of the potential issues in clustering analysis is the use of memory. If the data size is too large,
Like clustering methods, dimension reduction seeks and explores the inherent structure in the data, but in this case in an unsupervised manner or to summarize or describe data using less information.
This can be useful to visualize high-dimensional data or to simplify data which can then be used in a supervised learning method. Many of these methods can be adapted for use in classification and regression. The following are the frequently used algorithms.
In this section, we introduce the most commonly used PCA.
We have used the Attrition Data set in clustering algorithms. The
data set has 5 correlated numerical
variables(YearsWithCurrManager,
YearsAtCompany, and YearsInCurrentRole,
TotalWorkingYears, and
YearsSinceLastPromotion) and no categorical variable. The
five variables measure the Attrition rate. We use PCA to see whether
reducing the number of feature variables for related modeling.
We want PCA method to reduce the dimensions from 5 (numerical variables) to a smaller number. The R function prcomp() to the factor loadings associated with the five numerical variables.
Next, we find the factor loading of the above fitted PCA. We can write an explicit system of linear transformation by using the loadings.
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | |
|---|---|---|---|---|---|---|---|---|---|
| Age | -0.29 | 0.61 | -0.03 | 0.12 | -0.10 | 0.21 | -0.01 | -0.69 | -0.01 |
| DistanceFromHome | 0.01 | -0.01 | -0.70 | -0.42 | 0.48 | 0.32 | 0.01 | -0.03 | -0.02 |
| MonthlyIncome | -0.37 | 0.51 | 0.02 | 0.02 | -0.09 | 0.29 | -0.07 | 0.71 | 0.01 |
| NumCompaniesWorked | 0.07 | 0.47 | -0.07 | -0.14 | 0.33 | -0.80 | 0.03 | 0.08 | -0.01 |
| PercentSalaryHike | 0.06 | 0.02 | -0.55 | -0.14 | -0.78 | -0.22 | 0.05 | 0.04 | -0.02 |
| WorkLifeBalance | 0.00 | 0.05 | 0.44 | -0.87 | -0.17 | 0.06 | -0.06 | -0.08 | -0.01 |
| YearsInCurrentRole | -0.54 | -0.24 | -0.07 | -0.05 | 0.02 | -0.19 | -0.31 | -0.06 | 0.71 |
| YearsSinceLastPromotion | -0.44 | -0.15 | 0.03 | -0.07 | 0.03 | -0.08 | 0.88 | 0.00 | -0.02 |
| YearsWithCurrManager | -0.54 | -0.25 | -0.04 | -0.01 | 0.03 | -0.19 | -0.34 | -0.05 | -0.70 |
For example, let’s consider the first row (Age): The factor loading of -0.29 for PC1 suggests a moderately negative relationship between Age and PC1. A higher Age is associated with lower values of PC1. The factor loading of 0.61 for PC2 suggests a moderately positive relationship between Age and PC2. A higher Age is associated with higher values of PC2. The factor loadings close to 0 for other principal components indicate that Age has relatively weaker relationships with those components.
Similarly, let’s consider the second row (DistanceFromHome): The factor loading of 0.01 for PC1 indicates a very weak positive relationship between DistanceFromHome and PC1. The factor loading of -0.01 for PC2 indicates a very weak negative relationship between DistanceFromHome and PC2. The factor loading of -0.70 for PC3 indicates a relatively strong negative relationship between DistanceFromHome and PC3. Interpreting these factor loadings for all variables across principal components allows to understand which original variables contribute most to each principal component and how they contribute to the overall variance captured by the PCs.
The object of PCA is to reduce the dimension without losing a significant amount of information. In PCA, we look at how much total variation is captured by each principal component. Most of the libraries that are capable of performing PCA automatically rank the PCA based on the variation captured by each principal component.
The following summary table gives the importance of the principal components.
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | |
|---|---|---|---|---|---|---|---|---|---|
| Standard deviation | 1.505511 | 1.180874 | 1.018676 | 0.9981226 | 0.9937252 | 0.9272111 | 0.8250489 | 0.6922288 | 0.5458277 |
| Proportion of Variance | 0.251840 | 0.154940 | 0.115300 | 0.1106900 | 0.1097200 | 0.0955200 | 0.0756300 | 0.0532400 | 0.0331000 |
| Cumulative Proportion | 0.251840 | 0.406780 | 0.522080 | 0.6327800 | 0.7425000 | 0.8380200 | 0.9136500 | 0.9669000 | 1.0000000 |
The output is from a Principal Component Analysis (PCA) performed on your data. PCA is a technique used for dimensionality reduction, which transforms the original feature variables into a new set of uncorrelated variables called principal components (PCs). These PCs capture most of the variance in the original data, allowing to represent data with fewer variables while retaining as much information as possible.
Let’s break down the output and understand its components:
Standard Deviation: This row provides the standard deviation of each principal component. It indicates the spread of data along each principal component axis. Larger standard deviations suggest that the corresponding principal components capture more variation in the data. Proportion of Variance: This row tells the proportion of total variance in the data explained by each principal component. It quantifies the amount of information retained by each PC. Larger proportions indicate that the corresponding PC contributes more to explaining the overall variance in the data. Cumulative Proportion: This row shows the cumulative proportion of variance explained up to each principal component. It helps understand how much of the total variance is captured when including a certain number of PCs. This can guide in deciding how many principal components to retain. Interpreting the Output: PC1: The first principal component (PC1) has a relatively high standard deviation of 1.51, which means it captures a significant amount of variability in the data. It explains about 25.2% of the total variance. The cumulative proportion up to PC1 is 25.2%. PC2: The second principal component (PC2) has a standard deviation of 1.18 and explains an additional 15.5% of the variance. The cumulative proportion up to PC2 is 40.7%. PC3: PC3 captures 11.5% of the variance, with a standard deviation of 1.02. The cumulative proportion up to PC3 is 52.2%. PC4 to PC9: These components contribute progressively smaller proportions of variance, with decreasing standard deviations. Collectively, they explain the remaining variance, resulting in a cumulative proportion of 100%. Importance of Each Principal Component: PC1 is the most important component, as it has the largest standard deviation and explains the highest proportion of variance. It represents the dominant pattern of variability in the data. Subsequent principal components capture decreasing amounts of variance and contribute less to explaining the overall data variability. Deciding How Many Principal Components to Retain: A common approach is to choose the number of principal components that collectively explain a large portion (e.g., 95% or 99%) of the total variance. In this case, we could decide to retain the first 6 principal components, as they contribute to about 80% of the total variance. The choice of how many principal components to retain depends on specific goals and the trade-off between reducing dimensionality and retaining information. This might also consider other factors such as interpretability and the requirements of downstream analysis or modeling.
We can also make a scree plot as a visual tool to show the number of principal components to retain for future analysis.
Figure 18. Scree plot of PCA on Iris Data
Note that the vertical axis in the above scree plot uses the variances of PCs. The standard deviation was used in the above summary table.
The predictive principle scores are values of the newly transformed
variables. We can choose the first few principal components to use as
response variables to do relevant modeling. The command
ar.pca$x extracts the PC scores from the PCA procedure.
These scores are the values of the new transformed variables. They can
be used as response or predictor variables in statistical models. The
following table shows
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 |
|---|---|---|---|---|---|---|---|---|
| -0.2503729 | 0.4891398 | 0.3747896 | 3.6543569 | 0.8579011 | -0.7647964 | -1.2736396 | 0.0088989 | 0.0902827 |
| -1.0858270 | 0.5882272 | -1.2022389 | -0.7339714 | -1.4161095 | -0.5372264 | 0.3735791 | -0.8711734 | -0.0914351 |
| 3.3975430 | 0.9163099 | 0.9634498 | 0.1905301 | -0.4626099 | -0.1541245 | 0.4121812 | -0.7822869 | -0.0673956 |
| 0.8676736 | -0.2101990 | 1.3669685 | -0.0560005 | 0.8397869 | -0.1966703 | 1.2784114 | -0.3702883 | 1.8670205 |
| -0.2267333 | -1.0262492 | 1.3421464 | -0.0967361 | 0.5739030 | -1.1482797 | 0.4382696 | 0.3115776 | 0.0014997 |
| -0.7096627 | -2.2835630 | 0.7613344 | 1.4737681 | -0.5304995 | 1.1950815 | 0.4098705 | -0.4360873 | 0.0541166 |
| 2.8024440 | 2.1778603 | -0.6098430 | 1.2164423 | -1.2881520 | 0.1386260 | 0.5137954 | -1.6709849 | -0.0963894 |
| 3.5978736 | 0.5007554 | -1.5597572 | -1.0920457 | -0.6760869 | 0.3328040 | 0.4814611 | 0.0482851 | -0.1323815 |
| -1.3752266 | -0.9107851 | -1.4354599 | -0.8094795 | -1.5216529 | 2.1815365 | 0.2268772 | 0.2213691 | -0.0630726 |
| -0.9478920 | -0.1948945 | -1.1670399 | 0.0991997 | 1.5156007 | -0.1222437 | 0.5061640 | -0.0132037 | -0.0407171 |
| 0.2982608 | -1.8579992 | -0.1208496 | -0.3095551 | 0.1211938 | 2.0607686 | -1.4063438 | -1.0786582 | 0.0395563 |
| 0.1159991 | -1.9395323 | 0.1485927 | -0.3062092 | 0.3747998 | 2.1806035 | -1.5169336 | 0.0061617 | 0.0068637 |
| -0.2463278 | -1.0275712 | -1.7665721 | -0.1019653 | 0.6381276 | -0.5562854 | 0.6517848 | -0.1887846 | -0.0949720 |
| -0.5567213 | -2.1362485 | 0.2432414 | -0.4089012 | 0.8240233 | 2.1622582 | 0.3216970 | -0.9266780 | -0.0116863 |
| 0.9405464 | -1.1005522 | -0.7544006 | -1.0315756 | 1.1448150 | -0.5466799 | -1.2598201 | -0.4084133 | -0.0554879 |
Finally, we chose not to include a new variable derived from Principal Component Analysis (PCA) in the analysis due to the challenge of collapsing PC1 into a specific single feature variable. While PCA effectively reduces dimensionality by creating orthogonal components that capture the most variance in the data, translating these components back into meaningful original variables can be complex. PC1 is a linear combination of multiple original variables, making it difficult to attribute its influence to a single feature. As a result, introducing a new variable based on PC1 could lead to difficulties in interpretation and may not provide clear insights into the underlying relationships present in the data. Retaining the original variables preserves the interpretability of the features and their direct associations with the outcome or patterns, allowing for a more straightforward and actionable analysis.
We build and evaluate logistic regression, neural network, and decision tree models to predict employee attrition, providing insights into each model’s performance.
Employee attrition is a critical challenge faced by many organizations, especially in the fast-paced tech industry. High turnover rates can impact productivity, morale, and overall company performance. In this analysis, we will use logistic regression to predict employee attrition in a tech company based on various employee characteristics, job-related factors and also to assess the association between the binary response variable and other predictor variables.
In this study, we used a published study on Employee Attrition data set. The practical question for this predictive modeling assignment is to determine the factors that impact employee attrition. We want to understand which variables are significant predictors of attrition and build a logistic regression model to predict whether an employee is likely to leave the company (attrition = 1) or not (attrition = 0). A copy of this publicly available data is stored at https://raw.githubusercontent.com/Tenam01/DATASETS/main/cleanedattrition2.csv. This data set has been pre-processed and feature engineered.
The response variable: Attrition - status of whether an employee is
likely to leave the company (attrition = 1) or not (attrition = 0) of
predictor variables. There are 26 variables (columns) and below are the
variables contained in the data set: * Age: Employee age *
Attrition: if the employee leaves the job *
BusinessTravel: The frequency of job travels *
Department: Employee work department
* DistanceFromHome: Distance traveled to work from home *
Education: Employee education level (1 = Below College, 2 =
College, 3 = Bachelor, 4 = Master, 5 = Doctor) *
EducationField: Employee education field *
EnvironmentSatisfaction: Numerical value for environment
satisfaction (1 = Low, 2 = Medium, 3 = High, 4 = Very High) *
Gender: Employee gender * JobInvolvement:
Numerical value for job involvement (1 = Low, 2 = Medium, 3 = High, 4 =
Very High) * JobLevel: Numerical value for job level *
JobRole: Employee job position *
JobSatisfaction: Numerical value for job satisfaction (1 =
Low, 2 = Medium, 3 = High, 4 = Very High) * MaritalStatus:
Employee marital status * MonthlyIncome: The amount of
money that employee earns in one month, before taxes or deductions *
NumCompaniesWorked: Number of companies worked at *
PercentSalaryHike: Percent increase in salary *
PerformanceRating: Numerical value for performance rating
(1 = Low, 2 = Good, 3 = Excellent, 4 = Outstanding) *
RelationshipSatisfaction: Numerical value for relationship
satisfaction (1 = Low, 2 = Medium, 3 = High, 4 = Very High) *
StockOptionsLevel: Numerical value for stock options *
TrainingTimesLastYear: Hours employee spent on training
last year * WorkLifeBalance: Numerical value for work life
balance (1 = Bad, 2 = Good, 3 = Better, 4 = Best) *
YearsInCurrentRole: Number of years employee worked as
their current job role * YearsSinceLastPromotion: Number of
years since last promotion * YearsWithCurrentManager:
Number of years employee worked with current manager
We next read the data from the given URL directly to R. Since there are no records with missing values. We don’t need to drop those records.
Applying binary coding.
We first build a logistic regression model that contains all predictor variables in the data set. This model is usually called the full model. Note that the response variable is the attrition status (1 = yes, 0 = no).
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -10.1487504 | 395.7492662 | -0.0256444 | 0.9795410 |
| BusinessTravelTravel_Frequently | 1.8670294 | 0.4071329 | 4.5857985 | 0.0000045 |
| BusinessTravelTravel_Rarely | 0.9693814 | 0.3754112 | 2.5821858 | 0.0098177 |
| DepartmentResearch & Development | 12.6266608 | 395.7469457 | 0.0319059 | 0.9745471 |
| DepartmentSales | 12.4162910 | 395.7471849 | 0.0313743 | 0.9749710 |
| Education | 0.0019795 | 0.0872999 | 0.0226745 | 0.9819099 |
| EducationFieldLife Sciences | -0.7606727 | 0.7984502 | -0.9526864 | 0.3407490 |
| EducationFieldMarketing | -0.3297477 | 0.8468630 | -0.3893755 | 0.6969984 |
| EducationFieldMedical | -0.8515225 | 0.7981229 | -1.0669064 | 0.2860141 |
| EducationFieldOther | -0.8770665 | 0.8562571 | -1.0243028 | 0.3056923 |
| EducationFieldTechnical Degree | 0.1079078 | 0.8159464 | 0.1322486 | 0.8947877 |
| EnvironmentSatisfaction | -0.4199648 | 0.0821263 | -5.1136430 | 0.0000003 |
| GenderMale | 0.3950340 | 0.1830513 | 2.1580505 | 0.0309239 |
| JobInvolvement | -0.5328265 | 0.1205934 | -4.4183737 | 0.0000099 |
| JobLevel | -0.1389967 | 0.2977130 | -0.4668814 | 0.6405847 |
| JobRoleHuman Resources | 14.0089374 | 395.7471564 | 0.0353987 | 0.9717618 |
| JobRoleLaboratory Technician | 1.5317617 | 0.4816347 | 3.1803394 | 0.0014710 |
| JobRoleManager | 0.5978435 | 0.8727760 | 0.6849907 | 0.4933498 |
| JobRoleManufacturing Director | 0.2597355 | 0.5314577 | 0.4887228 | 0.6250380 |
| JobRoleResearch Director | -0.8767143 | 0.9659989 | -0.9075728 | 0.3641040 |
| JobRoleResearch Scientist | 0.6231180 | 0.4896045 | 1.2726965 | 0.2031258 |
| JobRoleSales Executive | 1.3223467 | 1.1132697 | 1.1878044 | 0.2349105 |
| JobRoleSales Representative | 2.2290891 | 1.1700504 | 1.9051223 | 0.0567642 |
| JobSatisfaction | -0.4031915 | 0.0803294 | -5.0192247 | 0.0000005 |
| MaritalStatusMarried | 0.3027285 | 0.2642453 | 1.1456342 | 0.2519466 |
| MaritalStatusSingle | 1.1604409 | 0.3416766 | 3.3963130 | 0.0006830 |
| MonthlyIncome | -0.0000032 | 0.0000798 | -0.0400431 | 0.9680587 |
| NumCompaniesWorked | 0.1630538 | 0.0370723 | 4.3982643 | 0.0000109 |
| OverTimeYes | 1.9343061 | 0.1909657 | 10.1290744 | 0.0000000 |
| PercentSalaryHike | -0.0210031 | 0.0387550 | -0.5419455 | 0.5878561 |
| PerformanceRating | 0.0679746 | 0.3946248 | 0.1722513 | 0.8632400 |
| RelationshipSatisfaction | -0.2456994 | 0.0819679 | -2.9975091 | 0.0027220 |
| StockOptionLevel | -0.1975184 | 0.1557253 | -1.2683770 | 0.2046633 |
| TrainingTimesLastYear | -0.1862392 | 0.0725335 | -2.5676316 | 0.0102396 |
| WorkLifeBalance | -0.3547433 | 0.1227487 | -2.8899968 | 0.0038525 |
| YearsInCurrentRole | -0.1186218 | 0.0419441 | -2.8280935 | 0.0046826 |
| YearsSinceLastPromotion | 0.2036475 | 0.0401511 | 5.0720267 | 0.0000004 |
| YearsWithCurrManager | -0.0953257 | 0.0418145 | -2.2797268 | 0.0226239 |
| Age | -0.0422346 | 0.0122225 | -3.4554899 | 0.0005493 |
| DistanceFromHome | 0.0446842 | 0.0106711 | 4.1873907 | 0.0000282 |
The p-values in the above significance test table some feature variables with p value greater than 0.05. We next search for the best model by dropping some of the insignificant predictor variables. Since there are so many different ways to drop variables, next we use an automatic variable procedure to search the final model.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 1.7644674 | 1.1479245 | 1.5370936 | 0.1242704 |
| BusinessTravelTravel_Frequently | 1.8817648 | 0.4041877 | 4.6556709 | 0.0000032 |
| BusinessTravelTravel_Rarely | 0.9782062 | 0.3731283 | 2.6216350 | 0.0087509 |
| EducationFieldLife Sciences | -0.5979343 | 0.7398454 | -0.8081882 | 0.4189823 |
| EducationFieldMarketing | -0.1826602 | 0.7887458 | -0.2315831 | 0.8168618 |
| EducationFieldMedical | -0.7024340 | 0.7397245 | -0.9495886 | 0.3423213 |
| EducationFieldOther | -0.6971558 | 0.8066688 | -0.8642405 | 0.3874558 |
| EducationFieldTechnical Degree | 0.2724424 | 0.7627622 | 0.3571787 | 0.7209580 |
| EnvironmentSatisfaction | -0.4207754 | 0.0818043 | -5.1436854 | 0.0000003 |
| GenderMale | 0.3792481 | 0.1823123 | 2.0802117 | 0.0375061 |
| JobInvolvement | -0.5327656 | 0.1197715 | -4.4481841 | 0.0000087 |
| JobRoleHuman Resources | 1.6361497 | 0.6420829 | 2.5481906 | 0.0108283 |
| JobRoleLaboratory Technician | 1.6782868 | 0.4279142 | 3.9220171 | 0.0000878 |
| JobRoleManager | 0.1509916 | 0.6348814 | 0.2378265 | 0.8120157 |
| JobRoleManufacturing Director | 0.2443439 | 0.5288542 | 0.4620251 | 0.6440633 |
| JobRoleResearch Director | -1.0222376 | 0.8690644 | -1.1762507 | 0.2394947 |
| JobRoleResearch Scientist | 0.7856609 | 0.4318438 | 1.8193172 | 0.0688631 |
| JobRoleSales Executive | 1.1258424 | 0.4413488 | 2.5509130 | 0.0107441 |
| JobRoleSales Representative | 2.1559119 | 0.4959605 | 4.3469424 | 0.0000138 |
| JobSatisfaction | -0.4088564 | 0.0798811 | -5.1183104 | 0.0000003 |
| MaritalStatusMarried | 0.3945195 | 0.2543611 | 1.5510215 | 0.1208965 |
| MaritalStatusSingle | 1.4321239 | 0.2609096 | 5.4889662 | 0.0000000 |
| NumCompaniesWorked | 0.1627058 | 0.0368348 | 4.4171807 | 0.0000100 |
| OverTimeYes | 1.9293093 | 0.1901496 | 10.1462735 | 0.0000000 |
| RelationshipSatisfaction | -0.2308355 | 0.0810321 | -2.8486935 | 0.0043899 |
| TrainingTimesLastYear | -0.1844669 | 0.0724362 | -2.5466125 | 0.0108774 |
| WorkLifeBalance | -0.3637707 | 0.1227142 | -2.9643736 | 0.0030330 |
| YearsInCurrentRole | -0.1192218 | 0.0415277 | -2.8708974 | 0.0040931 |
| YearsSinceLastPromotion | 0.2032144 | 0.0391851 | 5.1860088 | 0.0000002 |
| YearsWithCurrManager | -0.1001219 | 0.0413984 | -2.4184959 | 0.0155848 |
| Age | -0.0459829 | 0.0116388 | -3.9508431 | 0.0000779 |
| DistanceFromHome | 0.0431754 | 0.0105214 | 4.1035687 | 0.0000407 |
*Interpretation - Association Analysis The summary table contains the four practically important variables YearsSinceLastPromotion, RelationshipSatisfaction, WorkLifeBalance, and NumCompaniesWorked. YearsSinceLastPromotion does achieve high statistical significance (p-value \(\approx\) 0), RelationshipSatisfaction also achieve high statistical significance (p-value \(\approx\) 0.0044), WorkLifeBalance achieve high statistical significance (p-value \(\approx\) 0.003), and NumCompaniesWorked also achieves high significance (p-value \(\approx\) 0.00001). Both variables, YearsSinceLastPromotion and NumCompaniesWorked, are seemingly positively associated with the response variable. Where as RelationshipSatisfaction and WorkLifeBalance, are negatively associated with the response variable.
Here’s a brief interpretation of the significant tests:
BusinessTravel: Travel_Frequently has a positive coefficient (1.88) and is significant (p < 0.001), indicating that employees who travel frequently are more likely to have attrition. Travel_Rarely also has a positive coefficient (0.98) and is significant (p = 0.009), suggesting that employees who travel rarely are also more likely to have attrition. EducationField: None of the specific education fields are significant predictors of attrition but most show negative association. EnvironmentSatisfaction: Negative coefficient (-0.42) and highly significant (p < 0.001) suggest that lower environment satisfaction is associated with higher attrition. Gender: Male employees have a positive coefficient (0.38) and are marginally significant (p = 0.038), indicating that male employees may be slightly more likely to have attrition. JobInvolvement: Negative coefficient (-0.53) and highly significant (p < 0.001) suggest that lower job involvement is associated with higher attrition. JobRole: Several job roles have significant associations with attrition. For example, Laboratory Technicians, Sales Representatives, and Human Resources have positive coefficients and are significant predictors of attrition. JobSatisfaction: Negative coefficient (-0.41) and highly significant (p < 0.001) suggest that lower job satisfaction is associated with higher attrition. MaritalStatus: Single employees have a positive coefficient (1.43) and are highly significant (p < 0.001), indicating that single employees are more likely to have attrition. NumCompaniesWorked: Positive coefficient (0.16) and highly significant (p < 0.001) suggest that having worked at more companies is associated with higher attrition. OverTime: Employees who work overtime (OverTimeYes) have a positive coefficient (1.93) and are highly significant (p < 0.001), indicating that they are more likely to have attrition. TrainingTimesLastYear: Negative coefficient (-0.18) and significant (p = 0.011) indicate that fewer training times last year are associated with higher attrition. YearsInCurrentRole: Negative coefficient (-0.12) and significant (p = 0.004) suggest that fewer years in the current role are associated with higher attrition. YearsSinceLastPromotion: Positive coefficient (0.20) and highly significant (p < 0.001) suggest that more years since the last promotion are associated with higher attrition. YearsWithCurrManager: Negative coefficient (-0.10) and significant (p = 0.016) suggest that fewer years with the current manager are associated with higher attrition. Age: Negative coefficient (-0.05) and highly significant (p < 0.001) suggest that older age is associated with lower attrition. DistanceFromHome: Positive coefficient (0.04) and significant (p < 0.001) suggest that greater distance from home is associated with higher attrition. These results provide insights into how different feature variables are associated with employee attrition. Variables such as BusinessTravel, EnvironmentSatisfaction, JobInvolvement, JobRole, JobSatisfaction, MaritalStatus, NumCompaniesWorked, OverTime, RelationshipSatisfaction, TrainingTimesLastYear, WorkLifeBalance, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager, Age, and DistanceFromHome appear to have significant associations with attrition.
| BusinessTravel | EnvironmentSatisfaction | Gender | JobInvolvement | JobRole | JobSatisfaction | MaritalStatus | NumCompaniesWorked | OverTime | RelationshipSatisfaction | StockOptionLevel | TrainingTimesLastYear | WorkLifeBalance | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | EducationField | Age | DistanceFromHome | Pred.Response |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Travel_Rarely | 3 | Male | 1 | Healthcare Representative | 3 | Single | 3 | Yes | 1 | 0 | 2 | 2 | 3 | 1 | 2 | Human Resources | 25 | 14 | 1 |
| Travel_Frequently | 2 | Female | 3 | Manager | 4 | Married | 4 | No | 3 | 1 | 1 | 2 | 8 | 3 | 2 | Technical Degree | 35 | 5 | 0 |
The predicted status of the successful introduction of the two employees is attached to the two new data records.
The “Predicted Response” column indicates the predicted outcome of employee attrition based on the input values for the predictor variables and the logistic regression model. A value of 1 indicates that the model predicts attrition (Yes), and a value of 0 indicates that the model predicts no attrition (No). These predictions are based on the relationships and coefficients learned by the logistic regression model from the training data. It’s important to note that these predictions are based on the specific characteristics we have provided for each individual, and they may change if the input values are altered.
In the realm of predictive modeling, logistic regression stands as a foundational technique for tackling binary classification problems. It allows us to analyze the relationship between a set of predictor variables and a binary outcome, making it particularly well-suited for scenarios like employee attrition prediction. However, the journey doesn’t end with model construction; we need robust mechanisms to validate our model’s performance and ensure its generalizability. This exploration delves into the crucial steps of using logistic regression, employing training, validating and test datasets, harnessing the power of cross-validation, and evaluating model performance through the precision, recall, F1-score, and Receiver Operating Characteristic (ROC) curve. This multi-faceted approach enables us to not only build a predictive model but also rigorously assess its accuracy, sensitivity, and overall effectiveness.
Training Dataset: The training dataset serves as the foundation for teaching the logistic regression model. This dataset contains labeled examples where the model learns the underlying patterns and relationships between predictor variables and the binary response variable. During training, the logistic regression algorithm adjusts its parameters, specifically the coefficients, to minimize the error between predicted probabilities and actual outcomes. Validation Dataset: Incorporating a validation dataset is crucial to assess the model’s performance and prevent overfitting. This separate dataset is not used during training but rather during the model development process. By evaluating the model on the validation dataset, one can fine-tune hyperparameters and make decisions about the model’s complexity. Validation aids in selecting the best version of the model that generalizes well to unseen data. Test Dataset: The test dataset is entirely independent of both the training and validation datasets. It serves as a final benchmark to evaluate the model’s performance on new, unseen data. The test dataset provides an unbiased estimate of how well the trained model will perform when deployed in a real-world scenario.
First, we split our dataset into two distinct subsets: a training, validating and a test dataset. Since the same size is large, we split the sample by 70%:30% with 70% data for training and validating models and 30% for testing purposes. The labels (value of the fraud status) of testing and validation data will be removed when calculating the accuracy measures
However, a single split of the dataset may not provide a comprehensive assessment of our model’s robustness. This is where cross-validation steps in. By employing techniques like k-fold cross-validation, we repeatedly divide the dataset into subsets, training the model on a combination of these subsets and validating it on the remaining data. This iterative process yields a more reliable estimate of the model’s performance and guards against potential overfitting.
Cross-Validation: Cross-validation is a methodology that enhances the validation process by mitigating potential biases in model evaluation. K-fold cross-validation, a popular technique, involves partitioning the dataset into K subsets. The model is trained K times, each time using K-1 subsets for training and one subset for validation. This process ensures a comprehensive assessment of the model’s performance and reduces the risk of over-optimistic evaluations. We define a sequence of 20 candidate cut-off probabilities and then use 5-fold cross-validation to identify the optimal cut-off probability for the final detection model.
Figure 7. 5-fold CV performance plot
The above figure indicates that the optimal cut-off probability that yields the best accuracy is 0.57.
This subsection reports the performance of the model using the test data set. Note that the model needs to be fit to the original training data to find the regression coefficients and then use the holdout testing sample to find the accuracy.
| test.accuracy |
|---|
| 0.8843537 |
The accuracy is around 90%. This accuracy indicates that there is no under-fitting. The predictive model correctly classified 90% of the instances in the test dataset. In other words, out of all the instances (data points) in the test dataset, approximately 90% were correctly predicted by the model as either having attrition or not having attrition, based on the chosen threshold.
In conclusion, this study aimed to predict employee attrition using logistic regression and leveraged a robust 5-fold cross-validation approach to assess the model’s accuracy. The use of cross-validation allowed us to effectively evaluate the model’s performance on multiple subsets of the training data, enhancing its generalizability and reducing the risk of overfitting. The logistic regression model demonstrated promising results, achieving an accuracy of 87.75% in predicting employee attrition. This suggests that the selected features and the logistic regression algorithm have the potential to be valuable tools in identifying employees who might be at risk of attrition. However, it’s important to note that accuracy alone might not provide a complete picture of the model’s performance, and further evaluation using other metrics, such as precision, recall, F1-score, and ROC curve, could provide additional insights into the model’s strengths and weaknesses.
Now we will calculate the local and global performance metrics for logistic predictive models. We have used the confusion matrix in the case study in the previous note. Here we will use the optimal cut-off probability as the decision threshold to define a confusion matrix and then define the performance measure based on this matrix. We use the data from our previous model in which we create the training and testing data sets. We pretend the optimal cut-off probability is based on what is obtained through the Cross Validation. The testing data set will be used to report the local and global performance measures.
Since we have identified the optimal cut-off probability to be 0.57. Next, we will use the testing data set to report the local measures.
| sensitivity | specificity | precision | recall | F1 |
|---|---|---|---|---|
| 0.3134328 | 0.986631 | 0.8076923 | 0.3134328 | 0.4516129 |
Sensitivity (True Positive Rate): Sensitivity, also known as the True Positive Rate or Recall, measures the proportion of actual positive cases correctly identified by the model. In this case, it’s 0.3428571, which means that the model correctly identified approximately 34.29% of the actual positive cases (attrition) in the dataset. Specificity (True Negative Rate): Specificity measures the proportion of actual negative cases correctly identified by the model. It’s also known as the True Negative Rate. Here, it’s 0.9784367, which indicates that the model correctly identified approximately 97.84% of the actual negative cases (non-attrition) in the dataset. Precision: Precision is the ratio of true positive predictions to the total predicted positive cases by the model. In this context, it’s 0.75, meaning that out of all the predicted positive cases by the model, approximately 75% were actually positive cases (correctly predicted attrition). Recall: Recall is another term for Sensitivity, as explained above. It measures the proportion of actual positive cases correctly identified by the model. Here, it’s 0.3428571, indicating the same as Sensitivity: approximately 34.29% of actual positive cases were correctly identified. F1 Score: The F1 Score is the harmonic mean of Precision and Recall. It provides a balance between Precision and Recall, especially when there is an imbalance between positive and negative cases. Here, it’s 0.4705882, reflecting the balance between the model’s ability to predict both positive and negative cases.
These metrics collectively provide insights into how well the model is performing in terms of correctly classifying attrition and non-attrition cases. A higher sensitivity is desired when correctly identifying positive cases is crucial, whereas a higher specificity is important when correctly identifying negative cases is a priority. Precision and Recall help evaluate the trade-off between true positives and false positives, while the F1 Score provides a balanced measure between Precision and Recall.
In order to create an ROC curve, we need to select a sequence of decision thresholds and calculate the corresponding sensitivity and specificity. CAUTION: ROC and AUC are used for model selection, we still use the training data to construct the ROC and calculate the AUC.
Performance Metrics and ROC Curve: To
quantitatively measure the model’s performance, various metrics come
into play. The ROC (Receiver Operating Characteristic) curve is a
graphical representation of a model’s ability to distinguish between the
classes, showing the trade-off between sensitivity and specificity. The
area under the ROC curve (AUC-ROC) quantifies the model’s discriminatory
power, where a higher AUC indicates better performance. In this
exploration, we delve into the synergy of logistic regression, training,
validation, and test datasets, cross-validation techniques, and the
visualization of model performance through ROC curves. By combining
these elements, we aim to build robust and accurate logistic regression
models that effectively predict binary outcomes in various domains.
An AUC (Area Under the Curve) value of 0.8151 indicates that our
logistic regression model is performing reasonably well in
distinguishing between the two classes (attrition and non-attrition).
The AUC value ranges from 0 to 1, where a higher value indicates better
predictive performance. An AUC value of 0.8151 suggests that the model
has a good ability to discriminate between employees who will leave and
those who will not. When interpreting the AUC value: AUC = 0.5
indicates random performance (no discrimination power). * AUC > 0.5
and < 0.7 suggests a poor to fair discriminative ability. * AUC >
0.7 and < 0.9 indicates a good to excellent discriminative ability. *
AUC > 0.9 suggests outstanding discriminative ability. Keep in mind
that the AUC is just one measure of the model’s performance. It’s
important to consider other metrics such as sensitivity, specificity,
precision, and F1-score, as well as domain knowledge, when evaluating
and interpreting the results of our logistic regression model.
*Conclusion Remarks In conclusion, the combination of logistic regression, training and test datasets, cross-validation, and ROC analysis forms a comprehensive framework for predictive modeling. By embracing these techniques, we empower ourselves to not only build accurate models but also to validate their effectiveness and make informed decisions based on their predictive capabilities. This multifaceted approach is a cornerstone of modern data science, enabling us to unlock insights and drive informed actions from complex datasets.
A neural network is a type of machine learning algorithm that is designed to recognize patterns in data. It’s inspired by the structure and function of the human brain, where interconnected neurons work together to process and transmit information. Neural networks consist of layers of interconnected nodes (neurons) that process input data and produce output. These layers are typically organized into three main types:
Input Layer: This layer receives the raw input data and passes it on to the next layer for processing. Hidden Layers: These are one or more layers between the input and output layers. Each neuron in a hidden layer processes the information it receives from the previous layer and passes the output to the next layer. The hidden layers are responsible for learning complex patterns in the data. Output Layer: This layer produces the final prediction or output based on the information processed by the hidden layers. The structure of the output layer depends on the type of problem being solved, such as classification or regression.
Neural networks “learn” from data through a process called training. During training, the network adjusts the connections between neurons (weights) based on the input data and the desired output. This adjustment is done iteratively using optimization algorithms that minimize the difference between the predicted output and the actual target. Neural networks have gained significant popularity due to their ability to learn from complex and high-dimensional data, making them suitable for tasks such as image and speech recognition, language translation, playing games, autonomous driving, and more. They can capture intricate relationships in data that may be difficult for traditional algorithms to discover. While neural networks can achieve remarkable accuracy, they also come with challenges. They require large amounts of data for training, and determining the right architecture (number of layers and neurons) and optimization techniques can be complex. Deep learning, a subset of neural networks, involves using multiple layers to create complex models and has been a driving force behind many recent breakthroughs in AI.
In summary, neural networks are a powerful tool in the field of machine learning, capable of learning and representing complex patterns in data. They have enabled significant advancements in various domains and continue to be a focus of research and development.
neuralnetneuralnet() requires all features to be in the numeric
form (dummy variable for categorical features, normalization of
numerical features). The model formula in neuralnet()
requires dummy variables to be explicitly defined. It is also highly
recommended to scale all numerical features before being included in the
network model. The objective is to find all feature names (numeric and
all dummy variables) and write them in the model formula like the one in
glm: response ~ var_1 + var_2 + ... +var_k To
explain the modeling process in detail, we will outline major steps in
the following subsections.
There are different types of scaling and standardization. The one we use in the following has
\[ scaled.var = \frac{orig.var - \min(orig.var)}{\max(orig.var)-\min(orig.var)} \] The scaled numeric feature is unitless (similar to the well-known z-score transformation).
In practical applications, there may be many categorical features in
the model and each category could have many categories. It is
practically infeasible to write all resulting dummy features explicitly.
We can use the R function to extract variables from a model formula that
will be used in a model. Make sure, all categorical feature variables
must be defined in a non-numerical form (i.e., should not be numerically
encoded). We can also use the R function relevel() to
change the baseline of an unordered categorical feature variable.
Next, we use the R function model.matrix() to extract
the names of all feature variables (including implicitly defined dummy
feature variables from model.matrix()).
## [1] "(Intercept)" "Age"
## [3] "AttritionYes" "BusinessTravelTravel_Frequently"
## [5] "BusinessTravelTravel_Non" "DepartmentHuman Resources"
## [7] "DepartmentResearch & Development" "DistanceFromHome"
## [9] "Education" "EducationFieldHuman Resources"
## [11] "EducationFieldLife Sciences" "EducationFieldMarketing"
## [13] "EducationFieldMedical" "EducationFieldOther"
## [15] "EnvironmentSatisfaction" "GenderFemale"
## [17] "JobInvolvement" "JobLevel"
## [19] "JobRoleHealthcare Representative" "JobRoleHuman Resources"
## [21] "JobRoleLaboratory Technician" "JobRoleManager"
## [23] "JobRoleManufacturing Director" "JobRoleResearch Scientist"
## [25] "JobRoleSales Executive" "JobRoleSales Representative"
## [27] "JobSatisfaction" "MaritalStatusDivorced"
## [29] "MaritalStatusMarried" "MonthlyIncome"
## [31] "NumCompaniesWorked" "OverTimeYes"
## [33] "PercentSalaryHike" "PerformanceRating"
## [35] "RelationshipSatisfaction" "StockOptionLevel"
## [37] "TrainingTimesLastYear" "WorkLifeBalance"
## [39] "YearsInCurrentRole" "YearsSinceLastPromotion"
## [41] "YearsWithCurrManager"
There are some naming issues in the above dummy feature variables for network modeling (although they are good for regular linear and generalized linear regression models). We need to rename them by excluding special characters in order to build neural network models. These issues can be avoided at the stage of feature engineering (if we initially planned to build neural network models). Next, we clean up the variables before defining the network model formula.
For convenience, we encourage you to use CamelCase
notation (CamelCase is a way to separate the words in a
phrase by making the first letter of each word capitalized and not using
spaces) in naming feature variables.
## AttritionYes ~ Age + BusinessTravelTravelFreq + BusinessTravelTravelNon +
## DepartmentHumanResources + DepartmentResearchDevelopment +
## DistanceFromHome + Education + EducationFieldHumanResources +
## EducationFieldLifeSciences + EducationFieldMarketing + EducationFieldMedical +
## EducationFieldOther + EnvironmentSatisfaction + GenderFemale +
## JobInvolvement + JobLevel + JobRoleHealthcareRepresentative +
## JobRoleHumanResources + JobRoleLaboratorytechnician + JobRoleManager +
## JobRoleManufacturingDirector + JobRoleResearchScientist +
## JobRoleSalesExecutive + JobRoleSalesRepresentative + JobSatisfaction +
## MaritalStatusDivorced + MaritalStatusMarried + MonthlyIncome +
## NumCompaniesWorked + OverTimeYes + PercentSalaryHike + PerformanceRating +
## RelationshipSatisfaction + StockOptionLevel + TrainingTimesLastYear +
## WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion +
## YearsWithCurrManager
We follow the routine steps for building a neural network model to predict Attrition.
We split the data into 70% for training the neural network and 30% for testing.
| error | 12.4130373 |
| reached.threshold | 0.0097831 |
| steps | 4121.0000000 |
| Intercept.to.1layhid1 | 4.7283796 |
| Age.to.1layhid1 | -25.2700187 |
| BusinessTravelTravelFreq.to.1layhid1 | 29.9256578 |
| BusinessTravelTravelNon.to.1layhid1 | -0.1415772 |
| DepartmentHumanResources.to.1layhid1 | -1.5439061 |
| DepartmentResearchDevelopment.to.1layhid1 | -0.7428586 |
| DistanceFromHome.to.1layhid1 | 13.9251898 |
| Education.to.1layhid1 | 0.6244195 |
| EducationFieldHumanResources.to.1layhid1 | 21.3871291 |
| EducationFieldLifeSciences.to.1layhid1 | 10.6746956 |
| EducationFieldMarketing.to.1layhid1 | 16.6426970 |
| EducationFieldMedical.to.1layhid1 | 3.2402965 |
| EducationFieldOther.to.1layhid1 | 10.8804176 |
| EnvironmentSatisfaction.to.1layhid1 | -33.3847948 |
| GenderFemale.to.1layhid1 | -9.9868604 |
| JobInvolvement.to.1layhid1 | -9.3427068 |
| JobLevel.to.1layhid1 | -3.6680898 |
| JobRoleHealthcareRepresentative.to.1layhid1 | -409.4723540 |
| JobRoleHumanResources.to.1layhid1 | -0.4357436 |
| JobRoleLaboratorytechnician.to.1layhid1 | 16.5525537 |
| JobRoleManager.to.1layhid1 | -409.4229911 |
| JobRoleManufacturingDirector.to.1layhid1 | -15.1456200 |
| JobRoleResearchScientist.to.1layhid1 | 6.8461683 |
| JobRoleSalesExecutive.to.1layhid1 | 6.4702694 |
| JobRoleSalesRepresentative.to.1layhid1 | 22.7881528 |
| JobSatisfaction.to.1layhid1 | -2.6865442 |
| MaritalStatusDivorced.to.1layhid1 | -20.3574510 |
| MaritalStatusMarried.to.1layhid1 | -21.5059449 |
| MonthlyIncome.to.1layhid1 | -12.7027070 |
| NumCompaniesWorked.to.1layhid1 | 7.1220929 |
| OverTimeYes.to.1layhid1 | 41.7518850 |
| PercentSalaryHike.to.1layhid1 | 22.2213500 |
| PerformanceRating.to.1layhid1 | -8.6784197 |
| RelationshipSatisfaction.to.1layhid1 | -9.0012367 |
| StockOptionLevel.to.1layhid1 | 3.9173498 |
| TrainingTimesLastYear.to.1layhid1 | -11.3360298 |
| WorkLifeBalance.to.1layhid1 | 9.3880611 |
| YearsInCurrentRole.to.1layhid1 | -30.2971577 |
| YearsSinceLastPromotion.to.1layhid1 | -2.3969326 |
| YearsWithCurrManager.to.1layhid1 | -2.7083013 |
| Intercept.to.AttritionYes | 0.0543155 |
| 1layhid1.to.AttritionYes | 0.8470791 |
Figure 12. Single-layer backpropagation Neural network model for Employee Attrition
| Estimate | Std. Error | z value | |
|---|---|---|---|
| (Intercept) | 0.3545 | 1.608 | 0.2204 |
| Age | -1.502 | 0.5208 | -2.884 |
| BusinessTravelTravel_Frequently | 0.8802 | 0.2047 | 4.299 |
| DepartmentHuman Resources | -12.63 | 446 | -0.02832 |
| DepartmentResearch & Development | 0.2472 | 1.113 | 0.222 |
| DistanceFromHome | 1.196 | 0.3055 | 3.913 |
| Education | 0.09759 | 0.3507 | 0.2783 |
| EducationFieldHuman Resources | -0.02841 | 0.8175 | -0.03475 |
| EducationFieldLife Sciences | -0.8385 | 0.304 | -2.758 |
| EducationFieldMarketing | -0.4895 | 0.3912 | -1.251 |
| EducationFieldMedical | -0.903 | 0.3136 | -2.879 |
| EducationFieldOther | -1.133 | 0.4889 | -2.318 |
| EnvironmentSatisfaction | -1.07 | 0.2452 | -4.363 |
| GenderFemale | -0.3189 | 0.1842 | -1.731 |
| JobInvolvement | -1.677 | 0.3706 | -4.525 |
| JobLevel | -0.6883 | 1.183 | -0.5818 |
| JobRoleHealthcare Representative | 1.016 | 0.9542 | 1.065 |
| JobRoleHuman Resources | 15.17 | 446 | 0.03402 |
| JobRoleLaboratory Technician | 2.614 | 0.9983 | 2.619 |
| JobRoleManager | 1.406 | 1.07 | 1.315 |
| JobRoleManufacturing Director | 1.203 | 0.9399 | 1.28 |
| JobRoleResearch Scientist | 1.699 | 1.001 | 1.698 |
| JobRoleSales Executive | 2.377 | 1.422 | 1.672 |
| JobRoleSales Representative | 3.43 | 1.51 | 2.272 |
| JobSatisfaction | -1.041 | 0.2406 | -4.329 |
| MaritalStatusDivorced | -1.014 | 0.3419 | -2.967 |
| MaritalStatusMarried | -0.7231 | 0.248 | -2.916 |
| MonthlyIncome | 0.7883 | 1.503 | 0.5246 |
| NumCompaniesWorked | 1.306 | 0.3418 | 3.822 |
| OverTimeYes | 1.82 | 0.1904 | 9.558 |
| PercentSalaryHike | -0.3553 | 0.5425 | -0.6549 |
| PerformanceRating | 0.02365 | 0.396 | 0.05972 |
| RelationshipSatisfaction | -0.6659 | 0.2483 | -2.681 |
| StockOptionLevel | -0.6092 | 0.4656 | -1.309 |
| TrainingTimesLastYear | -0.9523 | 0.4452 | -2.139 |
| WorkLifeBalance | -0.9719 | 0.3725 | -2.609 |
| YearsInCurrentRole | -1.111 | 0.7599 | -1.462 |
| YearsSinceLastPromotion | -0.4548 | 0.7042 | -0.6458 |
| Pr(>|z|) | |
|---|---|
| (Intercept) | 0.8256 |
| Age | 0.003921 |
| BusinessTravelTravel_Frequently | 1.713e-05 |
| DepartmentHuman Resources | 0.9774 |
| DepartmentResearch & Development | 0.8243 |
| DistanceFromHome | 9.099e-05 |
| Education | 0.7808 |
| EducationFieldHuman Resources | 0.9723 |
| EducationFieldLife Sciences | 0.005809 |
| EducationFieldMarketing | 0.2109 |
| EducationFieldMedical | 0.003989 |
| EducationFieldOther | 0.02043 |
| EnvironmentSatisfaction | 1.282e-05 |
| GenderFemale | 0.08339 |
| JobInvolvement | 6.042e-06 |
| JobLevel | 0.5607 |
| JobRoleHealthcare Representative | 0.287 |
| JobRoleHuman Resources | 0.9729 |
| JobRoleLaboratory Technician | 0.00883 |
| JobRoleManager | 0.1886 |
| JobRoleManufacturing Director | 0.2006 |
| JobRoleResearch Scientist | 0.08956 |
| JobRoleSales Executive | 0.0946 |
| JobRoleSales Representative | 0.02311 |
| JobSatisfaction | 1.501e-05 |
| MaritalStatusDivorced | 0.003009 |
| MaritalStatusMarried | 0.003546 |
| MonthlyIncome | 0.5999 |
| NumCompaniesWorked | 0.0001326 |
| OverTimeYes | 1.197e-21 |
| PercentSalaryHike | 0.5126 |
| PerformanceRating | 0.9524 |
| RelationshipSatisfaction | 0.00733 |
| StockOptionLevel | 0.1907 |
| TrainingTimesLastYear | 0.03244 |
| WorkLifeBalance | 0.009082 |
| YearsInCurrentRole | 0.1438 |
| YearsSinceLastPromotion | 0.5184 |
The algorithm of Cross-validation is primarily used for tuning
hyper-parameters. For example, in the sigmoid perceptron, the optimal
cut-off scores for the binary decision can be obtained through
cross-validation. One of the important hyperparameters in the neural
network model is the learning rate \(\alpha\) (in the backpropagation algorithm)
that impacts the learning speed in training neural network models.
## $confusion.matrix
##
## 0 1
## FALSE 699 71
## TRUE 76 78
##
## $accuracy
## [1] 0.8409091
Recall that the ROC curve is the plot of sensitivity against (1 - specificity) calculated from the confusion matrix based on a sequence of selected cut-off scores. Definitions of sensitivity and specificity are given in the following confusion matrix Next, we construct a ROC for the above NN model based on the training data set.
Figure 14: ROC Curve of the neural network model.
The above ROC curve indicates that the underlying neural network is better than the random guess since the area under the curve is significantly greater than 0.5. In general, if the area under the ROC curve is greater than 0.65, we say the predictive power of the underlying model is acceptable. ### Here’s a comparison of the two AUC values mentioned: * Logistic Model AUC: 0.8151 * Neural Network Model AUC: 0.8337 *Comparing AUC values: AUC ranges between 0 and 1, where a higher value indicates better discrimination between the classes. Generally, an AUC value above 0.5 suggests that the model is performing better than random chance. In your case, both models have AUC values above 0.5, which is a positive sign. The neural network model has a slightly higher AUC (0.8337) compared to the logistic model (0.8151), suggesting that the Neural Network model is better at distinguishing between the positive and negative classes based on the chosen evaluation metric.
Decision tree algorithms are a class of machine learning algorithms used for both classification and regression tasks. They work by recursively splitting the dataset into subsets based on the values of input features, ultimately creating a tree-like structure of decisions that leads to a predicted output. In the context of our employee Attrition dataset, decision tree algorithms can be appropriate for the following reasons:
Interpretability: Decision trees provide transparent and easy-to-understand rules that map input features to predicted outcomes. This can be valuable in understanding the factors that contribute to employee attrition. Feature Importance: Decision trees can rank features based on their importance in the decision-making process. This helps identify which factors have the most significant impact on employee attrition. Non-linearity: Decision trees can capture complex relationships and interactions between features, which could be present in employee attrition scenarios. Handling Missing Values: Decision trees can handle missing values in a robust manner by considering available data for each split. No Assumption about Data Distribution: Decision trees do not assume any specific data distribution, making them suitable for various types of data.
The Decision Tree (DT) algorithm is based on conditional probabilities. Unlike the other classification algorithms, decision trees generate rules. A rule is a conditional statement that can easily be understood by humans and easily used within a database to identify a set of records. It is easy to interpret and implement in real-world applications. Among several basic tree-based algorithms, Classification and Regression Tree (CART) is most frequently used in practice.
This is a new model that is different from logistic and neural network models. We load the analytic data set.
rpart() has a lot of flexibility to construct decision
trees as it has user controls. It is particularly useful in applications
where the costs of false positive and
false negative are different.
Next, we write a wrapper so we can build different decision trees conveniently.
Using the above function, we define six different decision tree
models in the following. * Model 1: gini.tree.11 is based
on the Gini index without penalizing false positives and false
negatives. * Model 2: info.tree.11 is based on entropy
without penalizing false positives and false negatives. * Model 3:
gini.tree.110 is based on the Gini index: cost of false
negatives is 10 times the positives. * Model 4:
info.tree.110 is based on entropy: cost of false negatives
is 10 times the positives. * Model 5: gini.tree.101 is
based on the Gini index: cost of false positive is 10 times the
negatives.
* Model 6: info.tree.101 is based on entropy: cost of false
positive is 10 times the negatives.
Figure 14. Non-penalized decision tree models using Gini index (left) and entropy (right).
Figure 15. penalized decision tree models using Gini index (left) and entropy (right).
We built 4 different decision tree models above. Next, we use ROC analysis to select the best among the four candidate models.
The above function has three arguments for users to choose different types of decision trees including the 4 trees discussed in the previous subsection. Next we use this function to build 6 different trees and plot their corresponding ROC curves so we can see the global performance of these tree algorithms.
Next, we plot the ROC curves and calculate the areas under the ROC curves for Individual decision tree models.Figure 16. Comparison of ROC curves
The above ROC curves represent various decision trees and their corresponding AUC. The model with the largest AUC is considered the best decision tree among the existing ones. An AUC (Area Under the Curve) value of 0.875 (largest out of 6) for a decision tree model indicates that the model has relatively good performance in distinguishing between the positive and negative classes in your dataset. AUC measures the ability of the model to correctly rank instances of different classes based on their predicted probabilities. An AUC value closer to 1.0 indicates better discrimination, while a value closer to 0.5 suggests random performance. In your case, an AUC of 0.875 indicates that the decision tree model is making accurate predictions and effectively separating the two classes.
As usual, once the final model is determined, we need to find the optimal cut-off score for reporting the predictive performance of the final model with the test data. Please keep in mind the optimal cut-off determination through cross-validation must be based on the training data set.
In practical applications, one may end up with two or more final models with similar AUCs. In this case, we need to report the performance of all final models based on the test data and let clients choose one to deploy (and possibly leave the rest as challengers). For this reason, we write a function to determine the optimal cut-off for a given decision tree (based on this project) since different decision trees have their own optimal cut-off.
For demonstration, we use the above function to calculate the optimal cut-off of 6 decision trees constructed earlier in the following.Figure 17: Plot of optimal cut-off determination
As anticipated, different trees have their own optimal cut-off. Please keep in mind that the cut-off is random (based on the randomly split training data), there may have different cut-offs in different runs. It is dependent on the tree size, sometimes, we may end up with multiple optimal cut-offs. Technically speaking, we choose any one of them for implementation. A better recommendation is to choose the average of these multiple cut-offs and the final cut-off to be used on the testing data set.
Based on the information provided, it seems that our Decision Tree model has achieved an Area Under the Curve (AUC) of 0.875. A higher AUC indicates better discriminatory power of our model in distinguishing between positive and negative cases. In the context of employee attrition analysis, this suggests that our Decision Tree model has a good ability to rank and classify employees with respect to attrition risk.
However, the choice of a cutoff score is important for practical deployment and decision-making. The cutoff score determines how the model’s predictions are converted into binary classifications (positive or negative). The selection of a cutoff involves a trade-off between sensitivity (true positive rate) and specificity (true negative rate).
Model Performance: An AUC of 0.875 is generally considered good, indicating that your Decision Tree model is performing well in differentiating between employees who will attrite and those who won’t. Cutoff Selection: The optimal cutoff score depends on specific business context and the relative importance of false positives (predicting attrition when it won’t occur) versus false negatives (failing to predict attrition when it will occur). we should choose a cutoff that aligns with your organization’s goals and risk tolerance. Trade-off Consideration: we need to balance sensitivity and specificity based on our business needs. A higher sensitivity might be preferred if missing actual attrition cases is more costly, while a higher specificity might be chosen if avoiding false alarms is a higher priority. Further Analysis: To make an informed decision about the cutoff, we might want to analyze precision, recall, F1-score, and other relevant metrics. We could also perform sensitivity analysis by evaluating the model’s performance across a range of cutoff values. Deployment and Monitoring: Keep in mind that model performance can vary in real-world scenarios. It’s important to monitor the model’s performance after deployment and refine the cutoff if necessary.
In summary, our Decision Tree model is demonstrating strong predictive performance, but the choice of a cutoff score should be guided by the specific needs of the organization and the implications of false positives and false negatives in the context of employee attrition.
Based on the analysis and comparison of the three models used in this project, namely Logistic Regression, Neural Network, and Decision Tree, it can be concluded that the Neural Network model outperformed the other two in terms of predictive accuracy. The Neural Network model achieved an AUC of 0.8333, which was higher than the AUC values obtained from Logistic Regression (0.8151) and Decision Tree (0.875).
While the Decision Tree model had the highest AUC value, it’s important to consider that Decision Trees are prone to overfitting, which may lead to overly optimistic results on the training data. The Neural Network model demonstrated a good balance between predictive accuracy and generalization to unseen data.
Understanding the attrition by Department. Basically, running predictive modeling on each of the three sub categories of department separately and look at what factors affect it. This helps determine more in depth on what factors are the key driving force to employee attrition in each sub - categories. Therefore, we can provide meaningful insights on attrition factors on department level.
Feed back is welcome. Thank you.
Class notes, Powerpoint