In the realm of People Analytics, the significance of a data set created by IBM for attrition modeling cannot be overstated. This data set presents a valuable resource for addressing critical questions related to employee turnover and engagement. With 1470 rows and 35 columns, it offers a wealth of information that encompasses various aspects of employees’ professional and personal lives. The data set combines the typical Human Resources Information System (HRIS) data with a comprehensive engagement survey, providing a holistic view of the employees’ experiences and sentiments.
The main objective of this analysis is to understand and predict employee turnover within an organization. By examining the factors that contribute to attrition, we aim to uncover valuable insights that can guide HR strategies and improve employee retention rates. Moreover, this data set offers a unique opportunity to identify differences between the group of employees who chose to stay and those who decided to leave the organization.
The data set was created and made available by IBM, a renowned leader in the tech industry. As a reputable source, the data is expected to be reliable and well-structured, enabling meaningful analysis and inference. While the specific details of data collection methods are not provided, the comprehensive nature of the data set suggests that it was meticulously curated to capture a wide range of attributes relevant to employee turnover and engagement.
Sample Size and Feature Variables: The data set consists of 1470 rows, representing individual employee records. Each row contains 35 columns, making these columns the feature variables for analysis. The feature variables include demographic information such as age and gender, factors related to job satisfaction and environment satisfaction, education field, job role, income, overtime, percentage salary hike, tenure, training time, years in the current role, relationship status, and several other parameters that may impact attrition and engagement.
The feature variables encompass various data types, including numeric, categorical, and ordinal data, which allows for a diverse set of analyses and modeling approaches.
In conclusion, this data set offers a comprehensive collection of features that are crucial for understanding and predicting employee attrition within an organization. By delving into the relationships between different variables, we can gain valuable insights that have practical implications for HR policies and practices.
A detailed description of the variables is given below:
Education 1 ‘Below College’ 2 ‘College’ 3 ‘Bachelor’ 4
‘Master’ 5 ‘Doctor’
EnvironmentSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4
‘Very High’
JobInvolvement 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very
High’
JobSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very
High’
PerformanceRating 1 ‘Low’ 2 ‘Good’ 3 ‘Excellent’ 4
‘Outstanding’
RelationshipSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4
‘Very High’
WorkLifeBalance 1 ‘Bad’ 2 ‘Good’ 3 ‘Better’ 4 ‘Best’
A copy of this publicly available data is stored at https://raw.githubusercontent.com/Tenam01/DATASETS/main/EmployeeAttritionData.csv.
EmployeeAttrition = read.csv("https://raw.githubusercontent.com/Tenam01/DATASETS/main/EmployeeAttritionData.csv")
Applying binary coding.
# Convert our response variable from 'Yes' and 'No' to 1 and 0
EmployeeAttrition$Attrition_num <- ifelse(EmployeeAttrition$Attrition == "Yes", 1, 0)
Column Employee Number will be removed to protect the identity of employees and their sensitive information.
Additional columns have been identified for removal:
DailyRate, HourlyRate, and MonthlyRate are inexplicable. EmployeeCount, Over18, and StandardHours are uniform for all employees.
# Using subset() function to drop specific columns
EmployeeAttrition_drop <- subset(EmployeeAttrition, select = -c(EmployeeNumber, EmployeeCount, Over18, StandardHours, DailyRate, HourlyRate, MonthlyRate))
We first scan the entire data set and determine the EDA tools to use for feature engineering.
summary(EmployeeAttrition_drop)
## Age Attrition BusinessTravel Department
## Min. :18.00 Length:1470 Length:1470 Length:1470
## 1st Qu.:30.00 Class :character Class :character Class :character
## Median :36.00 Mode :character Mode :character Mode :character
## Mean :36.92
## 3rd Qu.:43.00
## Max. :60.00
## DistanceFromHome Education EducationField EnvironmentSatisfaction
## Min. : 1.000 Min. :1.000 Length:1470 Min. :1.000
## 1st Qu.: 2.000 1st Qu.:2.000 Class :character 1st Qu.:2.000
## Median : 7.000 Median :3.000 Mode :character Median :3.000
## Mean : 9.193 Mean :2.913 Mean :2.722
## 3rd Qu.:14.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :29.000 Max. :5.000 Max. :4.000
## Gender JobInvolvement JobLevel JobRole
## Length:1470 Min. :1.00 Min. :1.000 Length:1470
## Class :character 1st Qu.:2.00 1st Qu.:1.000 Class :character
## Mode :character Median :3.00 Median :2.000 Mode :character
## Mean :2.73 Mean :2.064
## 3rd Qu.:3.00 3rd Qu.:3.000
## Max. :4.00 Max. :5.000
## JobSatisfaction MaritalStatus MonthlyIncome NumCompaniesWorked
## Min. :1.000 Length:1470 Min. : 1009 Min. :0.000
## 1st Qu.:2.000 Class :character 1st Qu.: 2911 1st Qu.:1.000
## Median :3.000 Mode :character Median : 4919 Median :2.000
## Mean :2.729 Mean : 6503 Mean :2.693
## 3rd Qu.:4.000 3rd Qu.: 8379 3rd Qu.:4.000
## Max. :4.000 Max. :19999 Max. :9.000
## OverTime PercentSalaryHike PerformanceRating
## Length:1470 Min. :11.00 Min. :3.000
## Class :character 1st Qu.:12.00 1st Qu.:3.000
## Mode :character Median :14.00 Median :3.000
## Mean :15.21 Mean :3.154
## 3rd Qu.:18.00 3rd Qu.:3.000
## Max. :25.00 Max. :4.000
## RelationshipSatisfaction StockOptionLevel TotalWorkingYears
## Min. :1.000 Min. :0.0000 Min. : 0.00
## 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.: 6.00
## Median :3.000 Median :1.0000 Median :10.00
## Mean :2.712 Mean :0.7939 Mean :11.28
## 3rd Qu.:4.000 3rd Qu.:1.0000 3rd Qu.:15.00
## Max. :4.000 Max. :3.0000 Max. :40.00
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## Min. :0.000 Min. :1.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.000 1st Qu.: 2.000
## Median :3.000 Median :3.000 Median : 5.000 Median : 3.000
## Mean :2.799 Mean :2.761 Mean : 7.008 Mean : 4.229
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.: 9.000 3rd Qu.: 7.000
## Max. :6.000 Max. :4.000 Max. :40.000 Max. :18.000
## YearsSinceLastPromotion YearsWithCurrManager Attrition_num
## Min. : 0.000 Min. : 0.000 Min. :0.0000
## 1st Qu.: 0.000 1st Qu.: 2.000 1st Qu.:0.0000
## Median : 1.000 Median : 3.000 Median :0.0000
## Mean : 2.188 Mean : 4.123 Mean :0.1612
## 3rd Qu.: 3.000 3rd Qu.: 7.000 3rd Qu.:0.0000
## Max. :15.000 Max. :17.000 Max. :1.0000
There seems to be no apparent outliers.All the numerical variables seem to be in a reasonable range.
The average age seems to be around 37 years, the highest age seems to be 60 and the lowest age seems to be 18. This dynamic age group could range from new interns all the way to senior managers. There are people who live very close to work and some live far away from the work place Salaries tend to fluctuate a lot as well with a average monthly income of USD 6,500. The salaries can range anywhere from lowest USD 1,000 to as high as USD 20,000. This explains as new interns get paid less whereas the senior managers make a lot. The average number of years worked as an employee is 7 years, where there are employees who have worked for 40 years and as well as some who just started working.
# checking the unique character variables.
unique(EmployeeAttrition_drop$Gender)
## [1] "Female" "Male"
unique(EmployeeAttrition_drop$BusinessTravel)
## [1] "Travel_Rarely" "Travel_Frequently" "Non-Travel"
unique(EmployeeAttrition_drop$Department)
## [1] "Sales" "Research & Development" "Human Resources"
unique(EmployeeAttrition_drop$JobRole)
## [1] "Sales Executive" "Research Scientist"
## [3] "Laboratory Technician" "Manufacturing Director"
## [5] "Healthcare Representative" "Manager"
## [7] "Sales Representative" "Research Director"
## [9] "Human Resources"
unique(EmployeeAttrition_drop$MaritalStatus)
## [1] "Single" "Married" "Divorced"
unique(EmployeeAttrition_drop$OverTime)
## [1] "Yes" "No"
unique(EmployeeAttrition_drop$Attrition)
## [1] "Yes" "No"
unique(EmployeeAttrition_drop$EducationField)
## [1] "Life Sciences" "Other" "Medical" "Marketing"
## [5] "Technical Degree" "Human Resources"
All of the categorical characters are consistent.
Since the data set appears to have no missing values, it simplifies the Exploratory Data Analysis (EDA) process. With a complete data set, we can focus on exploring relationships, identifying patterns, and gaining insights more effectively.
The data cleaning process is complete.
numeric_vars <- sapply(EmployeeAttrition_drop, is.numeric)
numeric_data <- EmployeeAttrition_drop[, numeric_vars]
This subsection focuses on the potential discretization of continuous variables and grouping sparse categories of category variables based on their distribution.
We will group age into three sub groups ranging from 18 to 60 and also group the distance from home into three subgroups. We will replace Age, DistanceFromHome, and Attrition feature variables and replace them with the modified grouped variable grp.age and grp.dist also binary variable outcome Attrition_num for easy graphical approach.
EmployeeAttrition_drop$grp.age <- ifelse(EmployeeAttrition_drop$Age <= 30, '(18, 30)',
ifelse(EmployeeAttrition_drop$Age >= 50, '(50, 60)', '[30,50]'))
EmployeeAttrition_drop$grp.dist <- ifelse(EmployeeAttrition_drop$DistanceFromHome <= 10, '(1, 10)',
ifelse(EmployeeAttrition_drop$DistanceFromHome >= 20, '(20, 30)', '[10,20]'))
Pairwise association refers to the examination of relationships between pairs of variables in a data set. It involves analyzing how the values of two variables co-occur or change together. There are three different types of pairwise associations.
The best visual tool for assessing pairwise linear association between two numeric variables is a pair-wise scatter plot.
# Selecting specific columns
selected_columns <- c(1, 5, 15, 16, 18, 22)
# Creating ggpairs plot with selected columns
ggpairs(EmployeeAttrition_drop,
columns = selected_columns,
aes(color = Attrition, alpha = 0.5))
# Selecting specific columns
selected_columnss <- c(23, 25, 26,27,28)
# Creating ggpairs plot with selected columns
ggpairs(EmployeeAttrition_drop,
columns = selected_columnss,
aes(color = Attrition, alpha = 0.5))
The off-diagonal plots and numbers indicate the correlation between
the pair-wise numeric variables. As expected,
YearsWithCurrManager and YearsAtCompany are
significantly correlated, YearsWithCurrManager and
YearsInCurrentRole are significantly correlated,
TotalWorkingYears and YearsAtCompany are
significantly correlated, YearsInCurrentRole and
YearsAtCompany are significantly correlated, and
YearsSinceLastPromotion and YearsAtCompany are
significantly correlated. Other paired variables have weak
correlations.
The main diagonal stacked density curves show the potential difference in the distribution of the underlying numeric variable in Attrition and no attrition groups. This means that the stacked density curves show the relation between numeric and categorical variables. These stacked density curves are not completely overlapped indicating somewhat correlation between each of these numeric variables and the binary response variable.
Mosaic plots are convenient to show whether two categorical variables are dependent. In EDA, we are primarily interested in whether the response (binary in this case) is independent of categorical variables. Those categorical variables that are independent of the response variable should be excluded in any of the subsequent models and algorithms.
par(mfrow = c(2,2))
mosaicplot(BusinessTravel ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="BusinessTravel vs Attrition")
mosaicplot(Department ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="Department vs Attrition")
mosaicplot(EducationField ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="EducationField vs Attrition")
mosaicplot(JobRole ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="JobRole vs Attrition")
The top two mosaic plots demonstrate show that Attrition is not independent of Business Travel and Department because the proportion of Attrition cases in individual categories is not identical. In which employees traveling frequently have the highest attrition rate whereas non-travel employees has the least attrition rate. The bottom two mosaic plots also show that Attrition is not independent of Education field and Job Role because the proportion of Attrition cases in individual categories is not identical.
par(mfrow = c(2,3))
mosaicplot(OverTime ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="OverTime vs Attrition")
mosaicplot(MaritalStatus ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="MaritalStatus vs Attrition")
mosaicplot(Gender ~ Attrition, data=EmployeeAttrition,col=c("Blue","Red"), main="Gender vs Attrition")
mosaicplot(grp.age ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="Agegroup vs Attrition")
mosaicplot(grp.dist ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="DistanceFromHome vs Attrition")
The top left two mosaic plots demonstrate the positive association between overtime and marital status readings. In which employees being single and working a lot of overtime shows increasing rate of employee attrition. The bottom two mosaic plots also show that Attrition is not independent of age group and distance from home because the proportion of Attrition cases in individual categories is not identical. Here, the younger age group have a higher attrition rate than older age group. Also further away from the work place shows increase in the rate of attrition. Lastly, as for the Gender, there seems to be not much of an influence.
# Calculate the correlation matrix
cor_matrix <- cor(numeric_data)
# You can also plot the correlation matrix for better visualization
library(corrplot)
corrplot(cor_matrix, method = "color", tl.col = "black")
We will be removing grp.age and grp.dist also binary variable outcome Attrition_num. We will aslo remove YearsAtCompany, and TotalWorkingYears, as they highly correlated with correlation coefficient > .75. For our final cleaned data set, most feature variables have very high or very low correlation but not high or low enough to be removed. Therefore, we will keep them for further analysis.
# Using subset() function to drop specific columns
modified_data <- subset(EmployeeAttrition_drop, select = -c(grp.age, grp.dist, Attrition_num, YearsAtCompany, TotalWorkingYears))
write.csv(modified_data, file = "~/Desktop/cleanedattrition2.csv", row.names = FALSE)
Employee attrition is a critical challenge faced by many organizations, especially in the fast-paced tech industry. High turnover rates can impact productivity, morale, and overall company performance. In this analysis, we will use logistic regression to predict employee attrition in a tech company based on various employee characteristics, job-related factors and also to assess the association between the binary response variable and other predictor variables.
In this study, we used a published study on Employee Attrition data set. The practical question for this predictive modeling assignment is to determine the factors that impact employee attrition. We want to understand which variables are significant predictors of attrition and build a logistic regression model to predict whether an employee is likely to leave the company (attrition = 1) or not (attrition = 0). A copy of this publicly available data is stored at https://raw.githubusercontent.com/Tenam01/DATASETS/main/cleanedattrition2.csv. This data set has been pre-processed and feature engineered.
The response variable: Attrition - status of whether an employee is likely to leave the company (attrition = 1) or not (attrition = 0) of predictor variables.
There are 26 variables (columns) and below are the variables contained in the data set:
Age: Employee ageAttrition: if the employee leaves the jobBusinessTravel: The frequency of job travelsDepartment: Employee work departmentDistanceFromHome: Distance traveled to work from
homeEducation: Employee education level (1 = Below College,
2 = College, 3 = Bachelor, 4 = Master, 5 = Doctor)EducationField: Employee education fieldEnvironmentSatisfaction: Numerical value for
environment satisfaction (1 = Low, 2 = Medium, 3 = High, 4 = Very
High)Gender: Employee genderJobInvolvement: Numerical value for job involvement (1
= Low, 2 = Medium, 3 = High, 4 = Very High)JobLevel: Numerical value for job levelJobRole: Employee job positionJobSatisfaction: Numerical value for job satisfaction
(1 = Low, 2 = Medium, 3 = High, 4 = Very High)MaritalStatus: Employee marital statusMonthlyIncome: The amount of money that employee earns
in one month, before taxes or deductionsNumCompaniesWorked: Number of companies worked atPercentSalaryHike: Percent increase in salaryPerformanceRating: Numerical value for performance
rating (1 = Low, 2 = Good, 3 = Excellent, 4 = Outstanding)RelationshipSatisfaction: Numerical value for
relationship satisfaction (1 = Low, 2 = Medium, 3 = High, 4 = Very
High)StockOptionsLevel: Numerical value for stock
optionsTrainingTimesLastYear: Hours employee spent on training
last yearWorkLifeBalance: Numerical value for work life balance
(1 = Bad, 2 = Good, 3 = Better, 4 = Best)YearsInCurrentRole: Number of years employee worked as
their current job roleYearsSinceLastPromotion: Number of years since last
promotionYearsWithCurrentManager: Number of years employee
worked with current managerWe next read the data from the given URL directly to R. Since there are no records with missing values. We don’t need to drop those records.
EmployeeAttritionLog = read.csv("https://raw.githubusercontent.com/Tenam01/DATASETS/main/cleanedattrition2.csv")
employz = na.omit(EmployeeAttritionLog)
Applying binary coding.
# Convert our response variable from 'Yes' and 'No' to 1 and 0
employz$Attrition <- ifelse(employz$Attrition == "Yes", 1, 0)
We first build a logistic regression model that contains all predictor variables in the data set. This model is usually called the full model. Note that the response variable is the attrition status (1 = yes, 0 = no).
initial.model = glm(Attrition ~ BusinessTravel + Department + Education + EducationField + EnvironmentSatisfaction + Gender + JobInvolvement + JobLevel + JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome + NumCompaniesWorked + OverTime + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + StockOptionLevel + TrainingTimesLastYear + WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager + Age + DistanceFromHome, family = binomial, data = employz)
coefficient.table = summary(initial.model)$coef
kable(coefficient.table, caption = "Significance tests of logistic regression model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -10.1487504 | 395.7492662 | -0.0256444 | 0.9795410 |
| BusinessTravelTravel_Frequently | 1.8670294 | 0.4071329 | 4.5857985 | 0.0000045 |
| BusinessTravelTravel_Rarely | 0.9693814 | 0.3754112 | 2.5821858 | 0.0098177 |
| DepartmentResearch & Development | 12.6266608 | 395.7469457 | 0.0319059 | 0.9745471 |
| DepartmentSales | 12.4162910 | 395.7471849 | 0.0313743 | 0.9749710 |
| Education | 0.0019795 | 0.0872999 | 0.0226745 | 0.9819099 |
| EducationFieldLife Sciences | -0.7606727 | 0.7984502 | -0.9526864 | 0.3407490 |
| EducationFieldMarketing | -0.3297477 | 0.8468630 | -0.3893755 | 0.6969984 |
| EducationFieldMedical | -0.8515225 | 0.7981229 | -1.0669064 | 0.2860141 |
| EducationFieldOther | -0.8770665 | 0.8562571 | -1.0243028 | 0.3056923 |
| EducationFieldTechnical Degree | 0.1079078 | 0.8159464 | 0.1322486 | 0.8947877 |
| EnvironmentSatisfaction | -0.4199648 | 0.0821263 | -5.1136430 | 0.0000003 |
| GenderMale | 0.3950340 | 0.1830513 | 2.1580505 | 0.0309239 |
| JobInvolvement | -0.5328265 | 0.1205934 | -4.4183737 | 0.0000099 |
| JobLevel | -0.1389967 | 0.2977130 | -0.4668814 | 0.6405847 |
| JobRoleHuman Resources | 14.0089374 | 395.7471564 | 0.0353987 | 0.9717618 |
| JobRoleLaboratory Technician | 1.5317617 | 0.4816347 | 3.1803394 | 0.0014710 |
| JobRoleManager | 0.5978435 | 0.8727760 | 0.6849907 | 0.4933498 |
| JobRoleManufacturing Director | 0.2597355 | 0.5314577 | 0.4887228 | 0.6250380 |
| JobRoleResearch Director | -0.8767143 | 0.9659989 | -0.9075728 | 0.3641040 |
| JobRoleResearch Scientist | 0.6231180 | 0.4896045 | 1.2726965 | 0.2031258 |
| JobRoleSales Executive | 1.3223467 | 1.1132697 | 1.1878044 | 0.2349105 |
| JobRoleSales Representative | 2.2290891 | 1.1700504 | 1.9051223 | 0.0567642 |
| JobSatisfaction | -0.4031915 | 0.0803294 | -5.0192247 | 0.0000005 |
| MaritalStatusMarried | 0.3027285 | 0.2642453 | 1.1456342 | 0.2519466 |
| MaritalStatusSingle | 1.1604409 | 0.3416766 | 3.3963130 | 0.0006830 |
| MonthlyIncome | -0.0000032 | 0.0000798 | -0.0400431 | 0.9680587 |
| NumCompaniesWorked | 0.1630538 | 0.0370723 | 4.3982643 | 0.0000109 |
| OverTimeYes | 1.9343061 | 0.1909657 | 10.1290744 | 0.0000000 |
| PercentSalaryHike | -0.0210031 | 0.0387550 | -0.5419455 | 0.5878561 |
| PerformanceRating | 0.0679746 | 0.3946248 | 0.1722513 | 0.8632400 |
| RelationshipSatisfaction | -0.2456994 | 0.0819679 | -2.9975091 | 0.0027220 |
| StockOptionLevel | -0.1975184 | 0.1557253 | -1.2683770 | 0.2046633 |
| TrainingTimesLastYear | -0.1862392 | 0.0725335 | -2.5676316 | 0.0102396 |
| WorkLifeBalance | -0.3547433 | 0.1227487 | -2.8899968 | 0.0038525 |
| YearsInCurrentRole | -0.1186218 | 0.0419441 | -2.8280935 | 0.0046826 |
| YearsSinceLastPromotion | 0.2036475 | 0.0401511 | 5.0720267 | 0.0000004 |
| YearsWithCurrManager | -0.0953257 | 0.0418145 | -2.2797268 | 0.0226239 |
| Age | -0.0422346 | 0.0122225 | -3.4554899 | 0.0005493 |
| DistanceFromHome | 0.0446842 | 0.0106711 | 4.1873907 | 0.0000282 |
The p-values in the above significance test table some feature variables with p value greater than 0.05. We next search for the best model by dropping some of the insignificant predictor variables. Since there are so many different ways to drop variables, next we use an automatic variable procedure to search the final model.
R has an automatic variable selection function step() for searching the final model. We will start from the initial model and drop insignificant variables using AIC as an inclusion/exclusion criterion.
In practice, sometimes, there may be some practically important predictor variables. Practitioners want to include these practically important variables in the model regardless of their statistical significance. Therefore we can fit the smallest model that includes only those practically important variables. The final model should be between the smallest model, which we will call a reduced model, and the initial model, which we will call a full model. For illustration, we assume YearsSinceLastPromotion, RelationshipSatisfaction, WorkLifeBalance, and NumCompaniesWorked are practically important, we want to include these four variables in the final model regardless of their statistical significance.
In summary, we define two models: the full model and the reduced model. The final best model will be the model between the full and reduced models. The summary table of significant tests is given below.
full.model = initial.model # the *biggest model* that includes all predictor variables
reduced.model = glm(Attrition ~ YearsSinceLastPromotion + RelationshipSatisfaction + WorkLifeBalance + NumCompaniesWorked , family = binomial, data = employz)
final.model = step(full.model,
scope=list(lower=formula(reduced.model),upper=formula(full.model)),
data = employz,
direction = "backward",
trace = 0) # trace = 0: suppress the detailed selection process
final.model.coef = summary(final.model)$coef
kable(final.model.coef , caption = "Summary table of significant tests")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 1.7644674 | 1.1479245 | 1.5370936 | 0.1242704 |
| BusinessTravelTravel_Frequently | 1.8817648 | 0.4041877 | 4.6556709 | 0.0000032 |
| BusinessTravelTravel_Rarely | 0.9782062 | 0.3731283 | 2.6216350 | 0.0087509 |
| EducationFieldLife Sciences | -0.5979343 | 0.7398454 | -0.8081882 | 0.4189823 |
| EducationFieldMarketing | -0.1826602 | 0.7887458 | -0.2315831 | 0.8168618 |
| EducationFieldMedical | -0.7024340 | 0.7397245 | -0.9495886 | 0.3423213 |
| EducationFieldOther | -0.6971558 | 0.8066688 | -0.8642405 | 0.3874558 |
| EducationFieldTechnical Degree | 0.2724424 | 0.7627622 | 0.3571787 | 0.7209580 |
| EnvironmentSatisfaction | -0.4207754 | 0.0818043 | -5.1436854 | 0.0000003 |
| GenderMale | 0.3792481 | 0.1823123 | 2.0802117 | 0.0375061 |
| JobInvolvement | -0.5327656 | 0.1197715 | -4.4481841 | 0.0000087 |
| JobRoleHuman Resources | 1.6361497 | 0.6420829 | 2.5481906 | 0.0108283 |
| JobRoleLaboratory Technician | 1.6782868 | 0.4279142 | 3.9220171 | 0.0000878 |
| JobRoleManager | 0.1509916 | 0.6348814 | 0.2378265 | 0.8120157 |
| JobRoleManufacturing Director | 0.2443439 | 0.5288542 | 0.4620251 | 0.6440633 |
| JobRoleResearch Director | -1.0222376 | 0.8690644 | -1.1762507 | 0.2394947 |
| JobRoleResearch Scientist | 0.7856609 | 0.4318438 | 1.8193172 | 0.0688631 |
| JobRoleSales Executive | 1.1258424 | 0.4413488 | 2.5509130 | 0.0107441 |
| JobRoleSales Representative | 2.1559119 | 0.4959605 | 4.3469424 | 0.0000138 |
| JobSatisfaction | -0.4088564 | 0.0798811 | -5.1183104 | 0.0000003 |
| MaritalStatusMarried | 0.3945195 | 0.2543611 | 1.5510215 | 0.1208965 |
| MaritalStatusSingle | 1.4321239 | 0.2609096 | 5.4889662 | 0.0000000 |
| NumCompaniesWorked | 0.1627058 | 0.0368348 | 4.4171807 | 0.0000100 |
| OverTimeYes | 1.9293093 | 0.1901496 | 10.1462735 | 0.0000000 |
| RelationshipSatisfaction | -0.2308355 | 0.0810321 | -2.8486935 | 0.0043899 |
| TrainingTimesLastYear | -0.1844669 | 0.0724362 | -2.5466125 | 0.0108774 |
| WorkLifeBalance | -0.3637707 | 0.1227142 | -2.9643736 | 0.0030330 |
| YearsInCurrentRole | -0.1192218 | 0.0415277 | -2.8708974 | 0.0040931 |
| YearsSinceLastPromotion | 0.2032144 | 0.0391851 | 5.1860088 | 0.0000002 |
| YearsWithCurrManager | -0.1001219 | 0.0413984 | -2.4184959 | 0.0155848 |
| Age | -0.0459829 | 0.0116388 | -3.9508431 | 0.0000779 |
| DistanceFromHome | 0.0431754 | 0.0105214 | 4.1035687 | 0.0000407 |
The summary table contains the four practically important variables YearsSinceLastPromotion, RelationshipSatisfaction, WorkLifeBalance, and NumCompaniesWorked. YearsSinceLastPromotion does achieve high statistical significance (p-value \(\approx\) 0), RelationshipSatisfaction also achieve high statistical significance (p-value \(\approx\) 0.0044), WorkLifeBalance achieve high statistical significance (p-value \(\approx\) 0.003), and NumCompaniesWorked also achieves high significance (p-value \(\approx\) 0.00001). Both variables, YearsSinceLastPromotion and NumCompaniesWorked, are seemingly positively associated with the response variable. Where as RelationshipSatisfaction and WorkLifeBalance, are negatively associated with the response variable.
Here’s a brief interpretation of the significant tests:
BusinessTravel: Travel_Frequently has a positive coefficient (1.88) and is significant (p < 0.001), indicating that employees who travel frequently are more likely to have attrition.
Travel_Rarely also has a positive coefficient (0.98) and is significant (p = 0.009), suggesting that employees who travel rarely are also more likely to have attrition.
EducationField: None of the specific education fields are significant predictors of attrition but most show negative association.
EnvironmentSatisfaction: Negative coefficient (-0.42) and highly significant (p < 0.001) suggest that lower environment satisfaction is associated with higher attrition.
Gender: Male employees have a positive coefficient (0.38) and are marginally significant (p = 0.038), indicating that male employees may be slightly more likely to have attrition.
JobInvolvement: Negative coefficient (-0.53) and highly significant (p < 0.001) suggest that lower job involvement is associated with higher attrition.
JobRole: Several job roles have significant associations with attrition. For example, Laboratory Technicians, Sales Representatives, and Human Resources have positive coefficients and are significant predictors of attrition.
JobSatisfaction: Negative coefficient (-0.41) and highly significant (p < 0.001) suggest that lower job satisfaction is associated with higher attrition.
MaritalStatus: Single employees have a positive coefficient (1.43) and are highly significant (p < 0.001), indicating that single employees are more likely to have attrition.
NumCompaniesWorked: Positive coefficient (0.16) and highly significant (p < 0.001) suggest that having worked at more companies is associated with higher attrition.
OverTime: Employees who work overtime (OverTimeYes) have a positive coefficient (1.93) and are highly significant (p < 0.001), indicating that they are more likely to have attrition.
TrainingTimesLastYear: Negative coefficient (-0.18) and significant (p = 0.011) indicate that fewer training times last year are associated with higher attrition.
YearsInCurrentRole: Negative coefficient (-0.12) and significant (p = 0.004) suggest that fewer years in the current role are associated with higher attrition.
YearsSinceLastPromotion: Positive coefficient (0.20) and highly significant (p < 0.001) suggest that more years since the last promotion are associated with higher attrition.
YearsWithCurrManager: Negative coefficient (-0.10) and significant (p = 0.016) suggest that fewer years with the current manager are associated with higher attrition.
Age: Negative coefficient (-0.05) and highly significant (p < 0.001) suggest that older age is associated with lower attrition.
DistanceFromHome: Positive coefficient (0.04) and significant (p < 0.001) suggest that greater distance from home is associated with higher attrition.
These results provide insights into how different feature variables are associated with employee attrition. Variables such as BusinessTravel, EnvironmentSatisfaction, JobInvolvement, JobRole, JobSatisfaction, MaritalStatus, NumCompaniesWorked, OverTime, RelationshipSatisfaction, TrainingTimesLastYear, WorkLifeBalance, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager, Age, and DistanceFromHome appear to have significant associations with attrition.
As an illustration, we use the final model to predict the status of successful introduction based on the new values of the predictor variables associated with two species. See the numerical feature given in the code chunk.
mynewdata = data.frame(BusinessTravel=c('Travel_Rarely', 'Travel_Frequently'),
EnvironmentSatisfaction = c(3, 2),
Gender = c('Male','Female'),
JobInvolvement = c(1, 3),
JobRole = c('Healthcare Representative', 'Manager'),
JobSatisfaction = c(3,4),
MaritalStatus = c('Single', 'Married'),
NumCompaniesWorked = c(3, 4),
OverTime = c('Yes', 'No'),
RelationshipSatisfaction = c(1, 3),
StockOptionLevel = c(0, 1),
TrainingTimesLastYear = c(2, 1),
WorkLifeBalance = c(2, 2),
YearsInCurrentRole = c(3, 8),
YearsSinceLastPromotion = c(1, 3),
YearsWithCurrManager = c(2, 2),
EducationField = c('Human Resources', 'Technical Degree'),
Age = c(25, 35),
DistanceFromHome = c(14, 5))
pred.success.prob = predict(final.model, newdata = mynewdata, type="response")
##
## threshold probability
cut.off.prob = 0.5
pred.response = ifelse(pred.success.prob > cut.off.prob, 1, 0) # This predicts the response
## Add the new predicted response to Mynewdata
mynewdata$Pred.Response = pred.response
##
kable(mynewdata, caption = "Predicted Value of response variable
with the given cut-off probability")
| BusinessTravel | EnvironmentSatisfaction | Gender | JobInvolvement | JobRole | JobSatisfaction | MaritalStatus | NumCompaniesWorked | OverTime | RelationshipSatisfaction | StockOptionLevel | TrainingTimesLastYear | WorkLifeBalance | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | EducationField | Age | DistanceFromHome | Pred.Response |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Travel_Rarely | 3 | Male | 1 | Healthcare Representative | 3 | Single | 3 | Yes | 1 | 0 | 2 | 2 | 3 | 1 | 2 | Human Resources | 25 | 14 | 1 |
| Travel_Frequently | 2 | Female | 3 | Manager | 4 | Married | 4 | No | 3 | 1 | 1 | 2 | 8 | 3 | 2 | Technical Degree | 35 | 5 | 0 |
The predicted status of the successful introduction of the two employees is attached to the two new data records.
The “Predicted Response” column indicates the predicted outcome of employee attrition based on the input values for the predictor variables and the logistic regression model. A value of 1 indicates that the model predicts attrition (Yes), and a value of 0 indicates that the model predicts no attrition (No).
These predictions are based on the relationships and coefficients learned by the logistic regression model from the training data. It’s important to note that these predictions are based on the specific characteristics we have provided for each individual, and they may change if the input values are altered.