1 Introduction

In the realm of People Analytics, the significance of a data set created by IBM for attrition modeling cannot be overstated. This data set presents a valuable resource for addressing critical questions related to employee turnover and engagement. With 1470 rows and 35 columns, it offers a wealth of information that encompasses various aspects of employees’ professional and personal lives. The data set combines the typical Human Resources Information System (HRIS) data with a comprehensive engagement survey, providing a holistic view of the employees’ experiences and sentiments.

The main objective of this analysis is to understand and predict employee turnover within an organization. By examining the factors that contribute to attrition, we aim to uncover valuable insights that can guide HR strategies and improve employee retention rates. Moreover, this data set offers a unique opportunity to identify differences between the group of employees who chose to stay and those who decided to leave the organization.

2 Description of Data

The data set was created and made available by IBM, a renowned leader in the tech industry. As a reputable source, the data is expected to be reliable and well-structured, enabling meaningful analysis and inference. While the specific details of data collection methods are not provided, the comprehensive nature of the data set suggests that it was meticulously curated to capture a wide range of attributes relevant to employee turnover and engagement.

Sample Size and Feature Variables: The data set consists of 1470 rows, representing individual employee records. Each row contains 35 columns, making these columns the feature variables for analysis. The feature variables include demographic information such as age and gender, factors related to job satisfaction and environment satisfaction, education field, job role, income, overtime, percentage salary hike, tenure, training time, years in the current role, relationship status, and several other parameters that may impact attrition and engagement.

The feature variables encompass various data types, including numeric, categorical, and ordinal data, which allows for a diverse set of analyses and modeling approaches.

In conclusion, this data set offers a comprehensive collection of features that are crucial for understanding and predicting employee attrition within an organization. By delving into the relationships between different variables, we can gain valuable insights that have practical implications for HR policies and practices.

A detailed description of the variables is given below:

Education 1 ‘Below College’ 2 ‘College’ 3 ‘Bachelor’ 4 ‘Master’ 5 ‘Doctor’

EnvironmentSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

JobInvolvement 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

JobSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

PerformanceRating 1 ‘Low’ 2 ‘Good’ 3 ‘Excellent’ 4 ‘Outstanding’

RelationshipSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

WorkLifeBalance 1 ‘Bad’ 2 ‘Good’ 3 ‘Better’ 4 ‘Best’

A copy of this publicly available data is stored at https://raw.githubusercontent.com/Tenam01/DATASETS/main/EmployeeAttritionData.csv.

EmployeeAttrition = read.csv("https://raw.githubusercontent.com/Tenam01/DATASETS/main/EmployeeAttritionData.csv")

Applying binary coding.

# Convert our response variable from 'Yes' and 'No' to 1 and 0
EmployeeAttrition$Attrition_num <- ifelse(EmployeeAttrition$Attrition == "Yes", 1, 0)

Column Employee Number will be removed to protect the identity of employees and their sensitive information.

Additional columns have been identified for removal:

DailyRate, HourlyRate, and MonthlyRate are inexplicable. EmployeeCount, Over18, and StandardHours are uniform for all employees.

# Using subset() function to drop specific columns
EmployeeAttrition_drop <- subset(EmployeeAttrition, select = -c(EmployeeNumber, EmployeeCount, Over18, StandardHours, DailyRate, HourlyRate, MonthlyRate))

3 EDA for Feature Engineering

We first scan the entire data set and determine the EDA tools to use for feature engineering.

summary(EmployeeAttrition_drop)

##       Age         Attrition         BusinessTravel      Department       
##  Min.   :18.00   Length:1470        Length:1470        Length:1470       
##  1st Qu.:30.00   Class :character   Class :character   Class :character  
##  Median :36.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :36.92                                                           
##  3rd Qu.:43.00                                                           
##  Max.   :60.00                                                           
##  DistanceFromHome   Education     EducationField     EnvironmentSatisfaction
##  Min.   : 1.000   Min.   :1.000   Length:1470        Min.   :1.000          
##  1st Qu.: 2.000   1st Qu.:2.000   Class :character   1st Qu.:2.000          
##  Median : 7.000   Median :3.000   Mode  :character   Median :3.000          
##  Mean   : 9.193   Mean   :2.913                      Mean   :2.722          
##  3rd Qu.:14.000   3rd Qu.:4.000                      3rd Qu.:4.000          
##  Max.   :29.000   Max.   :5.000                      Max.   :4.000          
##     Gender          JobInvolvement    JobLevel       JobRole         
##  Length:1470        Min.   :1.00   Min.   :1.000   Length:1470       
##  Class :character   1st Qu.:2.00   1st Qu.:1.000   Class :character  
##  Mode  :character   Median :3.00   Median :2.000   Mode  :character  
##                     Mean   :2.73   Mean   :2.064                     
##                     3rd Qu.:3.00   3rd Qu.:3.000                     
##                     Max.   :4.00   Max.   :5.000                     
##  JobSatisfaction MaritalStatus      MonthlyIncome   NumCompaniesWorked
##  Min.   :1.000   Length:1470        Min.   : 1009   Min.   :0.000     
##  1st Qu.:2.000   Class :character   1st Qu.: 2911   1st Qu.:1.000     
##  Median :3.000   Mode  :character   Median : 4919   Median :2.000     
##  Mean   :2.729                      Mean   : 6503   Mean   :2.693     
##  3rd Qu.:4.000                      3rd Qu.: 8379   3rd Qu.:4.000     
##  Max.   :4.000                      Max.   :19999   Max.   :9.000     
##    OverTime         PercentSalaryHike PerformanceRating
##  Length:1470        Min.   :11.00     Min.   :3.000    
##  Class :character   1st Qu.:12.00     1st Qu.:3.000    
##  Mode  :character   Median :14.00     Median :3.000    
##                     Mean   :15.21     Mean   :3.154    
##                     3rd Qu.:18.00     3rd Qu.:3.000    
##                     Max.   :25.00     Max.   :4.000    
##  RelationshipSatisfaction StockOptionLevel TotalWorkingYears
##  Min.   :1.000            Min.   :0.0000   Min.   : 0.00    
##  1st Qu.:2.000            1st Qu.:0.0000   1st Qu.: 6.00    
##  Median :3.000            Median :1.0000   Median :10.00    
##  Mean   :2.712            Mean   :0.7939   Mean   :11.28    
##  3rd Qu.:4.000            3rd Qu.:1.0000   3rd Qu.:15.00    
##  Max.   :4.000            Max.   :3.0000   Max.   :40.00    
##  TrainingTimesLastYear WorkLifeBalance YearsAtCompany   YearsInCurrentRole
##  Min.   :0.000         Min.   :1.000   Min.   : 0.000   Min.   : 0.000    
##  1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.000   1st Qu.: 2.000    
##  Median :3.000         Median :3.000   Median : 5.000   Median : 3.000    
##  Mean   :2.799         Mean   :2.761   Mean   : 7.008   Mean   : 4.229    
##  3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.: 9.000   3rd Qu.: 7.000    
##  Max.   :6.000         Max.   :4.000   Max.   :40.000   Max.   :18.000    
##  YearsSinceLastPromotion YearsWithCurrManager Attrition_num   
##  Min.   : 0.000          Min.   : 0.000       Min.   :0.0000  
##  1st Qu.: 0.000          1st Qu.: 2.000       1st Qu.:0.0000  
##  Median : 1.000          Median : 3.000       Median :0.0000  
##  Mean   : 2.188          Mean   : 4.123       Mean   :0.1612  
##  3rd Qu.: 3.000          3rd Qu.: 7.000       3rd Qu.:0.0000  
##  Max.   :15.000          Max.   :17.000       Max.   :1.0000

There seems to be no apparent outliers.All the numerical variables seem to be in a reasonable range.

The average age seems to be around 37 years, the highest age seems to be 60 and the lowest age seems to be 18. This dynamic age group could range from new interns all the way to senior managers. There are people who live very close to work and some live far away from the work place Salaries tend to fluctuate a lot as well with a average monthly income of USD 6,500. The salaries can range anywhere from lowest USD 1,000 to as high as USD 20,000. This explains as new interns get paid less whereas the senior managers make a lot. The average number of years worked as an employee is 7 years, where there are employees who have worked for 40 years and as well as some who just started working.

# checking the unique character variables.
unique(EmployeeAttrition_drop$Gender)

## [1] "Female" "Male"

unique(EmployeeAttrition_drop$BusinessTravel)

## [1] "Travel_Rarely"     "Travel_Frequently" "Non-Travel"

unique(EmployeeAttrition_drop$Department)

## [1] "Sales"                  "Research & Development" "Human Resources"

unique(EmployeeAttrition_drop$JobRole)

## [1] "Sales Executive"           "Research Scientist"       
## [3] "Laboratory Technician"     "Manufacturing Director"   
## [5] "Healthcare Representative" "Manager"                  
## [7] "Sales Representative"      "Research Director"        
## [9] "Human Resources"

unique(EmployeeAttrition_drop$MaritalStatus)

## [1] "Single"   "Married"  "Divorced"

unique(EmployeeAttrition_drop$OverTime)

## [1] "Yes" "No"

unique(EmployeeAttrition_drop$Attrition)

## [1] "Yes" "No"

unique(EmployeeAttrition_drop$EducationField)

## [1] "Life Sciences"    "Other"            "Medical"          "Marketing"       
## [5] "Technical Degree" "Human Resources"

All of the categorical characters are consistent.

3.1 Missing values

Since the data set appears to have no missing values, it simplifies the Exploratory Data Analysis (EDA) process. With a complete data set, we can focus on exploring relationships, identifying patterns, and gaining insights more effectively.

The data cleaning process is complete.

numeric_vars <- sapply(EmployeeAttrition_drop, is.numeric)
numeric_data <- EmployeeAttrition_drop[, numeric_vars]

3.2 Assess Distributions

This subsection focuses on the potential discretization of continuous variables and grouping sparse categories of category variables based on their distribution.

3.2.1 Discretizing Continuous Variable

We will group age into three sub groups ranging from 18 to 60 and also group the distance from home into three subgroups. We will replace Age, DistanceFromHome, and Attrition feature variables and replace them with the modified grouped variable grp.age and grp.dist also binary variable outcome Attrition_num for easy graphical approach.

EmployeeAttrition_drop$grp.age <- ifelse(EmployeeAttrition_drop$Age <= 30, '(18, 30)',
               ifelse(EmployeeAttrition_drop$Age >= 50, '(50, 60)', '[30,50]'))
EmployeeAttrition_drop$grp.dist <- ifelse(EmployeeAttrition_drop$DistanceFromHome <= 10, '(1, 10)',
               ifelse(EmployeeAttrition_drop$DistanceFromHome >= 20, '(20, 30)', '[10,20]'))

3.3 Pairwise associations

Pairwise association refers to the examination of relationships between pairs of variables in a data set. It involves analyzing how the values of two variables co-occur or change together. There are three different types of pairwise associations.

3.3.1 Two numeric variables

The best visual tool for assessing pairwise linear association between two numeric variables is a pair-wise scatter plot.

# Selecting specific columns
selected_columns <- c(1, 5, 15, 16, 18, 22)

# Creating ggpairs plot with selected columns
ggpairs(EmployeeAttrition_drop,
        columns = selected_columns,
        aes(color = Attrition, alpha = 0.5))

# Selecting specific columns
selected_columnss <- c(23, 25, 26,27,28)

# Creating ggpairs plot with selected columns
ggpairs(EmployeeAttrition_drop,
        columns = selected_columnss,
        aes(color = Attrition, alpha = 0.5))

The off-diagonal plots and numbers indicate the correlation between the pair-wise numeric variables. As expected, YearsWithCurrManager and YearsAtCompany are significantly correlated, YearsWithCurrManager and YearsInCurrentRole are significantly correlated, TotalWorkingYears and YearsAtCompany are significantly correlated, YearsInCurrentRole and YearsAtCompany are significantly correlated, and YearsSinceLastPromotion and YearsAtCompany are significantly correlated. Other paired variables have weak correlations.

The main diagonal stacked density curves show the potential difference in the distribution of the underlying numeric variable in Attrition and no attrition groups. This means that the stacked density curves show the relation between numeric and categorical variables. These stacked density curves are not completely overlapped indicating somewhat correlation between each of these numeric variables and the binary response variable.

3.3.2 Two Categorial variables

Mosaic plots are convenient to show whether two categorical variables are dependent. In EDA, we are primarily interested in whether the response (binary in this case) is independent of categorical variables. Those categorical variables that are independent of the response variable should be excluded in any of the subsequent models and algorithms.

par(mfrow = c(2,2))
mosaicplot(BusinessTravel ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="BusinessTravel vs Attrition")
mosaicplot(Department ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="Department vs Attrition")
mosaicplot(EducationField ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="EducationField vs Attrition")
mosaicplot(JobRole ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="JobRole vs Attrition")

The top two mosaic plots demonstrate show that Attrition is not independent of Business Travel and Department because the proportion of Attrition cases in individual categories is not identical. In which employees traveling frequently have the highest attrition rate whereas non-travel employees has the least attrition rate. The bottom two mosaic plots also show that Attrition is not independent of Education field and Job Role because the proportion of Attrition cases in individual categories is not identical.

par(mfrow = c(2,3))
mosaicplot(OverTime ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="OverTime vs Attrition")
mosaicplot(MaritalStatus ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="MaritalStatus vs Attrition")
mosaicplot(Gender ~ Attrition, data=EmployeeAttrition,col=c("Blue","Red"), main="Gender vs Attrition")
mosaicplot(grp.age ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="Agegroup vs Attrition")
mosaicplot(grp.dist ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="DistanceFromHome vs Attrition")

The top left two mosaic plots demonstrate the positive association between overtime and marital status readings. In which employees being single and working a lot of overtime shows increasing rate of employee attrition. The bottom two mosaic plots also show that Attrition is not independent of age group and distance from home because the proportion of Attrition cases in individual categories is not identical. Here, the younger age group have a higher attrition rate than older age group. Also further away from the work place shows increase in the rate of attrition. Lastly, as for the Gender, there seems to be not much of an influence.

# Calculate the correlation matrix
cor_matrix <- cor(numeric_data)

# You can also plot the correlation matrix for better visualization
library(corrplot)
corrplot(cor_matrix, method = "color", tl.col = "black")

4 Concluding Remarks

We will be removing grp.age and grp.dist also binary variable outcome Attrition_num. We will aslo remove YearsAtCompany, and TotalWorkingYears, as they highly correlated with correlation coefficient > .75. For our final cleaned data set, most feature variables have very high or very low correlation but not high or low enough to be removed. Therefore, we will keep them for further analysis.

# Using subset() function to drop specific columns
modified_data <- subset(EmployeeAttrition_drop, select = -c(grp.age, grp.dist, Attrition_num, YearsAtCompany, TotalWorkingYears))
write.csv(modified_data, file = "~/Desktop/cleanedattrition2.csv", row.names = FALSE)

5 Introduction for Logistic Regression

Employee attrition is a critical challenge faced by many organizations, especially in the fast-paced tech industry. High turnover rates can impact productivity, morale, and overall company performance. In this analysis, we will use logistic regression to predict employee attrition in a tech company based on various employee characteristics, job-related factors and also to assess the association between the binary response variable and other predictor variables.

5.1 Research Questions

Can we develop a logistic regression model that accurately predicts employee attrition?
Which factors have a significant impact on the likelihood of employee attrition?
How well does the model perform in terms of classification accuracy and predictive power?

6 Multiple Logistic Regression Model

In this study, we used a published study on Employee Attrition data set. The practical question for this predictive modeling assignment is to determine the factors that impact employee attrition. We want to understand which variables are significant predictors of attrition and build a logistic regression model to predict whether an employee is likely to leave the company (attrition = 1) or not (attrition = 0). A copy of this publicly available data is stored at https://raw.githubusercontent.com/Tenam01/DATASETS/main/cleanedattrition2.csv. This data set has been pre-processed and feature engineered.

The response variable: Attrition - status of whether an employee is likely to leave the company (attrition = 1) or not (attrition = 0) of predictor variables.

There are 26 variables (columns) and below are the variables contained in the data set:

Age: Employee age
Attrition: if the employee leaves the job
BusinessTravel: The frequency of job travels
Department: Employee work department
DistanceFromHome: Distance traveled to work from home
Education: Employee education level (1 = Below College, 2 = College, 3 = Bachelor, 4 = Master, 5 = Doctor)
EducationField: Employee education field
EnvironmentSatisfaction: Numerical value for environment satisfaction (1 = Low, 2 = Medium, 3 = High, 4 = Very High)
Gender: Employee gender
JobInvolvement: Numerical value for job involvement (1 = Low, 2 = Medium, 3 = High, 4 = Very High)
JobLevel: Numerical value for job level
JobRole: Employee job position
JobSatisfaction: Numerical value for job satisfaction (1 = Low, 2 = Medium, 3 = High, 4 = Very High)
MaritalStatus: Employee marital status
MonthlyIncome: The amount of money that employee earns in one month, before taxes or deductions
NumCompaniesWorked: Number of companies worked at
PercentSalaryHike: Percent increase in salary
PerformanceRating: Numerical value for performance rating (1 = Low, 2 = Good, 3 = Excellent, 4 = Outstanding)
RelationshipSatisfaction: Numerical value for relationship satisfaction (1 = Low, 2 = Medium, 3 = High, 4 = Very High)
StockOptionsLevel: Numerical value for stock options
TrainingTimesLastYear: Hours employee spent on training last year
WorkLifeBalance: Numerical value for work life balance (1 = Bad, 2 = Good, 3 = Better, 4 = Best)
YearsInCurrentRole: Number of years employee worked as their current job role
YearsSinceLastPromotion: Number of years since last promotion
YearsWithCurrentManager: Number of years employee worked with current manager

We next read the data from the given URL directly to R. Since there are no records with missing values. We don’t need to drop those records.

EmployeeAttritionLog = read.csv("https://raw.githubusercontent.com/Tenam01/DATASETS/main/cleanedattrition2.csv")
employz = na.omit(EmployeeAttritionLog)

Applying binary coding.

# Convert our response variable from 'Yes' and 'No' to 1 and 0
employz$Attrition <- ifelse(employz$Attrition == "Yes", 1, 0)

Build an Initial Model

We first build a logistic regression model that contains all predictor variables in the data set. This model is usually called the full model. Note that the response variable is the attrition status (1 = yes, 0 = no).

initial.model = glm(Attrition ~ BusinessTravel + Department + Education + EducationField + EnvironmentSatisfaction + Gender  + JobInvolvement + JobLevel + JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome + NumCompaniesWorked + OverTime + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + StockOptionLevel + TrainingTimesLastYear + WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager + Age + DistanceFromHome, family = binomial, data = employz)
coefficient.table = summary(initial.model)$coef
kable(coefficient.table, caption = "Significance tests of logistic regression model")

Significance tests of logistic regression model
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-10.1487504	395.7492662	-0.0256444	0.9795410
BusinessTravelTravel_Frequently	1.8670294	0.4071329	4.5857985	0.0000045
BusinessTravelTravel_Rarely	0.9693814	0.3754112	2.5821858	0.0098177
DepartmentResearch & Development	12.6266608	395.7469457	0.0319059	0.9745471
DepartmentSales	12.4162910	395.7471849	0.0313743	0.9749710
Education	0.0019795	0.0872999	0.0226745	0.9819099
EducationFieldLife Sciences	-0.7606727	0.7984502	-0.9526864	0.3407490
EducationFieldMarketing	-0.3297477	0.8468630	-0.3893755	0.6969984
EducationFieldMedical	-0.8515225	0.7981229	-1.0669064	0.2860141
EducationFieldOther	-0.8770665	0.8562571	-1.0243028	0.3056923
EducationFieldTechnical Degree	0.1079078	0.8159464	0.1322486	0.8947877
EnvironmentSatisfaction	-0.4199648	0.0821263	-5.1136430	0.0000003
GenderMale	0.3950340	0.1830513	2.1580505	0.0309239
JobInvolvement	-0.5328265	0.1205934	-4.4183737	0.0000099
JobLevel	-0.1389967	0.2977130	-0.4668814	0.6405847
JobRoleHuman Resources	14.0089374	395.7471564	0.0353987	0.9717618
JobRoleLaboratory Technician	1.5317617	0.4816347	3.1803394	0.0014710
JobRoleManager	0.5978435	0.8727760	0.6849907	0.4933498
JobRoleManufacturing Director	0.2597355	0.5314577	0.4887228	0.6250380
JobRoleResearch Director	-0.8767143	0.9659989	-0.9075728	0.3641040
JobRoleResearch Scientist	0.6231180	0.4896045	1.2726965	0.2031258
JobRoleSales Executive	1.3223467	1.1132697	1.1878044	0.2349105
JobRoleSales Representative	2.2290891	1.1700504	1.9051223	0.0567642
JobSatisfaction	-0.4031915	0.0803294	-5.0192247	0.0000005
MaritalStatusMarried	0.3027285	0.2642453	1.1456342	0.2519466
MaritalStatusSingle	1.1604409	0.3416766	3.3963130	0.0006830
MonthlyIncome	-0.0000032	0.0000798	-0.0400431	0.9680587
NumCompaniesWorked	0.1630538	0.0370723	4.3982643	0.0000109
OverTimeYes	1.9343061	0.1909657	10.1290744	0.0000000
PercentSalaryHike	-0.0210031	0.0387550	-0.5419455	0.5878561
PerformanceRating	0.0679746	0.3946248	0.1722513	0.8632400
RelationshipSatisfaction	-0.2456994	0.0819679	-2.9975091	0.0027220
StockOptionLevel	-0.1975184	0.1557253	-1.2683770	0.2046633
TrainingTimesLastYear	-0.1862392	0.0725335	-2.5676316	0.0102396
WorkLifeBalance	-0.3547433	0.1227487	-2.8899968	0.0038525
YearsInCurrentRole	-0.1186218	0.0419441	-2.8280935	0.0046826
YearsSinceLastPromotion	0.2036475	0.0401511	5.0720267	0.0000004
YearsWithCurrManager	-0.0953257	0.0418145	-2.2797268	0.0226239
Age	-0.0422346	0.0122225	-3.4554899	0.0005493
DistanceFromHome	0.0446842	0.0106711	4.1873907	0.0000282

The p-values in the above significance test table some feature variables with p value greater than 0.05. We next search for the best model by dropping some of the insignificant predictor variables. Since there are so many different ways to drop variables, next we use an automatic variable procedure to search the final model.

Automatic Variable Selection

R has an automatic variable selection function step() for searching the final model. We will start from the initial model and drop insignificant variables using AIC as an inclusion/exclusion criterion.

In practice, sometimes, there may be some practically important predictor variables. Practitioners want to include these practically important variables in the model regardless of their statistical significance. Therefore we can fit the smallest model that includes only those practically important variables. The final model should be between the smallest model, which we will call a reduced model, and the initial model, which we will call a full model. For illustration, we assume YearsSinceLastPromotion, RelationshipSatisfaction, WorkLifeBalance, and NumCompaniesWorked are practically important, we want to include these four variables in the final model regardless of their statistical significance.

In summary, we define two models: the full model and the reduced model. The final best model will be the model between the full and reduced models. The summary table of significant tests is given below.

full.model = initial.model  # the *biggest model* that includes all predictor variables
reduced.model = glm(Attrition ~ YearsSinceLastPromotion + RelationshipSatisfaction + WorkLifeBalance + NumCompaniesWorked , family = binomial, data = employz)
final.model =  step(full.model, 
                    scope=list(lower=formula(reduced.model),upper=formula(full.model)),
                    data = employz, 
                    direction = "backward",
                    trace = 0)   # trace = 0: suppress the detailed selection process
final.model.coef = summary(final.model)$coef
kable(final.model.coef , caption = "Summary table of significant tests")

Summary table of significant tests
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	1.7644674	1.1479245	1.5370936	0.1242704
BusinessTravelTravel_Frequently	1.8817648	0.4041877	4.6556709	0.0000032
BusinessTravelTravel_Rarely	0.9782062	0.3731283	2.6216350	0.0087509
EducationFieldLife Sciences	-0.5979343	0.7398454	-0.8081882	0.4189823
EducationFieldMarketing	-0.1826602	0.7887458	-0.2315831	0.8168618
EducationFieldMedical	-0.7024340	0.7397245	-0.9495886	0.3423213
EducationFieldOther	-0.6971558	0.8066688	-0.8642405	0.3874558
EducationFieldTechnical Degree	0.2724424	0.7627622	0.3571787	0.7209580
EnvironmentSatisfaction	-0.4207754	0.0818043	-5.1436854	0.0000003
GenderMale	0.3792481	0.1823123	2.0802117	0.0375061
JobInvolvement	-0.5327656	0.1197715	-4.4481841	0.0000087
JobRoleHuman Resources	1.6361497	0.6420829	2.5481906	0.0108283
JobRoleLaboratory Technician	1.6782868	0.4279142	3.9220171	0.0000878
JobRoleManager	0.1509916	0.6348814	0.2378265	0.8120157
JobRoleManufacturing Director	0.2443439	0.5288542	0.4620251	0.6440633
JobRoleResearch Director	-1.0222376	0.8690644	-1.1762507	0.2394947
JobRoleResearch Scientist	0.7856609	0.4318438	1.8193172	0.0688631
JobRoleSales Executive	1.1258424	0.4413488	2.5509130	0.0107441
JobRoleSales Representative	2.1559119	0.4959605	4.3469424	0.0000138
JobSatisfaction	-0.4088564	0.0798811	-5.1183104	0.0000003
MaritalStatusMarried	0.3945195	0.2543611	1.5510215	0.1208965
MaritalStatusSingle	1.4321239	0.2609096	5.4889662	0.0000000
NumCompaniesWorked	0.1627058	0.0368348	4.4171807	0.0000100
OverTimeYes	1.9293093	0.1901496	10.1462735	0.0000000
RelationshipSatisfaction	-0.2308355	0.0810321	-2.8486935	0.0043899
TrainingTimesLastYear	-0.1844669	0.0724362	-2.5466125	0.0108774
WorkLifeBalance	-0.3637707	0.1227142	-2.9643736	0.0030330
YearsInCurrentRole	-0.1192218	0.0415277	-2.8708974	0.0040931
YearsSinceLastPromotion	0.2032144	0.0391851	5.1860088	0.0000002
YearsWithCurrManager	-0.1001219	0.0413984	-2.4184959	0.0155848
Age	-0.0459829	0.0116388	-3.9508431	0.0000779
DistanceFromHome	0.0431754	0.0105214	4.1035687	0.0000407

Interpretation - Association Analysis

The summary table contains the four practically important variables YearsSinceLastPromotion, RelationshipSatisfaction, WorkLifeBalance, and NumCompaniesWorked. YearsSinceLastPromotion does achieve high statistical significance (p-value \(\approx\) 0), RelationshipSatisfaction also achieve high statistical significance (p-value \(\approx\) 0.0044), WorkLifeBalance achieve high statistical significance (p-value \(\approx\) 0.003), and NumCompaniesWorked also achieves high significance (p-value \(\approx\) 0.00001). Both variables, YearsSinceLastPromotion and NumCompaniesWorked, are seemingly positively associated with the response variable. Where as RelationshipSatisfaction and WorkLifeBalance, are negatively associated with the response variable.

Here’s a brief interpretation of the significant tests:

BusinessTravel: Travel_Frequently has a positive coefficient (1.88) and is significant (p < 0.001), indicating that employees who travel frequently are more likely to have attrition.

Travel_Rarely also has a positive coefficient (0.98) and is significant (p = 0.009), suggesting that employees who travel rarely are also more likely to have attrition.

EducationField: None of the specific education fields are significant predictors of attrition but most show negative association.

EnvironmentSatisfaction: Negative coefficient (-0.42) and highly significant (p < 0.001) suggest that lower environment satisfaction is associated with higher attrition.

Gender: Male employees have a positive coefficient (0.38) and are marginally significant (p = 0.038), indicating that male employees may be slightly more likely to have attrition.

JobInvolvement: Negative coefficient (-0.53) and highly significant (p < 0.001) suggest that lower job involvement is associated with higher attrition.

JobRole: Several job roles have significant associations with attrition. For example, Laboratory Technicians, Sales Representatives, and Human Resources have positive coefficients and are significant predictors of attrition.

JobSatisfaction: Negative coefficient (-0.41) and highly significant (p < 0.001) suggest that lower job satisfaction is associated with higher attrition.

MaritalStatus: Single employees have a positive coefficient (1.43) and are highly significant (p < 0.001), indicating that single employees are more likely to have attrition.

NumCompaniesWorked: Positive coefficient (0.16) and highly significant (p < 0.001) suggest that having worked at more companies is associated with higher attrition.

OverTime: Employees who work overtime (OverTimeYes) have a positive coefficient (1.93) and are highly significant (p < 0.001), indicating that they are more likely to have attrition.

TrainingTimesLastYear: Negative coefficient (-0.18) and significant (p = 0.011) indicate that fewer training times last year are associated with higher attrition.

YearsInCurrentRole: Negative coefficient (-0.12) and significant (p = 0.004) suggest that fewer years in the current role are associated with higher attrition.

YearsSinceLastPromotion: Positive coefficient (0.20) and highly significant (p < 0.001) suggest that more years since the last promotion are associated with higher attrition.

YearsWithCurrManager: Negative coefficient (-0.10) and significant (p = 0.016) suggest that fewer years with the current manager are associated with higher attrition.

Age: Negative coefficient (-0.05) and highly significant (p < 0.001) suggest that older age is associated with lower attrition.

DistanceFromHome: Positive coefficient (0.04) and significant (p < 0.001) suggest that greater distance from home is associated with higher attrition.

These results provide insights into how different feature variables are associated with employee attrition. Variables such as BusinessTravel, EnvironmentSatisfaction, JobInvolvement, JobRole, JobSatisfaction, MaritalStatus, NumCompaniesWorked, OverTime, RelationshipSatisfaction, TrainingTimesLastYear, WorkLifeBalance, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager, Age, and DistanceFromHome appear to have significant associations with attrition.

Predictive Analysis

As an illustration, we use the final model to predict the status of successful introduction based on the new values of the predictor variables associated with two species. See the numerical feature given in the code chunk.

mynewdata = data.frame(BusinessTravel=c('Travel_Rarely', 'Travel_Frequently'),
                       EnvironmentSatisfaction = c(3, 2),
                       Gender = c('Male','Female'),
                       JobInvolvement = c(1, 3),
                       JobRole = c('Healthcare Representative', 'Manager'),
                       JobSatisfaction = c(3,4),
                       MaritalStatus = c('Single', 'Married'),
                       NumCompaniesWorked = c(3, 4),
                       OverTime = c('Yes', 'No'),
                       RelationshipSatisfaction = c(1, 3),
                       StockOptionLevel = c(0, 1),
                       TrainingTimesLastYear = c(2, 1),
                       WorkLifeBalance = c(2, 2),
                       YearsInCurrentRole = c(3, 8),
                       YearsSinceLastPromotion = c(1, 3),
                       YearsWithCurrManager = c(2, 2),
                       EducationField = c('Human Resources', 'Technical Degree'),
                       Age = c(25, 35),
                       DistanceFromHome = c(14, 5))
pred.success.prob = predict(final.model, newdata = mynewdata, type="response")
##
## threshold probability
cut.off.prob = 0.5
pred.response = ifelse(pred.success.prob > cut.off.prob, 1, 0)  # This predicts the response
## Add the new predicted response to Mynewdata
mynewdata$Pred.Response = pred.response
##
kable(mynewdata, caption = "Predicted Value of response variable 
      with the given cut-off probability")

Predicted Value of response variable with the given cut-off probability
BusinessTravel	EnvironmentSatisfaction	Gender	JobInvolvement	JobRole	JobSatisfaction	MaritalStatus	NumCompaniesWorked	OverTime	RelationshipSatisfaction	StockOptionLevel	TrainingTimesLastYear	WorkLifeBalance	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager	EducationField	Age	DistanceFromHome	Pred.Response
Travel_Rarely	3	Male	1	Healthcare Representative	3	Single	3	Yes	1	0	2	2	3	1	2	Human Resources	25	14	1
Travel_Frequently	2	Female	3	Manager	4	Married	4	No	3	1	1	2	8	3	2	Technical Degree	35	5	0

The predicted status of the successful introduction of the two employees is attached to the two new data records.

The “Predicted Response” column indicates the predicted outcome of employee attrition based on the input values for the predictor variables and the logistic regression model. A value of 1 indicates that the model predicts attrition (Yes), and a value of 0 indicates that the model predicts no attrition (No).

These predictions are based on the relationships and coefficients learned by the logistic regression model from the training data. It’s important to note that these predictions are based on the specific characteristics we have provided for each individual, and they may change if the input values are altered.

Applying Exploratory Data Analysis and Feature Engineering on Employee Attrition Dataset

Tenam Lama