Strategic HR Analytics: Logistic Regression and EDA in Employee Attrition Analysis

by Syarif Yusuf Effendi

Introduction

In an era where data is emerging as one of the most valuable assets for organizations, data analysis has played a very important role in assisting intelligent decision-making. Especially when it comes to HR, professionals now have access to tools and techniques that allow them to extract valuable insights from employee data. One important aspect of HR analysis is understanding the factors that influence employee turnover, or what is often known as employee attrition. This is why, in this article, we will explore and outline the steps to perform employee turnover analysis using the R programming language.

Data Collection

In this article, we will use open data sources on Kaggle, which you can access via the following link: click here.

Data Cleaning and Preparation

The first step before cleaning the data is to first import the dataset file that was previously downloaded and use the readr library.

library(readr)
attrition.data <- read.csv("D:/Dataset/Employee Attrition/WA_Fn-UseC_-HR-Employee-Attrition.csv")

Then, after the data is successfully imported, use the dplyr library to start cleaning it so the data is ready for use.

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Before doing data cleaning, let’s do some checking of the dataset to see the columns, structure, data type, etc.

head(attrition.data)
##   Age Attrition    BusinessTravel DailyRate             Department
## 1  41       Yes     Travel_Rarely      1102                  Sales
## 2  49        No Travel_Frequently       279 Research & Development
## 3  37       Yes     Travel_Rarely      1373 Research & Development
## 4  33        No Travel_Frequently      1392 Research & Development
## 5  27        No     Travel_Rarely       591 Research & Development
## 6  32        No Travel_Frequently      1005 Research & Development
##   DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
## 1                1         2  Life Sciences             1              1
## 2                8         1  Life Sciences             1              2
## 3                2         2          Other             1              4
## 4                3         4  Life Sciences             1              5
## 5                2         1        Medical             1              7
## 6                2         2  Life Sciences             1              8
##   EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1                       2 Female         94              3        2
## 2                       3   Male         61              2        2
## 3                       4   Male         92              2        1
## 4                       4 Female         56              3        1
## 5                       1   Male         40              3        1
## 6                       4   Male         79              3        1
##                 JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## 1       Sales Executive               4        Single          5993       19479
## 2    Research Scientist               2       Married          5130       24907
## 3 Laboratory Technician               3        Single          2090        2396
## 4    Research Scientist               3       Married          2909       23159
## 5 Laboratory Technician               2       Married          3468       16632
## 6 Laboratory Technician               4        Single          3068       11864
##   NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating
## 1                  8      Y      Yes                11                 3
## 2                  1      Y       No                23                 4
## 3                  6      Y      Yes                15                 3
## 4                  1      Y      Yes                11                 3
## 5                  9      Y       No                12                 3
## 6                  0      Y       No                13                 3
##   RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## 1                        1            80                0                 8
## 2                        4            80                1                10
## 3                        2            80                0                 7
## 4                        3            80                0                 8
## 5                        4            80                1                 6
## 6                        3            80                0                 8
##   TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1                     0               1              6                  4
## 2                     3               3             10                  7
## 3                     3               3              0                  0
## 4                     3               3              8                  7
## 5                     3               3              2                  2
## 6                     2               2              7                  7
##   YearsSinceLastPromotion YearsWithCurrManager
## 1                       0                    5
## 2                       1                    7
## 3                       0                    0
## 4                       3                    0
## 5                       2                    2
## 6                       3                    6

The head() function in R programming is used to display the first few rows of a data object, such as a dataframe, matrix, or vector. The main benefit of the head() function is that it provides an initial view of the data.

str(attrition.data)
## 'data.frame':    1470 obs. of  35 variables:
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : chr  "Yes" "No" "Yes" "No" ...
##  $ BusinessTravel          : chr  "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" "Travel_Frequently" ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : chr  "Sales" "Research & Development" "Research & Development" "Research & Development" ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : int  2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : chr  "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
##  $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EmployeeNumber          : int  1 2 4 5 7 8 10 11 12 13 ...
##  $ EnvironmentSatisfaction : int  2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : chr  "Female" "Male" "Male" "Female" ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : int  3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : chr  "Sales Executive" "Research Scientist" "Laboratory Technician" "Research Scientist" ...
##  $ JobSatisfaction         : int  4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : chr  "Single" "Married" "Single" "Married" ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ Over18                  : chr  "Y" "Y" "Y" "Y" ...
##  $ OverTime                : chr  "Yes" "No" "Yes" "Yes" ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : int  3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: int  1 4 2 3 4 3 1 2 2 2 ...
##  $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : int  1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...

The str() function in R programming is used to present structural information from a data object.

Next, we will see descriptive analysis for columns with integer data types, so we need to write the syntax as follows:

# The name of the column you want to exclude
exclude_cols <- c("Attrition", "BusinessTravel", "Department", "EducationField",
                  "Gender", "JobRole", "MaritalStatus", "Over18", "OverTime")

# Executes summary(), excluding the desired column
summary(attrition.data[, !(names(attrition.data) %in% exclude_cols)])
##       Age          DailyRate      DistanceFromHome   Education    
##  Min.   :18.00   Min.   : 102.0   Min.   : 1.000   Min.   :1.000  
##  1st Qu.:30.00   1st Qu.: 465.0   1st Qu.: 2.000   1st Qu.:2.000  
##  Median :36.00   Median : 802.0   Median : 7.000   Median :3.000  
##  Mean   :36.92   Mean   : 802.5   Mean   : 9.193   Mean   :2.913  
##  3rd Qu.:43.00   3rd Qu.:1157.0   3rd Qu.:14.000   3rd Qu.:4.000  
##  Max.   :60.00   Max.   :1499.0   Max.   :29.000   Max.   :5.000  
##  EmployeeCount EmployeeNumber   EnvironmentSatisfaction   HourlyRate    
##  Min.   :1     Min.   :   1.0   Min.   :1.000           Min.   : 30.00  
##  1st Qu.:1     1st Qu.: 491.2   1st Qu.:2.000           1st Qu.: 48.00  
##  Median :1     Median :1020.5   Median :3.000           Median : 66.00  
##  Mean   :1     Mean   :1024.9   Mean   :2.722           Mean   : 65.89  
##  3rd Qu.:1     3rd Qu.:1555.8   3rd Qu.:4.000           3rd Qu.: 83.75  
##  Max.   :1     Max.   :2068.0   Max.   :4.000           Max.   :100.00  
##  JobInvolvement    JobLevel     JobSatisfaction MonthlyIncome    MonthlyRate   
##  Min.   :1.00   Min.   :1.000   Min.   :1.000   Min.   : 1009   Min.   : 2094  
##  1st Qu.:2.00   1st Qu.:1.000   1st Qu.:2.000   1st Qu.: 2911   1st Qu.: 8047  
##  Median :3.00   Median :2.000   Median :3.000   Median : 4919   Median :14236  
##  Mean   :2.73   Mean   :2.064   Mean   :2.729   Mean   : 6503   Mean   :14313  
##  3rd Qu.:3.00   3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.: 8379   3rd Qu.:20462  
##  Max.   :4.00   Max.   :5.000   Max.   :4.000   Max.   :19999   Max.   :26999  
##  NumCompaniesWorked PercentSalaryHike PerformanceRating
##  Min.   :0.000      Min.   :11.00     Min.   :3.000    
##  1st Qu.:1.000      1st Qu.:12.00     1st Qu.:3.000    
##  Median :2.000      Median :14.00     Median :3.000    
##  Mean   :2.693      Mean   :15.21     Mean   :3.154    
##  3rd Qu.:4.000      3rd Qu.:18.00     3rd Qu.:3.000    
##  Max.   :9.000      Max.   :25.00     Max.   :4.000    
##  RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
##  Min.   :1.000            Min.   :80    Min.   :0.0000   Min.   : 0.00    
##  1st Qu.:2.000            1st Qu.:80    1st Qu.:0.0000   1st Qu.: 6.00    
##  Median :3.000            Median :80    Median :1.0000   Median :10.00    
##  Mean   :2.712            Mean   :80    Mean   :0.7939   Mean   :11.28    
##  3rd Qu.:4.000            3rd Qu.:80    3rd Qu.:1.0000   3rd Qu.:15.00    
##  Max.   :4.000            Max.   :80    Max.   :3.0000   Max.   :40.00    
##  TrainingTimesLastYear WorkLifeBalance YearsAtCompany   YearsInCurrentRole
##  Min.   :0.000         Min.   :1.000   Min.   : 0.000   Min.   : 0.000    
##  1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.000   1st Qu.: 2.000    
##  Median :3.000         Median :3.000   Median : 5.000   Median : 3.000    
##  Mean   :2.799         Mean   :2.761   Mean   : 7.008   Mean   : 4.229    
##  3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.: 9.000   3rd Qu.: 7.000    
##  Max.   :6.000         Max.   :4.000   Max.   :40.000   Max.   :18.000    
##  YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000          Min.   : 0.000      
##  1st Qu.: 0.000          1st Qu.: 2.000      
##  Median : 1.000          Median : 3.000      
##  Mean   : 2.188          Mean   : 4.123      
##  3rd Qu.: 3.000          3rd Qu.: 7.000      
##  Max.   :15.000          Max.   :17.000

attrition.data[, !(names(attrition.data) %in% exclude_cols)] is how we select columns that are not in exclude_cols from the attrition.data data object. This uses column selection using indexing and the %in% operator to check if the column name is in exclude_cols.

Next, we check for variables with character data types.

table(attrition.data$Attrition)
## 
##   No  Yes 
## 1233  237
table(attrition.data$BusinessTravel)
## 
##        Non-Travel Travel_Frequently     Travel_Rarely 
##               150               277              1043
table(attrition.data$Department)
## 
##        Human Resources Research & Development                  Sales 
##                     63                    961                    446
table(attrition.data$EducationField)
## 
##  Human Resources    Life Sciences        Marketing          Medical 
##               27              606              159              464 
##            Other Technical Degree 
##               82              132
table(attrition.data$Gender)
## 
## Female   Male 
##    588    882
table(attrition.data$JobRole)
## 
## Healthcare Representative           Human Resources     Laboratory Technician 
##                       131                        52                       259 
##                   Manager    Manufacturing Director         Research Director 
##                       102                       145                        80 
##        Research Scientist           Sales Executive      Sales Representative 
##                       292                       326                        83
table(attrition.data$MaritalStatus)
## 
## Divorced  Married   Single 
##      327      673      470
table(attrition.data$Over18)
## 
##    Y 
## 1470
table(attrition.data$OverTime)
## 
##   No  Yes 
## 1054  416

The table() function in R is used to create contingency tables, which calculate the frequency of observations among combinations of values of one or more variables.

Then do a final check to see if there is an NA value in the dataset.

# Calculates the total value of NA in all columns
total_na <- sum(colSums(is.na(attrition.data)))

# Print total_na
total_na
## [1] 0

The result is that there are no NA values in all columns in the dataset.

Next, after doing some checks, we need to make adjustments to our dataset. We need to remove the columns that can’t provide useful insights like EmployeeCount, Over18, and StandardHours. Also add a new column, namely EducationLevel.

# Data cleaning
df.attrition <- attrition.data %>%
  select(-EmployeeCount, -Over18, -StandardHours) %>% # Delete column
  mutate(EducationLevel = c("Below College", "College", "Bachelor", "Master", "Doctor")[Education]) # Added new column

# Review new table
head(df.attrition)
##   Age Attrition    BusinessTravel DailyRate             Department
## 1  41       Yes     Travel_Rarely      1102                  Sales
## 2  49        No Travel_Frequently       279 Research & Development
## 3  37       Yes     Travel_Rarely      1373 Research & Development
## 4  33        No Travel_Frequently      1392 Research & Development
## 5  27        No     Travel_Rarely       591 Research & Development
## 6  32        No Travel_Frequently      1005 Research & Development
##   DistanceFromHome Education EducationField EmployeeNumber
## 1                1         2  Life Sciences              1
## 2                8         1  Life Sciences              2
## 3                2         2          Other              4
## 4                3         4  Life Sciences              5
## 5                2         1        Medical              7
## 6                2         2  Life Sciences              8
##   EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1                       2 Female         94              3        2
## 2                       3   Male         61              2        2
## 3                       4   Male         92              2        1
## 4                       4 Female         56              3        1
## 5                       1   Male         40              3        1
## 6                       4   Male         79              3        1
##                 JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## 1       Sales Executive               4        Single          5993       19479
## 2    Research Scientist               2       Married          5130       24907
## 3 Laboratory Technician               3        Single          2090        2396
## 4    Research Scientist               3       Married          2909       23159
## 5 Laboratory Technician               2       Married          3468       16632
## 6 Laboratory Technician               4        Single          3068       11864
##   NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating
## 1                  8      Yes                11                 3
## 2                  1       No                23                 4
## 3                  6      Yes                15                 3
## 4                  1      Yes                11                 3
## 5                  9       No                12                 3
## 6                  0       No                13                 3
##   RelationshipSatisfaction StockOptionLevel TotalWorkingYears
## 1                        1                0                 8
## 2                        4                1                10
## 3                        2                0                 7
## 4                        3                0                 8
## 5                        4                1                 6
## 6                        3                0                 8
##   TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1                     0               1              6                  4
## 2                     3               3             10                  7
## 3                     3               3              0                  0
## 4                     3               3              8                  7
## 5                     3               3              2                  2
## 6                     2               2              7                  7
##   YearsSinceLastPromotion YearsWithCurrManager EducationLevel
## 1                       0                    5        College
## 2                       1                    7  Below College
## 3                       0                    0        College
## 4                       3                    0         Master
## 5                       2                    2  Below College
## 6                       3                    6        College

The %>% operator, known as the pipeline operator, is used in packages such as dplyr and tidyverse in R to facilitate the processing of data with a more concise and readable syntax.

Data Exploration

In an increasingly data-driven world, data exploration is a critical initial step before conducting in-depth research. This is the process of exploring your dataset to learn about its characteristics, patterns, and potential. Data exploration assists in the identification of early trends, the identification of possible problems, and the formulation of deeper questions for research.

Before starting data exploration, let’s prepare the ggplot2 and reshape2 library first, because at this stage we will do some visualizations such as histograms, boxplots, etc.

library(ggplot2)
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.2.3

How was the age distribution of employees based on attrition?

ggplot(df.attrition, aes(x = Age, y = ..density.., fill = Attrition)) +
  geom_histogram(binwidth = 5, color = "black", position = "stack") +
  geom_density(alpha = 0.5) +
  labs(title = "Employee Age Distribution by Attrition", y = "Density") +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_wrap(~Attrition)

Based on the histogram above, both employees with attrition values of “Yes” and “No” tend to slant to the right. This indicates that the mean, median, and mode values in the age column do not have the same value. From the histogram, we also get information that employees with Attrition “No” are predominantly aged 30–40 years, while employees with Attrition “Yes” are dominated by those in their 30s or around 27–33 years, with a density that almost reaches 0.06.

How was the monthly income distribution based on attrition?

ggplot(attrition.data, aes(x = Attrition, y = MonthlyIncome, fill = Attrition)) +
  geom_boxplot() +
  labs(title = "Monthly Income Distribution by Attrition", y = "Monthly Income (USD)")+
  theme(plot.title = element_text(hjust = 0.5))

Based on the box plot above, the median monthly income for employees with “No” attrition is higher than “Yes” attrition, which is around 6800 USD, while “Yes” is around 4700 USD. This may indicate that higher income may be a factor influencing an employee’s decision to stay with the company.

It can also be seen that there are outliers, indicating the existence of extreme values in the monthly income data. This could be because some employees have incomes that are much higher or lower than the majority of employees. There is also an asymmetry between the two, as indicated by the unequal size of the upper and lower squares, indicating that there is an abnormal distribution.

How was the correlation between numerical variables?

# Create numeric variables values
numeric_columns <- c("Age", "DailyRate", "DistanceFromHome", "Education", "EnvironmentSatisfaction", 
                     "HourlyRate", "JobInvolvement", "JobLevel", "JobSatisfaction", "MonthlyIncome", 
                     "MonthlyRate", "NumCompaniesWorked", "PerformanceRating", "RelationshipSatisfaction", 
                     "PercentSalaryHike", "StockOptionLevel", "TotalWorkingYears", "TrainingTimesLastYear", 
                     "WorkLifeBalance", "YearsAtCompany", "YearsInCurrentRole",
                     "YearsSinceLastPromotion", "YearsWithCurrManager")

# Calculate the correlation matrix
correlation_matrix <- cor(attrition.data[, numeric_columns])

# Create a correlation heatmap
ggplot(melt(correlation_matrix), aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient(low = "lightblue", high = "red") +
  labs(title = "Numeric Variables Correlation Heatmap", x = "Variable", y = "Variable") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
        plot.title = element_text(hjust = 0.5))

There is a strong positive correlation between the variables MonthlyIncome and JobLevel, MonthlyIncome and TotalWorkingYear, YearsWithCurrManager and YearsAtCompany, JobLevel and TotalWorkingYear, and Age and TotalWorkingYears. Which is marked with a deep red color, or an estimated correlation value of 0.75 to 1.0.

Let’s try to prove it again statistically using the Spearman correlation test

cor(df.attrition[, numeric_columns], method = "spearman")
##                                    Age     DailyRate DistanceFromHome
## Age                       1.000000e+00  0.0072897280     -0.019290911
## DailyRate                 7.289728e-03  1.0000000000     -0.002753667
## DistanceFromHome         -1.929091e-02 -0.0027536668      1.000000000
## Education                 2.049367e-01 -0.0136071321      0.015708005
## EnvironmentSatisfaction   9.820116e-03  0.0189611918     -0.010400913
## HourlyRate                2.885849e-02  0.0235114333      0.020445908
## JobInvolvement            3.445622e-02  0.0424687818      0.034430070
## JobLevel                  4.896178e-01  0.0038163552      0.022147607
## JobSatisfaction          -5.184852e-03  0.0278287617     -0.013078054
## MonthlyIncome             4.719021e-01  0.0162596518      0.002512448
## MonthlyRate               1.745100e-02 -0.0323595200      0.039617805
## NumCompaniesWorked        3.532126e-01  0.0365483454     -0.009591877
## PerformanceRating         9.338863e-05  0.0006244329      0.011319674
## RelationshipSatisfaction  4.606332e-02  0.0096845284      0.005851811
## PercentSalaryHike         7.708838e-03  0.0250704034      0.029666428
## StockOptionLevel          5.663306e-02  0.0385139335      0.030190294
## TotalWorkingYears         6.568958e-01  0.0209506994     -0.002912375
## TrainingTimesLastYear     3.158049e-04 -0.0113389745     -0.024848361
## WorkLifeBalance          -3.706791e-03 -0.0403523149     -0.020401887
## YearsAtCompany            2.516860e-01 -0.0097783439      0.010513095
## YearsInCurrentRole        1.979776e-01  0.0072075388      0.013708096
## YearsSinceLastPromotion   1.736472e-01 -0.0376312932     -0.004685211
## YearsWithCurrManager      1.948176e-01 -0.0047165189      0.004447868
##                             Education EnvironmentSatisfaction    HourlyRate
## Age                       0.204936684            0.0098201155  0.0288584852
## DailyRate                -0.013607132            0.0189611918  0.0235114333
## DistanceFromHome          0.015708005           -0.0104009131  0.0204459078
## Education                 1.000000000           -0.0276246896  0.0144319003
## EnvironmentSatisfaction  -0.027624690            1.0000000000 -0.0523803635
## HourlyRate                0.014431900           -0.0523803635  1.0000000000
## JobInvolvement            0.037230797           -0.0153011188  0.0438843132
## JobLevel                  0.107419164           -0.0001924317 -0.0338761113
## JobSatisfaction          -0.005175469           -0.0029927823 -0.0683401726
## MonthlyIncome             0.120028491           -0.0151630774 -0.0197617219
## MonthlyRate              -0.021213828            0.0374765094 -0.0148884998
## NumCompaniesWorked        0.135103375            0.0061514181  0.0192092504
## PerformanceRating        -0.025080669           -0.0291598169 -0.0021846442
## RelationshipSatisfaction -0.013172612            0.0053534581  0.0002585028
## PercentSalaryHike         0.004299936           -0.0304894261 -0.0098755483
## StockOptionLevel          0.013793504            0.0098261697  0.0505430780
## TotalWorkingYears         0.162176793           -0.0138819760 -0.0120716328
## TrainingTimesLastYear    -0.023748617           -0.0116589377  0.0002918715
## WorkLifeBalance           0.017350435            0.0271689716 -0.0100031257
## YearsAtCompany            0.064196129            0.0084245043 -0.0290323277
## YearsInCurrentRole        0.054567211            0.0201401658 -0.0340160475
## YearsSinceLastPromotion   0.032203024            0.0260816744 -0.0524124022
## YearsWithCurrManager      0.051291866           -0.0017318242 -0.0138114155
##                          JobInvolvement      JobLevel JobSatisfaction
## Age                         0.034456224  0.4896178109   -0.0051848521
## DailyRate                   0.042468782  0.0038163552    0.0278287617
## DistanceFromHome            0.034430070  0.0221476071   -0.0130780540
## Education                   0.037230797  0.1074191638   -0.0051754687
## EnvironmentSatisfaction    -0.015301119 -0.0001924317   -0.0029927823
## HourlyRate                  0.043884313 -0.0338761113   -0.0683401726
## JobInvolvement              1.000000000 -0.0184235130   -0.0121482118
## JobLevel                   -0.018423513  1.0000000000   -0.0008519730
## JobSatisfaction            -0.012148212 -0.0008519730    1.0000000000
## MonthlyIncome              -0.024552352  0.9204286748    0.0048807779
## MonthlyRate                -0.018117452  0.0527918878   -0.0027017288
## NumCompaniesWorked          0.015448159  0.1782701536   -0.0515158889
## PerformanceRating          -0.024732712 -0.0186083014    0.0069785019
## RelationshipSatisfaction    0.037857297  0.0113112331   -0.0146785538
## PercentSalaryHike          -0.016998737 -0.0324527523    0.0239695457
## StockOptionLevel            0.034464290  0.0477861699    0.0127854959
## TotalWorkingYears           0.006444104  0.7346775906   -0.0158747168
## TrainingTimesLastYear       0.002013915 -0.0197285879   -0.0116809933
## WorkLifeBalance            -0.019888634  0.0404657805   -0.0297808628
## YearsAtCompany              0.013836362  0.4722827149    0.0122804055
## YearsInCurrentRole          0.015547840  0.3910854019    0.0005310846
## YearsSinceLastPromotion    -0.008306725  0.2690960789    0.0074971306
## YearsWithCurrManager        0.037397014  0.3708892877   -0.0167721793
##                          MonthlyIncome   MonthlyRate NumCompaniesWorked
## Age                        0.471902130  0.0174510008       3.532126e-01
## DailyRate                  0.016259652 -0.0323595200       3.654835e-02
## DistanceFromHome           0.002512448  0.0396178052      -9.591877e-03
## Education                  0.120028491 -0.0212138278       1.351034e-01
## EnvironmentSatisfaction   -0.015163077  0.0374765094       6.151418e-03
## HourlyRate                -0.019761722 -0.0148884998       1.920925e-02
## JobInvolvement            -0.024552352 -0.0181174516       1.544816e-02
## JobLevel                   0.920428675  0.0527918878       1.782702e-01
## JobSatisfaction            0.004880778 -0.0027017288      -5.151589e-02
## MonthlyIncome              1.000000000  0.0542767660       1.903072e-01
## MonthlyRate                0.054276766  1.0000000000       1.955330e-02
## NumCompaniesWorked         0.190307217  0.0195532984       1.000000e+00
## PerformanceRating         -0.026999475 -0.0096975883      -8.298387e-03
## RelationshipSatisfaction   0.003885241 -0.0003728148       4.029637e-02
## PercentSalaryHike         -0.033767076 -0.0054705369       4.628802e-05
## StockOptionLevel           0.045851881 -0.0372742936       3.227691e-02
## TotalWorkingYears          0.710024314  0.0133598231       3.151956e-01
## TrainingTimesLastYear     -0.034846762 -0.0100179472      -4.733649e-02
## WorkLifeBalance            0.030759146  0.0063162635       9.102504e-03
## YearsAtCompany             0.464315235 -0.0298618681      -1.710698e-01
## YearsInCurrentRole         0.394711834 -0.0068654973      -1.276730e-01
## YearsSinceLastPromotion    0.264599332 -0.0162854734      -6.695018e-02
## YearsWithCurrManager       0.365385678 -0.0350591418      -1.441290e-01
##                          PerformanceRating RelationshipSatisfaction
## Age                           9.338863e-05             0.0460633199
## DailyRate                     6.244329e-04             0.0096845284
## DistanceFromHome              1.131967e-02             0.0058518112
## Education                    -2.508067e-02            -0.0131726125
## EnvironmentSatisfaction      -2.915982e-02             0.0053534581
## HourlyRate                   -2.184644e-03             0.0002585028
## JobInvolvement               -2.473271e-02             0.0378572969
## JobLevel                     -1.860830e-02             0.0113112331
## JobSatisfaction               6.978502e-03            -0.0146785538
## MonthlyIncome                -2.699948e-02             0.0038852411
## MonthlyRate                  -9.697588e-03            -0.0003728148
## NumCompaniesWorked           -8.298387e-03             0.0402963651
## PerformanceRating             1.000000e+00            -0.0329887789
## RelationshipSatisfaction     -3.298878e-02             1.0000000000
## PercentSalaryHike             6.285191e-01            -0.0349145727
## StockOptionLevel              1.102806e-02            -0.0562490052
## TotalWorkingYears             1.167810e-02             0.0039712744
## TrainingTimesLastYear        -1.667579e-02             0.0054241038
## WorkLifeBalance               6.808391e-03             0.0176841171
## YearsAtCompany                1.722425e-02            -0.0012671613
## YearsInCurrentRole            3.271927e-02            -0.0213997688
## YearsSinceLastPromotion      -6.578150e-03             0.0369629999
## YearsWithCurrManager          2.556002e-02             0.0002800476
##                          PercentSalaryHike StockOptionLevel TotalWorkingYears
## Age                           7.708838e-03      0.056633063       0.656895823
## DailyRate                     2.507040e-02      0.038513934       0.020950699
## DistanceFromHome              2.966643e-02      0.030190294      -0.002912375
## Education                     4.299936e-03      0.013793504       0.162176793
## EnvironmentSatisfaction      -3.048943e-02      0.009826170      -0.013881976
## HourlyRate                   -9.875548e-03      0.050543078      -0.012071633
## JobInvolvement               -1.699874e-02      0.034464290       0.006444104
## JobLevel                     -3.245275e-02      0.047786170       0.734677591
## JobSatisfaction               2.396955e-02      0.012785496      -0.015874717
## MonthlyIncome                -3.376708e-02      0.045851881       0.710024314
## MonthlyRate                  -5.470537e-03     -0.037274294       0.013359823
## NumCompaniesWorked            4.628802e-05      0.032276911       0.315195582
## PerformanceRating             6.285191e-01      0.011028055       0.011678101
## RelationshipSatisfaction     -3.491457e-02     -0.056249005       0.003971274
## PercentSalaryHike             1.000000e+00      0.023445876      -0.025527604
## StockOptionLevel              2.344588e-02      1.000000000       0.052618281
## TotalWorkingYears            -2.552760e-02      0.052618281       1.000000000
## TrainingTimesLastYear        -4.106182e-03      0.003388463      -0.014150578
## WorkLifeBalance               9.304377e-04     -0.016567956       0.003004074
## YearsAtCompany               -5.411676e-02      0.064974119       0.594193253
## YearsInCurrentRole           -2.552848e-02      0.071626914       0.492721324
## YearsSinceLastPromotion      -5.536242e-02      0.027502390       0.334995640
## YearsWithCurrManager         -2.604883e-02      0.053646188       0.495254103
##                          TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## Age                               0.0003158049   -0.0037067910    0.251685970
## DailyRate                        -0.0113389745   -0.0403523149   -0.009778344
## DistanceFromHome                 -0.0248483609   -0.0204018869    0.010513095
## Education                        -0.0237486170    0.0173504351    0.064196129
## EnvironmentSatisfaction          -0.0116589377    0.0271689716    0.008424504
## HourlyRate                        0.0002918715   -0.0100031257   -0.029032328
## JobInvolvement                    0.0020139150   -0.0198886343    0.013836362
## JobLevel                         -0.0197285879    0.0404657805    0.472282715
## JobSatisfaction                  -0.0116809933   -0.0297808628    0.012280406
## MonthlyIncome                    -0.0348467617    0.0307591463    0.464315235
## MonthlyRate                      -0.0100179472    0.0063162635   -0.029861868
## NumCompaniesWorked               -0.0473364941    0.0091025036   -0.171069831
## PerformanceRating                -0.0166757921    0.0068083913    0.017224252
## RelationshipSatisfaction          0.0054241038    0.0176841171   -0.001267161
## PercentSalaryHike                -0.0041061817    0.0009304377   -0.054116761
## StockOptionLevel                  0.0033884631   -0.0165679563    0.064974119
## TotalWorkingYears                -0.0141505775    0.0030040739    0.594193253
## TrainingTimesLastYear             1.0000000000    0.0236895575    0.001389345
## WorkLifeBalance                   0.0236895575    1.0000000000    0.004675134
## YearsAtCompany                    0.0013893447    0.0046751344    1.000000000
## YearsInCurrentRole                0.0045810569    0.0232142605    0.853999533
## YearsSinceLastPromotion           0.0102154346    0.0021511105    0.519966444
## YearsWithCurrManager             -0.0116275408   -0.0045905701    0.842803342
##                          YearsInCurrentRole YearsSinceLastPromotion
## Age                            0.1979775610             0.173647224
## DailyRate                      0.0072075388            -0.037631293
## DistanceFromHome               0.0137080964            -0.004685211
## Education                      0.0545672109             0.032203024
## EnvironmentSatisfaction        0.0201401658             0.026081674
## HourlyRate                    -0.0340160475            -0.052412402
## JobInvolvement                 0.0155478405            -0.008306725
## JobLevel                       0.3910854019             0.269096079
## JobSatisfaction                0.0005310846             0.007497131
## MonthlyIncome                  0.3947118335             0.264599332
## MonthlyRate                   -0.0068654973            -0.016285473
## NumCompaniesWorked            -0.1276729762            -0.066950179
## PerformanceRating              0.0327192666            -0.006578150
## RelationshipSatisfaction      -0.0213997688             0.036963000
## PercentSalaryHike             -0.0255284805            -0.055362419
## StockOptionLevel               0.0716269138             0.027502390
## TotalWorkingYears              0.4927213242             0.334995640
## TrainingTimesLastYear          0.0045810569             0.010215435
## WorkLifeBalance                0.0232142605             0.002151110
## YearsAtCompany                 0.8539995333             0.519966444
## YearsInCurrentRole             1.0000000000             0.505656503
## YearsSinceLastPromotion        0.5056565032             1.000000000
## YearsWithCurrManager           0.7247542193             0.466712745
##                          YearsWithCurrManager
## Age                              0.1948175841
## DailyRate                       -0.0047165189
## DistanceFromHome                 0.0044478680
## Education                        0.0512918662
## EnvironmentSatisfaction         -0.0017318242
## HourlyRate                      -0.0138114155
## JobInvolvement                   0.0373970145
## JobLevel                         0.3708892877
## JobSatisfaction                 -0.0167721793
## MonthlyIncome                    0.3653856782
## MonthlyRate                     -0.0350591418
## NumCompaniesWorked              -0.1441289757
## PerformanceRating                0.0255600151
## RelationshipSatisfaction         0.0002800476
## PercentSalaryHike               -0.0260488322
## StockOptionLevel                 0.0536461877
## TotalWorkingYears                0.4952541030
## TrainingTimesLastYear           -0.0116275408
## WorkLifeBalance                 -0.0045905701
## YearsAtCompany                   0.8428033422
## YearsInCurrentRole               0.7247542193
## YearsSinceLastPromotion          0.4667127452
## YearsWithCurrManager             1.0000000000

JobLevel and MonthlyIncome have a strong positive correlation of around 0.920, indicating that the higher the JobLevel, the higher the MonthlyIncome tends to be. Additionally, YearsAtCompany and YearsWithCurrManager also have a strong positive correlation of around 0.843, indicating that the longer someone has worked at a company (YearsAtCompany), they tend to have more time with the current manager (YearsWithCurrManager).

So we will exclude JobLevel and YearsAtCompany from the model.

Employee Attrition Analysis

In this section, we will explore the in-depth steps of analyzing the factors that contribute to employee turnover and how these steps can provide valuable insights for decision-making in companies. I will use logistic regression in creating the model.

Then we need to call the stats library first, and after that, we start creating the first model.

Before that we also need to change the Attrition variable with the values ​​1 and 0, 1 as “Yes” and 0 as “No”.

library(stats)
df.attrition$Attrition <- ifelse(df.attrition$Attrition == "Yes", 1, 0)

Logistic Regression

# Create first model as logit1
logit1 <- glm(Attrition ~ Age + DailyRate + DistanceFromHome + Education + EnvironmentSatisfaction +
                HourlyRate + JobInvolvement + JobSatisfaction + MonthlyIncome +
                MonthlyRate + NumCompaniesWorked + PerformanceRating + RelationshipSatisfaction +
                PercentSalaryHike + StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear +
                WorkLifeBalance + YearsInCurrentRole +
                YearsSinceLastPromotion + YearsWithCurrManager, family = "binomial",
              data = df.attrition) 
# Call logit1 using the summary function to see the estimation results
summary(logit1)
## 
## Call:
## glm(formula = Attrition ~ Age + DailyRate + DistanceFromHome + 
##     Education + EnvironmentSatisfaction + HourlyRate + JobInvolvement + 
##     JobSatisfaction + MonthlyIncome + MonthlyRate + NumCompaniesWorked + 
##     PerformanceRating + RelationshipSatisfaction + PercentSalaryHike + 
##     StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear + 
##     WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + 
##     YearsWithCurrManager, family = "binomial", data = df.attrition)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4921  -0.6029  -0.3914  -0.2006   3.3009  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               4.324e+00  1.089e+00   3.970 7.19e-05 ***
## Age                      -3.060e-02  1.221e-02  -2.507 0.012192 *  
## DailyRate                -2.457e-04  1.944e-04  -1.264 0.206215    
## DistanceFromHome          3.252e-02  9.285e-03   3.503 0.000460 ***
## Education                 1.065e-02  7.828e-02   0.136 0.891781    
## EnvironmentSatisfaction  -2.963e-01  7.101e-02  -4.173 3.00e-05 ***
## HourlyRate               -4.602e-04  3.857e-03  -0.119 0.905020    
## JobInvolvement           -4.919e-01  1.075e-01  -4.575 4.76e-06 ***
## JobSatisfaction          -2.965e-01  7.063e-02  -4.198 2.69e-05 ***
## MonthlyIncome            -7.521e-05  3.171e-05  -2.372 0.017701 *  
## MonthlyRate               4.904e-06  1.102e-05   0.445 0.656248    
## NumCompaniesWorked        1.244e-01  3.260e-02   3.815 0.000136 ***
## PerformanceRating         3.001e-01  3.462e-01   0.867 0.386035    
## RelationshipSatisfaction -1.460e-01  7.222e-02  -2.021 0.043268 *  
## PercentSalaryHike        -4.301e-02  3.440e-02  -1.250 0.211168    
## StockOptionLevel         -5.204e-01  1.047e-01  -4.969 6.73e-07 ***
## TotalWorkingYears        -3.154e-02  2.328e-02  -1.355 0.175469    
## TrainingTimesLastYear    -1.565e-01  6.196e-02  -2.526 0.011523 *  
## WorkLifeBalance          -2.468e-01  1.064e-01  -2.320 0.020327 *  
## YearsInCurrentRole       -9.613e-02  3.838e-02  -2.505 0.012250 *  
## YearsSinceLastPromotion   1.741e-01  3.551e-02   4.902 9.48e-07 ***
## YearsWithCurrManager     -8.765e-02  3.811e-02  -2.300 0.021438 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1298.6  on 1469  degrees of freedom
## Residual deviance: 1072.8  on 1448  degrees of freedom
## AIC: 1116.8
## 
## Number of Fisher Scoring iterations: 6

There are still a number of independent variables that are not significant, so we need to repeat it again by removing 1 variable, namely Education, and then generating the model, which becomes logit2.

# Create second model as logit2
logit2 <- glm(Attrition ~ Age + DailyRate + DistanceFromHome + EnvironmentSatisfaction +
                HourlyRate + JobInvolvement + JobSatisfaction + MonthlyIncome +
                MonthlyRate + NumCompaniesWorked + PerformanceRating + RelationshipSatisfaction +
                PercentSalaryHike + StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear +
                WorkLifeBalance + YearsInCurrentRole +
                YearsSinceLastPromotion + YearsWithCurrManager, family = "binomial",
              data = df.attrition) 
# Call logit2 using the summary function to see the estimation results
summary(logit2)
## 
## Call:
## glm(formula = Attrition ~ Age + DailyRate + DistanceFromHome + 
##     EnvironmentSatisfaction + HourlyRate + JobInvolvement + JobSatisfaction + 
##     MonthlyIncome + MonthlyRate + NumCompaniesWorked + PerformanceRating + 
##     RelationshipSatisfaction + PercentSalaryHike + StockOptionLevel + 
##     TotalWorkingYears + TrainingTimesLastYear + WorkLifeBalance + 
##     YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager, 
##     family = "binomial", data = df.attrition)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4995  -0.6009  -0.3913  -0.2012   3.3069  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               4.349e+00  1.073e+00   4.054 5.03e-05 ***
## Age                      -3.032e-02  1.203e-02  -2.521 0.011718 *  
## DailyRate                -2.466e-04  1.943e-04  -1.269 0.204300    
## DistanceFromHome          3.253e-02  9.283e-03   3.504 0.000458 ***
## EnvironmentSatisfaction  -2.964e-01  7.101e-02  -4.175 2.99e-05 ***
## HourlyRate               -4.700e-04  3.856e-03  -0.122 0.902996    
## JobInvolvement           -4.915e-01  1.075e-01  -4.572 4.83e-06 ***
## JobSatisfaction          -2.965e-01  7.063e-02  -4.198 2.69e-05 ***
## MonthlyIncome            -7.513e-05  3.171e-05  -2.370 0.017804 *  
## MonthlyRate               4.836e-06  1.101e-05   0.439 0.660382    
## NumCompaniesWorked        1.247e-01  3.250e-02   3.837 0.000125 ***
## PerformanceRating         2.985e-01  3.460e-01   0.863 0.388303    
## RelationshipSatisfaction -1.461e-01  7.220e-02  -2.023 0.043024 *  
## PercentSalaryHike        -4.291e-02  3.439e-02  -1.248 0.212066    
## StockOptionLevel         -5.202e-01  1.047e-01  -4.968 6.77e-07 ***
## TotalWorkingYears        -3.165e-02  2.327e-02  -1.360 0.173736    
## TrainingTimesLastYear    -1.565e-01  6.196e-02  -2.526 0.011522 *  
## WorkLifeBalance          -2.472e-01  1.063e-01  -2.324 0.020122 *  
## YearsInCurrentRole       -9.597e-02  3.836e-02  -2.502 0.012345 *  
## YearsSinceLastPromotion   1.742e-01  3.549e-02   4.909 9.15e-07 ***
## YearsWithCurrManager     -8.757e-02  3.809e-02  -2.299 0.021496 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1298.6  on 1469  degrees of freedom
## Residual deviance: 1072.8  on 1449  degrees of freedom
## AIC: 1114.8
## 
## Number of Fisher Scoring iterations: 6

There are still a number of independent variables that are not significant, so we need to repeat it again by removing 1 variable, namely HourlyRate, and then generating the model, which becomes logit3. We need to repeat this again and again until all variables are significant.

# Create third model as logit3
logit3 <- glm(Attrition ~ Age + DailyRate + DistanceFromHome + EnvironmentSatisfaction +
                JobInvolvement + JobSatisfaction + MonthlyIncome +
                MonthlyRate + NumCompaniesWorked + PerformanceRating + RelationshipSatisfaction +
                PercentSalaryHike + StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear +
                WorkLifeBalance + YearsInCurrentRole +
                YearsSinceLastPromotion + YearsWithCurrManager, family = "binomial",
              data = df.attrition) 
# Call logit3 using the summary function to see the estimation results
summary(logit3)
## 
## Call:
## glm(formula = Attrition ~ Age + DailyRate + DistanceFromHome + 
##     EnvironmentSatisfaction + JobInvolvement + JobSatisfaction + 
##     MonthlyIncome + MonthlyRate + NumCompaniesWorked + PerformanceRating + 
##     RelationshipSatisfaction + PercentSalaryHike + StockOptionLevel + 
##     TotalWorkingYears + TrainingTimesLastYear + WorkLifeBalance + 
##     YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager, 
##     family = "binomial", data = df.attrition)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5058  -0.6019  -0.3914  -0.2008   3.3052  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               4.321e+00  1.047e+00   4.126 3.70e-05 ***
## Age                      -3.040e-02  1.202e-02  -2.530 0.011406 *  
## DailyRate                -2.474e-04  1.942e-04  -1.274 0.202594    
## DistanceFromHome          3.251e-02  9.281e-03   3.503 0.000460 ***
## EnvironmentSatisfaction  -2.963e-01  7.101e-02  -4.173 3.01e-05 ***
## JobInvolvement           -4.921e-01  1.074e-01  -4.584 4.57e-06 ***
## JobSatisfaction          -2.959e-01  7.046e-02  -4.200 2.67e-05 ***
## MonthlyIncome            -7.515e-05  3.171e-05  -2.370 0.017780 *  
## MonthlyRate               4.853e-06  1.101e-05   0.441 0.659226    
## NumCompaniesWorked        1.248e-01  3.250e-02   3.839 0.000124 ***
## PerformanceRating         2.981e-01  3.459e-01   0.862 0.388756    
## RelationshipSatisfaction -1.460e-01  7.220e-02  -2.023 0.043112 *  
## PercentSalaryHike        -4.285e-02  3.438e-02  -1.246 0.212638    
## StockOptionLevel         -5.206e-01  1.047e-01  -4.974 6.54e-07 ***
## TotalWorkingYears        -3.162e-02  2.326e-02  -1.359 0.174065    
## TrainingTimesLastYear    -1.566e-01  6.195e-02  -2.527 0.011501 *  
## WorkLifeBalance          -2.472e-01  1.064e-01  -2.324 0.020118 *  
## YearsInCurrentRole       -9.599e-02  3.836e-02  -2.502 0.012333 *  
## YearsSinceLastPromotion   1.744e-01  3.547e-02   4.918 8.75e-07 ***
## YearsWithCurrManager     -8.761e-02  3.810e-02  -2.300 0.021463 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1298.6  on 1469  degrees of freedom
## Residual deviance: 1072.8  on 1450  degrees of freedom
## AIC: 1112.8
## 
## Number of Fisher Scoring iterations: 6
# Create fourth model as logit4
logit4 <- glm(Attrition ~ Age + DailyRate + DistanceFromHome + EnvironmentSatisfaction +
                JobInvolvement + JobSatisfaction + MonthlyIncome +
                NumCompaniesWorked + PerformanceRating + RelationshipSatisfaction +
                PercentSalaryHike + StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear +
                WorkLifeBalance + YearsInCurrentRole +
                YearsSinceLastPromotion + YearsWithCurrManager, family = "binomial",
              data = df.attrition) 
# Call logit4 using the summary function to see the estimation results
summary(logit4)
## 
## Call:
## glm(formula = Attrition ~ Age + DailyRate + DistanceFromHome + 
##     EnvironmentSatisfaction + JobInvolvement + JobSatisfaction + 
##     MonthlyIncome + NumCompaniesWorked + PerformanceRating + 
##     RelationshipSatisfaction + PercentSalaryHike + StockOptionLevel + 
##     TotalWorkingYears + TrainingTimesLastYear + WorkLifeBalance + 
##     YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager, 
##     family = "binomial", data = df.attrition)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5220  -0.6020  -0.3918  -0.1983   3.3054  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               4.403e+00  1.032e+00   4.267 1.98e-05 ***
## Age                      -3.040e-02  1.202e-02  -2.529 0.011442 *  
## DailyRate                -2.507e-04  1.940e-04  -1.292 0.196375    
## DistanceFromHome          3.268e-02  9.270e-03   3.526 0.000422 ***
## EnvironmentSatisfaction  -2.946e-01  7.089e-02  -4.155 3.25e-05 ***
## JobInvolvement           -4.921e-01  1.074e-01  -4.583 4.59e-06 ***
## JobSatisfaction          -2.951e-01  7.040e-02  -4.191 2.77e-05 ***
## MonthlyIncome            -7.456e-05  3.167e-05  -2.355 0.018547 *  
## NumCompaniesWorked        1.247e-01  3.249e-02   3.838 0.000124 ***
## PerformanceRating         2.899e-01  3.455e-01   0.839 0.401385    
## RelationshipSatisfaction -1.465e-01  7.221e-02  -2.029 0.042460 *  
## PercentSalaryHike        -4.233e-02  3.436e-02  -1.232 0.217987    
## StockOptionLevel         -5.237e-01  1.045e-01  -5.012 5.38e-07 ***
## TotalWorkingYears        -3.164e-02  2.327e-02  -1.359 0.174050    
## TrainingTimesLastYear    -1.562e-01  6.200e-02  -2.520 0.011730 *  
## WorkLifeBalance          -2.467e-01  1.063e-01  -2.321 0.020283 *  
## YearsInCurrentRole       -9.590e-02  3.835e-02  -2.501 0.012390 *  
## YearsSinceLastPromotion   1.750e-01  3.545e-02   4.936 7.98e-07 ***
## YearsWithCurrManager     -8.842e-02  3.804e-02  -2.324 0.020115 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1298.6  on 1469  degrees of freedom
## Residual deviance: 1073.0  on 1451  degrees of freedom
## AIC: 1111
## 
## Number of Fisher Scoring iterations: 6
# Create fifth model as logit5
logit5 <- glm(Attrition ~ Age + DailyRate + DistanceFromHome + EnvironmentSatisfaction +
                JobInvolvement + JobSatisfaction + MonthlyIncome +
                NumCompaniesWorked + RelationshipSatisfaction +
                PercentSalaryHike + StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear +
                WorkLifeBalance + YearsInCurrentRole +
                YearsSinceLastPromotion + YearsWithCurrManager, family = "binomial",
              data = df.attrition) 
# Call logit5 using the summary function to see the estimation results
summary(logit5)
## 
## Call:
## glm(formula = Attrition ~ Age + DailyRate + DistanceFromHome + 
##     EnvironmentSatisfaction + JobInvolvement + JobSatisfaction + 
##     MonthlyIncome + NumCompaniesWorked + RelationshipSatisfaction + 
##     PercentSalaryHike + StockOptionLevel + TotalWorkingYears + 
##     TrainingTimesLastYear + WorkLifeBalance + YearsInCurrentRole + 
##     YearsSinceLastPromotion + YearsWithCurrManager, family = "binomial", 
##     data = df.attrition)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5425  -0.6056  -0.3906  -0.1988   3.2800  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               4.992e+00  7.639e-01   6.536 6.33e-11 ***
## Age                      -3.059e-02  1.201e-02  -2.548 0.010847 *  
## DailyRate                -2.543e-04  1.940e-04  -1.311 0.189929    
## DistanceFromHome          3.249e-02  9.273e-03   3.504 0.000459 ***
## EnvironmentSatisfaction  -2.966e-01  7.086e-02  -4.185 2.85e-05 ***
## JobInvolvement           -4.936e-01  1.073e-01  -4.602 4.19e-06 ***
## JobSatisfaction          -2.954e-01  7.039e-02  -4.197 2.70e-05 ***
## MonthlyIncome            -7.559e-05  3.168e-05  -2.386 0.017036 *  
## NumCompaniesWorked        1.238e-01  3.244e-02   3.816 0.000136 ***
## RelationshipSatisfaction -1.469e-01  7.212e-02  -2.036 0.041703 *  
## PercentSalaryHike        -2.013e-02  2.177e-02  -0.925 0.355088    
## StockOptionLevel         -5.260e-01  1.045e-01  -5.036 4.76e-07 ***
## TotalWorkingYears        -3.072e-02  2.324e-02  -1.322 0.186124    
## TrainingTimesLastYear    -1.570e-01  6.199e-02  -2.532 0.011337 *  
## WorkLifeBalance          -2.432e-01  1.062e-01  -2.289 0.022055 *  
## YearsInCurrentRole       -9.532e-02  3.835e-02  -2.485 0.012940 *  
## YearsSinceLastPromotion   1.757e-01  3.539e-02   4.963 6.95e-07 ***
## YearsWithCurrManager     -8.814e-02  3.801e-02  -2.319 0.020396 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1298.6  on 1469  degrees of freedom
## Residual deviance: 1073.7  on 1452  degrees of freedom
## AIC: 1109.7
## 
## Number of Fisher Scoring iterations: 5
# Create sixth model as logit6
logit6 <- glm(Attrition ~ Age + DailyRate + DistanceFromHome + EnvironmentSatisfaction +
                JobInvolvement + JobSatisfaction + MonthlyIncome +
                NumCompaniesWorked + RelationshipSatisfaction +
                StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear +
                WorkLifeBalance + YearsInCurrentRole +
                YearsSinceLastPromotion + YearsWithCurrManager, family = "binomial",
              data = df.attrition) 
# Call logit6 using the summary function to see the estimation results
summary(logit6)
## 
## Call:
## glm(formula = Attrition ~ Age + DailyRate + DistanceFromHome + 
##     EnvironmentSatisfaction + JobInvolvement + JobSatisfaction + 
##     MonthlyIncome + NumCompaniesWorked + RelationshipSatisfaction + 
##     StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear + 
##     WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + 
##     YearsWithCurrManager, family = "binomial", data = df.attrition)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5380  -0.6020  -0.3903  -0.1987   3.2739  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               4.681e+00  6.835e-01   6.848 7.49e-12 ***
## Age                      -3.077e-02  1.199e-02  -2.566 0.010287 *  
## DailyRate                -2.622e-04  1.937e-04  -1.353 0.175905    
## DistanceFromHome          3.212e-02  9.261e-03   3.468 0.000523 ***
## EnvironmentSatisfaction  -2.961e-01  7.083e-02  -4.180 2.92e-05 ***
## JobInvolvement           -4.908e-01  1.071e-01  -4.585 4.55e-06 ***
## JobSatisfaction          -2.951e-01  7.033e-02  -4.196 2.71e-05 ***
## MonthlyIncome            -7.452e-05  3.165e-05  -2.355 0.018540 *  
## NumCompaniesWorked        1.240e-01  3.242e-02   3.824 0.000131 ***
## RelationshipSatisfaction -1.440e-01  7.197e-02  -2.000 0.045468 *  
## StockOptionLevel         -5.265e-01  1.045e-01  -5.037 4.74e-07 ***
## TotalWorkingYears        -3.080e-02  2.325e-02  -1.325 0.185241    
## TrainingTimesLastYear    -1.573e-01  6.203e-02  -2.536 0.011202 *  
## WorkLifeBalance          -2.436e-01  1.063e-01  -2.292 0.021900 *  
## YearsInCurrentRole       -9.516e-02  3.835e-02  -2.482 0.013078 *  
## YearsSinceLastPromotion   1.758e-01  3.534e-02   4.976 6.49e-07 ***
## YearsWithCurrManager     -8.769e-02  3.800e-02  -2.307 0.021029 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1298.6  on 1469  degrees of freedom
## Residual deviance: 1074.6  on 1453  degrees of freedom
## AIC: 1108.6
## 
## Number of Fisher Scoring iterations: 5
# Create seventh model as logit7
logit7 <- glm(Attrition ~ Age + DailyRate + DistanceFromHome + EnvironmentSatisfaction +
                JobInvolvement + JobSatisfaction + MonthlyIncome +
                NumCompaniesWorked + RelationshipSatisfaction +
                StockOptionLevel + TrainingTimesLastYear +
                WorkLifeBalance + YearsInCurrentRole +
                YearsSinceLastPromotion + YearsWithCurrManager, family = "binomial",
              data = df.attrition) 
# Call logit7 using the summary function to see the estimation results
summary(logit7)
## 
## Call:
## glm(formula = Attrition ~ Age + DailyRate + DistanceFromHome + 
##     EnvironmentSatisfaction + JobInvolvement + JobSatisfaction + 
##     MonthlyIncome + NumCompaniesWorked + RelationshipSatisfaction + 
##     StockOptionLevel + TrainingTimesLastYear + WorkLifeBalance + 
##     YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager, 
##     family = "binomial", data = df.attrition)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5518  -0.6004  -0.3873  -0.2018   3.1323  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               4.828e+00  6.764e-01   7.138 9.45e-13 ***
## Age                      -3.875e-02  1.058e-02  -3.664 0.000248 ***
## DailyRate                -2.666e-04  1.938e-04  -1.375 0.169013    
## DistanceFromHome          3.173e-02  9.237e-03   3.436 0.000591 ***
## EnvironmentSatisfaction  -2.913e-01  7.061e-02  -4.126 3.70e-05 ***
## JobInvolvement           -4.837e-01  1.068e-01  -4.530 5.90e-06 ***
## JobSatisfaction          -2.927e-01  7.022e-02  -4.168 3.07e-05 ***
## MonthlyIncome            -9.772e-05  2.643e-05  -3.697 0.000218 ***
## NumCompaniesWorked        1.160e-01  3.196e-02   3.628 0.000285 ***
## RelationshipSatisfaction -1.391e-01  7.193e-02  -1.934 0.053072 .  
## StockOptionLevel         -5.212e-01  1.044e-01  -4.992 5.99e-07 ***
## TrainingTimesLastYear    -1.559e-01  6.209e-02  -2.511 0.012033 *  
## WorkLifeBalance          -2.411e-01  1.063e-01  -2.268 0.023302 *  
## YearsInCurrentRole       -1.016e-01  3.773e-02  -2.692 0.007099 ** 
## YearsSinceLastPromotion   1.712e-01  3.497e-02   4.897 9.74e-07 ***
## YearsWithCurrManager     -9.801e-02  3.688e-02  -2.657 0.007878 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1298.6  on 1469  degrees of freedom
## Residual deviance: 1076.4  on 1454  degrees of freedom
## AIC: 1108.4
## 
## Number of Fisher Scoring iterations: 5
# Create eighth model as logit8
logit8 <- glm(Attrition ~ Age + DistanceFromHome + EnvironmentSatisfaction +
                JobInvolvement + JobSatisfaction + MonthlyIncome +
                NumCompaniesWorked + RelationshipSatisfaction +
                StockOptionLevel + TrainingTimesLastYear +
                WorkLifeBalance + YearsInCurrentRole +
                YearsSinceLastPromotion + YearsWithCurrManager, family = "binomial",
              data = df.attrition) 
# Call logit8 using the summary function to see the estimation results
summary(logit8)
## 
## Call:
## glm(formula = Attrition ~ Age + DistanceFromHome + EnvironmentSatisfaction + 
##     JobInvolvement + JobSatisfaction + MonthlyIncome + NumCompaniesWorked + 
##     RelationshipSatisfaction + StockOptionLevel + TrainingTimesLastYear + 
##     WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + 
##     YearsWithCurrManager, family = "binomial", data = df.attrition)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4986  -0.6117  -0.3896  -0.2054   3.1520  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               4.649e+00  6.620e-01   7.023 2.17e-12 ***
## Age                      -3.906e-02  1.056e-02  -3.701 0.000215 ***
## DistanceFromHome          3.184e-02  9.225e-03   3.451 0.000558 ***
## EnvironmentSatisfaction  -2.922e-01  7.053e-02  -4.142 3.44e-05 ***
## JobInvolvement           -4.900e-01  1.066e-01  -4.596 4.31e-06 ***
## JobSatisfaction          -2.963e-01  7.016e-02  -4.223 2.41e-05 ***
## MonthlyIncome            -9.859e-05  2.646e-05  -3.726 0.000195 ***
## NumCompaniesWorked        1.144e-01  3.187e-02   3.590 0.000330 ***
## RelationshipSatisfaction -1.388e-01  7.185e-02  -1.932 0.053367 .  
## StockOptionLevel         -5.260e-01  1.042e-01  -5.051 4.40e-07 ***
## TrainingTimesLastYear    -1.577e-01  6.200e-02  -2.544 0.010953 *  
## WorkLifeBalance          -2.314e-01  1.061e-01  -2.181 0.029191 *  
## YearsInCurrentRole       -1.034e-01  3.765e-02  -2.746 0.006036 ** 
## YearsSinceLastPromotion   1.729e-01  3.501e-02   4.937 7.92e-07 ***
## YearsWithCurrManager     -9.769e-02  3.698e-02  -2.642 0.008240 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1298.6  on 1469  degrees of freedom
## Residual deviance: 1078.3  on 1455  degrees of freedom
## AIC: 1108.3
## 
## Number of Fisher Scoring iterations: 5

The RelationshipSatisfaction variable is still not significant at 0.05; let’s try again to create a logit model as logit9.

# Create ninth model as logit9
logit9 <- glm(Attrition ~ Age + DistanceFromHome + EnvironmentSatisfaction +
                JobInvolvement + JobSatisfaction + MonthlyIncome +
                NumCompaniesWorked +
                StockOptionLevel + TrainingTimesLastYear +
                WorkLifeBalance + YearsInCurrentRole +
                YearsSinceLastPromotion + YearsWithCurrManager, family = "binomial",
              data = df.attrition) 
# Call logit9 using the summary function to see the estimation results
summary(logit9)
## 
## Call:
## glm(formula = Attrition ~ Age + DistanceFromHome + EnvironmentSatisfaction + 
##     JobInvolvement + JobSatisfaction + MonthlyIncome + NumCompaniesWorked + 
##     StockOptionLevel + TrainingTimesLastYear + WorkLifeBalance + 
##     YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager, 
##     family = "binomial", data = df.attrition)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4148  -0.6044  -0.3957  -0.2069   3.1739  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              4.334e+00  6.381e-01   6.792 1.10e-11 ***
## Age                     -4.021e-02  1.052e-02  -3.821 0.000133 ***
## DistanceFromHome         3.194e-02  9.212e-03   3.467 0.000526 ***
## EnvironmentSatisfaction -2.946e-01  7.035e-02  -4.188 2.81e-05 ***
## JobInvolvement          -4.989e-01  1.062e-01  -4.697 2.65e-06 ***
## JobSatisfaction         -2.947e-01  7.002e-02  -4.208 2.57e-05 ***
## MonthlyIncome           -9.766e-05  2.640e-05  -3.699 0.000217 ***
## NumCompaniesWorked       1.128e-01  3.180e-02   3.548 0.000388 ***
## StockOptionLevel        -5.136e-01  1.036e-01  -4.960 7.06e-07 ***
## TrainingTimesLastYear   -1.576e-01  6.189e-02  -2.547 0.010872 *  
## WorkLifeBalance         -2.342e-01  1.059e-01  -2.211 0.027049 *  
## YearsInCurrentRole      -9.996e-02  3.738e-02  -2.674 0.007488 ** 
## YearsSinceLastPromotion  1.679e-01  3.476e-02   4.831 1.36e-06 ***
## YearsWithCurrManager    -9.610e-02  3.675e-02  -2.615 0.008925 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1298.6  on 1469  degrees of freedom
## Residual deviance: 1082.0  on 1456  degrees of freedom
## AIC: 1110
## 
## Number of Fisher Scoring iterations: 5

Finally, all independent variables are significant. Now we need to choose the best logit model using the aic function.

AIC Criteria

AIC is a metric used to compare and select the most suitable model in statistics. The lower the AIC value, the better the model is at explaining the data. Below is how we want to compare the AIC values of each model.

# Create a vector that contains previously created models as logit_model
logit_model <- c("logit1", "logit2", "logit3", "logit4", "logit5", "logit6", "logit7",
                 "logit8", "logit9")

# Create a vector containing the AIC values of each model
AIC <- c(logit1$aic, logit2$aic, logit3$aic, logit4$aic, logit5$aic, logit6$aic, logit7$aic,
         logit8$aic, logit9$aic)

# Create a data frame containing the logit_model vector and AIC as criteria
criteria <- data.frame(logit_model, AIC)

# Calling the data frame `criteria`
criteria
##   logit_model      AIC
## 1      logit1 1116.786
## 2      logit2 1114.805
## 3      logit3 1112.819
## 4      logit4 1111.014
## 5      logit5 1109.719
## 6      logit6 1108.584
## 7      logit7 1108.376
## 8      logit8 1108.273
## 9      logit9 1109.996

From these results, we can conclude that logit8 has the lowest AIC value, namely around 1108.273, so it can be considered the most suitable model among all the models compared. Okay, next we need to check the goodness of fit using logit8.

Goodness of Fit

Goodness of fit in logistic regression refers to the extent to which the model that has been built fits the observed data. It is important to measure how well our logistic regression model explains the variation in the actual data. There are several methods that can be used to carry out this test, but in this article, I will use deviance.

# Create a null model with only the intercept
null_model <- glm(Attrition ~ 1, data = df.attrition, family = "binomial")

# Performing a likelihood ratio test between the null model and the complex model
lr_test <- anova(null_model, logit8, test = "Chisq")

# Load the results of the likelihood ratio test
lr_test
## Analysis of Deviance Table
## 
## Model 1: Attrition ~ 1
## Model 2: Attrition ~ Age + DistanceFromHome + EnvironmentSatisfaction + 
##     JobInvolvement + JobSatisfaction + MonthlyIncome + NumCompaniesWorked + 
##     RelationshipSatisfaction + StockOptionLevel + TrainingTimesLastYear + 
##     WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + 
##     YearsWithCurrManager
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      1469     1298.6                          
## 2      1455     1078.3 14   220.31 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results of the “Analysis of Deviance Table” show that Model 2 (with a number of additional predictor variables) is significantly better at explaining variation in the Attrition data compared to Model 1 (base model). The addition of predictor variables increases the model’s ability to predict the possibility of Attrition. Therefore, Model 2 is recommended for Attrition analysis and prediction.

After that, load the logit8 and interpret it.

summary(logit8)
## 
## Call:
## glm(formula = Attrition ~ Age + DistanceFromHome + EnvironmentSatisfaction + 
##     JobInvolvement + JobSatisfaction + MonthlyIncome + NumCompaniesWorked + 
##     RelationshipSatisfaction + StockOptionLevel + TrainingTimesLastYear + 
##     WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + 
##     YearsWithCurrManager, family = "binomial", data = df.attrition)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4986  -0.6117  -0.3896  -0.2054   3.1520  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               4.649e+00  6.620e-01   7.023 2.17e-12 ***
## Age                      -3.906e-02  1.056e-02  -3.701 0.000215 ***
## DistanceFromHome          3.184e-02  9.225e-03   3.451 0.000558 ***
## EnvironmentSatisfaction  -2.922e-01  7.053e-02  -4.142 3.44e-05 ***
## JobInvolvement           -4.900e-01  1.066e-01  -4.596 4.31e-06 ***
## JobSatisfaction          -2.963e-01  7.016e-02  -4.223 2.41e-05 ***
## MonthlyIncome            -9.859e-05  2.646e-05  -3.726 0.000195 ***
## NumCompaniesWorked        1.144e-01  3.187e-02   3.590 0.000330 ***
## RelationshipSatisfaction -1.388e-01  7.185e-02  -1.932 0.053367 .  
## StockOptionLevel         -5.260e-01  1.042e-01  -5.051 4.40e-07 ***
## TrainingTimesLastYear    -1.577e-01  6.200e-02  -2.544 0.010953 *  
## WorkLifeBalance          -2.314e-01  1.061e-01  -2.181 0.029191 *  
## YearsInCurrentRole       -1.034e-01  3.765e-02  -2.746 0.006036 ** 
## YearsSinceLastPromotion   1.729e-01  3.501e-02   4.937 7.92e-07 ***
## YearsWithCurrManager     -9.769e-02  3.698e-02  -2.642 0.008240 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1298.6  on 1469  degrees of freedom
## Residual deviance: 1078.3  on 1455  degrees of freedom
## AIC: 1108.3
## 
## Number of Fisher Scoring iterations: 5

The star at each coefficient value indicates the significance of the independent variable for the dependent variable. The more stars, the higher the level of significance. The following is an interpretation of the influence of each variable: clik here.

Okay, after this, we will test the predicted results of classification using the confusion matrix.

Confusion Matrix

Confusion matrix is a table used in statistical analysis, particularly in the context of classification or prediction. It is a useful tool for evaluating the performance of predictive models, such as logistic regression models or machine learning models, in distinguishing between various categories or classes. Here is how to create a confusion matrix table.

table(TRUE == df.attrition$Attrition, pred = round(fitted(logit8)))
##        pred
##            0    1
##   FALSE 1215   18
##   TRUE   190   47

Interpretation: click here.

After conducting data analysis, the next step is to generate visualizations of several variables that we have reviewed. These visualizations will aid in a better understanding of patterns and trends present in the data. Visualizing data serves as a crucial tool in conveying findings and insights from data analysis to stakeholders and team members. Therefore, data visualization becomes an essential step in effectively presenting the results of the analysis.

Data Visualization

Visualization has strong revealing power. Instead of just looking at a table of numbers, visualization allows us to see patterns, trends, and relationships among the data. This helps data users see the big picture quickly, gain immediate insights, and explore relationships that may not be visible in numbers. In the context of data analysis, visualization helps turn numbers into easy-to-understand images, thereby facilitating decision-making. In this stage, I will explain the importance of visualization in presenting analytical findings and how to create informative graphs.

Before starting, let’s add the AttritionStatus variable, because previously we changed the value of the Attrition variable

df.attrition <- df.attrition %>%
  mutate(AttritionStatus = ifelse(Attrition == 1, "Yes", "No"))

now we are ready to visualize the data.

What is the percentage of employees between those who are attrited and those who are not?

# Count attrition value
attrition.count <- df.attrition %>%
  count(AttritionStatus)

# Create attrition distribution pie chart
ggplot(attrition.count, aes(x = "", y = n, fill = AttritionStatus)) +
  geom_bar(stat = "identity", width = 1) +
  geom_text(aes(label = scales::percent(n / sum(n))), position = position_stack(vjust = 0.5)) +
  coord_polar(theta = "y") +
  labs(title = "Attrition Distiribution (Percentage)", x = "", y = "") +
  theme(plot.title = element_text(hjust = 0.5))

Insight: click here.

How is attrition affected by gender?

ggplot(df.attrition, aes(x = Gender, fill = AttritionStatus)) +
  geom_bar(position = "stack") +
  labs(title = "Attrition Status by Gender") +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = -0.5) +
  theme(plot.title = element_text(hjust = 0.5))

Insight: click here.

What job role has the highest turnover rate?

# Calculates the number of "Yes" Attrition and percentages within each Job Role
jobrole.data <- df.attrition %>%
  group_by(JobRole) %>%
  summarise(AttritionYesCount = sum(AttritionStatus == "Yes"),
            TotalCount = n(),
            AttritionPercentage = (AttritionYesCount / TotalCount) * 100)

# Create a bar chart with Attrition percentage "Yes"
ggplot(jobrole.data %>%
         arrange(desc(AttritionPercentage)),
         aes(y = reorder(JobRole, AttritionPercentage), 
                           x = AttritionPercentage, fill = JobRole)) +
  geom_bar(stat = "identity") +
  labs(title = "Turnover Rate Job Role", subtitle = "From Highest to Lowest",
       y = "Job Role", x = "Turnover Rate") +
  geom_text(aes(label = scales::percent(AttritionPercentage / 100, accuracy = 0.01), 
                x = AttritionPercentage), 
            hjust = -0.2) +
  scale_x_continuous(limits = c(0, 100)) +
  scale_y_discrete() +
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

Insight: click here.

How distance from home affect attrition?

ggplot(df.attrition, aes(x = DistanceFromHome, fill = AttritionStatus)) +
  geom_density(alpha = 0.5) +
  labs(title = "Distance From Home Distribution",
       x = "Distance From Home", y = "Density") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_x_continuous(limits = c(0, 30))

Insight: click here.

What marital status has the highest turnover rate?

# Calculates the number of "Yes" Attrition and percentages within each Marital Status
marital.data <- df.attrition %>%
  group_by(MaritalStatus) %>%
  summarise(AttritionYesCount = sum(AttritionStatus == "Yes"),
            TotalCount = n(),
            AttritionPercentage = (AttritionYesCount / TotalCount) * 100) %>%
  arrange(desc(AttritionPercentage))

# Create a bar chart with Attrition percentage "Yes"
ggplot(marital.data, aes(y = MaritalStatus, 
                           x = AttritionPercentage, fill = MaritalStatus)) +
  geom_bar(stat = "identity") +
  labs(title = "Turnover Rate by Marital Status", subtitle = "From Highest to Lowest",
       y = "Marital Status", x = "Turnover Rate") +
  geom_text(aes(label = scales::percent(AttritionPercentage / 100, accuracy = 0.01), 
                x = AttritionPercentage), 
            hjust = -0.2) +
  scale_x_continuous(limits = c(0, 100)) +
  scale_y_discrete() +
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

Insight: click here.

How is business travel affect attrition?

ggplot(df.attrition, aes(x = BusinessTravel, fill = AttritionStatus)) +
  geom_bar() +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = -0.5) +
  labs(title = "Attrition Status by Business Travel") +
  theme(plot.title = element_text(hjust = 0.5))

Insight: click here.

What education level has the highest turnover rate?

# Calculates the number of "Yes" Attrition and percentages within each Marital Status
education.data <- df.attrition %>%
  group_by(EducationLevel) %>%
  summarise(AttritionYesCount = sum(AttritionStatus == "Yes"),
            TotalCount = n(),
            AttritionPercentage = (AttritionYesCount / TotalCount) * 100)

# Create a bar chart with Attrition percentage "Yes"
ggplot(education.data, aes(y = reorder(EducationLevel, AttritionPercentage), 
                           x = AttritionPercentage, fill = EducationLevel)) +
  geom_bar(stat = "identity") +
  labs(title = "Turnover Rate by Education Level", subtitle = "From Highest to Lowest",
       y = "Education Level", x = "Turnover Rate") +
  geom_text(aes(label = scales::percent(AttritionPercentage / 100, accuracy = 0.01), 
                x = AttritionPercentage), 
            hjust = -0.2) +
  scale_x_continuous(limits = c(0, 100)) +
  scale_y_discrete() +
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

Insight: click here.

Does overtime affect turnover rate?

# Calculates the number of "Yes" Attrition and percentages within each Overtime
overtime.data <- df.attrition %>%
  group_by(OverTime) %>%
  summarise(AttritionYesCount = sum(AttritionStatus == "Yes"),
            TotalCount = n(),
            AttritionPercentage = (AttritionYesCount / TotalCount) * 100)

# Create a bar chart with Attrition percentage "Yes"
ggplot(overtime.data, aes(y = reorder(OverTime, -AttritionPercentage), 
                           x = AttritionPercentage, fill = OverTime)) +
  geom_bar(stat = "identity") +
  labs(title = "Turnover Rate by Over Time",
       y = "Over Time", x = "Turnover Rate") +
  geom_text(aes(label = scales::percent(AttritionPercentage / 100, accuracy = 0.01), 
                x = AttritionPercentage), 
            vjust = -0.5) +
  scale_x_continuous(limits = c(0, 100)) +
  scale_y_discrete() +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_flip()

Insight: click here.

Does having a lot of experience in several companies have an effect on attrition?

# Calculates the number of "Yes" Attrition and percentages within each number of companies worked
companiesworked.data <- df.attrition %>%
  group_by(NumCompaniesWorked) %>%
  summarise(AttritionYesCount = sum(AttritionStatus == "Yes"),
            TotalCount = n(),
            AttritionPercentage = (AttritionYesCount / TotalCount) * 100)

# Create a bar chart with Attrition percentage "Yes"
ggplot(companiesworked.data, aes(y = reorder(NumCompaniesWorked, AttritionPercentage), 
                           x = AttritionPercentage, fill = as.character(NumCompaniesWorked))) +
  geom_bar(stat = "identity") +
  labs(title = "Turnover Rate by Number of Companies Worked", fill = "Number of Companies Worked",
       y = "Number of Companies Worked", x = "Turnover Rate") +
  geom_text(aes(label = scales::percent(AttritionPercentage / 100, accuracy = 0.01), 
                x = AttritionPercentage), 
            hjust = -0.2) +
  scale_x_continuous(limits = c(0, 100)) +
  scale_y_discrete() +
  theme(plot.title = element_text(hjust = 0.5))

Insight: click here.

Conclusion

click here.