by Syarif Yusuf Effendi
In an era where data is emerging as one of the most valuable assets for organizations, data analysis has played a very important role in assisting intelligent decision-making. Especially when it comes to HR, professionals now have access to tools and techniques that allow them to extract valuable insights from employee data. One important aspect of HR analysis is understanding the factors that influence employee turnover, or what is often known as employee attrition. This is why, in this article, we will explore and outline the steps to perform employee turnover analysis using the R programming language.
In this article, we will use open data sources on Kaggle, which you can access via the following link: click here.
The first step before cleaning the data is to first import the
dataset file that was previously downloaded and use the
readr library.
library(readr)
attrition.data <- read.csv("D:/Dataset/Employee Attrition/WA_Fn-UseC_-HR-Employee-Attrition.csv")
Then, after the data is successfully imported, use the
dplyr library to start cleaning it so the data is ready for
use.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Before doing data cleaning, let’s do some checking of the dataset to see the columns, structure, data type, etc.
head(attrition.data)
## Age Attrition BusinessTravel DailyRate Department
## 1 41 Yes Travel_Rarely 1102 Sales
## 2 49 No Travel_Frequently 279 Research & Development
## 3 37 Yes Travel_Rarely 1373 Research & Development
## 4 33 No Travel_Frequently 1392 Research & Development
## 5 27 No Travel_Rarely 591 Research & Development
## 6 32 No Travel_Frequently 1005 Research & Development
## DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
## 1 1 2 Life Sciences 1 1
## 2 8 1 Life Sciences 1 2
## 3 2 2 Other 1 4
## 4 3 4 Life Sciences 1 5
## 5 2 1 Medical 1 7
## 6 2 2 Life Sciences 1 8
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1 2 Female 94 3 2
## 2 3 Male 61 2 2
## 3 4 Male 92 2 1
## 4 4 Female 56 3 1
## 5 1 Male 40 3 1
## 6 4 Male 79 3 1
## JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## 1 Sales Executive 4 Single 5993 19479
## 2 Research Scientist 2 Married 5130 24907
## 3 Laboratory Technician 3 Single 2090 2396
## 4 Research Scientist 3 Married 2909 23159
## 5 Laboratory Technician 2 Married 3468 16632
## 6 Laboratory Technician 4 Single 3068 11864
## NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating
## 1 8 Y Yes 11 3
## 2 1 Y No 23 4
## 3 6 Y Yes 15 3
## 4 1 Y Yes 11 3
## 5 9 Y No 12 3
## 6 0 Y No 13 3
## RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## 1 1 80 0 8
## 2 4 80 1 10
## 3 2 80 0 7
## 4 3 80 0 8
## 5 4 80 1 6
## 6 3 80 0 8
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1 0 1 6 4
## 2 3 3 10 7
## 3 3 3 0 0
## 4 3 3 8 7
## 5 3 3 2 2
## 6 2 2 7 7
## YearsSinceLastPromotion YearsWithCurrManager
## 1 0 5
## 2 1 7
## 3 0 0
## 4 3 0
## 5 2 2
## 6 3 6
The head() function in R programming is used to display
the first few rows of a data object, such as a dataframe, matrix, or
vector. The main benefit of the head() function is that it
provides an initial view of the data.
str(attrition.data)
## 'data.frame': 1470 obs. of 35 variables:
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : chr "Yes" "No" "Yes" "No" ...
## $ BusinessTravel : chr "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" "Travel_Frequently" ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ Department : chr "Sales" "Research & Development" "Research & Development" "Research & Development" ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : int 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : chr "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : int 1 2 4 5 7 8 10 11 12 13 ...
## $ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : chr "Female" "Male" "Male" "Female" ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : chr "Sales Executive" "Research Scientist" "Laboratory Technician" "Research Scientist" ...
## $ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : chr "Single" "Married" "Single" "Married" ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ Over18 : chr "Y" "Y" "Y" "Y" ...
## $ OverTime : chr "Yes" "No" "Yes" "Yes" ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : int 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
The str() function in R programming is used to present
structural information from a data object.
Next, we will see descriptive analysis for columns with integer data types, so we need to write the syntax as follows:
# The name of the column you want to exclude
exclude_cols <- c("Attrition", "BusinessTravel", "Department", "EducationField",
"Gender", "JobRole", "MaritalStatus", "Over18", "OverTime")
# Executes summary(), excluding the desired column
summary(attrition.data[, !(names(attrition.data) %in% exclude_cols)])
## Age DailyRate DistanceFromHome Education
## Min. :18.00 Min. : 102.0 Min. : 1.000 Min. :1.000
## 1st Qu.:30.00 1st Qu.: 465.0 1st Qu.: 2.000 1st Qu.:2.000
## Median :36.00 Median : 802.0 Median : 7.000 Median :3.000
## Mean :36.92 Mean : 802.5 Mean : 9.193 Mean :2.913
## 3rd Qu.:43.00 3rd Qu.:1157.0 3rd Qu.:14.000 3rd Qu.:4.000
## Max. :60.00 Max. :1499.0 Max. :29.000 Max. :5.000
## EmployeeCount EmployeeNumber EnvironmentSatisfaction HourlyRate
## Min. :1 Min. : 1.0 Min. :1.000 Min. : 30.00
## 1st Qu.:1 1st Qu.: 491.2 1st Qu.:2.000 1st Qu.: 48.00
## Median :1 Median :1020.5 Median :3.000 Median : 66.00
## Mean :1 Mean :1024.9 Mean :2.722 Mean : 65.89
## 3rd Qu.:1 3rd Qu.:1555.8 3rd Qu.:4.000 3rd Qu.: 83.75
## Max. :1 Max. :2068.0 Max. :4.000 Max. :100.00
## JobInvolvement JobLevel JobSatisfaction MonthlyIncome MonthlyRate
## Min. :1.00 Min. :1.000 Min. :1.000 Min. : 1009 Min. : 2094
## 1st Qu.:2.00 1st Qu.:1.000 1st Qu.:2.000 1st Qu.: 2911 1st Qu.: 8047
## Median :3.00 Median :2.000 Median :3.000 Median : 4919 Median :14236
## Mean :2.73 Mean :2.064 Mean :2.729 Mean : 6503 Mean :14313
## 3rd Qu.:3.00 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.: 8379 3rd Qu.:20462
## Max. :4.00 Max. :5.000 Max. :4.000 Max. :19999 Max. :26999
## NumCompaniesWorked PercentSalaryHike PerformanceRating
## Min. :0.000 Min. :11.00 Min. :3.000
## 1st Qu.:1.000 1st Qu.:12.00 1st Qu.:3.000
## Median :2.000 Median :14.00 Median :3.000
## Mean :2.693 Mean :15.21 Mean :3.154
## 3rd Qu.:4.000 3rd Qu.:18.00 3rd Qu.:3.000
## Max. :9.000 Max. :25.00 Max. :4.000
## RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## Min. :1.000 Min. :80 Min. :0.0000 Min. : 0.00
## 1st Qu.:2.000 1st Qu.:80 1st Qu.:0.0000 1st Qu.: 6.00
## Median :3.000 Median :80 Median :1.0000 Median :10.00
## Mean :2.712 Mean :80 Mean :0.7939 Mean :11.28
## 3rd Qu.:4.000 3rd Qu.:80 3rd Qu.:1.0000 3rd Qu.:15.00
## Max. :4.000 Max. :80 Max. :3.0000 Max. :40.00
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## Min. :0.000 Min. :1.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.000 1st Qu.: 2.000
## Median :3.000 Median :3.000 Median : 5.000 Median : 3.000
## Mean :2.799 Mean :2.761 Mean : 7.008 Mean : 4.229
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.: 9.000 3rd Qu.: 7.000
## Max. :6.000 Max. :4.000 Max. :40.000 Max. :18.000
## YearsSinceLastPromotion YearsWithCurrManager
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 1.000 Median : 3.000
## Mean : 2.188 Mean : 4.123
## 3rd Qu.: 3.000 3rd Qu.: 7.000
## Max. :15.000 Max. :17.000
attrition.data[, !(names(attrition.data) %in% exclude_cols)]
is how we select columns that are not in exclude_cols from
the attrition.data data object. This uses column selection using
indexing and the %in% operator to check if the column name
is in exclude_cols.
Next, we check for variables with character data types.
table(attrition.data$Attrition)
##
## No Yes
## 1233 237
table(attrition.data$BusinessTravel)
##
## Non-Travel Travel_Frequently Travel_Rarely
## 150 277 1043
table(attrition.data$Department)
##
## Human Resources Research & Development Sales
## 63 961 446
table(attrition.data$EducationField)
##
## Human Resources Life Sciences Marketing Medical
## 27 606 159 464
## Other Technical Degree
## 82 132
table(attrition.data$Gender)
##
## Female Male
## 588 882
table(attrition.data$JobRole)
##
## Healthcare Representative Human Resources Laboratory Technician
## 131 52 259
## Manager Manufacturing Director Research Director
## 102 145 80
## Research Scientist Sales Executive Sales Representative
## 292 326 83
table(attrition.data$MaritalStatus)
##
## Divorced Married Single
## 327 673 470
table(attrition.data$Over18)
##
## Y
## 1470
table(attrition.data$OverTime)
##
## No Yes
## 1054 416
The table() function in R is used to create contingency
tables, which calculate the frequency of observations among combinations
of values of one or more variables.
Then do a final check to see if there is an NA value in the dataset.
# Calculates the total value of NA in all columns
total_na <- sum(colSums(is.na(attrition.data)))
# Print total_na
total_na
## [1] 0
The result is that there are no NA values in all columns in the dataset.
Next, after doing some checks, we need to make adjustments to our
dataset. We need to remove the columns that can’t provide useful
insights like EmployeeCount, Over18, and
StandardHours. Also add a new column, namely
EducationLevel.
# Data cleaning
df.attrition <- attrition.data %>%
select(-EmployeeCount, -Over18, -StandardHours) %>% # Delete column
mutate(EducationLevel = c("Below College", "College", "Bachelor", "Master", "Doctor")[Education]) # Added new column
# Review new table
head(df.attrition)
## Age Attrition BusinessTravel DailyRate Department
## 1 41 Yes Travel_Rarely 1102 Sales
## 2 49 No Travel_Frequently 279 Research & Development
## 3 37 Yes Travel_Rarely 1373 Research & Development
## 4 33 No Travel_Frequently 1392 Research & Development
## 5 27 No Travel_Rarely 591 Research & Development
## 6 32 No Travel_Frequently 1005 Research & Development
## DistanceFromHome Education EducationField EmployeeNumber
## 1 1 2 Life Sciences 1
## 2 8 1 Life Sciences 2
## 3 2 2 Other 4
## 4 3 4 Life Sciences 5
## 5 2 1 Medical 7
## 6 2 2 Life Sciences 8
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1 2 Female 94 3 2
## 2 3 Male 61 2 2
## 3 4 Male 92 2 1
## 4 4 Female 56 3 1
## 5 1 Male 40 3 1
## 6 4 Male 79 3 1
## JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## 1 Sales Executive 4 Single 5993 19479
## 2 Research Scientist 2 Married 5130 24907
## 3 Laboratory Technician 3 Single 2090 2396
## 4 Research Scientist 3 Married 2909 23159
## 5 Laboratory Technician 2 Married 3468 16632
## 6 Laboratory Technician 4 Single 3068 11864
## NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating
## 1 8 Yes 11 3
## 2 1 No 23 4
## 3 6 Yes 15 3
## 4 1 Yes 11 3
## 5 9 No 12 3
## 6 0 No 13 3
## RelationshipSatisfaction StockOptionLevel TotalWorkingYears
## 1 1 0 8
## 2 4 1 10
## 3 2 0 7
## 4 3 0 8
## 5 4 1 6
## 6 3 0 8
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1 0 1 6 4
## 2 3 3 10 7
## 3 3 3 0 0
## 4 3 3 8 7
## 5 3 3 2 2
## 6 2 2 7 7
## YearsSinceLastPromotion YearsWithCurrManager EducationLevel
## 1 0 5 College
## 2 1 7 Below College
## 3 0 0 College
## 4 3 0 Master
## 5 2 2 Below College
## 6 3 6 College
The %>% operator, known as the pipeline operator, is
used in packages such as dplyr and tidyverse
in R to facilitate the processing of data with a more concise and
readable syntax.
In an increasingly data-driven world, data exploration is a critical initial step before conducting in-depth research. This is the process of exploring your dataset to learn about its characteristics, patterns, and potential. Data exploration assists in the identification of early trends, the identification of possible problems, and the formulation of deeper questions for research.
Before starting data exploration, let’s prepare the
ggplot2 and reshape2 library first, because at
this stage we will do some visualizations such as histograms, boxplots,
etc.
library(ggplot2)
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.2.3
ggplot(df.attrition, aes(x = Age, y = ..density.., fill = Attrition)) +
geom_histogram(binwidth = 5, color = "black", position = "stack") +
geom_density(alpha = 0.5) +
labs(title = "Employee Age Distribution by Attrition", y = "Density") +
theme(plot.title = element_text(hjust = 0.5)) +
facet_wrap(~Attrition)
Based on the histogram above, both employees with attrition values of “Yes” and “No” tend to slant to the right. This indicates that the mean, median, and mode values in the age column do not have the same value. From the histogram, we also get information that employees with Attrition “No” are predominantly aged 30–40 years, while employees with Attrition “Yes” are dominated by those in their 30s or around 27–33 years, with a density that almost reaches 0.06.
ggplot(attrition.data, aes(x = Attrition, y = MonthlyIncome, fill = Attrition)) +
geom_boxplot() +
labs(title = "Monthly Income Distribution by Attrition", y = "Monthly Income (USD)")+
theme(plot.title = element_text(hjust = 0.5))
Based on the box plot above, the median monthly income for employees with “No” attrition is higher than “Yes” attrition, which is around 6800 USD, while “Yes” is around 4700 USD. This may indicate that higher income may be a factor influencing an employee’s decision to stay with the company.
It can also be seen that there are outliers, indicating the existence of extreme values in the monthly income data. This could be because some employees have incomes that are much higher or lower than the majority of employees. There is also an asymmetry between the two, as indicated by the unequal size of the upper and lower squares, indicating that there is an abnormal distribution.
# Create numeric variables values
numeric_columns <- c("Age", "DailyRate", "DistanceFromHome", "Education", "EnvironmentSatisfaction",
"HourlyRate", "JobInvolvement", "JobLevel", "JobSatisfaction", "MonthlyIncome",
"MonthlyRate", "NumCompaniesWorked", "PerformanceRating", "RelationshipSatisfaction",
"PercentSalaryHike", "StockOptionLevel", "TotalWorkingYears", "TrainingTimesLastYear",
"WorkLifeBalance", "YearsAtCompany", "YearsInCurrentRole",
"YearsSinceLastPromotion", "YearsWithCurrManager")
# Calculate the correlation matrix
correlation_matrix <- cor(attrition.data[, numeric_columns])
# Create a correlation heatmap
ggplot(melt(correlation_matrix), aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient(low = "lightblue", high = "red") +
labs(title = "Numeric Variables Correlation Heatmap", x = "Variable", y = "Variable") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
plot.title = element_text(hjust = 0.5))
There is a strong positive correlation between the variables
MonthlyIncome and JobLevel, MonthlyIncome and
TotalWorkingYear, YearsWithCurrManager and
YearsAtCompany, JobLevel and
TotalWorkingYear, and Age and
TotalWorkingYears. Which is marked with a deep red color,
or an estimated correlation value of 0.75 to 1.0.
cor(df.attrition[, numeric_columns], method = "spearman")
## Age DailyRate DistanceFromHome
## Age 1.000000e+00 0.0072897280 -0.019290911
## DailyRate 7.289728e-03 1.0000000000 -0.002753667
## DistanceFromHome -1.929091e-02 -0.0027536668 1.000000000
## Education 2.049367e-01 -0.0136071321 0.015708005
## EnvironmentSatisfaction 9.820116e-03 0.0189611918 -0.010400913
## HourlyRate 2.885849e-02 0.0235114333 0.020445908
## JobInvolvement 3.445622e-02 0.0424687818 0.034430070
## JobLevel 4.896178e-01 0.0038163552 0.022147607
## JobSatisfaction -5.184852e-03 0.0278287617 -0.013078054
## MonthlyIncome 4.719021e-01 0.0162596518 0.002512448
## MonthlyRate 1.745100e-02 -0.0323595200 0.039617805
## NumCompaniesWorked 3.532126e-01 0.0365483454 -0.009591877
## PerformanceRating 9.338863e-05 0.0006244329 0.011319674
## RelationshipSatisfaction 4.606332e-02 0.0096845284 0.005851811
## PercentSalaryHike 7.708838e-03 0.0250704034 0.029666428
## StockOptionLevel 5.663306e-02 0.0385139335 0.030190294
## TotalWorkingYears 6.568958e-01 0.0209506994 -0.002912375
## TrainingTimesLastYear 3.158049e-04 -0.0113389745 -0.024848361
## WorkLifeBalance -3.706791e-03 -0.0403523149 -0.020401887
## YearsAtCompany 2.516860e-01 -0.0097783439 0.010513095
## YearsInCurrentRole 1.979776e-01 0.0072075388 0.013708096
## YearsSinceLastPromotion 1.736472e-01 -0.0376312932 -0.004685211
## YearsWithCurrManager 1.948176e-01 -0.0047165189 0.004447868
## Education EnvironmentSatisfaction HourlyRate
## Age 0.204936684 0.0098201155 0.0288584852
## DailyRate -0.013607132 0.0189611918 0.0235114333
## DistanceFromHome 0.015708005 -0.0104009131 0.0204459078
## Education 1.000000000 -0.0276246896 0.0144319003
## EnvironmentSatisfaction -0.027624690 1.0000000000 -0.0523803635
## HourlyRate 0.014431900 -0.0523803635 1.0000000000
## JobInvolvement 0.037230797 -0.0153011188 0.0438843132
## JobLevel 0.107419164 -0.0001924317 -0.0338761113
## JobSatisfaction -0.005175469 -0.0029927823 -0.0683401726
## MonthlyIncome 0.120028491 -0.0151630774 -0.0197617219
## MonthlyRate -0.021213828 0.0374765094 -0.0148884998
## NumCompaniesWorked 0.135103375 0.0061514181 0.0192092504
## PerformanceRating -0.025080669 -0.0291598169 -0.0021846442
## RelationshipSatisfaction -0.013172612 0.0053534581 0.0002585028
## PercentSalaryHike 0.004299936 -0.0304894261 -0.0098755483
## StockOptionLevel 0.013793504 0.0098261697 0.0505430780
## TotalWorkingYears 0.162176793 -0.0138819760 -0.0120716328
## TrainingTimesLastYear -0.023748617 -0.0116589377 0.0002918715
## WorkLifeBalance 0.017350435 0.0271689716 -0.0100031257
## YearsAtCompany 0.064196129 0.0084245043 -0.0290323277
## YearsInCurrentRole 0.054567211 0.0201401658 -0.0340160475
## YearsSinceLastPromotion 0.032203024 0.0260816744 -0.0524124022
## YearsWithCurrManager 0.051291866 -0.0017318242 -0.0138114155
## JobInvolvement JobLevel JobSatisfaction
## Age 0.034456224 0.4896178109 -0.0051848521
## DailyRate 0.042468782 0.0038163552 0.0278287617
## DistanceFromHome 0.034430070 0.0221476071 -0.0130780540
## Education 0.037230797 0.1074191638 -0.0051754687
## EnvironmentSatisfaction -0.015301119 -0.0001924317 -0.0029927823
## HourlyRate 0.043884313 -0.0338761113 -0.0683401726
## JobInvolvement 1.000000000 -0.0184235130 -0.0121482118
## JobLevel -0.018423513 1.0000000000 -0.0008519730
## JobSatisfaction -0.012148212 -0.0008519730 1.0000000000
## MonthlyIncome -0.024552352 0.9204286748 0.0048807779
## MonthlyRate -0.018117452 0.0527918878 -0.0027017288
## NumCompaniesWorked 0.015448159 0.1782701536 -0.0515158889
## PerformanceRating -0.024732712 -0.0186083014 0.0069785019
## RelationshipSatisfaction 0.037857297 0.0113112331 -0.0146785538
## PercentSalaryHike -0.016998737 -0.0324527523 0.0239695457
## StockOptionLevel 0.034464290 0.0477861699 0.0127854959
## TotalWorkingYears 0.006444104 0.7346775906 -0.0158747168
## TrainingTimesLastYear 0.002013915 -0.0197285879 -0.0116809933
## WorkLifeBalance -0.019888634 0.0404657805 -0.0297808628
## YearsAtCompany 0.013836362 0.4722827149 0.0122804055
## YearsInCurrentRole 0.015547840 0.3910854019 0.0005310846
## YearsSinceLastPromotion -0.008306725 0.2690960789 0.0074971306
## YearsWithCurrManager 0.037397014 0.3708892877 -0.0167721793
## MonthlyIncome MonthlyRate NumCompaniesWorked
## Age 0.471902130 0.0174510008 3.532126e-01
## DailyRate 0.016259652 -0.0323595200 3.654835e-02
## DistanceFromHome 0.002512448 0.0396178052 -9.591877e-03
## Education 0.120028491 -0.0212138278 1.351034e-01
## EnvironmentSatisfaction -0.015163077 0.0374765094 6.151418e-03
## HourlyRate -0.019761722 -0.0148884998 1.920925e-02
## JobInvolvement -0.024552352 -0.0181174516 1.544816e-02
## JobLevel 0.920428675 0.0527918878 1.782702e-01
## JobSatisfaction 0.004880778 -0.0027017288 -5.151589e-02
## MonthlyIncome 1.000000000 0.0542767660 1.903072e-01
## MonthlyRate 0.054276766 1.0000000000 1.955330e-02
## NumCompaniesWorked 0.190307217 0.0195532984 1.000000e+00
## PerformanceRating -0.026999475 -0.0096975883 -8.298387e-03
## RelationshipSatisfaction 0.003885241 -0.0003728148 4.029637e-02
## PercentSalaryHike -0.033767076 -0.0054705369 4.628802e-05
## StockOptionLevel 0.045851881 -0.0372742936 3.227691e-02
## TotalWorkingYears 0.710024314 0.0133598231 3.151956e-01
## TrainingTimesLastYear -0.034846762 -0.0100179472 -4.733649e-02
## WorkLifeBalance 0.030759146 0.0063162635 9.102504e-03
## YearsAtCompany 0.464315235 -0.0298618681 -1.710698e-01
## YearsInCurrentRole 0.394711834 -0.0068654973 -1.276730e-01
## YearsSinceLastPromotion 0.264599332 -0.0162854734 -6.695018e-02
## YearsWithCurrManager 0.365385678 -0.0350591418 -1.441290e-01
## PerformanceRating RelationshipSatisfaction
## Age 9.338863e-05 0.0460633199
## DailyRate 6.244329e-04 0.0096845284
## DistanceFromHome 1.131967e-02 0.0058518112
## Education -2.508067e-02 -0.0131726125
## EnvironmentSatisfaction -2.915982e-02 0.0053534581
## HourlyRate -2.184644e-03 0.0002585028
## JobInvolvement -2.473271e-02 0.0378572969
## JobLevel -1.860830e-02 0.0113112331
## JobSatisfaction 6.978502e-03 -0.0146785538
## MonthlyIncome -2.699948e-02 0.0038852411
## MonthlyRate -9.697588e-03 -0.0003728148
## NumCompaniesWorked -8.298387e-03 0.0402963651
## PerformanceRating 1.000000e+00 -0.0329887789
## RelationshipSatisfaction -3.298878e-02 1.0000000000
## PercentSalaryHike 6.285191e-01 -0.0349145727
## StockOptionLevel 1.102806e-02 -0.0562490052
## TotalWorkingYears 1.167810e-02 0.0039712744
## TrainingTimesLastYear -1.667579e-02 0.0054241038
## WorkLifeBalance 6.808391e-03 0.0176841171
## YearsAtCompany 1.722425e-02 -0.0012671613
## YearsInCurrentRole 3.271927e-02 -0.0213997688
## YearsSinceLastPromotion -6.578150e-03 0.0369629999
## YearsWithCurrManager 2.556002e-02 0.0002800476
## PercentSalaryHike StockOptionLevel TotalWorkingYears
## Age 7.708838e-03 0.056633063 0.656895823
## DailyRate 2.507040e-02 0.038513934 0.020950699
## DistanceFromHome 2.966643e-02 0.030190294 -0.002912375
## Education 4.299936e-03 0.013793504 0.162176793
## EnvironmentSatisfaction -3.048943e-02 0.009826170 -0.013881976
## HourlyRate -9.875548e-03 0.050543078 -0.012071633
## JobInvolvement -1.699874e-02 0.034464290 0.006444104
## JobLevel -3.245275e-02 0.047786170 0.734677591
## JobSatisfaction 2.396955e-02 0.012785496 -0.015874717
## MonthlyIncome -3.376708e-02 0.045851881 0.710024314
## MonthlyRate -5.470537e-03 -0.037274294 0.013359823
## NumCompaniesWorked 4.628802e-05 0.032276911 0.315195582
## PerformanceRating 6.285191e-01 0.011028055 0.011678101
## RelationshipSatisfaction -3.491457e-02 -0.056249005 0.003971274
## PercentSalaryHike 1.000000e+00 0.023445876 -0.025527604
## StockOptionLevel 2.344588e-02 1.000000000 0.052618281
## TotalWorkingYears -2.552760e-02 0.052618281 1.000000000
## TrainingTimesLastYear -4.106182e-03 0.003388463 -0.014150578
## WorkLifeBalance 9.304377e-04 -0.016567956 0.003004074
## YearsAtCompany -5.411676e-02 0.064974119 0.594193253
## YearsInCurrentRole -2.552848e-02 0.071626914 0.492721324
## YearsSinceLastPromotion -5.536242e-02 0.027502390 0.334995640
## YearsWithCurrManager -2.604883e-02 0.053646188 0.495254103
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## Age 0.0003158049 -0.0037067910 0.251685970
## DailyRate -0.0113389745 -0.0403523149 -0.009778344
## DistanceFromHome -0.0248483609 -0.0204018869 0.010513095
## Education -0.0237486170 0.0173504351 0.064196129
## EnvironmentSatisfaction -0.0116589377 0.0271689716 0.008424504
## HourlyRate 0.0002918715 -0.0100031257 -0.029032328
## JobInvolvement 0.0020139150 -0.0198886343 0.013836362
## JobLevel -0.0197285879 0.0404657805 0.472282715
## JobSatisfaction -0.0116809933 -0.0297808628 0.012280406
## MonthlyIncome -0.0348467617 0.0307591463 0.464315235
## MonthlyRate -0.0100179472 0.0063162635 -0.029861868
## NumCompaniesWorked -0.0473364941 0.0091025036 -0.171069831
## PerformanceRating -0.0166757921 0.0068083913 0.017224252
## RelationshipSatisfaction 0.0054241038 0.0176841171 -0.001267161
## PercentSalaryHike -0.0041061817 0.0009304377 -0.054116761
## StockOptionLevel 0.0033884631 -0.0165679563 0.064974119
## TotalWorkingYears -0.0141505775 0.0030040739 0.594193253
## TrainingTimesLastYear 1.0000000000 0.0236895575 0.001389345
## WorkLifeBalance 0.0236895575 1.0000000000 0.004675134
## YearsAtCompany 0.0013893447 0.0046751344 1.000000000
## YearsInCurrentRole 0.0045810569 0.0232142605 0.853999533
## YearsSinceLastPromotion 0.0102154346 0.0021511105 0.519966444
## YearsWithCurrManager -0.0116275408 -0.0045905701 0.842803342
## YearsInCurrentRole YearsSinceLastPromotion
## Age 0.1979775610 0.173647224
## DailyRate 0.0072075388 -0.037631293
## DistanceFromHome 0.0137080964 -0.004685211
## Education 0.0545672109 0.032203024
## EnvironmentSatisfaction 0.0201401658 0.026081674
## HourlyRate -0.0340160475 -0.052412402
## JobInvolvement 0.0155478405 -0.008306725
## JobLevel 0.3910854019 0.269096079
## JobSatisfaction 0.0005310846 0.007497131
## MonthlyIncome 0.3947118335 0.264599332
## MonthlyRate -0.0068654973 -0.016285473
## NumCompaniesWorked -0.1276729762 -0.066950179
## PerformanceRating 0.0327192666 -0.006578150
## RelationshipSatisfaction -0.0213997688 0.036963000
## PercentSalaryHike -0.0255284805 -0.055362419
## StockOptionLevel 0.0716269138 0.027502390
## TotalWorkingYears 0.4927213242 0.334995640
## TrainingTimesLastYear 0.0045810569 0.010215435
## WorkLifeBalance 0.0232142605 0.002151110
## YearsAtCompany 0.8539995333 0.519966444
## YearsInCurrentRole 1.0000000000 0.505656503
## YearsSinceLastPromotion 0.5056565032 1.000000000
## YearsWithCurrManager 0.7247542193 0.466712745
## YearsWithCurrManager
## Age 0.1948175841
## DailyRate -0.0047165189
## DistanceFromHome 0.0044478680
## Education 0.0512918662
## EnvironmentSatisfaction -0.0017318242
## HourlyRate -0.0138114155
## JobInvolvement 0.0373970145
## JobLevel 0.3708892877
## JobSatisfaction -0.0167721793
## MonthlyIncome 0.3653856782
## MonthlyRate -0.0350591418
## NumCompaniesWorked -0.1441289757
## PerformanceRating 0.0255600151
## RelationshipSatisfaction 0.0002800476
## PercentSalaryHike -0.0260488322
## StockOptionLevel 0.0536461877
## TotalWorkingYears 0.4952541030
## TrainingTimesLastYear -0.0116275408
## WorkLifeBalance -0.0045905701
## YearsAtCompany 0.8428033422
## YearsInCurrentRole 0.7247542193
## YearsSinceLastPromotion 0.4667127452
## YearsWithCurrManager 1.0000000000
JobLevel and MonthlyIncome have a strong
positive correlation of around 0.920, indicating that the higher the
JobLevel, the higher the MonthlyIncome tends to be.
Additionally, YearsAtCompany and
YearsWithCurrManager also have a strong positive
correlation of around 0.843, indicating that the longer someone has
worked at a company (YearsAtCompany), they tend to have
more time with the current manager
(YearsWithCurrManager).
So we will exclude JobLevel and
YearsAtCompany from the model.
In this section, we will explore the in-depth steps of analyzing the factors that contribute to employee turnover and how these steps can provide valuable insights for decision-making in companies. I will use logistic regression in creating the model.
Then we need to call the stats library first, and after
that, we start creating the first model.
Before that we also need to change the Attrition variable with the values 1 and 0, 1 as “Yes” and 0 as “No”.
library(stats)
df.attrition$Attrition <- ifelse(df.attrition$Attrition == "Yes", 1, 0)
# Create first model as logit1
logit1 <- glm(Attrition ~ Age + DailyRate + DistanceFromHome + Education + EnvironmentSatisfaction +
HourlyRate + JobInvolvement + JobSatisfaction + MonthlyIncome +
MonthlyRate + NumCompaniesWorked + PerformanceRating + RelationshipSatisfaction +
PercentSalaryHike + StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear +
WorkLifeBalance + YearsInCurrentRole +
YearsSinceLastPromotion + YearsWithCurrManager, family = "binomial",
data = df.attrition)
# Call logit1 using the summary function to see the estimation results
summary(logit1)
##
## Call:
## glm(formula = Attrition ~ Age + DailyRate + DistanceFromHome +
## Education + EnvironmentSatisfaction + HourlyRate + JobInvolvement +
## JobSatisfaction + MonthlyIncome + MonthlyRate + NumCompaniesWorked +
## PerformanceRating + RelationshipSatisfaction + PercentSalaryHike +
## StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear +
## WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion +
## YearsWithCurrManager, family = "binomial", data = df.attrition)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4921 -0.6029 -0.3914 -0.2006 3.3009
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.324e+00 1.089e+00 3.970 7.19e-05 ***
## Age -3.060e-02 1.221e-02 -2.507 0.012192 *
## DailyRate -2.457e-04 1.944e-04 -1.264 0.206215
## DistanceFromHome 3.252e-02 9.285e-03 3.503 0.000460 ***
## Education 1.065e-02 7.828e-02 0.136 0.891781
## EnvironmentSatisfaction -2.963e-01 7.101e-02 -4.173 3.00e-05 ***
## HourlyRate -4.602e-04 3.857e-03 -0.119 0.905020
## JobInvolvement -4.919e-01 1.075e-01 -4.575 4.76e-06 ***
## JobSatisfaction -2.965e-01 7.063e-02 -4.198 2.69e-05 ***
## MonthlyIncome -7.521e-05 3.171e-05 -2.372 0.017701 *
## MonthlyRate 4.904e-06 1.102e-05 0.445 0.656248
## NumCompaniesWorked 1.244e-01 3.260e-02 3.815 0.000136 ***
## PerformanceRating 3.001e-01 3.462e-01 0.867 0.386035
## RelationshipSatisfaction -1.460e-01 7.222e-02 -2.021 0.043268 *
## PercentSalaryHike -4.301e-02 3.440e-02 -1.250 0.211168
## StockOptionLevel -5.204e-01 1.047e-01 -4.969 6.73e-07 ***
## TotalWorkingYears -3.154e-02 2.328e-02 -1.355 0.175469
## TrainingTimesLastYear -1.565e-01 6.196e-02 -2.526 0.011523 *
## WorkLifeBalance -2.468e-01 1.064e-01 -2.320 0.020327 *
## YearsInCurrentRole -9.613e-02 3.838e-02 -2.505 0.012250 *
## YearsSinceLastPromotion 1.741e-01 3.551e-02 4.902 9.48e-07 ***
## YearsWithCurrManager -8.765e-02 3.811e-02 -2.300 0.021438 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1298.6 on 1469 degrees of freedom
## Residual deviance: 1072.8 on 1448 degrees of freedom
## AIC: 1116.8
##
## Number of Fisher Scoring iterations: 6
There are still a number of independent variables that are not
significant, so we need to repeat it again by removing 1 variable,
namely Education, and then generating the model, which
becomes logit2.
# Create second model as logit2
logit2 <- glm(Attrition ~ Age + DailyRate + DistanceFromHome + EnvironmentSatisfaction +
HourlyRate + JobInvolvement + JobSatisfaction + MonthlyIncome +
MonthlyRate + NumCompaniesWorked + PerformanceRating + RelationshipSatisfaction +
PercentSalaryHike + StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear +
WorkLifeBalance + YearsInCurrentRole +
YearsSinceLastPromotion + YearsWithCurrManager, family = "binomial",
data = df.attrition)
# Call logit2 using the summary function to see the estimation results
summary(logit2)
##
## Call:
## glm(formula = Attrition ~ Age + DailyRate + DistanceFromHome +
## EnvironmentSatisfaction + HourlyRate + JobInvolvement + JobSatisfaction +
## MonthlyIncome + MonthlyRate + NumCompaniesWorked + PerformanceRating +
## RelationshipSatisfaction + PercentSalaryHike + StockOptionLevel +
## TotalWorkingYears + TrainingTimesLastYear + WorkLifeBalance +
## YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager,
## family = "binomial", data = df.attrition)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4995 -0.6009 -0.3913 -0.2012 3.3069
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.349e+00 1.073e+00 4.054 5.03e-05 ***
## Age -3.032e-02 1.203e-02 -2.521 0.011718 *
## DailyRate -2.466e-04 1.943e-04 -1.269 0.204300
## DistanceFromHome 3.253e-02 9.283e-03 3.504 0.000458 ***
## EnvironmentSatisfaction -2.964e-01 7.101e-02 -4.175 2.99e-05 ***
## HourlyRate -4.700e-04 3.856e-03 -0.122 0.902996
## JobInvolvement -4.915e-01 1.075e-01 -4.572 4.83e-06 ***
## JobSatisfaction -2.965e-01 7.063e-02 -4.198 2.69e-05 ***
## MonthlyIncome -7.513e-05 3.171e-05 -2.370 0.017804 *
## MonthlyRate 4.836e-06 1.101e-05 0.439 0.660382
## NumCompaniesWorked 1.247e-01 3.250e-02 3.837 0.000125 ***
## PerformanceRating 2.985e-01 3.460e-01 0.863 0.388303
## RelationshipSatisfaction -1.461e-01 7.220e-02 -2.023 0.043024 *
## PercentSalaryHike -4.291e-02 3.439e-02 -1.248 0.212066
## StockOptionLevel -5.202e-01 1.047e-01 -4.968 6.77e-07 ***
## TotalWorkingYears -3.165e-02 2.327e-02 -1.360 0.173736
## TrainingTimesLastYear -1.565e-01 6.196e-02 -2.526 0.011522 *
## WorkLifeBalance -2.472e-01 1.063e-01 -2.324 0.020122 *
## YearsInCurrentRole -9.597e-02 3.836e-02 -2.502 0.012345 *
## YearsSinceLastPromotion 1.742e-01 3.549e-02 4.909 9.15e-07 ***
## YearsWithCurrManager -8.757e-02 3.809e-02 -2.299 0.021496 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1298.6 on 1469 degrees of freedom
## Residual deviance: 1072.8 on 1449 degrees of freedom
## AIC: 1114.8
##
## Number of Fisher Scoring iterations: 6
There are still a number of independent variables that are not
significant, so we need to repeat it again by removing 1 variable,
namely HourlyRate, and then generating the model, which
becomes logit3. We need to repeat this again and again
until all variables are significant.
# Create third model as logit3
logit3 <- glm(Attrition ~ Age + DailyRate + DistanceFromHome + EnvironmentSatisfaction +
JobInvolvement + JobSatisfaction + MonthlyIncome +
MonthlyRate + NumCompaniesWorked + PerformanceRating + RelationshipSatisfaction +
PercentSalaryHike + StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear +
WorkLifeBalance + YearsInCurrentRole +
YearsSinceLastPromotion + YearsWithCurrManager, family = "binomial",
data = df.attrition)
# Call logit3 using the summary function to see the estimation results
summary(logit3)
##
## Call:
## glm(formula = Attrition ~ Age + DailyRate + DistanceFromHome +
## EnvironmentSatisfaction + JobInvolvement + JobSatisfaction +
## MonthlyIncome + MonthlyRate + NumCompaniesWorked + PerformanceRating +
## RelationshipSatisfaction + PercentSalaryHike + StockOptionLevel +
## TotalWorkingYears + TrainingTimesLastYear + WorkLifeBalance +
## YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager,
## family = "binomial", data = df.attrition)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5058 -0.6019 -0.3914 -0.2008 3.3052
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.321e+00 1.047e+00 4.126 3.70e-05 ***
## Age -3.040e-02 1.202e-02 -2.530 0.011406 *
## DailyRate -2.474e-04 1.942e-04 -1.274 0.202594
## DistanceFromHome 3.251e-02 9.281e-03 3.503 0.000460 ***
## EnvironmentSatisfaction -2.963e-01 7.101e-02 -4.173 3.01e-05 ***
## JobInvolvement -4.921e-01 1.074e-01 -4.584 4.57e-06 ***
## JobSatisfaction -2.959e-01 7.046e-02 -4.200 2.67e-05 ***
## MonthlyIncome -7.515e-05 3.171e-05 -2.370 0.017780 *
## MonthlyRate 4.853e-06 1.101e-05 0.441 0.659226
## NumCompaniesWorked 1.248e-01 3.250e-02 3.839 0.000124 ***
## PerformanceRating 2.981e-01 3.459e-01 0.862 0.388756
## RelationshipSatisfaction -1.460e-01 7.220e-02 -2.023 0.043112 *
## PercentSalaryHike -4.285e-02 3.438e-02 -1.246 0.212638
## StockOptionLevel -5.206e-01 1.047e-01 -4.974 6.54e-07 ***
## TotalWorkingYears -3.162e-02 2.326e-02 -1.359 0.174065
## TrainingTimesLastYear -1.566e-01 6.195e-02 -2.527 0.011501 *
## WorkLifeBalance -2.472e-01 1.064e-01 -2.324 0.020118 *
## YearsInCurrentRole -9.599e-02 3.836e-02 -2.502 0.012333 *
## YearsSinceLastPromotion 1.744e-01 3.547e-02 4.918 8.75e-07 ***
## YearsWithCurrManager -8.761e-02 3.810e-02 -2.300 0.021463 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1298.6 on 1469 degrees of freedom
## Residual deviance: 1072.8 on 1450 degrees of freedom
## AIC: 1112.8
##
## Number of Fisher Scoring iterations: 6
# Create fourth model as logit4
logit4 <- glm(Attrition ~ Age + DailyRate + DistanceFromHome + EnvironmentSatisfaction +
JobInvolvement + JobSatisfaction + MonthlyIncome +
NumCompaniesWorked + PerformanceRating + RelationshipSatisfaction +
PercentSalaryHike + StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear +
WorkLifeBalance + YearsInCurrentRole +
YearsSinceLastPromotion + YearsWithCurrManager, family = "binomial",
data = df.attrition)
# Call logit4 using the summary function to see the estimation results
summary(logit4)
##
## Call:
## glm(formula = Attrition ~ Age + DailyRate + DistanceFromHome +
## EnvironmentSatisfaction + JobInvolvement + JobSatisfaction +
## MonthlyIncome + NumCompaniesWorked + PerformanceRating +
## RelationshipSatisfaction + PercentSalaryHike + StockOptionLevel +
## TotalWorkingYears + TrainingTimesLastYear + WorkLifeBalance +
## YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager,
## family = "binomial", data = df.attrition)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5220 -0.6020 -0.3918 -0.1983 3.3054
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.403e+00 1.032e+00 4.267 1.98e-05 ***
## Age -3.040e-02 1.202e-02 -2.529 0.011442 *
## DailyRate -2.507e-04 1.940e-04 -1.292 0.196375
## DistanceFromHome 3.268e-02 9.270e-03 3.526 0.000422 ***
## EnvironmentSatisfaction -2.946e-01 7.089e-02 -4.155 3.25e-05 ***
## JobInvolvement -4.921e-01 1.074e-01 -4.583 4.59e-06 ***
## JobSatisfaction -2.951e-01 7.040e-02 -4.191 2.77e-05 ***
## MonthlyIncome -7.456e-05 3.167e-05 -2.355 0.018547 *
## NumCompaniesWorked 1.247e-01 3.249e-02 3.838 0.000124 ***
## PerformanceRating 2.899e-01 3.455e-01 0.839 0.401385
## RelationshipSatisfaction -1.465e-01 7.221e-02 -2.029 0.042460 *
## PercentSalaryHike -4.233e-02 3.436e-02 -1.232 0.217987
## StockOptionLevel -5.237e-01 1.045e-01 -5.012 5.38e-07 ***
## TotalWorkingYears -3.164e-02 2.327e-02 -1.359 0.174050
## TrainingTimesLastYear -1.562e-01 6.200e-02 -2.520 0.011730 *
## WorkLifeBalance -2.467e-01 1.063e-01 -2.321 0.020283 *
## YearsInCurrentRole -9.590e-02 3.835e-02 -2.501 0.012390 *
## YearsSinceLastPromotion 1.750e-01 3.545e-02 4.936 7.98e-07 ***
## YearsWithCurrManager -8.842e-02 3.804e-02 -2.324 0.020115 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1298.6 on 1469 degrees of freedom
## Residual deviance: 1073.0 on 1451 degrees of freedom
## AIC: 1111
##
## Number of Fisher Scoring iterations: 6
# Create fifth model as logit5
logit5 <- glm(Attrition ~ Age + DailyRate + DistanceFromHome + EnvironmentSatisfaction +
JobInvolvement + JobSatisfaction + MonthlyIncome +
NumCompaniesWorked + RelationshipSatisfaction +
PercentSalaryHike + StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear +
WorkLifeBalance + YearsInCurrentRole +
YearsSinceLastPromotion + YearsWithCurrManager, family = "binomial",
data = df.attrition)
# Call logit5 using the summary function to see the estimation results
summary(logit5)
##
## Call:
## glm(formula = Attrition ~ Age + DailyRate + DistanceFromHome +
## EnvironmentSatisfaction + JobInvolvement + JobSatisfaction +
## MonthlyIncome + NumCompaniesWorked + RelationshipSatisfaction +
## PercentSalaryHike + StockOptionLevel + TotalWorkingYears +
## TrainingTimesLastYear + WorkLifeBalance + YearsInCurrentRole +
## YearsSinceLastPromotion + YearsWithCurrManager, family = "binomial",
## data = df.attrition)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5425 -0.6056 -0.3906 -0.1988 3.2800
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.992e+00 7.639e-01 6.536 6.33e-11 ***
## Age -3.059e-02 1.201e-02 -2.548 0.010847 *
## DailyRate -2.543e-04 1.940e-04 -1.311 0.189929
## DistanceFromHome 3.249e-02 9.273e-03 3.504 0.000459 ***
## EnvironmentSatisfaction -2.966e-01 7.086e-02 -4.185 2.85e-05 ***
## JobInvolvement -4.936e-01 1.073e-01 -4.602 4.19e-06 ***
## JobSatisfaction -2.954e-01 7.039e-02 -4.197 2.70e-05 ***
## MonthlyIncome -7.559e-05 3.168e-05 -2.386 0.017036 *
## NumCompaniesWorked 1.238e-01 3.244e-02 3.816 0.000136 ***
## RelationshipSatisfaction -1.469e-01 7.212e-02 -2.036 0.041703 *
## PercentSalaryHike -2.013e-02 2.177e-02 -0.925 0.355088
## StockOptionLevel -5.260e-01 1.045e-01 -5.036 4.76e-07 ***
## TotalWorkingYears -3.072e-02 2.324e-02 -1.322 0.186124
## TrainingTimesLastYear -1.570e-01 6.199e-02 -2.532 0.011337 *
## WorkLifeBalance -2.432e-01 1.062e-01 -2.289 0.022055 *
## YearsInCurrentRole -9.532e-02 3.835e-02 -2.485 0.012940 *
## YearsSinceLastPromotion 1.757e-01 3.539e-02 4.963 6.95e-07 ***
## YearsWithCurrManager -8.814e-02 3.801e-02 -2.319 0.020396 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1298.6 on 1469 degrees of freedom
## Residual deviance: 1073.7 on 1452 degrees of freedom
## AIC: 1109.7
##
## Number of Fisher Scoring iterations: 5
# Create sixth model as logit6
logit6 <- glm(Attrition ~ Age + DailyRate + DistanceFromHome + EnvironmentSatisfaction +
JobInvolvement + JobSatisfaction + MonthlyIncome +
NumCompaniesWorked + RelationshipSatisfaction +
StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear +
WorkLifeBalance + YearsInCurrentRole +
YearsSinceLastPromotion + YearsWithCurrManager, family = "binomial",
data = df.attrition)
# Call logit6 using the summary function to see the estimation results
summary(logit6)
##
## Call:
## glm(formula = Attrition ~ Age + DailyRate + DistanceFromHome +
## EnvironmentSatisfaction + JobInvolvement + JobSatisfaction +
## MonthlyIncome + NumCompaniesWorked + RelationshipSatisfaction +
## StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear +
## WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion +
## YearsWithCurrManager, family = "binomial", data = df.attrition)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5380 -0.6020 -0.3903 -0.1987 3.2739
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.681e+00 6.835e-01 6.848 7.49e-12 ***
## Age -3.077e-02 1.199e-02 -2.566 0.010287 *
## DailyRate -2.622e-04 1.937e-04 -1.353 0.175905
## DistanceFromHome 3.212e-02 9.261e-03 3.468 0.000523 ***
## EnvironmentSatisfaction -2.961e-01 7.083e-02 -4.180 2.92e-05 ***
## JobInvolvement -4.908e-01 1.071e-01 -4.585 4.55e-06 ***
## JobSatisfaction -2.951e-01 7.033e-02 -4.196 2.71e-05 ***
## MonthlyIncome -7.452e-05 3.165e-05 -2.355 0.018540 *
## NumCompaniesWorked 1.240e-01 3.242e-02 3.824 0.000131 ***
## RelationshipSatisfaction -1.440e-01 7.197e-02 -2.000 0.045468 *
## StockOptionLevel -5.265e-01 1.045e-01 -5.037 4.74e-07 ***
## TotalWorkingYears -3.080e-02 2.325e-02 -1.325 0.185241
## TrainingTimesLastYear -1.573e-01 6.203e-02 -2.536 0.011202 *
## WorkLifeBalance -2.436e-01 1.063e-01 -2.292 0.021900 *
## YearsInCurrentRole -9.516e-02 3.835e-02 -2.482 0.013078 *
## YearsSinceLastPromotion 1.758e-01 3.534e-02 4.976 6.49e-07 ***
## YearsWithCurrManager -8.769e-02 3.800e-02 -2.307 0.021029 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1298.6 on 1469 degrees of freedom
## Residual deviance: 1074.6 on 1453 degrees of freedom
## AIC: 1108.6
##
## Number of Fisher Scoring iterations: 5
# Create seventh model as logit7
logit7 <- glm(Attrition ~ Age + DailyRate + DistanceFromHome + EnvironmentSatisfaction +
JobInvolvement + JobSatisfaction + MonthlyIncome +
NumCompaniesWorked + RelationshipSatisfaction +
StockOptionLevel + TrainingTimesLastYear +
WorkLifeBalance + YearsInCurrentRole +
YearsSinceLastPromotion + YearsWithCurrManager, family = "binomial",
data = df.attrition)
# Call logit7 using the summary function to see the estimation results
summary(logit7)
##
## Call:
## glm(formula = Attrition ~ Age + DailyRate + DistanceFromHome +
## EnvironmentSatisfaction + JobInvolvement + JobSatisfaction +
## MonthlyIncome + NumCompaniesWorked + RelationshipSatisfaction +
## StockOptionLevel + TrainingTimesLastYear + WorkLifeBalance +
## YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager,
## family = "binomial", data = df.attrition)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5518 -0.6004 -0.3873 -0.2018 3.1323
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.828e+00 6.764e-01 7.138 9.45e-13 ***
## Age -3.875e-02 1.058e-02 -3.664 0.000248 ***
## DailyRate -2.666e-04 1.938e-04 -1.375 0.169013
## DistanceFromHome 3.173e-02 9.237e-03 3.436 0.000591 ***
## EnvironmentSatisfaction -2.913e-01 7.061e-02 -4.126 3.70e-05 ***
## JobInvolvement -4.837e-01 1.068e-01 -4.530 5.90e-06 ***
## JobSatisfaction -2.927e-01 7.022e-02 -4.168 3.07e-05 ***
## MonthlyIncome -9.772e-05 2.643e-05 -3.697 0.000218 ***
## NumCompaniesWorked 1.160e-01 3.196e-02 3.628 0.000285 ***
## RelationshipSatisfaction -1.391e-01 7.193e-02 -1.934 0.053072 .
## StockOptionLevel -5.212e-01 1.044e-01 -4.992 5.99e-07 ***
## TrainingTimesLastYear -1.559e-01 6.209e-02 -2.511 0.012033 *
## WorkLifeBalance -2.411e-01 1.063e-01 -2.268 0.023302 *
## YearsInCurrentRole -1.016e-01 3.773e-02 -2.692 0.007099 **
## YearsSinceLastPromotion 1.712e-01 3.497e-02 4.897 9.74e-07 ***
## YearsWithCurrManager -9.801e-02 3.688e-02 -2.657 0.007878 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1298.6 on 1469 degrees of freedom
## Residual deviance: 1076.4 on 1454 degrees of freedom
## AIC: 1108.4
##
## Number of Fisher Scoring iterations: 5
# Create eighth model as logit8
logit8 <- glm(Attrition ~ Age + DistanceFromHome + EnvironmentSatisfaction +
JobInvolvement + JobSatisfaction + MonthlyIncome +
NumCompaniesWorked + RelationshipSatisfaction +
StockOptionLevel + TrainingTimesLastYear +
WorkLifeBalance + YearsInCurrentRole +
YearsSinceLastPromotion + YearsWithCurrManager, family = "binomial",
data = df.attrition)
# Call logit8 using the summary function to see the estimation results
summary(logit8)
##
## Call:
## glm(formula = Attrition ~ Age + DistanceFromHome + EnvironmentSatisfaction +
## JobInvolvement + JobSatisfaction + MonthlyIncome + NumCompaniesWorked +
## RelationshipSatisfaction + StockOptionLevel + TrainingTimesLastYear +
## WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion +
## YearsWithCurrManager, family = "binomial", data = df.attrition)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4986 -0.6117 -0.3896 -0.2054 3.1520
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.649e+00 6.620e-01 7.023 2.17e-12 ***
## Age -3.906e-02 1.056e-02 -3.701 0.000215 ***
## DistanceFromHome 3.184e-02 9.225e-03 3.451 0.000558 ***
## EnvironmentSatisfaction -2.922e-01 7.053e-02 -4.142 3.44e-05 ***
## JobInvolvement -4.900e-01 1.066e-01 -4.596 4.31e-06 ***
## JobSatisfaction -2.963e-01 7.016e-02 -4.223 2.41e-05 ***
## MonthlyIncome -9.859e-05 2.646e-05 -3.726 0.000195 ***
## NumCompaniesWorked 1.144e-01 3.187e-02 3.590 0.000330 ***
## RelationshipSatisfaction -1.388e-01 7.185e-02 -1.932 0.053367 .
## StockOptionLevel -5.260e-01 1.042e-01 -5.051 4.40e-07 ***
## TrainingTimesLastYear -1.577e-01 6.200e-02 -2.544 0.010953 *
## WorkLifeBalance -2.314e-01 1.061e-01 -2.181 0.029191 *
## YearsInCurrentRole -1.034e-01 3.765e-02 -2.746 0.006036 **
## YearsSinceLastPromotion 1.729e-01 3.501e-02 4.937 7.92e-07 ***
## YearsWithCurrManager -9.769e-02 3.698e-02 -2.642 0.008240 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1298.6 on 1469 degrees of freedom
## Residual deviance: 1078.3 on 1455 degrees of freedom
## AIC: 1108.3
##
## Number of Fisher Scoring iterations: 5
The RelationshipSatisfaction variable is still not
significant at 0.05; let’s try again to create a logit model as
logit9.
# Create ninth model as logit9
logit9 <- glm(Attrition ~ Age + DistanceFromHome + EnvironmentSatisfaction +
JobInvolvement + JobSatisfaction + MonthlyIncome +
NumCompaniesWorked +
StockOptionLevel + TrainingTimesLastYear +
WorkLifeBalance + YearsInCurrentRole +
YearsSinceLastPromotion + YearsWithCurrManager, family = "binomial",
data = df.attrition)
# Call logit9 using the summary function to see the estimation results
summary(logit9)
##
## Call:
## glm(formula = Attrition ~ Age + DistanceFromHome + EnvironmentSatisfaction +
## JobInvolvement + JobSatisfaction + MonthlyIncome + NumCompaniesWorked +
## StockOptionLevel + TrainingTimesLastYear + WorkLifeBalance +
## YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager,
## family = "binomial", data = df.attrition)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4148 -0.6044 -0.3957 -0.2069 3.1739
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.334e+00 6.381e-01 6.792 1.10e-11 ***
## Age -4.021e-02 1.052e-02 -3.821 0.000133 ***
## DistanceFromHome 3.194e-02 9.212e-03 3.467 0.000526 ***
## EnvironmentSatisfaction -2.946e-01 7.035e-02 -4.188 2.81e-05 ***
## JobInvolvement -4.989e-01 1.062e-01 -4.697 2.65e-06 ***
## JobSatisfaction -2.947e-01 7.002e-02 -4.208 2.57e-05 ***
## MonthlyIncome -9.766e-05 2.640e-05 -3.699 0.000217 ***
## NumCompaniesWorked 1.128e-01 3.180e-02 3.548 0.000388 ***
## StockOptionLevel -5.136e-01 1.036e-01 -4.960 7.06e-07 ***
## TrainingTimesLastYear -1.576e-01 6.189e-02 -2.547 0.010872 *
## WorkLifeBalance -2.342e-01 1.059e-01 -2.211 0.027049 *
## YearsInCurrentRole -9.996e-02 3.738e-02 -2.674 0.007488 **
## YearsSinceLastPromotion 1.679e-01 3.476e-02 4.831 1.36e-06 ***
## YearsWithCurrManager -9.610e-02 3.675e-02 -2.615 0.008925 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1298.6 on 1469 degrees of freedom
## Residual deviance: 1082.0 on 1456 degrees of freedom
## AIC: 1110
##
## Number of Fisher Scoring iterations: 5
Finally, all independent variables are significant. Now we need to
choose the best logit model using the aic function.
AIC is a metric used to compare and select the most suitable model in statistics. The lower the AIC value, the better the model is at explaining the data. Below is how we want to compare the AIC values of each model.
# Create a vector that contains previously created models as logit_model
logit_model <- c("logit1", "logit2", "logit3", "logit4", "logit5", "logit6", "logit7",
"logit8", "logit9")
# Create a vector containing the AIC values of each model
AIC <- c(logit1$aic, logit2$aic, logit3$aic, logit4$aic, logit5$aic, logit6$aic, logit7$aic,
logit8$aic, logit9$aic)
# Create a data frame containing the logit_model vector and AIC as criteria
criteria <- data.frame(logit_model, AIC)
# Calling the data frame `criteria`
criteria
## logit_model AIC
## 1 logit1 1116.786
## 2 logit2 1114.805
## 3 logit3 1112.819
## 4 logit4 1111.014
## 5 logit5 1109.719
## 6 logit6 1108.584
## 7 logit7 1108.376
## 8 logit8 1108.273
## 9 logit9 1109.996
From these results, we can conclude that logit8 has the
lowest AIC value, namely around 1108.273, so it can be considered the
most suitable model among all the models compared. Okay, next we need to
check the goodness of fit using logit8.
Goodness of fit in logistic regression refers to the extent to which the model that has been built fits the observed data. It is important to measure how well our logistic regression model explains the variation in the actual data. There are several methods that can be used to carry out this test, but in this article, I will use deviance.
# Create a null model with only the intercept
null_model <- glm(Attrition ~ 1, data = df.attrition, family = "binomial")
# Performing a likelihood ratio test between the null model and the complex model
lr_test <- anova(null_model, logit8, test = "Chisq")
# Load the results of the likelihood ratio test
lr_test
## Analysis of Deviance Table
##
## Model 1: Attrition ~ 1
## Model 2: Attrition ~ Age + DistanceFromHome + EnvironmentSatisfaction +
## JobInvolvement + JobSatisfaction + MonthlyIncome + NumCompaniesWorked +
## RelationshipSatisfaction + StockOptionLevel + TrainingTimesLastYear +
## WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion +
## YearsWithCurrManager
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 1469 1298.6
## 2 1455 1078.3 14 220.31 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results of the “Analysis of Deviance Table” show that Model 2 (with a number of additional predictor variables) is significantly better at explaining variation in the Attrition data compared to Model 1 (base model). The addition of predictor variables increases the model’s ability to predict the possibility of Attrition. Therefore, Model 2 is recommended for Attrition analysis and prediction.
After that, load the logit8 and interpret it.
summary(logit8)
##
## Call:
## glm(formula = Attrition ~ Age + DistanceFromHome + EnvironmentSatisfaction +
## JobInvolvement + JobSatisfaction + MonthlyIncome + NumCompaniesWorked +
## RelationshipSatisfaction + StockOptionLevel + TrainingTimesLastYear +
## WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion +
## YearsWithCurrManager, family = "binomial", data = df.attrition)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4986 -0.6117 -0.3896 -0.2054 3.1520
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.649e+00 6.620e-01 7.023 2.17e-12 ***
## Age -3.906e-02 1.056e-02 -3.701 0.000215 ***
## DistanceFromHome 3.184e-02 9.225e-03 3.451 0.000558 ***
## EnvironmentSatisfaction -2.922e-01 7.053e-02 -4.142 3.44e-05 ***
## JobInvolvement -4.900e-01 1.066e-01 -4.596 4.31e-06 ***
## JobSatisfaction -2.963e-01 7.016e-02 -4.223 2.41e-05 ***
## MonthlyIncome -9.859e-05 2.646e-05 -3.726 0.000195 ***
## NumCompaniesWorked 1.144e-01 3.187e-02 3.590 0.000330 ***
## RelationshipSatisfaction -1.388e-01 7.185e-02 -1.932 0.053367 .
## StockOptionLevel -5.260e-01 1.042e-01 -5.051 4.40e-07 ***
## TrainingTimesLastYear -1.577e-01 6.200e-02 -2.544 0.010953 *
## WorkLifeBalance -2.314e-01 1.061e-01 -2.181 0.029191 *
## YearsInCurrentRole -1.034e-01 3.765e-02 -2.746 0.006036 **
## YearsSinceLastPromotion 1.729e-01 3.501e-02 4.937 7.92e-07 ***
## YearsWithCurrManager -9.769e-02 3.698e-02 -2.642 0.008240 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1298.6 on 1469 degrees of freedom
## Residual deviance: 1078.3 on 1455 degrees of freedom
## AIC: 1108.3
##
## Number of Fisher Scoring iterations: 5
The star at each coefficient value indicates the significance of the independent variable for the dependent variable. The more stars, the higher the level of significance. The following is an interpretation of the influence of each variable: clik here.
Okay, after this, we will test the predicted results of classification using the confusion matrix.
Confusion matrix is a table used in statistical analysis, particularly in the context of classification or prediction. It is a useful tool for evaluating the performance of predictive models, such as logistic regression models or machine learning models, in distinguishing between various categories or classes. Here is how to create a confusion matrix table.
table(TRUE == df.attrition$Attrition, pred = round(fitted(logit8)))
## pred
## 0 1
## FALSE 1215 18
## TRUE 190 47
Interpretation: click here.
After conducting data analysis, the next step is to generate visualizations of several variables that we have reviewed. These visualizations will aid in a better understanding of patterns and trends present in the data. Visualizing data serves as a crucial tool in conveying findings and insights from data analysis to stakeholders and team members. Therefore, data visualization becomes an essential step in effectively presenting the results of the analysis.
Visualization has strong revealing power. Instead of just looking at a table of numbers, visualization allows us to see patterns, trends, and relationships among the data. This helps data users see the big picture quickly, gain immediate insights, and explore relationships that may not be visible in numbers. In the context of data analysis, visualization helps turn numbers into easy-to-understand images, thereby facilitating decision-making. In this stage, I will explain the importance of visualization in presenting analytical findings and how to create informative graphs.
Before starting, let’s add the AttritionStatus variable,
because previously we changed the value of the Attrition
variable
df.attrition <- df.attrition %>%
mutate(AttritionStatus = ifelse(Attrition == 1, "Yes", "No"))
now we are ready to visualize the data.
# Count attrition value
attrition.count <- df.attrition %>%
count(AttritionStatus)
# Create attrition distribution pie chart
ggplot(attrition.count, aes(x = "", y = n, fill = AttritionStatus)) +
geom_bar(stat = "identity", width = 1) +
geom_text(aes(label = scales::percent(n / sum(n))), position = position_stack(vjust = 0.5)) +
coord_polar(theta = "y") +
labs(title = "Attrition Distiribution (Percentage)", x = "", y = "") +
theme(plot.title = element_text(hjust = 0.5))
Insight: click here.
ggplot(df.attrition, aes(x = Gender, fill = AttritionStatus)) +
geom_bar(position = "stack") +
labs(title = "Attrition Status by Gender") +
geom_text(aes(label = after_stat(count)), stat = "count", vjust = -0.5) +
theme(plot.title = element_text(hjust = 0.5))
Insight: click here.
# Calculates the number of "Yes" Attrition and percentages within each Job Role
jobrole.data <- df.attrition %>%
group_by(JobRole) %>%
summarise(AttritionYesCount = sum(AttritionStatus == "Yes"),
TotalCount = n(),
AttritionPercentage = (AttritionYesCount / TotalCount) * 100)
# Create a bar chart with Attrition percentage "Yes"
ggplot(jobrole.data %>%
arrange(desc(AttritionPercentage)),
aes(y = reorder(JobRole, AttritionPercentage),
x = AttritionPercentage, fill = JobRole)) +
geom_bar(stat = "identity") +
labs(title = "Turnover Rate Job Role", subtitle = "From Highest to Lowest",
y = "Job Role", x = "Turnover Rate") +
geom_text(aes(label = scales::percent(AttritionPercentage / 100, accuracy = 0.01),
x = AttritionPercentage),
hjust = -0.2) +
scale_x_continuous(limits = c(0, 100)) +
scale_y_discrete() +
theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
Insight: click here.
ggplot(df.attrition, aes(x = DistanceFromHome, fill = AttritionStatus)) +
geom_density(alpha = 0.5) +
labs(title = "Distance From Home Distribution",
x = "Distance From Home", y = "Density") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_x_continuous(limits = c(0, 30))
Insight: click here.
# Calculates the number of "Yes" Attrition and percentages within each Marital Status
marital.data <- df.attrition %>%
group_by(MaritalStatus) %>%
summarise(AttritionYesCount = sum(AttritionStatus == "Yes"),
TotalCount = n(),
AttritionPercentage = (AttritionYesCount / TotalCount) * 100) %>%
arrange(desc(AttritionPercentage))
# Create a bar chart with Attrition percentage "Yes"
ggplot(marital.data, aes(y = MaritalStatus,
x = AttritionPercentage, fill = MaritalStatus)) +
geom_bar(stat = "identity") +
labs(title = "Turnover Rate by Marital Status", subtitle = "From Highest to Lowest",
y = "Marital Status", x = "Turnover Rate") +
geom_text(aes(label = scales::percent(AttritionPercentage / 100, accuracy = 0.01),
x = AttritionPercentage),
hjust = -0.2) +
scale_x_continuous(limits = c(0, 100)) +
scale_y_discrete() +
theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
Insight: click here.
ggplot(df.attrition, aes(x = BusinessTravel, fill = AttritionStatus)) +
geom_bar() +
geom_text(aes(label = after_stat(count)), stat = "count", vjust = -0.5) +
labs(title = "Attrition Status by Business Travel") +
theme(plot.title = element_text(hjust = 0.5))
Insight: click here.
# Calculates the number of "Yes" Attrition and percentages within each Marital Status
education.data <- df.attrition %>%
group_by(EducationLevel) %>%
summarise(AttritionYesCount = sum(AttritionStatus == "Yes"),
TotalCount = n(),
AttritionPercentage = (AttritionYesCount / TotalCount) * 100)
# Create a bar chart with Attrition percentage "Yes"
ggplot(education.data, aes(y = reorder(EducationLevel, AttritionPercentage),
x = AttritionPercentage, fill = EducationLevel)) +
geom_bar(stat = "identity") +
labs(title = "Turnover Rate by Education Level", subtitle = "From Highest to Lowest",
y = "Education Level", x = "Turnover Rate") +
geom_text(aes(label = scales::percent(AttritionPercentage / 100, accuracy = 0.01),
x = AttritionPercentage),
hjust = -0.2) +
scale_x_continuous(limits = c(0, 100)) +
scale_y_discrete() +
theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
Insight: click here.
# Calculates the number of "Yes" Attrition and percentages within each Overtime
overtime.data <- df.attrition %>%
group_by(OverTime) %>%
summarise(AttritionYesCount = sum(AttritionStatus == "Yes"),
TotalCount = n(),
AttritionPercentage = (AttritionYesCount / TotalCount) * 100)
# Create a bar chart with Attrition percentage "Yes"
ggplot(overtime.data, aes(y = reorder(OverTime, -AttritionPercentage),
x = AttritionPercentage, fill = OverTime)) +
geom_bar(stat = "identity") +
labs(title = "Turnover Rate by Over Time",
y = "Over Time", x = "Turnover Rate") +
geom_text(aes(label = scales::percent(AttritionPercentage / 100, accuracy = 0.01),
x = AttritionPercentage),
vjust = -0.5) +
scale_x_continuous(limits = c(0, 100)) +
scale_y_discrete() +
theme(plot.title = element_text(hjust = 0.5)) +
coord_flip()
Insight: click here.
# Calculates the number of "Yes" Attrition and percentages within each number of companies worked
companiesworked.data <- df.attrition %>%
group_by(NumCompaniesWorked) %>%
summarise(AttritionYesCount = sum(AttritionStatus == "Yes"),
TotalCount = n(),
AttritionPercentage = (AttritionYesCount / TotalCount) * 100)
# Create a bar chart with Attrition percentage "Yes"
ggplot(companiesworked.data, aes(y = reorder(NumCompaniesWorked, AttritionPercentage),
x = AttritionPercentage, fill = as.character(NumCompaniesWorked))) +
geom_bar(stat = "identity") +
labs(title = "Turnover Rate by Number of Companies Worked", fill = "Number of Companies Worked",
y = "Number of Companies Worked", x = "Turnover Rate") +
geom_text(aes(label = scales::percent(AttritionPercentage / 100, accuracy = 0.01),
x = AttritionPercentage),
hjust = -0.2) +
scale_x_continuous(limits = c(0, 100)) +
scale_y_discrete() +
theme(plot.title = element_text(hjust = 0.5))
Insight: click here.