Introduction
This data set was obtained from kaggle.com. The data set contains
information on several thousand employees from an unnamed company. Since
no details are given about the company, it cannot be said how exactly
the data was collected. The response variable we are trying to predict
is whether or no an employee stays at or leaves the company. The
explanatory variables relate to each subjects work life. The response
variable is whether or not an employee stays at or leaves the company.
For my simple logistic regression analysis, the variable that I am going
to use is “satisfaction_level”. This is the employees self reported
satisfaction level.
Variable
Description
- satisfaction_level (x1) - the employees self reported satisfaction
level. (Numeric from 0-1)
- last_evaluation (x2) - the employees last performance review.
(Numeric from 0-1)
- number_project (x3) - the number of projects an employee has done
for the company. (Numeric)
- average_monthly_hours (x4) - the average number of hours an employee
works per month. (Numeric)
- time_spend_company (x5) - how long the employee has worked at the
company in years. (Numeric)
- Work_accident (x6) - number of work related accidents the employee
has had. (Numeric)
- promotion_last_5years (x7) - Has the employee had a promotion in the
last 5 years? (Binary 1=yes, 0=n0)
- Department (x8) - the department the employee is in.
(categorical)
- Salary (x9) - salary level. (categorical)
- left (y) - whether the employee stays at or leaves the company
(0=stay, 1=leave)
Practical
Question
For this study, we want to identify which factors about an employee’s
work life indicate they will leave the company.
Data Download
hr_data <- read.csv("https://raw.githubusercontent.com/AvaDeSt/STA-321/refs/heads/main/HR_comma_sep.csv", header = TRUE)
pred_vars <- select(hr_data, - "left")
data(hr_data)
## Warning in data(hr_data): data set 'hr_data' not found
hr.0 = hr_data
hr_d = na.omit(hr.0)
duplicates <- duplicated(hr_d)
hr <- unique(hr_d)
The are no missing values in the data set. But there are 3008
duplicate values so I took those out of the data set.
Exploratory
Analysis
data.num <- select(hr_data, "satisfaction_level", "last_evaluation", "number_project", "average_montly_hours", "time_spend_company") #data set of only the numeric variables
pairs.panels(data.num[,-9],
method = "pearson",
hist.col = "#00AFBB",
density = TRUE,
ellipses = TRUE
)
We can see from the graphs that the variable for the number of years
spent at the company is skewed. Here is a closer look:
par(mfrow=c(1,2))
hist(hr$time_spend_company, xlab="Years at the company", main = "")
To fix this, I am going to discretize “time_spend_company” based on the
histogram.
time = hr$time_spend_company
grp.time = time
grp.time[time %in% c(2:4)] = "2-4"
grp.time[time %in% c(5:7)] = "5-7"
grp.time[time %in% c(8:10)] = "8-10"
hr$grp.time = grp.time
There is a moderate correlation between the number of projects and
the average monthly hours an employee has worked. But since they are not
too similar, they will both be kept for the time being. There is no need
to transform any of the variables since we are only doing association
analysis.
Model Building
full.model = glm(left ~ satisfaction_level + last_evaluation + number_project + average_montly_hours + Work_accident + promotion_last_5years + Department + salary + grp.time,
family = binomial(link = "logit"), # logit(p) = log(p/(1-p))!
data = hr)
kable(summary(full.model)$coef,
caption="Summary of inferential statistics of the full model")
Summary of inferential statistics of the full model
(Intercept) |
-0.9867879 |
0.2450354 |
-4.0271237 |
0.0000565 |
satisfaction_level |
-4.2525608 |
0.1242625 |
-34.2224070 |
0.0000000 |
last_evaluation |
0.5919079 |
0.1826236 |
3.2411352 |
0.0011905 |
number_project |
-0.3052465 |
0.0263733 |
-11.5740741 |
0.0000000 |
average_montly_hours |
0.0043584 |
0.0006329 |
6.8864733 |
0.0000000 |
Work_accident |
-1.4801817 |
0.1136691 |
-13.0218516 |
0.0000000 |
promotion_last_5years |
-1.2925052 |
0.3859450 |
-3.3489364 |
0.0008112 |
Departmenthr |
0.0657740 |
0.1691216 |
0.3889154 |
0.6973387 |
DepartmentIT |
-0.0451906 |
0.1555504 |
-0.2905204 |
0.7714181 |
Departmentmanagement |
-0.0866355 |
0.2069245 |
-0.4186819 |
0.6754487 |
Departmentmarketing |
0.0722944 |
0.1692217 |
0.4272169 |
0.6692214 |
Departmentproduct_mng |
-0.0514320 |
0.1672970 |
-0.3074294 |
0.7585166 |
DepartmentRandD |
-0.4590236 |
0.1779200 |
-2.5799431 |
0.0098817 |
Departmentsales |
0.0447254 |
0.1318253 |
0.3392778 |
0.7344004 |
Departmentsupport |
0.0902813 |
0.1396548 |
0.6464604 |
0.5179812 |
Departmenttechnical |
0.0614581 |
0.1361925 |
0.4512593 |
0.6518027 |
salarylow |
1.7717805 |
0.1653125 |
10.7177616 |
0.0000000 |
salarymedium |
1.3334114 |
0.1665768 |
8.0047854 |
0.0000000 |
grp.time5-7 |
1.3512536 |
0.0701784 |
19.2545459 |
0.0000000 |
grp.time8-10 |
-14.2440493 |
155.6796437 |
-0.0914959 |
0.9270986 |
Reduced Model
reduced.model = glm(left ~ satisfaction_level + last_evaluation + number_project + promotion_last_5years + average_montly_hours + Work_accident,
family = binomial(link = "logit"), # logit(p) = log(p/(1-p))!
data = hr)
kable(summary(reduced.model)$coef,
caption="Summary of inferential statistics of the reduced model")
Summary of inferential statistics of the reduced
model
(Intercept) |
0.3371415 |
0.1487394 |
2.266659 |
0.0234110 |
satisfaction_level |
-4.1861392 |
0.1201123 |
-34.851889 |
0.0000000 |
last_evaluation |
0.7862953 |
0.1752611 |
4.486422 |
0.0000072 |
number_project |
-0.2339762 |
0.0250456 |
-9.342022 |
0.0000000 |
promotion_last_5years |
-1.4115858 |
0.3708276 |
-3.806582 |
0.0001409 |
average_montly_hours |
0.0042296 |
0.0006067 |
6.970990 |
0.0000000 |
Work_accident |
-1.3338246 |
0.1078854 |
-12.363349 |
0.0000000 |
Final Model
final.model.forward = stepAIC(reduced.model,
scope = list(lower=formula(reduced.model),upper=formula(full.model)),
direction = "forward",
trace = 0
)
kable(summary(final.model.forward)$coef,
caption="Summary of inferential statistics of the final model")
Summary of inferential statistics of the final model
(Intercept) |
-0.9779961 |
0.2147754 |
-4.553576 |
0.0000053 |
satisfaction_level |
-4.2413818 |
0.1239701 |
-34.212943 |
0.0000000 |
last_evaluation |
0.5927693 |
0.1823071 |
3.251488 |
0.0011480 |
number_project |
-0.3046392 |
0.0263297 |
-11.570158 |
0.0000000 |
promotion_last_5years |
-1.3257470 |
0.3845411 |
-3.447608 |
0.0005656 |
average_montly_hours |
0.0043328 |
0.0006318 |
6.857459 |
0.0000000 |
Work_accident |
-1.4841619 |
0.1136100 |
-13.063650 |
0.0000000 |
grp.time5-7 |
1.3416190 |
0.0700726 |
19.146130 |
0.0000000 |
grp.time8-10 |
-14.2350334 |
156.2040875 |
-0.091131 |
0.9273885 |
salarylow |
1.7761701 |
0.1647064 |
10.783859 |
0.0000000 |
salarymedium |
1.3355641 |
0.1660276 |
8.044227 |
0.0000000 |
Even though we discovered that the number of projects an employee has
and the average number of hours worked have a moderate correlation, the
final model still includes both variables. We can see that the final
mode only takes out the variable “Department”. The model stills keeps
the variables for the number of years spent at the company “grp.time”,
even though when time time spent at the company ranges from 8-10 years,
the p-value is no longer significant.
global.measure=function(s.logit){
dev.resid = s.logit$deviance
dev.0.resid = s.logit$null.deviance
aic = s.logit$aic
goodness = cbind(Deviance.residual =dev.resid, Null.Deviance.Residual = dev.0.resid,
AIC = aic)
goodness
}
goodness=rbind(full.model = global.measure(full.model),
reduced.model=global.measure(reduced.model),
final.model=global.measure(final.model.forward))
row.names(goodness) = c("full.model", "reduced.model", "final.model")
kable(goodness, caption ="Comparison of global goodness-of-fit statistics")
Comparison of global goodness-of-fit statistics
full.model |
8395.920 |
10781.18 |
8435.920 |
reduced.model |
9005.841 |
10781.18 |
9019.841 |
final.model |
8413.683 |
10781.18 |
8435.683 |
We can see that the final model has the lowest AIC, indicating it is
the best one to use.
Odds Ratio
model.coef.stats = summary(final.model.forward)$coef
odds.ratio = exp(coef(final.model.forward))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)
kable(out.stats,caption = "Summary Stats with Odds Ratios")
Summary Stats with Odds Ratios
(Intercept) |
-0.9779961 |
0.2147754 |
-4.553576 |
0.0000053 |
0.3760639 |
satisfaction_level |
-4.2413818 |
0.1239701 |
-34.212943 |
0.0000000 |
0.0143877 |
last_evaluation |
0.5927693 |
0.1823071 |
3.251488 |
0.0011480 |
1.8089912 |
number_project |
-0.3046392 |
0.0263297 |
-11.570158 |
0.0000000 |
0.7373894 |
promotion_last_5years |
-1.3257470 |
0.3845411 |
-3.447608 |
0.0005656 |
0.2656045 |
average_montly_hours |
0.0043328 |
0.0006318 |
6.857459 |
0.0000000 |
1.0043422 |
Work_accident |
-1.4841619 |
0.1136100 |
-13.063650 |
0.0000000 |
0.2266923 |
grp.time5-7 |
1.3416190 |
0.0700726 |
19.146130 |
0.0000000 |
3.8252314 |
grp.time8-10 |
-14.2350334 |
156.2040875 |
-0.091131 |
0.9273885 |
0.0000007 |
salarylow |
1.7761701 |
0.1647064 |
10.783859 |
0.0000000 |
5.9071891 |
salarymedium |
1.3355641 |
0.1660276 |
8.044227 |
0.0000000 |
3.8021401 |
The highest odds ratio belongs to the salarylow variable at 5.907.
This means that when an employee has a low salary, their odds of leaving
the company increase by about 5.907. (Although this does not instantly
mean that having a low salary is the best indicator of if an employee
leaves the company). The variable for time has three different
categories with 2-4 as the base year. As the number of years spent at
the company increases, the odds of leaving the company decreases.
Summary and
Conclusion
To summarize, the data set we did an association analysis on looks at
several factors affecting why an employee would leave a company. The
data has nine explanatory variables. After discretising the variable
“time_spend_company” into three dummy variables, there are eleven
explanatory variables. We then built a full model, a reduced model, and
a final model. Due to their high significance, all of the explanatory
variables except for “Department” were kept in the final model. After
calculating the odds ratios for each explanatory variable in the final
model, we discovered that in regards to the time spent working at the
company, as the number of years spent at the company increases, the odds
of leaving the company decreases. To conclude, most of the variables in
the data set could strongly indicate whether or not an employee would
leave this certain company. This might suggest that other companies
should look at things such as their employee satisfaction rate, and
employee evaluation scores to predict if employees will stay with them
or leave.
