This data set was obtained from kaggle.com. The data set contains information on several thousand employees from an unnamed company. Since no details are given about the company, it cannot be said how exactly the data was collected. The response variable we are trying to predict is whether or no an employee stays at or leaves the company. The explanatory variables relate to each subjects work life. The response variable is whether or not an employee stays at or leaves the company. For my simple logistic regression analysis, the variable that I am going to use is “satisfaction_level”. This is the employees self reported satisfaction level.
This data set contains five numeric variables, three binary variables, and two categorical variables. They include the following:
For this study, we want to identify which factors about an employee’s work life indicate they will leave the company.
First, we download the data and look for missing values.
hr_data <- read.csv("https://raw.githubusercontent.com/AvaDeSt/STA-321/refs/heads/main/HR_comma_sep.csv", header = TRUE)
data(hr_data)
## Warning in data(hr_data): data set 'hr_data' not found
hr.0 = hr_data
hr_d = na.omit(hr.0)
duplicates <- duplicated(hr_d)
left <- unique(hr_d)
The are no missing values in the data set. But there are 3008 duplicate values so I took those out of the data set. This leaves us with 11,991 observations.
Here, we made a pairwise scatter plot to see potential issues with the variables, and if any of them are correlated.
data.num <- select(left, "satisfaction_level", "last_evaluation", "number_project", "average_montly_hours", "time_spend_company")
pairs.panels(data.num[,-9],
method = "pearson",
hist.col = "#00AFBB",
density = TRUE,
ellipses = TRUE
)
We can see from the graphs that the variable for the number of years spent at the company is skewed. Here is a closer look with a histogram:
par(mfrow=c(1,2))
hist(left$time_spend_company, xlab="Years at the company", main = "")
To fix this, I am going to discretize “time_spend_company” based on the histogram.
time = left$time_spend_company
grp.time = time
grp.time[time %in% c(2:4)] = "2-4"
grp.time[time %in% c(5:7)] = "5-7"
grp.time[time %in% c(8:10)] = "8-10"
left$grp.time = grp.time
There is a very small correlation between the number of projects and the average monthly hours an employee has worked. But since they are not too similar, they will both be kept for the time being. There is no need to transform any of the variables since we are only doing association analysis.
To improve the predictor power and reduce potential bias, the numeric variables for this data set will be standardized. This no longer includes the variable for time spent at the company.
left$sd.last_evaluation = (left$last_evaluation-mean(left$last_evaluation))/sd(left$last_evaluation)
left$sd.satisfaction_level = (left$satisfaction_level-mean(left$satisfaction_level))/sd(left$satisfaction_level)
left$sd.number_project = (left$number_project-mean(left$number_project))/sd(left$number_project)
left$sd.average_montly_hours = (left$average_montly_hours-mean(left$average_montly_hours))/sd(left$average_montly_hours)
sd.left = left[, -(1:14)]
Here, the data is randomly split into two subsets. 70% of this data will be used as training data. This training data will be used to generate candidate models, evaluate them, and eventually choose the best one
n <- dim(left)[1]
train.n <- round(0.7*n)
train.id <- sample(1:n, train.n, replace = FALSE)
train <- left[train.id, ]
test <- left[-train.id, ]
Now that the variables have been standardized, they will be used to build three candidate models that will consist of a full model, a reduced model, and a final model. We will then find the average prediction of errors in each model. 8 fold cross validation will be used since the data set being used is large.
k=8
## floor() function must be used to avoid producing NA in the subsequent results
fold.size = floor(dim(train)[1]/k)
## PE vectors for candidate models
PE1 = rep(0,8)
PE2 = rep(0,8)
PE3 = rep(0,8)
for(i in 1:k){
## Training and testing folds
valid.id = (fold.size*(i-1)+1):(fold.size*i)
valid = train[valid.id, ]
train.dat = train[-valid.id,]
candidate01 = glm(left ~ grp.time + sd.satisfaction_level + sd.number_project+ sd.average_montly_hours + sd.last_evaluation + salary + Work_accident + promotion_last_5years + Department, family = binomial(link = "logit"),
data = train.dat)
## reduced model
candidate03 = glm(left ~ grp.time + sd.satisfaction_level + sd.last_evaluation + sd.average_montly_hours + sd.number_project + salary,
family = binomial(link = "logit"),
data = train.dat)
##
candidate02 = stepAIC(candidate01,
scope = list(lower=formula(candidate03),upper=formula(candidate01)),
direction = "forward",
trace = 0
)
pred01 = predict(candidate01, newdata = valid, type="response")
pred02 = predict(candidate02, newdata = valid, type="response")
pred03 = predict(candidate03, newdata = valid, type="response")
pre.outcome01 = ifelse(as.vector(pred01) > 0.5, "1", "0")
pre.outcome02 = ifelse(as.vector(pred02) > 0.5, "1", "0")
pre.outcome03 = ifelse(as.vector(pred03) > 0.5, "1", "0")
PE1[i] = sum(pre.outcome01 == valid$left )/length(pred01)
PE2[i] = sum(pre.outcome02 == valid$left )/length(pred02)
PE3[i] = sum(pre.outcome03 == valid$left )/length(pred03)
}
avg.pe = cbind(PE1 = mean(PE1), PE2 = mean(PE2), PE3 = mean(PE3))
kable(avg.pe, caption = "Average of prediction errors of candidate models")
PE1 | PE2 | PE3 |
---|---|---|
0.8205434 | 0.8205434 | 0.8138704 |
Candidate models 1 and 2 have the same average predictive error, but model 2 will be chosen as the final model since it is the more simple model.
Here, the actual accuracy of candidate model 2 is reported. It is reported based on the with held test data. This would be the statistic shown to clients.
pred02 = predict(candidate02, newdata = test, type="response")
pred02.outcome = ifelse(as.vector(pred02)>0.5, "1", "0")
accuracy = sum(pred02.outcome == test$left)/length(pred02)
kable(accuracy, caption="The actual accuracy of the final model")
x |
---|
0.8187378 |
The final model has an accuracy rating of about 0.822.
For this analysis, we will construct some ROC curves to compare the global performance of our models. To build these curves, we will first extract false positve and false negative reates from the data,
TPR.FPR=function(pred){
prob.seq = seq(0,1, length=50)
pn=length(prob.seq)
true.lab=as.vector(train$left)
TPR = NULL
FPR = NULL
for (i in 1:pn){
pred.lab = as.vector(ifelse(pred >prob.seq[i],"1", "0"))
TPR[i] = length(which(true.lab=="1" & pred.lab=="1"))/length(which(true.lab=="1"))
FPR[i] = length(which(true.lab=="0" & pred.lab=="0"))/length(which(true.lab=="0"))
}
cbind(FPR = FPR, TPR = TPR)
}
Here are the curves for the three candidate models.
candidate01 = glm(left ~ grp.time + sd.satisfaction_level + sd.average_montly_hours + sd.number_project + sd.last_evaluation + salary + Work_accident + promotion_last_5years + Department, family = binomial(link = "logit"),
data = train)
## reduced model
candidate03 = glm(left ~ grp.time + sd.satisfaction_level + sd.average_montly_hours + sd.last_evaluation + salary + sd.number_project,
family = binomial(link = "logit"),
data = train)
candidate02 = stepAIC(candidate03,
scope = list(lower=formula(candidate03),upper=formula(candidate01)),
direction = "forward",
trace = 0
)
pred01 = predict.glm(candidate01, newdata = train, type="response")
pred02 = predict.glm(candidate02, newdata = train, type="response")
pred03 = predict.glm(candidate03, newdata = train, type="response")
plot(TPR.FPR(pred01)[,1], TPR.FPR(pred01)[,2],
type="l", col=2, lty=1, xlim=c(0,1), ylim=c(0,1),
xlab = "FPR: 1 - specificity",
ylab ="TPR: sensitivity",
main = "ROC curves of the three candidate models",
cex.main = 0.8,
col.main = "navy")
lines(TPR.FPR(pred02)[,1], TPR.FPR(pred02)[,2], col=3, lty=2)
lines(TPR.FPR(pred03)[,1], TPR.FPR(pred03)[,2], col=4, lty=3)
##
category = train$left == "1"
ROCobj01 <- roc(category, as.vector(pred01))
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
ROCobj02 <- roc(category, as.vector(pred02))
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
ROCobj03 <- roc(category, as.vector(pred03))
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
AUC01 = round(auc(ROCobj01),4)
AUC02 = round(auc(ROCobj02),4)
AUC03 = round(auc(ROCobj03),4)
##
legend("bottomright", c(paste("Full model: AUC = ",AUC01),
paste("Stepwise model: AUC =",AUC02),
paste("reduced model: AUC =", AUC03)),
col=2:4, lty=1:3, cex = 0.8, bty="n")
We can see from the graph that the full model and the stepwise model
have the same AUC, but they both perform better than the reduced model.
But since the stepwise model is the more simple of the two, that would
be the final model chosen to report to the client.
For this analysis, we examined the work life habits of employees at a company to see if we can use them to predict if they leave the company or not. After checking for missing and repeating values, the variable for time was discretized, since the distribution was skewed . We also standardized the remaining variables to use in the model building process. Three candidate models were built along with their average predictive errors. The second model was chosen as the final model, with an accuracy rating of 0.822.We also looked at the ROC curves of the models, and figured out that a stepwise model is the best one to report.