Logistic regression is the most appropriate model to predict employee attrition (“Yes”/“No”) from variables such as age, salary, and years at the job. This is because logistic regression effectively models binary outcomes and allows for estimating the probability that an employee will leave, given their attributes.
set.seed(42)
n <- 200
employee_data <- data.frame(
Age = sample(20:60, n, replace = TRUE),
Salary = sample(20000:150000, n, replace = TRUE),
YearsAtJob = sample(1:21, n, replace = TRUE)
)
logit <- -30 + 0.1 * employee_data$Age + 0.0002 * employee_data$Salary - 0.3 * employee_data$YearsAtJob
prob <- 1 / (1 + exp(-logit))
employee_data$AttritionBinary <- rbinom(n, 1, prob)
employee_data$Attrition <- ifelse(employee_data$AttritionBinary == 1, "Yes", "No")
set.seed(123)
train_indices <- sample(seq_len(nrow(employee_data)), size = 0.7 * nrow(employee_data))
train_data <- employee_data[train_indices, ]
test_data <- employee_data[-train_indices, ]
model <- glm(AttritionBinary ~ Age + Salary + YearsAtJob, data = train_data, family = binomial())
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
probs <- predict(model, test_data, type = "response")
preds <- ifelse(probs > 0.5, "Yes", "No")
test_data$PredictedProb <- probs
test_data$PredictedAttrition <- preds
head(test_data[, c("Age", "Salary", "YearsAtJob", "Attrition", "PredictedProb", "PredictedAttrition")])
## Age Salary YearsAtJob Attrition PredictedProb PredictedAttrition
## 2 20 99414 13 No 2.307048e-10 No
## 3 44 104990 20 No 4.003740e-08 No
## 10 44 80554 13 No 4.130856e-11 No
## 11 56 96049 7 No 2.147062e-06 No
## 15 60 46812 5 No 2.220446e-16 No
## 18 55 82549 7 No 9.565932e-09 No
library(caret)
## Warning: package 'caret' was built under R version 4.5.2
## Loading required package: ggplot2
## Loading required package: lattice
confusionMatrix(as.factor(test_data$PredictedAttrition), as.factor(test_data$Attrition))
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 59 0
## Yes 0 1
##
## Accuracy : 1
## 95% CI : (0.9404, 1)
## No Information Rate : 0.9833
## P-Value [Acc > NIR] : 0.3648
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.9833
## Detection Rate : 0.9833
## Detection Prevalence : 0.9833
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : No
##
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc_obj <- roc(test_data$AttritionBinary, test_data$PredictedProb)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_obj)
auc(roc_obj)
## Area under the curve: 1
Accuracy is measured through the confusion matrix, as well as the ROC curve and AUC. Cross-validation can also be used for reliability.
Logistic regression is specifically designed for binary classification, making it ideal for “Yes”/“No” attrition prediction. It provides interpretable coefficients, allows for probability threshold adjustments, is robust to various predictor types, and is widely applicable in business contexts.
The area of a rectangle is given by \(A = l \times w\), where \(l\) is the length and \(w\) is the width.
To display an image from online (R logo) at 200x100 pixels:
```