In the realm of predictive modeling, logistic regression stands as a foundational technique for tackling binary classification problems. It allows us to analyze the relationship between a set of predictor variables and a binary outcome, making it particularly well-suited for scenarios like employee attrition prediction. However, the journey doesn’t end with model construction; we need robust mechanisms to validate our model’s performance and ensure its generalizability.
This exploration delves into the crucial steps of using logistic regression, employing training, validating and test datasets, harnessing the power of cross-validation, and evaluating model performance through the precision, recall, F1-score, and Receiver Operating Characteristic (ROC) curve. This multi-faceted approach enables us to not only build a predictive model but also rigorously assess its accuracy, sensitivity, and overall effectiveness.
Training Dataset: The training dataset serves as the foundation for teaching the logistic regression model. This dataset contains labeled examples where the model learns the underlying patterns and relationships between predictor variables and the binary response variable. During training, the logistic regression algorithm adjusts its parameters, specifically the coefficients, to minimize the error between predicted probabilities and actual outcomes.
Validation Dataset: Incorporating a validation dataset is crucial to assess the model’s performance and prevent overfitting. This separate dataset is not used during training but rather during the model development process. By evaluating the model on the validation dataset, one can fine-tune hyperparameters and make decisions about the model’s complexity. Validation aids in selecting the best version of the model that generalizes well to unseen data.
Test Dataset: The test dataset is entirely independent of both the training and validation datasets. It serves as a final benchmark to evaluate the model’s performance on new, unseen data. The test dataset provides an unbiased estimate of how well the trained model will perform when deployed in a real-world scenario.
First, we split our dataset into two distinct subsets: a training, validating and a test dataset.
Since the same size is large, we split the sample by 70%:30% with 70% data for training and validating models and 30% for testing purposes. The labels (value of the fraud status) of testing and validation data will be removed when calculating the accuracy measures
attrition.data = read.csv("https://raw.githubusercontent.com/Tenam01/DATASETS/main/cleanedattrition2.csv")
# recode attrition variable: Yes = 1 and No = 0
yes.id = which(attrition.data$Attrition == "Yes")
no.id = which(attrition.data$Attrition == "No")
# Add a new binary variable for attrition status
attrition.data$attrition.status = 0
attrition.data$attrition.status[yes.id] = 1
# Calculate the number of rows
nn = dim(attrition.data)[1]
# Generate indices for the training dataset
train.id = sample(1:nn, round(nn*0.7), replace = FALSE)
# Create training and testing datasets
training = attrition.data[train.id,]
testing = attrition.data[-train.id,]
However, a single split of the dataset may not provide a comprehensive assessment of our model’s robustness. This is where cross-validation steps in. By employing techniques like k-fold cross-validation, we repeatedly divide the dataset into subsets, training the model on a combination of these subsets and validating it on the remaining data. This iterative process yields a more reliable estimate of the model’s performance and guards against potential overfitting.
Cross-Validation: Cross-validation is a methodology that enhances the validation process by mitigating potential biases in model evaluation. K-fold cross-validation, a popular technique, involves partitioning the dataset into K subsets. The model is trained K times, each time using K-1 subsets for training and one subset for validation. This process ensures a comprehensive assessment of the model’s performance and reduces the risk of over-optimistic evaluations.
We define a sequence of 20 candidate cut-off probabilities and then use 5-fold cross-validation to identify the optimal cut-off probability for the final detection model.
n0 = dim(training)[1]/5
cut.0ff.prob = seq(0,1, length = 22)[-c(1,22)] # candidate cut off prob
pred.accuracy = matrix(0,ncol=20, nrow=5, byrow = T) # null vector for storing prediction accuracy
## 5-fold CV
for (i in 1:5){
valid.id = ((i-1)*n0 + 1):(i*n0)
valid.data = training[valid.id,]
train.data = training[-valid.id,]
train.model = glm(attrition.status ~ Age + BusinessTravel + Department + DistanceFromHome + Education + EducationField + EnvironmentSatisfaction + Gender + JobInvolvement + JobLevel + JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome + NumCompaniesWorked + OverTime + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + StockOptionLevel + TrainingTimesLastYear + WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager, family = binomial(link = logit), data = train.data)
newdata = data.frame(Age= valid.data$Age, BusinessTravel= valid.data$BusinessTravel, Department= valid.data$Department, DistanceFromHome= valid.data$DistanceFromHome, Education= valid.data$Education, EducationField= valid.data$EducationField, EnvironmentSatisfaction= valid.data$EnvironmentSatisfaction, Gender= valid.data$Gender, JobInvolvement= valid.data$JobInvolvement, JobLevel= valid.data$JobLevel, JobRole= valid.data$JobRole, JobSatisfaction= valid.data$JobSatisfaction, MaritalStatus= valid.data$MaritalStatus, MonthlyIncome= valid.data$MonthlyIncome, NumCompaniesWorked= valid.data$NumCompaniesWorked, OverTime= valid.data$OverTime, PercentSalaryHike= valid.data$PercentSalaryHike, PerformanceRating= valid.data$PerformanceRating, RelationshipSatisfaction= valid.data$RelationshipSatisfaction, StockOptionLevel= valid.data$StockOptionLevel, TrainingTimesLastYear= valid.data$TrainingTimesLastYear, WorkLifeBalance= valid.data$WorkLifeBalance, YearsInCurrentRole= valid.data$YearsInCurrentRole, YearsSinceLastPromotion= valid.data$YearsSinceLastPromotion, YearsWithCurrManager= valid.data$YearsWithCurrManager)
pred.prob = predict.glm(train.model, newdata, type = "response")
# define confusion matrix and accuracy
for(j in 1:20){
pred.status = rep(0,length(pred.prob))
valid.data$pred.status = as.numeric(pred.prob >cut.0ff.prob[j])
a11 = sum(valid.data$pred.status == valid.data$attrition.status)
pred.accuracy[i,j] = a11/length(pred.prob)
}
}
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
###
avg.accuracy = apply(pred.accuracy, 2, mean)
max.id = which(avg.accuracy ==max(avg.accuracy ))
### visual representation
tick.label = as.character(round(cut.0ff.prob,2))
plot(1:20, avg.accuracy, type = "b",
xlim=c(1,20),
ylim=c(0.5,1),
axes = FALSE,
xlab = "Cut-off Probability",
ylab = "Accuracy",
main = "5-fold CV performance"
)
axis(1, at=1:20, label = tick.label, las = 2)
axis(2)
segments(max.id, 0.5, max.id, avg.accuracy[max.id], col = "red")
text(max.id, avg.accuracy[max.id]+0.03, as.character(round(avg.accuracy[max.id],4)), col = "red", cex = 0.8)
Figure 7. 5-fold CV performance plot
The above figure indicates that the optimal cut-off probability that yields the best accuracy is 0.57.
This subsection reports the performance of the model using the test data set. Note that the model needs to be fit to the original training data to find the regression coefficients and then use the holdout testing sample to find the accuracy.
test.model = glm(attrition.status ~ Age + BusinessTravel + Department + DistanceFromHome + Education + EducationField + EnvironmentSatisfaction + Gender + JobInvolvement + JobLevel + JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome + NumCompaniesWorked + OverTime + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + StockOptionLevel + TrainingTimesLastYear + WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager, family = binomial(link = logit), data = training)
newdata = data.frame(Age = testing$Age, BusinessTravel = testing$BusinessTravel, Department = testing$Department, DistanceFromHome = testing$DistanceFromHome, Education = testing$Education, EducationField = testing$EducationField, EnvironmentSatisfaction = testing$EnvironmentSatisfaction, Gender = testing$Gender, JobInvolvement = testing$JobInvolvement, JobLevel = testing$JobLevel, JobRole = testing$JobRole, JobSatisfaction = testing$JobSatisfaction, MaritalStatus = testing$MaritalStatus, MonthlyIncome = testing$MonthlyIncome, NumCompaniesWorked = testing$NumCompaniesWorked, OverTime = testing$OverTime, PercentSalaryHike = testing$PercentSalaryHike, PerformanceRating = testing$PerformanceRating, RelationshipSatisfaction = testing$RelationshipSatisfaction, StockOptionLevel = testing$StockOptionLevel, TrainingTimesLastYear = testing$TrainingTimesLastYear, WorkLifeBalance = testing$WorkLifeBalance, YearsInCurrentRole = testing$YearsInCurrentRole, YearsSinceLastPromotion = testing$YearsSinceLastPromotion, YearsWithCurrManager = testing$YearsWithCurrManager)
pred.prob.test = predict.glm(test.model, newdata, type = "response")
testing$test.status = as.numeric(pred.prob.test > 0.57)
a11 = sum(testing$test.status == testing$attrition.status)
test.accuracy = a11 / nrow(testing)
kable(as.data.frame(test.accuracy), align = 'c')
| test.accuracy |
|---|
| 0.8888889 |
The accuracy is around 90%. This accuracy indicates that there is no under-fitting. The predictive model correctly classified 90% of the instances in the test dataset. In other words, out of all the instances (data points) in the test dataset, approximately 90% were correctly predicted by the model as either having attrition or not having attrition, based on the chosen threshold.
In conclusion, this study aimed to predict employee attrition using logistic regression and leveraged a robust 5-fold cross-validation approach to assess the model’s accuracy. The use of cross-validation allowed us to effectively evaluate the model’s performance on multiple subsets of the training data, enhancing its generalizability and reducing the risk of overfitting.
The logistic regression model demonstrated promising results, achieving an accuracy of 87.75% in predicting employee attrition. This suggests that the selected features and the logistic regression algorithm have the potential to be valuable tools in identifying employees who might be at risk of attrition. However, it’s important to note that accuracy alone might not provide a complete picture of the model’s performance, and further evaluation using other metrics, such as precision, recall, F1-score, and ROC curve, could provide additional insights into the model’s strengths and weaknesses.
Now we will calculate the local and global performance metrics for logistic predictive models. We have used the confusion matrix in the case study in the previous note. Here we will use the optimal cut-off probability as the decision threshold to define a confusion matrix and then define the performance measure based on this matrix.
We use the data from our previous model in which we create the training and testing data sets. We pretend the optimal cut-off probability is based on what is obtained through the Cross Validation. The testing data set will be used to report the local and global performance measures.
Since we have identified the optimal cut-off probability to be 0.57. Next, we will use the testing data set to report the local measures.
test.model = glm(attrition.status ~ Age + BusinessTravel + Department + DistanceFromHome + Education + EducationField + EnvironmentSatisfaction + Gender + JobInvolvement + JobLevel + JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome + NumCompaniesWorked + OverTime + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + StockOptionLevel + TrainingTimesLastYear + WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager, family = binomial(link = logit), data = training)
newdata = data.frame(Age = testing$Age, BusinessTravel = testing$BusinessTravel, Department = testing$Department, DistanceFromHome = testing$DistanceFromHome, Education = testing$Education, EducationField = testing$EducationField, EnvironmentSatisfaction = testing$EnvironmentSatisfaction, Gender = testing$Gender, JobInvolvement = testing$JobInvolvement, JobLevel = testing$JobLevel, JobRole = testing$JobRole, JobSatisfaction = testing$JobSatisfaction, MaritalStatus = testing$MaritalStatus, MonthlyIncome = testing$MonthlyIncome, NumCompaniesWorked = testing$NumCompaniesWorked, OverTime = testing$OverTime, PercentSalaryHike = testing$PercentSalaryHike, PerformanceRating = testing$PerformanceRating, RelationshipSatisfaction = testing$RelationshipSatisfaction, StockOptionLevel = testing$StockOptionLevel, TrainingTimesLastYear = testing$TrainingTimesLastYear, WorkLifeBalance = testing$WorkLifeBalance, YearsInCurrentRole = testing$YearsInCurrentRole, YearsSinceLastPromotion = testing$YearsSinceLastPromotion, YearsWithCurrManager = testing$YearsWithCurrManager)
pred.prob.test = predict.glm(test.model, newdata, type = "response")
testing$test.status = as.numeric(pred.prob.test > 0.57)
### components for defining various measures
TN = sum(testing$test.status ==0 & testing$attrition.status ==0)
FN = sum(testing$test.status ==0 & testing$attrition.status ==1)
FP = sum(testing$test.status ==1 & testing$attrition.status ==0)
TP = sum(testing$test.status ==1 & testing$attrition.status ==1)
###
sensitivity = TP / (TP + FN)
specificity = TN / (TN + FP)
###
precision = TP / (TP + FP)
recall = sensitivity
F1 = 2*precision*recall/(precision + recall)
metric.list = cbind(sensitivity = sensitivity,
specificity = specificity,
precision = precision,
recall = recall,
F1 = F1)
kable(as.data.frame(metric.list), align='c', caption = "Local performance metrics")
| sensitivity | specificity | precision | recall | F1 |
|---|---|---|---|---|
| 0.4027778 | 0.9837398 | 0.8285714 | 0.4027778 | 0.5420561 |
Sensitivity (True Positive Rate): Sensitivity, also known as the True Positive Rate or Recall, measures the proportion of actual positive cases correctly identified by the model. In this case, it’s 0.3428571, which means that the model correctly identified approximately 34.29% of the actual positive cases (attrition) in the dataset.
Specificity (True Negative Rate): Specificity measures the proportion of actual negative cases correctly identified by the model. It’s also known as the True Negative Rate. Here, it’s 0.9784367, which indicates that the model correctly identified approximately 97.84% of the actual negative cases (non-attrition) in the dataset.
Precision: Precision is the ratio of true positive predictions to the total predicted positive cases by the model. In this context, it’s 0.75, meaning that out of all the predicted positive cases by the model, approximately 75% were actually positive cases (correctly predicted attrition).
Recall: Recall is another term for Sensitivity, as explained above. It measures the proportion of actual positive cases correctly identified by the model. Here, it’s 0.3428571, indicating the same as Sensitivity: approximately 34.29% of actual positive cases were correctly identified.
F1 Score: The F1 Score is the harmonic mean of Precision and Recall. It provides a balance between Precision and Recall, especially when there is an imbalance between positive and negative cases. Here, it’s 0.4705882, reflecting the balance between the model’s ability to predict both positive and negative cases.
These metrics collectively provide insights into how well the model is performing in terms of correctly classifying attrition and non-attrition cases. A higher sensitivity is desired when correctly identifying positive cases is crucial, whereas a higher specificity is important when correctly identifying negative cases is a priority. Precision and Recall help evaluate the trade-off between true positives and false positives, while the F1 Score provides a balanced measure between Precision and Recall.
In order to create an ROC curve, we need to select a sequence of decision thresholds and calculate the corresponding sensitivity and specificity.
CAUTION: ROC and AUC are used for model selection, we still use the training data to construct the ROC and calculate the AUC.
*Performance Metrics and ROC Curve:
To quantitatively measure the model’s performance, various metrics come into play. The ROC (Receiver Operating Characteristic) curve is a graphical representation of a model’s ability to distinguish between the classes, showing the trade-off between sensitivity and specificity. The area under the ROC curve (AUC-ROC) quantifies the model’s discriminatory power, where a higher AUC indicates better performance.
In this exploration, we delve into the synergy of logistic regression, training, validation, and test datasets, cross-validation techniques, and the visualization of model performance through ROC curves. By combining these elements, we aim to build robust and accurate logistic regression models that effectively predict binary outcomes in various domains.
With a well-trained model in hand and a thorough understanding of its cross-validated performance, we turn our attention to measuring its predictive accuracy using the ROC curve. The ROC curve is a graphical representation of the trade-off between the true positive rate and the false positive rate. It allows us to choose an optimal threshold for classification, balancing the sensitivity and specificity of the model. The area under the ROC curve (AUC-ROC) summarizes the model’s overall discriminatory power, providing a single metric to assess its performance.
# Install and load the pROC package
library(pROC)
cut.off.seq = seq(0, 1, length = 100)
sensitivity.vec = NULL
specificity.vec = NULL
training.model = glm(attrition.status ~ Age + BusinessTravel + Department + DistanceFromHome + Education + EducationField + EnvironmentSatisfaction + Gender + JobInvolvement + JobLevel + JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome + NumCompaniesWorked + OverTime + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + StockOptionLevel + TrainingTimesLastYear + WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager, family = binomial(link = logit), data = training)
newdata = data.frame(Age = testing$Age, BusinessTravel = testing$BusinessTravel, Department = testing$Department, DistanceFromHome = testing$DistanceFromHome, Education = testing$Education, EducationField = testing$EducationField, EnvironmentSatisfaction = testing$EnvironmentSatisfaction, Gender = testing$Gender, JobInvolvement = testing$JobInvolvement, JobLevel = testing$JobLevel, JobRole = testing$JobRole, JobSatisfaction = testing$JobSatisfaction, MaritalStatus = testing$MaritalStatus, MonthlyIncome = testing$MonthlyIncome, NumCompaniesWorked = testing$NumCompaniesWorked, OverTime = testing$OverTime, PercentSalaryHike = testing$PercentSalaryHike, PerformanceRating = testing$PerformanceRating, RelationshipSatisfaction = testing$RelationshipSatisfaction, StockOptionLevel = testing$StockOptionLevel, TrainingTimesLastYear = testing$TrainingTimesLastYear, WorkLifeBalance = testing$WorkLifeBalance, YearsInCurrentRole = testing$YearsInCurrentRole, YearsSinceLastPromotion = testing$YearsSinceLastPromotion, YearsWithCurrManager = testing$YearsWithCurrManager)
pred.prob.train = predict.glm(training.model, newdata, type = "response")
for (i in 1:100){
train.status = as.numeric(pred.prob.train > cut.off.seq[i])
### components for defining various measures
TN = sum(train.status == 0 & testing$attrition.status == 0)
FN = sum(train.status == 0 & testing$attrition.status == 1)
FP = sum(train.status == 1 & testing$attrition.status == 0)
TP = sum(train.status == 1 & testing$attrition.status == 1)
###
sensitivity.vec[i] = TP / (TP + FN)
specificity.vec[i] = TN / (TN + FP)
}
one.minus.spec = 1 - specificity.vec
sens.vec = sensitivity.vec
## A better approx of ROC
prediction = pred.prob.train
category = testing$attrition.status == 1
ROCobj <- roc(category, prediction)
AUC = round(auc(ROCobj),4)
par(pty = "s") # make a square figure
plot(one.minus.spec, sens.vec, type = "l", xlim = c(0,1), ylim = c(0,1),
xlab ="1 - specificity",
ylab = "sensitivity",
main = "ROC curve of Logistic Employee Attrition Model",
lwd = 2,
col = "blue", )
segments(0,0,1,1, col = "red", lty = 2, lwd = 2)
#AUC = round(sum(sens.vec*(one.minus.spec[-101]-one.minus.spec[-1])),4)
text(0.8, 0.3, paste("AUC = ", AUC), col = "blue", cex = 0.8)
An AUC (Area Under the Curve) value of 0.8151 indicates that our
logistic regression model is performing reasonably well in
distinguishing between the two classes (attrition and non-attrition).
The AUC value ranges from 0 to 1, where a higher value indicates better
predictive performance. An AUC value of 0.8151 suggests that the model
has a good ability to discriminate between employees who will leave and
those who will not.
When interpreting the AUC value:
Keep in mind that the AUC is just one measure of the model’s performance. It’s important to consider other metrics such as sensitivity, specificity, precision, and F1-score, as well as domain knowledge, when evaluating and interpreting the results of our logistic regression model.
*Conclusion Remarks
In conclusion, the combination of logistic regression, training and test datasets, cross-validation, and ROC analysis forms a comprehensive framework for predictive modeling. By embracing these techniques, we empower ourselves to not only build accurate models but also to validate their effectiveness and make informed decisions based on their predictive capabilities. This multifaceted approach is a cornerstone of modern data science, enabling us to unlock insights and drive informed actions from complex datasets.