We use the IBM HR employee-attrition data to estimate three logistic-regression models on a training set and assess each with a confusion matrix on a test set, following Hands-On Machine Learning with R, Chapter 5 (https://bradleyboehmke.github.io/HOML/logistic-regression.html).
Models:
Attrition ~ MonthlyIncomeAttrition ~ MonthlyIncome + OverTimeAttrition ~ . (all predictors)This uses base R only — no packages to install. The
code reads the data directly from the web, so you do
not need any local file: just knit. (If you prefer to work offline, put
attrition.csv in the same folder as this .Rmd
and the code will use that instead.)
data_url <- "https://raw.githubusercontent.com/prernagoel/IBM-Employee-Attrition-Analysis-in-Python/master/WA_Fn-UseC_-HR-Employee-Attrition.csv"
if (file.exists("attrition.csv")) {
df <- read.csv("attrition.csv", stringsAsFactors = TRUE) # local copy if present
} else {
df <- read.csv(data_url, stringsAsFactors = TRUE) # otherwise download it
}
cat("Dimensions:", nrow(df), "rows,", ncol(df), "columns\n")
## Dimensions: 1470 rows, 35 columns
str(df[, c("Attrition", "MonthlyIncome", "OverTime")])
## 'data.frame': 1470 obs. of 3 variables:
## $ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ MonthlyIncome: int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
EmployeeCount, Over18 and
StandardHours are constant, and EmployeeNumber
is just an ID — none carry information, and they would break the
~ . model, so we drop them. We also code
Attrition as a factor with "No" as the
reference level, so the model predicts the probability of
Attrition = "Yes".
drop_cols <- c("EmployeeCount", "Over18", "StandardHours", "EmployeeNumber")
df <- df[, !names(df) %in% drop_cols]
df$Attrition <- factor(df$Attrition, levels = c("No", "Yes"))
positive <- "Yes" # the "attrition" class
negative <- "No"
prop.table(table(df$Attrition)) # ~16% attrition
##
## No Yes
## 0.8387755 0.1612245
We hold out 30% of the data as a test set, sampling within each class so the attrition rate is preserved in both sets (stratified split, as in HOML).
set.seed(123) # reproducibility
idx_yes <- which(df$Attrition == "Yes")
idx_no <- which(df$Attrition == "No")
train_idx <- c(sample(idx_yes, round(0.70 * length(idx_yes))),
sample(idx_no, round(0.70 * length(idx_no))))
train <- df[train_idx, ]
test <- df[-train_idx, ]
cat("Training rows:", nrow(train), " | Test rows:", nrow(test), "\n")
## Training rows: 1029 | Test rows: 441
# align factor predictor levels in test to those seen in training
for (col in names(train)) {
if (is.factor(train[[col]]) && col != "Attrition") {
test[[col]] <- factor(test[[col]], levels = levels(train[[col]]))
}
}
model1 <- glm(Attrition ~ MonthlyIncome, data = train, family = binomial)
model2 <- glm(Attrition ~ MonthlyIncome + OverTime, data = train, family = binomial)
model3 <- glm(Attrition ~ ., data = train, family = binomial)
round(summary(model1)$coefficients, 5)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.97183 0.15319 -6.34379 0
## MonthlyIncome -0.00012 0.00003 -4.69644 0
round(summary(model2)$coefficients, 5)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.41435 0.17365 -8.14472 0
## MonthlyIncome -0.00013 0.00003 -4.83568 0
## OverTimeYes 1.33507 0.17887 7.46390 0
Consistent with HOML, the coefficient on MonthlyIncome
is negative (higher income -> lower probability of
attrition) and the coefficient on OverTimeYes is
positive (working overtime -> higher probability of
attrition).
We classify an employee as "Yes" (will attrite) when the
predicted probability exceeds 0.5, then compare predictions to the true
test labels. We report accuracy, sensitivity (true-positive rate for
attrition) and specificity.
confusion <- function(model, data) {
prob <- predict(model, newdata = data, type = "response") # P(Attrition = Yes)
pred <- ifelse(prob > 0.5, "Yes", "No")
pred <- factor(pred, levels = c("No", "Yes"))
ok <- !is.na(pred) & !is.na(data$Attrition)
cm <- table(Predicted = pred[ok], Actual = data$Attrition[ok])
TP <- cm["Yes", "Yes"]; FN <- cm["No", "Yes"]
FP <- cm["Yes", "No"]; TN <- cm["No", "No"]
list(
confusion_matrix = cm,
accuracy = round((TP + TN) / sum(cm), 4),
sensitivity = round(TP / (TP + FN), 4),
specificity = round(TN / (TN + FP), 4)
)
}
Attrition ~ MonthlyIncomeconfusion(model1, test)
## $confusion_matrix
## Actual
## Predicted No Yes
## No 370 71
## Yes 0 0
##
## $accuracy
## [1] 0.839
##
## $sensitivity
## [1] 0
##
## $specificity
## [1] 1
Attrition ~ MonthlyIncome + OverTimeconfusion(model2, test)
## $confusion_matrix
## Actual
## Predicted No Yes
## No 370 71
## Yes 0 0
##
## $accuracy
## [1] 0.839
##
## $sensitivity
## [1] 0
##
## $specificity
## [1] 1
Attrition ~ . (all predictors)confusion(model3, test)
## $confusion_matrix
## Actual
## Predicted No Yes
## No 356 39
## Yes 14 32
##
## $accuracy
## [1] 0.8798
##
## $sensitivity
## [1] 0.4507
##
## $specificity
## [1] 0.9622
res <- rbind(
"MonthlyIncome" = unlist(confusion(model1, test)[c("accuracy","sensitivity","specificity")]),
"MonthlyIncome + OverTime" = unlist(confusion(model2, test)[c("accuracy","sensitivity","specificity")]),
"All predictors (.)" = unlist(confusion(model3, test)[c("accuracy","sensitivity","specificity")])
)
res
## accuracy sensitivity specificity
## MonthlyIncome 0.8390 0.0000 1.0000
## MonthlyIncome + OverTime 0.8390 0.0000 1.0000
## All predictors (.) 0.8798 0.4507 0.9622
As in HOML, the simple models (1 and 2) predict the majority no-attrition class very well (high specificity) but miss most genuine cases of attrition (low sensitivity, because attrition is the rare ~16% class). The full model (3), using all predictors, generally attains the highest accuracy and a better balance between sensitivity and specificity. Roughly 84% of employees did not attrite, so any useful model must beat that “no-information” baseline.