We use the IBM HR employee-attrition data to estimate three logistic-regression models on a training set and assess each with a confusion matrix on a test set, following Hands-On Machine Learning with R, Chapter 5 (https://bradleyboehmke.github.io/HOML/logistic-regression.html).

Models:

  1. Attrition ~ MonthlyIncome
  2. Attrition ~ MonthlyIncome + OverTime
  3. Attrition ~ . (all predictors)

This uses base R only — no packages to install. The code reads the data directly from the web, so you do not need any local file: just knit. (If you prefer to work offline, put attrition.csv in the same folder as this .Rmd and the code will use that instead.)

1 Load the data

data_url <- "https://raw.githubusercontent.com/prernagoel/IBM-Employee-Attrition-Analysis-in-Python/master/WA_Fn-UseC_-HR-Employee-Attrition.csv"

if (file.exists("attrition.csv")) {
  df <- read.csv("attrition.csv", stringsAsFactors = TRUE)   # local copy if present
} else {
  df <- read.csv(data_url, stringsAsFactors = TRUE)          # otherwise download it
}
cat("Dimensions:", nrow(df), "rows,", ncol(df), "columns\n")
## Dimensions: 1470 rows, 35 columns
str(df[, c("Attrition", "MonthlyIncome", "OverTime")])
## 'data.frame':    1470 obs. of  3 variables:
##  $ Attrition    : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ MonthlyIncome: int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ OverTime     : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...

1.1 Clean up

EmployeeCount, Over18 and StandardHours are constant, and EmployeeNumber is just an ID — none carry information, and they would break the ~ . model, so we drop them. We also code Attrition as a factor with "No" as the reference level, so the model predicts the probability of Attrition = "Yes".

drop_cols <- c("EmployeeCount", "Over18", "StandardHours", "EmployeeNumber")
df <- df[, !names(df) %in% drop_cols]

df$Attrition <- factor(df$Attrition, levels = c("No", "Yes"))
positive <- "Yes"   # the "attrition" class
negative <- "No"

prop.table(table(df$Attrition))   # ~16% attrition
## 
##        No       Yes 
## 0.8387755 0.1612245

2 Train / test split (70% / 30%)

We hold out 30% of the data as a test set, sampling within each class so the attrition rate is preserved in both sets (stratified split, as in HOML).

set.seed(123)   # reproducibility
idx_yes <- which(df$Attrition == "Yes")
idx_no  <- which(df$Attrition == "No")

train_idx <- c(sample(idx_yes, round(0.70 * length(idx_yes))),
               sample(idx_no,  round(0.70 * length(idx_no))))

train <- df[train_idx, ]
test  <- df[-train_idx, ]

cat("Training rows:", nrow(train), " | Test rows:", nrow(test), "\n")
## Training rows: 1029  | Test rows: 441
# align factor predictor levels in test to those seen in training
for (col in names(train)) {
  if (is.factor(train[[col]]) && col != "Attrition") {
    test[[col]] <- factor(test[[col]], levels = levels(train[[col]]))
  }
}

3 Fit the three models on the training data

model1 <- glm(Attrition ~ MonthlyIncome,            data = train, family = binomial)
model2 <- glm(Attrition ~ MonthlyIncome + OverTime, data = train, family = binomial)
model3 <- glm(Attrition ~ .,                        data = train, family = binomial)
round(summary(model1)$coefficients, 5)
##               Estimate Std. Error  z value Pr(>|z|)
## (Intercept)   -0.97183    0.15319 -6.34379        0
## MonthlyIncome -0.00012    0.00003 -4.69644        0
round(summary(model2)$coefficients, 5)
##               Estimate Std. Error  z value Pr(>|z|)
## (Intercept)   -1.41435    0.17365 -8.14472        0
## MonthlyIncome -0.00013    0.00003 -4.83568        0
## OverTimeYes    1.33507    0.17887  7.46390        0

Consistent with HOML, the coefficient on MonthlyIncome is negative (higher income -> lower probability of attrition) and the coefficient on OverTimeYes is positive (working overtime -> higher probability of attrition).

4 Confusion matrices on the test data

We classify an employee as "Yes" (will attrite) when the predicted probability exceeds 0.5, then compare predictions to the true test labels. We report accuracy, sensitivity (true-positive rate for attrition) and specificity.

confusion <- function(model, data) {
  prob <- predict(model, newdata = data, type = "response")   # P(Attrition = Yes)
  pred <- ifelse(prob > 0.5, "Yes", "No")
  pred <- factor(pred, levels = c("No", "Yes"))
  ok   <- !is.na(pred) & !is.na(data$Attrition)
  cm   <- table(Predicted = pred[ok], Actual = data$Attrition[ok])

  TP <- cm["Yes", "Yes"]; FN <- cm["No", "Yes"]
  FP <- cm["Yes", "No"];  TN <- cm["No", "No"]

  list(
    confusion_matrix = cm,
    accuracy    = round((TP + TN) / sum(cm), 4),
    sensitivity = round(TP / (TP + FN), 4),
    specificity = round(TN / (TN + FP), 4)
  )
}

4.1 Model 1 — Attrition ~ MonthlyIncome

confusion(model1, test)
## $confusion_matrix
##          Actual
## Predicted  No Yes
##       No  370  71
##       Yes   0   0
## 
## $accuracy
## [1] 0.839
## 
## $sensitivity
## [1] 0
## 
## $specificity
## [1] 1

4.2 Model 2 — Attrition ~ MonthlyIncome + OverTime

confusion(model2, test)
## $confusion_matrix
##          Actual
## Predicted  No Yes
##       No  370  71
##       Yes   0   0
## 
## $accuracy
## [1] 0.839
## 
## $sensitivity
## [1] 0
## 
## $specificity
## [1] 1

4.3 Model 3 — Attrition ~ . (all predictors)

confusion(model3, test)
## $confusion_matrix
##          Actual
## Predicted  No Yes
##       No  356  39
##       Yes  14  32
## 
## $accuracy
## [1] 0.8798
## 
## $sensitivity
## [1] 0.4507
## 
## $specificity
## [1] 0.9622

5 Summary

res <- rbind(
  "MonthlyIncome"            = unlist(confusion(model1, test)[c("accuracy","sensitivity","specificity")]),
  "MonthlyIncome + OverTime" = unlist(confusion(model2, test)[c("accuracy","sensitivity","specificity")]),
  "All predictors (.)"       = unlist(confusion(model3, test)[c("accuracy","sensitivity","specificity")])
)
res
##                          accuracy sensitivity specificity
## MonthlyIncome              0.8390      0.0000      1.0000
## MonthlyIncome + OverTime   0.8390      0.0000      1.0000
## All predictors (.)         0.8798      0.4507      0.9622

As in HOML, the simple models (1 and 2) predict the majority no-attrition class very well (high specificity) but miss most genuine cases of attrition (low sensitivity, because attrition is the rare ~16% class). The full model (3), using all predictors, generally attains the highest accuracy and a better balance between sensitivity and specificity. Roughly 84% of employees did not attrite, so any useful model must beat that “no-information” baseline.