#2023000308 # 🔒 Force knitting from home directory
knitr::opts_knit$set(root.dir = “~”)
install.packages(“ggplot2”)
Employee attrition is a binary outcome (“Yes” or “No”).
Therefore, the most suitable model is Logistic
Regression.
Why not Linear Regression?
Linear regression predicts continuous outcomes, not binary ones.
Logistic regression models the probability of an event (attrition =
“Yes”).
Why Logistic Regression works here: - It handles binary dependent variables. - It produces interpretable coefficients (log-odds or odds ratios). - It outputs probabilities between 0 and 1.
\[ \text{logit}(p) = \ln\left(\frac{p}{1 - p}\right) = \beta_0 + \beta_1(\text{Age}) + \beta_2(\text{Salary}) + \beta_3(\text{YearsAtJob}) \] — ## b. Implementing the Model in R
# --- Data Preparation (Creating data directly in R) ---
employee_data <- data.frame(
Age = c(25, 30, 40, 22, 50, 28, 35, 45),
Salary = c(35000, 48000, 65000, 30000, 80000, 42000, 54000, 70000),
YearsAtJob = c(1, 3, 10, 0.5, 20, 2, 6, 15),
Attrition = c("Yes", "No", "No", "Yes", "No", "Yes", "No", "No")
)
# Convert Attrition column to a factor
employee_data$Attrition <- as.factor(employee_data$Attrition)
# --- Model Fitting ---
attrition_model <- glm(
Attrition ~ Age + Salary + YearsAtJob,
data = employee_data,
family = binomial
)
# --- Model Summary ---
summary(attrition_model)
##
## Call:
## glm(formula = Attrition ~ Age + Salary + YearsAtJob, family = binomial,
## data = employee_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.577e+02 1.607e+06 0 1
## Age 4.948e+00 1.006e+05 0 1
## Salary -1.202e-02 3.799e+01 0 1
## YearsAtJob 1.577e+01 8.298e+04 0 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.0585e+01 on 7 degrees of freedom
## Residual deviance: 4.8982e-10 on 4 degrees of freedom
## AIC: 8
##
## Number of Fisher Scoring iterations: 24
# --- Predictions ---
predicted_prob <- predict(attrition_model, type = "response")
predicted_class <- ifelse(predicted_prob > 0.5, "Yes", "No")
head(predicted_class)
## 1 2 3 4 5 6
## "Yes" "No" "No" "Yes" "No" "Yes"
library(caret)
library(pROC)
# --- Confusion Matrix ---
conf_matrix <- confusionMatrix(
as.factor(predicted_class),
employee_data$Attrition
)
conf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 5 0
## Yes 0 3
##
## Accuracy : 1
## 95% CI : (0.6306, 1)
## No Information Rate : 0.625
## P-Value [Acc > NIR] : 0.02328
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.000
## Specificity : 1.000
## Pos Pred Value : 1.000
## Neg Pred Value : 1.000
## Prevalence : 0.625
## Detection Rate : 0.625
## Detection Prevalence : 0.625
## Balanced Accuracy : 1.000
##
## 'Positive' Class : No
##
# --- ROC Curve and AUC ---
roc_obj <- roc(employee_data$Attrition, predicted_prob)
plot(roc_obj, main = "ROC Curve for Employee Attrition Model")
auc(roc_obj)
## Area under the curve: 1
The chosen regression model — Logistic Regression — is appropriate for this prediction task because:
Binary Outcome Variable:
The dependent variable (Attrition) has only two possible
outcomes — “Yes” (leaves) or “No” (stays).
Logistic regression is designed specifically to handle such
binary classification problems.
Probability-Based Output:
Instead of predicting just labels, logistic regression predicts the
probability of an employee leaving.
These probabilities (values between 0 and 1) make the model
interpretable and useful for threshold-based decisions.
Interpretable Coefficients:
Each coefficient represents how much a variable (e.g., age, salary, or
years at job) affects the odds of attrition.
This allows HR teams to identify which factors most influence employee
turnover.
Handles Different Data Types:
The model easily accommodates both numeric (e.g., age,
salary) and categorical (e.g., gender, department)
variables if needed.
Computational Efficiency:
Logistic regression is fast, stable, and works well on small to medium
datasets, making it ideal for HR analytics.
In R Markdown, you can include mathematical formulas using LaTeX syntax.
To display math inline, wrap it between single
dollar signs $...$.
Example sentence:
The area of a rectangle is given by the formula \(A = l \times w\),
where \(l\) represents the
length and \(w\)
represents the width.
To display the same formula centered on a new line
(display mode),
use double dollar signs like this:
\[A = l \times w\]
In R Markdown, you can insert images either using Markdown syntax or R code.
We can use the knitr::include_graphics() function to
include an image in the document.
The chunk options out.width and out.height
control the image size.