1. Predicting Employee Attrition

a. Selecting and Justifying a Regression Model

Employee attrition is a binary outcome (“Yes” or “No”).
Therefore, the most suitable model is Logistic Regression.

Why not Linear Regression?
Linear regression predicts continuous outcomes, not binary ones.
Logistic regression models the probability of an event (attrition = “Yes”).

Why Logistic Regression works here: - It handles binary dependent variables. - It produces interpretable coefficients (log-odds or odds ratios). - It outputs probabilities between 0 and 1.

(mathematical expression)

\[ \text{logit}(p) = \ln\left(\frac{p}{1 - p}\right) = \beta_0 + \beta_1(\text{Age}) + \beta_2(\text{Salary}) + \beta_3(\text{YearsAtJob}) \] — ## b. Implementing the Model in R

# --- Data Preparation (Creating data directly in R) ---
employee_data <- data.frame(
  Age = c(25, 30, 40, 22, 50, 28, 35, 45),
  Salary = c(35000, 48000, 65000, 30000, 80000, 42000, 54000, 70000),
  YearsAtJob = c(1, 3, 10, 0.5, 20, 2, 6, 15),
  Attrition = c("Yes", "No", "No", "Yes", "No", "Yes", "No", "No")
)

# Convert Attrition column to a factor
employee_data$Attrition <- as.factor(employee_data$Attrition)

# --- Model Fitting ---
attrition_model <- glm(
  Attrition ~ Age + Salary + YearsAtJob,
  data   = employee_data,
  family = binomial
)

# --- Model Summary ---
summary(attrition_model)

## 
## Call:
## glm(formula = Attrition ~ Age + Salary + YearsAtJob, family = binomial, 
##     data = employee_data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)  3.577e+02  1.607e+06       0        1
## Age          4.948e+00  1.006e+05       0        1
## Salary      -1.202e-02  3.799e+01       0        1
## YearsAtJob   1.577e+01  8.298e+04       0        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.0585e+01  on 7  degrees of freedom
## Residual deviance: 4.8982e-10  on 4  degrees of freedom
## AIC: 8
## 
## Number of Fisher Scoring iterations: 24

# --- Predictions ---
predicted_prob  <- predict(attrition_model, type = "response")
predicted_class <- ifelse(predicted_prob > 0.5, "Yes", "No")

head(predicted_class)

##     1     2     3     4     5     6 
## "Yes"  "No"  "No" "Yes"  "No" "Yes"

c. Evaluating Model Accuracy

library(caret)
library(pROC)

# --- Confusion Matrix ---

conf_matrix <- confusionMatrix(
as.factor(predicted_class),
employee_data$Attrition
)
conf_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No   5   0
##        Yes  0   3
##                                      
##                Accuracy : 1          
##                  95% CI : (0.6306, 1)
##     No Information Rate : 0.625      
##     P-Value [Acc > NIR] : 0.02328    
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.000      
##             Specificity : 1.000      
##          Pos Pred Value : 1.000      
##          Neg Pred Value : 1.000      
##              Prevalence : 0.625      
##          Detection Rate : 0.625      
##    Detection Prevalence : 0.625      
##       Balanced Accuracy : 1.000      
##                                      
##        'Positive' Class : No         
##

# --- ROC Curve and AUC ---


roc_obj <- roc(employee_data$Attrition, predicted_prob)
plot(roc_obj, main = "ROC Curve for Employee Attrition Model")

auc(roc_obj)

## Area under the curve: 1

d. Discuss Why the Chosen Regression Approach is Appropriate

The chosen regression model — Logistic Regression — is appropriate for this prediction task because:

Binary Outcome Variable:
The dependent variable (Attrition) has only two possible outcomes — “Yes” (leaves) or “No” (stays).
Logistic regression is designed specifically to handle such binary classification problems.
Probability-Based Output:
Instead of predicting just labels, logistic regression predicts the probability of an employee leaving.
These probabilities (values between 0 and 1) make the model interpretable and useful for threshold-based decisions.
Interpretable Coefficients:
Each coefficient represents how much a variable (e.g., age, salary, or years at job) affects the odds of attrition.
This allows HR teams to identify which factors most influence employee turnover.
Handles Different Data Types:
The model easily accommodates both numeric (e.g., age, salary) and categorical (e.g., gender, department) variables if needed.
Computational Efficiency:
Logistic regression is fast, stable, and works well on small to medium datasets, making it ideal for HR analytics.

2. a) Inline Mathematical Expression

In R Markdown, you can include mathematical formulas using LaTeX syntax.

To display math inline, wrap it between single dollar signs $...$ .

Example sentence:

The area of a rectangle is given by the formula $A = l \times w$,
where $l$ represents the length and $w$ represents the width.

To display the same formula centered on a new line (display mode),
use double dollar signs like this:

\[A = l \times w\]

b) Inserting an Image in R Markdown

In R Markdown, you can insert images either using Markdown syntax or R code.

🧠 Using R Code

We can use the knitr::include_graphics() function to include an image in the document.
The chunk options out.width and out.height control the image size.

casetstudy