1. Employee Attrition Prediction

a. Select and justify a suitable regression model for predicting employee attrition

Logistic regression is the most appropriate model to predict employee attrition (“Yes”/“No”) from variables such as age, salary, and years at the job. This is because logistic regression effectively models binary outcomes and allows for estimating the probability that an employee will leave, given their attributes.

b. Outline R code to implement the model, including data preparation, model fitting, and prediction

set.seed(42)
n <- 200
employee_data <- data.frame(
  Age = sample(20:60, n, replace = TRUE),
  Salary = sample(20000:150000, n, replace = TRUE),
  YearsAtJob = sample(1:21, n, replace = TRUE)
)
logit <- -30 + 0.1 * employee_data$Age + 0.0002 * employee_data$Salary - 0.3 * employee_data$YearsAtJob
prob <- 1 / (1 + exp(-logit))
employee_data$AttritionBinary <- rbinom(n, 1, prob)
employee_data$Attrition <- ifelse(employee_data$AttritionBinary == 1, "Yes", "No")

set.seed(123)
train_indices <- sample(seq_len(nrow(employee_data)), size = 0.7 * nrow(employee_data))
train_data <- employee_data[train_indices, ]
test_data <- employee_data[-train_indices, ]

model <- glm(AttritionBinary ~ Age + Salary + YearsAtJob, data = train_data, family = binomial())
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
probs <- predict(model, test_data, type = "response")
preds <- ifelse(probs > 0.5, "Yes", "No")
test_data$PredictedProb <- probs
test_data$PredictedAttrition <- preds

head(test_data[, c("Age", "Salary", "YearsAtJob", "Attrition", "PredictedProb", "PredictedAttrition")])
##    Age Salary YearsAtJob Attrition PredictedProb PredictedAttrition
## 2   20  99414         13        No  2.307048e-10                 No
## 3   44 104990         20        No  4.003740e-08                 No
## 10  44  80554         13        No  4.130856e-11                 No
## 11  56  96049          7        No  2.147062e-06                 No
## 15  60  46812          5        No  2.220446e-16                 No
## 18  55  82549          7        No  9.565932e-09                 No

c. Describe how you would evaluate the accuracy of your model using appropriate metrics and validation techniques

library(caret)
## Warning: package 'caret' was built under R version 4.5.2
## Loading required package: ggplot2
## Loading required package: lattice
confusionMatrix(as.factor(test_data$PredictedAttrition), as.factor(test_data$Attrition))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  59   0
##        Yes  0   1
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9404, 1)
##     No Information Rate : 0.9833     
##     P-Value [Acc > NIR] : 0.3648     
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.9833     
##          Detection Rate : 0.9833     
##    Detection Prevalence : 0.9833     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : No         
## 
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
roc_obj <- roc(test_data$AttritionBinary, test_data$PredictedProb)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_obj)

auc(roc_obj)
## Area under the curve: 1

Accuracy is measured through the confusion matrix, as well as the ROC curve and AUC. Cross-validation can also be used for reliability.

d. Discuss why the chosen regression approach is appropriate for this prediction task

Logistic regression is specifically designed for binary classification, making it ideal for “Yes”/“No” attrition prediction. It provides interpretable coefficients, allows for probability threshold adjustments, is robust to various predictor types, and is widely applicable in business contexts.


2. R Markdown Demonstration

a) Inline Mathematical Expression

The area of a rectangle is given by \(A = l \times w\), where \(l\) is the length and \(w\) is the width.

b) Inserting an Image

To display an image from online (R logo) at 200x100 pixels:

Company Logo ```