CA8 - Predicting Home

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

employee <- read_csv("employee.csv")

## Rows: 1470 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (9): Attrition, BusinessTravel, Department, EducationField, Gender, Job...
## dbl (26): Age, DailyRate, DistanceFromHome, Education, EmployeeCount, Employ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

employee$Attrition <- ifelse(employee$Attrition == "Yes", 1, 0)

employee %>%
  ggplot(aes(JobSatisfaction, Education, z = Attrition)) +
  stat_summary_hex(alpha = 0.8, bins = 4) +
  scale_fill_viridis_c(labels = scales::percent) +
  labs(fill = "Attrition")

employee %>%
  ggplot(aes(DailyRate, YearsAtCompany, z = Attrition)) +
  stat_summary_hex(alpha = 0.8) +
  scale_fill_viridis_c(labels = scales::percent) +
  labs(fill = "Attrition")

Build a model

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──

## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tune         1.1.2
## ✔ infer        1.0.4     ✔ workflows    1.1.3
## ✔ modeldata    1.2.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.1     ✔ yardstick    1.2.0
## ✔ recipes      1.0.8

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org

set.seed(123)
employee_split <- employee %>%
  mutate(
    Attrition = if_else(as.logical(Attrition), "Left", "Stayed"),
    Attrition = factor(Attrition)
  ) %>%
  initial_split(strata = Attrition)
employee_train <- training(employee_split)
employee_test <- testing(employee_split)

set.seed(234)
employee_folds <- vfold_cv(employee_train, strata = Attrition)
employee_folds

## #  10-fold cross-validation using stratification 
## # A tibble: 10 × 2
##    splits            id    
##    <list>            <chr> 
##  1 <split [990/111]> Fold01
##  2 <split [990/111]> Fold02
##  3 <split [990/111]> Fold03
##  4 <split [990/111]> Fold04
##  5 <split [991/110]> Fold05
##  6 <split [991/110]> Fold06
##  7 <split [991/110]> Fold07
##  8 <split [992/109]> Fold08
##  9 <split [992/109]> Fold09
## 10 <split [992/109]> Fold10

employee_rec <-
  recipe(Attrition ~ DailyRate + DistanceFromHome + HourlyRate + JobLevel +
    MonthlyIncome + PerformanceRating + StandardHours +
    YearsAtCompany + WorkLifeBalance + YearsInCurrentRole,
  data = employee_train
  ) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
  step_nzv(all_predictors())

## we can `prep()` just to check that it works
prep(employee_rec)

##

## ── Recipe ──────────────────────────────────────────────────────────────────────

##

## ── Inputs

## Number of variables by role

## outcome:    1
## predictor: 10

##

## ── Training information

## Training data contained 1101 data points and no incomplete rows.

##

## ── Operations

## • Unknown factor level assignment for: <none> | Trained

## • Dummy variables from: <none> | Trained

## • Sparse, unbalanced variable filter removed: StandardHours | Trained

xgb_spec <-
  boost_tree(
    trees = tune(),
    min_n = tune(),
    mtry = tune(),
    learn_rate = 0.01
  ) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

xgb_wf <- workflow(employee_rec, xgb_spec)

Use racing to tune xgboost

library(finetune)
doParallel::registerDoParallel()

set.seed(345)
xgb_rs <- tune_race_anova(
  xgb_wf,
  resamples = employee_folds,
  grid = 15,
  metrics = metric_set(mn_log_loss),
  control = control_race(verbose_elim = TRUE)
)

## i Creating pre-processing data to finalize unknown parameter: mtry

## ℹ Racing will minimize the mn_log_loss metric.
## ℹ Resamples are analyzed in a random order.
## ℹ Fold10: 6 eliminated; 9 candidates remain.
## 
## ℹ Fold07: 3 eliminated; 6 candidates remain.
## 
## ℹ Fold03: 3 eliminated; 3 candidates remain.
## 
## ℹ Fold05: 0 eliminated; 3 candidates remain.
## 
## ℹ Fold09: 0 eliminated; 3 candidates remain.
## 
## ℹ Fold04: 0 eliminated; 3 candidates remain.
## 
## ℹ Fold06: 0 eliminated; 3 candidates remain.

plot_race(xgb_rs)

show_best(xgb_rs)

## # A tibble: 3 × 9
##    mtry trees min_n .metric     .estimator  mean     n std_err .config          
##   <int> <int> <int> <chr>       <chr>      <dbl> <int>   <dbl> <chr>            
## 1     4   442    39 mn_log_loss binary     0.414    10 0.00683 Preprocessor1_Mo…
## 2     1   599     8 mn_log_loss binary     0.415    10 0.00853 Preprocessor1_Mo…
## 3     2  1805    31 mn_log_loss binary     0.417    10 0.00868 Preprocessor1_Mo…

xgb_last <- xgb_wf %>%
  finalize_workflow(select_best(xgb_rs, "mn_log_loss")) %>%
  last_fit(employee_split)

xgb_last

## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
##   splits             id               .metrics .notes   .predictions .workflow 
##   <list>             <chr>            <list>   <list>   <list>       <list>    
## 1 <split [1101/369]> train/test split <tibble> <tibble> <tibble>     <workflow>

collect_predictions(xgb_last) %>%
  mn_log_loss(Attrition, .pred_Left)

## # A tibble: 1 × 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 mn_log_loss binary         0.409

library(vip)

## 
## Attaching package: 'vip'

## The following object is masked from 'package:utils':
## 
##     vi

extract_workflow(xgb_last) %>%
  extract_fit_parsnip() %>%
  vip(geom = "point", num_features = 15)

Can we predict if an employee will leave the company (i.e., attrition) based on various features related to their job satisfaction, daily rates, years at the company, education, and several other attributes?

The dataset provides insights into various attributes of employees within a company. The primary focus is the Attrition target variable, which indicates whether an employee has left the company. Key features include JobSatisfaction, reflecting an employee’s contentment; Education, denoting academic qualifications; DailyRate, which is the daily wage; and YearsAtCompany, representing tenure. Other attributes like HourlyRate, JobLevel, and MonthlyIncome offer a glimpse into an employee’s remuneration and professional status. This dataset serves as the foundation for building a predictive model to understand and forecast employee attrition patterns.

The dataset centers on the Attrition variable, indicating if an employee has left the company. Key features include JobSatisfaction, denoting contentment levels; Education, representing academic qualifications; DailyRate, the daily wage; and YearsAtCompany, highlighting tenure. These variables provide insights into factors influencing an employee’s decision to stay or leave.

The original dataset contained a mix of categorical and numerical attributes describing employee profiles and sentiments. For effective modeling, transformations were applied. The primary change was converting the Attrition target from a textual “Yes” or “No” representation to a binary format of ‘1’ or ‘0’. This conversion facilitates easier processing by machine learning algorithms. Furthermore, categorical features underwent one-hot encoding to transform them into a format suitable for model training. This encoding creates binary columns for each category, ensuring the model can interpret and utilize the data efficiently. Such transformations optimize data compatibility with algorithms and enhance prediction accuracy.
The data preparation involved several key steps: converting the Attrition target from textual to binary format, one-hot encoding of categorical features, splitting the dataset into training and test sets, and handling potential missing or unknown values. These steps ensured the data’s suitability for efficient model training and accurate predictions.

The machine learning model used in the analysis is xgboost, which stands for eXtreme Gradient Boosting. This is a powerful and efficient gradient boosting framework often used for supervised learning tasks, such as classification and regression.

The metric used for model evaluation in the analysis is mn_log_loss, which stands for mean log loss. Log loss quantifies the accuracy of a classifier by penalizing false classifications. It is especially useful for models that output probabilities, providing a measure of the model’s confidence in its predictions.
The analysis, using the xgboost model on the employee dataset, aimed to predict employee attrition. Through model tuning and evaluation, certain features emerged as significant influencers on attrition. The model’s performance, as gauged by the mean log loss, indicates its predictive capability. The visualization further revealed the most influential variables affecting employee decisions to leave or stay. These findings offer valuable insights for companies to understand employee behavior and implement strategies to improve retention.

CA8 - Predicting Home

Anton Jellvik

2023-10-26

Build a model

Use racing to tune xgboost