Decision Trees - Toolkit

## Age self_employed family_history remote_work tech_company benefits ## 1 37 <NA> No No Yes Yes ## 2 44 <NA> No No No Don't know ## 3 32 <NA> No No Yes No ## 4 31 <NA> Yes No Yes No ## 5 31 <NA> No Yes Yes Yes ## 6 33 <NA> Yes No Yes Yes ## care_options wellness_program anonymity leave ## 1 Not sure No Yes Somewhat easy ## 2 No Don't know Don't know Don't know ## 3 No No Don't know Somewhat difficult ## 4 Yes No No Somewhat difficult ## 5 No Don't know Don't know Don't know ## 6 Not sure No Don't know Don't know ## mental_health_consequence phys_health_consequence coworkers supervisor ## 1 No No Some of them Yes ## 2 Maybe No No No ## 3 No No Yes Yes ## 4 Yes Yes Some of them No ## 5 No No Some of them Yes ## 6 No No Yes Yes ## mental_health_interview phys_health_interview mental_vs_physical ## 1 No Maybe Yes ## 2 No No Don't know ## 3 Yes Yes No ## 4 Maybe Maybe No ## 5 Yes Yes Don't know ## 6 No Maybe Don't know ## obs_consequence no_employess_mid resp_var ## 1 No 15 sought_help ## 2 No 1250 no_help ## 3 No 15 no_help ## 4 Yes 63 sought_help ## 5 No 300 no_help ## 6 No 15 no_help

Imbalanced data

For this dataset we have a slightly imbalanced dataset. There are 637 samples who sought help and 362 individuals who did not seek help. This may have an effect on our classifier paying more attention to the majority class. To remedy this we can use:

Bootstrap re-sampling - Sample with replacement from the minority class to re-balance To carry out bootstrap re-sampling we simply generate a random selection of indices from the minority class to use in the model, in effect this randomly duplicates some of the minority class samples:

# Split data into help and no help classes
no_help <- mh_dat[which(mh_dat$resp_var == "no_help"),] # Select minority samples
help <- mh_dat[which(mh_dat$resp_var == "sought_help"),] # Select majority samples
nrow(no_help) # Rows in no help

## [1] 362

nrow(help) # Rows in help

## [1] 637

set.seed(123456) # Set seed for sampling
no_help_boot <- no_help[sample(1:nrow(no_help), size = nrow(help), replace =TRUE),] # Create bootstrap sample
nrow(no_help_boot) # Check rows of bootstrap sample

## [1] 637

use_dat <- rbind.data.frame(help, no_help_boot) # Join data together

We can now fit a tree on the bootstrapped data

Also: SMOTE - Synthetic Minority Over-sampling Technique

SMOTE creates new synthetic observations using the observations already present in this class using the nearest neighbors of the samples. It is also possible to under-sample with SMOTE.

Decision Trees - Toolkit

Ben Scartz

2024-09-25

The tree

Parameters

Imbalanced data