load("mh_dat.rda") # Load data into workspace
head(mh_dat) # View first five rows
##   Age self_employed family_history remote_work tech_company   benefits
## 1  37          <NA>             No          No          Yes        Yes
## 2  44          <NA>             No          No           No Don't know
## 3  32          <NA>             No          No          Yes         No
## 4  31          <NA>            Yes          No          Yes         No
## 5  31          <NA>             No         Yes          Yes        Yes
## 6  33          <NA>            Yes          No          Yes        Yes
##   care_options wellness_program  anonymity              leave
## 1     Not sure               No        Yes      Somewhat easy
## 2           No       Don't know Don't know         Don't know
## 3           No               No Don't know Somewhat difficult
## 4          Yes               No         No Somewhat difficult
## 5           No       Don't know Don't know         Don't know
## 6     Not sure               No Don't know         Don't know
##   mental_health_consequence phys_health_consequence    coworkers supervisor
## 1                        No                      No Some of them        Yes
## 2                     Maybe                      No           No         No
## 3                        No                      No          Yes        Yes
## 4                       Yes                     Yes Some of them         No
## 5                        No                      No Some of them        Yes
## 6                        No                      No          Yes        Yes
##   mental_health_interview phys_health_interview mental_vs_physical
## 1                      No                 Maybe                Yes
## 2                      No                    No         Don't know
## 3                     Yes                   Yes                 No
## 4                   Maybe                 Maybe                 No
## 5                     Yes                   Yes         Don't know
## 6                      No                 Maybe         Don't know
##   obs_consequence no_employess_mid    resp_var
## 1              No               15 sought_help
## 2              No             1250     no_help
## 3              No               15     no_help
## 4             Yes               63 sought_help
## 5              No              300     no_help
## 6              No               15     no_help

The tree

tree_1 <- rpart::rpart(resp_var ~., # Set tree formula
                       data = mh_dat) # Set dataset
               

rattle::fancyRpartPlot(tree_1) # Plot fancy tree

Parameters

Imbalanced data

For this dataset we have a slightly imbalanced dataset. There are 637 samples who sought help and 362 individuals who did not seek help. This may have an effect on our classifier paying more attention to the majority class. To remedy this we can use:

# Split data into help and no help classes
no_help <- mh_dat[which(mh_dat$resp_var == "no_help"),] # Select minority samples
help <- mh_dat[which(mh_dat$resp_var == "sought_help"),] # Select majority samples
nrow(no_help) # Rows in no help
## [1] 362
nrow(help) # Rows in help
## [1] 637
set.seed(123456) # Set seed for sampling
no_help_boot <- no_help[sample(1:nrow(no_help), size = nrow(help), replace =TRUE),] # Create bootstrap sample
nrow(no_help_boot) # Check rows of bootstrap sample
## [1] 637
use_dat <- rbind.data.frame(help, no_help_boot) # Join data together

We can now fit a tree on the bootstrapped data

Also: SMOTE - Synthetic Minority Over-sampling Technique

SMOTE creates new synthetic observations using the observations already present in this class using the nearest neighbors of the samples. It is also possible to under-sample with SMOTE.