load("mh_dat.rda") # Load data into workspace
head(mh_dat) # View first five rows
## Age self_employed family_history remote_work tech_company benefits
## 1 37 <NA> No No Yes Yes
## 2 44 <NA> No No No Don't know
## 3 32 <NA> No No Yes No
## 4 31 <NA> Yes No Yes No
## 5 31 <NA> No Yes Yes Yes
## 6 33 <NA> Yes No Yes Yes
## care_options wellness_program anonymity leave
## 1 Not sure No Yes Somewhat easy
## 2 No Don't know Don't know Don't know
## 3 No No Don't know Somewhat difficult
## 4 Yes No No Somewhat difficult
## 5 No Don't know Don't know Don't know
## 6 Not sure No Don't know Don't know
## mental_health_consequence phys_health_consequence coworkers supervisor
## 1 No No Some of them Yes
## 2 Maybe No No No
## 3 No No Yes Yes
## 4 Yes Yes Some of them No
## 5 No No Some of them Yes
## 6 No No Yes Yes
## mental_health_interview phys_health_interview mental_vs_physical
## 1 No Maybe Yes
## 2 No No Don't know
## 3 Yes Yes No
## 4 Maybe Maybe No
## 5 Yes Yes Don't know
## 6 No Maybe Don't know
## obs_consequence no_employess_mid resp_var
## 1 No 15 sought_help
## 2 No 1250 no_help
## 3 No 15 no_help
## 4 Yes 63 sought_help
## 5 No 300 no_help
## 6 No 15 no_help
tree_1 <- rpart::rpart(resp_var ~., # Set tree formula
data = mh_dat) # Set dataset
rattle::fancyRpartPlot(tree_1) # Plot fancy tree
For this dataset we have a slightly imbalanced dataset. There are 637 samples who sought help and 362 individuals who did not seek help. This may have an effect on our classifier paying more attention to the majority class. To remedy this we can use:
# Split data into help and no help classes
no_help <- mh_dat[which(mh_dat$resp_var == "no_help"),] # Select minority samples
help <- mh_dat[which(mh_dat$resp_var == "sought_help"),] # Select majority samples
nrow(no_help) # Rows in no help
## [1] 362
nrow(help) # Rows in help
## [1] 637
set.seed(123456) # Set seed for sampling
no_help_boot <- no_help[sample(1:nrow(no_help), size = nrow(help), replace =TRUE),] # Create bootstrap sample
nrow(no_help_boot) # Check rows of bootstrap sample
## [1] 637
use_dat <- rbind.data.frame(help, no_help_boot) # Join data together
We can now fit a tree on the bootstrapped data
Also: SMOTE - Synthetic Minority Over-sampling Technique
SMOTE creates new synthetic observations using the observations already present in this class using the nearest neighbors of the samples. It is also possible to under-sample with SMOTE.