Introduction

Like many other (laid-off) software engineers, I have personally had concerns throughout my career about continuing in the current incarnation of the technology industry due to the effects on individuals’ mental health. There are innumerable issues generating endless discussion in traditional and new media that directly affect tech work and workers, like the perception of the capabilities of AI on job prospects, the effects of increased social media use on society, the hardware economy and geopolitics, etc.

My focus in this analysis will be to examine any possible relationships between the reporting of mental health issues in tech workers and the characteristics of the companies that employ them, including the culture of the workplace and the resources provided. This information could be valuable in the development of actionable changes by tech companies to foster more supportive and rewarding environments. Studies on data like this could help improve productivity and attract the best possible labor force; but more importantly, I believe employers and decision-makers have a moral and ethical responsibility to their employees. They must remember that their workers’ health and well-being is the base upon which businesses are built.

Using the below survey data, my goal is to develop an explanatory model to identify the most significant factors associated with higher or lower likelihoods of mental health reporting, enabling data-driven recommendations for improving workplace support systems in the tech industry.

Exploratory Data Analysis

The data for this study comes from OSMI (Open Sourcing Mental Illness) a non-profit that aims to provide resources and raise awareness of mental health issues in the tech community. This particular dataset is a compilation of survey responses from workers in tech jobs, collected from 2017 to 2021.

The Variables

The response variable will be mental_health, described as “Whether or not respondents currently have a mental health disorder” as self-reported on the surveys. To me, this variable is less an indication of the actual rates of poor mental health in employees than it is an expression of the rate of openness and awareness of those issues. In other words, the data will really be trained to predict “Yes” if employees feel safe enough to talk about sensitive personal issues in the context of their workplace, even when the survey is anonymized.

Some of the predictor variables include:

  • tech_company Whether or not the respondent’s employer is primarily a tech company/organization.

  • benefits Whether or not the employer provides mental health benefits as part of healthcare coverage.

  • medical_coverage Whether or not respondents have medical coverage (private insurance or state-provided) that includes treatment of mental health disorders.

  • workplace_resources Whether or not the employer offers resources to learn more about mental health disorders and options for seeking help.

  • mh_employer_discussion Whether or not the respondent has ever discussed their mental health with the employer.

  • mh_coworker_discussion Whether or not the respondent has ever discussed their mental health with coworkers

  • mh_share The willingness of respondent to share mental health illness issues with friends and family, on a scale from 0 to 10.

osmi_raw <- read.csv(file = "osmi.csv")
glimpse(osmi_raw)
## Rows: 1,242
## Columns: 11
## $ tech_company           <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No",…
## $ benefits               <chr> "No", "Yes", "I don't know", "Yes", "Yes", "Yes…
## $ workplace_resources    <chr> "I don't know", "No", "No", "I don't know", "No…
## $ mh_employer_discussion <chr> "No", "No", "Yes", "Yes", "No", "No", "Yes", "N…
## $ mh_coworker_discussion <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes",…
## $ medical_coverage       <chr> "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes",…
## $ mental_health          <chr> "Possibly", "Possibly", "Yes", "Yes", "Yes", "N…
## $ mh_share               <int> 5, 4, 5, 10, 8, 3, 2, 10, 3, 9, 8, 7, 3, 10, 5,…
## $ age                    <dbl> 27, 31, 36, 22, 36, 38, 40, 35, 22, 28, 21, 35,…
## $ gender                 <chr> "Female", "Male", "Male", "Male", "Female", "Fe…
## $ country                <chr> "United Kingdom", "United Kingdom", "United Sta…

The data will be cleaned and transformed as follows:

  • some unclear or unsure responses are converted to NAs

  • character columns are converted to factors with no more than 3 levels

  • country column will be converted to “USA” or “Other” since the majority of respondents are from the US

Also, as discussed above, since the goal is binary classification made on the mental_health column, the “Don’t Know” value will be converted to “No” and the “Possibly” value will be converted to “Yes” (interpreted here as comfortable enough in their workplace to answer affirmatively).

osmi <- osmi_raw

osmi <- osmi |>
  mutate(benefits = na_if(benefits, "I don't know"),
         workplace_resources = na_if(workplace_resources, "I don't know"),
         country = case_match(country, "United States of America" ~ "USA", .default = "Other"),
         mental_health = case_match(mental_health, "Don't Know" ~ "No", .default = mental_health),
         mental_health = case_match(mental_health, "Possibly" ~ "Yes", .default = mental_health)) |>
  mutate(across(where(is.character), as.factor))

head(osmi)

Distributions

The sole numeric feature of the mental health-specific columns in this set is the employees’ willingness to share mental health illness issues with friends and family. Here are the distributions by demographic data: age, gender and country. There are slightly higher frequencies of responses for workers in their 30s and with gender identity reported as ‘other’.

The binary categorical variables representing company resources and culture are visualized below. There clear majorities in the data for companies that are in the tech industry and for employees having never discussed mental health with the employer, even though more employees overall reported “Yes” for mental health issues.

Algorithm Selection

Support Vector Machines

Since the dataset is relatively small with only 1242 observations, I have chosen to begin with SVMs because these models can be effective at avoiding overfitting during the training stage, which could be a concern for a decision tree or even ensembles of trees like XGBoost. Cross-validation will be applied to help further tune hyperparameters and increase accuracy and efficiency.

Neural Networks

I have also chosen to experiment with a simple feedforward neural network because the end-goal is just a binary classification. Other types of neural networks tend to work well on larger datasets and complex problems like image processing, which could be overkill on this more focused task.

Experimentation & Model Training

The data is split for training and testing. Both types of algorithms require the NAs to be imputed, so I applied na.roughfix which replaces missing numeric values with medians and factor variables with the most frequent levels (breaking ties at random).

set.seed(101)

splitIndex <- createDataPartition(osmi$mental_health, p = 0.8, list = FALSE)
osmi_train <- osmi[splitIndex,]
osmi_test <- osmi[-splitIndex,]

osmi_train <- na.roughfix(osmi_train)
osmi_test <- na.roughfix(osmi_test)

# breakdown of the response variable in each dataset
round(prop.table(table(select(osmi, mental_health))), 2)
## mental_health
##   No  Yes 
## 0.35 0.65
round(prop.table(table(select(osmi_train, mental_health))), 2)
## mental_health
##   No  Yes 
## 0.35 0.65
round(prop.table(table(select(osmi_test, mental_health))), 2)
## mental_health
##   No  Yes 
## 0.35 0.65

A simple experiment log is initiated first for tracking.

experiment_log <- data.frame(
  ID = integer(),
  Model = character(),
  Features = character(),
  Hyperparameters = character(),
  Train = numeric(),
  Test = numeric(),
  Notes = character(),
  stringsAsFactors = FALSE
)

Support Vector Machines

Experiment 1:

Objective: Since SVMs generalize well on smaller datasets like this one, this can be a baseline for accuracy comparisons.

Variations: This first model will use a linear kernel with the default cost (hardness/softness of margin) of 1. The model has built-in scaling for the numeric columns to balance the features’ contributions in determining the hyperplane.

Evaluation: Generating the confusion matrix for accuracy.

Experiment:

set.seed(101)

# vector of numeric columns for scaling
num_cols <- sapply(osmi_train, is.numeric)

svm1 <- svm(mental_health ~.,
            data = osmi_train,
            scale = num_cols,
            kernel = "linear")

print(svm1)
## 
## Call:
## svm(formula = mental_health ~ ., data = osmi_train, kernel = "linear", 
##     scale = num_cols)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  662
# predict and evaluate on training data
svm1_train_pred <- predict(svm1, osmi_train)
svm1_train_cm <- confusionMatrix(svm1_train_pred, osmi_train$mental_health)
svm1_train_cm$overall["Accuracy"]
##  Accuracy 
## 0.7022133
# predict and evaluate on testing data
svm1_test_pred <- predict(svm1, osmi_test)
svm1_test_cm <- confusionMatrix(svm1_test_pred, osmi_test$mental_health)
svm1_test_cm$overall["Accuracy"]
##  Accuracy 
## 0.6572581

Review: The accuracy for rates for training and testing indicate a decent performance, but improvement on this metric would be ideal. There are also 662 support vectors in this model to 994 observations in the training set, implying overfitting or noisy data.

Experiment 2:

Objective: To see if we can improve on the accuracy and performance, the kernel will be updated.

Variations: The kernel will be changed to the nonlinear RBF (Radial Basis Function), which is more flexible and able to read unclear patterns; this may be helpful since the dataset is on the smaller side.

Evaluation: Generating the confusion matrix for accuracy.

Experiment:

set.seed(101)

svm2 <- svm(mental_health ~.,
            data = osmi_train,
            scale = num_cols,
            kernel = "radial")

print(svm2)
## 
## Call:
## svm(formula = mental_health ~ ., data = osmi_train, kernel = "radial", 
##     scale = num_cols)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  684
# predict and evaluate on training data
svm2_train_pred <- predict(svm2, osmi_train)
svm2_train_cm <- confusionMatrix(svm2_train_pred, osmi_train$mental_health)
svm2_train_cm$overall["Accuracy"]
##  Accuracy 
## 0.7354125
# predict and evaluate on testing data
svm2_test_pred <- predict(svm2, osmi_test)
svm2_test_cm <- confusionMatrix(svm2_test_pred, osmi_test$mental_health)
svm2_test_cm$overall["Accuracy"]
##  Accuracy 
## 0.6814516

Review: The accuracy improved slightly, but with a larger difference between training and testing. The number of support vectors also increased, so we are still overfitting on noise.

Experiment 3:

Objective: 10-fold cross-validation will be applied with different, commonly-used cost and gamma values to determine the best-performing hyperparameters for a final SVM test.

Variations: Based on the above, the cost may be changed to 0.01, 0.1, 1 (same), or 10. The best gamma will be chosen from 0.001, 0.025 (default), 0.1, or 1.

Evaluation: Generating the confusion matrix for accuracy.

Experiment:

set.seed(101)

tune_mod <- tune(svm,
                 mental_health ~.,
                 data = osmi_train,
                 kernel = "radial",
                 ranges = list(cost = c(0.01, 0.1, 1, 10), gamma = c(0.001, 0.025, 0.1, 1)))

print(tune_mod)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##     1   0.1
## 
## - best performance: 0.3058283
best_mod <- tune_mod$best.model
print(best_mod)
## 
## Call:
## best.tune(METHOD = svm, train.x = mental_health ~ ., data = osmi_train, 
##     ranges = list(cost = c(0.01, 0.1, 1, 10), gamma = c(0.001, 0.025, 
##         0.1, 1)), kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  676
# predict and evaluate on training data
best_train_pred <- predict(best_mod, osmi_train)
best_train_cm <- confusionMatrix(best_train_pred, osmi_train$mental_health)
best_train_cm$overall["Accuracy"]
##  Accuracy 
## 0.7193159
# predict and evaluate on testing data
best_test_pred <- predict(best_mod, osmi_test)
best_test_cm <- confusionMatrix(best_test_pred, osmi_test$mental_health)
best_test_cm$overall["Accuracy"]
##  Accuracy 
## 0.6491935

Review: The hyperparameter tuning determined that the best model raised the gamma to 0.1, which reduced the effect of points further from the decision boundary; as expected, this resulted in the model overfitting on training data.

The number of support vectors remains too high. Overall, I would say the performance of SVMs on this small dataset has been merely passable.

Simple Neural Network

Experiment 4:

Objective: We will test whether a simple neural network can predict more effectively and accurately on this dataset than SVM.

Variations:

  • In a neural network, features must be hand-selected for the formula. A simple randomForest model can quickly rank importance using MeanDecreaseGini (aka the reduction in impurity) and the top 5 features will be selected.

  • Factor columns must be converted to dummy variables (one-hot encoded), then converted back for interpretation.

  • The first neural network will have 2 hidden layers, the first layer with 3 neurons and the second with 2 (also arbitrarily chosen / default values).

Evaluation: Generating the confusion matrix for accuracy.

Experiment:

set.seed(101)

rf <- randomForest(mental_health ~ ., data = osmi_train)
importance(rf)
##                        MeanDecreaseGini
## tech_company                  12.410531
## benefits                       9.606341
## workplace_resources           12.884414
## mh_employer_discussion        21.054504
## mh_coworker_discussion        34.408542
## medical_coverage               5.990652
## mh_share                      59.006817
## age                           83.580049
## gender                        18.292597
## country                       21.333865
# scale and one-hot encode
otrain_scaled <- osmi_train
otrain_scaled[num_cols] <- scale(otrain_scaled[num_cols])
mh_1h_train <- ifelse(otrain_scaled$mental_health == "Yes", 1, 0)
otrain_scaled <- model.matrix(mental_health ~ ., data = otrain_scaled)[, -1]
otrain_scaled <- cbind(mental_health = mh_1h_train, otrain_scaled)

otest_scaled <- osmi_test
otest_scaled[num_cols] <- scale(otest_scaled[num_cols])
mh_1h_test <- ifelse(otest_scaled$mental_health == "Yes", 1, 0)
otest_scaled <- model.matrix(mental_health ~ ., data = otest_scaled)[, -1]
otest_scaled <- cbind(mental_health = mh_1h_test, otest_scaled)
set.seed(101)

nn1 <- neuralnet(mental_health ~ age + mh_share + mh_coworker_discussionYes + mh_employer_discussionYes + countryUSA,
                 data = otrain_scaled,
                 hidden = c(3, 2))

plot(nn1, rep = "best")

# predict and evaluate on training data
nn1_train_pred <- predict(nn1, otrain_scaled)

# convert probabilities yes/no strings at threshold = 0.5,
nn1_train_pred_factor <- ifelse(nn1_train_pred > 0.5, "Yes", "No")

# convert back to factors for the confusion matrix
nn1_train_pred_factor <- factor(nn1_train_pred_factor, levels = c("No", "Yes"))
nn1_train_cm <- confusionMatrix(nn1_train_pred_factor, osmi_train$mental_health, positive = "Yes")
nn1_train_cm$overall["Accuracy"]
##  Accuracy 
## 0.7152918
# predict and evaluate on testing data
nn1_test_pred <- predict(nn1, otest_scaled)
nn1_test_pred_factor <- ifelse(nn1_test_pred > 0.5, "Yes", "No")
nn1_test_pred_factor <- factor(nn1_test_pred_factor, levels = c("No", "Yes"))
nn1_test_cm <- confusionMatrix(nn1_test_pred_factor, osmi_test$mental_health, positive = "Yes")
nn1_test_cm$overall["Accuracy"]
##  Accuracy 
## 0.6612903

Review: The accuracy is essentially the same as the SVMs, with some signs of overfitting.

Experiment 5:

Objective: In this next test, I want to refocus on the business problem and select only tech companies and features dealing solely with company resource data.

Variations: We will test using only the variables that employers can affect: tech_company, benefits, workplace_resources, medical_coverage and mh_employer_discussion.

Evaluation: Generating the confusion matrix for accuracy.

set.seed(101)

nn2 <- neuralnet(
  mental_health ~ tech_companyYes + benefitsYes + workplace_resourcesYes + medical_coverageYes + mh_employer_discussionYes,
  data = otrain_scaled,
  hidden = c(3, 2))

plot(nn2, rep = "best")

# predict and evaluate on training data
nn2_train_pred <- predict(nn2, otrain_scaled)
nn2_train_pred_factor <- ifelse(nn2_train_pred > 0.5, "Yes", "No")
nn2_train_pred_factor <- factor(nn2_train_pred_factor, levels = c("No", "Yes"))
nn2_train_cm <- confusionMatrix(nn2_train_pred_factor, osmi_train$mental_health, positive = "Yes")
nn2_train_cm$overall["Accuracy"]
##  Accuracy 
## 0.6539235
# predict and evaluate on testing data
nn2_test_pred <- predict(nn2, otest_scaled)
nn2_test_pred_factor <- ifelse(nn2_test_pred > 0.5, "Yes", "No")
nn2_test_pred_factor <- factor(nn2_test_pred_factor, levels = c("No", "Yes"))
nn2_test_cm <- confusionMatrix(nn2_test_pred_factor, osmi_test$mental_health, positive = "Yes")
nn2_test_cm$overall["Accuracy"]
##  Accuracy 
## 0.6491935

Review: As expected, selecting the features of lower importance per the RF resulted in a drop in accuracy, and the sum of squared errors increased from the first model’s 89.9 to 103.4. However, the overfitting has been mitigated.

Experiment 6

Objective: For the final experiment, the hyperparameters of the neural network will be tuned to see if accuracy can be enhanced and errors reduced.

Variations: The number of hidden layers and the number of neurons in each layer will be increased.

Evaluation: Generating the confusion matrix for accuracy.

Experiment:

set.seed(101)

nn3 <- neuralnet(
  mental_health ~ tech_companyYes + benefitsYes + workplace_resourcesYes + medical_coverageYes + mh_employer_discussionYes,
  data = otrain_scaled,
  hidden = c(5, 4, 3, 2))

plot(nn3, rep = "best")

# predict and evaluate on training data
nn3_train_pred <- predict(nn3, otrain_scaled)
nn3_train_pred_factor <- ifelse(nn3_train_pred > 0.5, "Yes", "No")
nn3_train_pred_factor <- factor(nn3_train_pred_factor, levels = c("No", "Yes"))
nn3_train_cm <- confusionMatrix(nn3_train_pred_factor, osmi_train$mental_health, positive = "Yes")
nn3_train_cm$overall["Accuracy"]
##  Accuracy 
## 0.6539235
# predict and evaluate on testing data
nn3_test_pred <- predict(nn3, otest_scaled)
nn3_test_pred_factor <- ifelse(nn3_test_pred > 0.5, "Yes", "No")
nn3_test_pred_factor <- factor(nn3_test_pred_factor, levels = c("No", "Yes"))
nn3_test_cm <- confusionMatrix(nn3_test_pred_factor, osmi_test$mental_health, positive = "Yes")
nn3_test_cm$overall["Accuracy"]
##  Accuracy 
## 0.6451613

Review: The accuracy and SSE values appear virtually unchanged.

Results, Comparison & Conclusions

ID Model Features Hyperparameters Train Test Notes
1 SVM all cost = 1 0.70 0.66 mediocre accuracy, too many support vectors
2 SVM all kernel = radial 0.74 0.68 slightly higher accuracy, overfitting
3 SVM all tuned to best gamma 0.1 0.72 0.65 more overfitting, no improvement elsewhere
4 Simple Neural Net Top 5 based on RF hidden layers: 3 nodes -> 2 nodes 0.72 0.66 similar results to SVM
5 Simple Neural Net Employer features only same number of layer/nodes 0.65 0.65 less accurate, no overfitting
6 Simple Neural Net Employer features only hidden layers: 5 nodes -> 4 -> 3 -> 2 0.65 0.65 same accuracy and SSE

In the above machine learning experiments, all the models achieved pretty similar accuracy levels, ranging from about 65% to 74%. My conclusion is that any further tuning and testing based on this dataset will likely be of little significance. Despite the differences in model complexities and hyperparameter adjustments, the similar accuracy and error rates across all tests and data subsets suggests that the structure and content of the original data itself might be limiting increases to the predictive performance.

SVMs reached 74% on training, a decent performance after tuning the kernel from linear to radial, but predictions made on testing sets suggested the model tended to overfit on the training step. The simple Neural Networks with the most relevant features selected did manage to mitigate any overfitting but at the expense of overall accuracy.

The overall performance of our models implies only modest predictive confidence in identifying patterns in the support for mental health of tech workers. Insights drawn from these experiments should be considered as only one piece in the much larger puzzle of companies’ support for their employees. Continuing to improve and expand on the available data regarding tech employees’ mental health and wellness would be the most beneficial step toward usable predictive modeling in this area.