Like many other (laid-off) software engineers, I have personally had concerns throughout my career about continuing in the current incarnation of the technology industry due to the effects on individuals’ mental health. There are innumerable issues generating endless discussion in traditional and new media that directly affect tech work and workers, like the perception of the capabilities of AI on job prospects, the effects of increased social media use on society, the hardware economy and geopolitics, etc.
My focus in this analysis will be to examine any possible relationships between the reporting of mental health issues in tech workers and the characteristics of the companies that employ them, including the culture of the workplace and the resources provided. This information could be valuable in the development of actionable changes by tech companies to foster more supportive and rewarding environments. Studies on data like this could help improve productivity and attract the best possible labor force; but more importantly, I believe employers and decision-makers have a moral and ethical responsibility to their employees. They must remember that their workers’ health and well-being is the base upon which businesses are built.
Using the below survey data, my goal is to develop an explanatory model to identify the most significant factors associated with higher or lower likelihoods of mental health reporting, enabling data-driven recommendations for improving workplace support systems in the tech industry.
The data for this study comes from OSMI (Open Sourcing Mental Illness) a non-profit that aims to provide resources and raise awareness of mental health issues in the tech community. This particular dataset is a compilation of survey responses from workers in tech jobs, collected from 2017 to 2021.
The response variable will be mental_health
, described
as “Whether or not respondents currently have a mental health disorder”
as self-reported on the surveys. To me, this variable is less an
indication of the actual rates of poor mental health in employees than
it is an expression of the rate of openness and awareness of those
issues. In other words, the data will really be trained to predict “Yes”
if employees feel safe enough to talk about sensitive personal issues in
the context of their workplace, even when the survey is anonymized.
Some of the predictor variables include:
tech_company
Whether or not the respondent’s
employer is primarily a tech company/organization.
benefits
Whether or not the employer provides mental
health benefits as part of healthcare coverage.
medical_coverage
Whether or not respondents have
medical coverage (private insurance or state-provided) that includes
treatment of mental health disorders.
workplace_resources
Whether or not the employer
offers resources to learn more about mental health disorders and options
for seeking help.
mh_employer_discussion
Whether or not the respondent
has ever discussed their mental health with the employer.
mh_coworker_discussion
Whether or not the respondent
has ever discussed their mental health with coworkers
mh_share
The willingness of respondent to share
mental health illness issues with friends and family, on a scale from 0
to 10.
osmi_raw <- read.csv(file = "osmi.csv")
glimpse(osmi_raw)
## Rows: 1,242
## Columns: 11
## $ tech_company <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No",…
## $ benefits <chr> "No", "Yes", "I don't know", "Yes", "Yes", "Yes…
## $ workplace_resources <chr> "I don't know", "No", "No", "I don't know", "No…
## $ mh_employer_discussion <chr> "No", "No", "Yes", "Yes", "No", "No", "Yes", "N…
## $ mh_coworker_discussion <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes",…
## $ medical_coverage <chr> "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes",…
## $ mental_health <chr> "Possibly", "Possibly", "Yes", "Yes", "Yes", "N…
## $ mh_share <int> 5, 4, 5, 10, 8, 3, 2, 10, 3, 9, 8, 7, 3, 10, 5,…
## $ age <dbl> 27, 31, 36, 22, 36, 38, 40, 35, 22, 28, 21, 35,…
## $ gender <chr> "Female", "Male", "Male", "Male", "Female", "Fe…
## $ country <chr> "United Kingdom", "United Kingdom", "United Sta…
The data will be cleaned and transformed as follows:
some unclear or unsure responses are converted to
NA
s
character columns are converted to factors with no more than 3 levels
country
column will be converted to “USA” or “Other”
since the majority of respondents are from the US
Also, as discussed above, since the goal is binary classification
made on the mental_health
column, the “Don’t Know” value
will be converted to “No” and the “Possibly” value will be converted to
“Yes” (interpreted here as comfortable enough in their workplace to
answer affirmatively).
osmi <- osmi_raw
osmi <- osmi |>
mutate(benefits = na_if(benefits, "I don't know"),
workplace_resources = na_if(workplace_resources, "I don't know"),
country = case_match(country, "United States of America" ~ "USA", .default = "Other"),
mental_health = case_match(mental_health, "Don't Know" ~ "No", .default = mental_health),
mental_health = case_match(mental_health, "Possibly" ~ "Yes", .default = mental_health)) |>
mutate(across(where(is.character), as.factor))
head(osmi)
The sole numeric feature of the mental health-specific columns in
this set is the employees’ willingness to share mental health illness
issues with friends and family. Here are the distributions by
demographic data: age
, gender
and
country.
There are slightly higher frequencies of responses
for workers in their 30s and with gender identity reported as
‘other’.
The binary categorical variables representing company resources and culture are visualized below. There clear majorities in the data for companies that are in the tech industry and for employees having never discussed mental health with the employer, even though more employees overall reported “Yes” for mental health issues.
Since the dataset is relatively small with only 1242 observations, I have chosen to begin with SVMs because these models can be effective at avoiding overfitting during the training stage, which could be a concern for a decision tree or even ensembles of trees like XGBoost. Cross-validation will be applied to help further tune hyperparameters and increase accuracy and efficiency.
I have also chosen to experiment with a simple feedforward neural network because the end-goal is just a binary classification. Other types of neural networks tend to work well on larger datasets and complex problems like image processing, which could be overkill on this more focused task.
The data is split for training and testing. Both types of algorithms
require the NA
s to be imputed, so I applied
na.roughfix
which replaces missing numeric values with
medians and factor variables with the most frequent levels (breaking
ties at random).
set.seed(101)
splitIndex <- createDataPartition(osmi$mental_health, p = 0.8, list = FALSE)
osmi_train <- osmi[splitIndex,]
osmi_test <- osmi[-splitIndex,]
osmi_train <- na.roughfix(osmi_train)
osmi_test <- na.roughfix(osmi_test)
# breakdown of the response variable in each dataset
round(prop.table(table(select(osmi, mental_health))), 2)
## mental_health
## No Yes
## 0.35 0.65
round(prop.table(table(select(osmi_train, mental_health))), 2)
## mental_health
## No Yes
## 0.35 0.65
round(prop.table(table(select(osmi_test, mental_health))), 2)
## mental_health
## No Yes
## 0.35 0.65
A simple experiment log is initiated first for tracking.
experiment_log <- data.frame(
ID = integer(),
Model = character(),
Features = character(),
Hyperparameters = character(),
Train = numeric(),
Test = numeric(),
Notes = character(),
stringsAsFactors = FALSE
)
Objective: Since SVMs generalize well on smaller datasets like this one, this can be a baseline for accuracy comparisons.
Variations: This first model will use a linear kernel with the default cost (hardness/softness of margin) of 1. The model has built-in scaling for the numeric columns to balance the features’ contributions in determining the hyperplane.
Evaluation: Generating the confusion matrix for accuracy.
Experiment:
set.seed(101)
# vector of numeric columns for scaling
num_cols <- sapply(osmi_train, is.numeric)
svm1 <- svm(mental_health ~.,
data = osmi_train,
scale = num_cols,
kernel = "linear")
print(svm1)
##
## Call:
## svm(formula = mental_health ~ ., data = osmi_train, kernel = "linear",
## scale = num_cols)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 662
# predict and evaluate on training data
svm1_train_pred <- predict(svm1, osmi_train)
svm1_train_cm <- confusionMatrix(svm1_train_pred, osmi_train$mental_health)
svm1_train_cm$overall["Accuracy"]
## Accuracy
## 0.7022133
# predict and evaluate on testing data
svm1_test_pred <- predict(svm1, osmi_test)
svm1_test_cm <- confusionMatrix(svm1_test_pred, osmi_test$mental_health)
svm1_test_cm$overall["Accuracy"]
## Accuracy
## 0.6572581
Review: The accuracy for rates for training and testing indicate a decent performance, but improvement on this metric would be ideal. There are also 662 support vectors in this model to 994 observations in the training set, implying overfitting or noisy data.
Objective: To see if we can improve on the accuracy and performance, the kernel will be updated.
Variations: The kernel will be changed to the nonlinear RBF (Radial Basis Function), which is more flexible and able to read unclear patterns; this may be helpful since the dataset is on the smaller side.
Evaluation: Generating the confusion matrix for accuracy.
Experiment:
set.seed(101)
svm2 <- svm(mental_health ~.,
data = osmi_train,
scale = num_cols,
kernel = "radial")
print(svm2)
##
## Call:
## svm(formula = mental_health ~ ., data = osmi_train, kernel = "radial",
## scale = num_cols)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 684
# predict and evaluate on training data
svm2_train_pred <- predict(svm2, osmi_train)
svm2_train_cm <- confusionMatrix(svm2_train_pred, osmi_train$mental_health)
svm2_train_cm$overall["Accuracy"]
## Accuracy
## 0.7354125
# predict and evaluate on testing data
svm2_test_pred <- predict(svm2, osmi_test)
svm2_test_cm <- confusionMatrix(svm2_test_pred, osmi_test$mental_health)
svm2_test_cm$overall["Accuracy"]
## Accuracy
## 0.6814516
Review: The accuracy improved slightly, but with a larger difference between training and testing. The number of support vectors also increased, so we are still overfitting on noise.
Objective: 10-fold cross-validation will be applied with different,
commonly-used cost
and gamma
values to
determine the best-performing hyperparameters for a final SVM test.
Variations: Based on the above, the cost may be changed to 0.01, 0.1, 1 (same), or 10. The best gamma will be chosen from 0.001, 0.025 (default), 0.1, or 1.
Evaluation: Generating the confusion matrix for accuracy.
Experiment:
set.seed(101)
tune_mod <- tune(svm,
mental_health ~.,
data = osmi_train,
kernel = "radial",
ranges = list(cost = c(0.01, 0.1, 1, 10), gamma = c(0.001, 0.025, 0.1, 1)))
print(tune_mod)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 1 0.1
##
## - best performance: 0.3058283
best_mod <- tune_mod$best.model
print(best_mod)
##
## Call:
## best.tune(METHOD = svm, train.x = mental_health ~ ., data = osmi_train,
## ranges = list(cost = c(0.01, 0.1, 1, 10), gamma = c(0.001, 0.025,
## 0.1, 1)), kernel = "radial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 676
# predict and evaluate on training data
best_train_pred <- predict(best_mod, osmi_train)
best_train_cm <- confusionMatrix(best_train_pred, osmi_train$mental_health)
best_train_cm$overall["Accuracy"]
## Accuracy
## 0.7193159
# predict and evaluate on testing data
best_test_pred <- predict(best_mod, osmi_test)
best_test_cm <- confusionMatrix(best_test_pred, osmi_test$mental_health)
best_test_cm$overall["Accuracy"]
## Accuracy
## 0.6491935
Review: The hyperparameter tuning determined that the best model
raised the gamma
to 0.1, which reduced the effect of points
further from the decision boundary; as expected, this resulted in the
model overfitting on training data.
The number of support vectors remains too high. Overall, I would say the performance of SVMs on this small dataset has been merely passable.
Objective: We will test whether a simple neural network can predict more effectively and accurately on this dataset than SVM.
Variations:
In a neural network, features must be hand-selected for the
formula. A simple randomForest
model can quickly rank
importance using MeanDecreaseGini
(aka the reduction in
impurity) and the top 5 features will be selected.
Factor columns must be converted to dummy variables (one-hot encoded), then converted back for interpretation.
The first neural network will have 2 hidden layers, the first layer with 3 neurons and the second with 2 (also arbitrarily chosen / default values).
Evaluation: Generating the confusion matrix for accuracy.
Experiment:
set.seed(101)
rf <- randomForest(mental_health ~ ., data = osmi_train)
importance(rf)
## MeanDecreaseGini
## tech_company 12.410531
## benefits 9.606341
## workplace_resources 12.884414
## mh_employer_discussion 21.054504
## mh_coworker_discussion 34.408542
## medical_coverage 5.990652
## mh_share 59.006817
## age 83.580049
## gender 18.292597
## country 21.333865
# scale and one-hot encode
otrain_scaled <- osmi_train
otrain_scaled[num_cols] <- scale(otrain_scaled[num_cols])
mh_1h_train <- ifelse(otrain_scaled$mental_health == "Yes", 1, 0)
otrain_scaled <- model.matrix(mental_health ~ ., data = otrain_scaled)[, -1]
otrain_scaled <- cbind(mental_health = mh_1h_train, otrain_scaled)
otest_scaled <- osmi_test
otest_scaled[num_cols] <- scale(otest_scaled[num_cols])
mh_1h_test <- ifelse(otest_scaled$mental_health == "Yes", 1, 0)
otest_scaled <- model.matrix(mental_health ~ ., data = otest_scaled)[, -1]
otest_scaled <- cbind(mental_health = mh_1h_test, otest_scaled)
set.seed(101)
nn1 <- neuralnet(mental_health ~ age + mh_share + mh_coworker_discussionYes + mh_employer_discussionYes + countryUSA,
data = otrain_scaled,
hidden = c(3, 2))
plot(nn1, rep = "best")
# predict and evaluate on training data
nn1_train_pred <- predict(nn1, otrain_scaled)
# convert probabilities yes/no strings at threshold = 0.5,
nn1_train_pred_factor <- ifelse(nn1_train_pred > 0.5, "Yes", "No")
# convert back to factors for the confusion matrix
nn1_train_pred_factor <- factor(nn1_train_pred_factor, levels = c("No", "Yes"))
nn1_train_cm <- confusionMatrix(nn1_train_pred_factor, osmi_train$mental_health, positive = "Yes")
nn1_train_cm$overall["Accuracy"]
## Accuracy
## 0.7152918
# predict and evaluate on testing data
nn1_test_pred <- predict(nn1, otest_scaled)
nn1_test_pred_factor <- ifelse(nn1_test_pred > 0.5, "Yes", "No")
nn1_test_pred_factor <- factor(nn1_test_pred_factor, levels = c("No", "Yes"))
nn1_test_cm <- confusionMatrix(nn1_test_pred_factor, osmi_test$mental_health, positive = "Yes")
nn1_test_cm$overall["Accuracy"]
## Accuracy
## 0.6612903
Review: The accuracy is essentially the same as the SVMs, with some signs of overfitting.
Objective: In this next test, I want to refocus on the business problem and select only tech companies and features dealing solely with company resource data.
Variations: We will test using only the variables that employers can
affect: tech_company
, benefits
,
workplace_resources
, medical_coverage
and
mh_employer_discussion
.
Evaluation: Generating the confusion matrix for accuracy.
set.seed(101)
nn2 <- neuralnet(
mental_health ~ tech_companyYes + benefitsYes + workplace_resourcesYes + medical_coverageYes + mh_employer_discussionYes,
data = otrain_scaled,
hidden = c(3, 2))
plot(nn2, rep = "best")
# predict and evaluate on training data
nn2_train_pred <- predict(nn2, otrain_scaled)
nn2_train_pred_factor <- ifelse(nn2_train_pred > 0.5, "Yes", "No")
nn2_train_pred_factor <- factor(nn2_train_pred_factor, levels = c("No", "Yes"))
nn2_train_cm <- confusionMatrix(nn2_train_pred_factor, osmi_train$mental_health, positive = "Yes")
nn2_train_cm$overall["Accuracy"]
## Accuracy
## 0.6539235
# predict and evaluate on testing data
nn2_test_pred <- predict(nn2, otest_scaled)
nn2_test_pred_factor <- ifelse(nn2_test_pred > 0.5, "Yes", "No")
nn2_test_pred_factor <- factor(nn2_test_pred_factor, levels = c("No", "Yes"))
nn2_test_cm <- confusionMatrix(nn2_test_pred_factor, osmi_test$mental_health, positive = "Yes")
nn2_test_cm$overall["Accuracy"]
## Accuracy
## 0.6491935
Review: As expected, selecting the features of lower importance per the RF resulted in a drop in accuracy, and the sum of squared errors increased from the first model’s 89.9 to 103.4. However, the overfitting has been mitigated.
Objective: For the final experiment, the hyperparameters of the neural network will be tuned to see if accuracy can be enhanced and errors reduced.
Variations: The number of hidden layers and the number of neurons in each layer will be increased.
Evaluation: Generating the confusion matrix for accuracy.
Experiment:
set.seed(101)
nn3 <- neuralnet(
mental_health ~ tech_companyYes + benefitsYes + workplace_resourcesYes + medical_coverageYes + mh_employer_discussionYes,
data = otrain_scaled,
hidden = c(5, 4, 3, 2))
plot(nn3, rep = "best")
# predict and evaluate on training data
nn3_train_pred <- predict(nn3, otrain_scaled)
nn3_train_pred_factor <- ifelse(nn3_train_pred > 0.5, "Yes", "No")
nn3_train_pred_factor <- factor(nn3_train_pred_factor, levels = c("No", "Yes"))
nn3_train_cm <- confusionMatrix(nn3_train_pred_factor, osmi_train$mental_health, positive = "Yes")
nn3_train_cm$overall["Accuracy"]
## Accuracy
## 0.6539235
# predict and evaluate on testing data
nn3_test_pred <- predict(nn3, otest_scaled)
nn3_test_pred_factor <- ifelse(nn3_test_pred > 0.5, "Yes", "No")
nn3_test_pred_factor <- factor(nn3_test_pred_factor, levels = c("No", "Yes"))
nn3_test_cm <- confusionMatrix(nn3_test_pred_factor, osmi_test$mental_health, positive = "Yes")
nn3_test_cm$overall["Accuracy"]
## Accuracy
## 0.6451613
Review: The accuracy and SSE values appear virtually unchanged.
ID | Model | Features | Hyperparameters | Train | Test | Notes |
---|---|---|---|---|---|---|
1 | SVM | all | cost = 1 | 0.70 | 0.66 | mediocre accuracy, too many support vectors |
2 | SVM | all | kernel = radial | 0.74 | 0.68 | slightly higher accuracy, overfitting |
3 | SVM | all | tuned to best gamma 0.1 | 0.72 | 0.65 | more overfitting, no improvement elsewhere |
4 | Simple Neural Net | Top 5 based on RF | hidden layers: 3 nodes -> 2 nodes | 0.72 | 0.66 | similar results to SVM |
5 | Simple Neural Net | Employer features only | same number of layer/nodes | 0.65 | 0.65 | less accurate, no overfitting |
6 | Simple Neural Net | Employer features only | hidden layers: 5 nodes -> 4 -> 3 -> 2 | 0.65 | 0.65 | same accuracy and SSE |
In the above machine learning experiments, all the models achieved pretty similar accuracy levels, ranging from about 65% to 74%. My conclusion is that any further tuning and testing based on this dataset will likely be of little significance. Despite the differences in model complexities and hyperparameter adjustments, the similar accuracy and error rates across all tests and data subsets suggests that the structure and content of the original data itself might be limiting increases to the predictive performance.
SVMs reached 74% on training, a decent performance after tuning the kernel from linear to radial, but predictions made on testing sets suggested the model tended to overfit on the training step. The simple Neural Networks with the most relevant features selected did manage to mitigate any overfitting but at the expense of overall accuracy.
The overall performance of our models implies only modest predictive confidence in identifying patterns in the support for mental health of tech workers. Insights drawn from these experiments should be considered as only one piece in the much larger puzzle of companies’ support for their employees. Continuing to improve and expand on the available data regarding tech employees’ mental health and wellness would be the most beneficial step toward usable predictive modeling in this area.