Introduction

This analysis is based on the Mental Health in Tech Survey dataset, publicly available on Kaggle. Collected by Open Sourcing Mental Illness (OSMI), the dataset captures responses from individuals working in the technology sector about their experiences, attitudes, and perceptions related to mental health in the workplace. The survey includes a range of variables such as demographic information, employment details, mental health history, and workplace support structures.

The goal of this analysis is to explore factors associated with seeking mental health treatment among tech workers. Using descriptive statistics and confirmatory factor analysis (CFA), we examine how individual characteristics and organizational supports relate to treatment-seeking behavior. This helps identify key predictors and potential areas for improving mental health support in tech workplaces.

Install Packages

library(readr) # Read Excel data
library(tm) # text cleaning
library(compareGroups) #Descriptive tables
library(SnowballC) #text stemming
library(wordcloud) #text visualization
library(dplyr) #data manipulation
library(stringr) #wrapper for string operation
library(corrplot) #correlation matrix
library(psych)  #factor analysis and visualization
library(lavaan) #confirmatory factor analysis
library(mice) #Data imputation
library(ggplot2) 
library(tidyr)

Import Data

data <- read_csv("survey.csv")
head(data)
## # A tibble: 6 × 27
##   Timestamp             Age Gender Country    state self_employed family_history
##   <dttm>              <dbl> <chr>  <chr>      <chr> <chr>         <chr>         
## 1 2014-08-27 11:29:31    37 Female United St… IL    <NA>          No            
## 2 2014-08-27 11:29:37    44 M      United St… IN    <NA>          No            
## 3 2014-08-27 11:29:44    32 Male   Canada     <NA>  <NA>          No            
## 4 2014-08-27 11:29:46    31 Male   United Ki… <NA>  <NA>          Yes           
## 5 2014-08-27 11:30:22    31 Male   United St… TX    <NA>          No            
## 6 2014-08-27 11:31:22    33 Male   United St… TN    <NA>          Yes           
## # ℹ 20 more variables: treatment <chr>, work_interfere <chr>,
## #   no_employees <chr>, remote_work <chr>, tech_company <chr>, benefits <chr>,
## #   care_options <chr>, wellness_program <chr>, seek_help <chr>,
## #   anonymity <chr>, leave <chr>, mental_health_consequence <chr>,
## #   phys_health_consequence <chr>, coworkers <chr>, supervisor <chr>,
## #   mental_health_interview <chr>, phys_health_interview <chr>,
## #   mental_vs_physical <chr>, obs_consequence <chr>, comments <chr>
#check missing values
naniar::vis_miss(data)

There are only four columns that include missing values: state, self_employed, work_interfere, and comments

Step 1: Data Exploration

Initial descriptive table to explore data categories

desc <-  compareGroups(~ ., data = data, method = 4, max.ylev = 12, max.xlev = 20, chisq.test.perm = T, byrow = F)
## Warning in compareGroups.fit(X = X, y = y, include.label = include.label, :
## Variables 'Gender', 'Country', 'state', 'comments' have been removed since some
## errors occurred
desctab <- createTable(desc, type = 2,  show.n = F,show.p.mul = F, show.all = T, show.p.overall = T)
desctab
## 
## --------Summary descriptives table ---------
## 
## _____________________________________________________________ 
##                                          [ALL]                
##                                          N=1259               
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
## Timestamp                  1409193000 [1409149616;1409269008] 
## Age                                 31.0 [27.0;36.0]          
## self_employed:                                                
##     No                                1095 (88.2%)            
##     Yes                               146 (11.8%)             
## family_history:                                               
##     No                                767 (60.9%)             
##     Yes                               492 (39.1%)             
## treatment:                                                    
##     No                                622 (49.4%)             
##     Yes                               637 (50.6%)             
## work_interfere:                                               
##     Never                             213 (21.4%)             
##     Often                             144 (14.5%)             
##     Rarely                            173 (17.4%)             
##     Sometimes                         465 (46.7%)             
## no_employees:                                                 
##     1-5                               162 (12.9%)             
##     100-500                           176 (14.0%)             
##     26-100                            289 (23.0%)             
##     500-1000                           60 (4.77%)             
##     6-25                              290 (23.0%)             
##     More than 1000                    282 (22.4%)             
## remote_work:                                                  
##     No                                883 (70.1%)             
##     Yes                               376 (29.9%)             
## tech_company:                                                 
##     No                                228 (18.1%)             
##     Yes                               1031 (81.9%)            
## benefits:                                                     
##     Don't know                        408 (32.4%)             
##     No                                374 (29.7%)             
##     Yes                               477 (37.9%)             
## care_options:                                                 
##     No                                501 (39.8%)             
##     Not sure                          314 (24.9%)             
##     Yes                               444 (35.3%)             
## wellness_program:                                             
##     Don't know                        188 (14.9%)             
##     No                                842 (66.9%)             
##     Yes                               229 (18.2%)             
## seek_help:                                                    
##     Don't know                        363 (28.8%)             
##     No                                646 (51.3%)             
##     Yes                               250 (19.9%)             
## anonymity:                                                    
##     Don't know                        819 (65.1%)             
##     No                                 65 (5.16%)             
##     Yes                               375 (29.8%)             
## leave:                                                        
##     Don't know                        563 (44.7%)             
##     Somewhat difficult                126 (10.0%)             
##     Somewhat easy                     266 (21.1%)             
##     Very difficult                     98 (7.78%)             
##     Very easy                         206 (16.4%)             
## mental_health_consequence:                                    
##     Maybe                             477 (37.9%)             
##     No                                490 (38.9%)             
##     Yes                               292 (23.2%)             
## phys_health_consequence:                                      
##     Maybe                             273 (21.7%)             
##     No                                925 (73.5%)             
##     Yes                                61 (4.85%)             
## coworkers:                                                    
##     No                                260 (20.7%)             
##     Some of them                      774 (61.5%)             
##     Yes                               225 (17.9%)             
## supervisor:                                                   
##     No                                393 (31.2%)             
##     Some of them                      350 (27.8%)             
##     Yes                               516 (41.0%)             
## mental_health_interview:                                      
##     Maybe                             207 (16.4%)             
##     No                                1008 (80.1%)            
##     Yes                                44 (3.49%)             
## phys_health_interview:                                        
##     Maybe                             557 (44.2%)             
##     No                                500 (39.7%)             
##     Yes                               202 (16.0%)             
## mental_vs_physical:                                           
##     Don't know                        576 (45.8%)             
##     No                                340 (27.0%)             
##     Yes                               343 (27.2%)             
## obs_consequence:                                              
##     No                                1075 (85.4%)            
##     Yes                               184 (14.6%)             
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

The warning shows that ‘Gender’, ‘Country’, ‘state’, ‘comments’ have been removed since some errors occurred. Let’s check Gender and Country variables.

Gender preparation

as.data.frame(table(data$Gender))
##                                              Var1 Freq
## 1                              A little about you    1
## 2                                         Agender    1
## 3                                             All    1
## 4                                       Androgyne    1
## 5                                cis-female/femme    1
## 6                                      Cis Female    1
## 7                                        cis male    1
## 8                                        Cis Male    2
## 9                                         Cis Man    1
## 10                                           Enby    1
## 11                                              f   15
## 12                                              F   38
## 13                                         femail    1
## 14                                         Femake    1
## 15                                         female   62
## 16                                         Female  123
## 17                                   Female (cis)    1
## 18                                 Female (trans)    2
## 19                                          fluid    1
## 20                                    Genderqueer    1
## 21                                 Guy (-ish) ^_^    1
## 22                                              m   34
## 23                                              M  116
## 24                                           Mail    1
## 25                                          maile    1
## 26                                           Make    4
## 27                                            Mal    1
## 28                                           male  206
## 29                                           Male  618
## 30                                       Male-ish    1
## 31                                     Male (CIS)    1
## 32                       male leaning androgynous    1
## 33                                           Malr    1
## 34                                            Man    2
## 35                                           msle    1
## 36                                            Nah    1
## 37                                         Neuter    1
## 38                                     non-binary    1
## 39 ostensibly male, unsure what that really means    1
## 40                                              p    1
## 41                                          queer    1
## 42                                 queer/she/they    1
## 43                          something kinda male?    1
## 44                                   Trans-female    1
## 45                                    Trans woman    1
## 46                                          woman    1
## 47                                          Woman    3

The Gender variable requires cleaning and categories combination into smaller well defined groups.

data <- data %>%
  mutate(
    Gender = tolower(Gender),
    Gender = str_trim(Gender),
    Gender = case_when(
      str_detect(Gender, "^(f|female|cis[-\\s]?female|femail|femake|woman|femme)") ~ "Female",
      str_detect(Gender, "^(m|male|cis[-\\s]?male|man|make|mail|maile|malr|mal|msle)") ~ "Male",
      str_detect(Gender, "trans|androgyne|non[-\\s]?binary|genderqueer|enby|fluid|agender|neuter|queer|androgynous") ~ "Non-binary / Gender diverse",
      str_detect(Gender, "nah|ostensibly|something|unsure|a little about you|^p$") ~ "Other / Ambiguous",
      TRUE ~ "Other / Ambiguous"
    )
  )

Country preparation

as.data.frame(table(data$Country))
##                      Var1 Freq
## 1               Australia   21
## 2                 Austria    3
## 3            Bahamas, The    1
## 4                 Belgium    6
## 5  Bosnia and Herzegovina    1
## 6                  Brazil    6
## 7                Bulgaria    4
## 8                  Canada   72
## 9                   China    1
## 10               Colombia    2
## 11             Costa Rica    1
## 12                Croatia    2
## 13         Czech Republic    1
## 14                Denmark    2
## 15                Finland    3
## 16                 France   13
## 17                Georgia    1
## 18                Germany   45
## 19                 Greece    2
## 20                Hungary    1
## 21                  India   10
## 22                Ireland   27
## 23                 Israel    5
## 24                  Italy    7
## 25                  Japan    1
## 26                 Latvia    1
## 27                 Mexico    3
## 28                Moldova    1
## 29            Netherlands   27
## 30            New Zealand    8
## 31                Nigeria    1
## 32                 Norway    1
## 33            Philippines    1
## 34                 Poland    7
## 35               Portugal    2
## 36                Romania    1
## 37                 Russia    3
## 38              Singapore    4
## 39               Slovenia    1
## 40           South Africa    6
## 41                  Spain    1
## 42                 Sweden    7
## 43            Switzerland    7
## 44               Thailand    1
## 45         United Kingdom  185
## 46          United States  751
## 47                Uruguay    1
## 48               Zimbabwe    1

There are some countries with very low frequencies. Therefore, we will combine countries into regions.

data <- data %>%
  mutate(
    Region = case_when(
      Country %in% c("United States", "Canada", "Mexico") ~ "North America",
      Country %in% c("Bahamas, The", "Brazil", "Colombia", "Costa Rica", "Uruguay") ~ "Central/South America",
      Country %in% c(
        "Austria", "Belgium", "Bosnia and Herzegovina", "Bulgaria", "Croatia",
        "Czech Republic", "Denmark", "Finland", "France", "Georgia", "Germany",
        "Greece", "Hungary", "Ireland", "Italy", "Latvia", "Moldova",
        "Netherlands", "Norway", "Poland", "Portugal", "Romania", "Russia",
        "Slovenia", "Spain", "Sweden", "Switzerland", "United Kingdom"
      ) ~ "Europe",
      Country %in% c("India", "China", "Japan", "Israel", "Singapore", "Thailand", "Philippines") ~ "Asia",
      Country %in% c("Nigeria", "South Africa", "Zimbabwe") ~ "Africa",
      Country %in% c("Australia", "New Zealand") ~ "Oceania",
      TRUE ~ "Other / Unknown"
    )
  )

Let’s redo the descriptive table and inspect the changes

desc <-  compareGroups(~ ., data = data, method = 4, max.ylev = 12, max.xlev = 20, chisq.test.perm = T, byrow = F)
## Warning in compareGroups.fit(X = X, y = y, include.label = include.label, :
## Variables 'Country', 'state', 'comments' have been removed since some errors
## occurred
desctab <- createTable(desc, type = 2,  show.n = F,show.p.mul = F, show.all = T, show.p.overall = T)
desctab
## 
## --------Summary descriptives table ---------
## 
## __________________________________________________________________ 
##                                               [ALL]                
##                                               N=1259               
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
## Timestamp                       1409193000 [1409149616;1409269008] 
## Age                                      31.0 [27.0;36.0]          
## Gender:                                                            
##     Female                                 250 (19.9%)             
##     Male                                   991 (78.7%)             
##     Non-binary / Gender diverse             10 (0.79%)             
##     Other / Ambiguous                       8 (0.64%)              
## self_employed:                                                     
##     No                                     1095 (88.2%)            
##     Yes                                    146 (11.8%)             
## family_history:                                                    
##     No                                     767 (60.9%)             
##     Yes                                    492 (39.1%)             
## treatment:                                                         
##     No                                     622 (49.4%)             
##     Yes                                    637 (50.6%)             
## work_interfere:                                                    
##     Never                                  213 (21.4%)             
##     Often                                  144 (14.5%)             
##     Rarely                                 173 (17.4%)             
##     Sometimes                              465 (46.7%)             
## no_employees:                                                      
##     1-5                                    162 (12.9%)             
##     100-500                                176 (14.0%)             
##     26-100                                 289 (23.0%)             
##     500-1000                                60 (4.77%)             
##     6-25                                   290 (23.0%)             
##     More than 1000                         282 (22.4%)             
## remote_work:                                                       
##     No                                     883 (70.1%)             
##     Yes                                    376 (29.9%)             
## tech_company:                                                      
##     No                                     228 (18.1%)             
##     Yes                                    1031 (81.9%)            
## benefits:                                                          
##     Don't know                             408 (32.4%)             
##     No                                     374 (29.7%)             
##     Yes                                    477 (37.9%)             
## care_options:                                                      
##     No                                     501 (39.8%)             
##     Not sure                               314 (24.9%)             
##     Yes                                    444 (35.3%)             
## wellness_program:                                                  
##     Don't know                             188 (14.9%)             
##     No                                     842 (66.9%)             
##     Yes                                    229 (18.2%)             
## seek_help:                                                         
##     Don't know                             363 (28.8%)             
##     No                                     646 (51.3%)             
##     Yes                                    250 (19.9%)             
## anonymity:                                                         
##     Don't know                             819 (65.1%)             
##     No                                      65 (5.16%)             
##     Yes                                    375 (29.8%)             
## leave:                                                             
##     Don't know                             563 (44.7%)             
##     Somewhat difficult                     126 (10.0%)             
##     Somewhat easy                          266 (21.1%)             
##     Very difficult                          98 (7.78%)             
##     Very easy                              206 (16.4%)             
## mental_health_consequence:                                         
##     Maybe                                  477 (37.9%)             
##     No                                     490 (38.9%)             
##     Yes                                    292 (23.2%)             
## phys_health_consequence:                                           
##     Maybe                                  273 (21.7%)             
##     No                                     925 (73.5%)             
##     Yes                                     61 (4.85%)             
## coworkers:                                                         
##     No                                     260 (20.7%)             
##     Some of them                           774 (61.5%)             
##     Yes                                    225 (17.9%)             
## supervisor:                                                        
##     No                                     393 (31.2%)             
##     Some of them                           350 (27.8%)             
##     Yes                                    516 (41.0%)             
## mental_health_interview:                                           
##     Maybe                                  207 (16.4%)             
##     No                                     1008 (80.1%)            
##     Yes                                     44 (3.49%)             
## phys_health_interview:                                             
##     Maybe                                  557 (44.2%)             
##     No                                     500 (39.7%)             
##     Yes                                    202 (16.0%)             
## mental_vs_physical:                                                
##     Don't know                             576 (45.8%)             
##     No                                     340 (27.0%)             
##     Yes                                    343 (27.2%)             
## obs_consequence:                                                   
##     No                                     1075 (85.4%)            
##     Yes                                    184 (14.6%)             
## Region:                                                            
##     Africa                                  8 (0.64%)              
##     Asia                                    23 (1.83%)             
##     Central/South America                   11 (0.87%)             
##     Europe                                 362 (28.8%)             
##     North America                          826 (65.6%)             
##     Oceania                                 29 (2.30%)             
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

Descriptive table startified by ‘treatment’ variable

desc <-  compareGroups(treatment ~ ., data = data[-1], method = 4, max.ylev = 12, max.xlev = 20, chisq.test.perm = T, byrow = T)
## Warning in chisq.test(xx, correct = FALSE): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(xx, correct = FALSE): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(xx, correct = FALSE): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(xx, correct = FALSE): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(xx, correct = FALSE): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(xx, correct = FALSE): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(xx, correct = FALSE): Chi-squared approximation may be
## incorrect
## Warning in compareGroups.fit(X = X, y = y, include.label = include.label, :
## Variables 'Country', 'state', 'comments' have been removed since some errors
## occurred
desctab <- createTable(desc, type = 2,  show.n = F,show.p.mul = F, show.all = T, show.p.overall = T)
desctab
## 
## --------Summary descriptives table by 'treatment'---------
## 
## ____________________________________________________________________________________________ 
##                                      [ALL]              No              Yes        p.overall 
##                                      N=1259           N=622            N=637                 
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
## Age                             31.0 [27.0;36.0] 31.0 [27.0;35.0] 32.0 [27.0;37.0]   0.011   
## Gender:                                                                             <0.001   
##     Female                        250 (19.9%)       77 (30.8%)      173 (69.2%)              
##     Male                          991 (78.7%)      542 (54.7%)      449 (45.3%)              
##     Non-binary / Gender diverse    10 (0.79%)       2 (20.0%)        8 (80.0%)               
##     Other / Ambiguous              8 (0.64%)        1 (12.5%)        7 (87.5%)               
## self_employed:                                                                       0.524   
##     No                            1095 (88.2%)     545 (49.8%)      550 (50.2%)              
##     Yes                           146 (11.8%)       68 (46.6%)       78 (53.4%)              
## family_history:                                                                     <0.001   
##     No                            767 (60.9%)      495 (64.5%)      272 (35.5%)              
##     Yes                           492 (39.1%)      127 (25.8%)      365 (74.2%)              
## work_interfere:                                                                     <0.001   
##     Never                         213 (21.4%)      183 (85.9%)       30 (14.1%)              
##     Often                         144 (14.5%)       21 (14.6%)      123 (85.4%)              
##     Rarely                        173 (17.4%)       51 (29.5%)      122 (70.5%)              
##     Sometimes                     465 (46.7%)      107 (23.0%)      358 (77.0%)              
## no_employees:                                                                        0.119   
##     1-5                           162 (12.9%)       71 (43.8%)       91 (56.2%)              
##     100-500                       176 (14.0%)       81 (46.0%)       95 (54.0%)              
##     26-100                        289 (23.0%)      139 (48.1%)      150 (51.9%)              
##     500-1000                       60 (4.77%)       33 (55.0%)       27 (45.0%)              
##     6-25                          290 (23.0%)      162 (55.9%)      128 (44.1%)              
##     More than 1000                282 (22.4%)      136 (48.2%)      146 (51.8%)              
## remote_work:                                                                         0.371   
##     No                            883 (70.1%)      444 (50.3%)      439 (49.7%)              
##     Yes                           376 (29.9%)      178 (47.3%)      198 (52.7%)              
## tech_company:                                                                        0.296   
##     No                            228 (18.1%)      105 (46.1%)      123 (53.9%)              
##     Yes                           1031 (81.9%)     517 (50.1%)      514 (49.9%)              
## benefits:                                                                           <0.001   
##     Don't know                    408 (32.4%)      257 (63.0%)      151 (37.0%)              
##     No                            374 (29.7%)      193 (51.6%)      181 (48.4%)              
##     Yes                           477 (37.9%)      172 (36.1%)      305 (63.9%)              
## care_options:                                                                       <0.001   
##     No                            501 (39.8%)      294 (58.7%)      207 (41.3%)              
##     Not sure                      314 (24.9%)      191 (60.8%)      123 (39.2%)              
##     Yes                           444 (35.3%)      137 (30.9%)      307 (69.1%)              
## wellness_program:                                                                    0.003   
##     Don't know                    188 (14.9%)      107 (56.9%)       81 (43.1%)              
##     No                            842 (66.9%)      422 (50.1%)      420 (49.9%)              
##     Yes                           229 (18.2%)       93 (40.6%)      136 (59.4%)              
## seek_help:                                                                           0.004   
##     Don't know                    363 (28.8%)      197 (54.3%)      166 (45.7%)              
##     No                            646 (51.3%)      323 (50.0%)      323 (50.0%)              
##     Yes                           250 (19.9%)      102 (40.8%)      148 (59.2%)              
## anonymity:                                                                          <0.001   
##     Don't know                    819 (65.1%)      448 (54.7%)      371 (45.3%)              
##     No                             65 (5.16%)       27 (41.5%)       38 (58.5%)              
##     Yes                           375 (29.8%)      147 (39.2%)      228 (60.8%)              
## leave:                                                                              <0.001   
##     Don't know                    563 (44.7%)      309 (54.9%)      254 (45.1%)              
##     Somewhat difficult            126 (10.0%)       44 (34.9%)       82 (65.1%)              
##     Somewhat easy                 266 (21.1%)      135 (50.8%)      131 (49.2%)              
##     Very difficult                 98 (7.78%)       31 (31.6%)       67 (68.4%)              
##     Very easy                     206 (16.4%)      103 (50.0%)      103 (50.0%)              
## mental_health_consequence:                                                          <0.001   
##     Maybe                         477 (37.9%)      224 (47.0%)      253 (53.0%)              
##     No                            490 (38.9%)      280 (57.1%)      210 (42.9%)              
##     Yes                           292 (23.2%)      118 (40.4%)      174 (59.6%)              
## phys_health_consequence:                                                             0.185   
##     Maybe                         273 (21.7%)      127 (46.5%)      146 (53.5%)              
##     No                            925 (73.5%)      470 (50.8%)      455 (49.2%)              
##     Yes                            61 (4.85%)       25 (41.0%)       36 (59.0%)              
## coworkers:                                                                           0.050   
##     No                            260 (20.7%)      141 (54.2%)      119 (45.8%)              
##     Some of them                  774 (61.5%)      384 (49.6%)      390 (50.4%)              
##     Yes                           225 (17.9%)       97 (43.1%)      128 (56.9%)              
## supervisor:                                                                          0.422   
##     No                            393 (31.2%)      186 (47.3%)      207 (52.7%)              
##     Some of them                  350 (27.8%)      170 (48.6%)      180 (51.4%)              
##     Yes                           516 (41.0%)      266 (51.6%)      250 (48.4%)              
## mental_health_interview:                                                             0.002   
##     Maybe                         207 (16.4%)      125 (60.4%)       82 (39.6%)              
##     No                            1008 (80.1%)     479 (47.5%)      529 (52.5%)              
##     Yes                            44 (3.49%)       18 (40.9%)       26 (59.1%)              
## phys_health_interview:                                                               0.183   
##     Maybe                         557 (44.2%)      290 (52.1%)      267 (47.9%)              
##     No                            500 (39.7%)      241 (48.2%)      259 (51.8%)              
##     Yes                           202 (16.0%)       91 (45.0%)      111 (55.0%)              
## mental_vs_physical:                                                                 <0.001   
##     Don't know                    576 (45.8%)      316 (54.9%)      260 (45.1%)              
##     No                            340 (27.0%)      138 (40.6%)      202 (59.4%)              
##     Yes                           343 (27.2%)      168 (49.0%)      175 (51.0%)              
## obs_consequence:                                                                    <0.001   
##     No                            1075 (85.4%)     566 (52.7%)      509 (47.3%)              
##     Yes                           184 (14.6%)       56 (30.4%)      128 (69.6%)              
## Region:                                                                              0.001   
##     Africa                         8 (0.64%)        3 (37.5%)        5 (62.5%)               
##     Asia                           23 (1.83%)       18 (78.3%)       5 (21.7%)               
##     Central/South America          11 (0.87%)       8 (72.7%)        3 (27.3%)               
##     Europe                        362 (28.8%)      204 (56.4%)      158 (43.6%)              
##     North America                 826 (65.6%)      378 (45.8%)      448 (54.2%)              
##     Oceania                        29 (2.30%)       11 (37.9%)       18 (62.1%)              
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

Several factors showed significant differences between those who received treatment and those who did not. Age was slightly higher in the treatment group (p = 0.011). Gender was strongly associated with treatment (p < 0.001), with females, non-binary, and other gender-diverse individuals more likely to report receiving treatment than males. Those with a family history of mental illness were significantly more likely to have received treatment (p < 0.001). The frequency of work interference due to mental health also had a strong association (p < 0.001), with those reporting more interference more likely to be in the treatment group.

Other significant variables included mental health consequences, benefits availability, care options, wellness programs, help-seeking behavior, leave policies, perceptions of mental vs physical health treatment, interview willingness regarding mental health, anonymity, observed workplace consequences, and region (all p < 0.05). These findings suggest that both individual and workplace-related factors are significantly associated with seeking or receiving mental health treatment.

Commments column exploration

#create text corpus
corpus <- VCorpus(VectorSource(data$comments))

corpus_clean <- tm_map(corpus,content_transformer(tolower)) #converting to lower case letters
corpus_clean <- tm_map(corpus_clean,removeNumbers) #removing numbers
corpus_clean <- tm_map(corpus_clean,removeWords,stopwords()) #removing stop words
corpus_clean <- tm_map(corpus_clean,removePunctuation) #removing punctuation
corpus_clean <- tm_map(corpus_clean,stemDocument) #stemming the document
corpus_clean <- tm_map(corpus_clean,stripWhitespace)#removing spaces after doing above process

#visualize the most frequent words
wordcloud(corpus_clean, scale = c(4,.1),max.words = 100,random.order = FALSE, random.color = FALSE, colors = brewer.pal(6, 'Dark2'))

Mental health and work-related terms are the most frequent terms in the comments. Depression is the most frequent mental condition term.

Step 2: Psychometric Analysis

select questionnaire items with removing demographical variables

items <- data[ , -which(names(data) %in% c("Timestamp","Country", "state", "Age", "Region", "Gender", "comments"))]

Recode responses

#functions to recode responses
#1- Yes/No/Maybe/Some of them/Don't know/Not sure
recode_yes <- function(x) {
  ifelse(x == "Yes", 3, ifelse(x == "No", 1, ifelse(x == "Maybe", 2, ifelse(x == "Some of them", 2, ifelse(x == "Don't know", 2, ifelse(x == "Not sure", 2, NA))))))
}

#2- Don't know/Very difficult/Somewhat difficult/Somewhat easy/Very easy
recode_difficult <- function(x) {
  ifelse(x == "Don't know", 1, ifelse(x == "Very difficult", 2, ifelse(x == "Somewhat difficult", 3, ifelse(x == "Somewhat easy", 4, ifelse(x == "Very easy", 5, NA)))))
}

#3- Never/Often/Rarely/Sometimes   
recode_freq <- function(x){
  ifelse(x == "Never", 1, ifelse(x == "Rarely", 2, ifelse(x == "Sometimes", 3, ifelse(x == "Often", 4, NA))))
}

#4- Number of employees
recode_employ <- function(x) {
  ifelse(x == "1-5", 1, ifelse(x == "6-25", 2, ifelse(x == "26-100", 3, ifelse(x == "100-500", 4, ifelse(x == "500-1000", 5, ifelse(x == "More than 1000", 6, NA))))))
}
#Apply functions
items[,c(1:3, 6:12, 14:21)] <- lapply(items[,c(1:3, 6:12, 14:21)], recode_yes)
items[,13] <- lapply(items[,13], recode_difficult)
items[,4] <- lapply(items[,4], recode_freq)
items[,5] <- lapply(items[,5], recode_employ)

Data imputation for NA values

imputed <- mice(items, m = 1, method = "pmm")
## 
##  iter imp variable
##   1   1  self_employed  work_interfere
##   2   1  self_employed  work_interfere
##   3   1  self_employed  work_interfere
##   4   1  self_employed  work_interfere
##   5   1  self_employed  work_interfere
items <- complete(imputed, 1)

Exploratory Factor Analysis

1- correlation between items

Let’s explore the highly correlated items.

cor_items <- cor(items, use = "complete.obs")

cor_items[lower.tri(cor_items, diag = TRUE)] <- NA  # remove lower triangle and diagonal

# Convert to long/tidy format
cor_table <- as.data.frame(as.table(cor_items)) %>%
  filter(!is.na(Freq), Freq > 0.35) %>%  #subset correlations > 35%
  arrange(desc(Freq)) %>%
  rename(Variable1 = Var1, Variable2 = Var2, Correlation = Freq)

# View table
print(cor_table)
##                    Variable1               Variable2 Correlation
## 1           wellness_program               seek_help   0.6181306
## 2                  coworkers              supervisor   0.5743100
## 3  mental_health_consequence phys_health_consequence   0.5156194
## 4                  treatment          work_interfere   0.5038236
## 5                   benefits               seek_help   0.4904305
## 6               no_employees                benefits   0.4593412
## 7    mental_health_interview   phys_health_interview   0.4488442
## 8                   benefits        wellness_program   0.4092124
## 9               no_employees               seek_help   0.4059284
## 10            family_history               treatment   0.3779177

The highest between items correlation is between the discussing the employee wellness program by the employer and providing mental health and seeking help resources. Moreover, the size of company positively correlate with providing mental health and seeking help resources.

2- Reliability analysis

# load library ltm
library(ltm)

# calculate cronbach's alpha
cronbach.alpha(items, CI=TRUE, standardized=TRUE)
## 
## Standardized Cronbach's alpha for the 'items' data-set
## 
## Items: 21
## Sample units: 1259
## alpha: 0.49
## 
## Bootstrap 95% CI based on 1000 samples
##  2.5% 97.5% 
## 0.448 0.537

The Cronbach’s alpha is unacceptable (0.5). We will continue with exploratory factor analysis to uncover underlying dimensions.

3- Exploratory factor analysis

We will start by 1-factor solution

efa_1 <- factanal(x = items, factors = 1)
efa_1
## 
## Call:
## factanal(x = items, factors = 1)
## 
## Uniquenesses:
##             self_employed            family_history                 treatment 
##                     0.993                     0.993                     0.991 
##            work_interfere              no_employees               remote_work 
##                     0.968                     0.987                     0.996 
##              tech_company                  benefits              care_options 
##                     0.986                     0.993                     0.991 
##          wellness_program                 seek_help                 anonymity 
##                     0.961                     0.966                     0.898 
##                     leave mental_health_consequence   phys_health_consequence 
##                     0.886                     0.344                     0.743 
##                 coworkers                supervisor   mental_health_interview 
##                     0.653                     0.474                     0.834 
##     phys_health_interview        mental_vs_physical           obs_consequence 
##                     0.967                     0.734                     0.949 
## 
## Loadings:
##                           Factor1
## self_employed                    
## family_history                   
## treatment                        
## work_interfere            -0.180 
## no_employees              -0.113 
## remote_work                      
## tech_company               0.119 
## benefits                         
## care_options                     
## wellness_program           0.198 
## seek_help                  0.184 
## anonymity                  0.320 
## leave                      0.338 
## mental_health_consequence -0.810 
## phys_health_consequence   -0.507 
## coworkers                  0.589 
## supervisor                 0.725 
## mental_health_interview    0.407 
## phys_health_interview      0.182 
## mental_vs_physical         0.516 
## obs_consequence           -0.226 
## 
##                Factor1
## SS loadings      2.694
## Proportion Var   0.128
## 
## Test of the hypothesis that 1 factor is sufficient.
## The chi square statistic is 4028.37 on 189 degrees of freedom.
## The p-value is 0

The model fit shows that only 12.8% of the total variance is explained by this factor which is a very low percentage.

The p value of the chi-square test is 0, so we will reject the null hypothesis that 1 factor is sufficient.

Now, we will try the 2-factor solution.

efa_2 <- factanal(x = items, factors = 2)
efa_2
## 
## Call:
## factanal(x = items, factors = 2)
## 
## Uniquenesses:
##             self_employed            family_history                 treatment 
##                     0.928                     0.979                     0.973 
##            work_interfere              no_employees               remote_work 
##                     0.969                     0.614                     0.970 
##              tech_company                  benefits              care_options 
##                     0.936                     0.550                     0.857 
##          wellness_program                 seek_help                 anonymity 
##                     0.499                     0.400                     0.817 
##                     leave mental_health_consequence   phys_health_consequence 
##                     0.879                     0.357                     0.754 
##                 coworkers                supervisor   mental_health_interview 
##                     0.641                     0.475                     0.813 
##     phys_health_interview        mental_vs_physical           obs_consequence 
##                     0.946                     0.710                     0.948 
## 
## Loadings:
##                           Factor1 Factor2
## self_employed              0.108  -0.247 
## family_history                     0.116 
## treatment                          0.135 
## work_interfere            -0.175         
## no_employees              -0.133   0.607 
## remote_work                       -0.156 
## tech_company               0.126  -0.218 
## benefits                           0.666 
## care_options               0.100   0.365 
## wellness_program           0.220   0.673 
## seek_help                  0.205   0.747 
## anonymity                  0.325   0.278 
## leave                      0.345         
## mental_health_consequence -0.800         
## phys_health_consequence   -0.496         
## coworkers                  0.592         
## supervisor                 0.724         
## mental_health_interview    0.414  -0.124 
## phys_health_interview      0.186  -0.138 
## mental_vs_physical         0.526   0.114 
## obs_consequence           -0.221         
## 
##                Factor1 Factor2
## SS loadings      2.721   2.263
## Proportion Var   0.130   0.108
## Cumulative Var   0.130   0.237
## 
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 2265.94 on 169 degrees of freedom.
## The p-value is 0

The proportion of variance explained in the 2-factor model is 23.7%. It is better than 1-factor model but it is still low proportion.

Now, we will try the 3-factor solution.

efa_3 <- factanal(x = items, factors = 3)
efa_3
## 
## Call:
## factanal(x = items, factors = 3)
## 
## Uniquenesses:
##             self_employed            family_history                 treatment 
##                     0.888                     0.778                     0.449 
##            work_interfere              no_employees               remote_work 
##                     0.551                     0.562                     0.949 
##              tech_company                  benefits              care_options 
##                     0.935                     0.536                     0.784 
##          wellness_program                 seek_help                 anonymity 
##                     0.515                     0.415                     0.798 
##                     leave mental_health_consequence   phys_health_consequence 
##                     0.849                     0.349                     0.757 
##                 coworkers                supervisor   mental_health_interview 
##                     0.602                     0.462                     0.810 
##     phys_health_interview        mental_vs_physical           obs_consequence 
##                     0.940                     0.712                     0.899 
## 
## Loadings:
##                           Factor1 Factor2 Factor3
## self_employed              0.151  -0.276   0.115 
## family_history                             0.467 
## treatment                                  0.740 
## work_interfere                    -0.116   0.656 
## no_employees              -0.191   0.631         
## remote_work                0.105  -0.176         
## tech_company               0.138  -0.213         
## benefits                           0.668   0.122 
## care_options               0.131   0.337   0.291 
## wellness_program           0.190   0.667         
## seek_help                  0.163   0.746         
## anonymity                  0.332   0.282   0.113 
## leave                      0.373                 
## mental_health_consequence -0.772           0.231 
## phys_health_consequence   -0.474           0.132 
## coworkers                  0.623                 
## supervisor                 0.732                 
## mental_health_interview    0.421  -0.101         
## phys_health_interview      0.202  -0.135         
## mental_vs_physical         0.505   0.148  -0.108 
## obs_consequence           -0.180           0.261 
## 
##                Factor1 Factor2 Factor3
## SS loadings      2.680   2.271   1.508
## Proportion Var   0.128   0.108   0.072
## Cumulative Var   0.128   0.236   0.308
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 1414.44 on 150 degrees of freedom.
## The p-value is 1.8e-204

The cumulative variance explained by the model is 31.0%, which is decent for social science/psychometric data.

Some variables are not well accounted for by any of the 3 factors such as: remote_work (0.954), phys_health_interview (0.937), obs_consequence (0.900), self_employed (0.888), mental_health_interview (0.810), anonymity (0.797)

We will remove these variables from the questionnaire:

items <- items[ , -which(names(items) %in% c("remote_work","phys_health_interview", "obs_consequence", "self_employed", "mental_health_interview", "anonymity"))]

Let’s repeat the 3-factor analysis

efa_3 <- factanal(x = items, factors = 3)
efa_3
## 
## Call:
## factanal(x = items, factors = 3)
## 
## Uniquenesses:
##            family_history                 treatment            work_interfere 
##                     0.775                     0.381                     0.570 
##              no_employees              tech_company                  benefits 
##                     0.644                     0.942                     0.586 
##              care_options          wellness_program                 seek_help 
##                     0.805                     0.474                     0.342 
##                     leave mental_health_consequence   phys_health_consequence 
##                     0.872                     0.324                     0.742 
##                 coworkers                supervisor        mental_vs_physical 
##                     0.601                     0.452                     0.723 
## 
## Loadings:
##                           Factor1 Factor2 Factor3
## family_history                             0.467 
## treatment                                  0.782 
## work_interfere                             0.645 
## no_employees              -0.173   0.567         
## tech_company               0.136  -0.199         
## benefits                           0.634   0.104 
## care_options                       0.352   0.250 
## wellness_program           0.158   0.708         
## seek_help                  0.138   0.799         
## leave                      0.347                 
## mental_health_consequence -0.800           0.188 
## phys_health_consequence   -0.499                 
## coworkers                  0.622                 
## supervisor                 0.740                 
## mental_vs_physical         0.487   0.170  -0.103 
## 
##                Factor1 Factor2 Factor3
## SS loadings      2.294   2.079   1.393
## Proportion Var   0.153   0.139   0.093
## Cumulative Var   0.153   0.292   0.384
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 488.99 on 63 degrees of freedom.
## The p-value is 3.54e-67

The cumulative variance explained by the model increased to 38.3%

Visualize the path of the factors using Psych

# Run EFA with rotation
efa_result <- fa(items, nfactors = 3, rotate = "varimax")

# Path diagram



# Call fa.diagram() to draw within current plotting area
fa.diagram(efa_result)

# Convert loadings to data frame
load_df <- as.data.frame(efa_result$loadings[1:ncol(items), ])
load_df$Variable <- rownames(load_df)

# Convert to long format
load_long <- pivot_longer(load_df, cols = starts_with("MR"), names_to = "Factor", values_to = "Loading")

# Plot
ggplot(load_long, aes(x = Factor, y = Loading, fill = abs(Loading))) +
  geom_col(stat = "identity", position = "dodge") +
  facet_wrap(~ Variable, scales = "free_y") +
  coord_flip() +
  scale_fill_gradient2(low = "blue", mid = "gray90", high = "red", midpoint = 0.4) +
  theme_minimal() +
  labs(title = "Factor Loadings by Variable", x = "Factor", y = "Loading")

The factor analysis revealed three underlying dimensions related to workplace mental health attitudes and experiences.

The first factor, which we can describe as openness and willingness, reflects how comfortable and supported individuals feel when it comes to discussing mental health in the workplace. Items that loaded highly on this factor include willingness to talk to a supervisor (loading = 0.7), comfort discussing mental health with coworkers (0.6) and the perception that mental and physical health are treated equally by the employer (0.5). Providing medical leave for mental health conditions contributed moderately (0.4). Notably, there were strong negative loadings for beliefs that disclosing mental or physical health issues would have negative consequences (−0.8 and −0.5, respectively), suggesting that higher scorers on this factor are less likely to anticipate stigma or adverse outcomes.

The second factor reflects Organizational Mental Health Resources, emphasizing the structural and informational support provided by employers. Items with high loadings include knowledge of where to seek help for mental health issues (0.8), the availability of mental health benefits (0.7), and the inclusion of mental health in wellness programs (0.7). Interestingly, company size also loaded moderately (0.6), which may reflect the tendency of larger organizations to have more comprehensive mental health resources. Awareness of available care options also contributed to this factor (0.3).

The third factor can be described as Personal Experience, focusing on individual encounters with mental health challenges. This includes having received treatment for mental health concerns (loading = 0.8), experiencing interference with work due to mental health (0.6), and having a family history of mental illness (0.5). These items collectively represent the personal side of mental health.

4- Confirmatory factor analysis

model <- 'Factor1 =~ mental_health_consequence + phys_health_consequence + supervisor + coworkers + leave +mental_vs_physical
          Factor2 =~ seek_help + benefits + wellness_program + care_options + no_employees
          Factor3 =~ treatment +family_history + work_interfere'
fit <- cfa(model, data = items)
summary(fit, fit.measures=TRUE)
## lavaan 0.6-19 ended normally after 37 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        31
## 
##   Number of observations                          1259
## 
## Model Test User Model:
##                                                       
##   Test statistic                               781.144
##   Degrees of freedom                                74
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                              4490.244
##   Degrees of freedom                                91
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.839
##   Tucker-Lewis Index (TLI)                       0.802
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)             -21108.013
##   Loglikelihood unrestricted model (H1)     -20717.441
##                                                       
##   Akaike (AIC)                               42278.025
##   Bayesian (BIC)                             42437.305
##   Sample-size adjusted Bayesian (SABIC)      42338.835
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.087
##   90 Percent confidence interval - lower         0.082
##   90 Percent confidence interval - upper         0.093
##   P-value H_0: RMSEA <= 0.050                    0.000
##   P-value H_0: RMSEA >= 0.080                    0.984
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.071
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Factor1 =~                                          
##     mntl_hlth_cnsq    1.000                           
##     phys_hlth_cnsq    0.460    0.027   17.338    0.000
##     supervisor       -0.957    0.041  -23.423    0.000
##     coworkers        -0.568    0.029  -19.303    0.000
##     leave            -0.793    0.077  -10.349    0.000
##     mntl_vs_physcl   -0.562    0.035  -16.034    0.000
##   Factor2 =~                                          
##     seek_help         1.000                           
##     benefits          0.806    0.040   19.953    0.000
##     wellness_prgrm    0.896    0.040   22.293    0.000
##     care_options      0.505    0.042   11.931    0.000
##     no_employees      1.370    0.084   16.390    0.000
##   Factor3 =~                                          
##     treatment         1.000                           
##     family_history    0.582    0.051   11.464    0.000
##     work_interfere    0.815    0.067   12.143    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Factor1 ~~                                          
##     Factor2          -0.061    0.014   -4.253    0.000
##     Factor3           0.079    0.019    4.183    0.000
##   Factor2 ~~                                          
##     Factor3           0.041    0.019    2.200    0.028
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .mntl_hlth_cnsq    0.186    0.015   12.665    0.000
##    .phys_hlth_cnsq    0.225    0.010   22.957    0.000
##    .supervisor        0.337    0.018   18.386    0.000
##    .coworkers         0.252    0.011   22.151    0.000
##    .leave             2.286    0.093   24.481    0.000
##    .mntl_vs_physcl    0.413    0.018   23.366    0.000
##    .seek_help         0.207    0.016   12.794    0.000
##    .benefits          0.405    0.019   21.034    0.000
##    .wellness_prgrm    0.288    0.016   17.591    0.000
##    .care_options      0.645    0.027   24.137    0.000
##    .no_employees      2.153    0.094   22.951    0.000
##    .treatment         0.368    0.050    7.308    0.000
##    .family_history    0.738    0.034   21.784    0.000
##    .work_interfere    0.631    0.041   15.463    0.000
##     Factor1           0.410    0.026   15.841    0.000
##     Factor2           0.406    0.027   15.121    0.000
##     Factor3           0.632    0.061   10.413    0.000

The results of the confirmatory factor analysis (CFA) suggest that the proposed three-factor model has a moderate fit to the data. While the model was statistically significant (χ²(74) = 793.15, p < .001), indicating that the model does not perfectly reproduce the observed data (which is common in large samples), several other fit indices offer more practical insight. The CFI (0.836) and TLI (0.798) fall below the conventional threshold of 0.90, suggesting room for improvement in model fit. Similarly, the RMSEA (0.088) is slightly above the acceptable range, and its confidence interval (0.082–0.093) indicates a less-than-ideal fit. However, the SRMR (0.072) falls within the acceptable range (<0.08), providing some support for the model.