This analysis is based on the Mental Health in Tech Survey dataset, publicly available on Kaggle. Collected by Open Sourcing Mental Illness (OSMI), the dataset captures responses from individuals working in the technology sector about their experiences, attitudes, and perceptions related to mental health in the workplace. The survey includes a range of variables such as demographic information, employment details, mental health history, and workplace support structures.
The goal of this analysis is to explore factors associated with seeking mental health treatment among tech workers. Using descriptive statistics and confirmatory factor analysis (CFA), we examine how individual characteristics and organizational supports relate to treatment-seeking behavior. This helps identify key predictors and potential areas for improving mental health support in tech workplaces.
library(readr) # Read Excel data
library(tm) # text cleaning
library(compareGroups) #Descriptive tables
library(SnowballC) #text stemming
library(wordcloud) #text visualization
library(dplyr) #data manipulation
library(stringr) #wrapper for string operation
library(corrplot) #correlation matrix
library(psych) #factor analysis and visualization
library(lavaan) #confirmatory factor analysis
library(mice) #Data imputation
library(ggplot2)
library(tidyr)
data <- read_csv("survey.csv")
head(data)
## # A tibble: 6 × 27
## Timestamp Age Gender Country state self_employed family_history
## <dttm> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 2014-08-27 11:29:31 37 Female United St… IL <NA> No
## 2 2014-08-27 11:29:37 44 M United St… IN <NA> No
## 3 2014-08-27 11:29:44 32 Male Canada <NA> <NA> No
## 4 2014-08-27 11:29:46 31 Male United Ki… <NA> <NA> Yes
## 5 2014-08-27 11:30:22 31 Male United St… TX <NA> No
## 6 2014-08-27 11:31:22 33 Male United St… TN <NA> Yes
## # ℹ 20 more variables: treatment <chr>, work_interfere <chr>,
## # no_employees <chr>, remote_work <chr>, tech_company <chr>, benefits <chr>,
## # care_options <chr>, wellness_program <chr>, seek_help <chr>,
## # anonymity <chr>, leave <chr>, mental_health_consequence <chr>,
## # phys_health_consequence <chr>, coworkers <chr>, supervisor <chr>,
## # mental_health_interview <chr>, phys_health_interview <chr>,
## # mental_vs_physical <chr>, obs_consequence <chr>, comments <chr>
#check missing values
naniar::vis_miss(data)
There are only four columns that include missing values: state, self_employed, work_interfere, and comments
desc <- compareGroups(~ ., data = data, method = 4, max.ylev = 12, max.xlev = 20, chisq.test.perm = T, byrow = F)
## Warning in compareGroups.fit(X = X, y = y, include.label = include.label, :
## Variables 'Gender', 'Country', 'state', 'comments' have been removed since some
## errors occurred
desctab <- createTable(desc, type = 2, show.n = F,show.p.mul = F, show.all = T, show.p.overall = T)
desctab
##
## --------Summary descriptives table ---------
##
## _____________________________________________________________
## [ALL]
## N=1259
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
## Timestamp 1409193000 [1409149616;1409269008]
## Age 31.0 [27.0;36.0]
## self_employed:
## No 1095 (88.2%)
## Yes 146 (11.8%)
## family_history:
## No 767 (60.9%)
## Yes 492 (39.1%)
## treatment:
## No 622 (49.4%)
## Yes 637 (50.6%)
## work_interfere:
## Never 213 (21.4%)
## Often 144 (14.5%)
## Rarely 173 (17.4%)
## Sometimes 465 (46.7%)
## no_employees:
## 1-5 162 (12.9%)
## 100-500 176 (14.0%)
## 26-100 289 (23.0%)
## 500-1000 60 (4.77%)
## 6-25 290 (23.0%)
## More than 1000 282 (22.4%)
## remote_work:
## No 883 (70.1%)
## Yes 376 (29.9%)
## tech_company:
## No 228 (18.1%)
## Yes 1031 (81.9%)
## benefits:
## Don't know 408 (32.4%)
## No 374 (29.7%)
## Yes 477 (37.9%)
## care_options:
## No 501 (39.8%)
## Not sure 314 (24.9%)
## Yes 444 (35.3%)
## wellness_program:
## Don't know 188 (14.9%)
## No 842 (66.9%)
## Yes 229 (18.2%)
## seek_help:
## Don't know 363 (28.8%)
## No 646 (51.3%)
## Yes 250 (19.9%)
## anonymity:
## Don't know 819 (65.1%)
## No 65 (5.16%)
## Yes 375 (29.8%)
## leave:
## Don't know 563 (44.7%)
## Somewhat difficult 126 (10.0%)
## Somewhat easy 266 (21.1%)
## Very difficult 98 (7.78%)
## Very easy 206 (16.4%)
## mental_health_consequence:
## Maybe 477 (37.9%)
## No 490 (38.9%)
## Yes 292 (23.2%)
## phys_health_consequence:
## Maybe 273 (21.7%)
## No 925 (73.5%)
## Yes 61 (4.85%)
## coworkers:
## No 260 (20.7%)
## Some of them 774 (61.5%)
## Yes 225 (17.9%)
## supervisor:
## No 393 (31.2%)
## Some of them 350 (27.8%)
## Yes 516 (41.0%)
## mental_health_interview:
## Maybe 207 (16.4%)
## No 1008 (80.1%)
## Yes 44 (3.49%)
## phys_health_interview:
## Maybe 557 (44.2%)
## No 500 (39.7%)
## Yes 202 (16.0%)
## mental_vs_physical:
## Don't know 576 (45.8%)
## No 340 (27.0%)
## Yes 343 (27.2%)
## obs_consequence:
## No 1075 (85.4%)
## Yes 184 (14.6%)
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
The warning shows that ‘Gender’, ‘Country’, ‘state’, ‘comments’ have been removed since some errors occurred. Let’s check Gender and Country variables.
as.data.frame(table(data$Gender))
## Var1 Freq
## 1 A little about you 1
## 2 Agender 1
## 3 All 1
## 4 Androgyne 1
## 5 cis-female/femme 1
## 6 Cis Female 1
## 7 cis male 1
## 8 Cis Male 2
## 9 Cis Man 1
## 10 Enby 1
## 11 f 15
## 12 F 38
## 13 femail 1
## 14 Femake 1
## 15 female 62
## 16 Female 123
## 17 Female (cis) 1
## 18 Female (trans) 2
## 19 fluid 1
## 20 Genderqueer 1
## 21 Guy (-ish) ^_^ 1
## 22 m 34
## 23 M 116
## 24 Mail 1
## 25 maile 1
## 26 Make 4
## 27 Mal 1
## 28 male 206
## 29 Male 618
## 30 Male-ish 1
## 31 Male (CIS) 1
## 32 male leaning androgynous 1
## 33 Malr 1
## 34 Man 2
## 35 msle 1
## 36 Nah 1
## 37 Neuter 1
## 38 non-binary 1
## 39 ostensibly male, unsure what that really means 1
## 40 p 1
## 41 queer 1
## 42 queer/she/they 1
## 43 something kinda male? 1
## 44 Trans-female 1
## 45 Trans woman 1
## 46 woman 1
## 47 Woman 3
The Gender variable requires cleaning and categories combination into smaller well defined groups.
data <- data %>%
mutate(
Gender = tolower(Gender),
Gender = str_trim(Gender),
Gender = case_when(
str_detect(Gender, "^(f|female|cis[-\\s]?female|femail|femake|woman|femme)") ~ "Female",
str_detect(Gender, "^(m|male|cis[-\\s]?male|man|make|mail|maile|malr|mal|msle)") ~ "Male",
str_detect(Gender, "trans|androgyne|non[-\\s]?binary|genderqueer|enby|fluid|agender|neuter|queer|androgynous") ~ "Non-binary / Gender diverse",
str_detect(Gender, "nah|ostensibly|something|unsure|a little about you|^p$") ~ "Other / Ambiguous",
TRUE ~ "Other / Ambiguous"
)
)
as.data.frame(table(data$Country))
## Var1 Freq
## 1 Australia 21
## 2 Austria 3
## 3 Bahamas, The 1
## 4 Belgium 6
## 5 Bosnia and Herzegovina 1
## 6 Brazil 6
## 7 Bulgaria 4
## 8 Canada 72
## 9 China 1
## 10 Colombia 2
## 11 Costa Rica 1
## 12 Croatia 2
## 13 Czech Republic 1
## 14 Denmark 2
## 15 Finland 3
## 16 France 13
## 17 Georgia 1
## 18 Germany 45
## 19 Greece 2
## 20 Hungary 1
## 21 India 10
## 22 Ireland 27
## 23 Israel 5
## 24 Italy 7
## 25 Japan 1
## 26 Latvia 1
## 27 Mexico 3
## 28 Moldova 1
## 29 Netherlands 27
## 30 New Zealand 8
## 31 Nigeria 1
## 32 Norway 1
## 33 Philippines 1
## 34 Poland 7
## 35 Portugal 2
## 36 Romania 1
## 37 Russia 3
## 38 Singapore 4
## 39 Slovenia 1
## 40 South Africa 6
## 41 Spain 1
## 42 Sweden 7
## 43 Switzerland 7
## 44 Thailand 1
## 45 United Kingdom 185
## 46 United States 751
## 47 Uruguay 1
## 48 Zimbabwe 1
There are some countries with very low frequencies. Therefore, we will combine countries into regions.
data <- data %>%
mutate(
Region = case_when(
Country %in% c("United States", "Canada", "Mexico") ~ "North America",
Country %in% c("Bahamas, The", "Brazil", "Colombia", "Costa Rica", "Uruguay") ~ "Central/South America",
Country %in% c(
"Austria", "Belgium", "Bosnia and Herzegovina", "Bulgaria", "Croatia",
"Czech Republic", "Denmark", "Finland", "France", "Georgia", "Germany",
"Greece", "Hungary", "Ireland", "Italy", "Latvia", "Moldova",
"Netherlands", "Norway", "Poland", "Portugal", "Romania", "Russia",
"Slovenia", "Spain", "Sweden", "Switzerland", "United Kingdom"
) ~ "Europe",
Country %in% c("India", "China", "Japan", "Israel", "Singapore", "Thailand", "Philippines") ~ "Asia",
Country %in% c("Nigeria", "South Africa", "Zimbabwe") ~ "Africa",
Country %in% c("Australia", "New Zealand") ~ "Oceania",
TRUE ~ "Other / Unknown"
)
)
desc <- compareGroups(~ ., data = data, method = 4, max.ylev = 12, max.xlev = 20, chisq.test.perm = T, byrow = F)
## Warning in compareGroups.fit(X = X, y = y, include.label = include.label, :
## Variables 'Country', 'state', 'comments' have been removed since some errors
## occurred
desctab <- createTable(desc, type = 2, show.n = F,show.p.mul = F, show.all = T, show.p.overall = T)
desctab
##
## --------Summary descriptives table ---------
##
## __________________________________________________________________
## [ALL]
## N=1259
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
## Timestamp 1409193000 [1409149616;1409269008]
## Age 31.0 [27.0;36.0]
## Gender:
## Female 250 (19.9%)
## Male 991 (78.7%)
## Non-binary / Gender diverse 10 (0.79%)
## Other / Ambiguous 8 (0.64%)
## self_employed:
## No 1095 (88.2%)
## Yes 146 (11.8%)
## family_history:
## No 767 (60.9%)
## Yes 492 (39.1%)
## treatment:
## No 622 (49.4%)
## Yes 637 (50.6%)
## work_interfere:
## Never 213 (21.4%)
## Often 144 (14.5%)
## Rarely 173 (17.4%)
## Sometimes 465 (46.7%)
## no_employees:
## 1-5 162 (12.9%)
## 100-500 176 (14.0%)
## 26-100 289 (23.0%)
## 500-1000 60 (4.77%)
## 6-25 290 (23.0%)
## More than 1000 282 (22.4%)
## remote_work:
## No 883 (70.1%)
## Yes 376 (29.9%)
## tech_company:
## No 228 (18.1%)
## Yes 1031 (81.9%)
## benefits:
## Don't know 408 (32.4%)
## No 374 (29.7%)
## Yes 477 (37.9%)
## care_options:
## No 501 (39.8%)
## Not sure 314 (24.9%)
## Yes 444 (35.3%)
## wellness_program:
## Don't know 188 (14.9%)
## No 842 (66.9%)
## Yes 229 (18.2%)
## seek_help:
## Don't know 363 (28.8%)
## No 646 (51.3%)
## Yes 250 (19.9%)
## anonymity:
## Don't know 819 (65.1%)
## No 65 (5.16%)
## Yes 375 (29.8%)
## leave:
## Don't know 563 (44.7%)
## Somewhat difficult 126 (10.0%)
## Somewhat easy 266 (21.1%)
## Very difficult 98 (7.78%)
## Very easy 206 (16.4%)
## mental_health_consequence:
## Maybe 477 (37.9%)
## No 490 (38.9%)
## Yes 292 (23.2%)
## phys_health_consequence:
## Maybe 273 (21.7%)
## No 925 (73.5%)
## Yes 61 (4.85%)
## coworkers:
## No 260 (20.7%)
## Some of them 774 (61.5%)
## Yes 225 (17.9%)
## supervisor:
## No 393 (31.2%)
## Some of them 350 (27.8%)
## Yes 516 (41.0%)
## mental_health_interview:
## Maybe 207 (16.4%)
## No 1008 (80.1%)
## Yes 44 (3.49%)
## phys_health_interview:
## Maybe 557 (44.2%)
## No 500 (39.7%)
## Yes 202 (16.0%)
## mental_vs_physical:
## Don't know 576 (45.8%)
## No 340 (27.0%)
## Yes 343 (27.2%)
## obs_consequence:
## No 1075 (85.4%)
## Yes 184 (14.6%)
## Region:
## Africa 8 (0.64%)
## Asia 23 (1.83%)
## Central/South America 11 (0.87%)
## Europe 362 (28.8%)
## North America 826 (65.6%)
## Oceania 29 (2.30%)
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
desc <- compareGroups(treatment ~ ., data = data[-1], method = 4, max.ylev = 12, max.xlev = 20, chisq.test.perm = T, byrow = T)
## Warning in chisq.test(xx, correct = FALSE): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(xx, correct = FALSE): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(xx, correct = FALSE): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(xx, correct = FALSE): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(xx, correct = FALSE): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(xx, correct = FALSE): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(xx, correct = FALSE): Chi-squared approximation may be
## incorrect
## Warning in compareGroups.fit(X = X, y = y, include.label = include.label, :
## Variables 'Country', 'state', 'comments' have been removed since some errors
## occurred
desctab <- createTable(desc, type = 2, show.n = F,show.p.mul = F, show.all = T, show.p.overall = T)
desctab
##
## --------Summary descriptives table by 'treatment'---------
##
## ____________________________________________________________________________________________
## [ALL] No Yes p.overall
## N=1259 N=622 N=637
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
## Age 31.0 [27.0;36.0] 31.0 [27.0;35.0] 32.0 [27.0;37.0] 0.011
## Gender: <0.001
## Female 250 (19.9%) 77 (30.8%) 173 (69.2%)
## Male 991 (78.7%) 542 (54.7%) 449 (45.3%)
## Non-binary / Gender diverse 10 (0.79%) 2 (20.0%) 8 (80.0%)
## Other / Ambiguous 8 (0.64%) 1 (12.5%) 7 (87.5%)
## self_employed: 0.524
## No 1095 (88.2%) 545 (49.8%) 550 (50.2%)
## Yes 146 (11.8%) 68 (46.6%) 78 (53.4%)
## family_history: <0.001
## No 767 (60.9%) 495 (64.5%) 272 (35.5%)
## Yes 492 (39.1%) 127 (25.8%) 365 (74.2%)
## work_interfere: <0.001
## Never 213 (21.4%) 183 (85.9%) 30 (14.1%)
## Often 144 (14.5%) 21 (14.6%) 123 (85.4%)
## Rarely 173 (17.4%) 51 (29.5%) 122 (70.5%)
## Sometimes 465 (46.7%) 107 (23.0%) 358 (77.0%)
## no_employees: 0.119
## 1-5 162 (12.9%) 71 (43.8%) 91 (56.2%)
## 100-500 176 (14.0%) 81 (46.0%) 95 (54.0%)
## 26-100 289 (23.0%) 139 (48.1%) 150 (51.9%)
## 500-1000 60 (4.77%) 33 (55.0%) 27 (45.0%)
## 6-25 290 (23.0%) 162 (55.9%) 128 (44.1%)
## More than 1000 282 (22.4%) 136 (48.2%) 146 (51.8%)
## remote_work: 0.371
## No 883 (70.1%) 444 (50.3%) 439 (49.7%)
## Yes 376 (29.9%) 178 (47.3%) 198 (52.7%)
## tech_company: 0.296
## No 228 (18.1%) 105 (46.1%) 123 (53.9%)
## Yes 1031 (81.9%) 517 (50.1%) 514 (49.9%)
## benefits: <0.001
## Don't know 408 (32.4%) 257 (63.0%) 151 (37.0%)
## No 374 (29.7%) 193 (51.6%) 181 (48.4%)
## Yes 477 (37.9%) 172 (36.1%) 305 (63.9%)
## care_options: <0.001
## No 501 (39.8%) 294 (58.7%) 207 (41.3%)
## Not sure 314 (24.9%) 191 (60.8%) 123 (39.2%)
## Yes 444 (35.3%) 137 (30.9%) 307 (69.1%)
## wellness_program: 0.003
## Don't know 188 (14.9%) 107 (56.9%) 81 (43.1%)
## No 842 (66.9%) 422 (50.1%) 420 (49.9%)
## Yes 229 (18.2%) 93 (40.6%) 136 (59.4%)
## seek_help: 0.004
## Don't know 363 (28.8%) 197 (54.3%) 166 (45.7%)
## No 646 (51.3%) 323 (50.0%) 323 (50.0%)
## Yes 250 (19.9%) 102 (40.8%) 148 (59.2%)
## anonymity: <0.001
## Don't know 819 (65.1%) 448 (54.7%) 371 (45.3%)
## No 65 (5.16%) 27 (41.5%) 38 (58.5%)
## Yes 375 (29.8%) 147 (39.2%) 228 (60.8%)
## leave: <0.001
## Don't know 563 (44.7%) 309 (54.9%) 254 (45.1%)
## Somewhat difficult 126 (10.0%) 44 (34.9%) 82 (65.1%)
## Somewhat easy 266 (21.1%) 135 (50.8%) 131 (49.2%)
## Very difficult 98 (7.78%) 31 (31.6%) 67 (68.4%)
## Very easy 206 (16.4%) 103 (50.0%) 103 (50.0%)
## mental_health_consequence: <0.001
## Maybe 477 (37.9%) 224 (47.0%) 253 (53.0%)
## No 490 (38.9%) 280 (57.1%) 210 (42.9%)
## Yes 292 (23.2%) 118 (40.4%) 174 (59.6%)
## phys_health_consequence: 0.185
## Maybe 273 (21.7%) 127 (46.5%) 146 (53.5%)
## No 925 (73.5%) 470 (50.8%) 455 (49.2%)
## Yes 61 (4.85%) 25 (41.0%) 36 (59.0%)
## coworkers: 0.050
## No 260 (20.7%) 141 (54.2%) 119 (45.8%)
## Some of them 774 (61.5%) 384 (49.6%) 390 (50.4%)
## Yes 225 (17.9%) 97 (43.1%) 128 (56.9%)
## supervisor: 0.422
## No 393 (31.2%) 186 (47.3%) 207 (52.7%)
## Some of them 350 (27.8%) 170 (48.6%) 180 (51.4%)
## Yes 516 (41.0%) 266 (51.6%) 250 (48.4%)
## mental_health_interview: 0.002
## Maybe 207 (16.4%) 125 (60.4%) 82 (39.6%)
## No 1008 (80.1%) 479 (47.5%) 529 (52.5%)
## Yes 44 (3.49%) 18 (40.9%) 26 (59.1%)
## phys_health_interview: 0.183
## Maybe 557 (44.2%) 290 (52.1%) 267 (47.9%)
## No 500 (39.7%) 241 (48.2%) 259 (51.8%)
## Yes 202 (16.0%) 91 (45.0%) 111 (55.0%)
## mental_vs_physical: <0.001
## Don't know 576 (45.8%) 316 (54.9%) 260 (45.1%)
## No 340 (27.0%) 138 (40.6%) 202 (59.4%)
## Yes 343 (27.2%) 168 (49.0%) 175 (51.0%)
## obs_consequence: <0.001
## No 1075 (85.4%) 566 (52.7%) 509 (47.3%)
## Yes 184 (14.6%) 56 (30.4%) 128 (69.6%)
## Region: 0.001
## Africa 8 (0.64%) 3 (37.5%) 5 (62.5%)
## Asia 23 (1.83%) 18 (78.3%) 5 (21.7%)
## Central/South America 11 (0.87%) 8 (72.7%) 3 (27.3%)
## Europe 362 (28.8%) 204 (56.4%) 158 (43.6%)
## North America 826 (65.6%) 378 (45.8%) 448 (54.2%)
## Oceania 29 (2.30%) 11 (37.9%) 18 (62.1%)
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
Several factors showed significant differences between those who received treatment and those who did not. Age was slightly higher in the treatment group (p = 0.011). Gender was strongly associated with treatment (p < 0.001), with females, non-binary, and other gender-diverse individuals more likely to report receiving treatment than males. Those with a family history of mental illness were significantly more likely to have received treatment (p < 0.001). The frequency of work interference due to mental health also had a strong association (p < 0.001), with those reporting more interference more likely to be in the treatment group.
Other significant variables included mental health consequences, benefits availability, care options, wellness programs, help-seeking behavior, leave policies, perceptions of mental vs physical health treatment, interview willingness regarding mental health, anonymity, observed workplace consequences, and region (all p < 0.05). These findings suggest that both individual and workplace-related factors are significantly associated with seeking or receiving mental health treatment.
#create text corpus
corpus <- VCorpus(VectorSource(data$comments))
corpus_clean <- tm_map(corpus,content_transformer(tolower)) #converting to lower case letters
corpus_clean <- tm_map(corpus_clean,removeNumbers) #removing numbers
corpus_clean <- tm_map(corpus_clean,removeWords,stopwords()) #removing stop words
corpus_clean <- tm_map(corpus_clean,removePunctuation) #removing punctuation
corpus_clean <- tm_map(corpus_clean,stemDocument) #stemming the document
corpus_clean <- tm_map(corpus_clean,stripWhitespace)#removing spaces after doing above process
#visualize the most frequent words
wordcloud(corpus_clean, scale = c(4,.1),max.words = 100,random.order = FALSE, random.color = FALSE, colors = brewer.pal(6, 'Dark2'))
Mental health and work-related terms are the most frequent terms in the comments. Depression is the most frequent mental condition term.
items <- data[ , -which(names(data) %in% c("Timestamp","Country", "state", "Age", "Region", "Gender", "comments"))]
#functions to recode responses
#1- Yes/No/Maybe/Some of them/Don't know/Not sure
recode_yes <- function(x) {
ifelse(x == "Yes", 3, ifelse(x == "No", 1, ifelse(x == "Maybe", 2, ifelse(x == "Some of them", 2, ifelse(x == "Don't know", 2, ifelse(x == "Not sure", 2, NA))))))
}
#2- Don't know/Very difficult/Somewhat difficult/Somewhat easy/Very easy
recode_difficult <- function(x) {
ifelse(x == "Don't know", 1, ifelse(x == "Very difficult", 2, ifelse(x == "Somewhat difficult", 3, ifelse(x == "Somewhat easy", 4, ifelse(x == "Very easy", 5, NA)))))
}
#3- Never/Often/Rarely/Sometimes
recode_freq <- function(x){
ifelse(x == "Never", 1, ifelse(x == "Rarely", 2, ifelse(x == "Sometimes", 3, ifelse(x == "Often", 4, NA))))
}
#4- Number of employees
recode_employ <- function(x) {
ifelse(x == "1-5", 1, ifelse(x == "6-25", 2, ifelse(x == "26-100", 3, ifelse(x == "100-500", 4, ifelse(x == "500-1000", 5, ifelse(x == "More than 1000", 6, NA))))))
}
#Apply functions
items[,c(1:3, 6:12, 14:21)] <- lapply(items[,c(1:3, 6:12, 14:21)], recode_yes)
items[,13] <- lapply(items[,13], recode_difficult)
items[,4] <- lapply(items[,4], recode_freq)
items[,5] <- lapply(items[,5], recode_employ)
imputed <- mice(items, m = 1, method = "pmm")
##
## iter imp variable
## 1 1 self_employed work_interfere
## 2 1 self_employed work_interfere
## 3 1 self_employed work_interfere
## 4 1 self_employed work_interfere
## 5 1 self_employed work_interfere
items <- complete(imputed, 1)
Let’s explore the highly correlated items.
cor_items <- cor(items, use = "complete.obs")
cor_items[lower.tri(cor_items, diag = TRUE)] <- NA # remove lower triangle and diagonal
# Convert to long/tidy format
cor_table <- as.data.frame(as.table(cor_items)) %>%
filter(!is.na(Freq), Freq > 0.35) %>% #subset correlations > 35%
arrange(desc(Freq)) %>%
rename(Variable1 = Var1, Variable2 = Var2, Correlation = Freq)
# View table
print(cor_table)
## Variable1 Variable2 Correlation
## 1 wellness_program seek_help 0.6181306
## 2 coworkers supervisor 0.5743100
## 3 mental_health_consequence phys_health_consequence 0.5156194
## 4 treatment work_interfere 0.5038236
## 5 benefits seek_help 0.4904305
## 6 no_employees benefits 0.4593412
## 7 mental_health_interview phys_health_interview 0.4488442
## 8 benefits wellness_program 0.4092124
## 9 no_employees seek_help 0.4059284
## 10 family_history treatment 0.3779177
The highest between items correlation is between the discussing the employee wellness program by the employer and providing mental health and seeking help resources. Moreover, the size of company positively correlate with providing mental health and seeking help resources.
# load library ltm
library(ltm)
# calculate cronbach's alpha
cronbach.alpha(items, CI=TRUE, standardized=TRUE)
##
## Standardized Cronbach's alpha for the 'items' data-set
##
## Items: 21
## Sample units: 1259
## alpha: 0.49
##
## Bootstrap 95% CI based on 1000 samples
## 2.5% 97.5%
## 0.448 0.537
The Cronbach’s alpha is unacceptable (0.5). We will continue with exploratory factor analysis to uncover underlying dimensions.
We will start by 1-factor solution
efa_1 <- factanal(x = items, factors = 1)
efa_1
##
## Call:
## factanal(x = items, factors = 1)
##
## Uniquenesses:
## self_employed family_history treatment
## 0.993 0.993 0.991
## work_interfere no_employees remote_work
## 0.968 0.987 0.996
## tech_company benefits care_options
## 0.986 0.993 0.991
## wellness_program seek_help anonymity
## 0.961 0.966 0.898
## leave mental_health_consequence phys_health_consequence
## 0.886 0.344 0.743
## coworkers supervisor mental_health_interview
## 0.653 0.474 0.834
## phys_health_interview mental_vs_physical obs_consequence
## 0.967 0.734 0.949
##
## Loadings:
## Factor1
## self_employed
## family_history
## treatment
## work_interfere -0.180
## no_employees -0.113
## remote_work
## tech_company 0.119
## benefits
## care_options
## wellness_program 0.198
## seek_help 0.184
## anonymity 0.320
## leave 0.338
## mental_health_consequence -0.810
## phys_health_consequence -0.507
## coworkers 0.589
## supervisor 0.725
## mental_health_interview 0.407
## phys_health_interview 0.182
## mental_vs_physical 0.516
## obs_consequence -0.226
##
## Factor1
## SS loadings 2.694
## Proportion Var 0.128
##
## Test of the hypothesis that 1 factor is sufficient.
## The chi square statistic is 4028.37 on 189 degrees of freedom.
## The p-value is 0
The model fit shows that only 12.8% of the total variance is explained by this factor which is a very low percentage.
The p value of the chi-square test is 0, so we will reject the null hypothesis that 1 factor is sufficient.
Now, we will try the 2-factor solution.
efa_2 <- factanal(x = items, factors = 2)
efa_2
##
## Call:
## factanal(x = items, factors = 2)
##
## Uniquenesses:
## self_employed family_history treatment
## 0.928 0.979 0.973
## work_interfere no_employees remote_work
## 0.969 0.614 0.970
## tech_company benefits care_options
## 0.936 0.550 0.857
## wellness_program seek_help anonymity
## 0.499 0.400 0.817
## leave mental_health_consequence phys_health_consequence
## 0.879 0.357 0.754
## coworkers supervisor mental_health_interview
## 0.641 0.475 0.813
## phys_health_interview mental_vs_physical obs_consequence
## 0.946 0.710 0.948
##
## Loadings:
## Factor1 Factor2
## self_employed 0.108 -0.247
## family_history 0.116
## treatment 0.135
## work_interfere -0.175
## no_employees -0.133 0.607
## remote_work -0.156
## tech_company 0.126 -0.218
## benefits 0.666
## care_options 0.100 0.365
## wellness_program 0.220 0.673
## seek_help 0.205 0.747
## anonymity 0.325 0.278
## leave 0.345
## mental_health_consequence -0.800
## phys_health_consequence -0.496
## coworkers 0.592
## supervisor 0.724
## mental_health_interview 0.414 -0.124
## phys_health_interview 0.186 -0.138
## mental_vs_physical 0.526 0.114
## obs_consequence -0.221
##
## Factor1 Factor2
## SS loadings 2.721 2.263
## Proportion Var 0.130 0.108
## Cumulative Var 0.130 0.237
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 2265.94 on 169 degrees of freedom.
## The p-value is 0
The proportion of variance explained in the 2-factor model is 23.7%. It is better than 1-factor model but it is still low proportion.
Now, we will try the 3-factor solution.
efa_3 <- factanal(x = items, factors = 3)
efa_3
##
## Call:
## factanal(x = items, factors = 3)
##
## Uniquenesses:
## self_employed family_history treatment
## 0.888 0.778 0.449
## work_interfere no_employees remote_work
## 0.551 0.562 0.949
## tech_company benefits care_options
## 0.935 0.536 0.784
## wellness_program seek_help anonymity
## 0.515 0.415 0.798
## leave mental_health_consequence phys_health_consequence
## 0.849 0.349 0.757
## coworkers supervisor mental_health_interview
## 0.602 0.462 0.810
## phys_health_interview mental_vs_physical obs_consequence
## 0.940 0.712 0.899
##
## Loadings:
## Factor1 Factor2 Factor3
## self_employed 0.151 -0.276 0.115
## family_history 0.467
## treatment 0.740
## work_interfere -0.116 0.656
## no_employees -0.191 0.631
## remote_work 0.105 -0.176
## tech_company 0.138 -0.213
## benefits 0.668 0.122
## care_options 0.131 0.337 0.291
## wellness_program 0.190 0.667
## seek_help 0.163 0.746
## anonymity 0.332 0.282 0.113
## leave 0.373
## mental_health_consequence -0.772 0.231
## phys_health_consequence -0.474 0.132
## coworkers 0.623
## supervisor 0.732
## mental_health_interview 0.421 -0.101
## phys_health_interview 0.202 -0.135
## mental_vs_physical 0.505 0.148 -0.108
## obs_consequence -0.180 0.261
##
## Factor1 Factor2 Factor3
## SS loadings 2.680 2.271 1.508
## Proportion Var 0.128 0.108 0.072
## Cumulative Var 0.128 0.236 0.308
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 1414.44 on 150 degrees of freedom.
## The p-value is 1.8e-204
The cumulative variance explained by the model is 31.0%, which is decent for social science/psychometric data.
Some variables are not well accounted for by any of the 3 factors such as: remote_work (0.954), phys_health_interview (0.937), obs_consequence (0.900), self_employed (0.888), mental_health_interview (0.810), anonymity (0.797)
We will remove these variables from the questionnaire:
items <- items[ , -which(names(items) %in% c("remote_work","phys_health_interview", "obs_consequence", "self_employed", "mental_health_interview", "anonymity"))]
Let’s repeat the 3-factor analysis
efa_3 <- factanal(x = items, factors = 3)
efa_3
##
## Call:
## factanal(x = items, factors = 3)
##
## Uniquenesses:
## family_history treatment work_interfere
## 0.775 0.381 0.570
## no_employees tech_company benefits
## 0.644 0.942 0.586
## care_options wellness_program seek_help
## 0.805 0.474 0.342
## leave mental_health_consequence phys_health_consequence
## 0.872 0.324 0.742
## coworkers supervisor mental_vs_physical
## 0.601 0.452 0.723
##
## Loadings:
## Factor1 Factor2 Factor3
## family_history 0.467
## treatment 0.782
## work_interfere 0.645
## no_employees -0.173 0.567
## tech_company 0.136 -0.199
## benefits 0.634 0.104
## care_options 0.352 0.250
## wellness_program 0.158 0.708
## seek_help 0.138 0.799
## leave 0.347
## mental_health_consequence -0.800 0.188
## phys_health_consequence -0.499
## coworkers 0.622
## supervisor 0.740
## mental_vs_physical 0.487 0.170 -0.103
##
## Factor1 Factor2 Factor3
## SS loadings 2.294 2.079 1.393
## Proportion Var 0.153 0.139 0.093
## Cumulative Var 0.153 0.292 0.384
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 488.99 on 63 degrees of freedom.
## The p-value is 3.54e-67
The cumulative variance explained by the model increased to 38.3%
# Run EFA with rotation
efa_result <- fa(items, nfactors = 3, rotate = "varimax")
# Path diagram
# Call fa.diagram() to draw within current plotting area
fa.diagram(efa_result)
# Convert loadings to data frame
load_df <- as.data.frame(efa_result$loadings[1:ncol(items), ])
load_df$Variable <- rownames(load_df)
# Convert to long format
load_long <- pivot_longer(load_df, cols = starts_with("MR"), names_to = "Factor", values_to = "Loading")
# Plot
ggplot(load_long, aes(x = Factor, y = Loading, fill = abs(Loading))) +
geom_col(stat = "identity", position = "dodge") +
facet_wrap(~ Variable, scales = "free_y") +
coord_flip() +
scale_fill_gradient2(low = "blue", mid = "gray90", high = "red", midpoint = 0.4) +
theme_minimal() +
labs(title = "Factor Loadings by Variable", x = "Factor", y = "Loading")
The factor analysis revealed three underlying dimensions related to workplace mental health attitudes and experiences.
The first factor, which we can describe as openness and willingness, reflects how comfortable and supported individuals feel when it comes to discussing mental health in the workplace. Items that loaded highly on this factor include willingness to talk to a supervisor (loading = 0.7), comfort discussing mental health with coworkers (0.6) and the perception that mental and physical health are treated equally by the employer (0.5). Providing medical leave for mental health conditions contributed moderately (0.4). Notably, there were strong negative loadings for beliefs that disclosing mental or physical health issues would have negative consequences (−0.8 and −0.5, respectively), suggesting that higher scorers on this factor are less likely to anticipate stigma or adverse outcomes.
The second factor reflects Organizational Mental Health Resources, emphasizing the structural and informational support provided by employers. Items with high loadings include knowledge of where to seek help for mental health issues (0.8), the availability of mental health benefits (0.7), and the inclusion of mental health in wellness programs (0.7). Interestingly, company size also loaded moderately (0.6), which may reflect the tendency of larger organizations to have more comprehensive mental health resources. Awareness of available care options also contributed to this factor (0.3).
The third factor can be described as Personal Experience, focusing on individual encounters with mental health challenges. This includes having received treatment for mental health concerns (loading = 0.8), experiencing interference with work due to mental health (0.6), and having a family history of mental illness (0.5). These items collectively represent the personal side of mental health.
model <- 'Factor1 =~ mental_health_consequence + phys_health_consequence + supervisor + coworkers + leave +mental_vs_physical
Factor2 =~ seek_help + benefits + wellness_program + care_options + no_employees
Factor3 =~ treatment +family_history + work_interfere'
fit <- cfa(model, data = items)
summary(fit, fit.measures=TRUE)
## lavaan 0.6-19 ended normally after 37 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 31
##
## Number of observations 1259
##
## Model Test User Model:
##
## Test statistic 781.144
## Degrees of freedom 74
## P-value (Chi-square) 0.000
##
## Model Test Baseline Model:
##
## Test statistic 4490.244
## Degrees of freedom 91
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 0.839
## Tucker-Lewis Index (TLI) 0.802
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -21108.013
## Loglikelihood unrestricted model (H1) -20717.441
##
## Akaike (AIC) 42278.025
## Bayesian (BIC) 42437.305
## Sample-size adjusted Bayesian (SABIC) 42338.835
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.087
## 90 Percent confidence interval - lower 0.082
## 90 Percent confidence interval - upper 0.093
## P-value H_0: RMSEA <= 0.050 0.000
## P-value H_0: RMSEA >= 0.080 0.984
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.071
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## Factor1 =~
## mntl_hlth_cnsq 1.000
## phys_hlth_cnsq 0.460 0.027 17.338 0.000
## supervisor -0.957 0.041 -23.423 0.000
## coworkers -0.568 0.029 -19.303 0.000
## leave -0.793 0.077 -10.349 0.000
## mntl_vs_physcl -0.562 0.035 -16.034 0.000
## Factor2 =~
## seek_help 1.000
## benefits 0.806 0.040 19.953 0.000
## wellness_prgrm 0.896 0.040 22.293 0.000
## care_options 0.505 0.042 11.931 0.000
## no_employees 1.370 0.084 16.390 0.000
## Factor3 =~
## treatment 1.000
## family_history 0.582 0.051 11.464 0.000
## work_interfere 0.815 0.067 12.143 0.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## Factor1 ~~
## Factor2 -0.061 0.014 -4.253 0.000
## Factor3 0.079 0.019 4.183 0.000
## Factor2 ~~
## Factor3 0.041 0.019 2.200 0.028
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .mntl_hlth_cnsq 0.186 0.015 12.665 0.000
## .phys_hlth_cnsq 0.225 0.010 22.957 0.000
## .supervisor 0.337 0.018 18.386 0.000
## .coworkers 0.252 0.011 22.151 0.000
## .leave 2.286 0.093 24.481 0.000
## .mntl_vs_physcl 0.413 0.018 23.366 0.000
## .seek_help 0.207 0.016 12.794 0.000
## .benefits 0.405 0.019 21.034 0.000
## .wellness_prgrm 0.288 0.016 17.591 0.000
## .care_options 0.645 0.027 24.137 0.000
## .no_employees 2.153 0.094 22.951 0.000
## .treatment 0.368 0.050 7.308 0.000
## .family_history 0.738 0.034 21.784 0.000
## .work_interfere 0.631 0.041 15.463 0.000
## Factor1 0.410 0.026 15.841 0.000
## Factor2 0.406 0.027 15.121 0.000
## Factor3 0.632 0.061 10.413 0.000
The results of the confirmatory factor analysis (CFA) suggest that the proposed three-factor model has a moderate fit to the data. While the model was statistically significant (χ²(74) = 793.15, p < .001), indicating that the model does not perfectly reproduce the observed data (which is common in large samples), several other fit indices offer more practical insight. The CFI (0.836) and TLI (0.798) fall below the conventional threshold of 0.90, suggesting room for improvement in model fit. Similarly, the RMSEA (0.088) is slightly above the acceptable range, and its confidence interval (0.082–0.093) indicates a less-than-ideal fit. However, the SRMR (0.072) falls within the acceptable range (<0.08), providing some support for the model.