Open Sourcing Mental Illess (OSMI) has an ongoing survey from 2016, which “aims to measure attitudes towards mental health in the tech workplace, and examine the frequency of mental health disorders among tech workers.” The survey is conducted online at the OSMI website and the OSMI team intends to use these data to help drive awareness and improve conditions for individuals with mental illness in the IT workplace.
It should be noted that the survey may be prone to certain biases. The sample of respondents (\(n = 1433\)) was not obtained through any random sampling approach. Furthermore, as the survey is conducted online, voluntary response bias should also be considered - i.e., self-selected respondents may have a particular opinions/experiences that cause overrepresentation in the data.
Lastly, as this is an observational study with potential sampling biases present, it is important to remember that causality can not be inferred. The results of the survey may not be generalizable to the entire population of Tech/IT workers due to the lack of random sampling.
Bearing the above limitations in mind and being cautious with our interpretations, we can still use these data to glean some insight into the state of mental health in the tech workplace.
Please see the Final Remarks and at the end of the document for more detail on these findings (as well as the actual exploratory analyses!)
GitHub repo of all relevant files.
Survey results were downloaded from data.world in .csv format. The file was downloaded on March 25, 2017.
# load packages
library(dplyr)
library(stringr)
library(ggplot2)
library(plotly)
library(wesanderson)
set.seed(1000)
# load dataset
mental.health <- read.csv("~/Dropbox/Data4Democracy/mental_health_in_tech/mental-heath-in-tech-2016_20161114.csv", header = TRUE, stringsAsFactors = TRUE)
# check structure
#str(mental.health)The first major issue that needs to be fixed is the variable names. The variable names in the .csv file are the actual survey questions themselves, which are too long for easy handling. I will create new variable names based on my own abbreviations of the survey’s questions. In time, I may add a correpondence table to cross-reference the original variable names to my abbreviated ones.
# create new variable names
new.names <- c("self.employed", "num.employees", "tech.company", "tech.role", "mental.health.coverage", "mental.health.options", "mental.health.formally.discussed", "mental.health.resources", "anonymity.protected", "medical.leave", "mental.health.negative", "physical.health.negative", "mental.health.comfort.coworker", "mental.health.comfort.supervisor", "mental.health.taken.seriously", "coworker.negative.consequences", "private.med.coverage", "resources", "reveal.diagnosis.clients.or.business", "revealed.negative.consequences.CB", "reveal.diagnosis.coworkers", "revealed.negative.consequences.CW", "productivity.effected", "percentage", "previous.employer", "prevemp.mental.health.coverage", "prevemp.mental.health.options", "prevemp.mental.health.formally.discussed", "prevemp.mental.health.resources", "prevemp.anonymity.protected", "prevemp.mental.health.negative",
"prevemp.physical.health.negative", "prevemp.mental.health.coworker", "prevemp.mental.health.comfort.supervisor", "prevemp.mental.health.taken.seriously", "prevemp.coworker.negative.consequences", "mention.phsyical.issue.interview", "why.whynot.physical", "mention.mental.health.interview", "why.whynot.mental", "career.hurt", "viewed.negatively.by.coworkers", "share.with.family", "observed.poor.handling", "observations.lead.less.likely.to.reveal", "family.history", "ever.had.mental.disorder", "currently.have.mental.disorder", "if.yes.what", "if.maybe.what", "medical.prof.diagnosis", "what.conditions", "sought.prof.treatment", "treatment.affects.work", "no.treatment.affects.work", "age", "gender", "country.live", "US.state", "country.work", "state.work", "work.position", "remotely" )
# change names
colnames(mental.health) <- new.names
# check
#str(mental.health)
# what does gender variable look like?
head(table(mental.health$gender))##
## Female AFAB Agender Androgynous Bigender
## 1 1 1 2 1 1
tail(table(mental.health$gender))##
## Sex is male Transgender woman Transitioned, M2F Unicorn
## 1 1 1 1
## woman Woman
## 4 3
# ok, we have some issues with gender that need to be cleaned up - see next section below
# convert some factor variables to character
mental.health$why.whynot.physical <- as.character(mental.health$why.whynot.physical)
mental.health$why.whynot.mental <- as.character(mental.health$why.whynot.mental)
mental.health$gender <- as.character(mental.health$gender)
mental.health$work.position <- as.character(mental.health$work.position)
# convert boolean to factor
mental.health$self.employed <- as.factor(mental.health$self.employed)
levels(mental.health$self.employed) <- c("No", "Yes")
mental.health$tech.role <- as.factor(mental.health$tech.role)
levels(mental.health$tech.role) <- c("No", "Yes")The open-ended, free-form text response to gender on the survey will necessitate some data cleaning. There are several non-standard responses, i.e. “Unicorn” and “I’m a man why didn’t you make this a drop down question. You should of asked sex? And I would of answered yes please. Seriously how much text can this take?” Other types of responses included some variants of being Gender Queer, non-binary, etc.
If someone responded as transgender, I have coded their gender as TG. If someone responded with a variation of being Gender Queer, i.e. “gender fluid”, “human”, “androgenous”, I have coded their gender as GQ. Otherwise, gender has been encoded as either M or F, unless the field was left blank (NA) or the survey participant refused to answer (Refused).
# let's try to standardize responses
mental.health[mental.health$gender == "Male", "gender"] <- "M"
mental.health[mental.health$gender == "male", "gender"] <- "M"
mental.health[mental.health$gender == "MALE", "gender"] <- "M"
mental.health[mental.health$gender == "Man", "gender"] <- "M"
mental.health[mental.health$gender == "man", "gender"] <- "M"
mental.health[mental.health$gender == "m", "gender"] <- "M"
mental.health[mental.health$gender == "man ", "gender"] <- "M"
mental.health[mental.health$gender == "Dude", "gender"] <- "M"
mental.health[mental.health$gender == "mail", "gender"] <- "M"
mental.health[mental.health$gender == "M|", "gender"] <- "M"
mental.health[mental.health$gender == "Cis male", "gender"] <- "M"
mental.health[mental.health$gender == "Male (cis)", "gender"] <- "M"
mental.health[mental.health$gender == "Cis Male", "gender"] <- "M"
mental.health[mental.health$gender == "cis male", "gender"] <- "M"
mental.health[mental.health$gender == "cisdude", "gender"] <- "M"
mental.health[mental.health$gender == "cis man", "gender"] <- "M"
mental.health[mental.health$gender == "Male.", "gender"] <- "M"
mental.health[mental.health$gender == "Male ", "gender"] <- "M"
mental.health[mental.health$gender == "male ", "gender"] <- "M"
mental.health[mental.health$gender == "Malr", "gender"] <- "M"
mental.health[841,"gender"] <- "M"
mental.health[mental.health$gender == "Female", "gender"] <- "F"
mental.health[mental.health$gender == "Female ", "gender"] <- "F"
mental.health[mental.health$gender == " Female", "gender"] <- "F"
mental.health[mental.health$gender == "female", "gender"] <- "F"
mental.health[mental.health$gender == "female ", "gender"] <- "F"
mental.health[mental.health$gender == "Woman", "gender"] <- "F"
mental.health[mental.health$gender == "woman", "gender"] <- "F"
mental.health[mental.health$gender == "f", "gender"] <- "F"
mental.health[mental.health$gender == "Cis female", "gender"] <- "F"
mental.health[mental.health$gender == "Cis female ", "gender"] <- "F"
mental.health[mental.health$gender == "Cisgender Female", "gender"] <- "F"
mental.health[mental.health$gender == "Cis-woman", "gender"] <- "F"
mental.health[mental.health$gender == "fem", "gender"] <- "F"
mental.health[1091, "gender"] <- "F"
mental.health[17, "gender"] <- "F"
# gender queer (GQ)
mental.health[mental.health$gender == "Agender", "gender"] <- "GQ"
mental.health[mental.health$gender == "Androgynous", "gender"] <- "GQ"
mental.health[mental.health$gender == "Bigender", "gender"] <- "GQ"
mental.health[mental.health$gender == "Female or Multi-Gender Femme", "gender"] <- "GQ"
mental.health[mental.health$gender == "female-bodied; no feelings about gender", "gender"] <- "GQ"
mental.health[mental.health$gender == "Fluid", "gender"] <- "GQ"
mental.health[mental.health$gender == "fm", "gender"] <- "GQ"
mental.health[mental.health$gender == "GenderFluid", "gender"] <- "GQ"
mental.health[mental.health$gender == "GenderFluid (born female)", "gender"] <- "GQ"
mental.health[mental.health$gender == "Genderflux demi-girl", "gender"] <- "GQ"
mental.health[mental.health$gender == "genderqueer", "gender"] <- "GQ"
mental.health[mental.health$gender == "Genderqueer", "gender"] <- "GQ"
mental.health[mental.health$gender == "fm", "gender"] <- "GQ"
mental.health[mental.health$gender == "genderqueer woman", "gender"] <- "GQ"
mental.health[mental.health$gender == "human", "gender"] <- "GQ"
mental.health[mental.health$gender == "Human", "gender"] <- "GQ"
mental.health[mental.health$gender == "Unicorn", "gender"] <- "GQ"
mental.health[mental.health$gender == "Male/genderqueer", "gender"] <- "GQ"
mental.health[mental.health$gender == "nb masculine", "gender"] <- "GQ"
mental.health[mental.health$gender == "non-binary", "gender"] <- "GQ"
mental.health[mental.health$gender == "Nonbinary", "gender"] <- "GQ"
mental.health[mental.health$gender == "AFAB", "gender"] <- "GQ"
# transgender (TG)
mental.health[mental.health$gender == "Male (trans, FtM)", "gender"] <- "TG"
mental.health[mental.health$gender == "Transgender woman", "gender"] <- "TG"
# see what's left
index <- which(mental.health$gender != "M" & mental.health$gender != "F" & mental.health$gender != "GQ" & mental.health$gender != "TG")
mental.health[index, "gender"]## [1] "Female assigned at birth " "Transitioned, M2F"
## [3] "Genderfluid (born female)" "Other/Transfeminine"
## [5] "female/woman" "male 9:1 female, roughly"
## [7] "N/A" "Other"
## [9] "Sex is male" "none of your business"
## [11] "Genderfluid" "N/A"
## [13] "Enby" "mtf"
## [15] "Queer" ""
# create vector of final gender values to fill in based on index
last.genders <- c("F", "TG", "GQ", "GQ", "F", "GQ", "GQ", "GQ", "M", "Refused", "GQ", "GQ", "GQ", "TG", "GQ", NA)
# fill in remaining values
mental.health[index, "gender"] <- last.genders
# check gender
table(mental.health$gender)##
## F GQ M Refused TG
## 337 33 1057 1 4
# convert gender back to factor
mental.health$gender <- as.factor(mental.health$gender)To answer this question, I will restrict myself to only those respondents who have provided an answer to the question, Is your primary role within your company related to tech/IT?
# exclude NA responses in the tech.role variable
tekkies <- mental.health %>% filter(!is.na(tech.role))
#how many respondents total?
nrow(tekkies)## [1] 263
# how many tech workers?
nrow(tekkies[tekkies$tech.role == "Yes", ])## [1] 248
# get only tech workers
tekkies <- tekkies[tekkies$tech.role == "Yes", ]
# what is the gender breakdown?
table(tekkies$gender)##
## F GQ M Refused TG
## 60 4 182 0 1
#group by variables of interest
tekkies.grouped <- tekkies %>% group_by(gender, ever.had.mental.disorder)
# plot counts
ggplot(tekkies, aes(ever.had.mental.disorder)) + geom_bar() + facet_wrap(~gender, scales = "free_y") +
ggtitle("Raw counts of tech role workers with previous diagnosis of mental\n disorder") +
xlab("Ever diagnosed with mental disorder?") +
theme_bw()## create summary of counts and relative frequencies for each class
forPlotting <- tekkies.grouped %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
# show frequencies
forPlotting## Source: local data frame [9 x 4]
## Groups: gender [5]
##
## gender ever.had.mental.disorder n freq
## <fctr> <fctr> <int> <dbl>
## 1 F Maybe 9 0.1500000
## 2 F No 13 0.2166667
## 3 F Yes 38 0.6333333
## 4 GQ Yes 4 1.0000000
## 5 M Maybe 30 0.1648352
## 6 M No 69 0.3791209
## 7 M Yes 83 0.4560440
## 8 TG No 1 1.0000000
## 9 NA Yes 1 1.0000000
# remove the one TG, one NA and 4 GQ respondents
tekkies.grouped <- tekkies %>% filter(gender != "TG", gender != "NA", gender != "GQ") %>% group_by(gender, ever.had.mental.disorder)
## create updated summary of counts and relative frequencies for each class
forPlotting <- tekkies.grouped %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
# plot relative frequencies with a stacked bar plot
gg <- ggplot(forPlotting, aes(x = gender, y = freq, fill = ever.had.mental.disorder)) +
geom_bar(stat = "identity") +
ggtitle("Tech role workers with previous diagnosis of mental disorder") +
xlab("Gender") +
ylab("Relative frequency") +
guides(fill=guide_legend(title=NULL)) +
scale_fill_manual(values=wes_palette(n=3, name="GrandBudapest2")) +
theme_bw()
ggplotly(gg)The first thing to note is that many of the survey respondents did not provide an answer to the question, “Is your primary role within your company related to tech/IT?”. In fact, out of 1433 total respondents, only 263 provided a response, while 1170 respondents left this field empty. As I found this to be out of the ordinary, I checked the survey online. It appears that this question is no longer present in the current form the survey, which would explain the large number of missing data.
Perhaps unsurprisingly, there are roughly three times more male (182) than female (60) respondents in this restricted dataset (tech role only). However, there is a higher proportion of female respondents who have had a diagnosis of a mental health disorder (roughly 63% vs 46%, respectively). The reasons for this are not clear, but it could be that men are less likely to seek out professional medical assistance as it pertains to mental health issues and therefore would have a lower frequency of medically diagnosed mental health disorders.
Lastly, of the four respondents who identified as being GQ or Gender Queer, all had been diagnosed with a mental health disorder in the past, while the single TG (transgender) respondent had no past diagnosis of a mental health disorder.
To answer this question, I will use the same subset of data that I used above.
# quick count
table(tekkies$currently.have.mental.disorder)##
## Maybe No Yes
## 56 95 97
# group by variables of interest
tekkies.grouped <- tekkies %>% group_by(gender, currently.have.mental.disorder)
# create summary of counts and relative frequencies for each class
forPlotting <- tekkies.grouped %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
# show updated frequencies
forPlotting## Source: local data frame [10 x 4]
## Groups: gender [5]
##
## gender currently.have.mental.disorder n freq
## <fctr> <fctr> <int> <dbl>
## 1 F Maybe 10 0.1666667
## 2 F No 19 0.3166667
## 3 F Yes 31 0.5166667
## 4 GQ Maybe 1 0.2500000
## 5 GQ Yes 3 0.7500000
## 6 M Maybe 45 0.2472527
## 7 M No 76 0.4175824
## 8 M Yes 61 0.3351648
## 9 TG Yes 1 1.0000000
## 10 NA Yes 1 1.0000000
# remove the one TG, one NA and 4 GQ respondents
tekkies.grouped <- tekkies %>% filter(gender != "TG", gender != "NA", gender != "GQ") %>% group_by(gender, currently.have.mental.disorder)
## create updated summary of counts and relative frequencies for each class
forPlotting <- tekkies.grouped %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
# plot relative frequencies with a stacked bar plot filled by current diagnosis
gg <- ggplot(forPlotting, aes(x = gender, y = freq, fill = currently.have.mental.disorder)) +
geom_bar(stat = "identity") +
ggtitle("Female tech workers have higher incidence of mental health disorders") +
xlab("Gender") +
ylab("Relative frequency") +
guides(fill=guide_legend(title=NULL)) +
scale_fill_manual(values=wes_palette(n=3, name="GrandBudapest2")) +
theme_bw()
ggplotly(gg)As was the case with having previous history of mental disorder, we see similar trends in those who currently have a mental health disorder. Roughly 52% of female respondents indicated they are currently suffering from a mental health disorder, in contrast to only 34% of male respondents. However, there are more male respondents who indicated that they “maybe” have a mental disorder than female respondents (25% versus 17%, respectively).
As it appears that the majority of survey respondents did not receive the question, “Is your primary role within your company related to tech/IT?”, I will now focus the remainder of the analyses using the full set of respondents.
# quick table count
table(mental.health$currently.have.mental.disorder)##
## Maybe No Yes
## 327 531 575
# group the data
all <- mental.health %>% group_by(gender, currently.have.mental.disorder)
# calculate frequencies
forPlotting <- all %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
# show freqs
forPlotting## Source: local data frame [12 x 4]
## Groups: gender [6]
##
## gender currently.have.mental.disorder n freq
## <fctr> <fctr> <int> <dbl>
## 1 F Maybe 57 0.16913947
## 2 F No 99 0.29376855
## 3 F Yes 181 0.53709199
## 4 GQ Maybe 9 0.27272727
## 5 GQ No 2 0.06060606
## 6 GQ Yes 22 0.66666667
## 7 M Maybe 260 0.24597919
## 8 M No 430 0.40681173
## 9 M Yes 367 0.34720908
## 10 Refused Maybe 1 1.00000000
## 11 TG Yes 4 1.00000000
## 12 NA Yes 1 1.00000000
# remove Refused and NA respondents and regroup data
all <- mental.health %>% filter(gender != "NA", gender != "Refused") %>%
group_by(gender, currently.have.mental.disorder)
# recalculate frequencies
forPlotting <- all %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
# plot relative frequencies with a stacked bar plot filled by current diagnosis
gg <- ggplot(forPlotting, aes(x = gender, y = freq, fill = currently.have.mental.disorder)) +
geom_bar(stat = "identity") +
ggtitle("Gender Queer and Female respondents have higher incidence of mental health disorders\n") +
xlab("Gender") +
ylab("Relative frequency") +
guides(fill=guide_legend(title=NULL)) +
scale_fill_manual(values=wes_palette(n=3, name="GrandBudapest2")) +
theme_bw()
ggplotly(gg)First, let us note while the TG frequency of Yes is 100%, bear in mind there are only four total TG respondents in the entire survey. With more data, it is likely this frequency will change.
Interestingly, among the respondents who consider themselves GQ, 67% currently have a mental health disorder, higher than either males or females. As was observed in the tech worker subset of the data, there are more females (54%) than males (35%) who currently have a mental health disorder.
# check entire survey
all <- mental.health %>% group_by(self.employed, currently.have.mental.disorder)
# calculate frequencies
forPlotting <- all %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
# show freqs
forPlotting## Source: local data frame [6 x 4]
## Groups: self.employed [2]
##
## self.employed currently.have.mental.disorder n freq
## <fctr> <fctr> <int> <dbl>
## 1 No Maybe 254 0.2216405
## 2 No No 441 0.3848168
## 3 No Yes 451 0.3935428
## 4 Yes Maybe 73 0.2543554
## 5 Yes No 90 0.3135889
## 6 Yes Yes 124 0.4320557
# plot relative frequencies with a stacked bar plot filled by current diagnosis
gg <- ggplot(forPlotting, aes(x = self.employed, y = freq, fill = currently.have.mental.disorder)) +
geom_bar(stat = "identity") +
ggtitle("Self-employed respondents appear more likely to suffer from mental illness") +
xlab("Self-employment status") +
ylab("Relative frequency") +
guides(fill=guide_legend(title=NULL)) +
scale_fill_manual(values=wes_palette(n=3, name="GrandBudapest2")) +
theme_bw()
ggplotly(gg)# Are the differences in proportions statistically significant?
# Ho: There is no difference in proportion of mental health disorders among the self- and non-self-employed
# Ha: There is some difference in proportion of mental health disorders among the self- and non-self-employed
chisq.test(table(all$self.employed, all$currently.have.mental.disorder))##
## Pearson's Chi-squared test
##
## data: table(all$self.employed, all$currently.have.mental.disorder)
## X-squared = 5.0674, df = 2, p-value = 0.07937
From a survey-wide standpoint, it appears that those who are self-employed have a slightly higher incidence of of a mental health disorder (43%) versus those who are not self-employed (39%). However, according to the results of a ChiSquare test of independence, we fail to reject the null hypothesis of no difference among the self and non-self employed groups in the population at large. That being said, it should be noted that we failed to reject the null hypothesis of no difference by a small margin (\(p = 0.07937\)).
# first, what is the distribution of ages?
summary(mental.health$age)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 28.00 33.00 34.29 39.00 323.00
Okay.. It’s very unlikely that a 3 year-old and 323 year-old completed this survey. I will impute these ages with the median value of 33.
# impute the incorrect ages with median age of 33
mental.health[which(mental.health$age == 3), "age"] <- 33
mental.health[which(mental.health$age == 323), "age"] <- 33
# check
summary(mental.health$age)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.0 28.0 33.0 34.1 39.0 99.0
We still have some issues with age. There’s a respondent who answered 99. Let’s explore age graphically.
# plot ages
gg <- ggplot(mental.health, aes(age)) + geom_histogram() +
ggtitle("Distribution of ages among all respondents") +
theme_bw()
ggplotly(gg)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Although possible, it is unlikely that a 99 year-old would still be working and not enjoying the golden years of their life. Therefore, I will replace this age with the median.
# fix last age
mental.health[which(mental.health$age == 99), "age"] <- 33
# how does age relate to comfort with discussing mental health issues with supervisor?
non.empty.response <- mental.health %>% filter(mental.health.comfort.supervisor != "")
# plot
gg <- ggplot(non.empty.response, aes(x = mental.health.comfort.supervisor, y = age)) +
geom_boxplot() +
ggtitle("No age differences according to comfort with supervisor") +
xlab("comfort with supervisor") +
ylab("age") +
scale_fill_manual(values=wes_palette(n=3, name="GrandBudapest2")) +
theme_bw()
ggplotly(gg)# Are there differences in ages according to comfort discussing mental illness with supervisor?
av <- aov(non.empty.response$age ~ non.empty.response$mental.health.comfort.supervisor)
summary(av)## Df Sum Sq Mean Sq
## non.empty.response$mental.health.comfort.supervisor 2 127 63.29
## Residuals 1143 67679 59.21
## F value Pr(>F)
## non.empty.response$mental.health.comfort.supervisor 1.069 0.344
## Residuals
# how does age relate to comfort with discussing mental health issues with coworkers?
non.empty.response.co <- mental.health %>% filter(mental.health.comfort.coworker != "")
# plot
gg <- ggplot(non.empty.response.co, aes(x = mental.health.comfort.coworker, y = age)) +
geom_boxplot() +
ggtitle("No age differences according to comfort with coworkers") +
xlab("comfort with coworkers") +
ylab("age") +
scale_fill_manual(values=wes_palette(n=3, name="GrandBudapest2")) +
theme_bw()
ggplotly(gg)# Are there differences in ages according to comfort discussing mental illness with coworkers?
av <- aov(non.empty.response.co$age ~ non.empty.response.co$mental.health.comfort.supervisor)
summary(av)## Df Sum Sq Mean Sq
## non.empty.response.co$mental.health.comfort.supervisor 2 127 63.29
## Residuals 1143 67679 59.21
## F value Pr(>F)
## non.empty.response.co$mental.health.comfort.supervisor 1.069 0.344
## Residuals
According to the plots and ANOVA results above, there are no significant differences in ages by level of comfort in discussing mental health issues with either supervisors or coworkers. This means that developing targeted mental health wellness outreach programs based on age may be unnecessary.
To answer this question, I will restrict the data to only those survey respondents who have answered the survey question: “Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?”
# quick summary of number of employees breakdown
summary(mental.health$num.employees)## 1-5 100-500 26-100 500-1000
## 287 60 248 292 80
## 6-25 More than 1000
## 210 256
As we can see, we have a lot of missing data (287). Otherwise, the breakdown looks fairly homogenous, with the fewest survey respondents belonging to companies with 5 or less people ).
# get only respondents who have answer the question
size.company <- mental.health %>% filter(num.employees != "")
# order levels of factor to be in ascending order
size.company$num.employees <- factor(size.company$num.employees, levels = c("1-5", "6-25", "26-100", "100-500", "500-1000", "More than 1000"))
# group by variables of interest
size.company.grouped <- size.company %>% group_by(num.employees, currently.have.mental.disorder)
# calculate frequencies
forPlotting <- size.company.grouped %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
# show freqs
#forPlotting
# plot relative frequencies with a stacked bar plot filled by current diagnosis
gg <- ggplot(forPlotting, aes(x = num.employees, y = freq, fill = currently.have.mental.disorder)) +
geom_bar(stat = "identity") +
ggtitle("No relationship between company size and incidence of mental disorders") +
xlab("Company size (number of employees)") +
ylab("Relative frequency") +
guides(fill=guide_legend(title=NULL)) +
scale_fill_manual(values=wes_palette(n=3, name="GrandBudapest2")) +
theme_bw()
ggplotly(gg)# Are the differences in proportions statistically significant?
# Ho: There is no difference in proportion of mental health disorders dependent on company size
# Ha: There is some difference in proportion of mental health disorders dependent on company size
chisq.test(table(size.company$num.employees, size.company$currently.have.mental.disorder))##
## Pearson's Chi-squared test
##
## data: table(size.company$num.employees, size.company$currently.have.mental.disorder)
## X-squared = 13.231, df = 10, p-value = 0.2111
Given the data we have, we fail to reject the null hypothesis of no difference between the proportions of mental health disorders and company size. In other words, company size is not significantly associated with a diagnosis of a mental health disorder.
Or, in other words: Are smaller or larger companies more likely to formally discuss mental health issues?
To answer this question, I will use a similar approach as above.
# quick summary
summary(size.company$mental.health.formally.discussed)## I don't know No Yes
## 0 103 813 230
# fix empty level issue
size.company$mental.health.formally.discussed <- droplevels(size.company$mental.health.formally.discussed)
# group by variables of interest
size.company.grouped <- size.company %>% group_by(num.employees, mental.health.formally.discussed)
# calculate frequencies
forPlotting <- size.company.grouped %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
# show freqs
#forPlotting
# plot relative frequencies with a stacked bar plot filled by current diagnosis
gg <- ggplot(forPlotting, aes(x = num.employees, y = freq, fill = mental.health.formally.discussed)) +
geom_bar(stat = "identity") +
ggtitle("Larger companies formally discuss mental health issues more") +
xlab("Company size (number of employees)") +
ylab("Relative frequency") +
guides(fill=guide_legend(title=NULL)) +
scale_fill_manual(values=wes_palette(n=3, name="GrandBudapest2")) +
theme_bw()
ggplotly(gg)# Are the differences in proportions statistically significant?
# Ho: There is no difference in proportion of mental health being formally discussed being dependent on company size
# Ha: There is some difference in proportion of mental health being formally discussed being dependent on company size
chisq.test(table(size.company$num.employees, size.company$mental.health.formally.discussed))##
## Pearson's Chi-squared test
##
## data: table(size.company$num.employees, size.company$mental.health.formally.discussed)
## X-squared = 66.212, df = 10, p-value = 2.375e-10
Based on the data above, we have sufficient evidence to reject the null hypothesis of formal discussion of mental health issues and company size being independent. The data suggest that larger companies (greater than 500 employees) tend to have formal discussions/policies about mental health in place.
Do larger companies tend to take mental health more seriously?
# quick summary
table(size.company$mental.health.taken.seriously)##
## I don't know No Yes
## 0 493 303 350
# fix empty level issue
size.company$mental.health.taken.seriously <- droplevels(size.company$mental.health.taken.seriously)
# group by variables of interest
size.company.grouped <- size.company %>% group_by(num.employees, mental.health.taken.seriously)
# calculate frequencies
forPlotting <- size.company.grouped %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
# show freqs
#forPlotting
# plot relative frequencies with a stacked bar plot filled by mental.health.taken.seriously
gg <- ggplot(forPlotting, aes(x = num.employees, y = freq, fill = mental.health.taken.seriously)) +
geom_bar(stat = "identity") +
ggtitle("Companies with more than 1000 employees take mental health less seriously") +
xlab("Company size (number of employees)") +
ylab("Relative frequency") +
guides(fill=guide_legend(title=NULL)) +
scale_fill_manual(values=wes_palette(n=3, name="GrandBudapest2")) +
theme_bw()
ggplotly(gg)# Are the differences in proportions statistically significant?
# Ho: There is no difference in proportion of mental health taken seriously being dependent on company size
# Ha: There is some difference in proportion of mental health taken seriously being dependent on company size
chisq.test(table(size.company$num.employees, size.company$mental.health.taken.seriously))##
## Pearson's Chi-squared test
##
## data: table(size.company$num.employees, size.company$mental.health.taken.seriously)
## X-squared = 34.42, df = 10, p-value = 0.0001568
Surprisingly, even though the largest companies have the highest proportion of survey respondents indicating that there are formal discussions of mental health taking place at their companies, the largest companies have the lowest proportion of survey respondents indicating that their companies take mental health issues seriously. Regarding the ChiSquare test of independence, we have convincing evidence to reject the null hypothesis of independence between company size and taking mental health seriously.
To answer this question, I will limit the dataset to only a few discrete categories of work position. Note that in the survey, respondents could select multiple answers to describe their roles. I have decided to focus on single answer choices, such as:
# group by variables of interest
workpo.grouped <- mental.health %>% filter(work.position == "Back-end Developer" | work.position == "Front-end Developer" | work.position == "DevOps/SysAdmin" | work.position == "Supervisor/Team Lead" | work.position == "Support" | work.position == "Designer") %>% group_by(work.position, currently.have.mental.disorder)
# calculate frequencies
forPlotting <- workpo.grouped %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
# show frequencies
forPlotting## Source: local data frame [18 x 4]
## Groups: work.position [6]
##
## work.position currently.have.mental.disorder n freq
## <chr> <fctr> <int> <dbl>
## 1 Back-end Developer Maybe 52 0.19771863
## 2 Back-end Developer No 108 0.41064639
## 3 Back-end Developer Yes 103 0.39163498
## 4 Designer Maybe 2 0.07142857
## 5 Designer No 9 0.32142857
## 6 Designer Yes 17 0.60714286
## 7 DevOps/SysAdmin Maybe 17 0.31481481
## 8 DevOps/SysAdmin No 13 0.24074074
## 9 DevOps/SysAdmin Yes 24 0.44444444
## 10 Front-end Developer Maybe 34 0.27200000
## 11 Front-end Developer No 46 0.36800000
## 12 Front-end Developer Yes 45 0.36000000
## 13 Supervisor/Team Lead Maybe 9 0.13235294
## 14 Supervisor/Team Lead No 33 0.48529412
## 15 Supervisor/Team Lead Yes 26 0.38235294
## 16 Support Maybe 5 0.14705882
## 17 Support No 12 0.35294118
## 18 Support Yes 17 0.50000000
gg <- ggplot(forPlotting, aes(x = work.position, y = freq, fill = currently.have.mental.disorder)) +
geom_bar(stat = "identity") +
ggtitle("Designers have the highest incidence of mental health disorders") +
xlab("Work position") +
ylab("Relative frequency") +
guides(fill=guide_legend(title=NULL)) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
scale_fill_manual(values=wes_palette(n=3, name="GrandBudapest2")) +
theme_bw()
ggplotly(gg)Interestingly, Designers have the highest incidence of respondents (17/28) currently diagnosed with a mental disorder at roughly 61%, whereas Front-end Developers have the lowest incidence, at 36%.
Here I have performed a cursory exploratory data analysis in an attempt to gain some insight into the incidence of mental health/illness in the tech workplace based on the Open Sourcing Mental Illess (OSMI) ongoing survey, which started back in 2016.
Based the exploratory analysis conducted, there were several potentially interesting considerations/findings:
Having an open-ended survey question related to gender led to a minor data-cleaning headache. While I do understand the motivation behind leaving this survey question free-form, perhaps providing a handful of choices (e.g. “Male”, “Female”, “Non-binary/Gender Queer”, “Transgender” and so on) would mitigate the huge diversity of answers, some of which included entire sentences.
For whatever reason, at least one of the survey questions (“Is your primary role within your company related to tech/IT”) was removed from the survey, leading to a large number of missing data (roughly 82%) for this variable. Nevertheless, in the subset of data corresponding to those with a tech role (\(n = 248\)), we observed that:
63% females vs 46% males)52% females vs 34% males)The reasons for these differences between males and females are unclear, but it can be speculated that males could be less willing to admit to having issues with their mental health and/or could be reluctant to see a psychiatrist, given the larger proportion of males who were unsure if they have a mental disorder or not (25% males vs 17% females)
Considering the entire set of respondents (\(n = 1433\)), we observed that Gender Queer and Female respondents have a higher incidence of currently having a mental disorder (67% Gender Queer and 54% Female), while all 4 Transgender respondents currently have a mental health disorder. Survey-wide, males have the lowest frequency of currently having a mental disorder (35%)
Self-employed respondents are slightly more likely to currently have a mental disorder, but this difference was not found to be statistically significant.
A respondent’s age appears to not have any influence on their level of comfort in discussing mental health issues with their supervisors or co-workers. Median age of respondents is 33 years.
With the current data, company size and incidence of mental health disorders appear to be independent.
Larger companies tend to formally discuss mental health issues more, i.e. perhaps they have formal policies in place; however, companies with more than 1000 employees take mental health less seriously. Therefore, it is likely that working with larger companies to improve their handling of mental health issues could be a priority.
Designers have the highest incidence of mental health disorders at 61%, followed by those in a Support role (50%). Front-end Developers have the lowest incidence of mental health disorders at 36%.
I hope that these insights can be considered useful in some way. It would be nice to continue exploring the data and working with some of the other variables, and even building some sort of model to predict the incidence of mental health disorders. Unfortunately, I will have to cut my analysis off here due to time constraints!
Thank you!
Feel free to reach me at dalesan@gmail.com with any questions. All relevant files at located in a GitHub repo