Final Report - Mental Health in Tech Survey

Synopsis

The objective of this report is to analyse the attitude of companies towards mental health, and examine the frequency of mental health Illnesses among tech workers.

Some of the questions that would be addressed in the analysis are:

Are mental health illnesses more frequent among Tech workers as compared to non Tech?
How does the size of company relate to an employer formally discussing mental health?
How does Age relate to comfort discussing mental health issues with peers?

Packages Required

To analyze this data, I will use the following R packages:

library(plyr) #For Manipulating data
library(dplyr) # For Manipulating data
library(ggplot2) # for data visualizations
library(stringr) #For tidying data
library(knitr) # Used to display an aligned table on screen
library(DT) # Used to display the data on the screen in a scrollable format
library(plotly) #for visualization plots
library(rworldmap) #plot data on world map

Data Exploration

Data Source

The dataset is from a 2014 survey conducted by Open Sourcing Mental Illess (OSMI) and can be downloaded here. The survey is conducted online at the OSMI website and the OSMI team intends to use these data to help drive awareness and improve conditions for individuals with mental illness in the IT workplace. It should be noted that as this is an online survey, it may be prone to voluntary response bias and may cause over representation of data. The sample of respondents was not obtained through any random sampling approach.

Furthermore, as this is an observational study with potential sampling biases present, causality can not be inferred. The results of the survey may not be generalizable to the entire population of Tech/IT workers due to the lack of random sampling.

Keeping the above limitations in mind and being cautious with our interpretations, we can still use the data to gain some insight into the state of mental health in the tech workplace.

Data Import

We will import the excel file as a dataset and take a look at its structure and dimensions.

#load dataset
mental.health <- read.csv("C:/Users/ananya/Documents/Swidle/Data wrangling/Mental health in Tech Survey.csv", header = TRUE, stringsAsFactors = TRUE)

#check structure
#str(mental.health)

Dimension of the dataset:

dim(mental.health)

## [1] 1259   27

Data Description

The dataset contains 1259 rows with responses of people who participated in Survey. This survey examined people based on 27 factors (columns) that might associate with the mental health status.

This dataset contains the following variables:

Timestamp
Age
Gender
Country
state: If you live in the United States, which state or territory do you live in?
self_employed: Are you self-employed?
family_history: Do you have a family history of mental illness?
treatment: Have you sought treatment for a mental health condition?
work_interfere: If you have a mental health condition, do you feel that it interferes with your work?
no_employees: How many employees does your company or organization have?
remote_work: Do you work remotely (outside of an office) at least 50% of the time?
tech_company: Is your employer primarily a tech company/organization?
benefits: Does your employer provide mental health benefits?
care_options: Do you know the options for mental health care your employer provides?
wellness_program: Has your employer ever discussed mental health as part of an employee wellness program?
seek_help: Does your employer provide resources to learn more about mental health issues and how to seek help?
anonymity: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?
leave: How easy is it for you to take medical leave for a mental health condition? mental_health_consequence: Do you think that discussing a mental health issue with your employer would have negative consequences?
phys_health_consequence: Do you think that discussing a physical health issue with your employer would have negative consequences?
coworkers: Would you be willing to discuss a mental health issue with your coworkers?
supervisor: Would you be willing to discuss a mental health issue with your direct supervisor(s)?
mental_health_interview: Would you bring up a mental health issue with a potential employer in an interview?
phys_health_interview: Would you bring up a physical health issue with a potential employer in an interview?
mental_vs_physical: Do you feel that your employer takes mental health as seriously as physical health?
obs_consequence: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?
comments: Any additional notes or comments

Data Cleaning

The first part of data cleaning involves cleaning up of the the Gender variable. Based on the responses, there are 49 unique entries of gender. I will divide these values into four main categories: Male, Female, Gender Queer and Transgender.

mental.health$Gender <- as.character(mental.health$Gender) # convert the variable to string

mental.health$Gender <- tolower(mental.health$Gender)
Gender.clean <- as.vector(mental.health$Gender)
#unique(Gender.clean) check unique values

Female <- c('female', 'cis female', 'f', 'woman', 'femake', 'female ', 'cis-female/femme', 'female (cis)', 'femail')

Male <-  c('m', 'male', 'male-ish', 'maile', 'cis male', 'mal', 'male (cis)', 'make', 'male ', 'man', 'msle', 'mail', 'malr', 'cis man','Mle') 

GQ <- c('queer/she/they', 'non-binary', 'nah', 'enby', 'fluid', 'genderqueer', 'androgyne', 'agender', 'guy (-ish) ^_^', 'male leaning androgynous', 'neuter', 'queer', 'ostensibly male, unsure what that really means','a little about you','p','all')

TG <- c('trans-female', 'trans woman', 'female (trans)','something kinda male?')


Gender.clean <- sapply(as.vector(Gender.clean), function(x) if(x %in% Male) "Male" else x)
Gender.clean <- sapply(as.vector(Gender.clean), function(x) if(x %in% Female) "Female" else x)
Gender.clean <- sapply(as.vector(Gender.clean), function(x) if(x %in% TG) "Transgender" else x)
Gender.clean <- sapply(as.vector(Gender.clean), function(x) if(x %in% GQ) "Genderqueer" else x)


# Update the Gender column
mental.health$Gender <- Gender.clean

# Create the frequency table of gender
table(mental.health$Gender)

## 
##      Female Genderqueer        Male Transgender 
##         247          16         991           5

# convert gender back to factor
mental.health$Gender <- as.factor(mental.health$Gender)

Next, I will convert comments variable to character. We also do not need the timestamp variable, hence, we will drop it.

# convert factor variables to character
mental.health$comments <- as.character(mental.health$comments)

#drop timestamp and State variables
mental.health <- mental.health[,-1]
mental.health <- mental.health[,-4]

If we check for missing values in the dataset, we observe that most of the missing values are in the comments variable. A few missing values are also present in the self_employed and work_interfere variables. As these are categorical variables, it is difficult to impute the missing values for these variables. Hence, I will retain these variables until the exploratory Data Analysis phase and then decide if I need to retain or drop them.

Table with count of missing values:

colSums(is.na(mental.health))

##                       Age                    Gender 
##                         0                         0 
##                   Country             self_employed 
##                         0                        18 
##            family_history                 treatment 
##                         0                         0 
##            work_interfere              no_employees 
##                       264                         0 
##               remote_work              tech_company 
##                         0                         0 
##                  benefits              care_options 
##                         0                         0 
##          wellness_program                 seek_help 
##                         0                         0 
##                 anonymity                     leave 
##                         0                         0 
## mental_health_consequence   phys_health_consequence 
##                         0                         0 
##                 coworkers                supervisor 
##                         0                         0 
##   mental_health_interview     phys_health_interview 
##                         0                         0 
##        mental_vs_physical           obs_consequence 
##                         0                         0 
##                  comments 
##                      1095

Clean Data

Preview top 50 rows of the clean dataset:

Exploratory Data Analysis

Distribution of participant Age

We will first visualize the age of survey participants. Before we go ahead with the graphs, we will check the data to see if we have any outliers or invalid values.

#Check the distribution of Ages
summary(mental.health$Age)

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -1.726e+03  2.700e+01  3.100e+01  7.943e+07  3.600e+01  1.000e+11

We see unusual values in the Age variable. There are a few observations with negative age and one withun usually high value. We will impute these values with the median and then categorize the age variable into classes.

mental.health[which(mental.health$Age <= 15), "Age"] <- median(mental.health$Age)
mental.health[which(mental.health$Age > 100), "Age"] <- median(mental.health$Age)

#Categorization of Age variable

mental.health$Age<-cut(mental.health$Age, c(0,20,35,65,100))

theme_set(theme_classic())

g <- ggplot(mental.health, aes(Age))

g + geom_bar(aes(fill=Age),width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Distribution of participant Age")

We see that most of the participants are in the age group of 20 to 35.

Distribution of participants by country

Out of Curiosity, let’s see where our participants are from.

freqtable <- table(mental.health$Country)
Country <- as.data.frame.table(freqtable)

world <-joinCountryData2Map(Country, joinCode="NAME", nameJoinColumn="Var1")

## 48 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 195 codes from the map weren't represented in your data

mapCountryData(world,nameColumnToPlot = "Freq",
               mapTitle = "Distribution of Participants by country", colourPalette = "rainbow")

g2 <- ggplot(mental.health, aes(Country))
g2 + geom_bar(width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
  labs(title="Countrywise Distribution of participants")

We see that most of the partici[pants are from North America and Europe.Hence, the data is not representative of the population and the results of the survey cannot be generalized over the entire population.

How does Age relate to comfort discussing mental health issues with peers?

g1 <- ggplot(mental.health, aes(Age))

g1 + geom_bar(aes(fill=supervisor),width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Comfort with Supervisors according to Age")

g1 + geom_bar(aes(fill=coworkers),width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Comfort with coworkers according to Age")

According to the plots above, there are no significant differences in ages by level of comfort in discussing mental health issues with either supervisors or coworkers. This means that developing targeted mental health wellness outreach programs based on age may be unnecessary.

Does size of company relate to an employer formally discussing mental health?

Here, we will check if the size of the company determines its willingness to formally discuss mental health issues

size.company <- mental.health %>% filter(no_employees != "")

# quick summary
summary(size.company$wellness_program)

## Don't know         No        Yes 
##        188        842        229

# fix empty level issue
size.company$wellness_program <- droplevels(size.company$wellness_program)

# group by variables of interest
size.company.grouped <- size.company %>% group_by(no_employees, wellness_program)

# calculate frequencies
forPlotting1 <- size.company.grouped %>%
        summarise(n = n()) %>%
        mutate(freq = n / sum(n))

# plot relative frequencies with a stacked bar plot filled by current diagnosis
gg3 <- ggplot(forPlotting1, aes(x = no_employees, y = freq, fill = wellness_program)) + 
        geom_bar(stat = "identity") +
         ggtitle("Plot of Company size against Willingness to discuss mental health issues") +
         xlab("Company size (number of employees)") +
         ylab("Relative frequency") +
         guides(fill=guide_legend(title=NULL)) +
         theme_bw()

ggplotly(gg3)

We observe that large companies formally discuss mental health issues more. We will perform a chi square test to verify this.

# Ho: There is no difference in proportion of mental health being formally discussed being dependent on company size
# Ha: There is some difference in proportion of mental health being formally discussed being dependent on company size

chisq.test(table(size.company$no_employees, size.company$wellness_program))

## 
##  Pearson's Chi-squared test
## 
## data:  table(size.company$no_employees, size.company$wellness_program)
## X-squared = 179.48, df = 10, p-value < 2.2e-16

Based on the data above, we have sufficient evidence to reject the null hypothesis of formal discussion of mental health issues and company size being independent. The data suggest that larger companies (greater than 500 employees) tend to have formal discussions/policies about mental health in place.

Does Family history play a role in willingness to seek treatment for mental illness?

mental.health$treatment <- as.numeric(revalue(mental.health$treatment,
                                              c("No"="0", "Yes"="1")))


ggplot(mental.health,aes(x=family_history,y=treatment,fill=Gender)) +  
   stat_summary(fun.y=mean,position=position_dodge(),geom="bar") + 
  labs(x = "Family History", y = "Willingness to seek treatment", 
       title = "Plot of Family history vs willingness to seek treatment")

We see from the plot that participants who have a family history of mental illness are more likely to seek treatment as compared to folks who do not have a family history.

How does the frequency of mental health illness differ by workpace type (i.e., tech vs. non-tech) and size?

#re-value the variables of interest
mental.health$no_employees <- as.factor(revalue(mental.health$no_employees,c("1-5"="1", "6-25"="2", "26-100"="3", "100-500"="4", "500-1000"="5", "More than 1000"="6")))


ggplot(mental.health,aes(x=no_employees,y=treatment, fill=factor(tech_company)), color=factor(vs)) +  
  stat_summary(fun.y=mean,position=position_dodge(),geom="bar") + 
  labs(x = "Number of employees", y = "Probability of mental health condition", 
       title = "Probability of mental health illness by workplace type and size") +
  scale_x_discrete(labels=c("1" = "1-5", "2" = "6-25", "3" = "26-100", "4"="100-500", "5"="500-1000", "6"=">1000"))

The graph shows that the odds of mental health illness among tech companies is not significantly different than the odds of mental health illness in non-tech companies.

Summary

After performing a cursory exploratory data analysis on the data, there were several interesting findings:

Most of the participants fall in the age category between 20 and 35 and a majority of them are from United States. Hence this is a biased sample. A respondent’s age does not appear to have any influence on their level of comfort in discussing mental health issues with their supervisors or co-workers. However, participants who had a family history of mental illness are more likely to seek treatment as opposed to the ones who do not.

We also observe that larger companies tend to formally discuss mental health issues more, i.e. perhaps they have formal policies in place. Also, the odds of mental health illness among tech companies is not significantly different than the odds of mental health illness in non-tech companies.