The objective of this report is to analyse the attitude of companies towards mental health, and examine the frequency of mental health Illnesses among tech workers.
Some of the questions that would be addressed in the analysis are:
To analyze this data, I will use the following R packages:
library(plyr) #For Manipulating data
library(dplyr) # For Manipulating data
library(ggplot2) # for data visualizations
library(stringr) #For tidying data
library(knitr) # Used to display an aligned table on screen
library(DT) # Used to display the data on the screen in a scrollable format
library(plotly) #for visualization plots
library(rworldmap) #plot data on world map
The dataset is from a 2014 survey conducted by Open Sourcing Mental Illess (OSMI) and can be downloaded here. The survey is conducted online at the OSMI website and the OSMI team intends to use these data to help drive awareness and improve conditions for individuals with mental illness in the IT workplace. It should be noted that as this is an online survey, it may be prone to voluntary response bias and may cause over representation of data. The sample of respondents was not obtained through any random sampling approach.
Furthermore, as this is an observational study with potential sampling biases present, causality can not be inferred. The results of the survey may not be generalizable to the entire population of Tech/IT workers due to the lack of random sampling.
Keeping the above limitations in mind and being cautious with our interpretations, we can still use the data to gain some insight into the state of mental health in the tech workplace.
We will import the excel file as a dataset and take a look at its structure and dimensions.
#load dataset
mental.health <- read.csv("C:/Users/ananya/Documents/Swidle/Data wrangling/Mental health in Tech Survey.csv", header = TRUE, stringsAsFactors = TRUE)
#check structure
#str(mental.health)
Dimension of the dataset:
dim(mental.health)
## [1] 1259 27
The dataset contains 1259 rows with responses of people who participated in Survey. This survey examined people based on 27 factors (columns) that might associate with the mental health status.
This dataset contains the following variables:
The first part of data cleaning involves cleaning up of the the Gender variable. Based on the responses, there are 49 unique entries of gender. I will divide these values into four main categories: Male, Female, Gender Queer and Transgender.
mental.health$Gender <- as.character(mental.health$Gender) # convert the variable to string
mental.health$Gender <- tolower(mental.health$Gender)
Gender.clean <- as.vector(mental.health$Gender)
#unique(Gender.clean) check unique values
Female <- c('female', 'cis female', 'f', 'woman', 'femake', 'female ', 'cis-female/femme', 'female (cis)', 'femail')
Male <- c('m', 'male', 'male-ish', 'maile', 'cis male', 'mal', 'male (cis)', 'make', 'male ', 'man', 'msle', 'mail', 'malr', 'cis man','Mle')
GQ <- c('queer/she/they', 'non-binary', 'nah', 'enby', 'fluid', 'genderqueer', 'androgyne', 'agender', 'guy (-ish) ^_^', 'male leaning androgynous', 'neuter', 'queer', 'ostensibly male, unsure what that really means','a little about you','p','all')
TG <- c('trans-female', 'trans woman', 'female (trans)','something kinda male?')
Gender.clean <- sapply(as.vector(Gender.clean), function(x) if(x %in% Male) "Male" else x)
Gender.clean <- sapply(as.vector(Gender.clean), function(x) if(x %in% Female) "Female" else x)
Gender.clean <- sapply(as.vector(Gender.clean), function(x) if(x %in% TG) "Transgender" else x)
Gender.clean <- sapply(as.vector(Gender.clean), function(x) if(x %in% GQ) "Genderqueer" else x)
# Update the Gender column
mental.health$Gender <- Gender.clean
# Create the frequency table of gender
table(mental.health$Gender)
##
## Female Genderqueer Male Transgender
## 247 16 991 5
# convert gender back to factor
mental.health$Gender <- as.factor(mental.health$Gender)
Next, I will convert comments variable to character. We also do not need the timestamp variable, hence, we will drop it.
# convert factor variables to character
mental.health$comments <- as.character(mental.health$comments)
#drop timestamp and State variables
mental.health <- mental.health[,-1]
mental.health <- mental.health[,-4]
If we check for missing values in the dataset, we observe that most of the missing values are in the comments variable. A few missing values are also present in the self_employed and work_interfere variables. As these are categorical variables, it is difficult to impute the missing values for these variables. Hence, I will retain these variables until the exploratory Data Analysis phase and then decide if I need to retain or drop them.
Table with count of missing values:
colSums(is.na(mental.health))
## Age Gender
## 0 0
## Country self_employed
## 0 18
## family_history treatment
## 0 0
## work_interfere no_employees
## 264 0
## remote_work tech_company
## 0 0
## benefits care_options
## 0 0
## wellness_program seek_help
## 0 0
## anonymity leave
## 0 0
## mental_health_consequence phys_health_consequence
## 0 0
## coworkers supervisor
## 0 0
## mental_health_interview phys_health_interview
## 0 0
## mental_vs_physical obs_consequence
## 0 0
## comments
## 1095
Preview top 50 rows of the clean dataset:
We will first visualize the age of survey participants. Before we go ahead with the graphs, we will check the data to see if we have any outliers or invalid values.
#Check the distribution of Ages
summary(mental.health$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.726e+03 2.700e+01 3.100e+01 7.943e+07 3.600e+01 1.000e+11
We see unusual values in the Age variable. There are a few observations with negative age and one withun usually high value. We will impute these values with the median and then categorize the age variable into classes.
mental.health[which(mental.health$Age <= 15), "Age"] <- median(mental.health$Age)
mental.health[which(mental.health$Age > 100), "Age"] <- median(mental.health$Age)
#Categorization of Age variable
mental.health$Age<-cut(mental.health$Age, c(0,20,35,65,100))
theme_set(theme_classic())
g <- ggplot(mental.health, aes(Age))
g + geom_bar(aes(fill=Age),width = 0.5) +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title="Distribution of participant Age")
We see that most of the participants are in the age group of 20 to 35.
Out of Curiosity, let’s see where our participants are from.
freqtable <- table(mental.health$Country)
Country <- as.data.frame.table(freqtable)
world <-joinCountryData2Map(Country, joinCode="NAME", nameJoinColumn="Var1")
## 48 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 195 codes from the map weren't represented in your data
mapCountryData(world,nameColumnToPlot = "Freq",
mapTitle = "Distribution of Participants by country", colourPalette = "rainbow")
g2 <- ggplot(mental.health, aes(Country))
g2 + geom_bar(width = 0.5) +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title="Countrywise Distribution of participants")
We see that most of the partici[pants are from North America and Europe.Hence, the data is not representative of the population and the results of the survey cannot be generalized over the entire population.
g1 <- ggplot(mental.health, aes(Age))
g1 + geom_bar(aes(fill=supervisor),width = 0.5) +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title="Comfort with Supervisors according to Age")
g1 + geom_bar(aes(fill=coworkers),width = 0.5) +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title="Comfort with coworkers according to Age")
According to the plots above, there are no significant differences in ages by level of comfort in discussing mental health issues with either supervisors or coworkers. This means that developing targeted mental health wellness outreach programs based on age may be unnecessary.
Here, we will check if the size of the company determines its willingness to formally discuss mental health issues
size.company <- mental.health %>% filter(no_employees != "")
# quick summary
summary(size.company$wellness_program)
## Don't know No Yes
## 188 842 229
# fix empty level issue
size.company$wellness_program <- droplevels(size.company$wellness_program)
# group by variables of interest
size.company.grouped <- size.company %>% group_by(no_employees, wellness_program)
# calculate frequencies
forPlotting1 <- size.company.grouped %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
# plot relative frequencies with a stacked bar plot filled by current diagnosis
gg3 <- ggplot(forPlotting1, aes(x = no_employees, y = freq, fill = wellness_program)) +
geom_bar(stat = "identity") +
ggtitle("Plot of Company size against Willingness to discuss mental health issues") +
xlab("Company size (number of employees)") +
ylab("Relative frequency") +
guides(fill=guide_legend(title=NULL)) +
theme_bw()
ggplotly(gg3)
We observe that large companies formally discuss mental health issues more. We will perform a chi square test to verify this.
# Ho: There is no difference in proportion of mental health being formally discussed being dependent on company size
# Ha: There is some difference in proportion of mental health being formally discussed being dependent on company size
chisq.test(table(size.company$no_employees, size.company$wellness_program))
##
## Pearson's Chi-squared test
##
## data: table(size.company$no_employees, size.company$wellness_program)
## X-squared = 179.48, df = 10, p-value < 2.2e-16
Based on the data above, we have sufficient evidence to reject the null hypothesis of formal discussion of mental health issues and company size being independent. The data suggest that larger companies (greater than 500 employees) tend to have formal discussions/policies about mental health in place.
mental.health$treatment <- as.numeric(revalue(mental.health$treatment,
c("No"="0", "Yes"="1")))
ggplot(mental.health,aes(x=family_history,y=treatment,fill=Gender)) +
stat_summary(fun.y=mean,position=position_dodge(),geom="bar") +
labs(x = "Family History", y = "Willingness to seek treatment",
title = "Plot of Family history vs willingness to seek treatment")
We see from the plot that participants who have a family history of mental illness are more likely to seek treatment as compared to folks who do not have a family history.
#re-value the variables of interest
mental.health$no_employees <- as.factor(revalue(mental.health$no_employees,c("1-5"="1", "6-25"="2", "26-100"="3", "100-500"="4", "500-1000"="5", "More than 1000"="6")))
ggplot(mental.health,aes(x=no_employees,y=treatment, fill=factor(tech_company)), color=factor(vs)) +
stat_summary(fun.y=mean,position=position_dodge(),geom="bar") +
labs(x = "Number of employees", y = "Probability of mental health condition",
title = "Probability of mental health illness by workplace type and size") +
scale_x_discrete(labels=c("1" = "1-5", "2" = "6-25", "3" = "26-100", "4"="100-500", "5"="500-1000", "6"=">1000"))
The graph shows that the odds of mental health illness among tech companies is not significantly different than the odds of mental health illness in non-tech companies.
After performing a cursory exploratory data analysis on the data, there were several interesting findings:
Most of the participants fall in the age category between 20 and 35 and a majority of them are from United States. Hence this is a biased sample. A respondent’s age does not appear to have any influence on their level of comfort in discussing mental health issues with their supervisors or co-workers. However, participants who had a family history of mental illness are more likely to seek treatment as opposed to the ones who do not.
We also observe that larger companies tend to formally discuss mental health issues more, i.e. perhaps they have formal policies in place. Also, the odds of mental health illness among tech companies is not significantly different than the odds of mental health illness in non-tech companies.