Background

Tech’s battle with Deppression

The tech and startup world are built on the backs of incredibly bright minds. It’s known for its innovation and resilience and a culture that fosters high-productivity.

But it has a dark underbelly.

Tech is a fast-paced game with high stakes: Founders of startups have to transform an idea into a successful, scaleable business — quickly. They’re under intense pressure to run a successful company, stay on top of a fast-paced, competitive industry, all while maintaining the same image as the tech titans before them.

Employees of these companies operate under the same high-stress: late nights, abnormal hours, and tight deadlines, all while wearing multiple hats and being constantly available at any time of day.

The above isn’t unique to startups — the pressure to excel and climb the corporate ladder in the carrier world creates a culture that exacerbates mental health issues. And it’s also important to remember that there is no fixed state: mental health ebbs and flows along a spectrum, just like our physical health, ranging from thriving to coping or struggling to clinically-treated mental illness.

But the tech industry fosters a “crunch” culture (where demanding work must be completed in a short amount of time). And there’s an increased motivation to neglect one’s health by forgoing proper diet, exercise, and sleep in the name of increased output. And if left unchecked, this can lead to a rise in burnout, depression, anxiety, and loneliness.

The Growing Mental Health Crisis

Everyone around the world has mental health, but not everyone talks about it.

According to OSMI data, 51% of tech professionals have been diagnosed with a mental health condition. By comparison, 19.1% of U.S. adults experience mental illness, according to the National Alliance on Mental Illness.

A study by Michael Freeman found that entrepreneurs are 50% more likely to report having a mental health condition:

Founders are:

2x more likely to suffer from depression
6x more likely to suffer from ADHD
3x more likely to suffer from substance abuse
10x more likely to suffer from bi-polar disorder
2x more likely to have psychiatric hospitalization
2x more likely to have suicidal thoughts

The terrifying problem with mental illness is that it is invisible; it’s a private battle that people have, and it’s hard to know when people need help.

Problem Statement

Mental health affects your emotional, psychological and social well-being. It affects how we think, feel, and act. It also helps determine how we handle stress, relate to others, and make choices. In the workplace, communication and inclusion are keys skills for successful high performing teams or employees. The impact of mental health to an organization can mean an increase of absent days from work and a decrease in productivity and engagement. In the United States, approximately 70% of adults with depression are in the workforce. Employees with depression will miss an estimated 35 million workdays a year due mental illness. Those workers experiencing unresolved depression are estimated to encounter a 35% drop in their productivity, costing employers $105 billion dollars each year. In UK, better mental health support in the workplace can save UK business up to Eur 8 billion per year.

Project Idea

Open Sourcing Mental Illness (OSMI) is a non-profit, corporation dedicated to raising awareness, educating, and providing resources to support mental wellness in the tech and open source communities. OSMI began in 2013, with Ed Finkler speaking at tech conferences about his personal experiences as a web developer and open source advocate with a mental health disorder. The response was overwhelming, and thus OSMI was born.

Every year, OSMI came out with a new survey to see how employees want to get mental health treatment in tech companies around the world and I pick the survey from 2014.

This survey is filled by respondents who suffer from mental health disorders (diagnose or un-diagnosed by medical, even it’s just a feeling) in tech companies and see if any factors can affect the employee to get treatment or not.

From this research, We will create a machine learning can help HR to see what factors have the company needs to support so the employee wants to get mental health treatment. We call it Mental Health First Aid.

Problem Scope

Mental Health First Aid teaches HR how to notice and support an individual who may be experiencing a mental health or substance use concern or crisis and connect them with the appropriate employee resources. It teaches employees critical communication and support skills that can influence your organizations bottom line.

Research shows that employees who go through Mental Health First Aid have an increased awareness of mental health among themselves and their co-workers. It allows them to recognize the signs of someone who maybe struggling and teaches them the skills to know when to reach out and what resources are available. Which in turn creates beneficial intervention that increases engagement and creates an environment of inclusion and support.

Employers can also offer robust benefit packages to support employees who go through mental health issues. That includes Employee Assistance Programs, Wellness programs that focus on mental and physical health, Health and Disability Insurance or flexible working schedules or time off policies.

Organizations that incorporate mental health awareness help to create a healthy and productive work environment that reduces the stigma associated with mental illness, increases the organizations mental health literacy and teaches the skills to safely and responsibly respond to a co-workers mental health concern.

Incorporating mental health awareness in the workplace can help lead the way for mental health issues throughout your community by equipping people with the tools they need to start a dialogue so that more people can get the help they need.

Output

The output of this project is a dashboard analysis and prediction using machine learning using R Shiny dashboard. This dashboard can be utilized by HR team to predict whether any individual may be experiencing a mental health or not.

Business Impact

As mentioned in the problem statement, Employees with depression will miss an estimated 35 million workdays a year due mental illness. Those workers experiencing unresolved depression are estimated to encounter a 35% drop in their productivity, costing employers $105 billion dollars each year. This is a huge loss of money in terms of business.

If the employers can solve this issue, not only they can retain their employees, decrease the turnover rate, and increase employees productivity ,they also will save a huge lot of money.

Exploratory Data Analysis

Load Library

Before we do the analysis we need to load the library required.

library(dplyr)
library(ggplot2)
library(plotly)
library(esquisse)

Load Data

Now we will load the data for further analysis

Data source : https://www.kaggle.com/osmi/mental-health-in-tech-survey

mental <- read.csv("survey.csv")
mental

Data Description

Below are data description on each columns for our understanding

Timestamps
Age
Gender
Country
state: If you live in the United States, which state or territory do you live in?
self_employed: Are you self-employed?
family_history: Do you have a family history of mental illness?
treatment: Have you sought treatment for a mental health condition?
work_interfere: If you have a mental health condition, do you feel that it interferes with your work?
no_employees: How many employees does your company or organization have?
remote_work: Do you work remotely (outside of an office) at least 50% of the time?
tech_company: Is your employer primarily a tech company/organization?
benefits: Does your employer provide mental health benefits?
care_options: Do you know the options for mental health care your employer provides?
wellness_program: Has your employer ever discussed mental health as part of an employee wellness program?
seek_help: Does your employer provide resources to learn more about mental health issues and how to seek help?
anonymity: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?
leave: How easy is it for you to take medical leave for a mental health condition?
mentalhealthconsequence: Do you think that discussing a mental health issue with your employer would have negative consequences?
physhealthconsequence: Do you think that discussing a physical health issue with your employer would have negative consequences?
coworkers: Would you be willing to discuss a mental health issue with your coworkers?
physhealthinterview: Would you bring up a physical health issue with a potential employer in an interview?
mentalvsphysical: Do you feel that your employer takes mental health as seriously as physical health?
obs_consequence: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?
comments: Any additional notes or comments

glimpse(mental) #Check Data Types

#> Rows: 1,259
#> Columns: 27
#> $ Timestamp                 <chr> "2014-08-27 11:29:31", "2014-08-27 11:29:37"~
#> $ Age                       <dbl> 37, 44, 32, 31, 31, 33, 35, 39, 42, 23, 31, ~
#> $ Gender                    <chr> "Female", "M", "Male", "Male", "Male", "Male~
#> $ Country                   <chr> "United States", "United States", "Canada", ~
#> $ state                     <chr> "IL", "IN", NA, NA, "TX", "TN", "MI", NA, "I~
#> $ self_employed             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
#> $ family_history            <chr> "No", "No", "No", "Yes", "No", "Yes", "Yes",~
#> $ treatment                 <chr> "Yes", "No", "No", "Yes", "No", "No", "Yes",~
#> $ work_interfere            <chr> "Often", "Rarely", "Rarely", "Often", "Never~
#> $ no_employees              <chr> "6-25", "More than 1000", "6-25", "26-100", ~
#> $ remote_work               <chr> "No", "No", "No", "No", "Yes", "No", "Yes", ~
#> $ tech_company              <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Ye~
#> $ benefits                  <chr> "Yes", "Don't know", "No", "No", "Yes", "Yes~
#> $ care_options              <chr> "Not sure", "No", "No", "Yes", "No", "Not su~
#> $ wellness_program          <chr> "No", "Don't know", "No", "No", "Don't know"~
#> $ seek_help                 <chr> "Yes", "Don't know", "No", "No", "Don't know~
#> $ anonymity                 <chr> "Yes", "Don't know", "Don't know", "No", "Do~
#> $ leave                     <chr> "Somewhat easy", "Don't know", "Somewhat dif~
#> $ mental_health_consequence <chr> "No", "Maybe", "No", "Yes", "No", "No", "May~
#> $ phys_health_consequence   <chr> "No", "No", "No", "Yes", "No", "No", "Maybe"~
#> $ coworkers                 <chr> "Some of them", "No", "Yes", "Some of them",~
#> $ supervisor                <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "No"~
#> $ mental_health_interview   <chr> "No", "No", "Yes", "Maybe", "Yes", "No", "No~
#> $ phys_health_interview     <chr> "Maybe", "No", "Yes", "Maybe", "Yes", "Maybe~
#> $ mental_vs_physical        <chr> "Yes", "Don't know", "No", "No", "Don't know~
#> $ obs_consequence           <chr> "No", "No", "No", "Yes", "No", "No", "No", "~
#> $ comments                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~

colSums(is.na(mental)) #Check NaN

#>                 Timestamp                       Age                    Gender 
#>                         0                         0                         0 
#>                   Country                     state             self_employed 
#>                         0                       515                        18 
#>            family_history                 treatment            work_interfere 
#>                         0                         0                       264 
#>              no_employees               remote_work              tech_company 
#>                         0                         0                         0 
#>                  benefits              care_options          wellness_program 
#>                         0                         0                         0 
#>                 seek_help                 anonymity                     leave 
#>                         0                         0                         0 
#> mental_health_consequence   phys_health_consequence                 coworkers 
#>                         0                         0                         0 
#>                supervisor   mental_health_interview     phys_health_interview 
#>                         0                         0                         0 
#>        mental_vs_physical           obs_consequence                  comments 
#>                         0                         0                      1095

Some informations we got if we look at the glimpse and the summary of the data:

There are 1259 rows and 27 columns in the dataset.
Some of the columns data types are not in the correct data type. We will change these data types later.
Comment column seems to contain most number ( 70% ) of null values, which makes sense because it was an optional text box so it’s reasonable to expect that many (most) respondents would leave it blank.
We will be dropping the timestamp column because it’s contains date, month, year and time the respondent took this questionnaire, which is irrelevant for us.
The state column, work_interfere, self_employed columns also contains a lot of null values. We’ll dig deeper into that.

mental %>% count(Country) %>% arrange(desc(n))

Some notes if we look at data above:

It will be really misleading to conclude that a certain country faces more problem with the mental health of employees because around 60% of the people belong to The US.
Moreover there are a lot of countries which have only one respondents.
The country column thus becomes pointless. We will be dopping this.
A quick look at the states suggest us that it is applicable for the one’s only in The US, so we’ll drop it as well.

Data Cleaning

Drop Columns

Refer to the Data summary explanation above, we will drop columns: Timestamp, Country, States, and Comments

mental_clean <- mental %>%  select(-Timestamp,-Country,-state,-comments)
head(mental_clean)

Feature’s Value Checking

In this section, I process the value of columns that not suitable to neatly arranged.

Age

unique(mental_clean$Age)

#>  [1]          37          44          32          31          33          35
#>  [7]          39          42          23          29          36          27
#> [13]          46          41          34          30          40          38
#> [19]          50          24          18          28          26          22
#> [25]          19          25          45          21         -29          43
#> [31]          56          60          54         329          55 99999999999
#> [37]          48          20          57          58          47          62
#> [43]          51          65          49       -1726           5          53
#> [49]          61           8          11          -1          72

If we look at above data, there are age that have negative values and also under 15.
The ILO (International Labor Organization) has set a minimum age limit, 15 (fifteen) years which applies in all sectors and becomes a boundary for set the Age column

mental_clean1 <- mental_clean %>%  filter(Age > 15,
                                          Age < 100
                                          )

unique(mental_clean1$Age)

#>  [1] 37 44 32 31 33 35 39 42 23 29 36 27 46 41 34 30 40 38 50 24 18 28 26 22 19
#> [26] 25 45 21 43 56 60 54 55 48 20 57 58 47 62 51 65 49 53 61 72

Gender

unique(mental_clean1$Gender)

#>  [1] "Female"                                        
#>  [2] "M"                                             
#>  [3] "Male"                                          
#>  [4] "male"                                          
#>  [5] "female"                                        
#>  [6] "m"                                             
#>  [7] "Male-ish"                                      
#>  [8] "maile"                                         
#>  [9] "Trans-female"                                  
#> [10] "Cis Female"                                    
#> [11] "F"                                             
#> [12] "something kinda male?"                         
#> [13] "Cis Male"                                      
#> [14] "Woman"                                         
#> [15] "f"                                             
#> [16] "Mal"                                           
#> [17] "Male (CIS)"                                    
#> [18] "queer/she/they"                                
#> [19] "non-binary"                                    
#> [20] "Femake"                                        
#> [21] "woman"                                         
#> [22] "Make"                                          
#> [23] "Nah"                                           
#> [24] "Enby"                                          
#> [25] "fluid"                                         
#> [26] "Genderqueer"                                   
#> [27] "Female "                                       
#> [28] "Androgyne"                                     
#> [29] "Agender"                                       
#> [30] "cis-female/femme"                              
#> [31] "Guy (-ish) ^_^"                                
#> [32] "male leaning androgynous"                      
#> [33] "Male "                                         
#> [34] "Man"                                           
#> [35] "Trans woman"                                   
#> [36] "msle"                                          
#> [37] "Neuter"                                        
#> [38] "Female (trans)"                                
#> [39] "queer"                                         
#> [40] "Female (cis)"                                  
#> [41] "Mail"                                          
#> [42] "cis male"                                      
#> [43] "Malr"                                          
#> [44] "femail"                                        
#> [45] "Cis Man"                                       
#> [46] "ostensibly male, unsure what that really means"

For the Gender column has 46 distinct responses. I rename and combine if it’s in the same meaning, so it will trim the data and separate it into following categories: - Male, or cis Male, means born as male and decide to be male. - Female, or cis Female, means born as female and decide to be female. - Queer, is a word that describes sexual and gender identities other than straight and cisgender. Lesbian, gay, bisexual, and transgender people may all identify with the word queer.

mental_clean1["Gender"][mental_clean1["Gender"] == 'Male ' |
                          mental_clean1["Gender"] == 'male' |
                          mental_clean1["Gender"] == 'M' |
                          mental_clean1["Gender"] == 'm' |
                          mental_clean1["Gender"] == 'Male' |
                          mental_clean1["Gender"] == 'Cis Male' |
                          mental_clean1["Gender"] == 'Man' |
                          mental_clean1["Gender"] == 'cis male' |
                          mental_clean1["Gender"] == 'Mail' |
                          mental_clean1["Gender"] == 'Male-ish' |
                          mental_clean1["Gender"] == 'Male (CIS)' |
                          mental_clean1["Gender"] == 'Cis Man' |
                          mental_clean1["Gender"] == 'msle' |
                          mental_clean1["Gender"] == 'Malr' |
                          mental_clean1["Gender"] == 'Mal' |
                          mental_clean1["Gender"] == 'maile' |
                          mental_clean1["Gender"] == 'Make'] <- "Male"

mental_clean1["Gender"][mental_clean1["Gender"] == 'Female ' |
                          mental_clean1["Gender"] == 'female' |
                          mental_clean1["Gender"] == 'F' |
                          mental_clean1["Gender"] == 'f' |
                          mental_clean1["Gender"] == 'Woman' |
                          mental_clean1["Gender"] == 'Female' |
                          mental_clean1["Gender"] == 'femail' |
                          mental_clean1["Gender"] == 'cis Female' |
                          mental_clean1["Gender"] == 'cis-female/femme' |
                          mental_clean1["Gender"] == 'Femake' |
                          mental_clean1["Gender"] == 'Female (cis)' |
                          mental_clean1["Gender"] == 'Cis Female' |
                          mental_clean1["Gender"] == 'woman' ] <- "Female"

mental_clean1["Gender"][mental_clean1["Gender"] == 'Female (trans)' |
                          mental_clean1["Gender"] == 'queer/she/they' |
                          mental_clean1["Gender"] == 'non-binary' |
                          mental_clean1["Gender"] == 'f' |
                          mental_clean1["Gender"] == 'fluid' |
                          mental_clean1["Gender"] == 'queer' |
                          mental_clean1["Gender"] == 'Androgyne' |
                          mental_clean1["Gender"] == 'Trans-female' |
                          mental_clean1["Gender"] == 'male leaning androgynous' |
                          mental_clean1["Gender"] == 'Agender' |
                          mental_clean1["Gender"] == 'A little about you' |
                          mental_clean1["Gender"] == 'Nah' |
                          mental_clean1["Gender"] == 'All' |
                          mental_clean1["Gender"] == 'ostensibly male, unsure what that really means' |
                          mental_clean1["Gender"] == 'Genderqueer' |
                          mental_clean1["Gender"] == 'Enby' |
                          mental_clean1["Gender"] == 'p' |
                          mental_clean1["Gender"] == 'Neuter' |
                          mental_clean1["Gender"] == 'something kinda male?' |
                          mental_clean1["Gender"] == 'Guy (-ish) ^_^' |
                          mental_clean1["Gender"] == 'Trans woman' ] <- "Queer"

unique(mental_clean1$Gender)

#> [1] "Female" "Male"   "Queer"

Self Employed & Work Interfere

We have NaN values in self_employed and work_interfere columns

colSums(is.na(mental_clean1))

#>                       Age                    Gender             self_employed 
#>                         0                         0                        18 
#>            family_history                 treatment            work_interfere 
#>                         0                         0                       262 
#>              no_employees               remote_work              tech_company 
#>                         0                         0                         0 
#>                  benefits              care_options          wellness_program 
#>                         0                         0                         0 
#>                 seek_help                 anonymity                     leave 
#>                         0                         0                         0 
#> mental_health_consequence   phys_health_consequence                 coworkers 
#>                         0                         0                         0 
#>                supervisor   mental_health_interview     phys_health_interview 
#>                         0                         0                         0 
#>        mental_vs_physical           obs_consequence 
#>                         0                         0

Let us try to fill these null values and make our data ready for further processing.

For work_interfere let’s change NaN to “Don’t know”.
For self_employed let’s change NaN to NOT self_employed

mental_clean2 <- mental_clean1 %>%  
                mutate(work_interfere=ifelse(is.na(work_interfere),"Don't Know",work_interfere),
                       self_employed=ifelse(is.na(self_employed),"No",self_employed)
                       )

colSums(is.na(mental_clean2))

#>                       Age                    Gender             self_employed 
#>                         0                         0                         0 
#>            family_history                 treatment            work_interfere 
#>                         0                         0                         0 
#>              no_employees               remote_work              tech_company 
#>                         0                         0                         0 
#>                  benefits              care_options          wellness_program 
#>                         0                         0                         0 
#>                 seek_help                 anonymity                     leave 
#>                         0                         0                         0 
#> mental_health_consequence   phys_health_consequence                 coworkers 
#>                         0                         0                         0 
#>                supervisor   mental_health_interview     phys_health_interview 
#>                         0                         0                         0 
#>        mental_vs_physical           obs_consequence 
#>                         0                         0

Data Type

After we do some cleaning, now we change the incorrect data type columns to the correct data type

mental_clean2 <- mental_clean2 %>%  
                mutate_if(is.character,as.factor)

glimpse(mental_clean2)

#> Rows: 1,251
#> Columns: 23
#> $ Age                       <dbl> 37, 44, 32, 31, 31, 33, 35, 39, 42, 23, 31, ~
#> $ Gender                    <fct> Female, Male, Male, Male, Male, Male, Female~
#> $ self_employed             <fct> No, No, No, No, No, No, No, No, No, No, No, ~
#> $ family_history            <fct> No, No, No, Yes, No, Yes, Yes, No, Yes, No, ~
#> $ treatment                 <fct> Yes, No, No, Yes, No, No, Yes, No, Yes, No, ~
#> $ work_interfere            <fct> Often, Rarely, Rarely, Often, Never, Sometim~
#> $ no_employees              <fct> 6-25, More than 1000, 6-25, 26-100, 100-500,~
#> $ remote_work               <fct> No, No, No, No, Yes, No, Yes, Yes, No, No, Y~
#> $ tech_company              <fct> Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, ~
#> $ benefits                  <fct> Yes, Don't know, No, No, Yes, Yes, No, No, Y~
#> $ care_options              <fct> Not sure, No, No, Yes, No, Not sure, No, Yes~
#> $ wellness_program          <fct> No, Don't know, No, No, Don't know, No, No, ~
#> $ seek_help                 <fct> Yes, Don't know, No, No, Don't know, Don't k~
#> $ anonymity                 <fct> Yes, Don't know, Don't know, No, Don't know,~
#> $ leave                     <fct> Somewhat easy, Don't know, Somewhat difficul~
#> $ mental_health_consequence <fct> No, Maybe, No, Yes, No, No, Maybe, No, Maybe~
#> $ phys_health_consequence   <fct> No, No, No, Yes, No, No, Maybe, No, No, No, ~
#> $ coworkers                 <fct> Some of them, No, Yes, Some of them, Some of~
#> $ supervisor                <fct> Yes, No, Yes, No, Yes, Yes, No, No, Yes, Yes~
#> $ mental_health_interview   <fct> No, No, Yes, Maybe, Yes, No, No, No, No, May~
#> $ phys_health_interview     <fct> Maybe, No, Yes, Maybe, Yes, Maybe, No, No, M~
#> $ mental_vs_physical        <fct> Yes, Don't know, No, No, Don't know, Don't k~
#> $ obs_consequence           <fct> No, No, No, Yes, No, No, No, No, No, No, No,~

mental_clean2

Data Analysis

Target Data

Let us begin the data analysis by understanding the target data

plot1 <- ggplot(mental_clean2) +
  aes(x = treatment, fill = treatment) +
  geom_bar() +
  scale_fill_hue(direction = 1) +
  labs(
    x = "Treatment (Yes/No)",
    y = "Counts",
    title = "Do Respondents receive Treatments?"
  ) +
  theme_classic()


ggplotly(plot1)

This is the respondents result of question, ‘Have you sought treatment for a mental health condition?’.

This is our target variable. Looking at the first graph, we see that the percentage of respondents who want to get treatment is almost 50%. Workplaces that promote mental health and support people with mental disorders are more likely to have increased productivity, reduce absenteeism, and benefit from associated economic gains. If employees enjoy good mental health, employees can:

Be more productive
Take active participation in employee engagement activities and make better relations; both at workplace and personal life.
Be more joyous and make people around them happy.

Profiling Analysis

Age

Now let’s take a look of our respondents Age distribution

plot2 <- ggplot(mental_clean2) +
 aes(x = Age, colour = Age) +
 geom_histogram(bins = 50L, fill = "orange") +
 scale_color_distiller(palette = "PuBu", 
 direction = -1) +
 labs(title = "Age Distribution") +
 theme_classic()

ggplotly(plot2)

plot3 <- ggplot(mental_clean2) +
 aes(x = treatment, y = Age, fill = treatment) +
 geom_boxplot(shape = "circle") +
 scale_fill_hue(direction = 1) +
 labs(title = "Treatments with Age Distribution") +
 theme_classic()

ggplotly(plot3)

If we look at Plot 2 and Plot 3:

It’s indicated that most of the employees that fill the survey around the end 20s to early 40s. I assume that they on between mid to senior-level positions. The distribution of ages is right-skewed which is expected as the tech industry tends to have younger employees. From an article that I read, young (usually white, mostly male) faces of start-up founders like Mark Zuckerberg and other “tech bros” have become the symbol and stereotypical image that tends to represent the tech industry.
From the boxplot, there is no statistically significant difference of ages between respondents that get treatment and no treatment.

Gender

Now we will take a look at Gender Distribution

plot4 <- ggplot(mental_clean2) +
 aes(x = Gender, fill = treatment) +
 geom_bar() +
 scale_fill_hue(direction = 1) +
  labs(title = "Treatments with Gender Distribution") +
 theme_classic()

ggplotly(plot4)

If we look at plot4 above Majority respondents are male, not surprisingly, especially in the tech field. The very large gap between men and women causes higher competitive pressure for women than men. Based on the plot, female that want to get treatment is high around 70%. Maybe some of them get sexual harrassment or racism at work because female are scarce in the tech industry.

There is a Queer entry of less than 2%. Although the percentage of queer is very low, it still deserves to dig out some new insights. For example, such a small proportion can show a significant difference in the count of who wants the treatments, indicating that for the queer, mental health problems are serious too. In my opinion, maybe they received hate speech or discrimination in the workplace.

Family History of Mental Illness

plot5 <- ggplot(mental_clean2) +
 aes(x = family_history, fill = treatment) +
 geom_bar() +
 scale_fill_hue(direction = 1) +
 labs(title = "Family History with Illness") +
 theme_classic()

ggplotly(plot5)

From respondents who say that they have a family history of mental illness, the plot shows that they significantly want to get treatment rather than without a family history. This is acceptable, remember the fact that people with a family history pay more attention to mental illness. Family history is a significant risk factor for many mental health disorders. The apple does not fall far from the tree, as it is relatively common for families with mental illness symptoms to have one or more relatives with histories of similar difficulties.

Work Environment Analysis

Work Interfere

plot6 <- ggplot(mental_clean2) +
 aes(x = work_interfere, fill = treatment) +
 geom_bar() +
 scale_fill_hue(direction = 1) +
 labs(title = "Work Interfere Survey Respondents") +
 theme_classic()

ggplotly(plot6)

This is the respondents result of question, ‘If you have a mental health condition, do you feel that it interferes with your work?’. More than half Respondents have experienced interference at work with a ratio of rarely, sometimes, and frequently with majority respondents want to get treatment.But it’s surprising to know even mental health never has interfered at work, there is a little group that still want to get treatment before it become a job stress. It can be triggered by the requirements of the job do not match the capabilities, resources or needs of the worker.

Working Style

plot7 <-ggplot(mental_clean2) +
 aes(x = remote_work, fill = treatment) +
 geom_bar() +
 scale_fill_hue(direction = 1) +
 labs(title = "Working Style (Remote or Not)") +
 theme_classic()

ggplotly(plot7)

Majority of respondents don’t work remotely, which means the biggest factor of mental health disorder came up triggered on the workplace. On the other side, it has slightly different between an employee that want to get treatment and don’t want to get a treatment. But it’s getting interesting when we see a respondent who works 50% of the workday remotely. The employee who want to get treatment is a little bit higher. I have no idea why those employees work remotely to analyze more because the data doesn’t provide that information.

Company Type

plot8 <- ggplot(mental_clean2) +
 aes(x = tech_company, fill = treatment) +
 geom_bar() +
 scale_fill_hue(direction = 1) +
 labs(title = "Company Type") +
 theme_classic()

ggplotly(plot8)

Even the main target of the survey is the tech field, there are small amount of companies belong to the non-tech field. But it can be seen from the plot whether the company belongs to the tech field or not, mental health still becomes a big problem. I think the environment affects a lot of employees and some of them can’t take it for granted like abuse at the workplace.

However, I found that the number of employees in the technology field that want to get treatment is slightly lower than no treatment. But the non-technical field is the opposite. Maybe the non-tech company give more support for employee to get treatment?

Coworkers & Supervisors

plot9 <- ggplot(mental_clean2) +
 aes(x = coworkers, fill = treatment) +
 geom_bar() +
 scale_fill_hue(direction = 1) +
 labs(title = "Coworkers of Survey Respondents") +
 theme_classic()

plot10<- ggplot(mental_clean2) +
 aes(x = supervisor, fill = treatment) +
 geom_bar() +
 scale_fill_hue(direction = 1) +
 labs(title = "Supervisor of Survey Respondents") +
 theme_classic()

ggplotly(plot9)

ggplotly(plot10)

This is the respondents result of question, ‘Would you be willing to discuss a mental health issue with your coworkers?’.

From respondents who say yes to discuss it with coworkers, around 60% of them want to get treatment.
About more than half of respondents decide to discuss some of them with coworkers. Employees who do that and want to get treatment are half of them. Let’s see if the respondent will discuss it with a supervisor or not.

This is the respondents result of question, ‘Would you be willing to discuss a mental health issue with your direct supervisor(s)?’.

From all of respondents who say yes to discuss with supervisor, only 55% of them that want to get treatment. I think maybe talking to someone in a higher position could help the relief. It’s the opposite while employees discuss with coworkers.

Observed Consequence

plot11 <- ggplot(mental_clean2) +
 aes(x = obs_consequence, fill = treatment) +
 geom_bar() +
 scale_fill_hue(direction = 1) +
 labs(title = "Observed Consequence of Survey Respondents") +
 theme_classic()

ggplotly(plot11)

This is the respondents result of question, ‘Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?’. From all of respondents who say yes about knowing the negative consequences for coworkers with mental heatlh condition, almost 70% of them that want to get treatment. After the employee knows about the negative consequences, it becomes a good trigger for someone to get treatment to prevent mental health conditions.

Mental Health Facilities Analysis

Employer Benefits

plot12 <- ggplot(mental_clean2) +
 aes(x = benefits, fill = treatment) +
 geom_bar() +
 scale_fill_hue(direction = 1) +
 labs(title = "Employer Benefits Survey Respondents") +
 theme_classic()

ggplotly(plot12)

This is the respondents result of question, ‘Does your employer provide mental health benefits?’. Only around 1/3 of respondents know about mental health benefits that the company provides for them. For employees who know the benefits, almost 60% of the employees want to get treatment. Surprisingly, there is an employee who doesn’t know and says that the company doesn’t provide still want to get treatment. I assume that maybe the company can’t provide it properly because of budgeting or financial struggling.

Wellness Program

plot13 <- ggplot(mental_clean2) +
 aes(x = wellness_program, fill = treatment) +
 geom_bar() +
 scale_fill_hue(direction = 1) +
 labs(title = "Wellness Program Survey Respondents") +
 theme_classic()

ggplotly(plot13)

This is the respondents result of question, ‘Has your employer ever discussed mental health as part of an employee wellness program?’. All of the repondents who say yes about become a part of employee wellness program, around 60% of employee want to get treatment. After become a part of wellness program, i assume that employee feels a good vibe about it.

Majority of respondents say that there aren’t any wellness programs that provide by their company. But half of the respondents want to get treatment, which means the company need to provide it soon. Based on my curiosity about company’s benefit before, I think it makes sense if it’s about company budgeting. I know it will spend a lot of money, moreover, the company has a lot of employees to taking care of. My second thought, it’s still about budgeting but for a small company, it’s will be a lot of struggle.

Anonymity

plot14 <- ggplot(mental_clean2) +
 aes(x = anonymity, fill = treatment) +
 geom_bar() +
 scale_fill_hue(direction = 1) +
 labs(title = "Anonymity Survey Respondents") +
 theme_classic()

ggplotly(plot14)

This is the respondents result of question, ‘Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?’. Around 30% of respondents say yes if their anonymity is protected while taking advantage of mental health or substance abuse treatment resources and more than half of employees want to get treatment. The employee feels that the company protected their privacy and it’s a good move for the company to build trust with their employees. Because of that, the employee wants to get treatment to be better.

Exploratory Analysis Conclusion

By providing employees access to mental health benefits, the company can begin to create a culture of understanding and compassion at the tech company. And having employees who feel cared for and happy isn’t just good, it’s good business.

Based on profiling the respondents, Companies must know that gender and family history greatly influence the decision to get treatment for employees. So if the company wants to provide more support, the company must make an assessment of the employee’s personality because different characters can determine different needs. Age can also be a trigger, considering that most of them are young so there is a high chance that they will be open-minded to get treatment.

Based on the work environment of respondents, Work interference is the most influential of employees who want to get treatment. This means the company should consider providing facilities to anticipate job stress on employees. Some of the companies decide to make a private room or silent room in case employees suddenly feel stress and need a private moment to relieve.

Based on the mental health facilities of respondents, The company needs to provide a good benefit for employees so they can maintain their mental health. If the company can don’t have resources for it, there are so many third parties who can be hired to maintain a wellness program for the company. Building trust like keep private about whom employee that gets treatment also can also give a trigger for employee want to get treatment.

So after we have done the EDA, next step is build the machine learning apps using R Shiny. The detail as follows:

Machine Learning : Supervised learning - Classification. I will try use 3 basic models and 4 ensemble models to predict.

Basic models:

Logistic Regression (logreg)
Decision Tree Classifier (tree)
K-Nearest Neighbor (knn)

Ensemble models:

Random Forest Classifier (rf)

Target Variable :

treatment: Have you sought treatment for a mental health condition?

Predictor Variable :

Age
Gender
self_employed: Are you self-employed?
family_history: Do you have a family history of mental illness?
work_interfere: If you have a mental health condition, do you feel that it interferes with your work?
no_employees: How many employees does your company or organization have?
remote_work: Do you work remotely (outside of an office) at least 50% of the time?
tech_company: Is your employer primarily a tech company/organization?
benefits: Does your employer provide mental health benefits?
care_options: Do you know the options for mental health care your employer provides?
wellness_program: Has your employer ever discussed mental health as part of an employee wellness program?
seek_help: Does your employer provide resources to learn more about mental health issues and how to seek help?
anonymity: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?
leave: How easy is it for you to take medical leave for a mental health condition?
mentalhealthconsequence: Do you think that discussing a mental health issue with your employer would have negative consequences?
physhealthconsequence: Do you think that discussing a physical health issue with your employer would have negative consequences?
coworkers: Would you be willing to discuss a mental health issue with your coworkers?
physhealthinterview: Would you bring up a physical health issue with a potential employer in an interview?
mentalvsphysical: Do you feel that your employer takes mental health as seriously as physical health?
obs_consequence: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?

Modelling

Logistic Regression

Check Proportion of the target variable

prop.table(table(mental_clean2$treatment))

#> 
#>        No       Yes 
#> 0.4948042 0.5051958

Cross validation

RNGkind(sample.kind = "Rounding")
 
set.seed(901)

index <- sample(nrow(mental_clean2), 
                nrow(mental_clean2) *0.8) 

mental_train <- mental_clean2[index, ] 
mental_test <- mental_clean2[-index, ]

prop.table(table(mental_train$treatment))

#> 
#>    No   Yes 
#> 0.492 0.508

prop.table(table(mental_test$treatment))

#> 
#>        No       Yes 
#> 0.5059761 0.4940239

Model Fitting

set.seed(901)
model_mental1 <- glm(treatment ~ ., data = mental_train, family = "binomial")
summary(model_mental1)

#> 
#> Call:
#> glm(formula = treatment ~ ., family = "binomial", data = mental_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.6042  -0.3345   0.1470   0.5657   3.0040  
#> 
#> Coefficients:
#>                               Estimate Std. Error z value             Pr(>|z|)
#> (Intercept)                  -6.270245   1.050890  -5.967  0.00000000242243824
#> Age                           0.030615   0.014844   2.062              0.03916
#> GenderMale                   -0.625525   0.264289  -2.367              0.01794
#> GenderQueer                  -0.106362   0.752660  -0.141              0.88762
#> self_employedYes             -0.269167   0.371419  -0.725              0.46864
#> family_historyYes             0.994806   0.207528   4.794  0.00000163821341525
#> work_interfereNever           2.527367   0.641595   3.939  0.00008175598010119
#> work_interfereOften           6.269190   0.677950   9.247 < 0.0000000000000002
#> work_interfereRarely          4.979120   0.634174   7.851  0.00000000000000412
#> work_interfereSometimes       5.629618   0.621845   9.053 < 0.0000000000000002
#> no_employees100-500           0.370461   0.447437   0.828              0.40769
#> no_employees26-100            0.235395   0.396009   0.594              0.55223
#> no_employees500-1000          0.240185   0.601238   0.399              0.68954
#> no_employees6-25              0.095547   0.368728   0.259              0.79554
#> no_employeesMore than 1000   -0.107364   0.434280  -0.247              0.80474
#> remote_workYes               -0.105585   0.235832  -0.448              0.65436
#> tech_companyYes               0.026521   0.264340   0.100              0.92008
#> benefitsNo                    0.246491   0.305606   0.807              0.41992
#> benefitsYes                   0.453807   0.297866   1.524              0.12763
#> care_optionsNot sure         -0.127078   0.275436  -0.461              0.64453
#> care_optionsYes               0.775483   0.270404   2.868              0.00413
#> wellness_programNo            0.041164   0.343696   0.120              0.90467
#> wellness_programYes          -0.002521   0.415695  -0.006              0.99516
#> seek_helpNo                  -0.641289   0.296784  -2.161              0.03071
#> seek_helpYes                 -0.885077   0.372569  -2.376              0.01752
#> anonymityNo                  -0.118298   0.455709  -0.260              0.79518
#> anonymityYes                  0.545620   0.263178   2.073              0.03815
#> leaveSomewhat difficult       0.311562   0.352825   0.883              0.37721
#> leaveSomewhat easy           -0.541614   0.262469  -2.064              0.03906
#> leaveVery difficult          -0.168617   0.387557  -0.435              0.66351
#> leaveVery easy                0.193794   0.336972   0.575              0.56522
#> mental_health_consequenceNo  -0.157437   0.280416  -0.561              0.57450
#> mental_health_consequenceYes -0.249186   0.285645  -0.872              0.38301
#> phys_health_consequenceNo     0.169670   0.263001   0.645              0.51884
#> phys_health_consequenceYes   -0.009532   0.473185  -0.020              0.98393
#> coworkersSome of them         0.437436   0.272071   1.608              0.10788
#> coworkersYes                  1.089209   0.408831   2.664              0.00772
#> supervisorSome of them       -0.389383   0.274613  -1.418              0.15621
#> supervisorYes                -0.250359   0.324283  -0.772              0.44009
#> mental_health_interviewNo     0.533873   0.337731   1.581              0.11393
#> mental_health_interviewYes    0.681839   0.712726   0.957              0.33874
#> phys_health_interviewNo       0.210555   0.232338   0.906              0.36481
#> phys_health_interviewYes      0.722901   0.331206   2.183              0.02906
#> mental_vs_physicalNo         -0.061048   0.255130  -0.239              0.81089
#> mental_vs_physicalYes         0.026089   0.278549   0.094              0.92538
#> obs_consequenceYes            0.310931   0.290390   1.071              0.28429
#>                                 
#> (Intercept)                  ***
#> Age                          *  
#> GenderMale                   *  
#> GenderQueer                     
#> self_employedYes                
#> family_historyYes            ***
#> work_interfereNever          ***
#> work_interfereOften          ***
#> work_interfereRarely         ***
#> work_interfereSometimes      ***
#> no_employees100-500             
#> no_employees26-100              
#> no_employees500-1000            
#> no_employees6-25                
#> no_employeesMore than 1000      
#> remote_workYes                  
#> tech_companyYes                 
#> benefitsNo                      
#> benefitsYes                     
#> care_optionsNot sure            
#> care_optionsYes              ** 
#> wellness_programNo              
#> wellness_programYes             
#> seek_helpNo                  *  
#> seek_helpYes                 *  
#> anonymityNo                     
#> anonymityYes                 *  
#> leaveSomewhat difficult         
#> leaveSomewhat easy           *  
#> leaveVery difficult             
#> leaveVery easy                  
#> mental_health_consequenceNo     
#> mental_health_consequenceYes    
#> phys_health_consequenceNo       
#> phys_health_consequenceYes      
#> coworkersSome of them           
#> coworkersYes                 ** 
#> supervisorSome of them          
#> supervisorYes                   
#> mental_health_interviewNo       
#> mental_health_interviewYes      
#> phys_health_interviewNo         
#> phys_health_interviewYes     *  
#> mental_vs_physicalNo            
#> mental_vs_physicalYes           
#> obs_consequenceYes              
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1386.04  on 999  degrees of freedom
#> Residual deviance:  694.46  on 954  degrees of freedom
#> AIC: 786.46
#> 
#> Number of Fisher Scoring iterations: 7

library(car)
vif(model_mental1)

#>                               GVIF Df GVIF^(1/(2*Df))
#> Age                       1.265467  1        1.124930
#> Gender                    1.217911  2        1.050519
#> self_employed             1.748977  1        1.322489
#> family_history            1.096573  1        1.047174
#> work_interfere            1.424562  4        1.045226
#> no_employees              3.101563  5        1.119845
#> remote_work               1.242115  1        1.114502
#> tech_company              1.156530  1        1.075421
#> benefits                  2.918420  2        1.307034
#> care_options              1.998940  2        1.189050
#> wellness_program          2.742849  2        1.286917
#> seek_help                 3.384830  2        1.356389
#> anonymity                 1.671851  2        1.137102
#> leave                     2.048889  4        1.093805
#> mental_health_consequence 2.692632  2        1.280986
#> phys_health_consequence   1.738562  2        1.148279
#> coworkers                 1.948017  2        1.181403
#> supervisor                2.339783  2        1.236784
#> mental_health_interview   1.825899  2        1.162436
#> phys_health_interview     1.643544  2        1.132258
#> mental_vs_physical        1.871099  2        1.169564
#> obs_consequence           1.220991  1        1.104984

No multicolinearity (GVIF<10)

#linearity check

data.frame(prediction=model_mental1$fitted.values,
     error=model_mental1$residuals) %>% 
  ggplot(aes(prediction,error)) +
  geom_hline(yintercept=0) +
  geom_point() +
  geom_smooth() +
  theme_bw()

saveRDS(model_mental1, "model_logreg.RDS")

Prediction

mental_test$pred_result <- predict(object = model_mental1, 
        newdata = mental_test, 
        type = "response")

mental_test$pred_label <- ifelse(mental_test$pred_result < 0.5 ,"No", "Yes")
mental_test$pred_label <- as.factor(mental_test$pred_label)
head(mental_test)

library(caret)

confusionMatrix(mental_test$pred_label, mental_test$treatment, positive = "Yes")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  No Yes
#>        No   97   9
#>        Yes  30 115
#>                                                
#>                Accuracy : 0.8446               
#>                  95% CI : (0.7938, 0.8871)     
#>     No Information Rate : 0.506                
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.6898               
#>                                                
#>  Mcnemar's Test P-Value : 0.001362             
#>                                                
#>             Sensitivity : 0.9274               
#>             Specificity : 0.7638               
#>          Pos Pred Value : 0.7931               
#>          Neg Pred Value : 0.9151               
#>              Prevalence : 0.4940               
#>          Detection Rate : 0.4582               
#>    Detection Prevalence : 0.5777               
#>       Balanced Accuracy : 0.8456               
#>                                                
#>        'Positive' Class : Yes                  
#>

Decision Tree

Model Fitting

library(partykit)
set.seed(901)

model_dt <-ctree(treatment ~ ., mental_train)

Prediction

pred_dt <- predict(model_dt, newdata = mental_test, type = "response")

confusionMatrix(pred_dt, mental_test$treatment, positive = "Yes")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  No Yes
#>        No   86   8
#>        Yes  41 116
#>                                                
#>                Accuracy : 0.8048               
#>                  95% CI : (0.7503, 0.8519)     
#>     No Information Rate : 0.506                
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.6107               
#>                                                
#>  Mcnemar's Test P-Value : 0.000004844          
#>                                                
#>             Sensitivity : 0.9355               
#>             Specificity : 0.6772               
#>          Pos Pred Value : 0.7389               
#>          Neg Pred Value : 0.9149               
#>              Prevalence : 0.4940               
#>          Detection Rate : 0.4622               
#>    Detection Prevalence : 0.6255               
#>       Balanced Accuracy : 0.8063               
#>                                                
#>        'Positive' Class : Yes                  
#>

plot(model_dt, type="simple")

Random Forest

Random Forest using a 5-Fold Cross Validation, with 3 repeats.

#set.seed(901)
 
#ctrl <- trainControl(method = "repeatedcv",
#                      number = 5,
#                      repeats = 3) 
 
#model_forest <- train(treatment ~ .,
#                    data = mental_train,
#                    method = "rf", 
#                    trControl = ctrl)
 
#saveRDS(model_forest, "model_forest_update.RDS")

model_rf <- readRDS("model_forest_update.RDS")

Prediction

pred_rf <- predict(model_rf, mental_test, type = "raw")

confusionMatrix(pred_rf, mental_test$treatment, positive = "Yes")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  No Yes
#>        No   94  13
#>        Yes  33 111
#>                                                
#>                Accuracy : 0.8167               
#>                  95% CI : (0.7632, 0.8626)     
#>     No Information Rate : 0.506                
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.6341               
#>                                                
#>  Mcnemar's Test P-Value : 0.005088             
#>                                                
#>             Sensitivity : 0.8952               
#>             Specificity : 0.7402               
#>          Pos Pred Value : 0.7708               
#>          Neg Pred Value : 0.8785               
#>              Prevalence : 0.4940               
#>          Detection Rate : 0.4422               
#>    Detection Prevalence : 0.5737               
#>       Balanced Accuracy : 0.8177               
#>                                                
#>        'Positive' Class : Yes                  
#>

After We check the results of the model, our logistic regression model has better result.

Mental Health First Aid : Predicting Mental Health in Tech Industry using Machine Learning

By : Syabaruddin Malik

Background

Tech’s battle with Deppression

The Growing Mental Health Crisis

Problem Statement

Project Idea

Problem Scope

Output

Business Impact

Exploratory Data Analysis

Load Library

Load Data

Data Description

Data Cleaning

Drop Columns

Feature’s Value Checking

Age

Gender

Self Employed & Work Interfere

Data Type

Data Analysis

Target Data

Profiling Analysis

Age

Gender

Family History of Mental Illness

Work Environment Analysis

Work Interfere

Working Style

Company Type

Coworkers & Supervisors

Observed Consequence

Mental Health Facilities Analysis

Employer Benefits

Wellness Program

Anonymity

Exploratory Analysis Conclusion

Modelling

Logistic Regression

Decision Tree

Random Forest

Reference