Homework2

Import Dataset

Importing the Mental Health Survey from the csv file

Dataset was from Kaggle here

This dataset contains information from an anonmyous survey in 2014 regarding mental health for employees in the tech industry. It was conducted by OSMI (Open Sourcing Mental Illness).

survey_data <- read_csv("survey.csv")

## Rows: 1259 Columns: 27
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (25): Gender, Country, state, self_employed, family_history, treatment,...
## dbl   (1): Age
## dttm  (1): Timestamp
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(survey_data)

## # A tibble: 6 × 27
##   Timestamp             Age Gender Country    state self_employed family_history
##   <dttm>              <dbl> <chr>  <chr>      <chr> <chr>         <chr>         
## 1 2014-08-27 11:29:31    37 Female United St… IL    <NA>          No            
## 2 2014-08-27 11:29:37    44 M      United St… IN    <NA>          No            
## 3 2014-08-27 11:29:44    32 Male   Canada     <NA>  <NA>          No            
## 4 2014-08-27 11:29:46    31 Male   United Ki… <NA>  <NA>          Yes           
## 5 2014-08-27 11:30:22    31 Male   United St… TX    <NA>          No            
## 6 2014-08-27 11:31:22    33 Male   United St… TN    <NA>          Yes           
## # ℹ 20 more variables: treatment <chr>, work_interfere <chr>,
## #   no_employees <chr>, remote_work <chr>, tech_company <chr>, benefits <chr>,
## #   care_options <chr>, wellness_program <chr>, seek_help <chr>,
## #   anonymity <chr>, leave <chr>, mental_health_consequence <chr>,
## #   phys_health_consequence <chr>, coworkers <chr>, supervisor <chr>,
## #   mental_health_interview <chr>, phys_health_interview <chr>,
## #   mental_vs_physical <chr>, obs_consequence <chr>, comments <chr>

Cleaning the Data

Cleaning the data

## Replace NA in self_employed with No
survey_data <- survey_data %>%
  mutate(self_employed = replace(self_employed, 
                                 is.na(self_employed), "No")) %>%
  ## Mutate work_interfere for easier analysis
  mutate(work_interfere = if_else(work_interfere %in% c("Often", "Sometimes"), 
                               "Yes", "No"))
# I've taken the small, medium, and large company sizes based on startup size, growing company size, and corporate companies.
survey_data <- survey_data %>%
  select(-Timestamp) %>% 
  mutate(family_history = if_else(family_history == "Yes", TRUE, FALSE)) %>% 
  mutate(company_size = case_when(
    no_employees < 6 ~ "small",
     no_employees < 100 ~ "medium",
    TRUE ~ "large"))
# The columns were checked with unique values and I've organized the different answers into categories.
survey_data <- survey_data %>%   
  mutate(standardized_gender = 
           case_when(
             Gender %in% c('M', 'Male', 'male','Male ,', 'Man','m','Male ',"msle",'Cis Man', 'Malr','Mail', 'Man','Male-ish','maile','Mal','cis male','All','Cis Male','Male (CIS)', 'Make') ~ "Male",
             Gender %in% c('Female','Cis Female', 'Woman','Female ', 'f', 'Femake', 'woman','femail','female','Female (cis)','F') ~ "Female",  
             TRUE ~ "Other"
           )
  ) %>%
  ## Filter between the ages that seem resonable for a work tech survey
  filter(between(Age, 10, 95)) %>%
  mutate(work_interfere = recode(work_interfere,
                             "No"= "No",
                             "Sometimes" = "Yes",
                             "Often" = "Yes")) %>% 
    mutate(standardized_gender = str_to_lower(standardized_gender))


table(survey_data$standardized_gender)

## 
## female   male  other 
##    246    987     19

I’ve edited some columns to stay consistent to formatting. Then I also removed and standardized some columns to make data analysis easier for future research.

Narrative of the data

The column variables are as follows for this dataset:

Age, Gender, Country

state: If you live in the United States, which state or territory do you live in?

self_employed: Are you self-employed?

family_history: Do you have a family history of mental illness?

treatment: Have you sought treatment for a mental health condition?

work_interfere: If you have a mental health condition, do you feel that it interferes with your work?

no_employees: How many employees does your company or organization have?

remote_work: Do you work remotely (outside of an office) at least 50% of the time?

tech_company: Is your employer primarily a tech company/organization?

benefits: Does your employer provide mental health benefits?

care_options: Do you know the options for mental health care your employer provides?

wellness_program: Has your employer ever discussed mental health as part of an employee wellness program?

seek_help: Does your employer provide resources to learn more about mental health issues and how to seek help?

anonymity: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?

leave: How easy is it for you to take medical leave for a mental health condition?

mental_health_consequence: Do you think that discussing a mental health issue with your employer would have negative consequences?

phys_health_consequence: Do you think that discussing a physical health issue with your employer would have negative consequences?

coworkers: Would you be willing to discuss a mental health issue with your coworkers?

supervisor: Would you be willing to discuss a mental health issue with your direct supervisor(s)?

mental_health_interview: Would you bring up a mental health issue with a potential employer in an interview?

phys_health_interview: Would you bring up a physical health issue with a potential employer in an interview?

mental_vs_physical: Do you feel that your employer takes mental health as seriously as physical health?

obs_consequence: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?

comments: Any additional notes or comments

This can be observed in the kaggle dataset it was retrieved from.

Descriptive Statistics

# Age

unique(survey_data$Age)

##  [1] 37 44 32 31 33 35 39 42 23 29 36 27 46 41 34 30 40 38 50 24 18 28 26 22 19
## [26] 25 45 21 43 56 60 54 55 48 20 57 58 47 62 51 65 49 53 61 11 72

mean(survey_data$Age, na.rm = TRUE)

## [1] 32.0599

sd(survey_data$Age, na.rm = TRUE)

## [1] 7.309669

# Gender 
survey_data %>% 
  count(standardized_gender) %>%
  mutate(prop = n / sum(n))

## # A tibble: 3 × 3
##   standardized_gender     n   prop
##   <chr>               <int>  <dbl>
## 1 female                246 0.196 
## 2 male                  987 0.788 
## 3 other                  19 0.0152

# Self-Employed
mean(survey_data$self_employed == "Yes")

## [1] 0.1142173

# Treatment
mean(survey_data$treatment == "Yes")

## [1] 0.5047923

# Work Interference  
mean(survey_data$work_interfere == "Yes")

## [1] 0.4824281

# Gender bar plot
survey_data %>%
  ggplot(aes(x = standardized_gender)) +
  geom_bar()

# Age density plot
ggplot(data = survey_data, aes(x = Age)) +
  geom_density()

We can already observe some important statistics on average age in the industry, the dominant share of males in the industry and how only 50% of the people are seeking treatment. Something interesting to also note is that almost 48% of employees think there is some sort of work interference for seeking help regarding mental illness which should be important to look at in the future.

Potential Research Questions

We could have questions on who is actually seeking treatments

What proportion of tech workers report mental health issues or seeking treatment? How does this compare to other professions or the general population? How frequently do tech professionals experience work interference or negative consequences due to mental health issues?

Questions on demographic can be very important as a research question: Are younger tech workers more or less likely to seek treatment compared to older employees? Do negative consequences differ based on seniority, job role, or company size?

Employer Support questions: What workplace resources and policies are most positively associated with an employee’s likeliness to disclose and discuss mental health issues? Does organization culture impact discussion of mental health? Do observed negative consequences differ between employees at tech companies perceived to be more supportive versus less supportive?

Questions on Help-Seeking How comfortable are tech professionals when discussing mental health issues in the workplace?

GGplot Visualization

ggplot(data = survey_data,  
       mapping = aes(x = treatment, fill = treatment)) +
  geom_bar(color="black", show.legend = FALSE) +
  scale_fill_manual(values = c("#69b3a2", "#404080")) + 
  labs(
    title = "Percentage That Have Sought Treatment",
    x = "Treatment",
    y = "Count",
  ) +
  theme(legend.position="none")

ggplot(data = filter(survey_data, !is.na(benefits)), 
       mapping = aes(x = benefits, fill = benefits)) +
  geom_bar(color="black") + 
  scale_fill_manual(values = c("#69b3a2", "#404080", "#bbbbbb")) +
  labs(
    title = "Mental Health Benefits Offered By Employer",
    x = "Benefits Offered",
    y = "Count",
    fill = "Legend"
  ) +
  theme(legend.position = "bottom")

We used the treatment variable and we can potentially address questions on how many tech workers seek help regarding mental health and if it is different from any other industry. We see that it is around a 50/50 split on yes and no which is a relativley low number. Onto the 2nd visualization on if a workplace offers mental health benefits we can see it is also around a 50/50 split. We could possibly explore if these two statistics maybe go hand in hand.

age_group_data <- survey_data %>%
  mutate(age_group = case_when(
    between(Age, 18, 25) ~ "18-25",
    between(Age, 26, 35) ~ "26-35",
    between(Age, 36, 45) ~ "36-45",
    between(Age, 46, 55) ~ "46-55",
    between(Age, 56, 65) ~ "56-65",
    TRUE ~ "65+"
  )) %>% 
  group_by(age_group) %>%
  summarize(
    mean_treatment = sum(treatment == "Yes", na.rm = TRUE),
    n = n()
  )

ggplot(age_group_data, aes(x = age_group, y = mean_treatment)) +
  geom_col() +
  geom_hline(yintercept = mean(survey_data$treatment == "Yes"), 
             linetype = "dashed",
             color = "red")

Some future improvements

There are some questions that can still be answered. Such as How do the age-based rates compare by gender or company size? I could possibly explore this for further analysis. Age groups I think is very important and can be further determined if there is statistical differences among the groups.

There can be some improvements in captions, highlighting and overall theme consistency for the final project!