Data 607 Final: Remote Work & Mental Health

Introduction:

My project explores the relationship between work location (remote, onsite, hybrid) and various mental health indicators, such as stress levels, mental fatigue, and social isolation. By analyzing survey data from employees, the project aims to uncover how different work setups influence employee well-being. Statistical methods like chi-squared tests and multiple regression are used to evaluate these relationships and provide insights into the potential mental health challenges associated with different work environments.

I love working remotely and wanted to see if other people agree. Remote work has made me happier and allowed me to pursue hobbies outside of work during hours I would normally be commuting. This flexibility is one of the many reasons I became interested in studying data science—because it offers remote and hybrid career opportunities.

Null Hypothesis (H₀):

There is no significant relationship between work location (onsite, remote, hybrid) and mental health indicators (stress level, work-life balance, mental fatigue score, burn rate, resource allocation, etc.).

Alternative Hypothesis (H₁):

Work location (onsite, remote, hybrid) significantly influences mental health indicators (stress level, work-life balance, mental fatigue score, burn rate, etc.).

Step 1: loading dataset 1

library(readr)

# Load data from GitHub into R
url <- "https://raw.githubusercontent.com/leslietavarez/remotework-mentalhealth/refs/heads/main/Impact_of_Remote_Work_on_Mental_Health%20(1).csv"
df <- read.csv(url)

# Check the data
head(df)

##   Employee_ID Age     Gender          Job_Role   Industry Years_of_Experience
## 1     EMP0001  32 Non-binary                HR Healthcare                  13
## 2     EMP0002  40     Female    Data Scientist         IT                   3
## 3     EMP0003  59 Non-binary Software Engineer  Education                  22
## 4     EMP0004  27       Male Software Engineer    Finance                  20
## 5     EMP0005  49       Male             Sales Consulting                  32
## 6     EMP0006  59 Non-binary             Sales         IT                  31
##   Work_Location Hours_Worked_Per_Week Number_of_Virtual_Meetings
## 1        Hybrid                    47                          7
## 2        Remote                    52                          4
## 3        Hybrid                    46                         11
## 4        Onsite                    32                          8
## 5        Onsite                    35                         12
## 6        Hybrid                    39                          3
##   Work_Life_Balance_Rating Stress_Level Mental_Health_Condition
## 1                        2       Medium              Depression
## 2                        1       Medium                 Anxiety
## 3                        5       Medium                 Anxiety
## 4                        4         High              Depression
## 5                        2         High                    None
## 6                        4         High                    None
##   Access_to_Mental_Health_Resources Productivity_Change Social_Isolation_Rating
## 1                                No            Decrease                       1
## 2                                No            Increase                       3
## 3                                No           No Change                       4
## 4                               Yes            Increase                       3
## 5                               Yes            Decrease                       3
## 6                                No            Increase                       5
##   Satisfaction_with_Remote_Work Company_Support_for_Remote_Work
## 1                   Unsatisfied                               1
## 2                     Satisfied                               2
## 3                   Unsatisfied                               5
## 4                   Unsatisfied                               3
## 5                   Unsatisfied                               3
## 6                   Unsatisfied                               1
##   Physical_Activity Sleep_Quality        Region
## 1            Weekly          Good        Europe
## 2            Weekly          Good          Asia
## 3              None          Poor North America
## 4              None          Poor        Europe
## 5            Weekly       Average North America
## 6              None       Average South America

Step 2: cleaning the data

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

colnames(df) <- tolower(colnames(df)) #change all column names to lowercase

df <- df %>%
  mutate(across(where(is.character), tolower)) #change all characters to lowercase

sum(is.na(df)) #check for missing values returned 0

## [1] 0

df <- distinct(df) #removes duplicates if any

head(df)

##   employee_id age     gender          job_role   industry years_of_experience
## 1     emp0001  32 non-binary                hr healthcare                  13
## 2     emp0002  40     female    data scientist         it                   3
## 3     emp0003  59 non-binary software engineer  education                  22
## 4     emp0004  27       male software engineer    finance                  20
## 5     emp0005  49       male             sales consulting                  32
## 6     emp0006  59 non-binary             sales         it                  31
##   work_location hours_worked_per_week number_of_virtual_meetings
## 1        hybrid                    47                          7
## 2        remote                    52                          4
## 3        hybrid                    46                         11
## 4        onsite                    32                          8
## 5        onsite                    35                         12
## 6        hybrid                    39                          3
##   work_life_balance_rating stress_level mental_health_condition
## 1                        2       medium              depression
## 2                        1       medium                 anxiety
## 3                        5       medium                 anxiety
## 4                        4         high              depression
## 5                        2         high                    none
## 6                        4         high                    none
##   access_to_mental_health_resources productivity_change social_isolation_rating
## 1                                no            decrease                       1
## 2                                no            increase                       3
## 3                                no           no change                       4
## 4                               yes            increase                       3
## 5                               yes            decrease                       3
## 6                                no            increase                       5
##   satisfaction_with_remote_work company_support_for_remote_work
## 1                   unsatisfied                               1
## 2                     satisfied                               2
## 3                   unsatisfied                               5
## 4                   unsatisfied                               3
## 5                   unsatisfied                               3
## 6                   unsatisfied                               1
##   physical_activity sleep_quality        region
## 1            weekly          good        europe
## 2            weekly          good          asia
## 3              none          poor north america
## 4              none          poor        europe
## 5            weekly       average north america
## 6              none       average south america

Chi-Squared Test (work_location vs. stress_level)

# Create contingency table
stress_table <- table(df$work_location, df$stress_level)

# Perform Chi-squared test
chisq_test <- chisq.test(stress_table)

# Output results
chisq_test

## 
##  Pearson's Chi-squared test
## 
## data:  stress_table
## X-squared = 1.9223, df = 4, p-value = 0.75

# Count of each stress level for each work location
table(df$work_location, df$stress_level)

##         
##          high low medium
##   hybrid  561 543    545
##   onsite  535 555    547
##   remote  590 547    577

# Proportions of stress levels by work location
prop_table <- prop.table(table(df$work_location, df$stress_level), 1)

# Print proportions
prop_table

##         
##               high       low    medium
##   hybrid 0.3402062 0.3292905 0.3305033
##   onsite 0.3268173 0.3390348 0.3341478
##   remote 0.3442240 0.3191365 0.3366394

library(ggplot2)
# Create bar plot
ggplot(df, aes(x = work_location, fill = stress_level)) +
  geom_bar(position = "dodge") +
  labs(
    title = "Stress Level by Work Location",
    x = "Work Location",
    y = "Count",
    fill = "Stress Level"
  ) +
  theme_minimal()

Interpretation of Results From Chi-squared test output:

Chi-Squared Statistic: X-squared = 1.9 Degrees of freedom (df = 4) p-value = 0.8 Since the p-value = 0.8, it is much greater than the conventional threshold of 0.05.

Null Hypothesis: H₀ (Null): There is no association between work location and stress level. H₁ (Alternative): There is an association between work location and stress level.

Since p = 0.8 > 0.05, we fail to reject the null hypothesis, meaning there is no statistically significant association between work location and stress level.

ANOVA (work life balance vs. work location)

##                 Df Sum Sq Mean Sq F value Pr(>F)
## work_location    2      5   2.348    1.18  0.307
## Residuals     4997   9941   1.989

F Value = 2.35: This is the calculated ratio of between-group variance to within-group variance. p-value = 0.31: The p-value tells us whether the differences in means across groups are statistically significant.

Interpretation: If p<0.05, we would reject the null hypothesis. Here, 0.31 > 0.05, so there is no statistically significant difference in the work-life balance ratings across the three work_location groups.

Based on this ANOVA: The mean work-life balance rating does not differ significantly across the groups onsite, hybrid, and remote. This suggests that location (onsite, hybrid, or remote) has no statistically significant impact on how employees rate their work-life balance.

Work Location vs. Mental Health Condition

# Create contingency table between work_location and mental_health_condition
mental_health_table <- table(df$work_location, df$mental_health_condition)

# View the table
print(mental_health_table)

##         
##          anxiety burnout depression none
##   hybrid     428     400        421  400
##   onsite     407     442        412  376
##   remote     443     438        413  420

# Perform Chi-squared test
chi_test <- chisq.test(mental_health_table)

# Display the test result
print(chi_test)

## 
##  Pearson's Chi-squared test
## 
## data:  mental_health_table
## X-squared = 4.5809, df = 6, p-value = 0.5986

# Calculate counts for mental health condition by work_location
mental_health_plot <- df %>%
  group_by(work_location, mental_health_condition) %>%
  summarize(count = n())

## `summarise()` has grouped output by 'work_location'. You can override using the
## `.groups` argument.

# Plot
ggplot(mental_health_plot, aes(x = work_location, y = count, fill = mental_health_condition)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Mental Health Conditions by Work Location",
    x = "Work Location",
    y = "Count",
    fill = "Mental Health Condition"
  ) +
  theme_minimal()

p-value = 0.6:

Since p>0.05, we fail to reject the null hypothesis. This means there is no statistically significant association between work location and mental health conditions in this dataset. Implications:

From this analysis, it does not appear that the mental health conditions (anxiety, burnout, depression, none) are influenced by the type of work location (hybrid, onsite, remote).

Work location vs Social Isolation

# Create contingency table
social_isolation_table <- table(df$work_location, df$social_isolation_rating)

# Perform Chi-squared test
social_isolation_chi2 <- chisq.test(social_isolation_table)

# Output results
social_isolation_chi2

## 
##  Pearson's Chi-squared test
## 
## data:  social_isolation_table
## X-squared = 2.73, df = 8, p-value = 0.9501

# Calculate proportions for social_isolation_rating by work_location
social_isolation_plot <- df %>%
  group_by(work_location, social_isolation_rating) %>%
  summarize(count = n()) %>%
  mutate(proportion = count / sum(count))

## `summarise()` has grouped output by 'work_location'. You can override using the
## `.groups` argument.

# Plot
ggplot(social_isolation_plot, aes(x = work_location, y = proportion, fill = as.factor(social_isolation_rating))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Social Isolation Ratings by Work Location",
    x = "Work Location",
    y = "Proportion",
    fill = "Social Isolation Rating"
  ) +
  theme_minimal()

The very high p-value (1) indicates that there is no statistically significant relationship between work location (hybrid, onsite, remote) and social isolation rating. This suggests that the social isolation scores (ranging from 1–5) are evenly distributed across all work location categories, with no meaningful association between these two variables.

Load dataset 2

# Load data from GitHub into R
url2 <- "https://raw.githubusercontent.com/leslietavarez/remotework-mentalhealth/refs/heads/main/train.csv"
df2 <- read.csv(url2)

# Check the data
head(df2)

##                Employee.ID Date.of.Joining Gender Company.Type
## 1 fffe32003000360033003200      2008-09-30 Female      Service
## 2     fffe3700360033003500      2008-11-30   Male      Service
## 3 fffe31003300320037003900      2008-03-10 Female      Product
## 4 fffe32003400380032003900      2008-11-03   Male      Service
## 5 fffe31003900340031003600      2008-07-24 Female      Service
## 6     fffe3300350037003500      2008-11-26   Male      Product
##   WFH.Setup.Available Designation Resource.Allocation Mental.Fatigue.Score
## 1                  No           2                   3                  3.8
## 2                 Yes           1                   2                  5.0
## 3                 Yes           2                  NA                  5.8
## 4                 Yes           1                   1                  2.6
## 5                  No           3                   7                  6.9
## 6                 Yes           2                   4                  3.6
##   Burn.Rate
## 1      0.16
## 2      0.36
## 3      0.49
## 4      0.20
## 5      0.52
## 6      0.29

Clean dataset 2

colnames(df2) <- tolower(colnames(df2)) #change all column names to lowercase

df2 <- df2 %>%
  mutate(across(where(is.character), tolower)) #change all characters to lowercase

sum(is.na(df2)) #check for missing values returned 0

## [1] 4622

df2 <- distinct(df2) #removes duplicates if any


colSums(is.na(df2))

##          employee.id      date.of.joining               gender 
##                    0                    0                    0 
##         company.type  wfh.setup.available          designation 
##                    0                    0                    0 
##  resource.allocation mental.fatigue.score            burn.rate 
##                 1381                 2117                 1124

library(naniar)

# Visualize missing values
vis_miss(df2)

df2_cleaned <- df2[complete.cases(df2), ]

# Visualize missing values again to make sure
vis_miss(df2_cleaned)

## Logistic regression of dataset 2

# Convert the dependent variable to binary numeric
df2_cleaned$wfh.setup.available <- ifelse(df2_cleaned$wfh.setup.available == "yes", 1, 0)

# Fit a logistic regression model
model_logistic <- glm(wfh.setup.available ~ mental.fatigue.score + burn.rate + resource.allocation, 
                       family = binomial, 
                       data = df2_cleaned)

# Display the summary
summary(model_logistic)

## 
## Call:
## glm(formula = wfh.setup.available ~ mental.fatigue.score + burn.rate + 
##     resource.allocation, family = binomial, data = df2_cleaned)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           1.57023    0.05889  26.665  < 2e-16 ***
## mental.fatigue.score  0.11604    0.02506   4.631 3.64e-06 ***
## burn.rate            -3.72424    0.27947 -13.326  < 2e-16 ***
## resource.allocation  -0.08458    0.01452  -5.825 5.72e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25655  on 18589  degrees of freedom
## Residual deviance: 23763  on 18586  degrees of freedom
## AIC: 23771
## 
## Number of Fisher Scoring iterations: 4

library(ggplot2)
library(tidyr)

# Aggregate data for grouped bar plot
# Summarize the data
df_summary <- df2_cleaned %>%
  group_by(wfh.setup.available) %>%
  summarize(
    avg_mental_fatigue = mean(mental.fatigue.score, na.rm = TRUE),
    avg_burn_rate = mean(burn.rate, na.rm = TRUE),
    avg_resource_allocation = mean(resource.allocation, na.rm = TRUE)
  ) %>%
  gather(key = "Variable", value = "Value", avg_mental_fatigue, avg_burn_rate, avg_resource_allocation)

# Create grouped bar plot
ggplot(df_summary, aes(x = wfh.setup.available, y = Value, fill = Variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Comparison of Mental Fatigue, Burn Rate, and Resource Allocation by WFH Setup",
    x = "WFH Setup Availability",
    y = "Average Value",
    fill = "Variables"
  ) +
  theme_minimal()

Interpretation Intercept:The intercept value suggests the baseline log-odds when all predictors are zero.

Mental Fatigue Score (0.1160): Positive coefficient. As mental fatigue score increases, the likelihood of having WFH setup availability increases. Highly significant (p < 0.001).

Burn Rate (-3.7242): Negative coefficient. Higher burn rate values are associated with a lower likelihood of having WFH setup availability. Highly significant (p < 0.001).

Resource Allocation (-0.0846): Negative coefficient. Greater resource allocation is associated with a lower likelihood of having WFH setup availability. Highly significant (p < 0.001).

Conclusion:

My analysis aimed to explore the relationship between work location (onsite, remote, hybrid) and mental health indicators, including mental fatigue, stress levels work-life balance, burnout, and resource allocation, to better understand how different work arrangements might impact employees’ mental well-being.The findings from dataset 1 were statistically insignificant – p-values failed to reject the null hypothesis.The findings from dataset 2 were statistically significant, with p-values consistently below conventional thresholds (0.05), indicating strong evidence against the null hypothesis. This suggests that work location, alongside mental health factors, plays a meaningful role in determining mental health outcomes in employees. I believe that more research needs to be done to make a more accurate conclusion.