My project explores the relationship between work location (remote, onsite, hybrid) and various mental health indicators, such as stress levels, mental fatigue, and social isolation. By analyzing survey data from employees, the project aims to uncover how different work setups influence employee well-being. Statistical methods like chi-squared tests and multiple regression are used to evaluate these relationships and provide insights into the potential mental health challenges associated with different work environments.
I love working remotely and wanted to see if other people agree. Remote work has made me happier and allowed me to pursue hobbies outside of work during hours I would normally be commuting. This flexibility is one of the many reasons I became interested in studying data science—because it offers remote and hybrid career opportunities.
There is no significant relationship between work location (onsite, remote, hybrid) and mental health indicators (stress level, work-life balance, mental fatigue score, burn rate, resource allocation, etc.).
Work location (onsite, remote, hybrid) significantly influences mental health indicators (stress level, work-life balance, mental fatigue score, burn rate, etc.).
library(readr)
# Load data from GitHub into R
url <- "https://raw.githubusercontent.com/leslietavarez/remotework-mentalhealth/refs/heads/main/Impact_of_Remote_Work_on_Mental_Health%20(1).csv"
df <- read.csv(url)
# Check the data
head(df)
## Employee_ID Age Gender Job_Role Industry Years_of_Experience
## 1 EMP0001 32 Non-binary HR Healthcare 13
## 2 EMP0002 40 Female Data Scientist IT 3
## 3 EMP0003 59 Non-binary Software Engineer Education 22
## 4 EMP0004 27 Male Software Engineer Finance 20
## 5 EMP0005 49 Male Sales Consulting 32
## 6 EMP0006 59 Non-binary Sales IT 31
## Work_Location Hours_Worked_Per_Week Number_of_Virtual_Meetings
## 1 Hybrid 47 7
## 2 Remote 52 4
## 3 Hybrid 46 11
## 4 Onsite 32 8
## 5 Onsite 35 12
## 6 Hybrid 39 3
## Work_Life_Balance_Rating Stress_Level Mental_Health_Condition
## 1 2 Medium Depression
## 2 1 Medium Anxiety
## 3 5 Medium Anxiety
## 4 4 High Depression
## 5 2 High None
## 6 4 High None
## Access_to_Mental_Health_Resources Productivity_Change Social_Isolation_Rating
## 1 No Decrease 1
## 2 No Increase 3
## 3 No No Change 4
## 4 Yes Increase 3
## 5 Yes Decrease 3
## 6 No Increase 5
## Satisfaction_with_Remote_Work Company_Support_for_Remote_Work
## 1 Unsatisfied 1
## 2 Satisfied 2
## 3 Unsatisfied 5
## 4 Unsatisfied 3
## 5 Unsatisfied 3
## 6 Unsatisfied 1
## Physical_Activity Sleep_Quality Region
## 1 Weekly Good Europe
## 2 Weekly Good Asia
## 3 None Poor North America
## 4 None Poor Europe
## 5 Weekly Average North America
## 6 None Average South America
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
colnames(df) <- tolower(colnames(df)) #change all column names to lowercase
df <- df %>%
mutate(across(where(is.character), tolower)) #change all characters to lowercase
sum(is.na(df)) #check for missing values returned 0
## [1] 0
df <- distinct(df) #removes duplicates if any
head(df)
## employee_id age gender job_role industry years_of_experience
## 1 emp0001 32 non-binary hr healthcare 13
## 2 emp0002 40 female data scientist it 3
## 3 emp0003 59 non-binary software engineer education 22
## 4 emp0004 27 male software engineer finance 20
## 5 emp0005 49 male sales consulting 32
## 6 emp0006 59 non-binary sales it 31
## work_location hours_worked_per_week number_of_virtual_meetings
## 1 hybrid 47 7
## 2 remote 52 4
## 3 hybrid 46 11
## 4 onsite 32 8
## 5 onsite 35 12
## 6 hybrid 39 3
## work_life_balance_rating stress_level mental_health_condition
## 1 2 medium depression
## 2 1 medium anxiety
## 3 5 medium anxiety
## 4 4 high depression
## 5 2 high none
## 6 4 high none
## access_to_mental_health_resources productivity_change social_isolation_rating
## 1 no decrease 1
## 2 no increase 3
## 3 no no change 4
## 4 yes increase 3
## 5 yes decrease 3
## 6 no increase 5
## satisfaction_with_remote_work company_support_for_remote_work
## 1 unsatisfied 1
## 2 satisfied 2
## 3 unsatisfied 5
## 4 unsatisfied 3
## 5 unsatisfied 3
## 6 unsatisfied 1
## physical_activity sleep_quality region
## 1 weekly good europe
## 2 weekly good asia
## 3 none poor north america
## 4 none poor europe
## 5 weekly average north america
## 6 none average south america
# Create contingency table
stress_table <- table(df$work_location, df$stress_level)
# Perform Chi-squared test
chisq_test <- chisq.test(stress_table)
# Output results
chisq_test
##
## Pearson's Chi-squared test
##
## data: stress_table
## X-squared = 1.9223, df = 4, p-value = 0.75
# Count of each stress level for each work location
table(df$work_location, df$stress_level)
##
## high low medium
## hybrid 561 543 545
## onsite 535 555 547
## remote 590 547 577
# Proportions of stress levels by work location
prop_table <- prop.table(table(df$work_location, df$stress_level), 1)
# Print proportions
prop_table
##
## high low medium
## hybrid 0.3402062 0.3292905 0.3305033
## onsite 0.3268173 0.3390348 0.3341478
## remote 0.3442240 0.3191365 0.3366394
library(ggplot2)
# Create bar plot
ggplot(df, aes(x = work_location, fill = stress_level)) +
geom_bar(position = "dodge") +
labs(
title = "Stress Level by Work Location",
x = "Work Location",
y = "Count",
fill = "Stress Level"
) +
theme_minimal()
Interpretation of Results From Chi-squared test output:
Chi-Squared Statistic: X-squared = 1.9 Degrees of freedom (df = 4) p-value = 0.8 Since the p-value = 0.8, it is much greater than the conventional threshold of 0.05.
Null Hypothesis: H₀ (Null): There is no association between work location and stress level. H₁ (Alternative): There is an association between work location and stress level.
Since p = 0.8 > 0.05, we fail to reject the null hypothesis, meaning there is no statistically significant association between work location and stress level.
## Df Sum Sq Mean Sq F value Pr(>F)
## work_location 2 5 2.348 1.18 0.307
## Residuals 4997 9941 1.989
F Value = 2.35: This is the calculated ratio of between-group variance
to within-group variance. p-value = 0.31: The p-value tells us whether
the differences in means across groups are statistically
significant.
Interpretation: If p<0.05, we would reject the null hypothesis. Here, 0.31 > 0.05, so there is no statistically significant difference in the work-life balance ratings across the three work_location groups.
Based on this ANOVA: The mean work-life balance rating does not differ significantly across the groups onsite, hybrid, and remote. This suggests that location (onsite, hybrid, or remote) has no statistically significant impact on how employees rate their work-life balance.
# Create contingency table between work_location and mental_health_condition
mental_health_table <- table(df$work_location, df$mental_health_condition)
# View the table
print(mental_health_table)
##
## anxiety burnout depression none
## hybrid 428 400 421 400
## onsite 407 442 412 376
## remote 443 438 413 420
# Perform Chi-squared test
chi_test <- chisq.test(mental_health_table)
# Display the test result
print(chi_test)
##
## Pearson's Chi-squared test
##
## data: mental_health_table
## X-squared = 4.5809, df = 6, p-value = 0.5986
# Calculate counts for mental health condition by work_location
mental_health_plot <- df %>%
group_by(work_location, mental_health_condition) %>%
summarize(count = n())
## `summarise()` has grouped output by 'work_location'. You can override using the
## `.groups` argument.
# Plot
ggplot(mental_health_plot, aes(x = work_location, y = count, fill = mental_health_condition)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
title = "Mental Health Conditions by Work Location",
x = "Work Location",
y = "Count",
fill = "Mental Health Condition"
) +
theme_minimal()
p-value = 0.6:
Since p>0.05, we fail to reject the null hypothesis. This means there is no statistically significant association between work location and mental health conditions in this dataset. Implications:
From this analysis, it does not appear that the mental health conditions (anxiety, burnout, depression, none) are influenced by the type of work location (hybrid, onsite, remote).
# Load data from GitHub into R
url2 <- "https://raw.githubusercontent.com/leslietavarez/remotework-mentalhealth/refs/heads/main/train.csv"
df2 <- read.csv(url2)
# Check the data
head(df2)
## Employee.ID Date.of.Joining Gender Company.Type
## 1 fffe32003000360033003200 2008-09-30 Female Service
## 2 fffe3700360033003500 2008-11-30 Male Service
## 3 fffe31003300320037003900 2008-03-10 Female Product
## 4 fffe32003400380032003900 2008-11-03 Male Service
## 5 fffe31003900340031003600 2008-07-24 Female Service
## 6 fffe3300350037003500 2008-11-26 Male Product
## WFH.Setup.Available Designation Resource.Allocation Mental.Fatigue.Score
## 1 No 2 3 3.8
## 2 Yes 1 2 5.0
## 3 Yes 2 NA 5.8
## 4 Yes 1 1 2.6
## 5 No 3 7 6.9
## 6 Yes 2 4 3.6
## Burn.Rate
## 1 0.16
## 2 0.36
## 3 0.49
## 4 0.20
## 5 0.52
## 6 0.29
colnames(df2) <- tolower(colnames(df2)) #change all column names to lowercase
df2 <- df2 %>%
mutate(across(where(is.character), tolower)) #change all characters to lowercase
sum(is.na(df2)) #check for missing values returned 0
## [1] 4622
df2 <- distinct(df2) #removes duplicates if any
colSums(is.na(df2))
## employee.id date.of.joining gender
## 0 0 0
## company.type wfh.setup.available designation
## 0 0 0
## resource.allocation mental.fatigue.score burn.rate
## 1381 2117 1124
library(naniar)
# Visualize missing values
vis_miss(df2)
df2_cleaned <- df2[complete.cases(df2), ]
# Visualize missing values again to make sure
vis_miss(df2_cleaned)
## Logistic regression of dataset 2
# Convert the dependent variable to binary numeric
df2_cleaned$wfh.setup.available <- ifelse(df2_cleaned$wfh.setup.available == "yes", 1, 0)
# Fit a logistic regression model
model_logistic <- glm(wfh.setup.available ~ mental.fatigue.score + burn.rate + resource.allocation,
family = binomial,
data = df2_cleaned)
# Display the summary
summary(model_logistic)
##
## Call:
## glm(formula = wfh.setup.available ~ mental.fatigue.score + burn.rate +
## resource.allocation, family = binomial, data = df2_cleaned)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.57023 0.05889 26.665 < 2e-16 ***
## mental.fatigue.score 0.11604 0.02506 4.631 3.64e-06 ***
## burn.rate -3.72424 0.27947 -13.326 < 2e-16 ***
## resource.allocation -0.08458 0.01452 -5.825 5.72e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 25655 on 18589 degrees of freedom
## Residual deviance: 23763 on 18586 degrees of freedom
## AIC: 23771
##
## Number of Fisher Scoring iterations: 4
library(ggplot2)
library(tidyr)
# Aggregate data for grouped bar plot
# Summarize the data
df_summary <- df2_cleaned %>%
group_by(wfh.setup.available) %>%
summarize(
avg_mental_fatigue = mean(mental.fatigue.score, na.rm = TRUE),
avg_burn_rate = mean(burn.rate, na.rm = TRUE),
avg_resource_allocation = mean(resource.allocation, na.rm = TRUE)
) %>%
gather(key = "Variable", value = "Value", avg_mental_fatigue, avg_burn_rate, avg_resource_allocation)
# Create grouped bar plot
ggplot(df_summary, aes(x = wfh.setup.available, y = Value, fill = Variable)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
title = "Comparison of Mental Fatigue, Burn Rate, and Resource Allocation by WFH Setup",
x = "WFH Setup Availability",
y = "Average Value",
fill = "Variables"
) +
theme_minimal()
Interpretation Intercept:The intercept value suggests the baseline log-odds when all predictors are zero.
Mental Fatigue Score (0.1160): Positive coefficient. As mental fatigue score increases, the likelihood of having WFH setup availability increases. Highly significant (p < 0.001).
Burn Rate (-3.7242): Negative coefficient. Higher burn rate values are associated with a lower likelihood of having WFH setup availability. Highly significant (p < 0.001).
Resource Allocation (-0.0846): Negative coefficient. Greater resource allocation is associated with a lower likelihood of having WFH setup availability. Highly significant (p < 0.001).
My analysis aimed to explore the relationship between work location (onsite, remote, hybrid) and mental health indicators, including mental fatigue, stress levels work-life balance, burnout, and resource allocation, to better understand how different work arrangements might impact employees’ mental well-being.The findings from dataset 1 were statistically insignificant – p-values failed to reject the null hypothesis.The findings from dataset 2 were statistically significant, with p-values consistently below conventional thresholds (0.05), indicating strong evidence against the null hypothesis. This suggests that work location, alongside mental health factors, plays a meaningful role in determining mental health outcomes in employees. I believe that more research needs to be done to make a more accurate conclusion.