# Load required packages
library(htmltools)
library(caret)
library(pROC)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(corrplot)
library(skimr)
require(DataExplorer)
require(miscTools)
require(MASS)
require(performance)
require(lmtest)
require(mice)
require(glmnet)
require(Metrics)
library(patchwork) # for combining ggplots
library(e1071)
library(car)
library(forcats) # For better factor handling
# echo=FALSE, include=FALSE
remote_work_df <- read_csv("https://raw.githubusercontent.com/uzmabb182/Data_621/refs/heads/main/Final_Project/Impact_of_Remote_Work_on_Mental_Health.csv")
head(remote_work_df)
## # A tibble: 6 × 20
## Employee_ID Age Gender Job_Role Industry Years_of_Experience Work_Location
## <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 EMP0001 32 Non-bin… HR Healthc… 13 Hybrid
## 2 EMP0002 40 Female Data Sc… IT 3 Remote
## 3 EMP0003 59 Non-bin… Softwar… Educati… 22 Hybrid
## 4 EMP0004 27 Male Softwar… Finance 20 Onsite
## 5 EMP0005 49 Male Sales Consult… 32 Onsite
## 6 EMP0006 59 Non-bin… Sales IT 31 Hybrid
## # ℹ 13 more variables: Hours_Worked_Per_Week <dbl>,
## # Number_of_Virtual_Meetings <dbl>, Work_Life_Balance_Rating <dbl>,
## # Stress_Level <chr>, Mental_Health_Condition <chr>,
## # Access_to_Mental_Health_Resources <chr>, Productivity_Change <chr>,
## # Social_Isolation_Rating <dbl>, Satisfaction_with_Remote_Work <chr>,
## # Company_Support_for_Remote_Work <dbl>, Physical_Activity <chr>,
## # Sleep_Quality <chr>, Region <chr>
glimpse(remote_work_df)
## Rows: 5,000
## Columns: 20
## $ Employee_ID <chr> "EMP0001", "EMP0002", "EMP0003", "EM…
## $ Age <dbl> 32, 40, 59, 27, 49, 59, 31, 42, 56, …
## $ Gender <chr> "Non-binary", "Female", "Non-binary"…
## $ Job_Role <chr> "HR", "Data Scientist", "Software En…
## $ Industry <chr> "Healthcare", "IT", "Education", "Fi…
## $ Years_of_Experience <dbl> 13, 3, 22, 20, 32, 31, 24, 6, 9, 28,…
## $ Work_Location <chr> "Hybrid", "Remote", "Hybrid", "Onsit…
## $ Hours_Worked_Per_Week <dbl> 47, 52, 46, 32, 35, 39, 51, 54, 24, …
## $ Number_of_Virtual_Meetings <dbl> 7, 4, 11, 8, 12, 3, 7, 7, 4, 6, 3, 1…
## $ Work_Life_Balance_Rating <dbl> 2, 1, 5, 4, 2, 4, 3, 3, 2, 1, 3, 4, …
## $ Stress_Level <chr> "Medium", "Medium", "Medium", "High"…
## $ Mental_Health_Condition <chr> "Depression", "Anxiety", "Anxiety", …
## $ Access_to_Mental_Health_Resources <chr> "No", "No", "No", "Yes", "Yes", "No"…
## $ Productivity_Change <chr> "Decrease", "Increase", "No Change",…
## $ Social_Isolation_Rating <dbl> 1, 3, 4, 3, 3, 5, 5, 5, 2, 2, 4, 4, …
## $ Satisfaction_with_Remote_Work <chr> "Unsatisfied", "Satisfied", "Unsatis…
## $ Company_Support_for_Remote_Work <dbl> 1, 2, 5, 3, 3, 1, 3, 4, 4, 1, 2, 3, …
## $ Physical_Activity <chr> "Weekly", "Weekly", "None", "None", …
## $ Sleep_Quality <chr> "Good", "Good", "Poor", "Poor", "Ave…
## $ Region <chr> "Europe", "Asia", "North America", "…
summary(remote_work_df)
## Employee_ID Age Gender Job_Role
## Length:5000 Min. :22.00 Length:5000 Length:5000
## Class :character 1st Qu.:31.00 Class :character Class :character
## Mode :character Median :41.00 Mode :character Mode :character
## Mean :40.99
## 3rd Qu.:51.00
## Max. :60.00
## Industry Years_of_Experience Work_Location
## Length:5000 Min. : 1.00 Length:5000
## Class :character 1st Qu.: 9.00 Class :character
## Mode :character Median :18.00 Mode :character
## Mean :17.81
## 3rd Qu.:26.00
## Max. :35.00
## Hours_Worked_Per_Week Number_of_Virtual_Meetings Work_Life_Balance_Rating
## Min. :20.00 Min. : 0.000 Min. :1.000
## 1st Qu.:29.00 1st Qu.: 4.000 1st Qu.:2.000
## Median :40.00 Median : 8.000 Median :3.000
## Mean :39.61 Mean : 7.559 Mean :2.984
## 3rd Qu.:50.00 3rd Qu.:12.000 3rd Qu.:4.000
## Max. :60.00 Max. :15.000 Max. :5.000
## Stress_Level Mental_Health_Condition Access_to_Mental_Health_Resources
## Length:5000 Length:5000 Length:5000
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Productivity_Change Social_Isolation_Rating Satisfaction_with_Remote_Work
## Length:5000 Min. :1.000 Length:5000
## Class :character 1st Qu.:2.000 Class :character
## Mode :character Median :3.000 Mode :character
## Mean :2.994
## 3rd Qu.:4.000
## Max. :5.000
## Company_Support_for_Remote_Work Physical_Activity Sleep_Quality
## Min. :1.000 Length:5000 Length:5000
## 1st Qu.:2.000 Class :character Class :character
## Median :3.000 Mode :character Mode :character
## Mean :3.008
## 3rd Qu.:4.000
## Max. :5.000
## Region
## Length:5000
## Class :character
## Mode :character
##
##
##
skim(remote_work_df)
Name | remote_work_df |
Number of rows | 5000 |
Number of columns | 20 |
_______________________ | |
Column type frequency: | |
character | 13 |
numeric | 7 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Employee_ID | 0 | 1 | 7 | 7 | 0 | 5000 | 0 |
Gender | 0 | 1 | 4 | 17 | 0 | 4 | 0 |
Job_Role | 0 | 1 | 2 | 17 | 0 | 7 | 0 |
Industry | 0 | 1 | 2 | 13 | 0 | 7 | 0 |
Work_Location | 0 | 1 | 6 | 6 | 0 | 3 | 0 |
Stress_Level | 0 | 1 | 3 | 6 | 0 | 3 | 0 |
Mental_Health_Condition | 0 | 1 | 4 | 10 | 0 | 4 | 0 |
Access_to_Mental_Health_Resources | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Productivity_Change | 0 | 1 | 8 | 9 | 0 | 3 | 0 |
Satisfaction_with_Remote_Work | 0 | 1 | 7 | 11 | 0 | 3 | 0 |
Physical_Activity | 0 | 1 | 4 | 6 | 0 | 3 | 0 |
Sleep_Quality | 0 | 1 | 4 | 7 | 0 | 3 | 0 |
Region | 0 | 1 | 4 | 13 | 0 | 6 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Age | 0 | 1 | 40.99 | 11.30 | 22 | 31 | 41 | 51 | 60 | ▇▇▆▇▇ |
Years_of_Experience | 0 | 1 | 17.81 | 10.02 | 1 | 9 | 18 | 26 | 35 | ▇▇▇▇▇ |
Hours_Worked_Per_Week | 0 | 1 | 39.61 | 11.86 | 20 | 29 | 40 | 50 | 60 | ▇▇▆▆▆ |
Number_of_Virtual_Meetings | 0 | 1 | 7.56 | 4.64 | 0 | 4 | 8 | 12 | 15 | ▇▆▆▆▆ |
Work_Life_Balance_Rating | 0 | 1 | 2.98 | 1.41 | 1 | 2 | 3 | 4 | 5 | ▇▇▇▇▇ |
Social_Isolation_Rating | 0 | 1 | 2.99 | 1.39 | 1 | 2 | 3 | 4 | 5 | ▇▇▇▇▇ |
Company_Support_for_Remote_Work | 0 | 1 | 3.01 | 1.40 | 1 | 2 | 3 | 4 | 5 | ▇▇▇▇▇ |
colSums(is.na(remote_work_df))
## Employee_ID Age
## 0 0
## Gender Job_Role
## 0 0
## Industry Years_of_Experience
## 0 0
## Work_Location Hours_Worked_Per_Week
## 0 0
## Number_of_Virtual_Meetings Work_Life_Balance_Rating
## 0 0
## Stress_Level Mental_Health_Condition
## 0 0
## Access_to_Mental_Health_Resources Productivity_Change
## 0 0
## Social_Isolation_Rating Satisfaction_with_Remote_Work
## 0 0
## Company_Support_for_Remote_Work Physical_Activity
## 0 0
## Sleep_Quality Region
## 0 0
remote_work_df <- remote_work_df %>%
mutate(across(where(is.character), as.factor))
str(remote_work_df)
## tibble [5,000 × 20] (S3: tbl_df/tbl/data.frame)
## $ Employee_ID : Factor w/ 5000 levels "EMP0001","EMP0002",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : num [1:5000] 32 40 59 27 49 59 31 42 56 30 ...
## $ Gender : Factor w/ 4 levels "Female","Male",..: 3 1 3 2 2 3 4 3 4 1 ...
## $ Job_Role : Factor w/ 7 levels "Data Scientist",..: 3 1 7 7 6 6 6 1 1 3 ...
## $ Industry : Factor w/ 7 levels "Consulting","Education",..: 4 5 2 3 1 5 5 6 4 5 ...
## $ Years_of_Experience : num [1:5000] 13 3 22 20 32 31 24 6 9 28 ...
## $ Work_Location : Factor w/ 3 levels "Hybrid","Onsite",..: 1 3 1 2 2 1 3 2 1 1 ...
## $ Hours_Worked_Per_Week : num [1:5000] 47 52 46 32 35 39 51 54 24 57 ...
## $ Number_of_Virtual_Meetings : num [1:5000] 7 4 11 8 12 3 7 7 4 6 ...
## $ Work_Life_Balance_Rating : num [1:5000] 2 1 5 4 2 4 3 3 2 1 ...
## $ Stress_Level : Factor w/ 3 levels "High","Low","Medium": 3 3 3 1 1 1 2 3 1 2 ...
## $ Mental_Health_Condition : Factor w/ 4 levels "Anxiety","Burnout",..: 3 1 1 3 4 4 1 3 4 3 ...
## $ Access_to_Mental_Health_Resources: Factor w/ 2 levels "No","Yes": 1 1 1 2 2 1 2 1 2 2 ...
## $ Productivity_Change : Factor w/ 3 levels "Decrease","Increase",..: 1 2 3 2 1 2 1 1 1 1 ...
## $ Social_Isolation_Rating : num [1:5000] 1 3 4 3 3 5 5 5 2 2 ...
## $ Satisfaction_with_Remote_Work : Factor w/ 3 levels "Neutral","Satisfied",..: 3 2 3 3 3 3 1 2 3 1 ...
## $ Company_Support_for_Remote_Work : num [1:5000] 1 2 5 3 3 1 3 4 4 1 ...
## $ Physical_Activity : Factor w/ 3 levels "Daily","None",..: 3 3 2 2 3 2 1 2 1 3 ...
## $ Sleep_Quality : Factor w/ 3 levels "Average","Good",..: 2 2 3 3 1 1 3 1 3 3 ...
## $ Region : Factor w/ 6 levels "Africa","Asia",..: 3 2 4 3 4 6 2 4 3 4 ...
# Drop the Employee_ID column
remote_work_df <- remote_work_df %>% dplyr::select(-Employee_ID)
# Verify that the column is removed
glimpse(remote_work_df)
## Rows: 5,000
## Columns: 19
## $ Age <dbl> 32, 40, 59, 27, 49, 59, 31, 42, 56, …
## $ Gender <fct> Non-binary, Female, Non-binary, Male…
## $ Job_Role <fct> HR, Data Scientist, Software Enginee…
## $ Industry <fct> Healthcare, IT, Education, Finance, …
## $ Years_of_Experience <dbl> 13, 3, 22, 20, 32, 31, 24, 6, 9, 28,…
## $ Work_Location <fct> Hybrid, Remote, Hybrid, Onsite, Onsi…
## $ Hours_Worked_Per_Week <dbl> 47, 52, 46, 32, 35, 39, 51, 54, 24, …
## $ Number_of_Virtual_Meetings <dbl> 7, 4, 11, 8, 12, 3, 7, 7, 4, 6, 3, 1…
## $ Work_Life_Balance_Rating <dbl> 2, 1, 5, 4, 2, 4, 3, 3, 2, 1, 3, 4, …
## $ Stress_Level <fct> Medium, Medium, Medium, High, High, …
## $ Mental_Health_Condition <fct> Depression, Anxiety, Anxiety, Depres…
## $ Access_to_Mental_Health_Resources <fct> No, No, No, Yes, Yes, No, Yes, No, Y…
## $ Productivity_Change <fct> Decrease, Increase, No Change, Incre…
## $ Social_Isolation_Rating <dbl> 1, 3, 4, 3, 3, 5, 5, 5, 2, 2, 4, 4, …
## $ Satisfaction_with_Remote_Work <fct> Unsatisfied, Satisfied, Unsatisfied,…
## $ Company_Support_for_Remote_Work <dbl> 1, 2, 5, 3, 3, 1, 3, 4, 4, 1, 2, 3, …
## $ Physical_Activity <fct> Weekly, Weekly, None, None, Weekly, …
## $ Sleep_Quality <fct> Good, Good, Poor, Poor, Average, Ave…
## $ Region <fct> Europe, Asia, North America, Europe,…
remote_work_df %>%
dplyr::select(where(is.numeric)) %>%
summary()
## Age Years_of_Experience Hours_Worked_Per_Week
## Min. :22.00 Min. : 1.00 Min. :20.00
## 1st Qu.:31.00 1st Qu.: 9.00 1st Qu.:29.00
## Median :41.00 Median :18.00 Median :40.00
## Mean :40.99 Mean :17.81 Mean :39.61
## 3rd Qu.:51.00 3rd Qu.:26.00 3rd Qu.:50.00
## Max. :60.00 Max. :35.00 Max. :60.00
## Number_of_Virtual_Meetings Work_Life_Balance_Rating Social_Isolation_Rating
## Min. : 0.000 Min. :1.000 Min. :1.000
## 1st Qu.: 4.000 1st Qu.:2.000 1st Qu.:2.000
## Median : 8.000 Median :3.000 Median :3.000
## Mean : 7.559 Mean :2.984 Mean :2.994
## 3rd Qu.:12.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :15.000 Max. :5.000 Max. :5.000
## Company_Support_for_Remote_Work
## Min. :1.000
## 1st Qu.:2.000
## Median :3.000
## Mean :3.008
## 3rd Qu.:4.000
## Max. :5.000
remote_work_df %>%
dplyr::select(where(is.factor)) %>%
map(table)
## $Gender
##
## Female Male Non-binary Prefer not to say
## 1274 1270 1214 1242
##
## $Job_Role
##
## Data Scientist Designer HR Marketing
## 696 723 716 683
## Project Manager Sales Software Engineer
## 738 733 711
##
## $Industry
##
## Consulting Education Finance Healthcare IT
## 680 690 747 728 746
## Manufacturing Retail
## 683 726
##
## $Work_Location
##
## Hybrid Onsite Remote
## 1649 1637 1714
##
## $Stress_Level
##
## High Low Medium
## 1686 1645 1669
##
## $Mental_Health_Condition
##
## Anxiety Burnout Depression None
## 1278 1280 1246 1196
##
## $Access_to_Mental_Health_Resources
##
## No Yes
## 2553 2447
##
## $Productivity_Change
##
## Decrease Increase No Change
## 1737 1586 1677
##
## $Satisfaction_with_Remote_Work
##
## Neutral Satisfied Unsatisfied
## 1648 1675 1677
##
## $Physical_Activity
##
## Daily None Weekly
## 1616 1629 1755
##
## $Sleep_Quality
##
## Average Good Poor
## 1628 1687 1685
##
## $Region
##
## Africa Asia Europe North America Oceania
## 860 829 840 777 867
## South America
## 827
ggplot(remote_work_df, aes(x = Age)) +
geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
theme_minimal() +
labs(title = "Age Distribution", x = "Age", y = "Count")
### Distribution Plots for Work Location vs Stress Level
ggplot(remote_work_df, aes(x = Work_Location, fill = Stress_Level)) +
geom_bar(position = "dodge") +
labs(title = "Stress Level by Work Location", x = "Work Location", y = "Count") +
theme_minimal()
### Distribution Plots for Mental Health Condition by Gender
ggplot(remote_work_df, aes(x = Gender, fill = Mental_Health_Condition)) +
geom_bar(position = "fill") +
labs(title = "Mental Health Condition Distribution by Gender", y = "Proportion") +
theme_minimal() +
coord_flip()
### Correlation (Numeric Features)
# Correlation matrix
numeric_data <- remote_work_df %>%
dplyr::select(where(is.numeric))
cor_matrix <- cor(numeric_data, use = "complete.obs")
round(cor_matrix, 2)
## Age Years_of_Experience Hours_Worked_Per_Week
## Age 1.00 0.00 0.00
## Years_of_Experience 0.00 1.00 -0.02
## Hours_Worked_Per_Week 0.00 -0.02 1.00
## Number_of_Virtual_Meetings 0.00 0.02 0.00
## Work_Life_Balance_Rating 0.02 0.00 0.00
## Social_Isolation_Rating -0.02 0.00 -0.01
## Company_Support_for_Remote_Work 0.02 0.01 0.01
## Number_of_Virtual_Meetings
## Age 0.00
## Years_of_Experience 0.02
## Hours_Worked_Per_Week 0.00
## Number_of_Virtual_Meetings 1.00
## Work_Life_Balance_Rating 0.01
## Social_Isolation_Rating 0.00
## Company_Support_for_Remote_Work 0.00
## Work_Life_Balance_Rating
## Age 0.02
## Years_of_Experience 0.00
## Hours_Worked_Per_Week 0.00
## Number_of_Virtual_Meetings 0.01
## Work_Life_Balance_Rating 1.00
## Social_Isolation_Rating 0.00
## Company_Support_for_Remote_Work -0.01
## Social_Isolation_Rating
## Age -0.02
## Years_of_Experience 0.00
## Hours_Worked_Per_Week -0.01
## Number_of_Virtual_Meetings 0.00
## Work_Life_Balance_Rating 0.00
## Social_Isolation_Rating 1.00
## Company_Support_for_Remote_Work 0.02
## Company_Support_for_Remote_Work
## Age 0.02
## Years_of_Experience 0.01
## Hours_Worked_Per_Week 0.01
## Number_of_Virtual_Meetings 0.00
## Work_Life_Balance_Rating -0.01
## Social_Isolation_Rating 0.02
## Company_Support_for_Remote_Work 1.00
# Optional: Use corrplot for better visuals
library(corrplot)
corrplot(cor_matrix, method = "color", tl.cex = 0.8)
### Relationships Exploration for Productivity Change vs Mental
Health
ggplot(remote_work_df, aes(x = Mental_Health_Condition, fill = Productivity_Change)) +
geom_bar(position = "dodge") +
labs(title = "Productivity Change by Mental Health Condition") +
theme_minimal()
### Relationships Exploration for Sleep Quality by Region
ggplot(remote_work_df, aes(x = Region, fill = Sleep_Quality)) +
geom_bar(position = "dodge") +
labs(title = "Sleep Quality Across Regions") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
### Automated Exploratory Report
# Generate an HTML report
create_report(remote_work_df, output_file = "eda_remote_work_report.html")
##
##
## processing file: report.rmd
## | | | 0% | |. | 2% | |.. | 5% [global_options] | |... | 7% | |.... | 10% [introduce] | |.... | 12% | |..... | 14% [plot_intro]
## | |...... | 17% | |....... | 19% [data_structure] | |........ | 21% | |......... | 24% [missing_profile]
## | |.......... | 26% | |........... | 29% [univariate_distribution_header] | |........... | 31% | |............ | 33% [plot_histogram]
## | |............. | 36% | |.............. | 38% [plot_density] | |............... | 40% | |................ | 43% [plot_frequency_bar]
## | |................. | 45% | |.................. | 48% [plot_response_bar] | |.................. | 50% | |................... | 52% [plot_with_bar] | |.................... | 55% | |..................... | 57% [plot_normal_qq]
## | |...................... | 60% | |....................... | 62% [plot_response_qq] | |........................ | 64% | |......................... | 67% [plot_by_qq] | |.......................... | 69% | |.......................... | 71% [correlation_analysis]
## | |........................... | 74% | |............................ | 76% [principal_component_analysis]
## | |............................. | 79% | |.............................. | 81% [bivariate_distribution_header] | |............................... | 83% | |................................ | 86% [plot_response_boxplot] | |................................. | 88% | |................................. | 90% [plot_by_boxplot] | |.................................. | 93% | |................................... | 95% [plot_response_scatterplot] | |.................................... | 98% | |.....................................| 100% [plot_by_scatterplot]
## output file: C:/Users/Uzma/CUNY-SPS-Assignments/Data_621/Data_621/Final_Project/report.knit.md
## "C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/pandoc" +RTS -K512m -RTS "C:\Users\Uzma\CUNY-SPS-Assignments\Data_621\Data_621\Final_Project\report.knit.md" --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output pandoc2aa061f4e36.html --lua-filter "C:\Users\Uzma\AppData\Local\R\win-library\4.3\rmarkdown\rmarkdown\lua\pagebreak.lua" --lua-filter "C:\Users\Uzma\AppData\Local\R\win-library\4.3\rmarkdown\rmarkdown\lua\latex-div.lua" --embed-resources --standalone --variable bs3=TRUE --section-divs --table-of-contents --toc-depth 6 --template "C:\Users\Uzma\AppData\Local\R\win-library\4.3\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable theme=yeti --mathjax --variable "mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" --include-in-header "C:\Users\Uzma\AppData\Local\Temp\RtmpovpcdB\rmarkdown-str2aa019e94f9b.html"
##
## Output created: eda_remote_work_report.html
# identify typos or categories like "Prefer not to say" that may need special handling
boxplot(remote_work_df$Hours_Worked_Per_Week, main = "Boxplot of Weekly Hours Worked")
# Check unique values in categorical columns
sapply(remote_work_df %>% dplyr::select(where(is.factor)), levels)
## $Gender
## [1] "Female" "Male" "Non-binary"
## [4] "Prefer not to say"
##
## $Job_Role
## [1] "Data Scientist" "Designer" "HR"
## [4] "Marketing" "Project Manager" "Sales"
## [7] "Software Engineer"
##
## $Industry
## [1] "Consulting" "Education" "Finance" "Healthcare"
## [5] "IT" "Manufacturing" "Retail"
##
## $Work_Location
## [1] "Hybrid" "Onsite" "Remote"
##
## $Stress_Level
## [1] "High" "Low" "Medium"
##
## $Mental_Health_Condition
## [1] "Anxiety" "Burnout" "Depression" "None"
##
## $Access_to_Mental_Health_Resources
## [1] "No" "Yes"
##
## $Productivity_Change
## [1] "Decrease" "Increase" "No Change"
##
## $Satisfaction_with_Remote_Work
## [1] "Neutral" "Satisfied" "Unsatisfied"
##
## $Physical_Activity
## [1] "Daily" "None" "Weekly"
##
## $Sleep_Quality
## [1] "Average" "Good" "Poor"
##
## $Region
## [1] "Africa" "Asia" "Europe" "North America"
## [5] "Oceania" "South America"
The dataset includes a diverse set of variables like below:
demographic (age, gender, region) work-related (job role industry, work location) mental health (stress level mental health condition, access to resources) and lifestyle factors (sleep quality, physical activity)
No missing values were reported — indicating a complete dataset suitable for direct analysis.
All categorical columns were successfully converted to factors, and numeric variables are in usable form.
Employee_ID was removed correctly, avoiding unnecessary noise.
25–30, 40–50, and 50–55 age bins have the highest frequencies.
Each of these bins has 650–700 employees, indicating a strong middle-aged workforce presence.
The 18–22 group has the lowest count (under 150).
The 60+ group also has fewer participants (around 400), possibly due to retirements or reduced digital/remote work participation.
This is not a normal (bell-shaped) distribution. Instead, it’s closer to a uniform or flat distribution from ages 25 to 55.
The slight dip at the edges (youngest and oldest) is typical in workplace data where fewer very young or senior employees are present.
Remote workers report the highest number of high stress levels, noticeably more than hybrid or onsite workers.
Onsite workers have the most balanced distribution, with slightly more reporting low stress than high.
Hybrid workers show a fairly even spread across all stress levels, suggesting a moderate stress profile.
Anxiety is the most common condition across all genders, especially among males and females.
Non-binary and “Prefer not to say” groups have higher proportions of depression and burnout compared to the binary genders.
The “None” category (no mental health condition) is least prevalent in the non-binary and “Prefer not to say” groups.
Males have the highest proportion of “None” (no mental health condition).
Strongest Positive Correlation:
Age and Years_of_Experience show a strong positive correlation (~0.9+), which is expected—older individuals typically have more work experience.
Moderate Positive Relationships:
Hours_Worked_Per_Week has a moderate positive correlation with Number_of_Virtual_Meetings, suggesting those who attend more meetings tend to work more hours.
Weak or No Correlations:
Most other relationships (e.g., Company_Support_for_Remote_Work, Work_Life_Balance_Rating, Social_Isolation_Rating) show low or negligible correlations, indicating they may vary independently.
No Evidence of Multicollinearity:
No pairs of variables (besides age/experience) show very high correlation (>0.85), so there’s low risk of multicollinearity in predictive modeling.
Decrease in Productivity is most prominent among individuals with:
Depression (highest drop)
Burnout and Anxiety (also show high counts of decreased productivity)
Individuals with no mental health condition still report productivity decreases, but less frequently compared to those with depression.
Increase in productivity (green bars) is lowest among those with depression, suggesting a clear negative impact of depression on work output.
Africa and Oceania have the highest counts of “Good” sleep quality, suggesting relatively better rest patterns among employees in those regions.
Asia and North America show a slightly higher proportion of “Poor” sleep quality, which may indicate higher stress, longer work hours, or less work-life balance.
Europe and South America present a balanced distribution, with no extreme dominance of any sleep quality level.
Median (black line) is around 40 hours/week, which aligns with a standard full-time workload.
The interquartile range (IQR) spans approximately 30 to 50 hours, showing where most employees fall.
The minimum is around 20 hours, and the maximum is close to 60 hours.
There are no extreme outliers shown, but a few employees are working near the upper threshold, which may indicate potential overwork or burnout risk.
Impute or drop:
For numerical features: use median or mean imputation.
For categorical features: use mode or a new category like “Unknown”.
One-hot encoding: for unordered categories (e.g., Job_Role, Industry, Region).
Ordinal encoding: for ordered factors (e.g., Stress_Level, Satisfaction_with_Remote_Work).
Binary flags (e.g., Has_Mental_Health_Issue = if Mental_Health_Condition ≠ “None”)
Group rare categories (e.g., combine job roles or industries with low frequency)
Bucket Age or Years_of_Experience into groups if needed
Hours_Worked_Per_Week
Social_Isolation_Rating
Work_Life_Balance_Rating
Use standardization (z-score) or min-max scaling depending on model choice
Consider SMOTE, undersampling, or class weighting for classification models
Employee_ID (irrelevant for modeling)
Possibly Region or Work_Location if highly correlated with other variables
Hours_Worked_Per_Week, Social_Isolation_Rating, etc.
Decide whether to cap, transform, or remove
Collapse categories for sparsity or interpretability
Classification: Mental_Health_Condition, Productivity_Change
Regression: could be derived scores (e.g., scale of mental health burden)