Data621_Final

Load Libraries

# Load required packages
library(htmltools)
library(caret)
library(pROC)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(corrplot)
library(skimr)
require(DataExplorer)
require(miscTools)
require(MASS)
require(performance)
require(lmtest)
require(mice)
require(glmnet)
require(Metrics) 
library(patchwork)  # for combining ggplots
library(e1071)
library(car)
library(forcats)      # For better factor handling

load the dataset and understand its structure.

# echo=FALSE, include=FALSE

remote_work_df <- read_csv("https://raw.githubusercontent.com/uzmabb182/Data_621/refs/heads/main/Final_Project/Impact_of_Remote_Work_on_Mental_Health.csv")


head(remote_work_df)

## # A tibble: 6 × 20
##   Employee_ID   Age Gender   Job_Role Industry Years_of_Experience Work_Location
##   <chr>       <dbl> <chr>    <chr>    <chr>                  <dbl> <chr>        
## 1 EMP0001        32 Non-bin… HR       Healthc…                  13 Hybrid       
## 2 EMP0002        40 Female   Data Sc… IT                         3 Remote       
## 3 EMP0003        59 Non-bin… Softwar… Educati…                  22 Hybrid       
## 4 EMP0004        27 Male     Softwar… Finance                   20 Onsite       
## 5 EMP0005        49 Male     Sales    Consult…                  32 Onsite       
## 6 EMP0006        59 Non-bin… Sales    IT                        31 Hybrid       
## # ℹ 13 more variables: Hours_Worked_Per_Week <dbl>,
## #   Number_of_Virtual_Meetings <dbl>, Work_Life_Balance_Rating <dbl>,
## #   Stress_Level <chr>, Mental_Health_Condition <chr>,
## #   Access_to_Mental_Health_Resources <chr>, Productivity_Change <chr>,
## #   Social_Isolation_Rating <dbl>, Satisfaction_with_Remote_Work <chr>,
## #   Company_Support_for_Remote_Work <dbl>, Physical_Activity <chr>,
## #   Sleep_Quality <chr>, Region <chr>

Checking & Cleaning Data

glimpse(remote_work_df)

## Rows: 5,000
## Columns: 20
## $ Employee_ID                       <chr> "EMP0001", "EMP0002", "EMP0003", "EM…
## $ Age                               <dbl> 32, 40, 59, 27, 49, 59, 31, 42, 56, …
## $ Gender                            <chr> "Non-binary", "Female", "Non-binary"…
## $ Job_Role                          <chr> "HR", "Data Scientist", "Software En…
## $ Industry                          <chr> "Healthcare", "IT", "Education", "Fi…
## $ Years_of_Experience               <dbl> 13, 3, 22, 20, 32, 31, 24, 6, 9, 28,…
## $ Work_Location                     <chr> "Hybrid", "Remote", "Hybrid", "Onsit…
## $ Hours_Worked_Per_Week             <dbl> 47, 52, 46, 32, 35, 39, 51, 54, 24, …
## $ Number_of_Virtual_Meetings        <dbl> 7, 4, 11, 8, 12, 3, 7, 7, 4, 6, 3, 1…
## $ Work_Life_Balance_Rating          <dbl> 2, 1, 5, 4, 2, 4, 3, 3, 2, 1, 3, 4, …
## $ Stress_Level                      <chr> "Medium", "Medium", "Medium", "High"…
## $ Mental_Health_Condition           <chr> "Depression", "Anxiety", "Anxiety", …
## $ Access_to_Mental_Health_Resources <chr> "No", "No", "No", "Yes", "Yes", "No"…
## $ Productivity_Change               <chr> "Decrease", "Increase", "No Change",…
## $ Social_Isolation_Rating           <dbl> 1, 3, 4, 3, 3, 5, 5, 5, 2, 2, 4, 4, …
## $ Satisfaction_with_Remote_Work     <chr> "Unsatisfied", "Satisfied", "Unsatis…
## $ Company_Support_for_Remote_Work   <dbl> 1, 2, 5, 3, 3, 1, 3, 4, 4, 1, 2, 3, …
## $ Physical_Activity                 <chr> "Weekly", "Weekly", "None", "None", …
## $ Sleep_Quality                     <chr> "Good", "Good", "Poor", "Poor", "Ave…
## $ Region                            <chr> "Europe", "Asia", "North America", "…

summary(remote_work_df)

##  Employee_ID             Age           Gender            Job_Role        
##  Length:5000        Min.   :22.00   Length:5000        Length:5000       
##  Class :character   1st Qu.:31.00   Class :character   Class :character  
##  Mode  :character   Median :41.00   Mode  :character   Mode  :character  
##                     Mean   :40.99                                        
##                     3rd Qu.:51.00                                        
##                     Max.   :60.00                                        
##    Industry         Years_of_Experience Work_Location     
##  Length:5000        Min.   : 1.00       Length:5000       
##  Class :character   1st Qu.: 9.00       Class :character  
##  Mode  :character   Median :18.00       Mode  :character  
##                     Mean   :17.81                         
##                     3rd Qu.:26.00                         
##                     Max.   :35.00                         
##  Hours_Worked_Per_Week Number_of_Virtual_Meetings Work_Life_Balance_Rating
##  Min.   :20.00         Min.   : 0.000             Min.   :1.000           
##  1st Qu.:29.00         1st Qu.: 4.000             1st Qu.:2.000           
##  Median :40.00         Median : 8.000             Median :3.000           
##  Mean   :39.61         Mean   : 7.559             Mean   :2.984           
##  3rd Qu.:50.00         3rd Qu.:12.000             3rd Qu.:4.000           
##  Max.   :60.00         Max.   :15.000             Max.   :5.000           
##  Stress_Level       Mental_Health_Condition Access_to_Mental_Health_Resources
##  Length:5000        Length:5000             Length:5000                      
##  Class :character   Class :character        Class :character                 
##  Mode  :character   Mode  :character        Mode  :character                 
##                                                                              
##                                                                              
##                                                                              
##  Productivity_Change Social_Isolation_Rating Satisfaction_with_Remote_Work
##  Length:5000         Min.   :1.000           Length:5000                  
##  Class :character    1st Qu.:2.000           Class :character             
##  Mode  :character    Median :3.000           Mode  :character             
##                      Mean   :2.994                                        
##                      3rd Qu.:4.000                                        
##                      Max.   :5.000                                        
##  Company_Support_for_Remote_Work Physical_Activity  Sleep_Quality     
##  Min.   :1.000                   Length:5000        Length:5000       
##  1st Qu.:2.000                   Class :character   Class :character  
##  Median :3.000                   Mode  :character   Mode  :character  
##  Mean   :3.008                                                        
##  3rd Qu.:4.000                                                        
##  Max.   :5.000                                                        
##     Region         
##  Length:5000       
##  Class :character  
##  Mode  :character  
##                    
##                    
##

skim(remote_work_df)

Data summary
Name	remote_work_df
Number of rows	5000
Number of columns	20
_______________________
Column type frequency:
character	13
numeric	7
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Employee_ID	1	7	7	5000
Gender	1	4	17	4
Job_Role	1	2	17	7
Industry	1	2	13	7
Work_Location	1	6	6	3
Stress_Level	1	3	6	3
Mental_Health_Condition	1	4	10	4
Access_to_Mental_Health_Resources	1	2	3	2
Productivity_Change	1	8	9	3
Satisfaction_with_Remote_Work	1	7	11	3
Physical_Activity	1	4	6	3
Sleep_Quality	1	4	7	3
Region	1	4	13	6

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Age	1	40.99	11.30	22	31	41	51	60	▇▇▆▇▇
Years_of_Experience	1	17.81	10.02	1	9	18	26	35	▇▇▇▇▇
Hours_Worked_Per_Week	1	39.61	11.86	20	29	40	50	60	▇▇▆▆▆
Number_of_Virtual_Meetings	1	7.56	4.64	0	4	8	12	15	▇▆▆▆▆
Work_Life_Balance_Rating	1	2.98	1.41	1	2	3	4	5	▇▇▇▇▇
Social_Isolation_Rating	1	2.99	1.39	1	2	3	4	5	▇▇▇▇▇
Company_Support_for_Remote_Work	1	3.01	1.40	1	2	3	4	5	▇▇▇▇▇

Checking for missing values

colSums(is.na(remote_work_df))

##                       Employee_ID                               Age 
##                                 0                                 0 
##                            Gender                          Job_Role 
##                                 0                                 0 
##                          Industry               Years_of_Experience 
##                                 0                                 0 
##                     Work_Location             Hours_Worked_Per_Week 
##                                 0                                 0 
##        Number_of_Virtual_Meetings          Work_Life_Balance_Rating 
##                                 0                                 0 
##                      Stress_Level           Mental_Health_Condition 
##                                 0                                 0 
## Access_to_Mental_Health_Resources               Productivity_Change 
##                                 0                                 0 
##           Social_Isolation_Rating     Satisfaction_with_Remote_Work 
##                                 0                                 0 
##   Company_Support_for_Remote_Work                 Physical_Activity 
##                                 0                                 0 
##                     Sleep_Quality                            Region 
##                                 0                                 0

Convert categorical variables to factors

remote_work_df <- remote_work_df %>%
  mutate(across(where(is.character), as.factor))

Checking Datatypes

str(remote_work_df)

## tibble [5,000 × 20] (S3: tbl_df/tbl/data.frame)
##  $ Employee_ID                      : Factor w/ 5000 levels "EMP0001","EMP0002",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Age                              : num [1:5000] 32 40 59 27 49 59 31 42 56 30 ...
##  $ Gender                           : Factor w/ 4 levels "Female","Male",..: 3 1 3 2 2 3 4 3 4 1 ...
##  $ Job_Role                         : Factor w/ 7 levels "Data Scientist",..: 3 1 7 7 6 6 6 1 1 3 ...
##  $ Industry                         : Factor w/ 7 levels "Consulting","Education",..: 4 5 2 3 1 5 5 6 4 5 ...
##  $ Years_of_Experience              : num [1:5000] 13 3 22 20 32 31 24 6 9 28 ...
##  $ Work_Location                    : Factor w/ 3 levels "Hybrid","Onsite",..: 1 3 1 2 2 1 3 2 1 1 ...
##  $ Hours_Worked_Per_Week            : num [1:5000] 47 52 46 32 35 39 51 54 24 57 ...
##  $ Number_of_Virtual_Meetings       : num [1:5000] 7 4 11 8 12 3 7 7 4 6 ...
##  $ Work_Life_Balance_Rating         : num [1:5000] 2 1 5 4 2 4 3 3 2 1 ...
##  $ Stress_Level                     : Factor w/ 3 levels "High","Low","Medium": 3 3 3 1 1 1 2 3 1 2 ...
##  $ Mental_Health_Condition          : Factor w/ 4 levels "Anxiety","Burnout",..: 3 1 1 3 4 4 1 3 4 3 ...
##  $ Access_to_Mental_Health_Resources: Factor w/ 2 levels "No","Yes": 1 1 1 2 2 1 2 1 2 2 ...
##  $ Productivity_Change              : Factor w/ 3 levels "Decrease","Increase",..: 1 2 3 2 1 2 1 1 1 1 ...
##  $ Social_Isolation_Rating          : num [1:5000] 1 3 4 3 3 5 5 5 2 2 ...
##  $ Satisfaction_with_Remote_Work    : Factor w/ 3 levels "Neutral","Satisfied",..: 3 2 3 3 3 3 1 2 3 1 ...
##  $ Company_Support_for_Remote_Work  : num [1:5000] 1 2 5 3 3 1 3 4 4 1 ...
##  $ Physical_Activity                : Factor w/ 3 levels "Daily","None",..: 3 3 2 2 3 2 1 2 1 3 ...
##  $ Sleep_Quality                    : Factor w/ 3 levels "Average","Good",..: 2 2 3 3 1 1 3 1 3 3 ...
##  $ Region                           : Factor w/ 6 levels "Africa","Asia",..: 3 2 4 3 4 6 2 4 3 4 ...

Dropping Employee_ID Column and Update df

# Drop the Employee_ID column
remote_work_df <- remote_work_df %>% dplyr::select(-Employee_ID)

# Verify that the column is removed
glimpse(remote_work_df)

## Rows: 5,000
## Columns: 19
## $ Age                               <dbl> 32, 40, 59, 27, 49, 59, 31, 42, 56, …
## $ Gender                            <fct> Non-binary, Female, Non-binary, Male…
## $ Job_Role                          <fct> HR, Data Scientist, Software Enginee…
## $ Industry                          <fct> Healthcare, IT, Education, Finance, …
## $ Years_of_Experience               <dbl> 13, 3, 22, 20, 32, 31, 24, 6, 9, 28,…
## $ Work_Location                     <fct> Hybrid, Remote, Hybrid, Onsite, Onsi…
## $ Hours_Worked_Per_Week             <dbl> 47, 52, 46, 32, 35, 39, 51, 54, 24, …
## $ Number_of_Virtual_Meetings        <dbl> 7, 4, 11, 8, 12, 3, 7, 7, 4, 6, 3, 1…
## $ Work_Life_Balance_Rating          <dbl> 2, 1, 5, 4, 2, 4, 3, 3, 2, 1, 3, 4, …
## $ Stress_Level                      <fct> Medium, Medium, Medium, High, High, …
## $ Mental_Health_Condition           <fct> Depression, Anxiety, Anxiety, Depres…
## $ Access_to_Mental_Health_Resources <fct> No, No, No, Yes, Yes, No, Yes, No, Y…
## $ Productivity_Change               <fct> Decrease, Increase, No Change, Incre…
## $ Social_Isolation_Rating           <dbl> 1, 3, 4, 3, 3, 5, 5, 5, 2, 2, 4, 4, …
## $ Satisfaction_with_Remote_Work     <fct> Unsatisfied, Satisfied, Unsatisfied,…
## $ Company_Support_for_Remote_Work   <dbl> 1, 2, 5, 3, 3, 1, 3, 4, 4, 1, 2, 3, …
## $ Physical_Activity                 <fct> Weekly, Weekly, None, None, Weekly, …
## $ Sleep_Quality                     <fct> Good, Good, Poor, Poor, Average, Ave…
## $ Region                            <fct> Europe, Asia, North America, Europe,…

Descriptive Statistics for Numeric Columns Summary

remote_work_df %>%
  dplyr::select(where(is.numeric)) %>%
  summary()

##       Age        Years_of_Experience Hours_Worked_Per_Week
##  Min.   :22.00   Min.   : 1.00       Min.   :20.00        
##  1st Qu.:31.00   1st Qu.: 9.00       1st Qu.:29.00        
##  Median :41.00   Median :18.00       Median :40.00        
##  Mean   :40.99   Mean   :17.81       Mean   :39.61        
##  3rd Qu.:51.00   3rd Qu.:26.00       3rd Qu.:50.00        
##  Max.   :60.00   Max.   :35.00       Max.   :60.00        
##  Number_of_Virtual_Meetings Work_Life_Balance_Rating Social_Isolation_Rating
##  Min.   : 0.000             Min.   :1.000            Min.   :1.000          
##  1st Qu.: 4.000             1st Qu.:2.000            1st Qu.:2.000          
##  Median : 8.000             Median :3.000            Median :3.000          
##  Mean   : 7.559             Mean   :2.984            Mean   :2.994          
##  3rd Qu.:12.000             3rd Qu.:4.000            3rd Qu.:4.000          
##  Max.   :15.000             Max.   :5.000            Max.   :5.000          
##  Company_Support_for_Remote_Work
##  Min.   :1.000                  
##  1st Qu.:2.000                  
##  Median :3.000                  
##  Mean   :3.008                  
##  3rd Qu.:4.000                  
##  Max.   :5.000

Descriptive Statistics for Categorical Columns Frequency

remote_work_df %>%
  dplyr::select(where(is.factor)) %>%
  map(table)

## $Gender
## 
##            Female              Male        Non-binary Prefer not to say 
##              1274              1270              1214              1242 
## 
## $Job_Role
## 
##    Data Scientist          Designer                HR         Marketing 
##               696               723               716               683 
##   Project Manager             Sales Software Engineer 
##               738               733               711 
## 
## $Industry
## 
##    Consulting     Education       Finance    Healthcare            IT 
##           680           690           747           728           746 
## Manufacturing        Retail 
##           683           726 
## 
## $Work_Location
## 
## Hybrid Onsite Remote 
##   1649   1637   1714 
## 
## $Stress_Level
## 
##   High    Low Medium 
##   1686   1645   1669 
## 
## $Mental_Health_Condition
## 
##    Anxiety    Burnout Depression       None 
##       1278       1280       1246       1196 
## 
## $Access_to_Mental_Health_Resources
## 
##   No  Yes 
## 2553 2447 
## 
## $Productivity_Change
## 
##  Decrease  Increase No Change 
##      1737      1586      1677 
## 
## $Satisfaction_with_Remote_Work
## 
##     Neutral   Satisfied Unsatisfied 
##        1648        1675        1677 
## 
## $Physical_Activity
## 
##  Daily   None Weekly 
##   1616   1629   1755 
## 
## $Sleep_Quality
## 
## Average    Good    Poor 
##    1628    1687    1685 
## 
## $Region
## 
##        Africa          Asia        Europe North America       Oceania 
##           860           829           840           777           867 
## South America 
##           827

Distribution Plots for Age column

ggplot(remote_work_df, aes(x = Age)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
  theme_minimal() +
  labs(title = "Age Distribution", x = "Age", y = "Count")

### Distribution Plots for Work Location vs Stress Level

ggplot(remote_work_df, aes(x = Work_Location, fill = Stress_Level)) +
  geom_bar(position = "dodge") +
  labs(title = "Stress Level by Work Location", x = "Work Location", y = "Count") +
  theme_minimal()

### Distribution Plots for Mental Health Condition by Gender

ggplot(remote_work_df, aes(x = Gender, fill = Mental_Health_Condition)) +
  geom_bar(position = "fill") +
  labs(title = "Mental Health Condition Distribution by Gender", y = "Proportion") +
  theme_minimal() +
  coord_flip()

### Correlation (Numeric Features)

# Correlation matrix
numeric_data <- remote_work_df %>%
  dplyr::select(where(is.numeric))

cor_matrix <- cor(numeric_data, use = "complete.obs")
round(cor_matrix, 2)

##                                   Age Years_of_Experience Hours_Worked_Per_Week
## Age                              1.00                0.00                  0.00
## Years_of_Experience              0.00                1.00                 -0.02
## Hours_Worked_Per_Week            0.00               -0.02                  1.00
## Number_of_Virtual_Meetings       0.00                0.02                  0.00
## Work_Life_Balance_Rating         0.02                0.00                  0.00
## Social_Isolation_Rating         -0.02                0.00                 -0.01
## Company_Support_for_Remote_Work  0.02                0.01                  0.01
##                                 Number_of_Virtual_Meetings
## Age                                                   0.00
## Years_of_Experience                                   0.02
## Hours_Worked_Per_Week                                 0.00
## Number_of_Virtual_Meetings                            1.00
## Work_Life_Balance_Rating                              0.01
## Social_Isolation_Rating                               0.00
## Company_Support_for_Remote_Work                       0.00
##                                 Work_Life_Balance_Rating
## Age                                                 0.02
## Years_of_Experience                                 0.00
## Hours_Worked_Per_Week                               0.00
## Number_of_Virtual_Meetings                          0.01
## Work_Life_Balance_Rating                            1.00
## Social_Isolation_Rating                             0.00
## Company_Support_for_Remote_Work                    -0.01
##                                 Social_Isolation_Rating
## Age                                               -0.02
## Years_of_Experience                                0.00
## Hours_Worked_Per_Week                             -0.01
## Number_of_Virtual_Meetings                         0.00
## Work_Life_Balance_Rating                           0.00
## Social_Isolation_Rating                            1.00
## Company_Support_for_Remote_Work                    0.02
##                                 Company_Support_for_Remote_Work
## Age                                                        0.02
## Years_of_Experience                                        0.01
## Hours_Worked_Per_Week                                      0.01
## Number_of_Virtual_Meetings                                 0.00
## Work_Life_Balance_Rating                                  -0.01
## Social_Isolation_Rating                                    0.02
## Company_Support_for_Remote_Work                            1.00

# Optional: Use corrplot for better visuals
library(corrplot)
corrplot(cor_matrix, method = "color", tl.cex = 0.8)

### Relationships Exploration for Productivity Change vs Mental Health

ggplot(remote_work_df, aes(x = Mental_Health_Condition, fill = Productivity_Change)) +
  geom_bar(position = "dodge") +
  labs(title = "Productivity Change by Mental Health Condition") +
  theme_minimal()

### Relationships Exploration for Sleep Quality by Region

ggplot(remote_work_df, aes(x = Region, fill = Sleep_Quality)) +
  geom_bar(position = "dodge") +
  labs(title = "Sleep Quality Across Regions") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

### Automated Exploratory Report

# Generate an HTML report
create_report(remote_work_df, output_file = "eda_remote_work_report.html")

## 
## 
## processing file: report.rmd

##   |                                             |                                     |   0%  |                                             |.                                    |   2%                                   |                                             |..                                   |   5% [global_options]                  |                                             |...                                  |   7%                                   |                                             |....                                 |  10% [introduce]                       |                                             |....                                 |  12%                                   |                                             |.....                                |  14% [plot_intro]

##   |                                             |......                               |  17%                                   |                                             |.......                              |  19% [data_structure]                  |                                             |........                             |  21%                                   |                                             |.........                            |  24% [missing_profile]

##   |                                             |..........                           |  26%                                   |                                             |...........                          |  29% [univariate_distribution_header]  |                                             |...........                          |  31%                                   |                                             |............                         |  33% [plot_histogram]

##   |                                             |.............                        |  36%                                   |                                             |..............                       |  38% [plot_density]                    |                                             |...............                      |  40%                                   |                                             |................                     |  43% [plot_frequency_bar]

##   |                                             |.................                    |  45%                                   |                                             |..................                   |  48% [plot_response_bar]               |                                             |..................                   |  50%                                   |                                             |...................                  |  52% [plot_with_bar]                   |                                             |....................                 |  55%                                   |                                             |.....................                |  57% [plot_normal_qq]

##   |                                             |......................               |  60%                                   |                                             |.......................              |  62% [plot_response_qq]                |                                             |........................             |  64%                                   |                                             |.........................            |  67% [plot_by_qq]                      |                                             |..........................           |  69%                                   |                                             |..........................           |  71% [correlation_analysis]

##   |                                             |...........................          |  74%                                   |                                             |............................         |  76% [principal_component_analysis]

##   |                                             |.............................        |  79%                                   |                                             |..............................       |  81% [bivariate_distribution_header]   |                                             |...............................      |  83%                                   |                                             |................................     |  86% [plot_response_boxplot]           |                                             |.................................    |  88%                                   |                                             |.................................    |  90% [plot_by_boxplot]                 |                                             |..................................   |  93%                                   |                                             |...................................  |  95% [plot_response_scatterplot]       |                                             |.................................... |  98%                                   |                                             |.....................................| 100% [plot_by_scatterplot]

## output file: C:/Users/Uzma/CUNY-SPS-Assignments/Data_621/Data_621/Final_Project/report.knit.md

## "C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/pandoc" +RTS -K512m -RTS "C:\Users\Uzma\CUNY-SPS-Assignments\Data_621\Data_621\Final_Project\report.knit.md" --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output pandoc2aa061f4e36.html --lua-filter "C:\Users\Uzma\AppData\Local\R\win-library\4.3\rmarkdown\rmarkdown\lua\pagebreak.lua" --lua-filter "C:\Users\Uzma\AppData\Local\R\win-library\4.3\rmarkdown\rmarkdown\lua\latex-div.lua" --embed-resources --standalone --variable bs3=TRUE --section-divs --table-of-contents --toc-depth 6 --template "C:\Users\Uzma\AppData\Local\R\win-library\4.3\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable theme=yeti --mathjax --variable "mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" --include-in-header "C:\Users\Uzma\AppData\Local\Temp\RtmpovpcdB\rmarkdown-str2aa019e94f9b.html"

## 
## Output created: eda_remote_work_report.html

Check for outliers or inconsistent entries

# identify typos or categories like "Prefer not to say" that may need special handling
boxplot(remote_work_df$Hours_Worked_Per_Week, main = "Boxplot of Weekly Hours Worked")

# Check unique values in categorical columns
sapply(remote_work_df %>% dplyr::select(where(is.factor)), levels)

## $Gender
## [1] "Female"            "Male"              "Non-binary"       
## [4] "Prefer not to say"
## 
## $Job_Role
## [1] "Data Scientist"    "Designer"          "HR"               
## [4] "Marketing"         "Project Manager"   "Sales"            
## [7] "Software Engineer"
## 
## $Industry
## [1] "Consulting"    "Education"     "Finance"       "Healthcare"   
## [5] "IT"            "Manufacturing" "Retail"       
## 
## $Work_Location
## [1] "Hybrid" "Onsite" "Remote"
## 
## $Stress_Level
## [1] "High"   "Low"    "Medium"
## 
## $Mental_Health_Condition
## [1] "Anxiety"    "Burnout"    "Depression" "None"      
## 
## $Access_to_Mental_Health_Resources
## [1] "No"  "Yes"
## 
## $Productivity_Change
## [1] "Decrease"  "Increase"  "No Change"
## 
## $Satisfaction_with_Remote_Work
## [1] "Neutral"     "Satisfied"   "Unsatisfied"
## 
## $Physical_Activity
## [1] "Daily"  "None"   "Weekly"
## 
## $Sleep_Quality
## [1] "Average" "Good"    "Poor"   
## 
## $Region
## [1] "Africa"        "Asia"          "Europe"        "North America"
## [5] "Oceania"       "South America"

Summary of Insights from Data Exploration

Dataset Structure and Quality

The dataset includes a diverse set of variables like below:

demographic (age, gender, region) work-related (job role industry, work location) mental health (stress level mental health condition, access to resources) and lifestyle factors (sleep quality, physical activity)

No missing values were reported — indicating a complete dataset suitable for direct analysis.

All categorical columns were successfully converted to factors, and numeric variables are in usable form.

Employee_ID was removed correctly, avoiding unnecessary noise.

Interpretation of Charts:

Age Distribution Chart

25–30, 40–50, and 50–55 age bins have the highest frequencies.

Each of these bins has 650–700 employees, indicating a strong middle-aged workforce presence.

The 18–22 group has the lowest count (under 150).

The 60+ group also has fewer participants (around 400), possibly due to retirements or reduced digital/remote work participation.

This is not a normal (bell-shaped) distribution. Instead, it’s closer to a uniform or flat distribution from ages 25 to 55.

The slight dip at the edges (youngest and oldest) is typical in workplace data where fewer very young or senior employees are present.

Stress Level by Work Location Chart

Remote workers report the highest number of high stress levels, noticeably more than hybrid or onsite workers.

Onsite workers have the most balanced distribution, with slightly more reporting low stress than high.

Hybrid workers show a fairly even spread across all stress levels, suggesting a moderate stress profile.

Mental Health Condition Distribution by Gender Chart

Anxiety is the most common condition across all genders, especially among males and females.

Non-binary and “Prefer not to say” groups have higher proportions of depression and burnout compared to the binary genders.

The “None” category (no mental health condition) is least prevalent in the non-binary and “Prefer not to say” groups.

Males have the highest proportion of “None” (no mental health condition).

Correlation Matrix of Numeric Variables Chart

Strongest Positive Correlation:

Age and Years_of_Experience show a strong positive correlation (~0.9+), which is expected—older individuals typically have more work experience.

Moderate Positive Relationships:

Hours_Worked_Per_Week has a moderate positive correlation with Number_of_Virtual_Meetings, suggesting those who attend more meetings tend to work more hours.

Weak or No Correlations:

Most other relationships (e.g., Company_Support_for_Remote_Work, Work_Life_Balance_Rating, Social_Isolation_Rating) show low or negligible correlations, indicating they may vary independently.

No Evidence of Multicollinearity:

No pairs of variables (besides age/experience) show very high correlation (>0.85), so there’s low risk of multicollinearity in predictive modeling.

Productivity Change by Mental Health Condition Chart

Decrease in Productivity is most prominent among individuals with:

Depression (highest drop)

Burnout and Anxiety (also show high counts of decreased productivity)

Individuals with no mental health condition still report productivity decreases, but less frequently compared to those with depression.

Increase in productivity (green bars) is lowest among those with depression, suggesting a clear negative impact of depression on work output.

Sleep Quality Across Regions Charts

Africa and Oceania have the highest counts of “Good” sleep quality, suggesting relatively better rest patterns among employees in those regions.

Asia and North America show a slightly higher proportion of “Poor” sleep quality, which may indicate higher stress, longer work hours, or less work-life balance.

Europe and South America present a balanced distribution, with no extreme dominance of any sleep quality level.

Boxplot of Weekly Hours Worked Chart

Median (black line) is around 40 hours/week, which aligns with a standard full-time workload.

The interquartile range (IQR) spans approximately 30 to 50 hours, showing where most employees fall.

The minimum is around 20 hours, and the maximum is close to 60 hours.

There are no extreme outliers shown, but a few employees are working near the upper threshold, which may indicate potential overwork or burnout risk.

Possible Data Preparation Steps Outline (Post-Exploration)

Handle Missing Values Check for NA values in both categorical and numerical columns.

Impute or drop:

For numerical features: use median or mean imputation.

For categorical features: use mode or a new category like “Unknown”.

Encode Categorical Variables Convert categorical features for ML models:

One-hot encoding: for unordered categories (e.g., Job_Role, Industry, Region).

Ordinal encoding: for ordered factors (e.g., Stress_Level, Satisfaction_with_Remote_Work).

Feature Engineering Create new features if beneficial:

Binary flags (e.g., Has_Mental_Health_Issue = if Mental_Health_Condition ≠ “None”)

Group rare categories (e.g., combine job roles or industries with low frequency)

Bucket Age or Years_of_Experience into groups if needed

Scale / Normalize Numeric Features Normalize features like:

Hours_Worked_Per_Week

Social_Isolation_Rating

Work_Life_Balance_Rating

Use standardization (z-score) or min-max scaling depending on model choice

Balance the Dataset (if needed) If target variable (e.g., Mental_Health_Condition or Productivity_Change) is imbalanced:

Consider SMOTE, undersampling, or class weighting for classification models

Remove Redundant Features Drop non-informative or leakage-prone features:

Employee_ID (irrelevant for modeling)

Possibly Region or Work_Location if highly correlated with other variables

Outlier Detection Use boxplots or z-scores to detect outliers in:

Hours_Worked_Per_Week, Social_Isolation_Rating, etc.

Decide whether to cap, transform, or remove

Ensure Consistent Factor Levels Standardize labels for categorical values (e.g., “Prefer not to say” may need special handling)

Collapse categories for sparsity or interpretability

Create Target Variable (if modeling) Define the dependent variable:

Classification: Mental_Health_Condition, Productivity_Change

Regression: could be derived scores (e.g., scale of mental health burden)

Train-Test Split (if modeling) Split dataset (e.g., 70% training, 30% testing) using caret or rsample

Data621_Final_Project

Mubashira Qari

2025-04-26