Context

For this project, I will be analyzing a data set related to mental health and productivity in a work environment.
Data used: https://www.kaggle.com/datasets/shadab80k/mental-health-productivity-2026

Variables Found in Data:

  • Employee ID - unique identifier
  • Age - numeric
  • Gender - categorical
  • Country - categorical
  • Industry - categorical
  • Work_Mode - remote/hybrid/onsite
  • Work_Hour_Per_week - numeric
  • Stress_Level - categorical
  • Sleep_Hours - numeric
  • Physical_Activity_Hours - numeric
  • Mental_Health_Support_Access - yes/no
  • Burnout_Risk - categorical
  • Productivity_Score - numeric

Predictor Variables

  • Mental Health Factors
    • Stress_Level
    • Burnout_Risk
    • Sleep_Hours
    • Mental_Health_Support_Access
    • Physical_Activity_Hours
  • Work And Lifestyle Factors
    • Work_Hour_Per_week
    • Work_Mode
    • Industry *Demographics
    • Age
    • Gender
    • Country

Research Question

Productivity is the efficiency of being able to take inputs and convert them into outputs, measuring how effectively work goals can be reached.
Therefore, it doesn’t come as a surprise that employers seek to improve productivity in their businesses to increase profits and perform efficiently.
Research Question: What is the impact of mental health factors alone on productivity in the work force? Does the addition of workforce-related variables effect productivity? Would a full model be the best option based on this data set?

Data

Importing Data

data <- read.csv("mental_health_productivity_2026.csv")

str(data)
## 'data.frame':    1500 obs. of  13 variables:
##  $ Employee_ID                 : chr  "EMP_0001" "EMP_0002" "EMP_0003" "EMP_0004" ...
##  $ Age                         : int  50 36 29 42 40 44 32 32 45 57 ...
##  $ Gender                      : chr  "Male" "Male" "Male" "Female" ...
##  $ Country                     : chr  "USA" "UK" "Brazil" "UK" ...
##  $ Industry                    : chr  "Manufacturing" "Finance" "Finance" "Manufacturing" ...
##  $ Work_Mode                   : chr  "Remote" "Remote" "Remote" "Remote" ...
##  $ Work_Hours_Per_Week         : int  40 35 52 56 35 31 51 49 54 44 ...
##  $ Stress_Level                : int  9 1 4 10 2 7 7 7 4 8 ...
##  $ Sleep_Hours                 : num  7.38 6.19 7.47 7.17 9.52 ...
##  $ Productivity_Score          : int  89 61 47 49 53 54 77 69 65 83 ...
##  $ Physical_Activity_Hours     : num  6.565 0.185 8.319 1.747 5.155 ...
##  $ Mental_Health_Support_Access: chr  "Yes" "Yes" "No" "No" ...
##  $ Burnout_Risk                : chr  "High" "Low" "Low" "High" ...
summary(data)
##  Employee_ID             Age        Gender            Country         
##  Length:1500        Min.   :22   Length:1500        Length:1500       
##  Class :character   1st Qu.:31   Class :character   Class :character  
##  Mode  :character   Median :42   Mode  :character   Mode  :character  
##                     Mean   :41                                        
##                     3rd Qu.:50                                        
##                     Max.   :59                                        
##    Industry          Work_Mode         Work_Hours_Per_Week  Stress_Level   
##  Length:1500        Length:1500        Min.   :30.00       Min.   : 1.000  
##  Class :character   Class :character   1st Qu.:38.00       1st Qu.: 4.000  
##  Mode  :character   Mode  :character   Median :47.00       Median : 7.000  
##                                        Mean   :46.98       Mean   : 6.245  
##                                        3rd Qu.:55.25       3rd Qu.: 9.000  
##                                        Max.   :64.00       Max.   :10.000  
##   Sleep_Hours     Productivity_Score Physical_Activity_Hours
##  Min.   : 4.000   Min.   : 40.00     Min.   :0.01087        
##  1st Qu.: 6.138   1st Qu.: 55.00     1st Qu.:2.52587        
##  Median : 6.982   Median : 70.00     Median :5.02290        
##  Mean   : 6.973   Mean   : 69.84     Mean   :5.00524        
##  3rd Qu.: 7.818   3rd Qu.: 84.00     3rd Qu.:7.52587        
##  Max.   :10.000   Max.   :100.00     Max.   :9.99636        
##  Mental_Health_Support_Access Burnout_Risk      
##  Length:1500                  Length:1500       
##  Class :character             Class :character  
##  Mode  :character             Mode  :character  
##                                                 
##                                                 
## 

Cleaning Data

After seeing a summary of the data, it can be seen that there are the following categorical variables that needed to be factored:

data$Employee_ID <- NULL
data$Gender <- as.factor(data$Gender)
data$Country <- as.factor(data$Country)
data$Industry <- as.factor(data$Industry)
data$Work_Mode <- as.factor(data$Work_Mode)
data$Mental_Health_Support_Access <- as.factor(data$Mental_Health_Support_Access)
data$Burnout_Risk <- as.factor(data$Burnout_Risk)

Check for missing values

data <- na.omit(data)

colSums(is.na(data))
##                          Age                       Gender 
##                            0                            0 
##                      Country                     Industry 
##                            0                            0 
##                    Work_Mode          Work_Hours_Per_Week 
##                            0                            0 
##                 Stress_Level                  Sleep_Hours 
##                            0                            0 
##           Productivity_Score      Physical_Activity_Hours 
##                            0                            0 
## Mental_Health_Support_Access                 Burnout_Risk 
##                            0                            0

The previous command finds all NA’s and counts them for each of the columns of the data. The command also prints this count for each column, which returned all 0’s after removing NA’s.

Check ranges

summary(data$Sleep_Hours)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.000   6.138   6.982   6.973   7.818  10.000
summary(data$Work_Hours_Per_Week)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30.00   38.00   47.00   46.98   55.25   64.00
summary(data$Productivity_Score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   40.00   55.00   70.00   69.84   84.00  100.00

#Data Visualization ## Distribution of Productivity

ggplot(data, aes(x = Productivity_Score)) +
  geom_histogram(
    bins = 30,
    fill = "#4C72B0",
    color = "white"
  ) +
  labs(
    title = "Distribution of Employee Productivity Scores",
    x = "Productivity Score",
    y = "Count"
  ) +
  theme_minimal()

This graph shows frequencies of different productivity scores ranging from 40 to 100. The distribution is noticeably widespread and contains spikes, showing that productivity score fluctuates.

Stress Levels and Productivity

boxplot(Productivity_Score ~ Stress_Level,
        data = data,
        main = "Productivity by Stress Level",
        xlab = "Stress Level",
        ylab = "Productivity Score")

This graph compares productivity score to stress level (scale from 1 to 10). Interestingly, there seems to be no noticeable relationship between these two variables. Analyzing this graph further shows that productivity fluctuates.

Sleep and Productivity

ggplot(data, aes(x = Sleep_Hours, y = Productivity_Score)) +
  geom_point(alpha = 0.4, color = "#55A868") +
  geom_smooth(method = "lm", se = TRUE, color = "black") +
  labs(
    title = "Relationship Between Sleep Duration and Productivity",
    x = "Sleep Hours per Night",
    y = "Productivity Score"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

This graph shows the relationship between hours of sleep and productivity score. There shows a clear positive correlation between sleep and productivity, showing that as hours of sleep increases (from 4 to 10) productivity score gradually increases as well.

Burnout Risk and Productivity

ggplot(data, aes(x = Burnout_Risk, y = Productivity_Score)) +
  geom_boxplot(fill = "#C44E52", alpha = 0.8) +
  labs(
    title = "Productivity Across Burnout Risk Levels",
    x = "Burnout Risk",
    y = "Productivity Score"
  ) +
  theme_minimal()

Employees with “low” and “medium” risk tend to have higher median productivity scores than employees with “high” risk, though the difference between “low” and “medium” is minimal.

Mental Health Support Access and Productivity

ggplot(data, aes(x = Mental_Health_Support_Access, y = Productivity_Score)) +
  geom_boxplot(fill = "#8172B3", alpha = 0.8) +
  labs(
    title = "Productivity and Access to Mental Health Support",
    x = "Mental Health Support Access",
    y = "Productivity Score"
  ) +
  theme_minimal()

This boxplot compares employees who have access to mental health support and their productivity scores. Employees with access to mental health support show a slightly higher median productivity score and a higher upper quartile, which suggests that mental health support access could provide a small boost in performance.

Work Mode and Productivity

ggplot(data, aes(x = Work_Mode, y = Productivity_Score)) +
  geom_boxplot(fill = "#937860", alpha = 0.8) +
  labs(
    title = "Productivity by Work Mode",
    x = "Work Mode",
    y = "Productivity Score"
  ) +
  theme_minimal()

In these boxplot productivity is compared between employees with different work modes (hybrid/remote/on-site). Remote workers seem to have the highest median productivity score and greatest range, followed by hybrid and on-site, respectively.

Creating Linear Models

Model 1: Mental Health Variables Only

model1 <- lm(
  Productivity_Score ~ Stress_Level + Sleep_Hours + Burnout_Risk + 
                       Mental_Health_Support_Access + Physical_Activity_Hours,
  data = data
)
summary(model1)
## 
## Call:
## lm(formula = Productivity_Score ~ Stress_Level + Sleep_Hours + 
##     Burnout_Risk + Mental_Health_Support_Access + Physical_Activity_Hours, 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.6412 -14.6699   0.0708  14.9384  30.6264 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     39.645518   3.377340  11.739   <2e-16 ***
## Stress_Level                     0.266617   0.222827   1.197   0.2317    
## Sleep_Hours                      3.845615   0.356671  10.782   <2e-16 ***
## Burnout_RiskLow                  2.706128   1.485127   1.822   0.0686 .  
## Burnout_RiskMedium               1.459211   1.347022   1.083   0.2789    
## Mental_Health_Support_AccessYes  0.717751   0.868501   0.826   0.4087    
## Physical_Activity_Hours         -0.009194   0.149427  -0.062   0.9509    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.77 on 1493 degrees of freedom
## Multiple R-squared:  0.07485,    Adjusted R-squared:  0.07114 
## F-statistic: 20.13 on 6 and 1493 DF,  p-value: < 2.2e-16

Model 2: Add Work/Lifestyle Factors

model2 <- lm(
  Productivity_Score ~ Stress_Level + Sleep_Hours + Burnout_Risk +
                       Work_Hours_Per_Week + Work_Mode +
                       Mental_Health_Support_Access + 
                       Physical_Activity_Hours + Industry,
  data = data
)
summary(model2)
## 
## Call:
## lm(formula = Productivity_Score ~ Stress_Level + Sleep_Hours + 
##     Burnout_Risk + Work_Hours_Per_Week + Work_Mode + Mental_Health_Support_Access + 
##     Physical_Activity_Hours + Industry, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.021 -14.435   0.077  15.123  29.989 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     43.200108   4.061460  10.637   <2e-16 ***
## Stress_Level                     0.439766   0.236992   1.856   0.0637 .  
## Sleep_Hours                      3.830665   0.357597  10.712   <2e-16 ***
## Burnout_RiskLow                  2.503986   1.488029   1.683   0.0926 .  
## Burnout_RiskMedium               1.294039   1.349629   0.959   0.3378    
## Work_Hours_Per_Week             -0.118747   0.050438  -2.354   0.0187 *  
## Work_ModeOn-site                -0.305735   1.064160  -0.287   0.7739    
## Work_ModeRemote                  0.704007   1.075282   0.655   0.5128    
## Mental_Health_Support_AccessYes  0.801517   0.870382   0.921   0.3573    
## Physical_Activity_Hours         -0.007321   0.149827  -0.049   0.9610    
## IndustryFinance                  0.733547   1.480218   0.496   0.6203    
## IndustryHealthcare               0.817595   1.496373   0.546   0.5849    
## IndustryManufacturing            1.392901   1.478598   0.942   0.3463    
## IndustryRetail                   1.438464   1.468529   0.980   0.3275    
## IndustryTech                     1.581521   1.512651   1.046   0.2959    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.77 on 1485 degrees of freedom
## Multiple R-squared:  0.07987,    Adjusted R-squared:  0.07119 
## F-statistic: 9.207 on 14 and 1485 DF,  p-value: < 2.2e-16

Model 3: Full Model

model3 <- lm(
  Productivity_Score ~ Stress_Level + Sleep_Hours + Burnout_Risk +
                       Work_Hours_Per_Week + Work_Mode +
                       Mental_Health_Support_Access +
                       Physical_Activity_Hours +
                       Age + Gender + Country + Industry,
  data = data
)
summary(model3)
## 
## Call:
## lm(formula = Productivity_Score ~ Stress_Level + Sleep_Hours + 
##     Burnout_Risk + Work_Hours_Per_Week + Work_Mode + Mental_Health_Support_Access + 
##     Physical_Activity_Hours + Age + Gender + Country + Industry, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.865 -14.381  -0.096  14.977  30.255 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     44.102978   4.608060   9.571   <2e-16 ***
## Stress_Level                     0.452736   0.238179   1.901   0.0575 .  
## Sleep_Hours                      3.884120   0.359874  10.793   <2e-16 ***
## Burnout_RiskLow                  2.458211   1.494700   1.645   0.1003    
## Burnout_RiskMedium               1.296134   1.354810   0.957   0.3389    
## Work_Hours_Per_Week             -0.120003   0.050627  -2.370   0.0179 *  
## Work_ModeOn-site                -0.271594   1.068701  -0.254   0.7994    
## Work_ModeRemote                  0.800114   1.081183   0.740   0.4594    
## Mental_Health_Support_AccessYes  0.771524   0.876577   0.880   0.3789    
## Physical_Activity_Hours          0.007584   0.151092   0.050   0.9600    
## Age                              0.004994   0.039503   0.126   0.8994    
## GenderMale                       0.917029   1.053981   0.870   0.3844    
## GenderNon-binary                 0.384666   1.077461   0.357   0.7211    
## CountryBrazil                   -2.968255   2.002482  -1.482   0.1385    
## CountryCanada                   -2.304110   1.956153  -1.178   0.2390    
## CountryFrance                   -1.262794   2.011517  -0.628   0.5302    
## CountryGermany                  -1.993783   2.006422  -0.994   0.3205    
## CountryIndia                    -1.587723   2.025967  -0.784   0.4334    
## CountryJapan                    -3.212362   2.008602  -1.599   0.1100    
## CountrySingapore                -1.964145   2.019699  -0.972   0.3310    
## CountryUK                       -2.321139   1.971205  -1.178   0.2392    
## CountryUSA                      -2.329396   1.942693  -1.199   0.2307    
## IndustryFinance                  0.762925   1.488771   0.512   0.6084    
## IndustryHealthcare               0.954760   1.509937   0.632   0.5273    
## IndustryManufacturing            1.312573   1.486770   0.883   0.3775    
## IndustryRetail                   1.358470   1.477532   0.919   0.3580    
## IndustryTech                     1.605569   1.520338   1.056   0.2911    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.81 on 1473 degrees of freedom
## Multiple R-squared:  0.08269,    Adjusted R-squared:  0.0665 
## F-statistic: 5.107 on 26 and 1473 DF,  p-value: 1.81e-15
  • Significant Factors
    • Sleep_Hours: Based on its p-value, <2e-16, this variable is the strongest predictor for productivity.
    • Work_Hours_Per_Week: This variable also has a significant p-value and it’s negative. This confirms that as employees work more hours in a week, their productivity diminishes.
  • Marginally Significant -Stress_Level: This variable is slightly above, 0.0575, our significance value of 0.05. This suggests a slight positive trend, but not statistically reliable.
  • Statistics:
    • Overfitting: Having a greater Multiple R-squared than Adjusted R-squared is a sign that too many unneccessary variables have been added.

Model Comparison

anova(model1, model2, model3)

Looking at this Analysis of Variance Table shows a clear trend that adding more variables is not improving my model in terms of predicting productivity.
This test is looking to see if adding variables and making the model more complex explains a significant amount of the remaining variation. Since the p-value is much greater than the significance level of 0.05, the answer is no.

Model 1 VS Model 2

AIC(model1, model2)
BIC(model1, model2)
summary(model1)$adj.r.squared
## [1] 0.07113507
summary(model2)$adj.r.squared
## [1] 0.07119223
  • Added Variables:
    Work_Hours_Per_Week, Work_Mode, and Industry.
  • Results:
    • p-value: 0.4290
    • AIC: 12724.05 -> 12731.90 (worse)
    • BIC: 12766.56 -> 12816.91 (worse)
    • Adjusted R2: 0.07113507 -> 0.07119223 (improvement)
  • Interpretation: Looking at the statistics acquired from this comparison, it can be seen that model 2 had a very slight improvement; however, the p-value states that this “improvement” has a 42.9% chance of being random noise.

Model 2 VS Model 3

AIC(model2, model3)
BIC(model2, model3)
summary(model2)$adj.r.squared
## [1] 0.07119223
summary(model3)$adj.r.squared
## [1] 0.06650271
  • Added Variables:
    Age, Gender, and Country.
  • Results:
    • p-value: 0.9714
    • AIC: 12731.90 -> 12751.28 (worse)
    • BIC: 12816.91 -> 12900.05 (worse)
    • Adjusted R2: 0.07119223 -> 0.06650271 (worse)
  • Interpretation: The p-value for this is extremely high, which means the added variables in model 3 provide almost zero insight into productivity score. The full model is the worst out of the three linear models created.

Optimization

drop1(model3, test = "F")

Based on these results and previous findings, we come up with the following final linear model:

final_model <- lm(
  Productivity_Score ~ Stress_Level + Sleep_Hours + 
                       Work_Hours_Per_Week,
  data = data
)

Conclusion

What is the impact of mental health factors alone on productivity in the work force?

  • The Power of Sleep: Sleep duration is the single most reliable predictor of performance. For every additional hour of sleep, productivity scores increase by approximately 3.85 to 3.88 points (p<2e-16).

  • Stress and Burnout: While “Low” Burnout_Risk showed a marginal positive trend in simpler models, it lost statistical significance in the full model. This suggests that the negative effects of stress and burnout on productivity are largely mediated by or reflected in poor sleep quality.

  • Support Systems: Surprisingly, Mental_Health_Support_Access did not significantly move the needle on productivity scores (p = 0.3789), suggesting that the presence of a program is less impactful than the actual physiological recovery of the employee (sleep).

Would a full model be the best option based on this data set?

No, a full model would not be the best option. * Overfitting: The full model includes 26 variables but only explains 6.65% of the variance (Adjusted R2=0.0665). The high p-values for Age, Gender, Country, and Industry indicate these variables are “noise” that clutter the model without adding insight.

  • (AIC/BIC/ANOVA): The drop1 analysis and ANOVA tables consistently showed that removing demographic and environmental variables lowered the AIC or resulted in non-significant p-values, making the simpler model mathematically superior.

Final Verdict

The optimal model for this dataset is:
Productivity_Score \(\approx\) Sleep_Hours + Work_Hours_Per_Week + Stress_Level