For this project, I will be analyzing a data set related to mental
health and productivity in a work environment.
Data used: https://www.kaggle.com/datasets/shadab80k/mental-health-productivity-2026
Productivity is the efficiency of being able to take inputs and
convert them into outputs, measuring how effectively work goals can be
reached.
Therefore, it doesn’t come as a surprise that employers seek to improve
productivity in their businesses to increase profits and perform
efficiently.
Research Question: What is the impact of mental health
factors alone on productivity in the work force? Does the addition of
workforce-related variables effect productivity? Would a full model be
the best option based on this data set?
data <- read.csv("mental_health_productivity_2026.csv")
str(data)
## 'data.frame': 1500 obs. of 13 variables:
## $ Employee_ID : chr "EMP_0001" "EMP_0002" "EMP_0003" "EMP_0004" ...
## $ Age : int 50 36 29 42 40 44 32 32 45 57 ...
## $ Gender : chr "Male" "Male" "Male" "Female" ...
## $ Country : chr "USA" "UK" "Brazil" "UK" ...
## $ Industry : chr "Manufacturing" "Finance" "Finance" "Manufacturing" ...
## $ Work_Mode : chr "Remote" "Remote" "Remote" "Remote" ...
## $ Work_Hours_Per_Week : int 40 35 52 56 35 31 51 49 54 44 ...
## $ Stress_Level : int 9 1 4 10 2 7 7 7 4 8 ...
## $ Sleep_Hours : num 7.38 6.19 7.47 7.17 9.52 ...
## $ Productivity_Score : int 89 61 47 49 53 54 77 69 65 83 ...
## $ Physical_Activity_Hours : num 6.565 0.185 8.319 1.747 5.155 ...
## $ Mental_Health_Support_Access: chr "Yes" "Yes" "No" "No" ...
## $ Burnout_Risk : chr "High" "Low" "Low" "High" ...
summary(data)
## Employee_ID Age Gender Country
## Length:1500 Min. :22 Length:1500 Length:1500
## Class :character 1st Qu.:31 Class :character Class :character
## Mode :character Median :42 Mode :character Mode :character
## Mean :41
## 3rd Qu.:50
## Max. :59
## Industry Work_Mode Work_Hours_Per_Week Stress_Level
## Length:1500 Length:1500 Min. :30.00 Min. : 1.000
## Class :character Class :character 1st Qu.:38.00 1st Qu.: 4.000
## Mode :character Mode :character Median :47.00 Median : 7.000
## Mean :46.98 Mean : 6.245
## 3rd Qu.:55.25 3rd Qu.: 9.000
## Max. :64.00 Max. :10.000
## Sleep_Hours Productivity_Score Physical_Activity_Hours
## Min. : 4.000 Min. : 40.00 Min. :0.01087
## 1st Qu.: 6.138 1st Qu.: 55.00 1st Qu.:2.52587
## Median : 6.982 Median : 70.00 Median :5.02290
## Mean : 6.973 Mean : 69.84 Mean :5.00524
## 3rd Qu.: 7.818 3rd Qu.: 84.00 3rd Qu.:7.52587
## Max. :10.000 Max. :100.00 Max. :9.99636
## Mental_Health_Support_Access Burnout_Risk
## Length:1500 Length:1500
## Class :character Class :character
## Mode :character Mode :character
##
##
##
After seeing a summary of the data, it can be seen that there are the following categorical variables that needed to be factored:
data$Employee_ID <- NULL
data$Gender <- as.factor(data$Gender)
data$Country <- as.factor(data$Country)
data$Industry <- as.factor(data$Industry)
data$Work_Mode <- as.factor(data$Work_Mode)
data$Mental_Health_Support_Access <- as.factor(data$Mental_Health_Support_Access)
data$Burnout_Risk <- as.factor(data$Burnout_Risk)
data <- na.omit(data)
colSums(is.na(data))
## Age Gender
## 0 0
## Country Industry
## 0 0
## Work_Mode Work_Hours_Per_Week
## 0 0
## Stress_Level Sleep_Hours
## 0 0
## Productivity_Score Physical_Activity_Hours
## 0 0
## Mental_Health_Support_Access Burnout_Risk
## 0 0
The previous command finds all NA’s and counts them for each of the columns of the data. The command also prints this count for each column, which returned all 0’s after removing NA’s.
summary(data$Sleep_Hours)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.000 6.138 6.982 6.973 7.818 10.000
summary(data$Work_Hours_Per_Week)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30.00 38.00 47.00 46.98 55.25 64.00
summary(data$Productivity_Score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 40.00 55.00 70.00 69.84 84.00 100.00
#Data Visualization ## Distribution of Productivity
ggplot(data, aes(x = Productivity_Score)) +
geom_histogram(
bins = 30,
fill = "#4C72B0",
color = "white"
) +
labs(
title = "Distribution of Employee Productivity Scores",
x = "Productivity Score",
y = "Count"
) +
theme_minimal()
This graph shows frequencies of different productivity scores ranging
from 40 to 100. The distribution is noticeably widespread and contains
spikes, showing that productivity score fluctuates.
boxplot(Productivity_Score ~ Stress_Level,
data = data,
main = "Productivity by Stress Level",
xlab = "Stress Level",
ylab = "Productivity Score")
This graph compares productivity score to stress level (scale from 1 to
10). Interestingly, there seems to be no noticeable relationship between
these two variables. Analyzing this graph further shows that
productivity fluctuates.
ggplot(data, aes(x = Sleep_Hours, y = Productivity_Score)) +
geom_point(alpha = 0.4, color = "#55A868") +
geom_smooth(method = "lm", se = TRUE, color = "black") +
labs(
title = "Relationship Between Sleep Duration and Productivity",
x = "Sleep Hours per Night",
y = "Productivity Score"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
This graph shows the relationship between hours of sleep and
productivity score. There shows a clear positive correlation between
sleep and productivity, showing that as hours of sleep increases (from 4
to 10) productivity score gradually increases as well.
ggplot(data, aes(x = Burnout_Risk, y = Productivity_Score)) +
geom_boxplot(fill = "#C44E52", alpha = 0.8) +
labs(
title = "Productivity Across Burnout Risk Levels",
x = "Burnout Risk",
y = "Productivity Score"
) +
theme_minimal()
Employees with “low” and “medium” risk tend to have higher median
productivity scores than employees with “high” risk, though the
difference between “low” and “medium” is minimal.
ggplot(data, aes(x = Mental_Health_Support_Access, y = Productivity_Score)) +
geom_boxplot(fill = "#8172B3", alpha = 0.8) +
labs(
title = "Productivity and Access to Mental Health Support",
x = "Mental Health Support Access",
y = "Productivity Score"
) +
theme_minimal()
This boxplot compares employees who have access to mental health support
and their productivity scores. Employees with access to mental health
support show a slightly higher median productivity score and a higher
upper quartile, which suggests that mental health support access could
provide a small boost in performance.
ggplot(data, aes(x = Work_Mode, y = Productivity_Score)) +
geom_boxplot(fill = "#937860", alpha = 0.8) +
labs(
title = "Productivity by Work Mode",
x = "Work Mode",
y = "Productivity Score"
) +
theme_minimal()
In these boxplot productivity is compared between employees with
different work modes (hybrid/remote/on-site). Remote workers seem to
have the highest median productivity score and greatest range, followed
by hybrid and on-site, respectively.
model1 <- lm(
Productivity_Score ~ Stress_Level + Sleep_Hours + Burnout_Risk +
Mental_Health_Support_Access + Physical_Activity_Hours,
data = data
)
summary(model1)
##
## Call:
## lm(formula = Productivity_Score ~ Stress_Level + Sleep_Hours +
## Burnout_Risk + Mental_Health_Support_Access + Physical_Activity_Hours,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.6412 -14.6699 0.0708 14.9384 30.6264
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.645518 3.377340 11.739 <2e-16 ***
## Stress_Level 0.266617 0.222827 1.197 0.2317
## Sleep_Hours 3.845615 0.356671 10.782 <2e-16 ***
## Burnout_RiskLow 2.706128 1.485127 1.822 0.0686 .
## Burnout_RiskMedium 1.459211 1.347022 1.083 0.2789
## Mental_Health_Support_AccessYes 0.717751 0.868501 0.826 0.4087
## Physical_Activity_Hours -0.009194 0.149427 -0.062 0.9509
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.77 on 1493 degrees of freedom
## Multiple R-squared: 0.07485, Adjusted R-squared: 0.07114
## F-statistic: 20.13 on 6 and 1493 DF, p-value: < 2.2e-16
model2 <- lm(
Productivity_Score ~ Stress_Level + Sleep_Hours + Burnout_Risk +
Work_Hours_Per_Week + Work_Mode +
Mental_Health_Support_Access +
Physical_Activity_Hours + Industry,
data = data
)
summary(model2)
##
## Call:
## lm(formula = Productivity_Score ~ Stress_Level + Sleep_Hours +
## Burnout_Risk + Work_Hours_Per_Week + Work_Mode + Mental_Health_Support_Access +
## Physical_Activity_Hours + Industry, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.021 -14.435 0.077 15.123 29.989
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.200108 4.061460 10.637 <2e-16 ***
## Stress_Level 0.439766 0.236992 1.856 0.0637 .
## Sleep_Hours 3.830665 0.357597 10.712 <2e-16 ***
## Burnout_RiskLow 2.503986 1.488029 1.683 0.0926 .
## Burnout_RiskMedium 1.294039 1.349629 0.959 0.3378
## Work_Hours_Per_Week -0.118747 0.050438 -2.354 0.0187 *
## Work_ModeOn-site -0.305735 1.064160 -0.287 0.7739
## Work_ModeRemote 0.704007 1.075282 0.655 0.5128
## Mental_Health_Support_AccessYes 0.801517 0.870382 0.921 0.3573
## Physical_Activity_Hours -0.007321 0.149827 -0.049 0.9610
## IndustryFinance 0.733547 1.480218 0.496 0.6203
## IndustryHealthcare 0.817595 1.496373 0.546 0.5849
## IndustryManufacturing 1.392901 1.478598 0.942 0.3463
## IndustryRetail 1.438464 1.468529 0.980 0.3275
## IndustryTech 1.581521 1.512651 1.046 0.2959
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.77 on 1485 degrees of freedom
## Multiple R-squared: 0.07987, Adjusted R-squared: 0.07119
## F-statistic: 9.207 on 14 and 1485 DF, p-value: < 2.2e-16
model3 <- lm(
Productivity_Score ~ Stress_Level + Sleep_Hours + Burnout_Risk +
Work_Hours_Per_Week + Work_Mode +
Mental_Health_Support_Access +
Physical_Activity_Hours +
Age + Gender + Country + Industry,
data = data
)
summary(model3)
##
## Call:
## lm(formula = Productivity_Score ~ Stress_Level + Sleep_Hours +
## Burnout_Risk + Work_Hours_Per_Week + Work_Mode + Mental_Health_Support_Access +
## Physical_Activity_Hours + Age + Gender + Country + Industry,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.865 -14.381 -0.096 14.977 30.255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.102978 4.608060 9.571 <2e-16 ***
## Stress_Level 0.452736 0.238179 1.901 0.0575 .
## Sleep_Hours 3.884120 0.359874 10.793 <2e-16 ***
## Burnout_RiskLow 2.458211 1.494700 1.645 0.1003
## Burnout_RiskMedium 1.296134 1.354810 0.957 0.3389
## Work_Hours_Per_Week -0.120003 0.050627 -2.370 0.0179 *
## Work_ModeOn-site -0.271594 1.068701 -0.254 0.7994
## Work_ModeRemote 0.800114 1.081183 0.740 0.4594
## Mental_Health_Support_AccessYes 0.771524 0.876577 0.880 0.3789
## Physical_Activity_Hours 0.007584 0.151092 0.050 0.9600
## Age 0.004994 0.039503 0.126 0.8994
## GenderMale 0.917029 1.053981 0.870 0.3844
## GenderNon-binary 0.384666 1.077461 0.357 0.7211
## CountryBrazil -2.968255 2.002482 -1.482 0.1385
## CountryCanada -2.304110 1.956153 -1.178 0.2390
## CountryFrance -1.262794 2.011517 -0.628 0.5302
## CountryGermany -1.993783 2.006422 -0.994 0.3205
## CountryIndia -1.587723 2.025967 -0.784 0.4334
## CountryJapan -3.212362 2.008602 -1.599 0.1100
## CountrySingapore -1.964145 2.019699 -0.972 0.3310
## CountryUK -2.321139 1.971205 -1.178 0.2392
## CountryUSA -2.329396 1.942693 -1.199 0.2307
## IndustryFinance 0.762925 1.488771 0.512 0.6084
## IndustryHealthcare 0.954760 1.509937 0.632 0.5273
## IndustryManufacturing 1.312573 1.486770 0.883 0.3775
## IndustryRetail 1.358470 1.477532 0.919 0.3580
## IndustryTech 1.605569 1.520338 1.056 0.2911
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.81 on 1473 degrees of freedom
## Multiple R-squared: 0.08269, Adjusted R-squared: 0.0665
## F-statistic: 5.107 on 26 and 1473 DF, p-value: 1.81e-15
anova(model1, model2, model3)
Looking at this Analysis of Variance Table shows a clear trend that
adding more variables is not improving my model in terms of predicting
productivity.
This test is looking to see if adding variables and making the model
more complex explains a significant amount of the remaining variation.
Since the p-value is much greater than the significance level of 0.05,
the answer is no.
AIC(model1, model2)
BIC(model1, model2)
summary(model1)$adj.r.squared
## [1] 0.07113507
summary(model2)$adj.r.squared
## [1] 0.07119223
AIC(model2, model3)
BIC(model2, model3)
summary(model2)$adj.r.squared
## [1] 0.07119223
summary(model3)$adj.r.squared
## [1] 0.06650271
drop1(model3, test = "F")
Based on these results and previous findings, we come up with the following final linear model:
final_model <- lm(
Productivity_Score ~ Stress_Level + Sleep_Hours +
Work_Hours_Per_Week,
data = data
)
The Power of Sleep: Sleep duration is the single most reliable predictor of performance. For every additional hour of sleep, productivity scores increase by approximately 3.85 to 3.88 points (p<2e-16).
Stress and Burnout: While “Low” Burnout_Risk showed a marginal positive trend in simpler models, it lost statistical significance in the full model. This suggests that the negative effects of stress and burnout on productivity are largely mediated by or reflected in poor sleep quality.
Support Systems: Surprisingly, Mental_Health_Support_Access did not significantly move the needle on productivity scores (p = 0.3789), suggesting that the presence of a program is less impactful than the actual physiological recovery of the employee (sleep).
No, a full model would not be the best option. * Overfitting: The full model includes 26 variables but only explains 6.65% of the variance (Adjusted R2=0.0665). The high p-values for Age, Gender, Country, and Industry indicate these variables are “noise” that clutter the model without adding insight.
The optimal model for this dataset is:
Productivity_Score \(\approx\)
Sleep_Hours + Work_Hours_Per_Week + Stress_Level