In today’s fast-paced world, achieving a healthy work-life balance
has become increasingly challenging. The global pandemic has further
highlighted the importance of finding a harmonious equilibrium between
work and personal life. Recognizing this need, we present a project
aimed at developing a Work Life Balance Calculator, which will empower
employees and citizens to assess their work-life balance and identify
areas for improvement.
The objective of this project is to develop a Work-Life Balance Calculator that can assess and predict work-life balance based on various variables. The dataset contains information related to different aspects of individuals’ lives, such as daily habits, stress levels, social connections, achievements. By analyzing this data, we aim to:
Predict the “WORK_LIFE_BALANCE_SCORE” variable using regression models: The goal is to understand the relationship between work-life balance and other variables in the dataset. We want to identify which factors significantly influence work-life balance and develop predictive models that can estimate work-life balance scores based on those factors.
Predict the “BMI_RANGE” variable using classification models: Here, the focus is on predicting the categorical variable “BMI_RANGE” based on the available features. The goal is to assess the accuracy of different classification models in predicting BMI ranges and identify the most effective model.
The questions we are interested in answering from this dataset include:
How accurately can we predict work-life balance scores using regression models? Which regression model performs the best in terms of predicting work-life balance?
How accurately can we classify individuals into different BMI ranges using classification models? Which classification model achieves the highest accuracy in predicting BMI ranges? By addressing these questions, we aim to gain insights into the factors influencing work-life balance and the ability to predict work-life balance scores, as well as the effectiveness of different models in predicting BMI ranges. These findings will contribute to the development of the Work-Life Balance Calculator and enable individuals and organizations to improve work-life balance and overall well-being.
dataset <- read.csv("BALANCESCORE.csv")
dataset <- dataset[, -which(names(dataset) == "Timestamp")]
dataset
n_rows <- nrow(dataset)
n_cols <- ncol(dataset)
cat("Number of rows is", n_rows, "\n")
## Number of rows is 15972
cat("Number of columns is", n_cols, "\n")
## Number of columns is 23
Source: https://www.kaggle.com/datasets/ydalat/lifestyle-and-wellbeing-data Title: Lifestyle_and_Wellbeing_Data Year : 2021 Purpose: To evaluate and understand how individuals can reinvent their lifestyles to optimize their overall well-being while supporting the UN Sustainable Development Goals. Total number of rows: 15972 Total number of columns: 24 Target Variable: WORK_LIFE_BALANCE_SCORE Features: FRUITS_VEGGIES, DAILY_STRESS, PLACES_VISITED, CORE_CIRCLE, SUPPORTING_OTHERS, SOCIAL_NETWORK, ACHIEVEMENT, DONATION, BMI_RANGE, TODO_COMPLETED, FLOW, DAILY_STEPS, LIVE_VISION, SLEEP_HOURS, LOST_VACATION, DAILY_SHOUTING, SUFFICIENT_INCOME, PERSONAL_AWARDS, TIME_FOR_PASSION, WEEKLY_MEDITATION, AGE, and GENDER.
##### 2.3 Data Pre-Processing
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
dataset_structure <- str(dataset)
## 'data.frame': 15972 obs. of 23 variables:
## $ FRUITS_VEGGIES : int 4 1 2 4 4 3 2 5 2 1 ...
## $ DAILY_STRESS : chr "2" "4" "2" "2" ...
## $ PLACES_VISITED : int 10 3 10 10 10 6 3 8 6 1 ...
## $ CORE_CIRCLE : int 6 8 5 4 10 10 8 6 10 3 ...
## $ SUPPORTING_OTHERS : int 10 0 2 6 5 10 6 10 10 6 ...
## $ SOCIAL_NETWORK : int 10 2 8 10 10 6 5 10 10 5 ...
## $ ACHIEVEMENT : int 3 1 3 4 0 3 1 6 10 3 ...
## $ DONATION : int 5 0 4 0 1 5 2 4 5 5 ...
## $ BMI_RANGE : int 2 1 2 1 1 2 1 2 1 1 ...
## $ TODO_COMPLETED : int 8 2 7 8 7 8 8 4 10 2 ...
## $ FLOW : int 8 1 1 2 1 4 1 4 6 2 ...
## $ DAILY_STEPS : int 7 8 6 1 10 1 4 3 7 8 ...
## $ LIVE_VISION : int 5 2 10 1 2 5 6 3 4 1 ...
## $ SLEEP_HOURS : int 7 7 8 8 8 8 4 7 7 7 ...
## $ LOST_VACATION : int 10 7 0 1 0 0 0 1 0 0 ...
## $ DAILY_SHOUTING : int 0 1 0 1 3 0 3 2 1 6 ...
## $ SUFFICIENT_INCOME : int 2 2 2 2 2 2 2 2 2 2 ...
## $ PERSONAL_AWARDS : int 10 4 5 3 4 5 3 10 10 5 ...
## $ TIME_FOR_PASSION : int 8 1 2 3 8 1 2 6 5 1 ...
## $ WEEKLY_MEDITATION : int 10 7 7 3 6 2 5 8 7 1 ...
## $ AGE : chr "51 or more" "21 to 35" "21 to 35" "21 to 35" ...
## $ GENDER : chr "Male" "Male" "Male" "Male" ...
## $ WORK_LIFE_BALANCE_SCORE: num 727 619 686 674 708 ...
summary(dataset)
## FRUITS_VEGGIES DAILY_STRESS PLACES_VISITED CORE_CIRCLE
## Min. :0.000 Length:15972 Min. : 0.000 Min. : 0.000
## 1st Qu.:2.000 Class :character 1st Qu.: 2.000 1st Qu.: 3.000
## Median :3.000 Mode :character Median : 5.000 Median : 5.000
## Mean :2.923 Mean : 5.233 Mean : 5.508
## 3rd Qu.:4.000 3rd Qu.: 8.000 3rd Qu.: 8.000
## Max. :5.000 Max. :10.000 Max. :10.000
## SUPPORTING_OTHERS SOCIAL_NETWORK ACHIEVEMENT DONATION
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. :0.000
## 1st Qu.: 3.000 1st Qu.: 4.000 1st Qu.: 2.000 1st Qu.:1.000
## Median : 5.000 Median : 6.000 Median : 3.000 Median :3.000
## Mean : 5.616 Mean : 6.474 Mean : 4.001 Mean :2.715
## 3rd Qu.:10.000 3rd Qu.:10.000 3rd Qu.: 6.000 3rd Qu.:5.000
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :5.000
## BMI_RANGE TODO_COMPLETED FLOW DAILY_STEPS
## Min. :1.000 Min. : 0.000 Min. : 0.000 Min. : 1.000
## 1st Qu.:1.000 1st Qu.: 4.000 1st Qu.: 1.000 1st Qu.: 3.000
## Median :1.000 Median : 6.000 Median : 3.000 Median : 5.000
## Mean :1.411 Mean : 5.746 Mean : 3.195 Mean : 5.704
## 3rd Qu.:2.000 3rd Qu.: 8.000 3rd Qu.: 5.000 3rd Qu.: 8.000
## Max. :2.000 Max. :10.000 Max. :10.000 Max. :10.000
## LIVE_VISION SLEEP_HOURS LOST_VACATION DAILY_SHOUTING
## Min. : 0.000 Min. : 1.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 1.000 1st Qu.: 6.000 1st Qu.: 0.000 1st Qu.: 1.000
## Median : 3.000 Median : 7.000 Median : 0.000 Median : 2.000
## Mean : 3.752 Mean : 7.043 Mean : 2.899 Mean : 2.931
## 3rd Qu.: 5.000 3rd Qu.: 8.000 3rd Qu.: 5.000 3rd Qu.: 4.000
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
## SUFFICIENT_INCOME PERSONAL_AWARDS TIME_FOR_PASSION WEEKLY_MEDITATION
## Min. :1.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:1.000 1st Qu.: 3.000 1st Qu.: 1.000 1st Qu.: 4.000
## Median :2.000 Median : 5.000 Median : 3.000 Median : 7.000
## Mean :1.729 Mean : 5.712 Mean : 3.327 Mean : 6.233
## 3rd Qu.:2.000 3rd Qu.: 9.000 3rd Qu.: 5.000 3rd Qu.:10.000
## Max. :2.000 Max. :10.000 Max. :10.000 Max. :10.000
## AGE GENDER WORK_LIFE_BALANCE_SCORE
## Length:15972 Length:15972 Min. :480.0
## Class :character Class :character 1st Qu.:636.0
## Mode :character Mode :character Median :667.7
## Mean :666.8
## 3rd Qu.:698.5
## Max. :820.2
dataset$DAILY_STRESS <- as.numeric(dataset$DAILY_STRESS)
## Warning: NAs introduced by coercion
value_counts <- table(dataset$DAILY_STRESS)
print(value_counts)
##
## 0 1 2 3 4 5
## 676 2478 3407 4398 2960 2052
dataset <- na.omit(dataset)
missing_counts <- colSums(is.na(dataset))
print(missing_counts)
## FRUITS_VEGGIES DAILY_STRESS PLACES_VISITED
## 0 0 0
## CORE_CIRCLE SUPPORTING_OTHERS SOCIAL_NETWORK
## 0 0 0
## ACHIEVEMENT DONATION BMI_RANGE
## 0 0 0
## TODO_COMPLETED FLOW DAILY_STEPS
## 0 0 0
## LIVE_VISION SLEEP_HOURS LOST_VACATION
## 0 0 0
## DAILY_SHOUTING SUFFICIENT_INCOME PERSONAL_AWARDS
## 0 0 0
## TIME_FOR_PASSION WEEKLY_MEDITATION AGE
## 0 0 0
## GENDER WORK_LIFE_BALANCE_SCORE
## 0 0
Missing values can distort statistical analysis and lead to inaccurate or biased results. The dataset has no misisng/null value, it’s a clean dataset.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
result <- dataset %>%
group_by(AGE, GENDER) %>%
summarise(mean_BMI_RANGE = mean(BMI_RANGE), .groups = "drop")
library(tidyr)
result_table <- result %>%
pivot_wider(names_from = GENDER, values_from = mean_BMI_RANGE)
print(result_table)
## # A tibble: 4 × 3
## AGE Female Male
## <chr> <dbl> <dbl>
## 1 21 to 35 1.36 1.33
## 2 36 to 50 1.47 1.52
## 3 51 or more 1.53 1.52
## 4 Less than 20 1.23 1.22
library(ggplot2)
plot1 <- ggplot(dataset, aes(x = AGE)) +
geom_density(fill = "lightblue") +
labs(title = "Distribution of Age (Density Plot)")
plot1
plot2 <- ggplot(dataset, aes(x = GENDER, fill = GENDER)) +
geom_bar() +
labs(title = "Distribution of Gender")
plot2
plot3 <- ggplot(dataset, aes(x = GENDER, y = DAILY_STRESS, fill = GENDER)) +
geom_violin(scale ="width") +
scale_fill_manual(values = c("pink", "blue")) +
labs(x = "Gender", title = "Distribution of Daily Stress by Gender") +
theme_minimal()
plot3
ggplot(dataset, aes(x = WORK_LIFE_BALANCE_SCORE, y = WEEKLY_MEDITATION)) +
geom_point(color = "lightblue") +
labs(title = "Work-Life Balance Score vs. Weekly Meditation")
ggplot(dataset, aes(x = AGE)) +
geom_bar(stat = "count", fill = "steelblue", color = "black") +
labs(x = "Age", y = "Frequency") +
ggtitle("Distribution of Age")
ggplot(dataset, aes(x = AGE, y = WORK_LIFE_BALANCE_SCORE)) +
geom_boxplot(fill = "orange", color = "black") +
labs(x = "", y = "Work-Life Balance Score") +
ggtitle("Distribution of Work-Life Balance Score by Age")
#Daily Steps
plot5 <- ggplot(dataset, aes(x = DAILY_STEPS)) +
geom_histogram(fill = "lightblue", bins = 20) +
labs(title = "Histogram of Daily Steps")
plot5
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
subset_data <- subset(dataset, BMI_RANGE < 25)
plot6 <- ggplot(subset_data, aes(x = AGE, y = BMI_RANGE)) +
geom_bar(stat = "summary", fun = "mean", fill = "salmon") +
labs(x = "AGE", y = "BMI") +
ggtitle("BODY_MASS_INDEX BY AGE")
plot6
plot8 <- ggplot(subset_data, aes(x = AGE, y = BMI_RANGE, fill = GENDER)) +
stat_summary(fun = "mean", geom = "bar", position = "dodge") +
labs(title = "BODY_MASS_INDEX BY GENDER & AGE") +
scale_fill_manual(values = c("darksalmon", "cornflowerblue"))
plot9 <- plot8 + ggtitle("BODY_MASS_INDEX BY GENDER & AGE")
plot9
plot4 <- ggplot(subset_data, aes(x = SLEEP_HOURS, y = BMI_RANGE)) +
geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
labs(x = "Sleep Hours", y = "BMI") +
ggtitle("BODY_MASS_INDEX & SLEEP HOURS")
plot4
## `geom_smooth()` using formula = 'y ~ x'
plot5 <- ggplot(subset_data, aes(x = FRUITS_VEGGIES, y = BMI_RANGE)) +
geom_bar(stat = "summary", fun = "mean", fill = "yellow") +
labs(x = "Servings of Fruits/Veggies", y = "BMI") +
ggtitle("BODY_MASS_INDEX & SERVINGS OF FRUITS/VEGGIES")
plot5
plot6 <- ggplot(subset_data, aes(x = DAILY_STEPS, y = BMI_RANGE)) +
geom_smooth(method = "lm", se = FALSE, color = "grey") +
labs(x = "Daily Steps", y = "BMI") +
ggtitle("BODY_MASS_INDEX & DAILY STEPS")
plot6
## `geom_smooth()` using formula = 'y ~ x'
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
df3 <- dcast(dataset, AGE ~ GENDER, value.var = "DAILY_STRESS")
## Aggregation function missing: defaulting to length
head(df3)
plot1 <- ggplot(dataset, aes(x = AGE, y = DAILY_STRESS, fill = GENDER)) +
geom_bar(stat = "summary", fun = "mean", position = "dodge", color = "black") +
labs(x = "Age Group", y = "Average Daily Stress") +
ggtitle("AVERAGE DAILY_STRESS BY AGE GROUP")
plot1
plot2 <- ggplot(dataset, aes(x = GENDER, y = DAILY_STRESS, fill = GENDER)) +
geom_violin(trim = FALSE, scale = "count") +
labs(x = "Gender", y = "Daily Stress") +
ggtitle("DAILY_STRESS BY GENDER")
plot2
plot1 <- ggplot(dataset, aes(x = GENDER, y = CORE_CIRCLE, fill = GENDER)) +
geom_violin() +
labs(x = "Gender", y = "Core Circle") +
ggtitle("CORE_CIRCLE BY GENDER")
plot1
plot2 <- ggplot(dataset, aes(x = AGE, y = LOST_VACATION)) +
geom_boxplot() +
labs(x = "Age Group", y = "Lost Vacation") +
scale_x_discrete(limits = c("Less than 20", "21 to 35", "36 to 50", "51 or more")) +
ggtitle("LOST_VACATION BY AGE GROUP")
plot2
plot3 <- ggplot(dataset, aes(x = PLACES_VISITED, y = DAILY_STRESS)) +
geom_bar(stat = "summary", fun = "mean", fill = "steelblue") +
labs(x = "Places Visited", y = "Daily Stress") +
ggtitle("PLACES & DAILY_STRESS")
plot3
plot4 <- ggplot(dataset, aes(x = LOST_VACATION, y = DAILY_STRESS)) +
geom_boxplot() +
labs(x = "Lost Vacation", y = "Daily Stress") +
ggtitle("LOST VACATION & DAILY_STRESS")
plot4
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
plot5 <- ggplot(dataset, aes(x = SOCIAL_NETWORK, y = DAILY_STRESS)) +
geom_bar(stat = "summary", fun = "mean", fill = "steelblue") +
labs(x = "Social Network", y = "Daily Stress") +
ggtitle("FRIENDS & DAILY_STRESS")
plot5
columns <- setdiff(names(dataset), c("GENDER", "AGE", "DAILY_STRESS"))
cor_matrix <- cor(dataset[, columns])
cor_df <- reshape2::melt(cor_matrix)
ggplot(cor_df, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient(low = "blue", high = "red") +
labs(x = "Features", y = "Features", title = "Correlation Matrix")
#### Modeling
In this part, we will do two different problems relating to our dataset.
The first problem will be a regression problem to predict the “WORK_LIFE_BALANCE_SCORE” variable based on other variables in the dataset. The code is implementing three different regression models: Linear Regression, Support Vector Regression (SVR), and Random Forest.
##### Load the necessary libraries
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(123)
train_indices <- createDataPartition(dataset$WORK_LIFE_BALANCE_SCORE, p = 0.8, list = FALSE)
train_data <- dataset[train_indices, ]
test_data <- dataset[-train_indices, ]
train_data
test_data
lm_model <- lm(WORK_LIFE_BALANCE_SCORE ~ ., data = train_data)
svr_model <- svm(WORK_LIFE_BALANCE_SCORE ~ ., data = train_data)
rf_model <- randomForest(WORK_LIFE_BALANCE_SCORE ~ ., data = train_data)
#Compare the performance of the model
lm_predictions <- predict(lm_model, test_data)
svr_predictions <- predict(svr_model, test_data)
rf_predictions <- predict(rf_model, test_data)
lm_rmse <- sqrt(mean((test_data$WORK_LIFE_BALANCE_SCORE - lm_predictions)^2))
svr_rmse <- sqrt(mean((test_data$WORK_LIFE_BALANCE_SCORE - svr_predictions)^2))
rf_rmse <- sqrt(mean((test_data$WORK_LIFE_BALANCE_SCORE - rf_predictions)^2))
cat("Linear Regression RMSE:", lm_rmse, "\n")
## Linear Regression RMSE: 1.597709e-12
cat("SVR RMSE:", svr_rmse, "\n")
## SVR RMSE: 3.087885
cat("Random Forest RMSE:", rf_rmse, "\n")
## Random Forest RMSE: 10.83695
cat("Linear Regression RMSE:", lm_rmse, "\n")
## Linear Regression RMSE: 1.597709e-12
cat("SVR RMSE:", svr_rmse, "\n")
## SVR RMSE: 3.087885
cat("Random Forest RMSE:", rf_rmse, "\n")
## Random Forest RMSE: 10.83695
plot_data <- data.frame(
Actual = test_data$WORK_LIFE_BALANCE_SCORE,
Linear_Regression = lm_predictions,
SVR = svr_predictions,
Random_Forest = rf_predictions
)
plot_data <- reshape2::melt(plot_data, id.vars = "Actual", variable.name = "Model")
ggplot(plot_data, aes(x = Actual, y = value, color = Model)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black") +
labs(x = "Actual WORK_LIFE_BALANCE_SCORE", y = "Predicted WORK_LIFE_BALANCE_SCORE") +
ggtitle("Comparison of Predicted vs Actual WORK_LIFE_BALANCE_SCORE") +
theme_minimal()
The second problem is a classification problem. The goal is to predict a categorical variable (BMI_RANGE) and to evaluate the accuracy, which measures the proportion of correctly predicted class labels.
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-7
dataset$BMI_RANGE <- as.factor(dataset$BMI_RANGE)
dataset$GENDER <- as.factor(dataset$GENDER)
dataset$AGE <- as.factor(dataset$AGE)
model_rf <- randomForest(BMI_RANGE ~ ., data = train)
predictions_rf <- predict(model_rf, newdata = test)
accuracy_rf <- sum(predictions_rf == test$BMI_RANGE) / nrow(test)
model_lr <- glm(BMI_RANGE ~ ., data = train, family = binomial,control = list(maxit = 1000))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
predictions_lr <- predict(model_lr, newdata = test, type = "response")
predictions_lr <- ifelse(predictions_lr > 0.5, "2", "1")
accuracy_lr <- sum(predictions_lr == test$BMI_RANGE) / nrow(test)
model_dt <- rpart(BMI_RANGE ~ ., data = train, method = "class")
predictions_dt <- predict(model_dt, newdata = test, type = "class")
accuracy_dt <- sum(predictions_dt == test$BMI_RANGE) / nrow(test)
cat("Random Forest Accuracy:", accuracy_rf, "\n")
## Random Forest Accuracy: 0.7647673
cat("Logistic Regression Accuracy:", accuracy_lr, "\n")
## Logistic Regression Accuracy: 1
cat("Decision Tree Accuracy:", accuracy_dt, "\n")
## Decision Tree Accuracy: 0.6530996
plot_data <- data.frame(
Actual = test$BMI_RANGE,
Predicted = predictions_rf
)
ggplot(plot_data, aes(x = Actual, y = Predicted)) +
geom_jitter(width = 0.1, height = 0.1) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black") +
labs(x = "Actual BMI_RANGE", y = "Predicted BMI_RANGE") +
ggtitle("Comparison of Predicted vs Actual BMI_RANGE") +
theme_minimal()
In conclusion, the development of a Work Life Balance In conclusion, the development of a Work Life Balance Calculator through this project addresses the pressing need for individuals and organizations to prioritize work-life balance in today’s fast-paced world. By leveraging data mining techniques and machine learning algorithms, we have made significant strides in understanding the key factors that contribute to work-life balance and identifying areas for improvement.
The Work Life Balance Calculator serves as a valuable tool for individuals to assess their work-life balance, understand their strengths and areas for improvement, and make informed decisions to enhance their overall well-being. For organizations, the calculator offers insights into employees’ work-life balance, enabling them to develop tailored plans to optimize productivity and support their workforce.
Ultimately, this project contributes to the promotion of work-life balance and the improvement of overall performance and well-being. By prioritizing work-life balance, individuals can achieve greater satisfaction and fulfillment, leading to a more productive and harmonious society as a whole.