WQD 7004 - Predicting the Work Life Balance Score

Group 11

AMANI ALSHANQITI (S2127083)

PRIYADARSHINI NAIR AP MUNIANDY(22062712)

DHIVASHINI LINGADARAN (S2127834)

CHE NADZIRAH CHE AB RAZAK (S2170502)

IZZAH ATHIRAH MOHAMAD RADZI (S2179297)

1.0 Introduction

In today’s fast-paced world, achieving a healthy work-life balance has become increasingly challenging. The global pandemic has further highlighted the importance of finding a harmonious equilibrium between work and personal life. Recognizing this need, we present a project aimed at developing a Work Life Balance Calculator, which will empower employees and citizens to assess their work-life balance and identify areas for improvement.

1.1 Asking Questions

The questions we are interested in answering from this dataset include:

How accurately can we predict work-life balance scores using regression models? Which regression model performs the best in terms of predicting work-life balance?
How accurately can we classify individuals into different BMI ranges using classification models? Which classification model achieves the highest accuracy in predicting BMI ranges?

By addressing these questions, we aim to gain insights into the factors influencing work-life balance and the ability to predict work-life balance scores, as well as the effectiveness of different models in predicting BMI ranges. These findings will contribute to the development of the Work-Life Balance Calculator and enable individuals and organizations to improve work-life balance and overall well-being.

1.2 Objectives

The objective of this project is to develop a Work-Life Balance Calculator that can assess and predict work-life balance based on various variables. The dataset contains information related to different aspects of individuals’ lives, such as daily habits, stress levels, social connections, achievements. By analyzing this data, we aim to:

Predict the “WORK_LIFE_BALANCE_SCORE” variable using regression models: The goal is to understand the relationship between work-life balance and other variables in the dataset. We want to identify which factors significantly influence work-life balance and develop predictive models that can estimate work-life balance scores based on those factors.
Predict the “BMI_RANGE” variable using classification models: Here, the focus is on predicting the categorical variable “BMI_RANGE” based on the available features. The goal is to assess the accuracy of different classification models in predicting BMI ranges and identify the most effective model.

2.0 Data Understanding

Source: https://www.kaggle.com/datasets/ydalat/lifestyle-and-wellbeing-data
Title: Lifestyle_and_Wellbeing_Data
Year : 2021
Purpose: To evaluate and understand how individuals can reinvent their lifestyles to optimize their overall well-being while supporting the UN Sustainable Development Goals
Target Variable: WORK_LIFE_BALANCE_SCORE
Features:
- FRUITS_VEGGIES : Fruits or vegetables eaten daily
- DAILY_STRESS : Stress experienced daily
- PLACES_VISITED : New places visited
- CORE_CIRCLE : Number of people who are very close to you
- SUPPORTING_OTHERS : Number of people you help to achieve better life
- SOCIAL_NETWORK : Number of people you interact during a typical day
- ACHIEVEMENT : Remarkable achievements youre proud of
- DONATION : Number of times you donate your time or money to good causes
- BMI_RANGE : BMI range
- TODO_COMPLETED : Completion of weekly to-do lists
- FLOW : Hours you experience “flow”
- DAILY_STEPS : Number of steps taken in a day
- LIVE_VISION : Number of years to your life vision
- SLEEP_HOURS : Hours of sleep
- LOST_VACATION : Number of vacation days lost in a year
- DAILY_SHOUTING : Tendency to shout or sulk at people daily
- SUFFICIENT_INCOME : Sufficient income to cover basic needs
- PERSONAL AWARDS : Recognitions received in life
- TIME_FOR_PASSION : Number of hours spent doing your hobby
- WEEKLY_MEDITATION : Number of times you get to meditate in a week
- AGE
- GENDER

Load the dataset

dataset <- read.csv("BALANCESCORE.csv")
dataset <- dataset[, -which(names(dataset) == "Timestamp")]
dataset

Number of row and columns:

n_rows <- nrow(dataset)
n_cols <- ncol(dataset)
cat("Number of rows is", n_rows, "\n")

## Number of rows is 15972

cat("Number of columns is", n_cols, "\n")

## Number of columns is 23

2.1 Data Cleaning & Data Pre-Processing

Install Packages

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Data structure

dataset_structure <- str(dataset)

## 'data.frame':    15972 obs. of  23 variables:
##  $ FRUITS_VEGGIES         : int  3 2 2 3 5 3 4 3 5 4 ...
##  $ DAILY_STRESS           : chr  "2" "3" "3" "3" ...
##  $ PLACES_VISITED         : int  2 4 3 10 3 3 10 5 6 2 ...
##  $ CORE_CIRCLE            : int  5 3 4 3 3 9 6 3 4 6 ...
##  $ SUPPORTING_OTHERS      : int  0 8 4 10 10 10 10 5 3 10 ...
##  $ SOCIAL_NETWORK         : int  5 10 10 7 4 10 10 7 3 10 ...
##  $ ACHIEVEMENT            : int  2 5 3 2 2 2 3 4 5 0 ...
##  $ DONATION               : int  0 2 2 5 4 3 5 0 4 4 ...
##  $ BMI_RANGE              : int  1 2 2 2 2 1 2 1 1 2 ...
##  $ TODO_COMPLETED         : int  6 5 2 3 5 6 8 8 10 3 ...
##  $ FLOW                   : int  4 2 2 5 0 1 8 2 2 2 ...
##  $ DAILY_STEPS            : int  5 5 4 5 5 7 7 8 1 3 ...
##  $ LIVE_VISION            : int  0 5 5 0 0 10 5 10 5 0 ...
##  $ SLEEP_HOURS            : int  7 8 8 5 7 8 7 6 10 6 ...
##  $ LOST_VACATION          : int  5 2 10 7 0 0 10 0 0 0 ...
##  $ DAILY_SHOUTING         : int  5 2 2 5 0 2 0 2 2 0 ...
##  $ SUFFICIENT_INCOME      : int  1 2 2 1 2 2 2 2 2 1 ...
##  $ PERSONAL_AWARDS        : int  4 3 4 5 8 10 10 8 10 3 ...
##  $ TIME_FOR_PASSION       : int  0 2 8 2 1 8 8 2 3 8 ...
##  $ WEEKLY_MEDITATION      : int  5 6 3 0 5 3 10 2 10 1 ...
##  $ AGE                    : chr  "36 to 50" "36 to 50" "36 to 50" "51 or more" ...
##  $ GENDER                 : chr  "Female" "Female" "Female" "Female" ...
##  $ WORK_LIFE_BALANCE_SCORE: num  610 656 632 623 664 ...

Summary of the data frame

summary(dataset)

##  FRUITS_VEGGIES  DAILY_STRESS       PLACES_VISITED    CORE_CIRCLE    
##  Min.   :0.000   Length:15972       Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:2.000   Class :character   1st Qu.: 2.000   1st Qu.: 3.000  
##  Median :3.000   Mode  :character   Median : 5.000   Median : 5.000  
##  Mean   :2.923                      Mean   : 5.233   Mean   : 5.508  
##  3rd Qu.:4.000                      3rd Qu.: 8.000   3rd Qu.: 8.000  
##  Max.   :5.000                      Max.   :10.000   Max.   :10.000  
##  SUPPORTING_OTHERS SOCIAL_NETWORK    ACHIEVEMENT        DONATION    
##  Min.   : 0.000    Min.   : 0.000   Min.   : 0.000   Min.   :0.000  
##  1st Qu.: 3.000    1st Qu.: 4.000   1st Qu.: 2.000   1st Qu.:1.000  
##  Median : 5.000    Median : 6.000   Median : 3.000   Median :3.000  
##  Mean   : 5.616    Mean   : 6.474   Mean   : 4.001   Mean   :2.715  
##  3rd Qu.:10.000    3rd Qu.:10.000   3rd Qu.: 6.000   3rd Qu.:5.000  
##  Max.   :10.000    Max.   :10.000   Max.   :10.000   Max.   :5.000  
##    BMI_RANGE     TODO_COMPLETED        FLOW         DAILY_STEPS    
##  Min.   :1.000   Min.   : 0.000   Min.   : 0.000   Min.   : 1.000  
##  1st Qu.:1.000   1st Qu.: 4.000   1st Qu.: 1.000   1st Qu.: 3.000  
##  Median :1.000   Median : 6.000   Median : 3.000   Median : 5.000  
##  Mean   :1.411   Mean   : 5.746   Mean   : 3.195   Mean   : 5.704  
##  3rd Qu.:2.000   3rd Qu.: 8.000   3rd Qu.: 5.000   3rd Qu.: 8.000  
##  Max.   :2.000   Max.   :10.000   Max.   :10.000   Max.   :10.000  
##   LIVE_VISION      SLEEP_HOURS     LOST_VACATION    DAILY_SHOUTING  
##  Min.   : 0.000   Min.   : 1.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 1.000   1st Qu.: 6.000   1st Qu.: 0.000   1st Qu.: 1.000  
##  Median : 3.000   Median : 7.000   Median : 0.000   Median : 2.000  
##  Mean   : 3.752   Mean   : 7.043   Mean   : 2.899   Mean   : 2.931  
##  3rd Qu.: 5.000   3rd Qu.: 8.000   3rd Qu.: 5.000   3rd Qu.: 4.000  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.000  
##  SUFFICIENT_INCOME PERSONAL_AWARDS  TIME_FOR_PASSION WEEKLY_MEDITATION
##  Min.   :1.000     Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   
##  1st Qu.:1.000     1st Qu.: 3.000   1st Qu.: 1.000   1st Qu.: 4.000   
##  Median :2.000     Median : 5.000   Median : 3.000   Median : 7.000   
##  Mean   :1.729     Mean   : 5.712   Mean   : 3.327   Mean   : 6.233   
##  3rd Qu.:2.000     3rd Qu.: 9.000   3rd Qu.: 5.000   3rd Qu.:10.000   
##  Max.   :2.000     Max.   :10.000   Max.   :10.000   Max.   :10.000   
##      AGE               GENDER          WORK_LIFE_BALANCE_SCORE
##  Length:15972       Length:15972       Min.   :480.0          
##  Class :character   Class :character   1st Qu.:636.0          
##  Mode  :character   Mode  :character   Median :667.7          
##                                        Mean   :666.8          
##                                        3rd Qu.:698.5          
##                                        Max.   :820.2

Calculate the counts of each score in the “DAILY_STRESS” column

## Warning: NAs introduced by coercion

## 
##    0    1    2    3    4    5 
##  676 2478 3407 4398 2960 2052

Check for missing values and remove them

dataset <- na.omit(dataset)

Check if the number of missing values in each column is removed

missing_counts <- colSums(is.na(dataset))
print(missing_counts)

##          FRUITS_VEGGIES            DAILY_STRESS          PLACES_VISITED 
##                       0                       0                       0 
##             CORE_CIRCLE       SUPPORTING_OTHERS          SOCIAL_NETWORK 
##                       0                       0                       0 
##             ACHIEVEMENT                DONATION               BMI_RANGE 
##                       0                       0                       0 
##          TODO_COMPLETED                    FLOW             DAILY_STEPS 
##                       0                       0                       0 
##             LIVE_VISION             SLEEP_HOURS           LOST_VACATION 
##                       0                       0                       0 
##          DAILY_SHOUTING       SUFFICIENT_INCOME         PERSONAL_AWARDS 
##                       0                       0                       0 
##        TIME_FOR_PASSION       WEEKLY_MEDITATION                     AGE 
##                       0                       0                       0 
##                  GENDER WORK_LIFE_BALANCE_SCORE 
##                       0                       0

Missing values can distort statistical analysis and lead to inaccurate or biased results. The dataset has no missing/null value, it’s a clean dataset.

3.0 Exploratory Data Analysis (EDA)

Load the tidyverse package

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Calculate the mean BMI range by age and gender using group_by and summarise:

library(dplyr)
result <- dataset %>%
  group_by(AGE, GENDER) %>%
  summarise(mean_BMI_RANGE = mean(BMI_RANGE), .groups = "drop")

Pivot the result table

library(tidyr)
result_table <- result %>%
pivot_wider(names_from = GENDER, values_from = mean_BMI_RANGE)
print(result_table)

## # A tibble: 4 × 3
##   AGE          Female  Male
##   <chr>         <dbl> <dbl>
## 1 21 to 35       1.36  1.33
## 2 36 to 50       1.47  1.52
## 3 51 or more     1.53  1.52
## 4 Less than 20   1.23  1.22

Distribution of Age

library(ggplot2)
plot1 <- ggplot(dataset, aes(x = AGE)) +
  geom_density(fill = "lightblue") +
  labs(title = "Distribution of Age (Density Plot)")
plot1

Distribution of Gender

plot2 <- ggplot(dataset, aes(x = GENDER, fill = GENDER)) +
  geom_bar() +
  labs(title = "Distribution of Gender")
plot2

Distribution of Daily Stress by Gender

plot3 <- ggplot(dataset, aes(x = GENDER, y = DAILY_STRESS, fill = GENDER)) +
  geom_violin(scale ="width") +
  scale_fill_manual(values = c("pink", "blue")) +
  labs(x = "Gender", title = "Distribution of Daily Stress by Gender") +
  theme_minimal()

plot3

Work-Life Balance Score vs. Weekly Meditation

ggplot(dataset, aes(x = WORK_LIFE_BALANCE_SCORE, y = WEEKLY_MEDITATION)) +
  geom_point(color = "green") +
  labs(title = "Work-Life Balance Score vs. Weekly Meditation")

Histogram of Age

ggplot(dataset, aes(x = AGE)) +
  geom_bar(stat = "count", fill = "steelblue", color = "black") +
  labs(x = "Age", y = "Frequency") +
  ggtitle("Distribution of Age")

Boxplot of Work Life Balance Score by Age

ggplot(dataset, aes(x = AGE, y = WORK_LIFE_BALANCE_SCORE)) +
  geom_boxplot(fill = "orange", color = "black") +
  labs(x = "", y = "Work-Life Balance Score") +
  ggtitle("Distribution of Work-Life Balance Score by Age")

Histogram of Daily Steps

plot5 <- ggplot(dataset, aes(x = DAILY_STEPS)) +
  geom_histogram(fill = "lightblue", bins = 20) +
  labs(title = "Histogram of Daily Steps")
plot5

A) HEALTHY BODY (HOW TO KEEP OUR BMI BELOW 25)

Install the necessary libraries

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

Create a subset of a dataset where BMI_RANGE is below 25

subset_data <- subset(dataset, BMI_RANGE < 25)

Plot 1: Body Mass Index by Age

plot6 <- ggplot(subset_data, aes(x = AGE, y = BMI_RANGE)) +
  geom_bar(stat = "summary", fun = "mean", fill = "salmon") +
  labs(x = "AGE", y = "BMI") +
  ggtitle("BODY_MASS_INDEX BY AGE")
plot6

Plot 2: Body Mass Index by Age and Gender

plot8 <- ggplot(subset_data, aes(x = AGE, y = BMI_RANGE, fill = GENDER)) +
  stat_summary(fun = "mean", geom = "bar", position = "dodge") +
  labs(title = "BODY_MASS_INDEX BY GENDER & AGE") +
  scale_fill_manual(values = c("darksalmon", "cornflowerblue"))
plot9 <- plot8 + ggtitle("BODY_MASS_INDEX BY GENDER & AGE")
plot9

Plot 3: Body Mass Index by Sleep Hours

plot4 <- ggplot(subset_data, aes(x = SLEEP_HOURS, y = BMI_RANGE)) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue") +
  labs(x = "Sleep Hours", y = "BMI") +
  ggtitle("BODY_MASS_INDEX vs SLEEP HOURS")
plot4

## `geom_smooth()` using formula = 'y ~ x'

Plot 4: Body Mass Index by Servings of Fruits/Veggies

plot5 <- ggplot(subset_data, aes(x = FRUITS_VEGGIES, y = BMI_RANGE)) +
  geom_bar(stat = "summary", fun = "mean", fill = "yellow") +
  labs(x = "Servings of Fruits/Veggies", y = "BMI") +
  ggtitle("BODY_MASS_INDEX vs. SERVINGS OF FRUITS/VEGGIES")
plot5

Plot 5: Body Mass Index by Daily Steps Taken

plot6 <- ggplot(subset_data, aes(x = DAILY_STEPS, y = BMI_RANGE)) +
  geom_smooth(method = "lm", se = FALSE, color = "grey") +
  labs(x = "Daily Steps", y = "BMI") +
  ggtitle("BODY_MASS_INDEX BY DAILY STEPS TAKEN")
plot6

## `geom_smooth()` using formula = 'y ~ x'

B) HEALTHY MIND (WHAT DRIVES OUR DAILY_STRESS?)

Install library

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

Create the pivot table using dcast() function from reshape2 package

df3 <- dcast(dataset, AGE ~ GENDER, value.var = "DAILY_STRESS")

## Aggregation function missing: defaulting to length

head(df3)

Plot 1: Average Daily Stress by Age Group

plot1 <- ggplot(dataset, aes(x = AGE, y = DAILY_STRESS, fill = GENDER)) +
  geom_bar(stat = "summary", fun = "mean", position = "dodge", color = "black") +
  labs(x = "Age Group", y = "Average Daily Stress") +
  ggtitle("AVERAGE DAILY_STRESS BY AGE GROUP")
plot1

Plot 2: Daily Stress by Gender

plot2 <- ggplot(dataset, aes(x = GENDER, y = DAILY_STRESS, fill = GENDER)) +
  geom_violin(trim = FALSE, scale = "count") +
  labs(x = "Gender", y = "Daily Stress") +
  ggtitle("DAILY_STRESS BY GENDER")
plot2

C)PERSONAL ACHIEVEMENTS (WHAT DRIVE US TO ACHIEVE REMARKABLE THINGS?)

Plot 1: Core Circle by Gender

plot1 <- ggplot(dataset, aes(x = GENDER, y = CORE_CIRCLE, fill = GENDER)) +
  geom_violin() +
  labs(x = "Gender", y = "Core Circle") +
  ggtitle("CORE CIRCLE BY GENDER")
plot1

Plot 2: Lost Vacation by Age Group

plot2 <- ggplot(dataset, aes(x = AGE, y = LOST_VACATION)) +
  geom_boxplot() +
  labs(x = "Age Group", y = "Lost Vacation") +
  scale_x_discrete(limits = c("Less than 20", "21 to 35", "36 to 50", "51 or more")) +
  ggtitle("LOST VACATION BY AGE GROUP")
plot2

Plot 3: Places visited vs Daily Stress

plot3 <- ggplot(dataset, aes(x = PLACES_VISITED, y = DAILY_STRESS)) +
  geom_bar(stat = "summary", fun = "mean", fill = "steelblue") +
  labs(x = "Places Visited", y = "Daily Stress") +
  ggtitle("PLACES VISITED vs DAILY STRESS")
plot3

Plot 4: Lost vacation vs Daily Stress

plot4 <- ggplot(dataset, aes(x = LOST_VACATION, y = DAILY_STRESS)) +
  geom_boxplot() +
  labs(x = "Lost Vacation", y = "Daily Stress") +
  ggtitle("LOST VACATION vs DAILY STRESS")
plot4

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

Plot 5: Correlation matrix using a heatmap for all the variables

columns <- setdiff(names(dataset), c("GENDER", "AGE", "DAILY_STRESS"))
cor_matrix <- cor(dataset[, columns])
cor_df <- reshape2::melt(cor_matrix)
ggplot(cor_df, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient(low = "blue", high = "red") +
  labs(x = "Features", y = "Features", title = "Correlation Matrix")

4.0 Modelling

In this part, we will do two different problems relating to our dataset.

4.1 Addressing Regression problem

The first problem will be a regression problem to predict the “WORK_LIFE_BALANCE_SCORE” variable based on other variables in the dataset. The code is implementing three different regression models: Linear Regression, Support Vector Regression (SVR), and Random Forest.

Load the necessary libraries

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

Split the data into training and testing sets

set.seed(123)
train_indices <- createDataPartition(dataset$WORK_LIFE_BALANCE_SCORE, p = 0.8, list = FALSE)
train_data <- dataset[train_indices, ]
test_data <- dataset[-train_indices, ]
train_data

test_data

Linear Regression Model

lm_model <- lm(WORK_LIFE_BALANCE_SCORE ~ ., data = train_data)

Support Vector Regression (SVR) Model

svr_model <- svm(WORK_LIFE_BALANCE_SCORE ~ ., data = train_data)

Random Forest Model

rf_model <- randomForest(WORK_LIFE_BALANCE_SCORE ~ ., data = train_data)

Compare the performances of the model

lm_predictions <- predict(lm_model, test_data)
svr_predictions <- predict(svr_model, test_data)
rf_predictions <- predict(rf_model, test_data)

Calculate the Root Mean Squared Error (RMSE) for each model

lm_rmse <- sqrt(mean((test_data$WORK_LIFE_BALANCE_SCORE - lm_predictions)^2))
svr_rmse <- sqrt(mean((test_data$WORK_LIFE_BALANCE_SCORE - svr_predictions)^2))
rf_rmse <- sqrt(mean((test_data$WORK_LIFE_BALANCE_SCORE - rf_predictions)^2))

Print the RMSE values

cat("Linear Regression RMSE:", lm_rmse, "\n")

## Linear Regression RMSE: 9.21053e-13

cat("SVR RMSE:", svr_rmse, "\n")

## SVR RMSE: 2.971714

cat("Random Forest RMSE:", rf_rmse, "\n")

## Random Forest RMSE: 10.73447

Comparing these root mean square error (RMSE) values, we can see that Linear Regression has the lowest RMSE value of 9.21053e-13, indicating very low error on the given dataset. SVR (Support Vector Regression) has a slightly higher RMSE value of 2.971714, indicating a moderate level of error. Random Forest has the highest RMSE value of 10.73447, suggesting the highest level of error among the three models.

Scatter plot to compare predicted vs actual values

plot_data <- data.frame(
  Actual = test_data$WORK_LIFE_BALANCE_SCORE,
  Linear_Regression = lm_predictions,
  SVR = svr_predictions,
  Random_Forest = rf_predictions)

plot_data <- reshape2::melt(plot_data, id.vars = "Actual", variable.name = "Model")
ggplot(plot_data, aes(x = Actual, y = value, color = Model)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black") +
  labs(x = "Actual WORK_LIFE_BALANCE_SCORE", y = "Predicted WORK_LIFE_BALANCE_SCORE") +
  ggtitle("Comparison of Predicted vs Actual WORK_LIFE_BALANCE_SCORE") +
  theme_minimal()

4.2 Addressing Classification problem

The second problem is a classification problem. The goal is to predict a categorical variable (BMI_RANGE) and to evaluate the accuracy, which measures the proportion of correctly predicted class labels.

Load the necessary libraries

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.1-7

Convert relevant columns to factors

dataset$BMI_RANGE <- as.factor(dataset$BMI_RANGE)
dataset$GENDER <- as.factor(dataset$GENDER)
dataset$AGE <- as.factor(dataset$AGE)

Split the data into training and testing sets

Random Forest

model_rf <- randomForest(BMI_RANGE ~ ., data = train)
predictions_rf <- predict(model_rf, newdata = test)
accuracy_rf <- sum(predictions_rf == test$BMI_RANGE) / nrow(test)

Logistic Regression

model_lr <- glm(BMI_RANGE ~ ., data = train, family = binomial,control = list(maxit = 1000))

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

predictions_lr <- predict(model_lr, newdata = test, type = "response")
predictions_lr <- ifelse(predictions_lr > 0.5, "2", "1")
accuracy_lr <- sum(predictions_lr == test$BMI_RANGE) / nrow(test)

Decision Tree

model_dt <- rpart(BMI_RANGE ~ ., data = train, method = "class")
predictions_dt <- predict(model_dt, newdata = test, type = "class")
accuracy_dt <- sum(predictions_dt == test$BMI_RANGE) / nrow(test)

Print the accuracies

cat("Random Forest Accuracy:", accuracy_rf, "\n")

## Random Forest Accuracy: 0.768733

cat("Logistic Regression Accuracy:", accuracy_lr, "\n")

## Logistic Regression Accuracy: 1

cat("Decision Tree Accuracy:", accuracy_dt, "\n")

## Decision Tree Accuracy: 0.6551868

Comparing these accuracy values, Logistic Regression has the highest accuracy with a value of 1, indicating perfect accuracy on the given dataset. Random Forest has a lower accuracy of 0.768733, and Decision Tree has the lowest accuracy of 0.6551868.

Scatter plot to compare predicted vs actual BMI range

# Create data frames for plotting
plot_data_rf <- data.frame(
  Actual = test$BMI_RANGE,
  Predicted = predictions_rf,
  Model = "Random Forest"
)

plot_data_lr <- data.frame(
  Actual = test$BMI_RANGE,
  Predicted = predictions_lr,
  Model = "Logistic Regression"
)

plot_data_dt <- data.frame(
  Actual = test$BMI_RANGE,
  Predicted = predictions_dt,
  Model = "Decision Tree"
)

# Combine the data frames
plot_data <- rbind(plot_data_rf, plot_data_lr, plot_data_dt)

# Plot the comparisons
ggplot(plot_data, aes(x = Actual, y = Predicted, color = Model)) +
  geom_jitter(width = 0.1, height = 0.1) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black") +
  labs(x = "Actual BMI_RANGE", y = "Predicted BMI_RANGE") +
  ggtitle("Comparison of Predicted vs Actual BMI_RANGE") +
  theme_minimal()

5.0 Results & Evaluation

Question 1: How accurately can we predict work-life balance scores using regression models? Which regression model performs the best in terms of predicting work-life balance?

To answer the first question, we compare Root Mean Square Error (RMSE) values for each regression models we performed and we can see that Linear Regression has the lowest RMSE value of 9.21053e-13, indicating very low error on the given dataset. SVR (Support Vector Regression) has a slightly higher RMSE value of 2.971714, indicating a moderate level of error. Random Forest has the highest RMSE value of 10.73447, suggesting the highest level of error among the three models. The best model to be used to predict work-life balance is definitely Linear Regression.

Question 2: How accurately can we classify individuals into different BMI ranges using classification models? Which classification model has the highest accuracy in predicting BMI ranges?

As per the result, Random Forest’s Accuracy is 0.768733, Logistic Regression has a score of 1 while Decision Tree is having 0.6551868 accuracy rate. Comparing these accuracy values, Logistic Regression has the highest accuracy with a value of 1 which may indicates over fitting of the data. Random Forest has a lower accuracy of 0.768733, and Decision Tree has the lowest accuracy of 0.6551868.

6.0 Conclusion

In conclusion, the development of a Work Life Balance Calculator through this project addresses the pressing need for individuals and organizations to prioritize work-life balance in today’s fast-paced world. By leveraging data mining techniques and machine learning algorithms, we have made significant strides in understanding the key factors that contribute to work-life balance and identifying areas for improvement.

The Work Life Balance Calculator serves as a valuable tool for individuals to assess their work-life balance, understand their strengths and areas for improvement, and make informed decisions to enhance their overall well-being. For organizations, the calculator offers insights into employees’ work-life balance, enabling them to develop tailored plans to optimize productivity and support their workforce.

Ultimately, this project contributes to the promotion of work-life balance and the improvement of overall performance and well-being. By prioritizing work-life balance, individuals can achieve greater satisfaction and fulfillment, leading to a more productive and harmonious society as a whole.