Group Members
Name Matrics
Chew Tian Feng S2180850
Chua Ching Cheng 23052278
Lee Wei Tong 22099644
Lina Tay 22117352
Low Pei Ying 22119138

1.0 Introduction

In the fast-paced and dynamic landscape of today’s professional world, achieving a harmonious balance between work and personal life is an on-going challenge. As our lives become increasingly interconnected and demanding, the importance of a healthy work-life balance cannot be neglected. Hence, in our project, we seek to understand the determinants of work-life balance as well as to develop predictive models that can predict and quantify this critical metric base on lifestyle choices. Through this analysis and prediction, our goal is to uncover valuable insights that can assist individuals in creating a positive impact on their overall well-being and productivity.

1.1 Problem statements

  • What are the factors affecting work-life balance score?
  • What is the best way to predict work-life balance score?
  • What is the effective method to predict work-life balance rank?

1.2 Objectives

There are 3 objectives we would like to achieve through this project. The objectives as follows:

  • To identify the factors affecting work-life balance score.
  • To build regression models to predict the work-life balance score.
  • To build classification models to predict the work-life balance rank of an individual.

2.0 Data Pre-processing

2.1 Data Obtain

The dataset was obtained from Kaggle open-source dataset. The data source comes from Authentic Happiness website which measure Work-Life Balance Score based on research published in the 360 Living guide. The dataset contains 15,972 survey responses conducted between year 2015 to year 2021 and consists of 24 attributes as below:

  1. Timestamp - Date when survey was completed
  2. FRUITS_VEGGIES - How many fruits or vegetables do you eat every day?
  3. DAILY_STRESS - How much stress do you typically experience every day?
  4. PLACES_VISITED - How many new places do you visit over 12 months?
  5. CORE_CIRCLE - How many people are very close to you?
  6. SUPPORTING_OTHERS - How many people do you help achieve a better life over 12 months?
  7. SOCIAL_NETWORK - With how many people do you interact with during a typical day?
  8. ACHIEVEMENT - How many remarkable achievements are you proud of over the last 12 months?
  9. DONATION - How many times do you donate your time or money to good causes throughout 12 months?
  10. BMI_RANGE - What is your body mass index (BMI) range?
  11. TODO_COMPLETED - How well do you complete your weekly to-do lists?
  12. FLOW - In a typical day, how many hours do you experience “flow”?
  13. DAILY_STEPS - How many steps (in thousands) do you typically walk every day
  14. LIVE_VISION - For how many years ahead is your life vision very clear?
  15. SLEEP_HOURS - About how long you typically sleep over a typical working week, including weekends?
  16. LOST_VACATION - How many days of vacation do you typically lose every year unused vacation days?
  17. DAILY_SHOUTING - How often do you shout or sulk at somebody in a typical week?
  18. SUFFICIENT_INCOME - How sufficient is your income to cover basic living expenses?
  19. PERSONAL_AWARDS - How many recognitions have you received in your life throughout 12 months?
  20. TIME_FOR_PASSION - How many hours do you spend every day doing what you are passionate about?
  21. WEEKLY_MEDITATION - In a typical week, how many times do you have the opportunity to think about yourself?
  22. AGE - Age range
  23. GENDER - Male or Female
  24. WORK_LIFE_BALANCE_SCORE - Calculated score by algorithm

2.2 Data Preparation

## Import libraries
library(dplyr)
library(reshape2)
library(ggplot2)
library(caret)
library(xgboost)
library(e1071)
library(randomForest)
library(lares)
library(rpart.plot)
library(rsvg)
library(ggimage)
library(cvms)
library(adabag)

First, we load the dataset from csv file by using read.csv(). Then, we use str() function to display the structure of imported dataframe.

## Load data
file_path <- "wellbeing_and_lifestyle_dataset.csv"
dataset_raw <- read.csv(file_path)

## Display the structure of the dataset
str(dataset_raw)
## 'data.frame':    15972 obs. of  24 variables:
##  $ Timestamp              : chr  "7/7/15" "7/7/15" "7/7/15" "7/7/15" ...
##  $ FRUITS_VEGGIES         : int  3 2 2 3 5 3 4 3 5 4 ...
##  $ DAILY_STRESS           : chr  "2" "3" "3" "3" ...
##  $ PLACES_VISITED         : int  2 4 3 10 3 3 10 5 6 2 ...
##  $ CORE_CIRCLE            : int  5 3 4 3 3 9 6 3 4 6 ...
##  $ SUPPORTING_OTHERS      : int  0 8 4 10 10 10 10 5 3 10 ...
##  $ SOCIAL_NETWORK         : int  5 10 10 7 4 10 10 7 3 10 ...
##  $ ACHIEVEMENT            : int  2 5 3 2 2 2 3 4 5 0 ...
##  $ DONATION               : int  0 2 2 5 4 3 5 0 4 4 ...
##  $ BMI_RANGE              : int  1 2 2 2 2 1 2 1 1 2 ...
##  $ TODO_COMPLETED         : int  6 5 2 3 5 6 8 8 10 3 ...
##  $ FLOW                   : int  4 2 2 5 0 1 8 2 2 2 ...
##  $ DAILY_STEPS            : int  5 5 4 5 5 7 7 8 1 3 ...
##  $ LIVE_VISION            : int  0 5 5 0 0 10 5 10 5 0 ...
##  $ SLEEP_HOURS            : int  7 8 8 5 7 8 7 6 10 6 ...
##  $ LOST_VACATION          : int  5 2 10 7 0 0 10 0 0 0 ...
##  $ DAILY_SHOUTING         : int  5 2 2 5 0 2 0 2 2 0 ...
##  $ SUFFICIENT_INCOME      : int  1 2 2 1 2 2 2 2 2 1 ...
##  $ PERSONAL_AWARDS        : int  4 3 4 5 8 10 10 8 10 3 ...
##  $ TIME_FOR_PASSION       : int  0 2 8 2 1 8 8 2 3 8 ...
##  $ WEEKLY_MEDITATION      : int  5 6 3 0 5 3 10 2 10 1 ...
##  $ AGE                    : chr  "36 to 50" "36 to 50" "36 to 50" "51 or more" ...
##  $ GENDER                 : chr  "Female" "Female" "Female" "Female" ...
##  $ WORK_LIFE_BALANCE_SCORE: num  610 656 632 623 664 ...
## Display the summary of the dataset
summary(dataset_raw)
##   Timestamp         FRUITS_VEGGIES  DAILY_STRESS       PLACES_VISITED  
##  Length:15972       Min.   :0.000   Length:15972       Min.   : 0.000  
##  Class :character   1st Qu.:2.000   Class :character   1st Qu.: 2.000  
##  Mode  :character   Median :3.000   Mode  :character   Median : 5.000  
##                     Mean   :2.923                      Mean   : 5.233  
##                     3rd Qu.:4.000                      3rd Qu.: 8.000  
##                     Max.   :5.000                      Max.   :10.000  
##   CORE_CIRCLE     SUPPORTING_OTHERS SOCIAL_NETWORK    ACHIEVEMENT    
##  Min.   : 0.000   Min.   : 0.000    Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 3.000   1st Qu.: 3.000    1st Qu.: 4.000   1st Qu.: 2.000  
##  Median : 5.000   Median : 5.000    Median : 6.000   Median : 3.000  
##  Mean   : 5.508   Mean   : 5.616    Mean   : 6.474   Mean   : 4.001  
##  3rd Qu.: 8.000   3rd Qu.:10.000    3rd Qu.:10.000   3rd Qu.: 6.000  
##  Max.   :10.000   Max.   :10.000    Max.   :10.000   Max.   :10.000  
##     DONATION       BMI_RANGE     TODO_COMPLETED        FLOW       
##  Min.   :0.000   Min.   :1.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.: 4.000   1st Qu.: 1.000  
##  Median :3.000   Median :1.000   Median : 6.000   Median : 3.000  
##  Mean   :2.715   Mean   :1.411   Mean   : 5.746   Mean   : 3.195  
##  3rd Qu.:5.000   3rd Qu.:2.000   3rd Qu.: 8.000   3rd Qu.: 5.000  
##  Max.   :5.000   Max.   :2.000   Max.   :10.000   Max.   :10.000  
##   DAILY_STEPS      LIVE_VISION      SLEEP_HOURS     LOST_VACATION   
##  Min.   : 1.000   Min.   : 0.000   Min.   : 1.000   Min.   : 0.000  
##  1st Qu.: 3.000   1st Qu.: 1.000   1st Qu.: 6.000   1st Qu.: 0.000  
##  Median : 5.000   Median : 3.000   Median : 7.000   Median : 0.000  
##  Mean   : 5.704   Mean   : 3.752   Mean   : 7.043   Mean   : 2.899  
##  3rd Qu.: 8.000   3rd Qu.: 5.000   3rd Qu.: 8.000   3rd Qu.: 5.000  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.000  
##  DAILY_SHOUTING   SUFFICIENT_INCOME PERSONAL_AWARDS  TIME_FOR_PASSION
##  Min.   : 0.000   Min.   :1.000     Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 1.000   1st Qu.:1.000     1st Qu.: 3.000   1st Qu.: 1.000  
##  Median : 2.000   Median :2.000     Median : 5.000   Median : 3.000  
##  Mean   : 2.931   Mean   :1.729     Mean   : 5.712   Mean   : 3.327  
##  3rd Qu.: 4.000   3rd Qu.:2.000     3rd Qu.: 9.000   3rd Qu.: 5.000  
##  Max.   :10.000   Max.   :2.000     Max.   :10.000   Max.   :10.000  
##  WEEKLY_MEDITATION     AGE               GENDER         
##  Min.   : 0.000    Length:15972       Length:15972      
##  1st Qu.: 4.000    Class :character   Class :character  
##  Median : 7.000    Mode  :character   Mode  :character  
##  Mean   : 6.233                                         
##  3rd Qu.:10.000                                         
##  Max.   :10.000                                         
##  WORK_LIFE_BALANCE_SCORE
##  Min.   :480.0          
##  1st Qu.:636.0          
##  Median :667.7          
##  Mean   :666.8          
##  3rd Qu.:698.5          
##  Max.   :820.2
## Display the first few rows of the dataset
head(dataset_raw)
##   Timestamp FRUITS_VEGGIES DAILY_STRESS PLACES_VISITED CORE_CIRCLE
## 1    7/7/15              3            2              2           5
## 2    7/7/15              2            3              4           3
## 3    7/7/15              2            3              3           4
## 4    7/7/15              3            3             10           3
## 5    7/7/15              5            1              3           3
## 6    7/8/15              3            2              3           9
##   SUPPORTING_OTHERS SOCIAL_NETWORK ACHIEVEMENT DONATION BMI_RANGE
## 1                 0              5           2        0         1
## 2                 8             10           5        2         2
## 3                 4             10           3        2         2
## 4                10              7           2        5         2
## 5                10              4           2        4         2
## 6                10             10           2        3         1
##   TODO_COMPLETED FLOW DAILY_STEPS LIVE_VISION SLEEP_HOURS LOST_VACATION
## 1              6    4           5           0           7             5
## 2              5    2           5           5           8             2
## 3              2    2           4           5           8            10
## 4              3    5           5           0           5             7
## 5              5    0           5           0           7             0
## 6              6    1           7          10           8             0
##   DAILY_SHOUTING SUFFICIENT_INCOME PERSONAL_AWARDS TIME_FOR_PASSION
## 1              5                 1               4                0
## 2              2                 2               3                2
## 3              2                 2               4                8
## 4              5                 1               5                2
## 5              0                 2               8                1
## 6              2                 2              10                8
##   WEEKLY_MEDITATION        AGE GENDER WORK_LIFE_BALANCE_SCORE
## 1                 5   36 to 50 Female                   609.5
## 2                 6   36 to 50 Female                   655.6
## 3                 3   36 to 50 Female                   631.6
## 4                 0 51 or more Female                   622.7
## 5                 5 51 or more Female                   663.9
## 6                 3 51 or more Female                   722.3

2.3 Data Cleaning

In data cleaning stage, we drop off the Timestamp column which is not required for our analysis. Next, we check if there is any missing, duplicate, inconsistency value in the dataset.

## Remove the Timstamp column which is not relevant
dataset <- subset(dataset_raw, select = -c(Timestamp))

## Check the number of column in dataset after removing the Timestamp
cat("Total columns AFTER remove Timestamp: ", ncol(dataset), "\n")
## Total columns AFTER remove Timestamp:  23
## Check for any missing value in the dataset
any(is.na(dataset))
## [1] FALSE
## Check for any duplicate rows in the dataset
sum(duplicated(dataset))
## [1] 683

There are 683 duplicate rows found in the dataset, we assumed that these data was submitted due to user clicked the submit button too many times while page loading or unintentionally fill out the surveys multiple times. Hence, we removed the duplicate rows to prevent biasing result.

## Remove duplicate rows
dataset <- dataset %>% distinct()
cat("Total rows AFTER remove duplicates: ", nrow(dataset), "\n")
## Total rows AFTER remove duplicates:  15289

By using str() function, we noticed that DAILY_STRESS column is shown as “character” which means there is some cells is not numeric. There might have structural error under this column. We proceed to remove it.

## Check the class attribute of "Daily Stress" column
class(dataset$DAILY_STRESS)
## [1] "character"
## Check whether the class can be changed to integer
dataset$DAILY_STRESS <- as.integer(dataset$DAILY_STRESS)
## Warning: NAs introduced by coercion

The warning message “NAs introduced by coercion” indicates that there is a type coercion happening in code, and as a result, some values are being coerced (converted) to the special value NA (Not Available) which mean that not available to convert into numeric.

## Identify the rows where DAILY_STRESS cannot be converted to numeric
problematic_rows <- which(is.na(as.integer(dataset$DAILY_STRESS)))

## Print the row numbers and values
if (length(problematic_rows) > 0) {
  cat("Problematic rows:", problematic_rows, "\n")
  cat("Corresponding values:", dataset$DAILY_STRESS[problematic_rows], "\n")
} else {
  cat("No problematic rows found.\n")
}
## Problematic rows: 9389 
## Corresponding values: NA
## Remove the problematic row number 9389
cleaned_dataset <- dataset[-9389 ,]
cat("Total rows AFTER remove inconsistency: ", nrow(cleaned_dataset), "\n")
## Total rows AFTER remove inconsistency:  15288
## Compare the dataset before and after data cleaning process
str(cleaned_dataset)
## 'data.frame':    15288 obs. of  23 variables:
##  $ FRUITS_VEGGIES         : int  3 2 2 3 5 3 4 3 5 4 ...
##  $ DAILY_STRESS           : int  2 3 3 3 1 2 2 4 3 4 ...
##  $ PLACES_VISITED         : int  2 4 3 10 3 3 10 5 6 2 ...
##  $ CORE_CIRCLE            : int  5 3 4 3 3 9 6 3 4 6 ...
##  $ SUPPORTING_OTHERS      : int  0 8 4 10 10 10 10 5 3 10 ...
##  $ SOCIAL_NETWORK         : int  5 10 10 7 4 10 10 7 3 10 ...
##  $ ACHIEVEMENT            : int  2 5 3 2 2 2 3 4 5 0 ...
##  $ DONATION               : int  0 2 2 5 4 3 5 0 4 4 ...
##  $ BMI_RANGE              : int  1 2 2 2 2 1 2 1 1 2 ...
##  $ TODO_COMPLETED         : int  6 5 2 3 5 6 8 8 10 3 ...
##  $ FLOW                   : int  4 2 2 5 0 1 8 2 2 2 ...
##  $ DAILY_STEPS            : int  5 5 4 5 5 7 7 8 1 3 ...
##  $ LIVE_VISION            : int  0 5 5 0 0 10 5 10 5 0 ...
##  $ SLEEP_HOURS            : int  7 8 8 5 7 8 7 6 10 6 ...
##  $ LOST_VACATION          : int  5 2 10 7 0 0 10 0 0 0 ...
##  $ DAILY_SHOUTING         : int  5 2 2 5 0 2 0 2 2 0 ...
##  $ SUFFICIENT_INCOME      : int  1 2 2 1 2 2 2 2 2 1 ...
##  $ PERSONAL_AWARDS        : int  4 3 4 5 8 10 10 8 10 3 ...
##  $ TIME_FOR_PASSION       : int  0 2 8 2 1 8 8 2 3 8 ...
##  $ WEEKLY_MEDITATION      : int  5 6 3 0 5 3 10 2 10 1 ...
##  $ AGE                    : chr  "36 to 50" "36 to 50" "36 to 50" "51 or more" ...
##  $ GENDER                 : chr  "Female" "Female" "Female" "Female" ...
##  $ WORK_LIFE_BALANCE_SCORE: num  610 656 632 623 664 ...

Here is the comparison before and after cleaning the dataset. [Before] 15972 obs. of 24 variables VS [After] 15288 obs. of 23 variables

2.4 Data Categorization

We added a column named “RANKING” to rank the work life balance using WORK_LIFE_BALANCE_SCORE in our dataset. We also transform 2 categorical variables AGE and GENDER to numerical.

## Plot boxplot to identify the quartile range
wlb <- cleaned_dataset$WORK_LIFE_BALANCE_SCORE
boxplot(wlb, main="Work Life Balance Score")
text(y=boxplot.stats(wlb)$stats, labels=boxplot.stats(wlb)$stats, x=1.25)

Based on the summary of boxplot, the value higher than 698.5 is considered as HIGH, less than 636 is considered as LOW, and the remaining (between 636 to 698.5) will be categorized as MEDIUM.

## Categorize the work-life balance score by adding new column 'RANKING'
cleaned_dataset <- mutate(cleaned_dataset, RANKING=if_else(WORK_LIFE_BALANCE_SCORE>698.5,"High", if_else(WORK_LIFE_BALANCE_SCORE<636,"Low", "Medium")))
## Transform AGE and GENDER categorical variables to numerical
cleaned_dataset <- mutate(cleaned_dataset, AGE_GROUP=if_else(AGE=="Less than 20",1, if_else(AGE=="21 to 35",2, if_else(AGE=="36 to 50",3, 4))))
cleaned_dataset <- mutate(cleaned_dataset, GENDER_GROUP=if_else(GENDER=="Male", 0, 1))
## Let's check the final structure of cleaned dataset
str(cleaned_dataset)
## 'data.frame':    15288 obs. of  26 variables:
##  $ FRUITS_VEGGIES         : int  3 2 2 3 5 3 4 3 5 4 ...
##  $ DAILY_STRESS           : int  2 3 3 3 1 2 2 4 3 4 ...
##  $ PLACES_VISITED         : int  2 4 3 10 3 3 10 5 6 2 ...
##  $ CORE_CIRCLE            : int  5 3 4 3 3 9 6 3 4 6 ...
##  $ SUPPORTING_OTHERS      : int  0 8 4 10 10 10 10 5 3 10 ...
##  $ SOCIAL_NETWORK         : int  5 10 10 7 4 10 10 7 3 10 ...
##  $ ACHIEVEMENT            : int  2 5 3 2 2 2 3 4 5 0 ...
##  $ DONATION               : int  0 2 2 5 4 3 5 0 4 4 ...
##  $ BMI_RANGE              : int  1 2 2 2 2 1 2 1 1 2 ...
##  $ TODO_COMPLETED         : int  6 5 2 3 5 6 8 8 10 3 ...
##  $ FLOW                   : int  4 2 2 5 0 1 8 2 2 2 ...
##  $ DAILY_STEPS            : int  5 5 4 5 5 7 7 8 1 3 ...
##  $ LIVE_VISION            : int  0 5 5 0 0 10 5 10 5 0 ...
##  $ SLEEP_HOURS            : int  7 8 8 5 7 8 7 6 10 6 ...
##  $ LOST_VACATION          : int  5 2 10 7 0 0 10 0 0 0 ...
##  $ DAILY_SHOUTING         : int  5 2 2 5 0 2 0 2 2 0 ...
##  $ SUFFICIENT_INCOME      : int  1 2 2 1 2 2 2 2 2 1 ...
##  $ PERSONAL_AWARDS        : int  4 3 4 5 8 10 10 8 10 3 ...
##  $ TIME_FOR_PASSION       : int  0 2 8 2 1 8 8 2 3 8 ...
##  $ WEEKLY_MEDITATION      : int  5 6 3 0 5 3 10 2 10 1 ...
##  $ AGE                    : chr  "36 to 50" "36 to 50" "36 to 50" "51 or more" ...
##  $ GENDER                 : chr  "Female" "Female" "Female" "Female" ...
##  $ WORK_LIFE_BALANCE_SCORE: num  610 656 632 623 664 ...
##  $ RANKING                : chr  "Low" "Medium" "Low" "Low" ...
##  $ AGE_GROUP              : num  3 3 3 4 4 4 4 2 2 4 ...
##  $ GENDER_GROUP           : num  1 1 1 1 1 1 0 1 1 1 ...
head(cleaned_dataset)
##   FRUITS_VEGGIES DAILY_STRESS PLACES_VISITED CORE_CIRCLE SUPPORTING_OTHERS
## 1              3            2              2           5                 0
## 2              2            3              4           3                 8
## 3              2            3              3           4                 4
## 4              3            3             10           3                10
## 5              5            1              3           3                10
## 6              3            2              3           9                10
##   SOCIAL_NETWORK ACHIEVEMENT DONATION BMI_RANGE TODO_COMPLETED FLOW DAILY_STEPS
## 1              5           2        0         1              6    4           5
## 2             10           5        2         2              5    2           5
## 3             10           3        2         2              2    2           4
## 4              7           2        5         2              3    5           5
## 5              4           2        4         2              5    0           5
## 6             10           2        3         1              6    1           7
##   LIVE_VISION SLEEP_HOURS LOST_VACATION DAILY_SHOUTING SUFFICIENT_INCOME
## 1           0           7             5              5                 1
## 2           5           8             2              2                 2
## 3           5           8            10              2                 2
## 4           0           5             7              5                 1
## 5           0           7             0              0                 2
## 6          10           8             0              2                 2
##   PERSONAL_AWARDS TIME_FOR_PASSION WEEKLY_MEDITATION        AGE GENDER
## 1               4                0                 5   36 to 50 Female
## 2               3                2                 6   36 to 50 Female
## 3               4                8                 3   36 to 50 Female
## 4               5                2                 0 51 or more Female
## 5               8                1                 5 51 or more Female
## 6              10                8                 3 51 or more Female
##   WORK_LIFE_BALANCE_SCORE RANKING AGE_GROUP GENDER_GROUP
## 1                   609.5     Low         3            1
## 2                   655.6  Medium         3            1
## 3                   631.6     Low         3            1
## 4                   622.7     Low         4            1
## 5                   663.9  Medium         4            1
## 6                   722.3    High         4            1

3.0 Data Exploration

3.1 Correlation Matrix

To explore the dataset, we first plot a simple Correlation Matrix Heatmap to understand relationship between two variables in dataset. The melt() function in R (from the reshape2 package) transforms a data frame from wide format to long format.

## Drop columns "AGE", "GENDER", "RANKING", "AGE_GROUP", "GENDER_GROUP"
df1 <- cleaned_dataset[, !(names(cleaned_dataset) %in% c("AGE", "GENDER", "RANKING", "AGE_GROUP", "GENDER_GROUP"))]
cor_matrix <- round(cor(df1),2)
melted_corr_mat <- melt(cor_matrix)

## Plot correlation matrix heatmap
ggplot(data = melted_corr_mat, aes(x = Var1, y = Var2, fill = value, label = value)) +
  geom_tile() +
  geom_text(color = "black") +
  scale_fill_gradient(low = "lightblue", high = "red") +
  labs(title = "Correlation Matrix",
       x = "Variables",
       y = "Variables") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  theme(plot.title = element_text(hjust = 0.5))

## Create a table of correlation between work life balance score VS variables, sort in descending order
cor_with_work_life <- cor_matrix["WORK_LIFE_BALANCE_SCORE", ]
cor_table <- as.data.frame(cor_with_work_life)
colnames(cor_table) <- "WORK_LIFE_BALANCE_SCORE"
cor_table <- arrange(cor_table, desc(WORK_LIFE_BALANCE_SCORE))
cor_table <- cor_table[-1, , drop = FALSE]
print(cor_table)
##                   WORK_LIFE_BALANCE_SCORE
## ACHIEVEMENT                          0.56
## SUPPORTING_OTHERS                    0.55
## TODO_COMPLETED                       0.54
## PLACES_VISITED                       0.53
## TIME_FOR_PASSION                     0.52
## CORE_CIRCLE                          0.51
## PERSONAL_AWARDS                      0.50
## FLOW                                 0.48
## LIVE_VISION                          0.48
## DONATION                             0.46
## FRUITS_VEGGIES                       0.45
## DAILY_STEPS                          0.42
## WEEKLY_MEDITATION                    0.42
## SOCIAL_NETWORK                       0.41
## SUFFICIENT_INCOME                    0.40
## SLEEP_HOURS                          0.20
## BMI_RANGE                           -0.25
## LOST_VACATION                       -0.26
## DAILY_SHOUTING                      -0.27
## DAILY_STRESS                        -0.37
## Plot a bar chart to visualize the correlation score
ggplot(cor_table, aes(x = reorder(rownames(cor_table), -WORK_LIFE_BALANCE_SCORE), y = WORK_LIFE_BALANCE_SCORE)) +
  geom_bar(stat = "identity", fill = "pink") +
  geom_text(aes(label = WORK_LIFE_BALANCE_SCORE), vjust = -0.5) +
  labs(title = "Work Life Balance Scores Correlation", x = "Factors", y = "Correlation Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Based on Correlation matrix, ACHIEVEMENT, SUPPORTING_OTHERS and TODO_COMPLETED have the slightly stronger positive correlations with work-life balance, while DAILY_STRESS, DAILY_SHOUTING, and LOST_VACATION have moderate negative correlations with work-life balance.

3.2 Data analysis

3.2.1 Univariate analysis

## Plot bar chart to check the distribution of survey collected by Gender
df2 <- cleaned_dataset
ggplot(df2, aes(x = GENDER, fill = GENDER)) +
  geom_bar() +
  ggtitle("Number of Survey by Gender") +
  geom_text(stat = "count", aes(label = after_stat(count), vjust = -0.5)) +
  theme(plot.title = element_text(hjust = 0.5)) 

## Plot bar chart to check the distribution of survey collected by Age Group
df2$AGE <- factor(df2$AGE, levels = c("Less than 20","21 to 35","36 to 50","51 or more"))
ggplot(df2, aes(x = AGE, fill = AGE)) +
  geom_bar() +
  ggtitle("Number of Survey by Age Group") +
  geom_text(stat = "count", aes(label = after_stat(count), vjust = -0.5)) +
  theme(plot.title = element_text(hjust = 0.5)) 

As observed from the bar chart, there are more female than male contributed to this survey. Among these participants, there are most of them aged between 21-50 years old, which reflects the working age population.

## Plot Work-Life Balance Grade By Gender
df2$RANKING <- factor(df2$RANKING, levels = c("Low", "Medium", "High"))
ggplot(df2, aes(x = RANKING, fill = GENDER)) + 
  geom_bar(position = "dodge") +
  scale_x_discrete(labels = c("High"="High(Score>698.5)", "Medium"="Medium(636<=Score<=698.5)", "Low"="Low(Score<636)")) +
  geom_text(stat = "count", aes(label = after_stat(count), group = GENDER), position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
  labs(x = "Work Life Balance Grade", y = "Count", title = "Work Life Balance Grade By Gender") +
  theme(plot.title = element_text(hjust = 0.5))

Based on the bar chart above, most of the work-life balance grades fall into the medium category, while the counts for low and high grades are nearly equal for both male and female genders.

## Plot Work-Life Balance Grade By Age Group
df2$AGE_GROUP <- factor(df2$AGE_GROUP)
ggplot(df2, aes(x = RANKING, fill = AGE_GROUP)) +
  geom_bar() +
  labs(x = "Work Life Balance Grade", y = "Count", title = "Work Life Balance Grade by Age Group") +
  scale_x_discrete(labels = c("High" = "High(Score>698.5)", "Medium" = "Medium(636<=Score<=698.5)", "Low" = "Low(Score<636)")) +
  scale_fill_discrete(labels = c("18-25", "26-35", "36-50", "51 or more")) +
  theme(plot.title = element_text(hjust = 0.5))

Based on the bar chart above, observations can conclude that for each work-life balance grade (low, medium, and high), each grade contains four different age groups. There is no significant indication that a certain age group only appears in a specific grade.

3.2.2 Bivariate analysis

The top six highest correlations between factors and work-life balance scores are as follows: Achievement (0.56), followed by Supporting Others (0.55), Todo Completed (0.54), Places Visited (0.53), Time for Passion (0.52), Core Circle (0.51), and Personal Awards (0.5). The lowest negative correlation with work-life balance score is Daily Stress (-0.37). The corresponding graphs are plotted below:

## Convert integer to factor
df2 <- df2 %>% mutate_if(is.integer, as.factor)

## Create a function to plot Box-plot of Work-Life Balance Score vs input variable
plot_boxplot_line <- function(data, x_variable) {
  ggplot(data, aes_string(x = x_variable, y = "WORK_LIFE_BALANCE_SCORE")) +
    geom_boxplot(fill = "lightblue", color = "blue") +
    labs(title = paste("Box-plot of Work-Life Balance Score vs", x_variable),
         x = x_variable,
         y = "Work-Life Balance Score") +
    theme_minimal() +
    stat_summary(aes(shape = "Mean"), fun = mean, geom = "point", size = 3, color = "red") +
    geom_line(data = data %>% group_by(!!sym(x_variable)) %>% summarize(mean_score = mean(WORK_LIFE_BALANCE_SCORE)),
              aes_string(x = x_variable, y = "mean_score", group = "1"),
              color = "red", linetype = "dashed") +
    theme(plot.title = element_text(hjust = 0.5)) +
    scale_shape_manual(name = "", values = 11, labels = "Mean")
}

plot_boxplot_line(df2, "ACHIEVEMENT")
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Based on the above box plot, observations can conclude that as the number of achievements increases, both the median and mean of the work-life balance score also increase.

## Create a Box-plot of Work-Life Balance Score vs Supporting Others: Correlation 0.55
plot_boxplot_line(df2, "SUPPORTING_OTHERS")

Based on the above box plot, observations can conclude that as the number of supporting others increases, both the median and mean of the work-life balance score also increase.

# Create a Box-plot of Work-Life Balance Score vs TODO Completed: Correlation 0.54
plot_boxplot_line(df2, "TODO_COMPLETED")

Based on the above box plot, observations can conclude that as the number of Todo Completed increases, both the median and mean of the work-life balance score also increase.

# Create a Box-plot of Work-Life Balance Score vs Places Visited: Correlation 0.53
plot_boxplot_line(df2, "PLACES_VISITED")

Based on the above box plot, observations can conclude that as the number of places visited increases, both the median and mean of the work-life balance score also increase.

# Create a Box-plot of Work-Life Balance Score vs Time For Passion: Correlation 0.52
plot_boxplot_line(df2, "TIME_FOR_PASSION")

Based on the above box plot, observations can conclude that as the number of Time for Passion increases, both the median and mean of the work-life balance score also increase.

# Create a Box-plot of Work-Life Balance Score vs Core Circle: Correlation 0.51
plot_boxplot_line(df2, "CORE_CIRCLE")

Based on the above box plot, observations can conclude that as the number of Core Circle increases, both the median and mean of the work-life balance score also increase.

# Create a Box-plot of Work-Life Balance Score vs Personal Awards: Correlation 0.50
plot_boxplot_line(df2, "PERSONAL_AWARDS")

Based on the above box plot, observations can conclude that as the number of Personal Awards increases, both the median and mean of the work-life balance score also increase.

## Create a Box-plot of Work-Life Balance Score vs Daily Stress: Correlation -0.37
ggplot(df2, aes(x = DAILY_STRESS, y = WORK_LIFE_BALANCE_SCORE)) +
  geom_boxplot(fill = "pink", color = "red") +
  labs(title = "Box-plot of Work-Life Balance Score vs Daily Stress",
       x = "Daily Stress",
       y = "Work-Life Balance Score") +
  theme_minimal() +
  stat_summary(aes(shape = "Mean"), fun = mean, geom = "point", size = 3, color = "blue") +
  geom_line(data = df2 %>% group_by(DAILY_STRESS) %>% summarize(mean_score = mean(WORK_LIFE_BALANCE_SCORE)),
            aes(x = DAILY_STRESS, y = mean_score, group = 1),
            color = "blue", linetype = "dashed") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_shape_manual(name = "", values = 11, labels = "Mean")

Based on the above box plot, observations can conclude that as the number of Daily stress increases, both the median and mean of the work-life balance score decrease.

4.0 Data Modelling

4.1 Regression model

The data is going to be split into 2 portion, 80% for training and 20% for testing the model. To be fair for each regression model evaluation, set.seed function is used to ensure consistent data input for model training and model testing.

## Prepare dataset for modelling
df3 <- cleaned_dataset[, !colnames(cleaned_dataset) %in% c("AGE", "GENDER","RANKING")]
set.seed(123)  # For reproducibility

## Split the data into training (80%) and testing (20%) sets
train_index <- createDataPartition(df3$WORK_LIFE_BALANCE_SCORE, p = 0.8, list = FALSE)
train_data <- df3[train_index, ]
test_data <- df3[-train_index, ]
glimpse(df3)
## Rows: 15,288
## Columns: 23
## $ FRUITS_VEGGIES          <int> 3, 2, 2, 3, 5, 3, 4, 3, 5, 4, 2, 1, 2, 5, 3, 3…
## $ DAILY_STRESS            <int> 2, 3, 3, 3, 1, 2, 2, 4, 3, 4, 4, 4, 2, 3, 3, 3…
## $ PLACES_VISITED          <int> 2, 4, 3, 10, 3, 3, 10, 5, 6, 2, 7, 3, 10, 0, 2…
## $ CORE_CIRCLE             <int> 5, 3, 4, 3, 3, 9, 6, 3, 4, 6, 7, 8, 5, 3, 3, 4…
## $ SUPPORTING_OTHERS       <int> 0, 8, 4, 10, 10, 10, 10, 5, 3, 10, 1, 0, 2, 6,…
## $ SOCIAL_NETWORK          <int> 5, 10, 10, 7, 4, 10, 10, 7, 3, 10, 1, 2, 8, 6,…
## $ ACHIEVEMENT             <int> 2, 5, 3, 2, 2, 2, 3, 4, 5, 0, 3, 1, 3, 6, 3, 2…
## $ DONATION                <int> 0, 2, 2, 5, 4, 3, 5, 0, 4, 4, 0, 0, 4, 3, 2, 5…
## $ BMI_RANGE               <int> 1, 2, 2, 2, 2, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1…
## $ TODO_COMPLETED          <int> 6, 5, 2, 3, 5, 6, 8, 8, 10, 3, 2, 2, 7, 8, 8, …
## $ FLOW                    <int> 4, 2, 2, 5, 0, 1, 8, 2, 2, 2, 2, 1, 1, 7, 0, 2…
## $ DAILY_STEPS             <int> 5, 5, 4, 5, 5, 7, 7, 8, 1, 3, 7, 8, 6, 7, 2, 8…
## $ LIVE_VISION             <int> 0, 5, 5, 0, 0, 10, 5, 10, 5, 0, 4, 2, 10, 2, 5…
## $ SLEEP_HOURS             <int> 7, 8, 8, 5, 7, 8, 7, 6, 10, 6, 8, 7, 8, 6, 8, …
## $ LOST_VACATION           <int> 5, 2, 10, 7, 0, 0, 10, 0, 0, 0, 7, 7, 0, 0, 3,…
## $ DAILY_SHOUTING          <int> 5, 2, 2, 5, 0, 2, 0, 2, 2, 0, 3, 1, 0, 3, 2, 2…
## $ SUFFICIENT_INCOME       <int> 1, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2…
## $ PERSONAL_AWARDS         <int> 4, 3, 4, 5, 8, 10, 10, 8, 10, 3, 3, 4, 5, 5, 4…
## $ TIME_FOR_PASSION        <int> 0, 2, 8, 2, 1, 8, 8, 2, 3, 8, 0, 1, 2, 6, 5, 3…
## $ WEEKLY_MEDITATION       <int> 5, 6, 3, 0, 5, 3, 10, 2, 10, 1, 6, 7, 7, 5, 4,…
## $ WORK_LIFE_BALANCE_SCORE <dbl> 609.5, 655.6, 631.6, 622.7, 663.9, 722.3, 727.…
## $ AGE_GROUP               <dbl> 3, 3, 3, 4, 4, 4, 4, 2, 2, 4, 3, 2, 2, 3, 2, 4…
## $ GENDER_GROUP            <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1…

4.1.1 Multilinear Regression

## Build Multilinear Regression model
model_LR <- lm(WORK_LIFE_BALANCE_SCORE ~ .-1, data = train_data)
predictions_LR <- predict(model_LR, newdata = test_data)

## Evaluation metrics
LR_metrics <- data.frame(
  R2 = R2(predictions_LR, test_data$WORK_LIFE_BALANCE_SCORE),
  RMSE = RMSE(predictions_LR, test_data$WORK_LIFE_BALANCE_SCORE),
  MAE = MAE(predictions_LR, test_data$WORK_LIFE_BALANCE_SCORE)
)
print(LR_metrics)
##         R2     RMSE      MAE
## 1 0.545125 58.01379 45.69006
## Plot Regression results
print(lares::mplot_lineal(tag = test_data$WORK_LIFE_BALANCE_SCORE,
                          score = predictions_LR,
                          subtitle = "Multilinear Regression Model",
                          model_name = "model_LR"))

This chart represents predicted value against real value from multi-linear regression.
The plot shows quite a big portion of the prediction is deviated from real value.
The result shows adjusted R square = 0.545, RMSE = 58.01 & MAE = 45.69, which is not a satisfying results.

## Density plot
print(lares::mplot_density(tag = test_data$WORK_LIFE_BALANCE_SCORE,
                           score = predictions_LR,
                           subtitle = "Multilinear Regression Model",
                           model_name = "model_LR"))

The distribution plot indicates the distribution of both model predicted and real work life balance score.
The model prediction has wider range compared to real distribution.

## Split and compare quantiles
print(lares::mplot_splits(tag = test_data$WORK_LIFE_BALANCE_SCORE,
                          score = predictions_LR,
                          split = 5))

The predicted and real results are split into 5 groups.
The predict score for each group are 59 8%, 32.08%, 31.26%, 32.9% and 61.37% respectively.

4.1.2 SVM Regression

## Build SVM Regression model
model_SVM <- svm(WORK_LIFE_BALANCE_SCORE ~ .-1, data = train_data, kernel = "linear")
predictions_SVM <- predict(model_SVM, newdata = test_data)

## Evaluation metrics
SVM_metrics <- data.frame(
  R2 = R2(predictions_SVM, test_data$WORK_LIFE_BALANCE_SCORE),
  RMSE = RMSE(predictions_SVM, test_data$WORK_LIFE_BALANCE_SCORE),
  MAE = MAE(predictions_SVM, test_data$WORK_LIFE_BALANCE_SCORE)
)
print(SVM_metrics)
##          R2     RMSE      MAE
## 1 0.9957653 3.362268 2.961172
## Plot Regression results
print(lares::mplot_lineal(tag = test_data$WORK_LIFE_BALANCE_SCORE,
                          score = predictions_SVM,
                          subtitle = "SVM Regression Model",
                          model_name = "model_SVM"))

This chart represents predicted value against real value from SVM regression.
The plot shows the prediction and real values are pretty well-aligned and fitted.
The result shows adjusted R square = 0.9958, RMSE = 3.362 & MAE = 2.961, which is a very good prediction results.

## Density plot
print(lares::mplot_density(tag = test_data$WORK_LIFE_BALANCE_SCORE,
                           score = predictions_SVM,
                           subtitle = "SVM Regression Model",
                           model_name = "model_SVM"))

The distribution plot indicates the distribution of both model predicted and real work life balance score.
We can see the prediction and real value are very well overlapped compared to multi-linear regression model.

## Split and compare quantiles
print(lares::mplot_splits(tag = test_data$WORK_LIFE_BALANCE_SCORE,
                          score = predictions_SVM,
                          split = 5))

The predicted and real results are split into 5 groups.
The predict score for each group are 96.73%, 91.33%, 88.71%, 89.53% and 95.42% respectively.

4.1.3 Random Forest

## Build Random Forest model
model_RF <- randomForest(WORK_LIFE_BALANCE_SCORE ~ .-1, data = train_data)
predictions_RF <- predict(model_RF, newdata = test_data)

## Evaluation metrics
RF_metrics <- data.frame(
  R2 = R2(predictions_RF, test_data$WORK_LIFE_BALANCE_SCORE),
  RMSE = RMSE(predictions_RF, test_data$WORK_LIFE_BALANCE_SCORE),
  MAE = MAE(predictions_RF, test_data$WORK_LIFE_BALANCE_SCORE)
)
print(RF_metrics)
##          R2     RMSE      MAE
## 1 0.9572659 11.23245 9.047555
## Plot Regression results
print(lares::mplot_lineal(tag = test_data$WORK_LIFE_BALANCE_SCORE,
                          score = predictions_RF,
                          subtitle = "Random Forest Regression Model",
                          model_name = "model_RF"))

This chart represents predicted value against real value from random forest regression.
The plot shows the predictions are slightly deviated from real value.
The result shows adjusted R square = 0.9573, RMSE = 11.23 & MAE = 9.048, which is a good prediction results.

## Density plot
print(lares::mplot_density(tag = test_data$WORK_LIFE_BALANCE_SCORE,
                           score = predictions_RF,
                           subtitle = "Random Forest Regression Model",
                           model_name = "model_RF"))

The distribution plot indicates the distribution of both model predicted and real work life balance score.
We can see the prediction and real value distributions are pretty much aligned, the only downside for the model would be there are more predictions in range of 650 to 700 compared to real value.

## Split and compare quantiles
print(lares::mplot_splits(tag = test_data$WORK_LIFE_BALANCE_SCORE,
                          score = predictions_RF,
                          split = 5))

The predicted and real results are split into 5 groups.
The predict score for each group are 88.07%, 73.49%, 71.21%, 70.7% and 89.85% respectively.

4.1.4 XGBoost Regression

## Separate predictors and response variable
response_col <- "WORK_LIFE_BALANCE_SCORE"

## Training data
train_response <- train_data[[response_col]]
train_matrix <- as.matrix(train_data[, !names(train_data) %in% response_col])

## Testing data
test_response <- test_data[[response_col]]
test_matrix <- as.matrix(test_data[, !names(test_data) %in% response_col])
model_XG <- xgboost(data = train_matrix, label = train_response, objective = "reg:squarederror", nrounds = 100)
predictions_XG <- predict(model_XG, test_matrix)
## Evaluation metrics
XG_metrics <- data.frame(
  R2 = R2(predictions_XG, test_data$WORK_LIFE_BALANCE_SCORE),
  RMSE = RMSE(predictions_XG, test_data$WORK_LIFE_BALANCE_SCORE),
  MAE = MAE(predictions_XG, test_data$WORK_LIFE_BALANCE_SCORE)
)
print(XG_metrics)
##          R2     RMSE      MAE
## 1 0.9809891 6.231552 4.909307
## Plot Regression results
print(lares::mplot_lineal(tag = test_data$WORK_LIFE_BALANCE_SCORE,
                          score = predictions_XG,
                          subtitle = "XGBoost Regression Model",
                          model_name = "model_XG"))

This chart represents predicted value against real value from random forest regression.
The plot shows the predictions are slightly deviated from real value.
The result shows adjusted R square = 0.981, RMSE = 6.232 & MAE = 4.909, which is a good prediction results.

## Density plot
print(lares::mplot_density(tag = test_data$WORK_LIFE_BALANCE_SCORE,
                           score = predictions_XG,
                           subtitle = "XGBoost Regression Model",
                           model_name = "model_XG"))

The distribution plot indicates the distribution of both model predicted and real work life balance score.
We can see the prediction and real value distributions are pretty much aligned.
The deviation from model is distributed across the continuous values, which is more well-distributed compared to random forest model.

## Split and compare quantiles
print(lares::mplot_splits(tag = test_data$WORK_LIFE_BALANCE_SCORE,
                          score = predictions_XG,
                          split = 5))

The predicted and real results are split into 5 groups.
The predict score for each group are 91.99%, 82.98%, 81.01%, 82.82% and 92.8% respectively.

4.1.5 Summary

4.1.5.1 Model evaluation

Based on evaluation on several conventional regression models above, we can summarise the performance of models by their R-squared value as following:

## Model Assessment
regression_assessment <- data.frame(
  R2_LR = R2(predictions_LR, test_data$WORK_LIFE_BALANCE_SCORE),
  R2_SVM = R2(predictions_SVM, test_data$WORK_LIFE_BALANCE_SCORE),
  R2_RF = R2(predictions_RF, test_data$WORK_LIFE_BALANCE_SCORE),
  R2_XG = R2(predictions_XG, test_data$WORK_LIFE_BALANCE_SCORE)
)
print(regression_assessment)
##      R2_LR    R2_SVM     R2_RF     R2_XG
## 1 0.545125 0.9957653 0.9572659 0.9809891

SVM has highest R-squared score of 0.9958 while the worst performer linear regression only has 0.5451. Random forest and XGBoost results are close to SVM model.

A few reasons linear regression are not performing well could be:

  1. Linear regression assumes a linear relationship between independent and dependent variable.
  2. Linear regression is sensitive to multicollinearity, which means independent variables are highly correlated with each other. To utilize linear regression, variance inflation factors (VIF) among the independent variables should be assessed and high VIF variables should be considered to remove from regression modelling.
  3. Linear regression assumes no heteroskedasticity (residual plot gap increases across independent variable).

In contrast, SVMs are capable of capturing complex and non-linear relationships between variables. It is also relatively robust to outliers due to its usage of support vectors, which focus on the most critical instances for decision boundary construction.

4.1.5.2 Factors impact to work life balance score

Based on best regression model (SVM), let’s have an overall view on each independents variables’ correlation with dependent variables.

## Sort coefficients and names in descending order
coefficients_SVM <- coef(model_SVM)[-1]  # Exclude intercept
coeff_names_SVM <- names(coefficients_SVM)

sorted_indices_SVM <- order((coefficients_SVM), decreasing = TRUE)
sorted_coefficients_SVM <- coefficients_SVM[sorted_indices_SVM]
sorted_names_SVM <- coeff_names_SVM[sorted_indices_SVM]

data <- data.frame(Predictor = sorted_names_SVM, Coefficient = sorted_coefficients_SVM)

## Create a horizontal bar plot with data labels using ggplot2
ggplot(data, aes(x = Coefficient, y = reorder(Predictor, Coefficient))) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  geom_text(aes(label = round(Coefficient, digits = 2)), hjust = -0.1, size = 3) +  # Add data labels
  labs(title = "Coefficients of Predictors (Descending Order)",
       x = "Coefficients", y = "Predictors") +
  theme_minimal() +
  theme(axis.text.y = element_text(hjust = 0),  # Aligns y-axis labels to the left
        axis.title.y = element_blank())  # Removes the y-axis title

From the bar chart, the highest positive coefficients towards work life balance is donation (how many times of donating time or money to good causes), which is quite a surprising factor. The second highest factor is sufficient income which a reasonable factor to most of the individuals, followed by number of places visited.

On the other hand, lost vacation (total days of vacation typically lose every year) gives highest negative coefficient towards work life balance, followed by BMI range and daily stress.

4.2 Classification model

Same as Regression modelling, the dataset is split into 2 portion, 80% for training and 20% for testing the model.

## Prepare dataset for modelling
df4 <- cleaned_dataset[, !colnames(cleaned_dataset) %in% c("AGE", "GENDER", "WORK_LIFE_BALANCE_SCORE")]
df4$RANKING <- as.factor(df4$RANKING)
set.seed(456)  # For reproducibility

## Split the data into training (80%) and testing (20%) sets
index_train <- createDataPartition(df4$RANKING, p = 0.8, list = FALSE)
data_train <- df4[index_train, ]
data_test <- df4[-index_train, ]
glimpse(df4)
## Rows: 15,288
## Columns: 23
## $ FRUITS_VEGGIES    <int> 3, 2, 2, 3, 5, 3, 4, 3, 5, 4, 2, 1, 2, 5, 3, 3, 4, 4…
## $ DAILY_STRESS      <int> 2, 3, 3, 3, 1, 2, 2, 4, 3, 4, 4, 4, 2, 3, 3, 3, 2, 2…
## $ PLACES_VISITED    <int> 2, 4, 3, 10, 3, 3, 10, 5, 6, 2, 7, 3, 10, 0, 2, 5, 1…
## $ CORE_CIRCLE       <int> 5, 3, 4, 3, 3, 9, 6, 3, 4, 6, 7, 8, 5, 3, 3, 4, 4, 1…
## $ SUPPORTING_OTHERS <int> 0, 8, 4, 10, 10, 10, 10, 5, 3, 10, 1, 0, 2, 6, 7, 10…
## $ SOCIAL_NETWORK    <int> 5, 10, 10, 7, 4, 10, 10, 7, 3, 10, 1, 2, 8, 6, 5, 10…
## $ ACHIEVEMENT       <int> 2, 5, 3, 2, 2, 2, 3, 4, 5, 0, 3, 1, 3, 6, 3, 2, 4, 0…
## $ DONATION          <int> 0, 2, 2, 5, 4, 3, 5, 0, 4, 4, 0, 0, 4, 3, 2, 5, 0, 1…
## $ BMI_RANGE         <int> 1, 2, 2, 2, 2, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1…
## $ TODO_COMPLETED    <int> 6, 5, 2, 3, 5, 6, 8, 8, 10, 3, 2, 2, 7, 8, 8, 7, 8, …
## $ FLOW              <int> 4, 2, 2, 5, 0, 1, 8, 2, 2, 2, 2, 1, 1, 7, 0, 2, 2, 1…
## $ DAILY_STEPS       <int> 5, 5, 4, 5, 5, 7, 7, 8, 1, 3, 7, 8, 6, 7, 2, 8, 1, 1…
## $ LIVE_VISION       <int> 0, 5, 5, 0, 0, 10, 5, 10, 5, 0, 4, 2, 10, 2, 5, 10, …
## $ SLEEP_HOURS       <int> 7, 8, 8, 5, 7, 8, 7, 6, 10, 6, 8, 7, 8, 6, 8, 6, 8, …
## $ LOST_VACATION     <int> 5, 2, 10, 7, 0, 0, 10, 0, 0, 0, 7, 7, 0, 0, 3, 1, 1,…
## $ DAILY_SHOUTING    <int> 5, 2, 2, 5, 0, 2, 0, 2, 2, 0, 3, 1, 0, 3, 2, 2, 1, 3…
## $ SUFFICIENT_INCOME <int> 1, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2…
## $ PERSONAL_AWARDS   <int> 4, 3, 4, 5, 8, 10, 10, 8, 10, 3, 3, 4, 5, 5, 4, 8, 3…
## $ TIME_FOR_PASSION  <int> 0, 2, 8, 2, 1, 8, 8, 2, 3, 8, 0, 1, 2, 6, 5, 3, 3, 8…
## $ WEEKLY_MEDITATION <int> 5, 6, 3, 0, 5, 3, 10, 2, 10, 1, 6, 7, 7, 5, 4, 10, 3…
## $ RANKING           <fct> Low, Medium, Low, Low, Medium, High, High, Medium, H…
## $ AGE_GROUP         <dbl> 3, 3, 3, 4, 4, 4, 4, 2, 2, 4, 3, 2, 2, 3, 2, 4, 2, 2…
## $ GENDER_GROUP      <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0…
## Setup k-fold cross validation 
tc <- trainControl(
  method = "repeatedcv", 
  number = 10, 
  classProbs = FALSE,
  savePredictions = TRUE, 
  repeats = 3,
  summaryFunction = multiClassSummary
)

4.2.1 K-Nearest Neighbors model

## Build K-Nearest Neighbors model with tuning
knn_model <- train(
  RANKING ~ .,
  data = data_train,
  method = "knn",
  trControl = tc,
  preProcess = c("center", "scale"),
  metric = 'Accuracy',
  tuneLength = 20,
  tuneGrid = data.frame(k = seq(1, 20, by = 1))  # Specify the range of k values
)
print(knn_model)
## k-Nearest Neighbors 
## 
## 12232 samples
##    22 predictor
##     3 classes: 'High', 'Low', 'Medium' 
## 
## Pre-processing: centered (22), scaled (22) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 11007, 11008, 11009, 11010, 11009, 11009, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa      Mean_F1    Mean_Sensitivity  Mean_Specificity
##    1  0.8235244  0.7177429  0.8235894  0.8242233         0.9021540       
##    2  0.8143115  0.7029464  0.8143160  0.8148604         0.8969825       
##    3  0.8612934  0.7771753  0.8607725  0.8583294         0.9219276       
##    4  0.8578315  0.7714129  0.8572115  0.8541241         0.9197498       
##    5  0.8800147  0.8066136  0.8792090  0.8743829         0.9314365       
##    6  0.8770981  0.8016310  0.8761079  0.8702500         0.9294096       
##    7  0.8892249  0.8210775  0.8882656  0.8818093         0.9359602       
##    8  0.8876179  0.8184312  0.8866256  0.8799525         0.9349847       
##    9  0.8969633  0.8333261  0.8959173  0.8883296         0.9398543       
##   10  0.8947289  0.8296395  0.8936275  0.8857298         0.9384907       
##   11  0.9004787  0.8387582  0.8993109  0.8905798         0.9413854       
##   12  0.8993352  0.8368552  0.8981309  0.8891431         0.9406520       
##   13  0.9048392  0.8457116  0.9036527  0.8943588         0.9436143       
##   14  0.9038847  0.8440225  0.9026005  0.8926458         0.9428309       
##   15  0.9064739  0.8481289  0.9051462  0.8947339         0.9441028       
##   16  0.9066105  0.8483478  0.9052829  0.8948800         0.9441817       
##   17  0.9085717  0.8514150  0.9071639  0.8961856         0.9450526       
##   18  0.9069914  0.8488509  0.9055806  0.8946232         0.9441802       
##   19  0.9101804  0.8540045  0.9087586  0.8976420         0.9458956       
##   20  0.9106167  0.8547462  0.9092199  0.8983160         0.9462173       
##   Mean_Pos_Pred_Value  Mean_Neg_Pred_Value  Mean_Precision  Mean_Recall
##   0.8232843            0.9019230            0.8232843       0.8242233  
##   0.8141732            0.8968210            0.8141732       0.8148604  
##   0.8637197            0.9233913            0.8637197       0.8583294  
##   0.8609374            0.9215523            0.8609374       0.8541241  
##   0.8848478            0.9343739            0.8848478       0.8743829  
##   0.8830691            0.9330042            0.8830691       0.8702500  
##   0.8959117            0.9400003            0.8959117       0.8818093  
##   0.8946201            0.9391623            0.8946201       0.8799525  
##   0.9050919            0.9447375            0.9050919       0.8883296  
##   0.9032628            0.9435659            0.9032628       0.8857298  
##   0.9101003            0.9471132            0.9101003       0.8905798  
##   0.9092432            0.9465424            0.9092432       0.8891431  
##   0.9151852            0.9497926            0.9151852       0.8943588  
##   0.9150881            0.9494888            0.9150881       0.8926458  
##   0.9182996            0.9511479            0.9182996       0.8947339  
##   0.9184585            0.9512394            0.9184585       0.8948800  
##   0.9212849            0.9526082            0.9212849       0.8961856  
##   0.9196796            0.9516847            0.9196796       0.8946232  
##   0.9231007            0.9535938            0.9231007       0.8976420  
##   0.9232917            0.9537601            0.9232917       0.8983160  
##   Mean_Detection_Rate  Mean_Balanced_Accuracy
##   0.2745081            0.8631886             
##   0.2714372            0.8559214             
##   0.2870978            0.8901285             
##   0.2859438            0.8869369             
##   0.2933382            0.9029097             
##   0.2923660            0.8998298             
##   0.2964083            0.9088848             
##   0.2958726            0.9074686             
##   0.2989878            0.9140919             
##   0.2982430            0.9121102             
##   0.3001596            0.9159826             
##   0.2997784            0.9148975             
##   0.3016131            0.9189865             
##   0.3012949            0.9177383             
##   0.3021580            0.9194183             
##   0.3022035            0.9195309             
##   0.3028572            0.9206191             
##   0.3023305            0.9194017             
##   0.3033935            0.9217688             
##   0.3035389            0.9222667             
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 20.
## Predict Work Life Balance Rank with K-Nearest Neighbours model using test data
knn_pred <- predict(knn_model, data_test)

## Plot Confusion Matrix 
knn_confusionmatrix <- confusionMatrix(knn_pred, data_test$RANKING)
print(knn_confusionmatrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High  Low Medium
##     High    687    0     42
##     Low       0  657     24
##     Medium   75  106   1465
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9192          
##                  95% CI : (0.9089, 0.9286)
##     No Information Rate : 0.501           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8686          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: High Class: Low Class: Medium
## Sensitivity               0.9016     0.8611        0.9569
## Specificity               0.9817     0.9895        0.8813
## Pos Pred Value            0.9424     0.9648        0.8900
## Neg Pred Value            0.9678     0.9554        0.9532
## Prevalence                0.2493     0.2497        0.5010
## Detection Rate            0.2248     0.2150        0.4794
## Detection Prevalence      0.2385     0.2228        0.5386
## Balanced Accuracy         0.9416     0.9253        0.9191
knn_cfm <- as_tibble(knn_confusionmatrix$table)
print(plot_confusion_matrix(knn_cfm, 
                            target_col = "Prediction", 
                            prediction_col = "Reference",
                            counts_col = "n"))

## Calculate Feature Importance of K-Nearest Neighbors Model
knn_importance <- varImp(knn_model)
## Create box plot for variable importance
gg.knn_feat_importance <- ggplot(data = knn_importance, mapping = aes(x = knn_importance[,1])) + 
  geom_boxplot() + 
  labs(title = "Variable importance: K-Nearest Neighbors ") + 
  theme_minimal() 
plot(gg.knn_feat_importance)

KNN model achieved 91.92% accuracy and 86.86% Kappa which is relatively high.
From the feature importance, we can see that Achievement contributed the most to KNN model.

4.2.2 NaiveBayes model

## Build NaiveBayes model 
nb_model <- train(
  RANKING ~ .,
  data_train,
  method="naive_bayes",
  preProcess = c("center","scale"),
  metric='Accuracy',
  trControl=tc)
print(nb_model)
## Naive Bayes 
## 
## 12232 samples
##    22 predictor
##     3 classes: 'High', 'Low', 'Medium' 
## 
## Pre-processing: centered (22), scaled (22) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 11009, 11009, 11009, 11009, 11008, 11009, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa      Mean_F1    Mean_Sensitivity
##   FALSE      0.8773159  0.8020360  0.8764003  0.8707402       
##    TRUE      0.8452139  0.7419982  0.8369096  0.8123877       
##   Mean_Specificity  Mean_Pos_Pred_Value  Mean_Neg_Pred_Value  Mean_Precision
##   0.9296211         0.8831756            0.9330678            0.8831756     
##   0.9030140         0.8837514            0.9241576            0.8837514     
##   Mean_Recall  Mean_Detection_Rate  Mean_Balanced_Accuracy
##   0.8707402    0.2924386            0.9001806             
##   0.8123877    0.2817380            0.8577009             
## 
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = FALSE
##  and adjust = 1.
## Predict Work Life Balance Rank with Naive Bayes model using test data
nb_pred <- predict(nb_model, data_test)

## Plot Confusion Matrix 
nb_confusionmatrix <- confusionMatrix(nb_pred, data_test$RANKING)
print(nb_confusionmatrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High  Low Medium
##     High    637    0     58
##     Low       0  659     97
##     Medium  125  104   1376
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8743          
##                  95% CI : (0.8621, 0.8859)
##     No Information Rate : 0.501           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7968          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: High Class: Low Class: Medium
## Sensitivity               0.8360     0.8637        0.8988
## Specificity               0.9747     0.9577        0.8498
## Pos Pred Value            0.9165     0.8717        0.8573
## Neg Pred Value            0.9471     0.9548        0.8932
## Prevalence                0.2493     0.2497        0.5010
## Detection Rate            0.2084     0.2156        0.4503
## Detection Prevalence      0.2274     0.2474        0.5252
## Balanced Accuracy         0.9053     0.9107        0.8743
nb_cfm <- as_tibble(nb_confusionmatrix$table)
print(plot_confusion_matrix(nb_cfm, 
                            target_col = "Prediction", 
                            prediction_col = "Reference",
                            counts_col = "n"))

## Calculate Feature Importance of Naive Bayes Model
nb_importance <- varImp(nb_model)
## Create box plot for variable importance
gg.nb_feat_importance <- ggplot(data = nb_importance, mapping = aes(x = nb_importance[,1])) +
  geom_boxplot() + 
  labs(title = "Variable importance: Naive Bayes") + 
  theme_minimal() 
plot(gg.nb_feat_importance)

Naive Bayes model achieved 87.43% accuracy and 79.68% Kappa.
From the feature importance, we can see that Achievement and supporting others contributed the most to Naive Bayes model.

4.2.3 AdaBoost Model

## Create the Model without Cross Validation with the help of Boosting Function
ada_model <- boosting(RANKING~., data=data_train, boos=TRUE, mfinal=50)

## Predict Work Life Balance Rank with AdaBoost model using test data
ada_pred <- predict(ada_model , newdata = data_test)

## Plot Confusion Matrix 
ada_confusionmatrix <- confusionMatrix(as.factor(ada_pred$class), data_test$RANKING)
print(ada_confusionmatrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High  Low Medium
##     High    563    0     28
##     Low       0  580     37
##     Medium  199  183   1466
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8537          
##                  95% CI : (0.8407, 0.8661)
##     No Information Rate : 0.501           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7556          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: High Class: Low Class: Medium
## Sensitivity               0.7388     0.7602        0.9575
## Specificity               0.9878     0.9839        0.7495
## Pos Pred Value            0.9526     0.9400        0.7933
## Neg Pred Value            0.9193     0.9250        0.9462
## Prevalence                0.2493     0.2497        0.5010
## Detection Rate            0.1842     0.1898        0.4797
## Detection Prevalence      0.1934     0.2019        0.6047
## Balanced Accuracy         0.8633     0.8720        0.8535
ada_cfm <- as_tibble(ada_confusionmatrix$table)
print(plot_confusion_matrix(ada_cfm, 
                            target_col = "Prediction", 
                            prediction_col = "Reference",
                            counts_col = "n"))

## Calculate Feature Importance of AdaBoost Model
## Plot Feature Importance
print(importanceplot(ada_model, horiz=TRUE))

##       [,1]
##  [1,]  0.7
##  [2,]  1.9
##  [3,]  3.1
##  [4,]  4.3
##  [5,]  5.5
##  [6,]  6.7
##  [7,]  7.9
##  [8,]  9.1
##  [9,] 10.3
## [10,] 11.5
## [11,] 12.7
## [12,] 13.9
## [13,] 15.1
## [14,] 16.3
## [15,] 17.5
## [16,] 18.7
## [17,] 19.9
## [18,] 21.1
## [19,] 22.3
## [20,] 23.5
## [21,] 24.7
## [22,] 25.9

AdaBoost model achieved 85.37% accuracy and 75.56% Kappa.
From the feature importance, we can see that SUPPORTING_OTHERS contributed the most to AdaBoost model.

4.2.4 Model Assessment

## Model evaluation
classification_evaluation <- data.frame(
  K_Nearest_Neighbours= knn_confusionmatrix$overall[1],
  Naive_Bayes=  nb_confusionmatrix$overall[1],
  AdaBoost = ada_confusionmatrix$overall[1]
)
print(classification_evaluation)
##          K_Nearest_Neighbours Naive_Bayes  AdaBoost
## Accuracy            0.9191754   0.8743455 0.8537304

K-Nearest Neighbors classification model is the best model among the 3 models for predicting Work-Life Balance Rank as it achieved the highest model accuracy of 91.92%.

5.0 Discussion

5.1 The factors affecting Work-Life Balance Score

Based on our data exploration, the factors affecting Work-Life Balance Score are ACHIEVEMENT, SUPPORTING_OTHERS and TODO_COMPLETED which have slightly stronger correlation to the Work-Life Balance Score compared to other factors. People who experience sense of accomplishment in their professional lives tend to perceive a better work-life balance. This means setting and achieving goals contributes positively to overall well-being and work-life satisfaction. Another factor which is supporting others, whether helping out colleagues or contributing to a team may experience a more favorable work-life balance. On top of that, people who completed more to-do items means they can effectively manage and complete their tasks, and this may perceive a better work-life balance. It shows that efficient task completion and time management are crucial components of work-life balance.

5.2 Work-life Balance Score Prediction (Regression Model)

There are four regression models used to predict Work-life Balance Score which are Multi-linear, Support Vector(SVM), Random Forest and XGBoost. SVM is the best prediction model among the four models as it achieved the highest R-squared score of 0.9958 in our project. The predicted results are the closest to actual values, compared to Random Forest and XGBoost models which also have high accuracy. The worst model is Multi-linear regression which only has 0.5451 R-squared score.

5.3 Work-life Balance Rank Prediction (Classification Model)

There are three classification models used to predict Work-life Balance Rank which are K-Nearest Neighbors (KNN), Naive Bayes and AdaBoost. From the results, KNN is the best classification model compared to Naive Bayes and AdaBoost classification model. This is because the prediction accuracy of KNN classification model is highest which achieved 91.92%.

6.0 Conclusion

In conclusion, this project highlight that the work-life balance could be influenced by different factors. Based on the findings, it suggested the importance of recognizing achievements (ACHIEVEMENT), fostering teamwork (SUPPORTING_OTHERS), and optimizing task management (TODO_COMPLETED) in achieving a better work-life balance. These outcomes offer practical implications for organizations aiming to enhance workplace well-being and suggest avenues for future research in understanding and improving work-life balance dynamics.