| Name | Matrics |
|---|---|
| Chew Tian Feng | S2180850 |
| Chua Ching Cheng | 23052278 |
| Lee Wei Tong | 22099644 |
| Lina Tay | 22117352 |
| Low Pei Ying | 22119138 |
In the fast-paced and dynamic landscape of today’s professional world, achieving a harmonious balance between work and personal life is an on-going challenge. As our lives become increasingly interconnected and demanding, the importance of a healthy work-life balance cannot be neglected. Hence, in our project, we seek to understand the determinants of work-life balance as well as to develop predictive models that can predict and quantify this critical metric base on lifestyle choices. Through this analysis and prediction, our goal is to uncover valuable insights that can assist individuals in creating a positive impact on their overall well-being and productivity.
There are 3 objectives we would like to achieve through this project. The objectives as follows:
The dataset was obtained from Kaggle open-source dataset. The data source comes from Authentic Happiness website which measure Work-Life Balance Score based on research published in the 360 Living guide. The dataset contains 15,972 survey responses conducted between year 2015 to year 2021 and consists of 24 attributes as below:
## Import libraries
library(dplyr)
library(reshape2)
library(ggplot2)
library(caret)
library(xgboost)
library(e1071)
library(randomForest)
library(lares)
library(rpart.plot)
library(rsvg)
library(ggimage)
library(cvms)
library(adabag)
First, we load the dataset from csv file by using read.csv(). Then, we use str() function to display the structure of imported dataframe.
## Load data
file_path <- "wellbeing_and_lifestyle_dataset.csv"
dataset_raw <- read.csv(file_path)
## Display the structure of the dataset
str(dataset_raw)
## 'data.frame': 15972 obs. of 24 variables:
## $ Timestamp : chr "7/7/15" "7/7/15" "7/7/15" "7/7/15" ...
## $ FRUITS_VEGGIES : int 3 2 2 3 5 3 4 3 5 4 ...
## $ DAILY_STRESS : chr "2" "3" "3" "3" ...
## $ PLACES_VISITED : int 2 4 3 10 3 3 10 5 6 2 ...
## $ CORE_CIRCLE : int 5 3 4 3 3 9 6 3 4 6 ...
## $ SUPPORTING_OTHERS : int 0 8 4 10 10 10 10 5 3 10 ...
## $ SOCIAL_NETWORK : int 5 10 10 7 4 10 10 7 3 10 ...
## $ ACHIEVEMENT : int 2 5 3 2 2 2 3 4 5 0 ...
## $ DONATION : int 0 2 2 5 4 3 5 0 4 4 ...
## $ BMI_RANGE : int 1 2 2 2 2 1 2 1 1 2 ...
## $ TODO_COMPLETED : int 6 5 2 3 5 6 8 8 10 3 ...
## $ FLOW : int 4 2 2 5 0 1 8 2 2 2 ...
## $ DAILY_STEPS : int 5 5 4 5 5 7 7 8 1 3 ...
## $ LIVE_VISION : int 0 5 5 0 0 10 5 10 5 0 ...
## $ SLEEP_HOURS : int 7 8 8 5 7 8 7 6 10 6 ...
## $ LOST_VACATION : int 5 2 10 7 0 0 10 0 0 0 ...
## $ DAILY_SHOUTING : int 5 2 2 5 0 2 0 2 2 0 ...
## $ SUFFICIENT_INCOME : int 1 2 2 1 2 2 2 2 2 1 ...
## $ PERSONAL_AWARDS : int 4 3 4 5 8 10 10 8 10 3 ...
## $ TIME_FOR_PASSION : int 0 2 8 2 1 8 8 2 3 8 ...
## $ WEEKLY_MEDITATION : int 5 6 3 0 5 3 10 2 10 1 ...
## $ AGE : chr "36 to 50" "36 to 50" "36 to 50" "51 or more" ...
## $ GENDER : chr "Female" "Female" "Female" "Female" ...
## $ WORK_LIFE_BALANCE_SCORE: num 610 656 632 623 664 ...
## Display the summary of the dataset
summary(dataset_raw)
## Timestamp FRUITS_VEGGIES DAILY_STRESS PLACES_VISITED
## Length:15972 Min. :0.000 Length:15972 Min. : 0.000
## Class :character 1st Qu.:2.000 Class :character 1st Qu.: 2.000
## Mode :character Median :3.000 Mode :character Median : 5.000
## Mean :2.923 Mean : 5.233
## 3rd Qu.:4.000 3rd Qu.: 8.000
## Max. :5.000 Max. :10.000
## CORE_CIRCLE SUPPORTING_OTHERS SOCIAL_NETWORK ACHIEVEMENT
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.: 3.000 1st Qu.: 4.000 1st Qu.: 2.000
## Median : 5.000 Median : 5.000 Median : 6.000 Median : 3.000
## Mean : 5.508 Mean : 5.616 Mean : 6.474 Mean : 4.001
## 3rd Qu.: 8.000 3rd Qu.:10.000 3rd Qu.:10.000 3rd Qu.: 6.000
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
## DONATION BMI_RANGE TODO_COMPLETED FLOW
## Min. :0.000 Min. :1.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 4.000 1st Qu.: 1.000
## Median :3.000 Median :1.000 Median : 6.000 Median : 3.000
## Mean :2.715 Mean :1.411 Mean : 5.746 Mean : 3.195
## 3rd Qu.:5.000 3rd Qu.:2.000 3rd Qu.: 8.000 3rd Qu.: 5.000
## Max. :5.000 Max. :2.000 Max. :10.000 Max. :10.000
## DAILY_STEPS LIVE_VISION SLEEP_HOURS LOST_VACATION
## Min. : 1.000 Min. : 0.000 Min. : 1.000 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.: 1.000 1st Qu.: 6.000 1st Qu.: 0.000
## Median : 5.000 Median : 3.000 Median : 7.000 Median : 0.000
## Mean : 5.704 Mean : 3.752 Mean : 7.043 Mean : 2.899
## 3rd Qu.: 8.000 3rd Qu.: 5.000 3rd Qu.: 8.000 3rd Qu.: 5.000
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
## DAILY_SHOUTING SUFFICIENT_INCOME PERSONAL_AWARDS TIME_FOR_PASSION
## Min. : 0.000 Min. :1.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 1.000 1st Qu.:1.000 1st Qu.: 3.000 1st Qu.: 1.000
## Median : 2.000 Median :2.000 Median : 5.000 Median : 3.000
## Mean : 2.931 Mean :1.729 Mean : 5.712 Mean : 3.327
## 3rd Qu.: 4.000 3rd Qu.:2.000 3rd Qu.: 9.000 3rd Qu.: 5.000
## Max. :10.000 Max. :2.000 Max. :10.000 Max. :10.000
## WEEKLY_MEDITATION AGE GENDER
## Min. : 0.000 Length:15972 Length:15972
## 1st Qu.: 4.000 Class :character Class :character
## Median : 7.000 Mode :character Mode :character
## Mean : 6.233
## 3rd Qu.:10.000
## Max. :10.000
## WORK_LIFE_BALANCE_SCORE
## Min. :480.0
## 1st Qu.:636.0
## Median :667.7
## Mean :666.8
## 3rd Qu.:698.5
## Max. :820.2
## Display the first few rows of the dataset
head(dataset_raw)
## Timestamp FRUITS_VEGGIES DAILY_STRESS PLACES_VISITED CORE_CIRCLE
## 1 7/7/15 3 2 2 5
## 2 7/7/15 2 3 4 3
## 3 7/7/15 2 3 3 4
## 4 7/7/15 3 3 10 3
## 5 7/7/15 5 1 3 3
## 6 7/8/15 3 2 3 9
## SUPPORTING_OTHERS SOCIAL_NETWORK ACHIEVEMENT DONATION BMI_RANGE
## 1 0 5 2 0 1
## 2 8 10 5 2 2
## 3 4 10 3 2 2
## 4 10 7 2 5 2
## 5 10 4 2 4 2
## 6 10 10 2 3 1
## TODO_COMPLETED FLOW DAILY_STEPS LIVE_VISION SLEEP_HOURS LOST_VACATION
## 1 6 4 5 0 7 5
## 2 5 2 5 5 8 2
## 3 2 2 4 5 8 10
## 4 3 5 5 0 5 7
## 5 5 0 5 0 7 0
## 6 6 1 7 10 8 0
## DAILY_SHOUTING SUFFICIENT_INCOME PERSONAL_AWARDS TIME_FOR_PASSION
## 1 5 1 4 0
## 2 2 2 3 2
## 3 2 2 4 8
## 4 5 1 5 2
## 5 0 2 8 1
## 6 2 2 10 8
## WEEKLY_MEDITATION AGE GENDER WORK_LIFE_BALANCE_SCORE
## 1 5 36 to 50 Female 609.5
## 2 6 36 to 50 Female 655.6
## 3 3 36 to 50 Female 631.6
## 4 0 51 or more Female 622.7
## 5 5 51 or more Female 663.9
## 6 3 51 or more Female 722.3
In data cleaning stage, we drop off the Timestamp column which is not required for our analysis. Next, we check if there is any missing, duplicate, inconsistency value in the dataset.
## Remove the Timstamp column which is not relevant
dataset <- subset(dataset_raw, select = -c(Timestamp))
## Check the number of column in dataset after removing the Timestamp
cat("Total columns AFTER remove Timestamp: ", ncol(dataset), "\n")
## Total columns AFTER remove Timestamp: 23
## Check for any missing value in the dataset
any(is.na(dataset))
## [1] FALSE
## Check for any duplicate rows in the dataset
sum(duplicated(dataset))
## [1] 683
There are 683 duplicate rows found in the dataset, we assumed that these data was submitted due to user clicked the submit button too many times while page loading or unintentionally fill out the surveys multiple times. Hence, we removed the duplicate rows to prevent biasing result.
## Remove duplicate rows
dataset <- dataset %>% distinct()
cat("Total rows AFTER remove duplicates: ", nrow(dataset), "\n")
## Total rows AFTER remove duplicates: 15289
By using str() function, we noticed that DAILY_STRESS column is shown as “character” which means there is some cells is not numeric. There might have structural error under this column. We proceed to remove it.
## Check the class attribute of "Daily Stress" column
class(dataset$DAILY_STRESS)
## [1] "character"
## Check whether the class can be changed to integer
dataset$DAILY_STRESS <- as.integer(dataset$DAILY_STRESS)
## Warning: NAs introduced by coercion
The warning message “NAs introduced by coercion” indicates that there is a type coercion happening in code, and as a result, some values are being coerced (converted) to the special value NA (Not Available) which mean that not available to convert into numeric.
## Identify the rows where DAILY_STRESS cannot be converted to numeric
problematic_rows <- which(is.na(as.integer(dataset$DAILY_STRESS)))
## Print the row numbers and values
if (length(problematic_rows) > 0) {
cat("Problematic rows:", problematic_rows, "\n")
cat("Corresponding values:", dataset$DAILY_STRESS[problematic_rows], "\n")
} else {
cat("No problematic rows found.\n")
}
## Problematic rows: 9389
## Corresponding values: NA
## Remove the problematic row number 9389
cleaned_dataset <- dataset[-9389 ,]
cat("Total rows AFTER remove inconsistency: ", nrow(cleaned_dataset), "\n")
## Total rows AFTER remove inconsistency: 15288
## Compare the dataset before and after data cleaning process
str(cleaned_dataset)
## 'data.frame': 15288 obs. of 23 variables:
## $ FRUITS_VEGGIES : int 3 2 2 3 5 3 4 3 5 4 ...
## $ DAILY_STRESS : int 2 3 3 3 1 2 2 4 3 4 ...
## $ PLACES_VISITED : int 2 4 3 10 3 3 10 5 6 2 ...
## $ CORE_CIRCLE : int 5 3 4 3 3 9 6 3 4 6 ...
## $ SUPPORTING_OTHERS : int 0 8 4 10 10 10 10 5 3 10 ...
## $ SOCIAL_NETWORK : int 5 10 10 7 4 10 10 7 3 10 ...
## $ ACHIEVEMENT : int 2 5 3 2 2 2 3 4 5 0 ...
## $ DONATION : int 0 2 2 5 4 3 5 0 4 4 ...
## $ BMI_RANGE : int 1 2 2 2 2 1 2 1 1 2 ...
## $ TODO_COMPLETED : int 6 5 2 3 5 6 8 8 10 3 ...
## $ FLOW : int 4 2 2 5 0 1 8 2 2 2 ...
## $ DAILY_STEPS : int 5 5 4 5 5 7 7 8 1 3 ...
## $ LIVE_VISION : int 0 5 5 0 0 10 5 10 5 0 ...
## $ SLEEP_HOURS : int 7 8 8 5 7 8 7 6 10 6 ...
## $ LOST_VACATION : int 5 2 10 7 0 0 10 0 0 0 ...
## $ DAILY_SHOUTING : int 5 2 2 5 0 2 0 2 2 0 ...
## $ SUFFICIENT_INCOME : int 1 2 2 1 2 2 2 2 2 1 ...
## $ PERSONAL_AWARDS : int 4 3 4 5 8 10 10 8 10 3 ...
## $ TIME_FOR_PASSION : int 0 2 8 2 1 8 8 2 3 8 ...
## $ WEEKLY_MEDITATION : int 5 6 3 0 5 3 10 2 10 1 ...
## $ AGE : chr "36 to 50" "36 to 50" "36 to 50" "51 or more" ...
## $ GENDER : chr "Female" "Female" "Female" "Female" ...
## $ WORK_LIFE_BALANCE_SCORE: num 610 656 632 623 664 ...
Here is the comparison before and after cleaning the dataset. [Before] 15972 obs. of 24 variables VS [After] 15288 obs. of 23 variables
We added a column named “RANKING” to rank the work life balance using WORK_LIFE_BALANCE_SCORE in our dataset. We also transform 2 categorical variables AGE and GENDER to numerical.
## Plot boxplot to identify the quartile range
wlb <- cleaned_dataset$WORK_LIFE_BALANCE_SCORE
boxplot(wlb, main="Work Life Balance Score")
text(y=boxplot.stats(wlb)$stats, labels=boxplot.stats(wlb)$stats, x=1.25)
Based on the summary of boxplot, the value higher than 698.5 is considered as HIGH, less than 636 is considered as LOW, and the remaining (between 636 to 698.5) will be categorized as MEDIUM.
## Categorize the work-life balance score by adding new column 'RANKING'
cleaned_dataset <- mutate(cleaned_dataset, RANKING=if_else(WORK_LIFE_BALANCE_SCORE>698.5,"High", if_else(WORK_LIFE_BALANCE_SCORE<636,"Low", "Medium")))
## Transform AGE and GENDER categorical variables to numerical
cleaned_dataset <- mutate(cleaned_dataset, AGE_GROUP=if_else(AGE=="Less than 20",1, if_else(AGE=="21 to 35",2, if_else(AGE=="36 to 50",3, 4))))
cleaned_dataset <- mutate(cleaned_dataset, GENDER_GROUP=if_else(GENDER=="Male", 0, 1))
## Let's check the final structure of cleaned dataset
str(cleaned_dataset)
## 'data.frame': 15288 obs. of 26 variables:
## $ FRUITS_VEGGIES : int 3 2 2 3 5 3 4 3 5 4 ...
## $ DAILY_STRESS : int 2 3 3 3 1 2 2 4 3 4 ...
## $ PLACES_VISITED : int 2 4 3 10 3 3 10 5 6 2 ...
## $ CORE_CIRCLE : int 5 3 4 3 3 9 6 3 4 6 ...
## $ SUPPORTING_OTHERS : int 0 8 4 10 10 10 10 5 3 10 ...
## $ SOCIAL_NETWORK : int 5 10 10 7 4 10 10 7 3 10 ...
## $ ACHIEVEMENT : int 2 5 3 2 2 2 3 4 5 0 ...
## $ DONATION : int 0 2 2 5 4 3 5 0 4 4 ...
## $ BMI_RANGE : int 1 2 2 2 2 1 2 1 1 2 ...
## $ TODO_COMPLETED : int 6 5 2 3 5 6 8 8 10 3 ...
## $ FLOW : int 4 2 2 5 0 1 8 2 2 2 ...
## $ DAILY_STEPS : int 5 5 4 5 5 7 7 8 1 3 ...
## $ LIVE_VISION : int 0 5 5 0 0 10 5 10 5 0 ...
## $ SLEEP_HOURS : int 7 8 8 5 7 8 7 6 10 6 ...
## $ LOST_VACATION : int 5 2 10 7 0 0 10 0 0 0 ...
## $ DAILY_SHOUTING : int 5 2 2 5 0 2 0 2 2 0 ...
## $ SUFFICIENT_INCOME : int 1 2 2 1 2 2 2 2 2 1 ...
## $ PERSONAL_AWARDS : int 4 3 4 5 8 10 10 8 10 3 ...
## $ TIME_FOR_PASSION : int 0 2 8 2 1 8 8 2 3 8 ...
## $ WEEKLY_MEDITATION : int 5 6 3 0 5 3 10 2 10 1 ...
## $ AGE : chr "36 to 50" "36 to 50" "36 to 50" "51 or more" ...
## $ GENDER : chr "Female" "Female" "Female" "Female" ...
## $ WORK_LIFE_BALANCE_SCORE: num 610 656 632 623 664 ...
## $ RANKING : chr "Low" "Medium" "Low" "Low" ...
## $ AGE_GROUP : num 3 3 3 4 4 4 4 2 2 4 ...
## $ GENDER_GROUP : num 1 1 1 1 1 1 0 1 1 1 ...
head(cleaned_dataset)
## FRUITS_VEGGIES DAILY_STRESS PLACES_VISITED CORE_CIRCLE SUPPORTING_OTHERS
## 1 3 2 2 5 0
## 2 2 3 4 3 8
## 3 2 3 3 4 4
## 4 3 3 10 3 10
## 5 5 1 3 3 10
## 6 3 2 3 9 10
## SOCIAL_NETWORK ACHIEVEMENT DONATION BMI_RANGE TODO_COMPLETED FLOW DAILY_STEPS
## 1 5 2 0 1 6 4 5
## 2 10 5 2 2 5 2 5
## 3 10 3 2 2 2 2 4
## 4 7 2 5 2 3 5 5
## 5 4 2 4 2 5 0 5
## 6 10 2 3 1 6 1 7
## LIVE_VISION SLEEP_HOURS LOST_VACATION DAILY_SHOUTING SUFFICIENT_INCOME
## 1 0 7 5 5 1
## 2 5 8 2 2 2
## 3 5 8 10 2 2
## 4 0 5 7 5 1
## 5 0 7 0 0 2
## 6 10 8 0 2 2
## PERSONAL_AWARDS TIME_FOR_PASSION WEEKLY_MEDITATION AGE GENDER
## 1 4 0 5 36 to 50 Female
## 2 3 2 6 36 to 50 Female
## 3 4 8 3 36 to 50 Female
## 4 5 2 0 51 or more Female
## 5 8 1 5 51 or more Female
## 6 10 8 3 51 or more Female
## WORK_LIFE_BALANCE_SCORE RANKING AGE_GROUP GENDER_GROUP
## 1 609.5 Low 3 1
## 2 655.6 Medium 3 1
## 3 631.6 Low 3 1
## 4 622.7 Low 4 1
## 5 663.9 Medium 4 1
## 6 722.3 High 4 1
To explore the dataset, we first plot a simple Correlation Matrix Heatmap to understand relationship between two variables in dataset. The melt() function in R (from the reshape2 package) transforms a data frame from wide format to long format.
## Drop columns "AGE", "GENDER", "RANKING", "AGE_GROUP", "GENDER_GROUP"
df1 <- cleaned_dataset[, !(names(cleaned_dataset) %in% c("AGE", "GENDER", "RANKING", "AGE_GROUP", "GENDER_GROUP"))]
cor_matrix <- round(cor(df1),2)
melted_corr_mat <- melt(cor_matrix)
## Plot correlation matrix heatmap
ggplot(data = melted_corr_mat, aes(x = Var1, y = Var2, fill = value, label = value)) +
geom_tile() +
geom_text(color = "black") +
scale_fill_gradient(low = "lightblue", high = "red") +
labs(title = "Correlation Matrix",
x = "Variables",
y = "Variables") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.title = element_text(hjust = 0.5))
## Create a table of correlation between work life balance score VS variables, sort in descending order
cor_with_work_life <- cor_matrix["WORK_LIFE_BALANCE_SCORE", ]
cor_table <- as.data.frame(cor_with_work_life)
colnames(cor_table) <- "WORK_LIFE_BALANCE_SCORE"
cor_table <- arrange(cor_table, desc(WORK_LIFE_BALANCE_SCORE))
cor_table <- cor_table[-1, , drop = FALSE]
print(cor_table)
## WORK_LIFE_BALANCE_SCORE
## ACHIEVEMENT 0.56
## SUPPORTING_OTHERS 0.55
## TODO_COMPLETED 0.54
## PLACES_VISITED 0.53
## TIME_FOR_PASSION 0.52
## CORE_CIRCLE 0.51
## PERSONAL_AWARDS 0.50
## FLOW 0.48
## LIVE_VISION 0.48
## DONATION 0.46
## FRUITS_VEGGIES 0.45
## DAILY_STEPS 0.42
## WEEKLY_MEDITATION 0.42
## SOCIAL_NETWORK 0.41
## SUFFICIENT_INCOME 0.40
## SLEEP_HOURS 0.20
## BMI_RANGE -0.25
## LOST_VACATION -0.26
## DAILY_SHOUTING -0.27
## DAILY_STRESS -0.37
## Plot a bar chart to visualize the correlation score
ggplot(cor_table, aes(x = reorder(rownames(cor_table), -WORK_LIFE_BALANCE_SCORE), y = WORK_LIFE_BALANCE_SCORE)) +
geom_bar(stat = "identity", fill = "pink") +
geom_text(aes(label = WORK_LIFE_BALANCE_SCORE), vjust = -0.5) +
labs(title = "Work Life Balance Scores Correlation", x = "Factors", y = "Correlation Score") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Based on Correlation matrix, ACHIEVEMENT, SUPPORTING_OTHERS and TODO_COMPLETED have the slightly stronger positive correlations with work-life balance, while DAILY_STRESS, DAILY_SHOUTING, and LOST_VACATION have moderate negative correlations with work-life balance.
## Plot bar chart to check the distribution of survey collected by Gender
df2 <- cleaned_dataset
ggplot(df2, aes(x = GENDER, fill = GENDER)) +
geom_bar() +
ggtitle("Number of Survey by Gender") +
geom_text(stat = "count", aes(label = after_stat(count), vjust = -0.5)) +
theme(plot.title = element_text(hjust = 0.5))
## Plot bar chart to check the distribution of survey collected by Age Group
df2$AGE <- factor(df2$AGE, levels = c("Less than 20","21 to 35","36 to 50","51 or more"))
ggplot(df2, aes(x = AGE, fill = AGE)) +
geom_bar() +
ggtitle("Number of Survey by Age Group") +
geom_text(stat = "count", aes(label = after_stat(count), vjust = -0.5)) +
theme(plot.title = element_text(hjust = 0.5))
As observed from the bar chart, there are more female than male contributed to this survey. Among these participants, there are most of them aged between 21-50 years old, which reflects the working age population.
## Plot Work-Life Balance Grade By Gender
df2$RANKING <- factor(df2$RANKING, levels = c("Low", "Medium", "High"))
ggplot(df2, aes(x = RANKING, fill = GENDER)) +
geom_bar(position = "dodge") +
scale_x_discrete(labels = c("High"="High(Score>698.5)", "Medium"="Medium(636<=Score<=698.5)", "Low"="Low(Score<636)")) +
geom_text(stat = "count", aes(label = after_stat(count), group = GENDER), position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
labs(x = "Work Life Balance Grade", y = "Count", title = "Work Life Balance Grade By Gender") +
theme(plot.title = element_text(hjust = 0.5))
Based on the bar chart above, most of the work-life balance grades fall into the medium category, while the counts for low and high grades are nearly equal for both male and female genders.
## Plot Work-Life Balance Grade By Age Group
df2$AGE_GROUP <- factor(df2$AGE_GROUP)
ggplot(df2, aes(x = RANKING, fill = AGE_GROUP)) +
geom_bar() +
labs(x = "Work Life Balance Grade", y = "Count", title = "Work Life Balance Grade by Age Group") +
scale_x_discrete(labels = c("High" = "High(Score>698.5)", "Medium" = "Medium(636<=Score<=698.5)", "Low" = "Low(Score<636)")) +
scale_fill_discrete(labels = c("18-25", "26-35", "36-50", "51 or more")) +
theme(plot.title = element_text(hjust = 0.5))
Based on the bar chart above, observations can conclude that for each work-life balance grade (low, medium, and high), each grade contains four different age groups. There is no significant indication that a certain age group only appears in a specific grade.
The top six highest correlations between factors and work-life balance scores are as follows: Achievement (0.56), followed by Supporting Others (0.55), Todo Completed (0.54), Places Visited (0.53), Time for Passion (0.52), Core Circle (0.51), and Personal Awards (0.5). The lowest negative correlation with work-life balance score is Daily Stress (-0.37). The corresponding graphs are plotted below:
## Convert integer to factor
df2 <- df2 %>% mutate_if(is.integer, as.factor)
## Create a function to plot Box-plot of Work-Life Balance Score vs input variable
plot_boxplot_line <- function(data, x_variable) {
ggplot(data, aes_string(x = x_variable, y = "WORK_LIFE_BALANCE_SCORE")) +
geom_boxplot(fill = "lightblue", color = "blue") +
labs(title = paste("Box-plot of Work-Life Balance Score vs", x_variable),
x = x_variable,
y = "Work-Life Balance Score") +
theme_minimal() +
stat_summary(aes(shape = "Mean"), fun = mean, geom = "point", size = 3, color = "red") +
geom_line(data = data %>% group_by(!!sym(x_variable)) %>% summarize(mean_score = mean(WORK_LIFE_BALANCE_SCORE)),
aes_string(x = x_variable, y = "mean_score", group = "1"),
color = "red", linetype = "dashed") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_shape_manual(name = "", values = 11, labels = "Mean")
}
plot_boxplot_line(df2, "ACHIEVEMENT")
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Based on the above box plot, observations can conclude that as the number of achievements increases, both the median and mean of the work-life balance score also increase.
## Create a Box-plot of Work-Life Balance Score vs Supporting Others: Correlation 0.55
plot_boxplot_line(df2, "SUPPORTING_OTHERS")
Based on the above box plot, observations can conclude that as the number of supporting others increases, both the median and mean of the work-life balance score also increase.
# Create a Box-plot of Work-Life Balance Score vs TODO Completed: Correlation 0.54
plot_boxplot_line(df2, "TODO_COMPLETED")
Based on the above box plot, observations can conclude that as the number of Todo Completed increases, both the median and mean of the work-life balance score also increase.
# Create a Box-plot of Work-Life Balance Score vs Places Visited: Correlation 0.53
plot_boxplot_line(df2, "PLACES_VISITED")
Based on the above box plot, observations can conclude that as the number of places visited increases, both the median and mean of the work-life balance score also increase.
# Create a Box-plot of Work-Life Balance Score vs Time For Passion: Correlation 0.52
plot_boxplot_line(df2, "TIME_FOR_PASSION")
Based on the above box plot, observations can conclude that as the number of Time for Passion increases, both the median and mean of the work-life balance score also increase.
# Create a Box-plot of Work-Life Balance Score vs Core Circle: Correlation 0.51
plot_boxplot_line(df2, "CORE_CIRCLE")
Based on the above box plot, observations can conclude that as the number of Core Circle increases, both the median and mean of the work-life balance score also increase.
# Create a Box-plot of Work-Life Balance Score vs Personal Awards: Correlation 0.50
plot_boxplot_line(df2, "PERSONAL_AWARDS")
Based on the above box plot, observations can conclude that as the number of Personal Awards increases, both the median and mean of the work-life balance score also increase.
## Create a Box-plot of Work-Life Balance Score vs Daily Stress: Correlation -0.37
ggplot(df2, aes(x = DAILY_STRESS, y = WORK_LIFE_BALANCE_SCORE)) +
geom_boxplot(fill = "pink", color = "red") +
labs(title = "Box-plot of Work-Life Balance Score vs Daily Stress",
x = "Daily Stress",
y = "Work-Life Balance Score") +
theme_minimal() +
stat_summary(aes(shape = "Mean"), fun = mean, geom = "point", size = 3, color = "blue") +
geom_line(data = df2 %>% group_by(DAILY_STRESS) %>% summarize(mean_score = mean(WORK_LIFE_BALANCE_SCORE)),
aes(x = DAILY_STRESS, y = mean_score, group = 1),
color = "blue", linetype = "dashed") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_shape_manual(name = "", values = 11, labels = "Mean")
Based on the above box plot, observations can conclude that as the number of Daily stress increases, both the median and mean of the work-life balance score decrease.
The data is going to be split into 2 portion, 80% for training and 20% for testing the model. To be fair for each regression model evaluation, set.seed function is used to ensure consistent data input for model training and model testing.
## Prepare dataset for modelling
df3 <- cleaned_dataset[, !colnames(cleaned_dataset) %in% c("AGE", "GENDER","RANKING")]
set.seed(123) # For reproducibility
## Split the data into training (80%) and testing (20%) sets
train_index <- createDataPartition(df3$WORK_LIFE_BALANCE_SCORE, p = 0.8, list = FALSE)
train_data <- df3[train_index, ]
test_data <- df3[-train_index, ]
glimpse(df3)
## Rows: 15,288
## Columns: 23
## $ FRUITS_VEGGIES <int> 3, 2, 2, 3, 5, 3, 4, 3, 5, 4, 2, 1, 2, 5, 3, 3…
## $ DAILY_STRESS <int> 2, 3, 3, 3, 1, 2, 2, 4, 3, 4, 4, 4, 2, 3, 3, 3…
## $ PLACES_VISITED <int> 2, 4, 3, 10, 3, 3, 10, 5, 6, 2, 7, 3, 10, 0, 2…
## $ CORE_CIRCLE <int> 5, 3, 4, 3, 3, 9, 6, 3, 4, 6, 7, 8, 5, 3, 3, 4…
## $ SUPPORTING_OTHERS <int> 0, 8, 4, 10, 10, 10, 10, 5, 3, 10, 1, 0, 2, 6,…
## $ SOCIAL_NETWORK <int> 5, 10, 10, 7, 4, 10, 10, 7, 3, 10, 1, 2, 8, 6,…
## $ ACHIEVEMENT <int> 2, 5, 3, 2, 2, 2, 3, 4, 5, 0, 3, 1, 3, 6, 3, 2…
## $ DONATION <int> 0, 2, 2, 5, 4, 3, 5, 0, 4, 4, 0, 0, 4, 3, 2, 5…
## $ BMI_RANGE <int> 1, 2, 2, 2, 2, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1…
## $ TODO_COMPLETED <int> 6, 5, 2, 3, 5, 6, 8, 8, 10, 3, 2, 2, 7, 8, 8, …
## $ FLOW <int> 4, 2, 2, 5, 0, 1, 8, 2, 2, 2, 2, 1, 1, 7, 0, 2…
## $ DAILY_STEPS <int> 5, 5, 4, 5, 5, 7, 7, 8, 1, 3, 7, 8, 6, 7, 2, 8…
## $ LIVE_VISION <int> 0, 5, 5, 0, 0, 10, 5, 10, 5, 0, 4, 2, 10, 2, 5…
## $ SLEEP_HOURS <int> 7, 8, 8, 5, 7, 8, 7, 6, 10, 6, 8, 7, 8, 6, 8, …
## $ LOST_VACATION <int> 5, 2, 10, 7, 0, 0, 10, 0, 0, 0, 7, 7, 0, 0, 3,…
## $ DAILY_SHOUTING <int> 5, 2, 2, 5, 0, 2, 0, 2, 2, 0, 3, 1, 0, 3, 2, 2…
## $ SUFFICIENT_INCOME <int> 1, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2…
## $ PERSONAL_AWARDS <int> 4, 3, 4, 5, 8, 10, 10, 8, 10, 3, 3, 4, 5, 5, 4…
## $ TIME_FOR_PASSION <int> 0, 2, 8, 2, 1, 8, 8, 2, 3, 8, 0, 1, 2, 6, 5, 3…
## $ WEEKLY_MEDITATION <int> 5, 6, 3, 0, 5, 3, 10, 2, 10, 1, 6, 7, 7, 5, 4,…
## $ WORK_LIFE_BALANCE_SCORE <dbl> 609.5, 655.6, 631.6, 622.7, 663.9, 722.3, 727.…
## $ AGE_GROUP <dbl> 3, 3, 3, 4, 4, 4, 4, 2, 2, 4, 3, 2, 2, 3, 2, 4…
## $ GENDER_GROUP <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1…
## Build Multilinear Regression model
model_LR <- lm(WORK_LIFE_BALANCE_SCORE ~ .-1, data = train_data)
predictions_LR <- predict(model_LR, newdata = test_data)
## Evaluation metrics
LR_metrics <- data.frame(
R2 = R2(predictions_LR, test_data$WORK_LIFE_BALANCE_SCORE),
RMSE = RMSE(predictions_LR, test_data$WORK_LIFE_BALANCE_SCORE),
MAE = MAE(predictions_LR, test_data$WORK_LIFE_BALANCE_SCORE)
)
print(LR_metrics)
## R2 RMSE MAE
## 1 0.545125 58.01379 45.69006
## Plot Regression results
print(lares::mplot_lineal(tag = test_data$WORK_LIFE_BALANCE_SCORE,
score = predictions_LR,
subtitle = "Multilinear Regression Model",
model_name = "model_LR"))
This chart represents predicted value against real value from
multi-linear regression.
The plot shows quite a big portion of the prediction is deviated from
real value.
The result shows adjusted R square = 0.545, RMSE = 58.01 & MAE =
45.69, which is not a satisfying results.
## Density plot
print(lares::mplot_density(tag = test_data$WORK_LIFE_BALANCE_SCORE,
score = predictions_LR,
subtitle = "Multilinear Regression Model",
model_name = "model_LR"))
The distribution plot indicates the distribution of both model
predicted and real work life balance score.
The model prediction has wider range compared to real distribution.
## Split and compare quantiles
print(lares::mplot_splits(tag = test_data$WORK_LIFE_BALANCE_SCORE,
score = predictions_LR,
split = 5))
The predicted and real results are split into 5 groups.
The predict score for each group are 59 8%, 32.08%, 31.26%, 32.9% and
61.37% respectively.
## Build SVM Regression model
model_SVM <- svm(WORK_LIFE_BALANCE_SCORE ~ .-1, data = train_data, kernel = "linear")
predictions_SVM <- predict(model_SVM, newdata = test_data)
## Evaluation metrics
SVM_metrics <- data.frame(
R2 = R2(predictions_SVM, test_data$WORK_LIFE_BALANCE_SCORE),
RMSE = RMSE(predictions_SVM, test_data$WORK_LIFE_BALANCE_SCORE),
MAE = MAE(predictions_SVM, test_data$WORK_LIFE_BALANCE_SCORE)
)
print(SVM_metrics)
## R2 RMSE MAE
## 1 0.9957653 3.362268 2.961172
## Plot Regression results
print(lares::mplot_lineal(tag = test_data$WORK_LIFE_BALANCE_SCORE,
score = predictions_SVM,
subtitle = "SVM Regression Model",
model_name = "model_SVM"))
This chart represents predicted value against real value from SVM
regression.
The plot shows the prediction and real values are pretty well-aligned
and fitted.
The result shows adjusted R square = 0.9958, RMSE = 3.362 & MAE =
2.961, which is a very good prediction results.
## Density plot
print(lares::mplot_density(tag = test_data$WORK_LIFE_BALANCE_SCORE,
score = predictions_SVM,
subtitle = "SVM Regression Model",
model_name = "model_SVM"))
The distribution plot indicates the distribution of both model
predicted and real work life balance score.
We can see the prediction and real value are very well overlapped
compared to multi-linear regression model.
## Split and compare quantiles
print(lares::mplot_splits(tag = test_data$WORK_LIFE_BALANCE_SCORE,
score = predictions_SVM,
split = 5))
The predicted and real results are split into 5 groups.
The predict score for each group are 96.73%, 91.33%, 88.71%, 89.53% and
95.42% respectively.
## Build Random Forest model
model_RF <- randomForest(WORK_LIFE_BALANCE_SCORE ~ .-1, data = train_data)
predictions_RF <- predict(model_RF, newdata = test_data)
## Evaluation metrics
RF_metrics <- data.frame(
R2 = R2(predictions_RF, test_data$WORK_LIFE_BALANCE_SCORE),
RMSE = RMSE(predictions_RF, test_data$WORK_LIFE_BALANCE_SCORE),
MAE = MAE(predictions_RF, test_data$WORK_LIFE_BALANCE_SCORE)
)
print(RF_metrics)
## R2 RMSE MAE
## 1 0.9572659 11.23245 9.047555
## Plot Regression results
print(lares::mplot_lineal(tag = test_data$WORK_LIFE_BALANCE_SCORE,
score = predictions_RF,
subtitle = "Random Forest Regression Model",
model_name = "model_RF"))
This chart represents predicted value against real value from random
forest regression.
The plot shows the predictions are slightly deviated from real
value.
The result shows adjusted R square = 0.9573, RMSE = 11.23 & MAE =
9.048, which is a good prediction results.
## Density plot
print(lares::mplot_density(tag = test_data$WORK_LIFE_BALANCE_SCORE,
score = predictions_RF,
subtitle = "Random Forest Regression Model",
model_name = "model_RF"))
The distribution plot indicates the distribution of both model
predicted and real work life balance score.
We can see the prediction and real value distributions are pretty much
aligned, the only downside for the model would be there are more
predictions in range of 650 to 700 compared to real value.
## Split and compare quantiles
print(lares::mplot_splits(tag = test_data$WORK_LIFE_BALANCE_SCORE,
score = predictions_RF,
split = 5))
The predicted and real results are split into 5 groups.
The predict score for each group are 88.07%, 73.49%, 71.21%, 70.7% and
89.85% respectively.
## Separate predictors and response variable
response_col <- "WORK_LIFE_BALANCE_SCORE"
## Training data
train_response <- train_data[[response_col]]
train_matrix <- as.matrix(train_data[, !names(train_data) %in% response_col])
## Testing data
test_response <- test_data[[response_col]]
test_matrix <- as.matrix(test_data[, !names(test_data) %in% response_col])
model_XG <- xgboost(data = train_matrix, label = train_response, objective = "reg:squarederror", nrounds = 100)
predictions_XG <- predict(model_XG, test_matrix)
## Evaluation metrics
XG_metrics <- data.frame(
R2 = R2(predictions_XG, test_data$WORK_LIFE_BALANCE_SCORE),
RMSE = RMSE(predictions_XG, test_data$WORK_LIFE_BALANCE_SCORE),
MAE = MAE(predictions_XG, test_data$WORK_LIFE_BALANCE_SCORE)
)
print(XG_metrics)
## R2 RMSE MAE
## 1 0.9809891 6.231552 4.909307
## Plot Regression results
print(lares::mplot_lineal(tag = test_data$WORK_LIFE_BALANCE_SCORE,
score = predictions_XG,
subtitle = "XGBoost Regression Model",
model_name = "model_XG"))
This chart represents predicted value against real value from random
forest regression.
The plot shows the predictions are slightly deviated from real
value.
The result shows adjusted R square = 0.981, RMSE = 6.232 & MAE =
4.909, which is a good prediction results.
## Density plot
print(lares::mplot_density(tag = test_data$WORK_LIFE_BALANCE_SCORE,
score = predictions_XG,
subtitle = "XGBoost Regression Model",
model_name = "model_XG"))
The distribution plot indicates the distribution of both model
predicted and real work life balance score.
We can see the prediction and real value distributions are pretty much
aligned.
The deviation from model is distributed across the continuous values,
which is more well-distributed compared to random forest model.
## Split and compare quantiles
print(lares::mplot_splits(tag = test_data$WORK_LIFE_BALANCE_SCORE,
score = predictions_XG,
split = 5))
The predicted and real results are split into 5 groups.
The predict score for each group are 91.99%, 82.98%, 81.01%, 82.82% and
92.8% respectively.
Based on evaluation on several conventional regression models above, we can summarise the performance of models by their R-squared value as following:
## Model Assessment
regression_assessment <- data.frame(
R2_LR = R2(predictions_LR, test_data$WORK_LIFE_BALANCE_SCORE),
R2_SVM = R2(predictions_SVM, test_data$WORK_LIFE_BALANCE_SCORE),
R2_RF = R2(predictions_RF, test_data$WORK_LIFE_BALANCE_SCORE),
R2_XG = R2(predictions_XG, test_data$WORK_LIFE_BALANCE_SCORE)
)
print(regression_assessment)
## R2_LR R2_SVM R2_RF R2_XG
## 1 0.545125 0.9957653 0.9572659 0.9809891
SVM has highest R-squared score of 0.9958 while the worst performer linear regression only has 0.5451. Random forest and XGBoost results are close to SVM model.
A few reasons linear regression are not performing well could be:
In contrast, SVMs are capable of capturing complex and non-linear relationships between variables. It is also relatively robust to outliers due to its usage of support vectors, which focus on the most critical instances for decision boundary construction.
Based on best regression model (SVM), let’s have an overall view on each independents variables’ correlation with dependent variables.
## Sort coefficients and names in descending order
coefficients_SVM <- coef(model_SVM)[-1] # Exclude intercept
coeff_names_SVM <- names(coefficients_SVM)
sorted_indices_SVM <- order((coefficients_SVM), decreasing = TRUE)
sorted_coefficients_SVM <- coefficients_SVM[sorted_indices_SVM]
sorted_names_SVM <- coeff_names_SVM[sorted_indices_SVM]
data <- data.frame(Predictor = sorted_names_SVM, Coefficient = sorted_coefficients_SVM)
## Create a horizontal bar plot with data labels using ggplot2
ggplot(data, aes(x = Coefficient, y = reorder(Predictor, Coefficient))) +
geom_bar(stat = "identity", fill = "skyblue", color = "black") +
geom_text(aes(label = round(Coefficient, digits = 2)), hjust = -0.1, size = 3) + # Add data labels
labs(title = "Coefficients of Predictors (Descending Order)",
x = "Coefficients", y = "Predictors") +
theme_minimal() +
theme(axis.text.y = element_text(hjust = 0), # Aligns y-axis labels to the left
axis.title.y = element_blank()) # Removes the y-axis title
From the bar chart, the highest positive coefficients towards work life balance is donation (how many times of donating time or money to good causes), which is quite a surprising factor. The second highest factor is sufficient income which a reasonable factor to most of the individuals, followed by number of places visited.
On the other hand, lost vacation (total days of vacation typically lose every year) gives highest negative coefficient towards work life balance, followed by BMI range and daily stress.
Same as Regression modelling, the dataset is split into 2 portion, 80% for training and 20% for testing the model.
## Prepare dataset for modelling
df4 <- cleaned_dataset[, !colnames(cleaned_dataset) %in% c("AGE", "GENDER", "WORK_LIFE_BALANCE_SCORE")]
df4$RANKING <- as.factor(df4$RANKING)
set.seed(456) # For reproducibility
## Split the data into training (80%) and testing (20%) sets
index_train <- createDataPartition(df4$RANKING, p = 0.8, list = FALSE)
data_train <- df4[index_train, ]
data_test <- df4[-index_train, ]
glimpse(df4)
## Rows: 15,288
## Columns: 23
## $ FRUITS_VEGGIES <int> 3, 2, 2, 3, 5, 3, 4, 3, 5, 4, 2, 1, 2, 5, 3, 3, 4, 4…
## $ DAILY_STRESS <int> 2, 3, 3, 3, 1, 2, 2, 4, 3, 4, 4, 4, 2, 3, 3, 3, 2, 2…
## $ PLACES_VISITED <int> 2, 4, 3, 10, 3, 3, 10, 5, 6, 2, 7, 3, 10, 0, 2, 5, 1…
## $ CORE_CIRCLE <int> 5, 3, 4, 3, 3, 9, 6, 3, 4, 6, 7, 8, 5, 3, 3, 4, 4, 1…
## $ SUPPORTING_OTHERS <int> 0, 8, 4, 10, 10, 10, 10, 5, 3, 10, 1, 0, 2, 6, 7, 10…
## $ SOCIAL_NETWORK <int> 5, 10, 10, 7, 4, 10, 10, 7, 3, 10, 1, 2, 8, 6, 5, 10…
## $ ACHIEVEMENT <int> 2, 5, 3, 2, 2, 2, 3, 4, 5, 0, 3, 1, 3, 6, 3, 2, 4, 0…
## $ DONATION <int> 0, 2, 2, 5, 4, 3, 5, 0, 4, 4, 0, 0, 4, 3, 2, 5, 0, 1…
## $ BMI_RANGE <int> 1, 2, 2, 2, 2, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1…
## $ TODO_COMPLETED <int> 6, 5, 2, 3, 5, 6, 8, 8, 10, 3, 2, 2, 7, 8, 8, 7, 8, …
## $ FLOW <int> 4, 2, 2, 5, 0, 1, 8, 2, 2, 2, 2, 1, 1, 7, 0, 2, 2, 1…
## $ DAILY_STEPS <int> 5, 5, 4, 5, 5, 7, 7, 8, 1, 3, 7, 8, 6, 7, 2, 8, 1, 1…
## $ LIVE_VISION <int> 0, 5, 5, 0, 0, 10, 5, 10, 5, 0, 4, 2, 10, 2, 5, 10, …
## $ SLEEP_HOURS <int> 7, 8, 8, 5, 7, 8, 7, 6, 10, 6, 8, 7, 8, 6, 8, 6, 8, …
## $ LOST_VACATION <int> 5, 2, 10, 7, 0, 0, 10, 0, 0, 0, 7, 7, 0, 0, 3, 1, 1,…
## $ DAILY_SHOUTING <int> 5, 2, 2, 5, 0, 2, 0, 2, 2, 0, 3, 1, 0, 3, 2, 2, 1, 3…
## $ SUFFICIENT_INCOME <int> 1, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2…
## $ PERSONAL_AWARDS <int> 4, 3, 4, 5, 8, 10, 10, 8, 10, 3, 3, 4, 5, 5, 4, 8, 3…
## $ TIME_FOR_PASSION <int> 0, 2, 8, 2, 1, 8, 8, 2, 3, 8, 0, 1, 2, 6, 5, 3, 3, 8…
## $ WEEKLY_MEDITATION <int> 5, 6, 3, 0, 5, 3, 10, 2, 10, 1, 6, 7, 7, 5, 4, 10, 3…
## $ RANKING <fct> Low, Medium, Low, Low, Medium, High, High, Medium, H…
## $ AGE_GROUP <dbl> 3, 3, 3, 4, 4, 4, 4, 2, 2, 4, 3, 2, 2, 3, 2, 4, 2, 2…
## $ GENDER_GROUP <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0…
## Setup k-fold cross validation
tc <- trainControl(
method = "repeatedcv",
number = 10,
classProbs = FALSE,
savePredictions = TRUE,
repeats = 3,
summaryFunction = multiClassSummary
)
## Build K-Nearest Neighbors model with tuning
knn_model <- train(
RANKING ~ .,
data = data_train,
method = "knn",
trControl = tc,
preProcess = c("center", "scale"),
metric = 'Accuracy',
tuneLength = 20,
tuneGrid = data.frame(k = seq(1, 20, by = 1)) # Specify the range of k values
)
print(knn_model)
## k-Nearest Neighbors
##
## 12232 samples
## 22 predictor
## 3 classes: 'High', 'Low', 'Medium'
##
## Pre-processing: centered (22), scaled (22)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 11007, 11008, 11009, 11010, 11009, 11009, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa Mean_F1 Mean_Sensitivity Mean_Specificity
## 1 0.8235244 0.7177429 0.8235894 0.8242233 0.9021540
## 2 0.8143115 0.7029464 0.8143160 0.8148604 0.8969825
## 3 0.8612934 0.7771753 0.8607725 0.8583294 0.9219276
## 4 0.8578315 0.7714129 0.8572115 0.8541241 0.9197498
## 5 0.8800147 0.8066136 0.8792090 0.8743829 0.9314365
## 6 0.8770981 0.8016310 0.8761079 0.8702500 0.9294096
## 7 0.8892249 0.8210775 0.8882656 0.8818093 0.9359602
## 8 0.8876179 0.8184312 0.8866256 0.8799525 0.9349847
## 9 0.8969633 0.8333261 0.8959173 0.8883296 0.9398543
## 10 0.8947289 0.8296395 0.8936275 0.8857298 0.9384907
## 11 0.9004787 0.8387582 0.8993109 0.8905798 0.9413854
## 12 0.8993352 0.8368552 0.8981309 0.8891431 0.9406520
## 13 0.9048392 0.8457116 0.9036527 0.8943588 0.9436143
## 14 0.9038847 0.8440225 0.9026005 0.8926458 0.9428309
## 15 0.9064739 0.8481289 0.9051462 0.8947339 0.9441028
## 16 0.9066105 0.8483478 0.9052829 0.8948800 0.9441817
## 17 0.9085717 0.8514150 0.9071639 0.8961856 0.9450526
## 18 0.9069914 0.8488509 0.9055806 0.8946232 0.9441802
## 19 0.9101804 0.8540045 0.9087586 0.8976420 0.9458956
## 20 0.9106167 0.8547462 0.9092199 0.8983160 0.9462173
## Mean_Pos_Pred_Value Mean_Neg_Pred_Value Mean_Precision Mean_Recall
## 0.8232843 0.9019230 0.8232843 0.8242233
## 0.8141732 0.8968210 0.8141732 0.8148604
## 0.8637197 0.9233913 0.8637197 0.8583294
## 0.8609374 0.9215523 0.8609374 0.8541241
## 0.8848478 0.9343739 0.8848478 0.8743829
## 0.8830691 0.9330042 0.8830691 0.8702500
## 0.8959117 0.9400003 0.8959117 0.8818093
## 0.8946201 0.9391623 0.8946201 0.8799525
## 0.9050919 0.9447375 0.9050919 0.8883296
## 0.9032628 0.9435659 0.9032628 0.8857298
## 0.9101003 0.9471132 0.9101003 0.8905798
## 0.9092432 0.9465424 0.9092432 0.8891431
## 0.9151852 0.9497926 0.9151852 0.8943588
## 0.9150881 0.9494888 0.9150881 0.8926458
## 0.9182996 0.9511479 0.9182996 0.8947339
## 0.9184585 0.9512394 0.9184585 0.8948800
## 0.9212849 0.9526082 0.9212849 0.8961856
## 0.9196796 0.9516847 0.9196796 0.8946232
## 0.9231007 0.9535938 0.9231007 0.8976420
## 0.9232917 0.9537601 0.9232917 0.8983160
## Mean_Detection_Rate Mean_Balanced_Accuracy
## 0.2745081 0.8631886
## 0.2714372 0.8559214
## 0.2870978 0.8901285
## 0.2859438 0.8869369
## 0.2933382 0.9029097
## 0.2923660 0.8998298
## 0.2964083 0.9088848
## 0.2958726 0.9074686
## 0.2989878 0.9140919
## 0.2982430 0.9121102
## 0.3001596 0.9159826
## 0.2997784 0.9148975
## 0.3016131 0.9189865
## 0.3012949 0.9177383
## 0.3021580 0.9194183
## 0.3022035 0.9195309
## 0.3028572 0.9206191
## 0.3023305 0.9194017
## 0.3033935 0.9217688
## 0.3035389 0.9222667
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 20.
## Predict Work Life Balance Rank with K-Nearest Neighbours model using test data
knn_pred <- predict(knn_model, data_test)
## Plot Confusion Matrix
knn_confusionmatrix <- confusionMatrix(knn_pred, data_test$RANKING)
print(knn_confusionmatrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Low Medium
## High 687 0 42
## Low 0 657 24
## Medium 75 106 1465
##
## Overall Statistics
##
## Accuracy : 0.9192
## 95% CI : (0.9089, 0.9286)
## No Information Rate : 0.501
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8686
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: High Class: Low Class: Medium
## Sensitivity 0.9016 0.8611 0.9569
## Specificity 0.9817 0.9895 0.8813
## Pos Pred Value 0.9424 0.9648 0.8900
## Neg Pred Value 0.9678 0.9554 0.9532
## Prevalence 0.2493 0.2497 0.5010
## Detection Rate 0.2248 0.2150 0.4794
## Detection Prevalence 0.2385 0.2228 0.5386
## Balanced Accuracy 0.9416 0.9253 0.9191
knn_cfm <- as_tibble(knn_confusionmatrix$table)
print(plot_confusion_matrix(knn_cfm,
target_col = "Prediction",
prediction_col = "Reference",
counts_col = "n"))
## Calculate Feature Importance of K-Nearest Neighbors Model
knn_importance <- varImp(knn_model)
## Create box plot for variable importance
gg.knn_feat_importance <- ggplot(data = knn_importance, mapping = aes(x = knn_importance[,1])) +
geom_boxplot() +
labs(title = "Variable importance: K-Nearest Neighbors ") +
theme_minimal()
plot(gg.knn_feat_importance)
KNN model achieved 91.92% accuracy and 86.86% Kappa. From the feature importance, we can see that Achievement contributed the most to KNN model.
## Build NaiveBayes model
nb_model <- train(
RANKING ~ .,
data_train,
method="naive_bayes",
preProcess = c("center","scale"),
metric='Accuracy',
trControl=tc)
print(nb_model)
## Naive Bayes
##
## 12232 samples
## 22 predictor
## 3 classes: 'High', 'Low', 'Medium'
##
## Pre-processing: centered (22), scaled (22)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 11009, 11009, 11009, 11009, 11008, 11009, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa Mean_F1 Mean_Sensitivity
## FALSE 0.8773159 0.8020360 0.8764003 0.8707402
## TRUE 0.8452139 0.7419982 0.8369096 0.8123877
## Mean_Specificity Mean_Pos_Pred_Value Mean_Neg_Pred_Value Mean_Precision
## 0.9296211 0.8831756 0.9330678 0.8831756
## 0.9030140 0.8837514 0.9241576 0.8837514
## Mean_Recall Mean_Detection_Rate Mean_Balanced_Accuracy
## 0.8707402 0.2924386 0.9001806
## 0.8123877 0.2817380 0.8577009
##
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = FALSE
## and adjust = 1.
## Predict Work Life Balance Rank with Naive Bayes model using test data
nb_pred <- predict(nb_model, data_test)
## Plot Confusion Matrix
nb_confusionmatrix <- confusionMatrix(nb_pred, data_test$RANKING)
print(nb_confusionmatrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Low Medium
## High 637 0 58
## Low 0 659 97
## Medium 125 104 1376
##
## Overall Statistics
##
## Accuracy : 0.8743
## 95% CI : (0.8621, 0.8859)
## No Information Rate : 0.501
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7968
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: High Class: Low Class: Medium
## Sensitivity 0.8360 0.8637 0.8988
## Specificity 0.9747 0.9577 0.8498
## Pos Pred Value 0.9165 0.8717 0.8573
## Neg Pred Value 0.9471 0.9548 0.8932
## Prevalence 0.2493 0.2497 0.5010
## Detection Rate 0.2084 0.2156 0.4503
## Detection Prevalence 0.2274 0.2474 0.5252
## Balanced Accuracy 0.9053 0.9107 0.8743
nb_cfm <- as_tibble(nb_confusionmatrix$table)
print(plot_confusion_matrix(nb_cfm,
target_col = "Prediction",
prediction_col = "Reference",
counts_col = "n"))
## Calculate Feature Importance of Naive Bayes Model
nb_importance <- varImp(nb_model)
## Create box plot for variable importance
gg.nb_feat_importance <- ggplot(data = nb_importance, mapping = aes(x = nb_importance[,1])) +
geom_boxplot() +
labs(title = "Variable importance: Naive Bayes") +
theme_minimal()
plot(gg.nb_feat_importance)
Naive Bayes model achieved 87.43% accuracy and 79.68% Kappa. From the feature importance, we can see that Achievement contributed the most to KNN model.
## Create the Model without Cross Validation with the help of Boosting Function
ada_model <- boosting(RANKING~., data=data_train, boos=TRUE, mfinal=50)
## Predict Work Life Balance Rank with AdaBoost model using test data
ada_pred <- predict(ada_model , newdata = data_test)
## Plot Confusion Matrix
ada_confusionmatrix <- confusionMatrix(as.factor(ada_pred$class), data_test$RANKING)
print(ada_confusionmatrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Low Medium
## High 563 0 28
## Low 0 580 37
## Medium 199 183 1466
##
## Overall Statistics
##
## Accuracy : 0.8537
## 95% CI : (0.8407, 0.8661)
## No Information Rate : 0.501
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7556
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: High Class: Low Class: Medium
## Sensitivity 0.7388 0.7602 0.9575
## Specificity 0.9878 0.9839 0.7495
## Pos Pred Value 0.9526 0.9400 0.7933
## Neg Pred Value 0.9193 0.9250 0.9462
## Prevalence 0.2493 0.2497 0.5010
## Detection Rate 0.1842 0.1898 0.4797
## Detection Prevalence 0.1934 0.2019 0.6047
## Balanced Accuracy 0.8633 0.8720 0.8535
ada_cfm <- as_tibble(ada_confusionmatrix$table)
print(plot_confusion_matrix(ada_cfm,
target_col = "Prediction",
prediction_col = "Reference",
counts_col = "n"))
## Calculate Feature Importance of AdaBoost Model
## Plot Feature Importance
print(importanceplot(ada_model, horiz=TRUE))
## [,1]
## [1,] 0.7
## [2,] 1.9
## [3,] 3.1
## [4,] 4.3
## [5,] 5.5
## [6,] 6.7
## [7,] 7.9
## [8,] 9.1
## [9,] 10.3
## [10,] 11.5
## [11,] 12.7
## [12,] 13.9
## [13,] 15.1
## [14,] 16.3
## [15,] 17.5
## [16,] 18.7
## [17,] 19.9
## [18,] 21.1
## [19,] 22.3
## [20,] 23.5
## [21,] 24.7
## [22,] 25.9
AdaBoost model achieved 85.37% accuracy and 75.56% Kappa. From the feature importance, we can see that SUPPORTING_OTHERS contributed the most to AdaBoost model.
## Model evaluation
classification_evaluation <- data.frame(
K_Nearest_Neighbours= knn_confusionmatrix$overall[1],
AdaBoost = ada_confusionmatrix$overall[1],
Naive_Bayes= nb_confusionmatrix$overall[1]
)
print(classification_evaluation)
## K_Nearest_Neighbours AdaBoost Naive_Bayes
## Accuracy 0.9191754 0.8537304 0.8743455
K-Nearest Neighbors classification model is the best model among the 3 models for predicting Work-Life Balance Rank as it achieved the highest model accuracy of 91.92%.
The overall survey was conducted with was conducted with a higher participation of female than male. This gender imbalance means we need to be careful when interpreting the results since gender-related differences might be influencing how people assess their work-life balance. The work-life balance score among the participants falls in medium range. This suggests that on average, people in the survey feel they have a moderate level of satisfaction and balance between their work and personal lives.
Based on our data exploration, the factors affecting Work-Life Balance Score are ACHIEVEMENT, SUPPORTING_OTHERS and TODO_COMPLETED which have slightly stronger correlation to the Work-Life Balance Score compared to other factors. People who experience sense of accomplishment in their professional lives tend to perceive a better work-life balance. This means setting and achieving goals contributes positively to overall well-being and work-life satisfaction. Another factor which is supporting others, whether helping out colleagues or contributing to a team may experience a more favorable work-life balance. On top of that, people who completed more to-do items means they can effectively manage and complete their tasks, and this may perceive a better work-life balance. It shows that efficient task completion and time management are crucial components of work-life balance.
There are four regression models used to predict Work-life Balance Score which are Multi-linear, Support Vector(SVM), Random Forest and XGBoost. SVM is the best prediction model among the four models as it achieved the highest R-squared score of 0.9958 in our project. The predicted results are the closest to actual values, compared to Random Forest and XGBoost models which also have high accuracy. The worst model is Multi-linear regression which only has 0.5451 R-squared score.
There are three classification models used to predict Work-life Balance Rank which are K-Nearest Neighbors (KNN), Naive Bayes and AdaBoost. From the results, KNN is the best classification model compared to Naive Bayes and AdaBoost classification model. This is because the prediction accuracy of KNN classification model is highest which achieved 91.92%.
In conclusion, this project highlight that the work-life balance could be influenced by different factors. Based on the findings, it suggested the importance of recognizing achievements (ACHIEVEMENT), fostering teamwork (SUPPORTING_OTHERS), and optimizing task management (TODO_COMPLETED) in achieving a better work-life balance. These outcomes offer practical implications for organizations aiming to enhance workplace well-being and suggest avenues for future research in understanding and improving work-life balance dynamics.