This report is the first in a series designed to provide a comprehensive understanding of employee engagement and happiness in the workplace, using the HF 2021 Employee Engagement Dataset. The series aims to evaluate the quality and reliability of the data, assess the construct validity of survey measures, identify key workplace drivers of happiness, explore group differences, and model changes in employee happiness over time. By synthesizing these insights, the series seeks to tell a cohesive, data-driven story and offer actionable recommendations for HR teams, leaders, and employees to foster workplace happiness.
The focus of this initial report is on data quality and exploration, establishing a foundation for deeper analyses. A clean, well-prepared dataset and a solid understanding of its structure and characteristics are essential for uncovering meaningful insights. However, this analysis revealed that key demographic data—such as gender, hire date, and birth date—are systematically missing in some cases, which limits the ability to fully explore certain group differences. As a result, interpretations of gender, age, and tenure-related trends should be approached with caution.
Report Highlights
Data Quality: Addressed systematic missingness in gender, hire date, and birth date data, along with inconsistencies in gender labels and improbable dates, to ensure robust analyses.
Happiness Trends: Significant variations in happiness levels were observed by age group, particularly during the COVID-19 pandemic.
Organizational Insights: Company-level differences in average happiness scores underscore the critical role of organizational culture in shaping employee well-being.
For this analysis, I used the HF 2021 Employee Engagement Dataset, created by Jose Berengueres and Tristan Romanov using data from the Happyforce platform. I selected this dataset because it is freely available, based on real-world data, and presents a level of complexity and incomplete documentation that realistically reflects the challenges of working with organizational data in practice.
Every day, through the platform, employees were asked: ‘How do you feel today?’. This is called HI in the data, which I will refer to as the Happiness Index.
Additionally, once a month employees were asked a set of questions measuring motivation, relationships, feedback, alignment, wellness, and rewards and recognition. These are called Scores in the data
Lastly, every 3 months, employees were asked whether they would recommend their company as a place to work (eNPS).
The data set is composed of 5 files:
companyMetadata.csv: This file contains information
about client companies, including hashed ID, industry, and time
zone.employees.csv: This file contains information about all
employees that have responded to at least one question across all
datasets. Employees are identified by a combination of companyId and
employeeId. It also includes demographic data about employees, including
hiring date, birth date, and gender.scoreMetadata.csv: This file contains includes
descriptions of what is measured by Scores, Factors and Questions. These
are hierarchical, with questions nested within factors, and factors
nested within scores.hiVotes.csv: This file contains all responses to the HI
question, with unique values identified by a combination of companyId,
employeeId, and date.scoreVotes.csv: This file contains all responses across
all employees, companies, dates, scores, questions, and dates.Citation: Berengueres, J., & Romanov, T. (2021). HF 2021 Employee Engagement Dataset. Kaggle. Retrieved January 6, 2025, from https://www.kaggle.com/datasets/harriken/myhappyforce-survey-employee-stress.
This file contains information about client companies, including hashed ID, industry, and time zone. There are 147 unique company IDs spanning 16 industries (with an 87% completion rate) and 18 time zones.The most common industry is computer software and IT services.
# Load and transform data
companyMetadata <- read_csv("companyMetadata.csv", show_col_types = FALSE)
companyMetadata <- companyMetadata %>%
mutate_if(is.character, as.factor)
# Inspect data
head(companyMetadata, 10)
skim(companyMetadata)
| Name | companyMetadata |
| Number of rows | 147 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| companyId | 0 | 1.00 | FALSE | 147 | 56a: 1, 56f: 1, 574: 1, 579: 1 |
| industry | 19 | 0.87 | FALSE | 16 | COM: 38, FIN: 15, MAN: 13, HEA: 8 |
| timezone | 0 | 1.00 | FALSE | 18 | Eur: 49, GMT: 25, Eur: 24, Ame: 14 |
# Visualize counts of companies by industry
companyMetadata %>%
filter(!is.na(industry)) %>%
count(industry, sort = TRUE) %>%
ggplot(aes(x = reorder(industry, n), y = n)) +
geom_bar(stat = "identity", fill = "steelblue", color = "black") +
labs(
title = "Count of Companies by Industry",
x = "Count",
y = "Industry"
) +
theme_minimal() +
theme(axis.text.y = element_text(size = 10)) +
coord_flip()
This file provides data on 20,341 employees who have responded to at least one question across all datasets. Employees are uniquely identified by a combination of companyId and employeeId. The file also includes demographic details, such as hiring date, birth date, and gender. The date formats in the dataset were inconsistent, so I standardized them by parsing. I observed inconsistent gender labels in multiple languages and recoded them to standardized English terms.
# Load data
employees <- read_csv("employees.csv", show_col_types = FALSE)
# Format date and character variables.
employees <- employees %>%
mutate(
hiringDateParsed = coalesce(as.Date(ymd_hms(hiringDate, quiet = TRUE)), as.Date(ymd(hiringDate, quiet = TRUE))),
birthDateParsed = coalesce(as.Date(ymd_hms(birthDate, quiet = TRUE)), as.Date(ymd(birthDate, quiet = TRUE))),
deletionDateParsed = coalesce(as.Date(ymd_hms(deletionDate, quiet = TRUE)), as.Date(ymd(deletionDate, quiet = TRUE)))
) %>%
mutate(across(where(is.character), as.factor)) %>%
mutate(employeeId = as.factor(employeeId))
# Recode gender
unique(employees$gender)
## [1] <NA> Hombre Mujer Male Masculino Female Femenino
## [8] Femení
## Levels: Female Femení Femenino Hombre Male Masculino Mujer
employees <- employees %>%
mutate(
gender_recode = case_when(
gender %in% c("Female", "Femení", "Femenino", "Mujer") ~ "Female",
gender %in% c("Hombre", "Male", "Masculino") ~ "Male",
TRUE ~ NA_character_ # Assign NA for other or missing values
),
gender_recode = as.factor(gender_recode) # Convert to factor
)
There is a considerable amount of missing data, as not all companies provided demographic information. Specifically, hire dates are approximately 38% complete, birth dates are 37% complete, and gender information is available for 61% of the sample. Additionally, around 20% of employees have a deletion date, indicating they were no longer active on the platform at the time of the data collection.
head(employees, 10)
skim(employees)
| Name | employees |
| Number of rows | 20341 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| Date | 3 |
| factor | 4 |
| logical | 1 |
| POSIXct | 3 |
| ________________________ | |
| Group variables | None |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| hiringDateParsed | 12719 | 0.37 | 1914-03-08 | 2021-10-31 | 2016-06-22 | 3061 |
| birthDateParsed | 12830 | 0.37 | 1944-11-11 | 2097-02-08 | 1981-05-14 | 5775 |
| deletionDateParsed | 16206 | 0.20 | 2017-05-05 | 2021-08-10 | 2020-04-11 | 846 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| companyId | 0 | 1.00 | FALSE | 90 | 601: 1285, 5ed: 1045, 5be: 1022, 5a9: 971 |
| employeeId | 0 | 1.00 | FALSE | 2481 | 1: 71, 2: 67, 8: 67, 7: 66 |
| gender | 7982 | 0.61 | FALSE | 7 | Hom: 3744, Mal: 3430, Muj: 3209, Fem: 920 |
| gender_recode | 7982 | 0.61 | FALSE | 2 | Mal: 7279, Fem: 5080 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| deleted | 461 | 0.98 | 0.21 | FAL: 15678, TRU: 4202 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| hiringDate | 12716 | 0.37 | 0025-09-18 00:00:00 | 2021-10-31 00:00:00 | 2016-06-22 00:00:00 | 3287 |
| birthDate | 12805 | 0.37 | 0001-01-28 00:00:00 | 2097-02-08 00:00:00 | 1981-05-01 00:00:00 | 5985 |
| deletionDate | 16206 | 0.20 | 2017-05-05 07:34:11 | 2021-08-10 12:23:49 | 2020-04-11 14:20:43 | 3540 |
Upon examining the gender data more closely, I found that missing values are not missing at random. No company includes data for both genders; instead, each company only explicitly labels one gender, leaving the other implied or missing. Additionally, hire date and birth date information are also systematically missing when gender is missing, further complicating analyses involving demographic data. While it is possible that missing values for gender, hire date, and birth date could be inferred based on patterns in the data, this assumption cannot be reliably validated.
Given these issues, I will interpret gender, age, and tenure differences with caution to avoid introducing potential bias or inaccuracies. These limitations highlight the need for improved data collection practices to ensure a more comprehensive and inclusive understanding of employee demographics.
# Compute counts and percentages of gender by companyId with non-empty cell counts for hiringDate and birthDate
gender_summary <- employees %>%
group_by(companyId, Gender = gender_recode) %>% # Group by companyId and gender
summarise(
Count = n(), # Count the number of employees in each gender group
HiringDate_NonEmpty = sum(!is.na(hiringDateParsed)), # Count non-NA hiring dates
BirthDate_NonEmpty = sum(!is.na(birthDateParsed)), # Count non-NA birth dates
.groups = "drop"
) %>%
group_by(companyId) %>% # Regroup by companyId to calculate percentages
arrange(companyId, desc(Count)) # Sort by companyId and Count
# View the resulting summary table
head(gender_summary)
The date data includes improbable outliers, such as hiring dates in 1914. To address this, I recoded any hiring dates before August 11, 1976, as missing, as employees with such dates would have a tenure of over 45 years as of the last data collection date, August 11, 2021—a highly unlikely scenario. This adjustment resulted in an additional 4 hiring dates being marked as missing, but the percentage of complete hire dates remains 38%.
Similarly, some birth dates were set far in the future. To ensure all employees were at least 18 years old as of August 11, 2021, I recoded any birth dates after August 11, 2003, as missing. This adjustment led to 74 additional birth dates being marked as missing, but the percentage of complete birth dates remains 37%.
# Code outliers as missing
min_hiring_date <- as.Date("1976-08-11")
max_birth_date <- as.Date("2003-08-11")
employees <- employees %>%
mutate(
hiringDateParsedClean = case_when(
hiringDateParsed < min_hiring_date ~ NA_Date_,
TRUE ~ hiringDateParsed
),
birthDateParsedClean = case_when(
birthDateParsed > max_birth_date ~ NA_Date_,
TRUE ~ birthDateParsed
)
)
# Visualize date data
ggplot(employees, aes(x = hiringDateParsedClean)) +
geom_histogram(binwidth = 30, color = "black") +
labs(title = "Histogram of Hiring Dates", x = "Hiring Date", y = "Count") +
theme_minimal()
ggplot(employees, aes(x = birthDateParsedClean)) +
geom_histogram(binwidth = 30, color = "black") +
labs(title = "Histogram of Birth Dates", x = "Birth Date", y = "Count") +
theme_minimal()
ggplot(employees, aes(x = deletionDateParsed)) +
geom_histogram(binwidth = 30, color = "black") +
labs(title = "Histogram of Deletion Dates", x = "Deletion Date", y = "Count") +
theme_minimal()
This file contains all employees’ responses to the question: “How do you feel today?” which they were asked daily through the platform. Responses are uniquely identified by a combination of companyId, employeeId, and date.
# Read data
hiVotes <- read_csv("hiVotes.csv", show_col_types = FALSE)
# Format data
hiVotes <- hiVotes %>%
mutate(dateParsed = as.Date(ymd(date, quiet = TRUE))) %>%
mutate(across(where(is.character), as.factor)) %>%
mutate(employeeId = as.factor(employeeId))
The dataset includes a total of 2,302,358 responses from 2,481 employees across 90 companies. The dates range from February 1, 2016, to August 11, 2021. Scores for this question range from 1 to 4 (M = 2.92, SD = 0.98), with higher scores indicating more positive feelings and lower scores indicating more negative feelings.
head(hiVotes, 10)
skim(hiVotes)
| Name | hiVotes |
| Number of rows | 2302358 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| Date | 1 |
| factor | 3 |
| numeric | 1 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| dateParsed | 0 | 1 | 2016-02-01 | 2021-08-11 | 2019-12-13 | 2019 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| companyId | 0 | 1 | FALSE | 90 | 595: 326213, 56a: 240567, 590: 214268, 5a9: 206427 |
| employeeId | 0 | 1 | FALSE | 2481 | 1: 13188, 24: 11003, 4: 10984, 7: 10008 |
| departmentId | 0 | 1 | FALSE | 743 | 590: 128713, 594: 76724, 59a: 67343, 5aa: 61516 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| hiVote | 0 | 1 | 2.92 | 0.98 | 1 | 2 | 3 | 4 | 4 | ▂▂▁▇▆ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2016-02-01 | 2021-08-11 | 2019-12-13 | 2019 |
To determine the number of hiVote scores per employee, I grouped the data by employeeId and companyId. This resulted in 20,335 unique combinations of employeeId and companyId, just 6 fewer than the 20,341 employees listed in the employees data frame. The distribution of responses is highly skewed. The median number of responses per employee is 27, meaning that a typical employee responded to the question “How do you feel today?” 27 times. However, the mean number of responses is 133, significantly higher than the median, reflecting the influence of outliers. For example, one employee answered the question nearly every single day, or an impressive 2,006 times!
In addition to analyzing the average number of times an employee responded, I examined the variability of scores among repeat responders. The average standard deviation of scores within employees is 0.51 on a scale of 1 to 4, suggesting that employees’ feelings tend to fluctuate day-to-day, which aligns with the transient nature of moods. However, the standard deviation of these individual standard deviations is 0.32, indicating that some employees’ feelings vary more than others. This is consistent with expected differences between individuals, influenced by personality traits such as emotional stability.
# Define a mode function
get_mode <- function(x) {
ux <- unique(x) # Unique values
ux[which.max(tabulate(match(x, ux)))] # Return the value with the highest frequency
}
# Calculate summary statistics including mode and median
employee_summary <- hiVotes %>%
group_by(companyId, employeeId) %>% # Group by companyId and employeeId
summarise(
mean_score = mean(hiVote, na.rm = TRUE), # Mean score per employee
median_score = median(hiVote, na.rm = TRUE), # Median score per employee
mode_score = get_mode(hiVote), # Mode of hiVote scores
sd_score = sd(hiVote, na.rm = TRUE), # Standard deviation of scores
min_score = min(hiVote, na.rm = TRUE), # Minimum score
max_score = max(hiVote, na.rm = TRUE), # Maximum score
score_range = max_score - min_score, # Range of scores
n_scores = n(), # Number of scores
.groups = "drop"
)
ggplot(employee_summary, aes(x = n_scores)) +
geom_histogram(binwidth = 10, fill = "steelblue", color = "black") +
labs(
title = "Distribution of HI Counts Per Employee",
x = "Count of HI Scores",
y = "Frequency"
) +
theme_minimal()
sd_score_summary <- employee_summary %>%
summarise(
mean_sd_score = mean(sd_score, na.rm = TRUE),
sd_sd_score = sd(sd_score, na.rm = TRUE)
)
sd_score_summary
I also examined summary statistics for the 90 companies using the employee summary table. The distribution of company-level scores is approximately normal, suggesting that some companies are, on average, happier than others. These differences could potentially be attributed to variations in HR practices aimed at promoting employee well-being.
# Compute summary statistics by company
company_summary <- employee_summary %>%
group_by(companyId) %>% # Group by companyId
summarise(
mean_sd_score = mean(sd_score, na.rm = TRUE), # Mean standard deviation within company
mean_mean_score = mean(mean_score, na.rm = TRUE), # Mean of mean scores within company
median_mean_score = median(mean_score, na.rm = TRUE), # Median of mean scores
sd_mean_score = sd(mean_score, na.rm = TRUE), # Standard deviation of mean scores within company
employee_count = n(), # Number of employees in the company
.groups = "drop"
)
# Extract mean and standard deviation of mean company scores
mean_value <- mean(company_summary$mean_mean_score, na.rm = TRUE)
sd_value <- sd(company_summary$mean_mean_score, na.rm = TRUE)
# Plot histogram with a fitted normal curve
ggplot(company_summary, aes(x = mean_mean_score)) +
geom_histogram(aes(y = after_stat(density)), binwidth = 0.1, fill = "steelblue", color = "black") +
stat_function(
fun = dnorm,
args = list(mean = mean_value, sd = sd_value),
color = "red",
linewidth = 1
) +
labs(
title = "Histogram of Mean Company Scores with Normal Curve",
x = "Mean Score",
y = "Density"
) +
theme_minimal()
This file contains descriptions of workplace aspects measured by Scores, Factors and Questions. These are hierarchical, with questions nested within factors, and factors nested within scores. I created understandable IDs for the questions, rather than their alphanumeric codes by manually entering key words into the dataframe.
# Read and inspect data
scoreMetadata <- read_csv("scoreMetadataKeywords.csv", show_col_types = FALSE)
scoreMetadata <- scoreMetadata %>%
mutate_all(as.factor)
# Display the interactive table, hiding columns 1, 3, and 5
datatable(
scoreMetadata,
caption = "Interactive Table of scoreMetadata",
options = list(
pageLength = 10,
autoWidth = TRUE,
scrollX = TRUE,
columnDefs = list(
list(visible = FALSE, targets = c(1, 3, 5))
)
)
)
This file contains all responses from employees across companies, dates, and questions. Once a month, employees answered a set of questions measuring the factors outlined in the table above, which address workplace aspects such as compensation, management, and stress. Additionally, every three months, employees were asked whether they would recommend their company as a place to work (eNPS). All questions were rated on a 1 to 10 scale. The data covers responses collected between December 9, 2018, and August 11, 2021.
scoreVotes <- read_csv("scoreVotes.csv", show_col_types = FALSE)
head(scoreVotes)
scoreVotes <- scoreVotes %>%
mutate(across(ends_with("Id"), as.factor)) %>%
mutate(dateParsed = as.Date(ymd(date, quiet = TRUE)))
skim(scoreVotes)
| Name | scoreVotes |
| Number of rows | 495924 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| Date | 1 |
| factor | 6 |
| numeric | 1 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| dateParsed | 0 | 1 | 2018-12-09 | 2021-08-11 | 2021-03-09 | 799 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| companyId | 0 | 1 | FALSE | 73 | 5f4: 55102, 56a: 49848, 5e2: 39800, 59e: 24286 |
| employeeId | 0 | 1 | FALSE | 2322 | 1: 4536, 4: 3818, 11: 3727, 2: 3273 |
| departmentId | 0 | 1 | FALSE | 527 | 5ff: 22639, 576: 15726, 57d: 14791, 5e2: 13236 |
| scoreId | 0 | 1 | FALSE | 7 | 5de: 95710, 5df: 86291, 5dd: 79992, 5dd: 79931 |
| factorId | 0 | 1 | FALSE | 19 | 5de: 32768, 5de: 32345, 5dd: 32109, 5dd: 30934 |
| questionId | 0 | 1 | FALSE | 55 | 5bf: 20338, 5de: 11702, 5de: 11525, 5de: 11384 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| scoreVote | 0 | 1 | 6.23 | 2.75 | 1 | 4 | 7 | 9 | 10 | ▂▇▃▇▇ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2018-12-09 | 2021-08-11 | 2021-03-09 | 799 |
To align with the survey’s monthly measurement schedule, some
analyses may require organizing the
scoreVotes dataset into monthly subsets.
(The eNPS item has been excluded from this analysis since it was only
collected quarterly.) To prepare the dataset, I first merged
scoreVotes with
scoreMetadata by matching on
questionId, adding the
key_words column from
scoreMetadata. Then, I filtered
scoreVotes to retain only responses from
August 2021, creating scoreVotesMonth to
focus on a specific time period. Finally, I transformed
scoreVotesMonth into a wide format, where
each row represents a unique companyId and
employeeId combination, and each column
corresponds to a key_words category, with
values from scoreVote. There should be one
response per employee per month, but in cases of multiple resonses to
the same question, I tool the mean. This restructuring facilitates the
analysis of individual employees’ responses across different survey
topics.
# Step 0: Remove rows where questionId = "5bfff05de6caa5000429c543"
scoreVotesNoNps <- scoreVotes %>%
filter(questionId != "5bfff05de6caa5000429c543")
# Step 1: Merge key_words from scoreMetadata into scoreVotes by matching on questionId
scoreVotes <- scoreVotes %>%
left_join(scoreMetadata %>% select(questionId, key_words), by = "questionId")
# Step 2: Create a dataframe for one month of data (e.g., August 2021)
scoreVotesMonth <- scoreVotes %>%
filter(format(dateParsed, "%Y-%m") == "2021-08") # Adjust month as needed
# Step 3: Reshape scoreVotesMonth to wide format
scoreVotesWide <- scoreVotesMonth %>%
select(companyId, employeeId, key_words, scoreVote) %>% # Keep necessary columns
group_by(companyId, employeeId, key_words) %>% # Group to handle duplicates
summarise(scoreVote = mean(scoreVote, na.rm = TRUE), .groups = "drop") %>% # Take the mean response
pivot_wider(names_from = key_words, values_from = scoreVote)
# View the final dataset
head(scoreVotesWide)
skim(scoreVotesWide)
| Name | scoreVotesWide |
| Number of rows | 4128 |
| Number of columns | 57 |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| numeric | 55 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| companyId | 0 | 1 | FALSE | 61 | 601: 870, 602: 533, 5ed: 273, 5f4: 256 |
| employeeId | 0 | 1 | FALSE | 1843 | 12: 23, 1: 22, 11: 22, 16: 20 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| values_align | 3495 | 0.15 | 7.87 | 2.14 | 1 | 7 | 8 | 10.00 | 10 | ▁▁▂▆▇ |
| values_decisions | 3530 | 0.14 | 3.03 | 0.72 | 1 | 3 | 3 | 3.00 | 4 | ▁▂▁▇▃ |
| clear_communication | 3194 | 0.23 | 7.17 | 2.28 | 1 | 6 | 8 | 9.00 | 10 | ▁▂▅▇▇ |
| committed_coworkers | 3324 | 0.19 | 8.06 | 2.14 | 1 | 7 | 9 | 10.00 | 10 | ▁▁▂▃▇ |
| contribute_outcome | 3754 | 0.09 | 8.37 | 1.88 | 1 | 8 | 9 | 10.00 | 10 | ▁▁▂▅▇ |
| make_decision | 3489 | 0.15 | 7.34 | 2.32 | 1 | 6 | 8 | 9.00 | 10 | ▁▂▅▆▇ |
| manager_honest | 3047 | 0.26 | 7.81 | 2.39 | 1 | 7 | 9 | 10.00 | 10 | ▁▁▂▃▇ |
| manager_skill | 3233 | 0.22 | 7.74 | 2.13 | 1 | 7 | 8 | 9.00 | 10 | ▁▁▂▆▇ |
| meaning_pride | 3584 | 0.13 | 8.54 | 1.72 | 1 | 8 | 9 | 10.00 | 10 | ▁▁▁▃▇ |
| purposeful_work | 3669 | 0.11 | 3.36 | 0.68 | 1 | 3 | 3 | 4.00 | 4 | ▁▂▁▇▇ |
| rate_benefits | 3510 | 0.15 | 7.54 | 2.17 | 1 | 6 | 8 | 9.00 | 10 | ▁▁▅▇▇ |
| reach_objective | 3628 | 0.12 | 7.78 | 2.20 | 1 | 7 | 8 | 9.25 | 10 | ▁▁▂▆▇ |
| resource_invest | 3514 | 0.15 | 6.80 | 2.47 | 1 | 5 | 7 | 9.00 | 10 | ▂▃▆▇▇ |
| right_goal | 3557 | 0.14 | 7.27 | 2.33 | 1 | 6 | 8 | 9.00 | 10 | ▁▂▃▇▇ |
| sr_leader_confidence | 3270 | 0.21 | 7.04 | 2.59 | 1 | 5 | 8 | 9.00 | 10 | ▂▂▅▆▇ |
| tools_resource | 3539 | 0.14 | 3.55 | 0.90 | 1 | 3 | 4 | 4.00 | 5 | ▁▁▆▇▂ |
| vision_inspire | 3692 | 0.11 | 3.37 | 0.80 | 1 | 3 | 4 | 4.00 | 4 | ▁▁▁▆▇ |
| work_environment | 3595 | 0.13 | 7.96 | 2.10 | 1 | 7 | 8 | 10.00 | 10 | ▁▁▂▅▇ |
| work_feedback | 3079 | 0.25 | 2.85 | 0.82 | 1 | 2 | 3 | 3.00 | 4 | ▁▅▁▇▃ |
| advance_career | 3636 | 0.12 | 6.88 | 2.62 | 1 | 6 | 7 | 9.00 | 10 | ▂▁▅▇▇ |
| count_on_coworkers | 3120 | 0.24 | 8.25 | 2.09 | 1 | 7 | 9 | 10.00 | 10 | ▁▁▂▃▇ |
| coworkers_opinion | 3465 | 0.16 | 7.97 | 1.77 | 1 | 7 | 8 | 9.00 | 10 | ▁▁▂▇▇ |
| employee_wellbeing | 3583 | 0.13 | 3.57 | 1.06 | 1 | 3 | 4 | 4.00 | 5 | ▁▂▅▇▃ |
| fair_pay | 3573 | 0.13 | 6.33 | 2.72 | 1 | 4 | 7 | 8.00 | 10 | ▃▃▆▇▇ |
| goals_leader | 3626 | 0.12 | 7.12 | 2.52 | 1 | 6 | 8 | 9.00 | 10 | ▂▂▅▆▇ |
| know_expect | 3567 | 0.14 | 3.43 | 0.75 | 1 | 3 | 4 | 4.00 | 4 | ▁▁▁▅▇ |
| learn_skill | 3682 | 0.11 | 3.84 | 1.00 | 1 | 3 | 4 | 5.00 | 5 | ▁▂▆▇▇ |
| make_friends | 3184 | 0.23 | 7.46 | 2.33 | 1 | 6 | 8 | 10.00 | 10 | ▁▂▃▆▇ |
| manager_care | 3617 | 0.12 | 7.50 | 2.43 | 1 | 6 | 8 | 9.00 | 10 | ▁▁▂▆▇ |
| manager_develops | 3734 | 0.10 | 7.42 | 2.54 | 1 | 6 | 8 | 10.00 | 10 | ▁▂▃▅▇ |
| manager_opinion | 3433 | 0.17 | 6.66 | 1.76 | 1 | 6 | 7 | 8.00 | 8 | ▁▁▁▂▇ |
| manager_pay_convo | 3610 | 0.13 | 2.69 | 1.09 | 1 | 2 | 3 | 4.00 | 4 | ▅▇▁▆▇ |
| opinion_value | 3197 | 0.23 | 6.97 | 2.43 | 1 | 5 | 7 | 9.00 | 10 | ▂▂▅▇▇ |
| part_of_team | 3264 | 0.21 | 3.18 | 0.76 | 1 | 3 | 3 | 4.00 | 4 | ▁▃▁▇▇ |
| rate_pay_process | 3705 | 0.10 | 5.50 | 2.72 | 1 | 3 | 6 | 8.00 | 10 | ▅▅▆▇▃ |
| use_strengths | 3800 | 0.08 | 7.21 | 2.49 | 1 | 6 | 8 | 9.00 | 10 | ▂▂▃▇▇ |
| work_goal | 3703 | 0.10 | 3.51 | 0.73 | 1 | 3 | 4 | 4.00 | 4 | ▁▁▁▃▇ |
| work_stress | 3553 | 0.14 | 7.56 | 2.04 | 1 | 7 | 8 | 9.00 | 10 | ▁▁▃▇▆ |
| manager_inclusion | 3695 | 0.10 | 3.36 | 0.81 | 1 | 3 | 4 | 4.00 | 4 | ▁▁▁▅▇ |
| manager_success | 3652 | 0.12 | 7.59 | 2.43 | 1 | 6 | 8 | 10.00 | 10 | ▁▁▃▅▇ |
| fair_treatment | 3365 | 0.18 | 8.00 | 2.29 | 1 | 7 | 9 | 10.00 | 10 | ▁▁▂▃▇ |
| include_respect | 3627 | 0.12 | 8.00 | 2.34 | 1 | 7 | 9 | 10.00 | 10 | ▁▁▂▃▇ |
| accomplishment_celebrate | 3363 | 0.19 | 3.17 | 0.78 | 1 | 3 | 3 | 4.00 | 4 | ▁▃▁▇▇ |
| control_work | 3598 | 0.13 | 7.89 | 1.94 | 1 | 7 | 8 | 9.00 | 10 | ▁▁▂▇▇ |
| different_opinion | 3064 | 0.26 | 3.21 | 0.72 | 1 | 3 | 3 | 4.00 | 4 | ▁▂▁▇▆ |
| freedom_work | 3460 | 0.16 | 7.99 | 2.01 | 1 | 7 | 8 | 10.00 | 10 | ▁▁▂▆▇ |
| involved_decision | 3551 | 0.14 | 2.86 | 0.68 | 1 | 3 | 3 | 3.00 | 4 | ▁▂▁▇▂ |
| like_work_manager | 3284 | 0.20 | 4.07 | 0.73 | 2 | 4 | 4 | 5.00 | 5 | ▁▂▁▇▃ |
| manager_brings_best | 3237 | 0.22 | 7.53 | 2.39 | 1 | 6 | 8 | 10.00 | 10 | ▁▂▃▅▇ |
| manager_helps | 3373 | 0.18 | 3.37 | 0.75 | 1 | 3 | 4 | 4.00 | 4 | ▁▂▁▆▇ |
| recognition_encouraged | 3495 | 0.15 | 6.91 | 2.64 | 1 | 5 | 7 | 9.00 | 10 | ▂▂▅▆▇ |
| upward_feedback | 3258 | 0.21 | 7.74 | 2.24 | 1 | 7 | 8 | 10.00 | 10 | ▁▁▂▆▇ |
| voice_opinion | 3030 | 0.27 | 7.39 | 2.39 | 1 | 6 | 8 | 10.00 | 10 | ▁▂▃▆▇ |
| work_recognized | 3564 | 0.14 | 2.77 | 0.82 | 1 | 2 | 3 | 3.00 | 4 | ▁▆▁▇▃ |
| recommend_company | 3312 | 0.20 | 8.20 | 1.90 | 1 | 7 | 9 | 10.00 | 10 | ▁▁▂▆▇ |
Now that the data is loaded and cleaned, I can create any number of datasets for analysis. For example, I analyzed happiness trends by age group in 2020, focusing on the impact of the COVID-19 pandemic on workplace happiness. To that end, I created a dataset using the hiVotes and employees datasets. First, I computed age in years using birthDateParsedClean, which was merged into the hiVotes dataset using companyId and employeeId.
Next, I grouped employees into four age categories to create optimal bins: <29, 30-39, 40-49, and 50+. I aggregated the data by age group and month, calculating the average happiness index (Avg_HI) and the number of responses (Count_Responses) for each group. I also added a vertical reference line to March 2020 to mark the onset of significant COVID-19-related changes, such as the widespread adoption of work-from-home policies across Europe.
The analysis revealed that happiness levels dropped notably after the start of the pandemic. The most significant drop occurred among employees aged 30-39, a group often associated with child-rearing responsibilities. Employees aged <29 and 30-39 continued to report lower-than-pre-pandemic happiness levels throughout 2020. Conversely, employees aged 50+ showed an increase in happiness levels compared to pre-pandemic levels. Employees in the 40-49 group displayed relatively stable trends, with less pronounced changes than the younger or older cohorts.
# Merge birthDateParsedClean into hiVotes
hiVotes_with_age <- hiVotes %>%
inner_join(
employees %>%
select(companyId, employeeId, birthDateParsedClean),
by = c("companyId", "employeeId")
) %>%
mutate(
# Compute age in years
Age_Years = as.numeric(difftime(date, birthDateParsedClean, units = "days")) / 365.25,
# Bin ages into age groups
Age_Group = case_when(
Age_Years < 30 ~ "<29",
Age_Years >= 30 & Age_Years < 40 ~ "30-39",
Age_Years >= 40 & Age_Years < 50 ~ "40-49",
Age_Years >= 50 ~ "50+",
TRUE ~ "Unknown"
),
Month = floor_date(date, unit = "month")
)
# Aggregate data by age group and month
happiness_by_age_group <- hiVotes_with_age %>%
group_by(Age_Group, Month) %>% # Removed 'country'
summarise(
Avg_HI = mean(hiVote, na.rm = TRUE), # Calculate average happiness index
Count_Responses = n(), # Count the number of responses
.groups = "drop"
)
# Filter out 'Unknown' age group and restrict to the year 2020
happiness_by_age_group_filtered <- happiness_by_age_group %>%
filter(Age_Group != "Unknown",
year(Month) == 2020)
# Create the line graph with a vertical line for March 2020
ggplot(happiness_by_age_group_filtered, aes(x = Month, y = Avg_HI, color = Age_Group)) +
geom_line(linewidth = 1) +
geom_vline(xintercept = as.POSIXct("2020-03-01"), linetype = "dashed", color = "black") +
scale_x_datetime(date_labels = "%b %Y", date_breaks = "1 month") +
labs(
title = "Happiness Trends by Age Group in 2020",
subtitle = "Dashed line indicates March 2020 (Start of COVID-19 impact on work-from-home)",
x = "Month",
y = "Average Happiness Index",
color = "Age Group"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This report provides a foundational exploration of the HF 2021 Employee Engagement Dataset, emphasizing data quality, reliability, and initial insights into workplace happiness trends. By cleaning, exploring, and visualizing key aspects of the data, this analysis establishes a robust starting point for deeper investigations into employee engagement.
<29 and 30-39)
maintained lower-than-pre-pandemic happiness levels throughout
2020.50+) experienced an increase in
happiness levels, potentially reflecting reduced work-life pressures or
greater adaptability to remote work.<29 and 30-39, particularly in managing
work-life balance during challenging times like the pandemic.