1. Cleaning, Restructuring, Exploring Data

Introduction

This report is the first in a series designed to provide a comprehensive understanding of employee engagement and happiness in the workplace, using the HF 2021 Employee Engagement Dataset. The series aims to evaluate the quality and reliability of the data, assess the construct validity of survey measures, identify key workplace drivers of happiness, explore group differences, and model changes in employee happiness over time. By synthesizing these insights, the series seeks to tell a cohesive, data-driven story and offer actionable recommendations for HR teams, leaders, and employees to foster workplace happiness.

The focus of this initial report is on data quality and exploration, establishing a foundation for deeper analyses. A clean, well-prepared dataset and a solid understanding of its structure and characteristics are essential for uncovering meaningful insights. However, this analysis revealed that key demographic data—such as gender, hire date, and birth date—are systematically missing in some cases, which limits the ability to fully explore certain group differences. As a result, interpretations of gender, age, and tenure-related trends should be approached with caution.

Report Highlights

Data Quality: Addressed systematic missingness in gender, hire date, and birth date data, along with inconsistencies in gender labels and improbable dates, to ensure robust analyses.
Happiness Trends: Significant variations in happiness levels were observed by age group, particularly during the COVID-19 pandemic.
Organizational Insights: Company-level differences in average happiness scores underscore the critical role of organizational culture in shaping employee well-being.

Data Overview

For this analysis, I used the HF 2021 Employee Engagement Dataset, created by Jose Berengueres and Tristan Romanov using data from the Happyforce platform. I selected this dataset because it is freely available, based on real-world data, and presents a level of complexity and incomplete documentation that realistically reflects the challenges of working with organizational data in practice.

Every day, through the platform, employees were asked: ‘How do you feel today?’. This is called HI in the data, which I will refer to as the Happiness Index.

Additionally, once a month employees were asked a set of questions measuring motivation, relationships, feedback, alignment, wellness, and rewards and recognition. These are called Scores in the data

Lastly, every 3 months, employees were asked whether they would recommend their company as a place to work (eNPS).

The data set is composed of 5 files:

companyMetadata.csv: This file contains information about client companies, including hashed ID, industry, and time zone.
employees.csv: This file contains information about all employees that have responded to at least one question across all datasets. Employees are identified by a combination of companyId and employeeId. It also includes demographic data about employees, including hiring date, birth date, and gender.
scoreMetadata.csv: This file contains includes descriptions of what is measured by Scores, Factors and Questions. These are hierarchical, with questions nested within factors, and factors nested within scores.
hiVotes.csv: This file contains all responses to the HI question, with unique values identified by a combination of companyId, employeeId, and date.
scoreVotes.csv: This file contains all responses across all employees, companies, dates, scores, questions, and dates.

Citation: Berengueres, J., & Romanov, T. (2021). HF 2021 Employee Engagement Dataset. Kaggle. Retrieved January 6, 2025, from https://www.kaggle.com/datasets/harriken/myhappyforce-survey-employee-stress.

Data: companyMetadata

This file contains information about client companies, including hashed ID, industry, and time zone. There are 147 unique company IDs spanning 16 industries (with an 87% completion rate) and 18 time zones.The most common industry is computer software and IT services.

# Load and transform data
companyMetadata <- read_csv("companyMetadata.csv", show_col_types = FALSE)
companyMetadata <- companyMetadata %>%
  mutate_if(is.character, as.factor)

# Inspect data
head(companyMetadata, 10)

skim(companyMetadata)

Data summary
Name	companyMetadata
Number of rows	147
Number of columns	3
_______________________
Column type frequency:
factor	3
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
companyId	0	1.00	FALSE	147	56a: 1, 56f: 1, 574: 1, 579: 1
industry	19	0.87	FALSE	16	COM: 38, FIN: 15, MAN: 13, HEA: 8
timezone	0	1.00	FALSE	18	Eur: 49, GMT: 25, Eur: 24, Ame: 14

# Visualize counts of companies by industry
companyMetadata %>%
  filter(!is.na(industry)) %>%
  count(industry, sort = TRUE) %>%
  ggplot(aes(x = reorder(industry, n), y = n)) +
  geom_bar(stat = "identity", fill = "steelblue", color = "black") +
  labs(
    title = "Count of Companies by Industry",
    x = "Count",
    y = "Industry"
  ) +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 10)) +
  coord_flip()

Data: employees

This file provides data on 20,341 employees who have responded to at least one question across all datasets. Employees are uniquely identified by a combination of companyId and employeeId. The file also includes demographic details, such as hiring date, birth date, and gender. The date formats in the dataset were inconsistent, so I standardized them by parsing. I observed inconsistent gender labels in multiple languages and recoded them to standardized English terms.

# Load data
employees <- read_csv("employees.csv", show_col_types = FALSE)

# Format date and character variables.
employees <- employees %>%
  mutate(
    hiringDateParsed = coalesce(as.Date(ymd_hms(hiringDate, quiet = TRUE)), as.Date(ymd(hiringDate, quiet = TRUE))),
    birthDateParsed = coalesce(as.Date(ymd_hms(birthDate, quiet = TRUE)), as.Date(ymd(birthDate, quiet = TRUE))),
    deletionDateParsed = coalesce(as.Date(ymd_hms(deletionDate, quiet = TRUE)), as.Date(ymd(deletionDate, quiet = TRUE)))
  ) %>%
  mutate(across(where(is.character), as.factor))  %>%
  mutate(employeeId = as.factor(employeeId))

# Recode gender
unique(employees$gender)

## [1] <NA>      Hombre    Mujer     Male      Masculino Female    Femenino 
## [8] Femení   
## Levels: Female Femení Femenino Hombre Male Masculino Mujer

employees <- employees %>%
  mutate(
    gender_recode = case_when(
      gender %in% c("Female", "Femení", "Femenino", "Mujer") ~ "Female",
      gender %in% c("Hombre", "Male", "Masculino") ~ "Male",
      TRUE ~ NA_character_  # Assign NA for other or missing values
    ),
    gender_recode = as.factor(gender_recode)  # Convert to factor
  )

There is a considerable amount of missing data, as not all companies provided demographic information. Specifically, hire dates are approximately 38% complete, birth dates are 37% complete, and gender information is available for 61% of the sample. Additionally, around 20% of employees have a deletion date, indicating they were no longer active on the platform at the time of the data collection.

head(employees, 10)

skim(employees)

Data summary
Name	employees
Number of rows	20341
Number of columns	11
_______________________
Column type frequency:
Date	3
factor	4
logical	1
POSIXct	3
________________________
Group variables	None

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
hiringDateParsed	12719	0.37	1914-03-08	2021-10-31	2016-06-22	3061
birthDateParsed	12830	0.37	1944-11-11	2097-02-08	1981-05-14	5775
deletionDateParsed	16206	0.20	2017-05-05	2021-08-10	2020-04-11	846

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
companyId	0	1.00	FALSE	90	601: 1285, 5ed: 1045, 5be: 1022, 5a9: 971
employeeId	0	1.00	FALSE	2481	1: 71, 2: 67, 8: 67, 7: 66
gender	7982	0.61	FALSE	7	Hom: 3744, Mal: 3430, Muj: 3209, Fem: 920
gender_recode	7982	0.61	FALSE	2	Mal: 7279, Fem: 5080

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
deleted	461	0.98	0.21	FAL: 15678, TRU: 4202

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
hiringDate	12716	0.37	0025-09-18 00:00:00	2021-10-31 00:00:00	2016-06-22 00:00:00	3287
birthDate	12805	0.37	0001-01-28 00:00:00	2097-02-08 00:00:00	1981-05-01 00:00:00	5985
deletionDate	16206	0.20	2017-05-05 07:34:11	2021-08-10 12:23:49	2020-04-11 14:20:43	3540

Upon examining the gender data more closely, I found that missing values are not missing at random. No company includes data for both genders; instead, each company only explicitly labels one gender, leaving the other implied or missing. Additionally, hire date and birth date information are also systematically missing when gender is missing, further complicating analyses involving demographic data. While it is possible that missing values for gender, hire date, and birth date could be inferred based on patterns in the data, this assumption cannot be reliably validated.

Given these issues, I will interpret gender, age, and tenure differences with caution to avoid introducing potential bias or inaccuracies. These limitations highlight the need for improved data collection practices to ensure a more comprehensive and inclusive understanding of employee demographics.

# Compute counts and percentages of gender by companyId with non-empty cell counts for hiringDate and birthDate
gender_summary <- employees %>%
  group_by(companyId, Gender = gender_recode) %>%  # Group by companyId and gender
  summarise(
    Count = n(),  # Count the number of employees in each gender group
    HiringDate_NonEmpty = sum(!is.na(hiringDateParsed)),  # Count non-NA hiring dates
    BirthDate_NonEmpty = sum(!is.na(birthDateParsed)),    # Count non-NA birth dates
    .groups = "drop"
  ) %>%
  group_by(companyId) %>%  # Regroup by companyId to calculate percentages
  arrange(companyId, desc(Count))  # Sort by companyId and Count

# View the resulting summary table
head(gender_summary)

The date data includes improbable outliers, such as hiring dates in 1914. To address this, I recoded any hiring dates before August 11, 1976, as missing, as employees with such dates would have a tenure of over 45 years as of the last data collection date, August 11, 2021—a highly unlikely scenario. This adjustment resulted in an additional 4 hiring dates being marked as missing, but the percentage of complete hire dates remains 38%.

Similarly, some birth dates were set far in the future. To ensure all employees were at least 18 years old as of August 11, 2021, I recoded any birth dates after August 11, 2003, as missing. This adjustment led to 74 additional birth dates being marked as missing, but the percentage of complete birth dates remains 37%.

# Code outliers as missing
min_hiring_date <- as.Date("1976-08-11")
max_birth_date <- as.Date("2003-08-11")

employees <- employees %>%
  mutate(
    hiringDateParsedClean = case_when(
      hiringDateParsed < min_hiring_date ~ NA_Date_,
      TRUE ~ hiringDateParsed
    ),
    birthDateParsedClean = case_when(
      birthDateParsed > max_birth_date ~ NA_Date_,
      TRUE ~ birthDateParsed
    )
  )

# Visualize date data
ggplot(employees, aes(x = hiringDateParsedClean)) +
  geom_histogram(binwidth = 30, color = "black") +
  labs(title = "Histogram of Hiring Dates", x = "Hiring Date", y = "Count") +
  theme_minimal()

ggplot(employees, aes(x = birthDateParsedClean)) +
  geom_histogram(binwidth = 30, color = "black") +
  labs(title = "Histogram of Birth Dates", x = "Birth Date", y = "Count") +
  theme_minimal()

ggplot(employees, aes(x = deletionDateParsed)) +
  geom_histogram(binwidth = 30, color = "black") +
  labs(title = "Histogram of Deletion Dates", x = "Deletion Date", y = "Count") +
  theme_minimal()

Data: hiVotes

This file contains all employees’ responses to the question: “How do you feel today?” which they were asked daily through the platform. Responses are uniquely identified by a combination of companyId, employeeId, and date.

# Read data
hiVotes <- read_csv("hiVotes.csv", show_col_types = FALSE)

# Format data
hiVotes <- hiVotes %>%
  mutate(dateParsed = as.Date(ymd(date, quiet = TRUE))) %>%
  mutate(across(where(is.character), as.factor)) %>%
  mutate(employeeId = as.factor(employeeId))

The dataset includes a total of 2,302,358 responses from 2,481 employees across 90 companies. The dates range from February 1, 2016, to August 11, 2021. Scores for this question range from 1 to 4 (M = 2.92, SD = 0.98), with higher scores indicating more positive feelings and lower scores indicating more negative feelings.

head(hiVotes, 10)

skim(hiVotes)

Data summary
Name	hiVotes
Number of rows	2302358
Number of columns	6
_______________________
Column type frequency:
Date	1
factor	3
numeric	1
POSIXct	1
________________________
Group variables	None

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
dateParsed	0	1	2016-02-01	2021-08-11	2019-12-13	2019

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
companyId	1	FALSE	90	595: 326213, 56a: 240567, 590: 214268, 5a9: 206427
employeeId	1	FALSE	2481	1: 13188, 24: 11003, 4: 10984, 7: 10008
departmentId	1	FALSE	743	590: 128713, 594: 76724, 59a: 67343, 5aa: 61516

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
hiVote	0	1	2.92	0.98	1	2	3	4	4	▂▂▁▇▆

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	2016-02-01	2021-08-11	2019-12-13	2019

To determine the number of hiVote scores per employee, I grouped the data by employeeId and companyId. This resulted in 20,335 unique combinations of employeeId and companyId, just 6 fewer than the 20,341 employees listed in the employees data frame. The distribution of responses is highly skewed. The median number of responses per employee is 27, meaning that a typical employee responded to the question “How do you feel today?” 27 times. However, the mean number of responses is 133, significantly higher than the median, reflecting the influence of outliers. For example, one employee answered the question nearly every single day, or an impressive 2,006 times!

In addition to analyzing the average number of times an employee responded, I examined the variability of scores among repeat responders. The average standard deviation of scores within employees is 0.51 on a scale of 1 to 4, suggesting that employees’ feelings tend to fluctuate day-to-day, which aligns with the transient nature of moods. However, the standard deviation of these individual standard deviations is 0.32, indicating that some employees’ feelings vary more than others. This is consistent with expected differences between individuals, influenced by personality traits such as emotional stability.

# Define a mode function
get_mode <- function(x) {
  ux <- unique(x)                     # Unique values
  ux[which.max(tabulate(match(x, ux)))]  # Return the value with the highest frequency
}

# Calculate summary statistics including mode and median
employee_summary <- hiVotes %>%
  group_by(companyId, employeeId) %>%  # Group by companyId and employeeId
  summarise(
    mean_score = mean(hiVote, na.rm = TRUE),       # Mean score per employee
    median_score = median(hiVote, na.rm = TRUE),   # Median score per employee
    mode_score = get_mode(hiVote),                 # Mode of hiVote scores
    sd_score = sd(hiVote, na.rm = TRUE),           # Standard deviation of scores
    min_score = min(hiVote, na.rm = TRUE),         # Minimum score
    max_score = max(hiVote, na.rm = TRUE),         # Maximum score
    score_range = max_score - min_score,           # Range of scores
    n_scores = n(),                                # Number of scores
    .groups = "drop"
  )

ggplot(employee_summary, aes(x = n_scores)) +
  geom_histogram(binwidth = 10, fill = "steelblue", color = "black") +
  labs(
    title = "Distribution of HI Counts Per Employee",
    x = "Count of HI Scores",
    y = "Frequency"
  ) +
  theme_minimal()

sd_score_summary <- employee_summary %>%
  summarise(
    mean_sd_score = mean(sd_score, na.rm = TRUE),
    sd_sd_score = sd(sd_score, na.rm = TRUE)
  )
sd_score_summary

I also examined summary statistics for the 90 companies using the employee summary table. The distribution of company-level scores is approximately normal, suggesting that some companies are, on average, happier than others. These differences could potentially be attributed to variations in HR practices aimed at promoting employee well-being.

# Compute summary statistics by company
company_summary <- employee_summary %>%
  group_by(companyId) %>%  # Group by companyId
  summarise(
    mean_sd_score = mean(sd_score, na.rm = TRUE),  # Mean standard deviation within company
    mean_mean_score = mean(mean_score, na.rm = TRUE),  # Mean of mean scores within company
    median_mean_score = median(mean_score, na.rm = TRUE),  # Median of mean scores
    sd_mean_score = sd(mean_score, na.rm = TRUE),  # Standard deviation of mean scores within company
    employee_count = n(),  # Number of employees in the company
    .groups = "drop"
  )

# Extract mean and standard deviation of mean company scores
mean_value <- mean(company_summary$mean_mean_score, na.rm = TRUE)
sd_value <- sd(company_summary$mean_mean_score, na.rm = TRUE)

# Plot histogram with a fitted normal curve
ggplot(company_summary, aes(x = mean_mean_score)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 0.1, fill = "steelblue", color = "black") +
  stat_function(
    fun = dnorm,
    args = list(mean = mean_value, sd = sd_value),
    color = "red",
    linewidth = 1
  ) +
  labs(
    title = "Histogram of Mean Company Scores with Normal Curve",
    x = "Mean Score",
    y = "Density"
  ) +
  theme_minimal()

Data: scoreMetadata

This file contains descriptions of workplace aspects measured by Scores, Factors and Questions. These are hierarchical, with questions nested within factors, and factors nested within scores. I created understandable IDs for the questions, rather than their alphanumeric codes by manually entering key words into the dataframe.

# Read and inspect data
scoreMetadata <- read_csv("scoreMetadataKeywords.csv", show_col_types = FALSE)
scoreMetadata <- scoreMetadata %>%
  mutate_all(as.factor)

# Display the interactive table, hiding columns 1, 3, and 5
datatable(
  scoreMetadata,
  caption = "Interactive Table of scoreMetadata",
  options = list(
    pageLength = 10,
    autoWidth = TRUE,
    scrollX = TRUE,
    columnDefs = list(
      list(visible = FALSE, targets = c(1, 3, 5))
    )
  )
)

Data: scoreVotes

This file contains all responses from employees across companies, dates, and questions. Once a month, employees answered a set of questions measuring the factors outlined in the table above, which address workplace aspects such as compensation, management, and stress. Additionally, every three months, employees were asked whether they would recommend their company as a place to work (eNPS). All questions were rated on a 1 to 10 scale. The data covers responses collected between December 9, 2018, and August 11, 2021.

scoreVotes <- read_csv("scoreVotes.csv", show_col_types = FALSE)
head(scoreVotes)

scoreVotes <- scoreVotes %>%
  mutate(across(ends_with("Id"), as.factor)) %>%
  mutate(dateParsed = as.Date(ymd(date, quiet = TRUE)))

skim(scoreVotes)

Data summary
Name	scoreVotes
Number of rows	495924
Number of columns	9
_______________________
Column type frequency:
Date	1
factor	6
numeric	1
POSIXct	1
________________________
Group variables	None

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
dateParsed	0	1	2018-12-09	2021-08-11	2021-03-09	799

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
companyId	1	FALSE	73	5f4: 55102, 56a: 49848, 5e2: 39800, 59e: 24286
employeeId	1	FALSE	2322	1: 4536, 4: 3818, 11: 3727, 2: 3273
departmentId	1	FALSE	527	5ff: 22639, 576: 15726, 57d: 14791, 5e2: 13236
scoreId	1	FALSE	7	5de: 95710, 5df: 86291, 5dd: 79992, 5dd: 79931
factorId	1	FALSE	19	5de: 32768, 5de: 32345, 5dd: 32109, 5dd: 30934
questionId	1	FALSE	55	5bf: 20338, 5de: 11702, 5de: 11525, 5de: 11384

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
scoreVote	0	1	6.23	2.75	1	4	7	9	10	▂▇▃▇▇

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	2018-12-09	2021-08-11	2021-03-09	799

To align with the survey’s monthly measurement schedule, some analyses may require organizing the scoreVotes dataset into monthly subsets. (The eNPS item has been excluded from this analysis since it was only collected quarterly.) To prepare the dataset, I first merged scoreVotes with scoreMetadata by matching on questionId, adding the key_words column from scoreMetadata. Then, I filtered scoreVotes to retain only responses from August 2021, creating scoreVotesMonth to focus on a specific time period. Finally, I transformed scoreVotesMonth into a wide format, where each row represents a unique companyId and employeeId combination, and each column corresponds to a key_words category, with values from scoreVote. There should be one response per employee per month, but in cases of multiple resonses to the same question, I tool the mean. This restructuring facilitates the analysis of individual employees’ responses across different survey topics.

# Step 0: Remove rows where questionId = "5bfff05de6caa5000429c543"
scoreVotesNoNps <- scoreVotes %>%
  filter(questionId != "5bfff05de6caa5000429c543")

# Step 1: Merge key_words from scoreMetadata into scoreVotes by matching on questionId
scoreVotes <- scoreVotes %>%
  left_join(scoreMetadata %>% select(questionId, key_words), by = "questionId")

# Step 2: Create a dataframe for one month of data (e.g., August 2021)
scoreVotesMonth <- scoreVotes %>%
  filter(format(dateParsed, "%Y-%m") == "2021-08")  # Adjust month as needed

# Step 3: Reshape scoreVotesMonth to wide format
scoreVotesWide <- scoreVotesMonth %>%
  select(companyId, employeeId, key_words, scoreVote) %>%  # Keep necessary columns
  group_by(companyId, employeeId, key_words) %>%           # Group to handle duplicates
  summarise(scoreVote = mean(scoreVote, na.rm = TRUE), .groups = "drop") %>%  # Take the mean response
  pivot_wider(names_from = key_words, values_from = scoreVote)

# View the final dataset
head(scoreVotesWide)

skim(scoreVotesWide)

Data summary
Name	scoreVotesWide
Number of rows	4128
Number of columns	57
_______________________
Column type frequency:
factor	2
numeric	55
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
companyId	0	1	FALSE	61	601: 870, 602: 533, 5ed: 273, 5f4: 256
employeeId	0	1	FALSE	1843	12: 23, 1: 22, 11: 22, 16: 20

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
values_align	3495	0.15	7.87	2.14	1	7	8	10.00	10	▁▁▂▆▇
values_decisions	3530	0.14	3.03	0.72	1	3	3	3.00	4	▁▂▁▇▃
clear_communication	3194	0.23	7.17	2.28	1	6	8	9.00	10	▁▂▅▇▇
committed_coworkers	3324	0.19	8.06	2.14	1	7	9	10.00	10	▁▁▂▃▇
contribute_outcome	3754	0.09	8.37	1.88	1	8	9	10.00	10	▁▁▂▅▇
make_decision	3489	0.15	7.34	2.32	1	6	8	9.00	10	▁▂▅▆▇
manager_honest	3047	0.26	7.81	2.39	1	7	9	10.00	10	▁▁▂▃▇
manager_skill	3233	0.22	7.74	2.13	1	7	8	9.00	10	▁▁▂▆▇
meaning_pride	3584	0.13	8.54	1.72	1	8	9	10.00	10	▁▁▁▃▇
purposeful_work	3669	0.11	3.36	0.68	1	3	3	4.00	4	▁▂▁▇▇
rate_benefits	3510	0.15	7.54	2.17	1	6	8	9.00	10	▁▁▅▇▇
reach_objective	3628	0.12	7.78	2.20	1	7	8	9.25	10	▁▁▂▆▇
resource_invest	3514	0.15	6.80	2.47	1	5	7	9.00	10	▂▃▆▇▇
right_goal	3557	0.14	7.27	2.33	1	6	8	9.00	10	▁▂▃▇▇
sr_leader_confidence	3270	0.21	7.04	2.59	1	5	8	9.00	10	▂▂▅▆▇
tools_resource	3539	0.14	3.55	0.90	1	3	4	4.00	5	▁▁▆▇▂
vision_inspire	3692	0.11	3.37	0.80	1	3	4	4.00	4	▁▁▁▆▇
work_environment	3595	0.13	7.96	2.10	1	7	8	10.00	10	▁▁▂▅▇
work_feedback	3079	0.25	2.85	0.82	1	2	3	3.00	4	▁▅▁▇▃
advance_career	3636	0.12	6.88	2.62	1	6	7	9.00	10	▂▁▅▇▇
count_on_coworkers	3120	0.24	8.25	2.09	1	7	9	10.00	10	▁▁▂▃▇
coworkers_opinion	3465	0.16	7.97	1.77	1	7	8	9.00	10	▁▁▂▇▇
employee_wellbeing	3583	0.13	3.57	1.06	1	3	4	4.00	5	▁▂▅▇▃
fair_pay	3573	0.13	6.33	2.72	1	4	7	8.00	10	▃▃▆▇▇
goals_leader	3626	0.12	7.12	2.52	1	6	8	9.00	10	▂▂▅▆▇
know_expect	3567	0.14	3.43	0.75	1	3	4	4.00	4	▁▁▁▅▇
learn_skill	3682	0.11	3.84	1.00	1	3	4	5.00	5	▁▂▆▇▇
make_friends	3184	0.23	7.46	2.33	1	6	8	10.00	10	▁▂▃▆▇
manager_care	3617	0.12	7.50	2.43	1	6	8	9.00	10	▁▁▂▆▇
manager_develops	3734	0.10	7.42	2.54	1	6	8	10.00	10	▁▂▃▅▇
manager_opinion	3433	0.17	6.66	1.76	1	6	7	8.00	8	▁▁▁▂▇
manager_pay_convo	3610	0.13	2.69	1.09	1	2	3	4.00	4	▅▇▁▆▇
opinion_value	3197	0.23	6.97	2.43	1	5	7	9.00	10	▂▂▅▇▇
part_of_team	3264	0.21	3.18	0.76	1	3	3	4.00	4	▁▃▁▇▇
rate_pay_process	3705	0.10	5.50	2.72	1	3	6	8.00	10	▅▅▆▇▃
use_strengths	3800	0.08	7.21	2.49	1	6	8	9.00	10	▂▂▃▇▇
work_goal	3703	0.10	3.51	0.73	1	3	4	4.00	4	▁▁▁▃▇
work_stress	3553	0.14	7.56	2.04	1	7	8	9.00	10	▁▁▃▇▆
manager_inclusion	3695	0.10	3.36	0.81	1	3	4	4.00	4	▁▁▁▅▇
manager_success	3652	0.12	7.59	2.43	1	6	8	10.00	10	▁▁▃▅▇
fair_treatment	3365	0.18	8.00	2.29	1	7	9	10.00	10	▁▁▂▃▇
include_respect	3627	0.12	8.00	2.34	1	7	9	10.00	10	▁▁▂▃▇
accomplishment_celebrate	3363	0.19	3.17	0.78	1	3	3	4.00	4	▁▃▁▇▇
control_work	3598	0.13	7.89	1.94	1	7	8	9.00	10	▁▁▂▇▇
different_opinion	3064	0.26	3.21	0.72	1	3	3	4.00	4	▁▂▁▇▆
freedom_work	3460	0.16	7.99	2.01	1	7	8	10.00	10	▁▁▂▆▇
involved_decision	3551	0.14	2.86	0.68	1	3	3	3.00	4	▁▂▁▇▂
like_work_manager	3284	0.20	4.07	0.73	2	4	4	5.00	5	▁▂▁▇▃
manager_brings_best	3237	0.22	7.53	2.39	1	6	8	10.00	10	▁▂▃▅▇
manager_helps	3373	0.18	3.37	0.75	1	3	4	4.00	4	▁▂▁▆▇
recognition_encouraged	3495	0.15	6.91	2.64	1	5	7	9.00	10	▂▂▅▆▇
upward_feedback	3258	0.21	7.74	2.24	1	7	8	10.00	10	▁▁▂▆▇
voice_opinion	3030	0.27	7.39	2.39	1	6	8	10.00	10	▁▂▃▆▇
work_recognized	3564	0.14	2.77	0.82	1	2	3	3.00	4	▁▆▁▇▃
recommend_company	3312	0.20	8.20	1.90	1	7	9	10.00	10	▁▁▂▆▇

Happiness Trends by Age Group: Insights from the COVID-19 Pandemic

Now that the data is loaded and cleaned, I can create any number of datasets for analysis. For example, I analyzed happiness trends by age group in 2020, focusing on the impact of the COVID-19 pandemic on workplace happiness. To that end, I created a dataset using the hiVotes and employees datasets. First, I computed age in years using birthDateParsedClean, which was merged into the hiVotes dataset using companyId and employeeId.

Next, I grouped employees into four age categories to create optimal bins: <29, 30-39, 40-49, and 50+. I aggregated the data by age group and month, calculating the average happiness index (Avg_HI) and the number of responses (Count_Responses) for each group. I also added a vertical reference line to March 2020 to mark the onset of significant COVID-19-related changes, such as the widespread adoption of work-from-home policies across Europe.

The analysis revealed that happiness levels dropped notably after the start of the pandemic. The most significant drop occurred among employees aged 30-39, a group often associated with child-rearing responsibilities. Employees aged <29 and 30-39 continued to report lower-than-pre-pandemic happiness levels throughout 2020. Conversely, employees aged 50+ showed an increase in happiness levels compared to pre-pandemic levels. Employees in the 40-49 group displayed relatively stable trends, with less pronounced changes than the younger or older cohorts.

# Merge birthDateParsedClean into hiVotes
hiVotes_with_age <- hiVotes %>%
  inner_join(
    employees %>% 
      select(companyId, employeeId, birthDateParsedClean),
    by = c("companyId", "employeeId")
  ) %>%
  mutate(
    # Compute age in years
    Age_Years = as.numeric(difftime(date, birthDateParsedClean, units = "days")) / 365.25,
    # Bin ages into age groups
    Age_Group = case_when(
      Age_Years < 30 ~ "<29",
      Age_Years >= 30 & Age_Years < 40 ~ "30-39",
      Age_Years >= 40 & Age_Years < 50 ~ "40-49",
      Age_Years >= 50 ~ "50+",
      TRUE ~ "Unknown"
    ),
    Month = floor_date(date, unit = "month")
  )

# Aggregate data by age group and month
happiness_by_age_group <- hiVotes_with_age %>%
  group_by(Age_Group, Month) %>%  # Removed 'country'
  summarise(
    Avg_HI = mean(hiVote, na.rm = TRUE),  # Calculate average happiness index
    Count_Responses = n(),                # Count the number of responses
    .groups = "drop"
  )

# Filter out 'Unknown' age group and restrict to the year 2020
happiness_by_age_group_filtered <- happiness_by_age_group %>%
  filter(Age_Group != "Unknown",  
         year(Month) == 2020)     

# Create the line graph with a vertical line for March 2020
ggplot(happiness_by_age_group_filtered, aes(x = Month, y = Avg_HI, color = Age_Group)) +
  geom_line(linewidth = 1) +  
  geom_vline(xintercept = as.POSIXct("2020-03-01"), linetype = "dashed", color = "black") +
  scale_x_datetime(date_labels = "%b %Y", date_breaks = "1 month") +
  labs(
    title = "Happiness Trends by Age Group in 2020",
    subtitle = "Dashed line indicates March 2020 (Start of COVID-19 impact on work-from-home)",
    x = "Month",
    y = "Average Happiness Index",
    color = "Age Group"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Conclusion

This report provides a foundational exploration of the HF 2021 Employee Engagement Dataset, emphasizing data quality, reliability, and initial insights into workplace happiness trends. By cleaning, exploring, and visualizing key aspects of the data, this analysis establishes a robust starting point for deeper investigations into employee engagement.

Insights About the Data

Data Quality and Reliability:
- Missing Data: Critical demographic fields such as hire dates (38%), birth dates (37%), and gender (61%) showed significant gaps. Additionally, gender data was not missing at random; companies only labeled one gender, leaving the other implied or missing. This inconsistency led to the exclusion of gender from subsequent analyses.
- Date Anomalies: The data included improbable outliers, such as hiring dates in 1914 and birth dates far in the future. These were corrected to ensure logical consistency (e.g., excluding birth dates after August 11, 2003, to ensure all employees were at least 18 years old).
- Employee Deletion: Approximately 20% of employees had deletion dates, indicating they were no longer active on the platform.
Variability in Participation:
- The number of responses per employee varied widely, with a median of 27 responses and a mean of 133. This discrepancy highlights the presence of outliers, including one employee who responded 2,006 times.
Company and Industry Coverage:
- The dataset spans 147 companies across 16 industries, with “Computer Software and IT Services” being the most represented. Company-level happiness scores followed an approximately normal distribution, suggesting variation in workplace engagement across organizations.

Meaningful Insights

Happiness Trends by Age Group in 2020:
- Happiness levels dropped significantly after the onset of COVID-19 in March 2020, particularly among employees aged 30-39, a demographic often associated with child-rearing responsibilities.
- Younger employees (<29 and 30-39) maintained lower-than-pre-pandemic happiness levels throughout 2020.
- Older employees (50+) experienced an increase in happiness levels, potentially reflecting reduced work-life pressures or greater adaptability to remote work.
Mood Variability:
- Employee moods fluctuated day-to-day, with an average within-person standard deviation of 0.51 on a 4-point scale. This variability aligns with the transient nature of mood but also highlights between-person differences, likely influenced by personality traits like emotional stability.
Engagement and Workplace Practices:
- Differences in average happiness scores across companies suggest that HR practices and organizational culture significantly influence employee well-being. Industries and companies with higher average scores likely have more robust engagement and well-being strategies.

Actionable Recommendations

Improve Data Collection:
- Standardize the collection of demographic data, particularly gender, to ensure inclusivity and accuracy for future analyses.
- Implement stricter validation for date entries to avoid improbable outliers.
Focus on Younger Employees:
- Develop targeted strategies to support employees aged <29 and 30-39, particularly in managing work-life balance during challenging times like the pandemic.
Leverage Positive Outliers:
- Investigate high-scoring companies and industries to identify best practices in promoting workplace happiness.
Longitudinal Studies:
- Build on this analysis with time-series models to understand how engagement evolves over time and predict future trends.

1. Cleaning, Restructuring, Exploring Data

Dr. Rena Rasch

January 29, 2025

Introduction

Data Overview

Data: companyMetadata

Data: employees

Data: hiVotes

Data: scoreMetadata

Data: scoreVotes

Happiness Trends by Age Group: Insights from the COVID-19 Pandemic

Conclusion

Insights About the Data

Meaningful Insights

Actionable Recommendations