Introduction

This report is the first in a series designed to provide a comprehensive understanding of employee engagement and happiness in the workplace, using the HF 2021 Employee Engagement Dataset. The series aims to evaluate the quality and reliability of the data, assess the construct validity of survey measures, identify key workplace drivers of happiness, explore group differences, and model changes in employee happiness over time. By synthesizing these insights, the series seeks to tell a cohesive, data-driven story and offer actionable recommendations for HR teams, leaders, and employees to foster workplace happiness.

The focus of this initial report is on data quality and exploration, establishing a foundation for deeper analyses. A clean, well-prepared dataset and a solid understanding of its structure and characteristics are essential for uncovering meaningful insights. However, this analysis revealed that key demographic data—such as gender, hire date, and birth date—are systematically missing in some cases, which limits the ability to fully explore certain group differences. As a result, interpretations of gender, age, and tenure-related trends should be approached with caution.

Report Highlights


Data Overview

For this analysis, I used the HF 2021 Employee Engagement Dataset, created by Jose Berengueres and Tristan Romanov using data from the Happyforce platform. I selected this dataset because it is freely available, based on real-world data, and presents a level of complexity and incomplete documentation that realistically reflects the challenges of working with organizational data in practice.

Every day, through the platform, employees were asked: ‘How do you feel today?’. This is called HI in the data, which I will refer to as the Happiness Index.

Additionally, once a month employees were asked a set of questions measuring motivation, relationships, feedback, alignment, wellness, and rewards and recognition. These are called Scores in the data

Lastly, every 3 months, employees were asked whether they would recommend their company as a place to work (eNPS).

The data set is composed of 5 files:

Citation: Berengueres, J., & Romanov, T. (2021). HF 2021 Employee Engagement Dataset. Kaggle. Retrieved January 6, 2025, from https://www.kaggle.com/datasets/harriken/myhappyforce-survey-employee-stress.


Data: companyMetadata

This file contains information about client companies, including hashed ID, industry, and time zone. There are 147 unique company IDs spanning 16 industries (with an 87% completion rate) and 18 time zones.The most common industry is computer software and IT services.

# Load and transform data
companyMetadata <- read_csv("companyMetadata.csv", show_col_types = FALSE)
companyMetadata <- companyMetadata %>%
  mutate_if(is.character, as.factor)

# Inspect data
head(companyMetadata, 10)
skim(companyMetadata)
Data summary
Name companyMetadata
Number of rows 147
Number of columns 3
_______________________
Column type frequency:
factor 3
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
companyId 0 1.00 FALSE 147 56a: 1, 56f: 1, 574: 1, 579: 1
industry 19 0.87 FALSE 16 COM: 38, FIN: 15, MAN: 13, HEA: 8
timezone 0 1.00 FALSE 18 Eur: 49, GMT: 25, Eur: 24, Ame: 14
# Visualize counts of companies by industry
companyMetadata %>%
  filter(!is.na(industry)) %>%
  count(industry, sort = TRUE) %>%
  ggplot(aes(x = reorder(industry, n), y = n)) +
  geom_bar(stat = "identity", fill = "steelblue", color = "black") +
  labs(
    title = "Count of Companies by Industry",
    x = "Count",
    y = "Industry"
  ) +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 10)) +
  coord_flip()


Data: employees

This file provides data on 20,341 employees who have responded to at least one question across all datasets. Employees are uniquely identified by a combination of companyId and employeeId. The file also includes demographic details, such as hiring date, birth date, and gender. The date formats in the dataset were inconsistent, so I standardized them by parsing. I observed inconsistent gender labels in multiple languages and recoded them to standardized English terms.

# Load data
employees <- read_csv("employees.csv", show_col_types = FALSE)

# Format date and character variables.
employees <- employees %>%
  mutate(
    hiringDateParsed = coalesce(as.Date(ymd_hms(hiringDate, quiet = TRUE)), as.Date(ymd(hiringDate, quiet = TRUE))),
    birthDateParsed = coalesce(as.Date(ymd_hms(birthDate, quiet = TRUE)), as.Date(ymd(birthDate, quiet = TRUE))),
    deletionDateParsed = coalesce(as.Date(ymd_hms(deletionDate, quiet = TRUE)), as.Date(ymd(deletionDate, quiet = TRUE)))
  ) %>%
  mutate(across(where(is.character), as.factor))  %>%
  mutate(employeeId = as.factor(employeeId))

# Recode gender
unique(employees$gender)
## [1] <NA>      Hombre    Mujer     Male      Masculino Female    Femenino 
## [8] Femení   
## Levels: Female Femení Femenino Hombre Male Masculino Mujer
employees <- employees %>%
  mutate(
    gender_recode = case_when(
      gender %in% c("Female", "Femení", "Femenino", "Mujer") ~ "Female",
      gender %in% c("Hombre", "Male", "Masculino") ~ "Male",
      TRUE ~ NA_character_  # Assign NA for other or missing values
    ),
    gender_recode = as.factor(gender_recode)  # Convert to factor
  )

There is a considerable amount of missing data, as not all companies provided demographic information. Specifically, hire dates are approximately 38% complete, birth dates are 37% complete, and gender information is available for 61% of the sample. Additionally, around 20% of employees have a deletion date, indicating they were no longer active on the platform at the time of the data collection.

head(employees, 10)
skim(employees)
Data summary
Name employees
Number of rows 20341
Number of columns 11
_______________________
Column type frequency:
Date 3
factor 4
logical 1
POSIXct 3
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
hiringDateParsed 12719 0.37 1914-03-08 2021-10-31 2016-06-22 3061
birthDateParsed 12830 0.37 1944-11-11 2097-02-08 1981-05-14 5775
deletionDateParsed 16206 0.20 2017-05-05 2021-08-10 2020-04-11 846

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
companyId 0 1.00 FALSE 90 601: 1285, 5ed: 1045, 5be: 1022, 5a9: 971
employeeId 0 1.00 FALSE 2481 1: 71, 2: 67, 8: 67, 7: 66
gender 7982 0.61 FALSE 7 Hom: 3744, Mal: 3430, Muj: 3209, Fem: 920
gender_recode 7982 0.61 FALSE 2 Mal: 7279, Fem: 5080

Variable type: logical

skim_variable n_missing complete_rate mean count
deleted 461 0.98 0.21 FAL: 15678, TRU: 4202

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
hiringDate 12716 0.37 0025-09-18 00:00:00 2021-10-31 00:00:00 2016-06-22 00:00:00 3287
birthDate 12805 0.37 0001-01-28 00:00:00 2097-02-08 00:00:00 1981-05-01 00:00:00 5985
deletionDate 16206 0.20 2017-05-05 07:34:11 2021-08-10 12:23:49 2020-04-11 14:20:43 3540

Upon examining the gender data more closely, I found that missing values are not missing at random. No company includes data for both genders; instead, each company only explicitly labels one gender, leaving the other implied or missing. Additionally, hire date and birth date information are also systematically missing when gender is missing, further complicating analyses involving demographic data. While it is possible that missing values for gender, hire date, and birth date could be inferred based on patterns in the data, this assumption cannot be reliably validated.

Given these issues, I will interpret gender, age, and tenure differences with caution to avoid introducing potential bias or inaccuracies. These limitations highlight the need for improved data collection practices to ensure a more comprehensive and inclusive understanding of employee demographics.

# Compute counts and percentages of gender by companyId with non-empty cell counts for hiringDate and birthDate
gender_summary <- employees %>%
  group_by(companyId, Gender = gender_recode) %>%  # Group by companyId and gender
  summarise(
    Count = n(),  # Count the number of employees in each gender group
    HiringDate_NonEmpty = sum(!is.na(hiringDateParsed)),  # Count non-NA hiring dates
    BirthDate_NonEmpty = sum(!is.na(birthDateParsed)),    # Count non-NA birth dates
    .groups = "drop"
  ) %>%
  group_by(companyId) %>%  # Regroup by companyId to calculate percentages
  arrange(companyId, desc(Count))  # Sort by companyId and Count

# View the resulting summary table
head(gender_summary)

The date data includes improbable outliers, such as hiring dates in 1914. To address this, I recoded any hiring dates before August 11, 1976, as missing, as employees with such dates would have a tenure of over 45 years as of the last data collection date, August 11, 2021—a highly unlikely scenario. This adjustment resulted in an additional 4 hiring dates being marked as missing, but the percentage of complete hire dates remains 38%.

Similarly, some birth dates were set far in the future. To ensure all employees were at least 18 years old as of August 11, 2021, I recoded any birth dates after August 11, 2003, as missing. This adjustment led to 74 additional birth dates being marked as missing, but the percentage of complete birth dates remains 37%.

# Code outliers as missing
min_hiring_date <- as.Date("1976-08-11")
max_birth_date <- as.Date("2003-08-11")

employees <- employees %>%
  mutate(
    hiringDateParsedClean = case_when(
      hiringDateParsed < min_hiring_date ~ NA_Date_,
      TRUE ~ hiringDateParsed
    ),
    birthDateParsedClean = case_when(
      birthDateParsed > max_birth_date ~ NA_Date_,
      TRUE ~ birthDateParsed
    )
  )

# Visualize date data
ggplot(employees, aes(x = hiringDateParsedClean)) +
  geom_histogram(binwidth = 30, color = "black") +
  labs(title = "Histogram of Hiring Dates", x = "Hiring Date", y = "Count") +
  theme_minimal()

ggplot(employees, aes(x = birthDateParsedClean)) +
  geom_histogram(binwidth = 30, color = "black") +
  labs(title = "Histogram of Birth Dates", x = "Birth Date", y = "Count") +
  theme_minimal()

ggplot(employees, aes(x = deletionDateParsed)) +
  geom_histogram(binwidth = 30, color = "black") +
  labs(title = "Histogram of Deletion Dates", x = "Deletion Date", y = "Count") +
  theme_minimal()


Data: hiVotes

This file contains all employees’ responses to the question: “How do you feel today?” which they were asked daily through the platform. Responses are uniquely identified by a combination of companyId, employeeId, and date.

# Read data
hiVotes <- read_csv("hiVotes.csv", show_col_types = FALSE)

# Format data
hiVotes <- hiVotes %>%
  mutate(dateParsed = as.Date(ymd(date, quiet = TRUE))) %>%
  mutate(across(where(is.character), as.factor)) %>%
  mutate(employeeId = as.factor(employeeId))

The dataset includes a total of 2,302,358 responses from 2,481 employees across 90 companies. The dates range from February 1, 2016, to August 11, 2021. Scores for this question range from 1 to 4 (M = 2.92, SD = 0.98), with higher scores indicating more positive feelings and lower scores indicating more negative feelings.

head(hiVotes, 10)
skim(hiVotes)
Data summary
Name hiVotes
Number of rows 2302358
Number of columns 6
_______________________
Column type frequency:
Date 1
factor 3
numeric 1
POSIXct 1
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
dateParsed 0 1 2016-02-01 2021-08-11 2019-12-13 2019

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
companyId 0 1 FALSE 90 595: 326213, 56a: 240567, 590: 214268, 5a9: 206427
employeeId 0 1 FALSE 2481 1: 13188, 24: 11003, 4: 10984, 7: 10008
departmentId 0 1 FALSE 743 590: 128713, 594: 76724, 59a: 67343, 5aa: 61516

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
hiVote 0 1 2.92 0.98 1 2 3 4 4 ▂▂▁▇▆

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2016-02-01 2021-08-11 2019-12-13 2019

To determine the number of hiVote scores per employee, I grouped the data by employeeId and companyId. This resulted in 20,335 unique combinations of employeeId and companyId, just 6 fewer than the 20,341 employees listed in the employees data frame. The distribution of responses is highly skewed. The median number of responses per employee is 27, meaning that a typical employee responded to the question “How do you feel today?” 27 times. However, the mean number of responses is 133, significantly higher than the median, reflecting the influence of outliers. For example, one employee answered the question nearly every single day, or an impressive 2,006 times!

In addition to analyzing the average number of times an employee responded, I examined the variability of scores among repeat responders. The average standard deviation of scores within employees is 0.51 on a scale of 1 to 4, suggesting that employees’ feelings tend to fluctuate day-to-day, which aligns with the transient nature of moods. However, the standard deviation of these individual standard deviations is 0.32, indicating that some employees’ feelings vary more than others. This is consistent with expected differences between individuals, influenced by personality traits such as emotional stability.

# Define a mode function
get_mode <- function(x) {
  ux <- unique(x)                     # Unique values
  ux[which.max(tabulate(match(x, ux)))]  # Return the value with the highest frequency
}

# Calculate summary statistics including mode and median
employee_summary <- hiVotes %>%
  group_by(companyId, employeeId) %>%  # Group by companyId and employeeId
  summarise(
    mean_score = mean(hiVote, na.rm = TRUE),       # Mean score per employee
    median_score = median(hiVote, na.rm = TRUE),   # Median score per employee
    mode_score = get_mode(hiVote),                 # Mode of hiVote scores
    sd_score = sd(hiVote, na.rm = TRUE),           # Standard deviation of scores
    min_score = min(hiVote, na.rm = TRUE),         # Minimum score
    max_score = max(hiVote, na.rm = TRUE),         # Maximum score
    score_range = max_score - min_score,           # Range of scores
    n_scores = n(),                                # Number of scores
    .groups = "drop"
  )

ggplot(employee_summary, aes(x = n_scores)) +
  geom_histogram(binwidth = 10, fill = "steelblue", color = "black") +
  labs(
    title = "Distribution of HI Counts Per Employee",
    x = "Count of HI Scores",
    y = "Frequency"
  ) +
  theme_minimal()

sd_score_summary <- employee_summary %>%
  summarise(
    mean_sd_score = mean(sd_score, na.rm = TRUE),
    sd_sd_score = sd(sd_score, na.rm = TRUE)
  )
sd_score_summary

I also examined summary statistics for the 90 companies using the employee summary table. The distribution of company-level scores is approximately normal, suggesting that some companies are, on average, happier than others. These differences could potentially be attributed to variations in HR practices aimed at promoting employee well-being.

# Compute summary statistics by company
company_summary <- employee_summary %>%
  group_by(companyId) %>%  # Group by companyId
  summarise(
    mean_sd_score = mean(sd_score, na.rm = TRUE),  # Mean standard deviation within company
    mean_mean_score = mean(mean_score, na.rm = TRUE),  # Mean of mean scores within company
    median_mean_score = median(mean_score, na.rm = TRUE),  # Median of mean scores
    sd_mean_score = sd(mean_score, na.rm = TRUE),  # Standard deviation of mean scores within company
    employee_count = n(),  # Number of employees in the company
    .groups = "drop"
  )

# Extract mean and standard deviation of mean company scores
mean_value <- mean(company_summary$mean_mean_score, na.rm = TRUE)
sd_value <- sd(company_summary$mean_mean_score, na.rm = TRUE)

# Plot histogram with a fitted normal curve
ggplot(company_summary, aes(x = mean_mean_score)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 0.1, fill = "steelblue", color = "black") +
  stat_function(
    fun = dnorm,
    args = list(mean = mean_value, sd = sd_value),
    color = "red",
    linewidth = 1
  ) +
  labs(
    title = "Histogram of Mean Company Scores with Normal Curve",
    x = "Mean Score",
    y = "Density"
  ) +
  theme_minimal()


Data: scoreMetadata

This file contains descriptions of workplace aspects measured by Scores, Factors and Questions. These are hierarchical, with questions nested within factors, and factors nested within scores. I created understandable IDs for the questions, rather than their alphanumeric codes by manually entering key words into the dataframe.

# Read and inspect data
scoreMetadata <- read_csv("scoreMetadataKeywords.csv", show_col_types = FALSE)
scoreMetadata <- scoreMetadata %>%
  mutate_all(as.factor)

# Display the interactive table, hiding columns 1, 3, and 5
datatable(
  scoreMetadata,
  caption = "Interactive Table of scoreMetadata",
  options = list(
    pageLength = 10,
    autoWidth = TRUE,
    scrollX = TRUE,
    columnDefs = list(
      list(visible = FALSE, targets = c(1, 3, 5))
    )
  )
)

Data: scoreVotes

This file contains all responses from employees across companies, dates, and questions. Once a month, employees answered a set of questions measuring the factors outlined in the table above, which address workplace aspects such as compensation, management, and stress. Additionally, every three months, employees were asked whether they would recommend their company as a place to work (eNPS). All questions were rated on a 1 to 10 scale. The data covers responses collected between December 9, 2018, and August 11, 2021.

scoreVotes <- read_csv("scoreVotes.csv", show_col_types = FALSE)
head(scoreVotes)
scoreVotes <- scoreVotes %>%
  mutate(across(ends_with("Id"), as.factor)) %>%
  mutate(dateParsed = as.Date(ymd(date, quiet = TRUE)))

skim(scoreVotes)
Data summary
Name scoreVotes
Number of rows 495924
Number of columns 9
_______________________
Column type frequency:
Date 1
factor 6
numeric 1
POSIXct 1
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
dateParsed 0 1 2018-12-09 2021-08-11 2021-03-09 799

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
companyId 0 1 FALSE 73 5f4: 55102, 56a: 49848, 5e2: 39800, 59e: 24286
employeeId 0 1 FALSE 2322 1: 4536, 4: 3818, 11: 3727, 2: 3273
departmentId 0 1 FALSE 527 5ff: 22639, 576: 15726, 57d: 14791, 5e2: 13236
scoreId 0 1 FALSE 7 5de: 95710, 5df: 86291, 5dd: 79992, 5dd: 79931
factorId 0 1 FALSE 19 5de: 32768, 5de: 32345, 5dd: 32109, 5dd: 30934
questionId 0 1 FALSE 55 5bf: 20338, 5de: 11702, 5de: 11525, 5de: 11384

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
scoreVote 0 1 6.23 2.75 1 4 7 9 10 ▂▇▃▇▇

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2018-12-09 2021-08-11 2021-03-09 799

To align with the survey’s monthly measurement schedule, some analyses may require organizing the scoreVotes dataset into monthly subsets. (The eNPS item has been excluded from this analysis since it was only collected quarterly.) To prepare the dataset, I first merged scoreVotes with scoreMetadata by matching on questionId, adding the key_words column from scoreMetadata. Then, I filtered scoreVotes to retain only responses from August 2021, creating scoreVotesMonth to focus on a specific time period. Finally, I transformed scoreVotesMonth into a wide format, where each row represents a unique companyId and employeeId combination, and each column corresponds to a key_words category, with values from scoreVote. There should be one response per employee per month, but in cases of multiple resonses to the same question, I tool the mean. This restructuring facilitates the analysis of individual employees’ responses across different survey topics.

# Step 0: Remove rows where questionId = "5bfff05de6caa5000429c543"
scoreVotesNoNps <- scoreVotes %>%
  filter(questionId != "5bfff05de6caa5000429c543")

# Step 1: Merge key_words from scoreMetadata into scoreVotes by matching on questionId
scoreVotes <- scoreVotes %>%
  left_join(scoreMetadata %>% select(questionId, key_words), by = "questionId")

# Step 2: Create a dataframe for one month of data (e.g., August 2021)
scoreVotesMonth <- scoreVotes %>%
  filter(format(dateParsed, "%Y-%m") == "2021-08")  # Adjust month as needed

# Step 3: Reshape scoreVotesMonth to wide format
scoreVotesWide <- scoreVotesMonth %>%
  select(companyId, employeeId, key_words, scoreVote) %>%  # Keep necessary columns
  group_by(companyId, employeeId, key_words) %>%           # Group to handle duplicates
  summarise(scoreVote = mean(scoreVote, na.rm = TRUE), .groups = "drop") %>%  # Take the mean response
  pivot_wider(names_from = key_words, values_from = scoreVote)

# View the final dataset
head(scoreVotesWide)
skim(scoreVotesWide)
Data summary
Name scoreVotesWide
Number of rows 4128
Number of columns 57
_______________________
Column type frequency:
factor 2
numeric 55
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
companyId 0 1 FALSE 61 601: 870, 602: 533, 5ed: 273, 5f4: 256
employeeId 0 1 FALSE 1843 12: 23, 1: 22, 11: 22, 16: 20

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
values_align 3495 0.15 7.87 2.14 1 7 8 10.00 10 ▁▁▂▆▇
values_decisions 3530 0.14 3.03 0.72 1 3 3 3.00 4 ▁▂▁▇▃
clear_communication 3194 0.23 7.17 2.28 1 6 8 9.00 10 ▁▂▅▇▇
committed_coworkers 3324 0.19 8.06 2.14 1 7 9 10.00 10 ▁▁▂▃▇
contribute_outcome 3754 0.09 8.37 1.88 1 8 9 10.00 10 ▁▁▂▅▇
make_decision 3489 0.15 7.34 2.32 1 6 8 9.00 10 ▁▂▅▆▇
manager_honest 3047 0.26 7.81 2.39 1 7 9 10.00 10 ▁▁▂▃▇
manager_skill 3233 0.22 7.74 2.13 1 7 8 9.00 10 ▁▁▂▆▇
meaning_pride 3584 0.13 8.54 1.72 1 8 9 10.00 10 ▁▁▁▃▇
purposeful_work 3669 0.11 3.36 0.68 1 3 3 4.00 4 ▁▂▁▇▇
rate_benefits 3510 0.15 7.54 2.17 1 6 8 9.00 10 ▁▁▅▇▇
reach_objective 3628 0.12 7.78 2.20 1 7 8 9.25 10 ▁▁▂▆▇
resource_invest 3514 0.15 6.80 2.47 1 5 7 9.00 10 ▂▃▆▇▇
right_goal 3557 0.14 7.27 2.33 1 6 8 9.00 10 ▁▂▃▇▇
sr_leader_confidence 3270 0.21 7.04 2.59 1 5 8 9.00 10 ▂▂▅▆▇
tools_resource 3539 0.14 3.55 0.90 1 3 4 4.00 5 ▁▁▆▇▂
vision_inspire 3692 0.11 3.37 0.80 1 3 4 4.00 4 ▁▁▁▆▇
work_environment 3595 0.13 7.96 2.10 1 7 8 10.00 10 ▁▁▂▅▇
work_feedback 3079 0.25 2.85 0.82 1 2 3 3.00 4 ▁▅▁▇▃
advance_career 3636 0.12 6.88 2.62 1 6 7 9.00 10 ▂▁▅▇▇
count_on_coworkers 3120 0.24 8.25 2.09 1 7 9 10.00 10 ▁▁▂▃▇
coworkers_opinion 3465 0.16 7.97 1.77 1 7 8 9.00 10 ▁▁▂▇▇
employee_wellbeing 3583 0.13 3.57 1.06 1 3 4 4.00 5 ▁▂▅▇▃
fair_pay 3573 0.13 6.33 2.72 1 4 7 8.00 10 ▃▃▆▇▇
goals_leader 3626 0.12 7.12 2.52 1 6 8 9.00 10 ▂▂▅▆▇
know_expect 3567 0.14 3.43 0.75 1 3 4 4.00 4 ▁▁▁▅▇
learn_skill 3682 0.11 3.84 1.00 1 3 4 5.00 5 ▁▂▆▇▇
make_friends 3184 0.23 7.46 2.33 1 6 8 10.00 10 ▁▂▃▆▇
manager_care 3617 0.12 7.50 2.43 1 6 8 9.00 10 ▁▁▂▆▇
manager_develops 3734 0.10 7.42 2.54 1 6 8 10.00 10 ▁▂▃▅▇
manager_opinion 3433 0.17 6.66 1.76 1 6 7 8.00 8 ▁▁▁▂▇
manager_pay_convo 3610 0.13 2.69 1.09 1 2 3 4.00 4 ▅▇▁▆▇
opinion_value 3197 0.23 6.97 2.43 1 5 7 9.00 10 ▂▂▅▇▇
part_of_team 3264 0.21 3.18 0.76 1 3 3 4.00 4 ▁▃▁▇▇
rate_pay_process 3705 0.10 5.50 2.72 1 3 6 8.00 10 ▅▅▆▇▃
use_strengths 3800 0.08 7.21 2.49 1 6 8 9.00 10 ▂▂▃▇▇
work_goal 3703 0.10 3.51 0.73 1 3 4 4.00 4 ▁▁▁▃▇
work_stress 3553 0.14 7.56 2.04 1 7 8 9.00 10 ▁▁▃▇▆
manager_inclusion 3695 0.10 3.36 0.81 1 3 4 4.00 4 ▁▁▁▅▇
manager_success 3652 0.12 7.59 2.43 1 6 8 10.00 10 ▁▁▃▅▇
fair_treatment 3365 0.18 8.00 2.29 1 7 9 10.00 10 ▁▁▂▃▇
include_respect 3627 0.12 8.00 2.34 1 7 9 10.00 10 ▁▁▂▃▇
accomplishment_celebrate 3363 0.19 3.17 0.78 1 3 3 4.00 4 ▁▃▁▇▇
control_work 3598 0.13 7.89 1.94 1 7 8 9.00 10 ▁▁▂▇▇
different_opinion 3064 0.26 3.21 0.72 1 3 3 4.00 4 ▁▂▁▇▆
freedom_work 3460 0.16 7.99 2.01 1 7 8 10.00 10 ▁▁▂▆▇
involved_decision 3551 0.14 2.86 0.68 1 3 3 3.00 4 ▁▂▁▇▂
like_work_manager 3284 0.20 4.07 0.73 2 4 4 5.00 5 ▁▂▁▇▃
manager_brings_best 3237 0.22 7.53 2.39 1 6 8 10.00 10 ▁▂▃▅▇
manager_helps 3373 0.18 3.37 0.75 1 3 4 4.00 4 ▁▂▁▆▇
recognition_encouraged 3495 0.15 6.91 2.64 1 5 7 9.00 10 ▂▂▅▆▇
upward_feedback 3258 0.21 7.74 2.24 1 7 8 10.00 10 ▁▁▂▆▇
voice_opinion 3030 0.27 7.39 2.39 1 6 8 10.00 10 ▁▂▃▆▇
work_recognized 3564 0.14 2.77 0.82 1 2 3 3.00 4 ▁▆▁▇▃
recommend_company 3312 0.20 8.20 1.90 1 7 9 10.00 10 ▁▁▂▆▇

Conclusion

This report provides a foundational exploration of the HF 2021 Employee Engagement Dataset, emphasizing data quality, reliability, and initial insights into workplace happiness trends. By cleaning, exploring, and visualizing key aspects of the data, this analysis establishes a robust starting point for deeper investigations into employee engagement.


Insights About the Data

  1. Data Quality and Reliability:
    • Missing Data: Critical demographic fields such as hire dates (38%), birth dates (37%), and gender (61%) showed significant gaps. Additionally, gender data was not missing at random; companies only labeled one gender, leaving the other implied or missing. This inconsistency led to the exclusion of gender from subsequent analyses.
    • Date Anomalies: The data included improbable outliers, such as hiring dates in 1914 and birth dates far in the future. These were corrected to ensure logical consistency (e.g., excluding birth dates after August 11, 2003, to ensure all employees were at least 18 years old).
    • Employee Deletion: Approximately 20% of employees had deletion dates, indicating they were no longer active on the platform.
  2. Variability in Participation:
    • The number of responses per employee varied widely, with a median of 27 responses and a mean of 133. This discrepancy highlights the presence of outliers, including one employee who responded 2,006 times.
  3. Company and Industry Coverage:
    • The dataset spans 147 companies across 16 industries, with “Computer Software and IT Services” being the most represented. Company-level happiness scores followed an approximately normal distribution, suggesting variation in workplace engagement across organizations.

Meaningful Insights

  1. Happiness Trends by Age Group in 2020:
    • Happiness levels dropped significantly after the onset of COVID-19 in March 2020, particularly among employees aged 30-39, a demographic often associated with child-rearing responsibilities.
    • Younger employees (<29 and 30-39) maintained lower-than-pre-pandemic happiness levels throughout 2020.
    • Older employees (50+) experienced an increase in happiness levels, potentially reflecting reduced work-life pressures or greater adaptability to remote work.
  2. Mood Variability:
    • Employee moods fluctuated day-to-day, with an average within-person standard deviation of 0.51 on a 4-point scale. This variability aligns with the transient nature of mood but also highlights between-person differences, likely influenced by personality traits like emotional stability.
  3. Engagement and Workplace Practices:
    • Differences in average happiness scores across companies suggest that HR practices and organizational culture significantly influence employee well-being. Industries and companies with higher average scores likely have more robust engagement and well-being strategies.

Actionable Recommendations

  1. Improve Data Collection:
    • Standardize the collection of demographic data, particularly gender, to ensure inclusivity and accuracy for future analyses.
    • Implement stricter validation for date entries to avoid improbable outliers.
  2. Focus on Younger Employees:
    • Develop targeted strategies to support employees aged <29 and 30-39, particularly in managing work-life balance during challenging times like the pandemic.
  3. Leverage Positive Outliers:
    • Investigate high-scoring companies and industries to identify best practices in promoting workplace happiness.
  4. Longitudinal Studies:
    • Build on this analysis with time-series models to understand how engagement evolves over time and predict future trends.