Introduction

This dataset contains information about employees within an organization, encompassing demographic, employment, and performance-related attributes. It appears to be structured for human resources (HR) analytics purposes, allowing for insights into employee retention, performance evaluation, and organizational diversity. The data combines categorical, numerical, and date variables and can be used to understand workforce patterns, assess performance metrics, and identify areas for HR intervention.

Variable Description

Employee_Name: (Categorical) The name of the employee.
EmpID: (Numeric) A unique identifier assigned to each employee.
MarriedID: (Numeric) Indicates if the employee is married (1 = married, 0 = not married).
MaritalStatusID: (Numeric) Encodes the marital status of employees with unique IDs.
GenderID: (Numeric) Encodes gender with unique IDs (e.g., 1 = Male, 2 = Female).
EmpStatusID: (Numeric) Employment status represented by unique IDs.
DeptID: (Numeric) Encodes the department of the employee with unique IDs.
PerfScoreID: (Numeric) Encodes performance scores with unique IDs.
FromDiversityJobFairID: (Numeric) Indicates if the employee was recruited through a diversity job fair (1 = yes, 0 = no).
Salary: (Numeric) Annual salary of the employee.
Termd: (Numeric) Indicates if the employee has been terminated (1 = yes, 0 = no).
PositionID: (Numeric) Encodes job position with unique IDs.
Position: (Categorical) The job title of the employee.
State: (Categorical) The state where the employee is located.
Zip: (Numeric) ZIP code of the employee’s residence.
DOB: (Date) Date of birth of the employee.
Sex: (Categorical) Gender of the employee (e.g., Male, Female).
MaritalDesc: (Categorical) Marital status of the employee (e.g., Married, Single).
CitizenDesc: (Categorical) Citizenship status of the employee (e.g., US Citizen, Non-Citizen).
HispanicLatino: (Categorical) Indicates if the employee identifies as Hispanic/Latino (Yes/No).
RaceDesc: (Categorical) Race or ethnicity of the employee.
DateofHire: (Date) The date the employee was hired.
DateofTermination: (Date) The date the employee was terminated (if applicable).
TermReason: (Categorical) The reason for termination (e.g., Resigned, Terminated for Cause).
EmploymentStatus: (Categorical) Current employment status (e.g., Active, Terminated).
Department: (Categorical) The department where the employee works (e.g., IT, Sales).
ManagerName: (Categorical) Name of the employee’s manager.
ManagerID: (Numeric) A unique identifier for the manager.
RecruitmentSource: (Categorical) The source through which the employee was recruited (e.g., LinkedIn, Job Fair).
PerformanceScore: (Categorical) The employee’s performance evaluation score (e.g., Exceeds Expectations, Meets Expectations).
EngagementSurvey: (Numeric) A score representing the employee’s engagement level in surveys (typically 1-5 scale).
EmpSatisfaction: (Numeric) Employee satisfaction score (typically 1-5 scale).
SpecialProjectsCount: (Numeric) Number of special projects the employee has been assigned to.
LastPerformanceReview_Date: (Date) The date of the last performance review.
DaysLateLast30: (Numeric) Number of days the employee was late to work in the last 30 days.
Absences: (Numeric) Total number of work absences for the employee.

Data Cleaning

##                        Column NA_Count
## 1               Employee_Name        0
## 2                       EmpID        0
## 3                   MarriedID        0
## 4             MaritalStatusID        0
## 5                    GenderID        0
## 6                 EmpStatusID        0
## 7                      DeptID        0
## 8                 PerfScoreID        0
## 9      FromDiversityJobFairID        0
## 10                     Salary        0
## 11                      Termd        0
## 12                 PositionID        0
## 13                   Position        0
## 14                      State        0
## 15                        Zip        0
## 16                        DOB        0
## 17                        Sex        0
## 18                MaritalDesc        0
## 19                CitizenDesc        0
## 20             HispanicLatino        0
## 21                   RaceDesc        0
## 22                 DateofHire        0
## 23          DateofTermination      207
## 24                 TermReason        0
## 25           EmploymentStatus        0
## 26                 Department        0
## 27                ManagerName        0
## 28                  ManagerID        8
## 29          RecruitmentSource        0
## 30           PerformanceScore        0
## 31           EngagementSurvey        0
## 32            EmpSatisfaction        0
## 33       SpecialProjectsCount        0
## 34 LastPerformanceReview_Date        0
## 35             DaysLateLast30        0
## 36                   Absences        0

We initiate this analysis by loading in the data adding an additional column for the Age of employee. We’ve also added an additional column for the number of day the employee has spent with the company. We then finally filtered the data to only include valid entries for employee ages, as we saw a number of employees with negative ages inferring there are misprints present within these rows (in that they were born after 2024 which is impossible)

factor_columns <- c(
  "Employee_Name",        # Name (could be treated as categorical if needed)
  "MarriedID",            # Binary indicator for marital status
  "MaritalStatusID",      # Encoded marital status
  "GenderID",             # Encoded gender
  "EmpStatusID",          # Employment status ID
  "DeptID",               # Department ID
  "PerfScoreID",          # Performance score ID
  "FromDiversityJobFairID", # Binary indicator for diversity job fair
  "Termd",                # Binary indicator for termination
  "PositionID",           # Encoded position
  "Position",             # Position title
  "State",                # State location
  "Sex",                  # Gender
  "MaritalDesc",          # Marital status description
  "CitizenDesc",          # Citizenship status
  "HispanicLatino",       # Binary indicator for Hispanic/Latino
  "RaceDesc",             # Race/ethnicity
  "TermReason",           # Termination reason
  "EmploymentStatus",     # Employment status
  "Department",           # Department name
  "RecruitmentSource"    # Recruitment source
)

# Convert specified columns to factors
hr_data <- datos %>%
  mutate(across(all_of(factor_columns), as.factor))

# Check structure of the dataset to confirm factorization
str(hr_data)

## tibble [269 × 38] (S3: tbl_df/tbl/data.frame)
##  $ Employee_Name             : Factor w/ 269 levels "Adinolfi, Wilson  K",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ EmpID                     : num [1:269] 10026 10084 10196 10088 10069 ...
##  $ MarriedID                 : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 1 1 ...
##  $ MaritalStatusID           : Factor w/ 5 levels "0","1","2","3",..: 1 2 2 2 3 1 1 5 1 3 ...
##  $ GenderID                  : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 2 1 2 ...
##  $ EmpStatusID               : Factor w/ 5 levels "1","2","3","4",..: 1 5 5 1 5 1 1 1 3 1 ...
##  $ DeptID                    : Factor w/ 5 levels "1","3","4","5",..: 4 2 4 4 4 4 3 4 4 2 ...
##  $ PerfScoreID               : Factor w/ 4 levels "1","2","3","4": 4 3 3 3 3 4 3 3 3 3 ...
##  $ FromDiversityJobFairID    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
##  $ Salary                    : num [1:269] 62506 104437 64955 64991 50825 ...
##  $ Termd                     : Factor w/ 2 levels "0","1": 1 2 2 1 2 1 1 1 1 1 ...
##  $ PositionID                : Factor w/ 25 levels "1","2","3","4",..: 16 23 17 16 16 16 21 16 16 13 ...
##  $ Position                  : Factor w/ 27 levels "Accountant I",..: 19 26 20 19 19 19 24 19 19 15 ...
##  $ State                     : Factor w/ 23 levels "AL","AZ","CA",..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ Zip                       : chr [1:269] "01960" "02148" "01810" "01886" ...
##  $ DOB                       : Date[1:269], format: "1983-07-10" "1975-05-05" ...
##  $ Sex                       : Factor w/ 2 levels "F","M": 2 2 1 1 1 1 1 2 1 2 ...
##  $ MaritalDesc               : Factor w/ 5 levels "Divorced","Married",..: 4 2 2 2 1 4 4 5 4 1 ...
##  $ CitizenDesc               : Factor w/ 3 levels "Eligible NonCitizen",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ HispanicLatino            : Factor w/ 4 levels "no","No","yes",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ RaceDesc                  : Factor w/ 6 levels "American Indian or Alaska Native",..: 6 6 6 6 6 6 6 6 3 6 ...
##  $ DateofHire                : Date[1:269], format: "2011-07-05" "2015-03-30" ...
##  $ DateofTermination         : Date[1:269], format: NA "2016-06-16" ...
##  $ TermReason                : Factor w/ 17 levels "Another position",..: 11 3 6 11 16 11 11 11 11 11 ...
##  $ EmploymentStatus          : Factor w/ 3 levels "Active","Terminated for Cause",..: 1 3 3 1 3 1 1 1 1 1 ...
##  $ Department                : Factor w/ 5 levels "Admin Offices",..: 3 2 3 3 3 3 5 3 3 2 ...
##  $ ManagerName               : chr [1:269] "Michael Albert" "Simon Roup" "Kissy Sullivan" "Elijiah Gray" ...
##  $ ManagerID                 : num [1:269] 22 4 20 16 39 11 10 19 12 7 ...
##  $ RecruitmentSource         : Factor w/ 9 levels "CareerBuilder",..: 6 5 6 5 4 6 6 3 2 5 ...
##  $ PerformanceScore          : chr [1:269] "Exceeds" "Fully Meets" "Fully Meets" "Fully Meets" ...
##  $ EngagementSurvey          : num [1:269] 4.6 4.96 3.02 4.84 5 5 3.04 5 4.46 5 ...
##  $ EmpSatisfaction           : num [1:269] 5 3 3 5 4 5 3 4 3 5 ...
##  $ SpecialProjectsCount      : num [1:269] 0 6 0 0 0 0 4 0 0 6 ...
##  $ LastPerformanceReview_Date: chr [1:269] "1/17/2019" "2/24/2016" "5/15/2012" "1/3/2019" ...
##  $ DaysLateLast30            : num [1:269] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Absences                  : num [1:269] 1 17 3 15 2 15 19 19 4 16 ...
##  $ Age                       : int [1:269] 41 49 36 36 35 47 45 41 54 37 ...
##  $ DaysWorked                : num [1:269] 4940 444 447 6215 1884 ...

hr_data

## # A tibble: 269 × 38
##    Employee_Name     EmpID MarriedID MaritalStatusID GenderID EmpStatusID DeptID
##    <fct>             <dbl> <fct>     <fct>           <fct>    <fct>       <fct> 
##  1 Adinolfi, Wilson… 10026 0         0               1        1           5     
##  2 Ait Sidi, Karthi… 10084 1         1               1        5           3     
##  3 Akinkuolie, Sarah 10196 1         1               0        5           5     
##  4 Alagbe,Trina      10088 1         1               0        1           5     
##  5 Anderson, Carol   10069 0         2               0        5           5     
##  6 Anderson, Linda   10002 0         0               0        1           5     
##  7 Andreola, Colby   10194 0         0               0        1           4     
##  8 Athwal, Sam       10062 0         4               1        1           5     
##  9 Bachiochi, Linda  10114 0         0               0        3           5     
## 10 Bacong, Alejandro 10250 0         2               1        1           3     
## # ℹ 259 more rows
## # ℹ 31 more variables: PerfScoreID <fct>, FromDiversityJobFairID <fct>,
## #   Salary <dbl>, Termd <fct>, PositionID <fct>, Position <fct>, State <fct>,
## #   Zip <chr>, DOB <date>, Sex <fct>, MaritalDesc <fct>, CitizenDesc <fct>,
## #   HispanicLatino <fct>, RaceDesc <fct>, DateofHire <date>,
## #   DateofTermination <date>, TermReason <fct>, EmploymentStatus <fct>,
## #   Department <fct>, ManagerName <chr>, ManagerID <dbl>, …

Descriptive Analytics

We will move forward with drawing some insight of the composition of this company and its demogrphics by queueing questions to answer with visualizations.

1) What is the overall distribution of employee satisfaction?

satisfaction_counts <- hr_data %>%
  group_by(EmpSatisfaction) %>%
  summarise(Count = n()) %>%
  mutate(Percentage = Count / sum(Count) * 100)


# Plot with percentage labels
ggplot(satisfaction_counts, aes(x = EmpSatisfaction, y = Count, fill = factor(EmpSatisfaction))) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), vjust = -0.5, size = 4) + # Add percentage labels
  labs(
    x = "Satisfaction",
    y = "Count",
    fill = "Satisfaction Rating",
    title = "Employee Satisfaction Distribution"
  ) +
  scale_fill_brewer(palette = "Paired")+
  theme_minimal()+
  ylim(0,100)

This bar chart shows the distribution of employee satisfaction ratings on a scale from 1 to 5, with percentages displayed above each bar. The majority of employees fall within the higher satisfaction categories, with 32.3% rating their satisfaction as 3, 31.6% rating it as 4, and 32.7% rating it as 5. Only a small proportion of employees report lower satisfaction, with 2.6% giving a rating of 2 and just 0.7% giving a rating of 1. This indicates that overall, employee satisfaction is skewed toward higher ratings, suggesting a generally positive sentiment among employees.

2) Company Demographics Porportions

gndr_counts <- hr_data %>%
  group_by(Sex) %>%
  summarise(Count = n()) %>%
  mutate(Percentage = Count / sum(Count) * 100)

# Plot with percentage labels
j1 = ggplot(gndr_counts, aes(x = factor(Sex), y = Count, fill = factor(Sex))) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), vjust = -0.5, size = 4) + # Add percentage labels
  labs(
    x = "Sex",
    y = "Count",
    title = "Employee Numbers by Gender"
  ) +
  scale_fill_manual(values = c('pink','darkblue'))+
  theme_minimal()+
  theme(legend.position = "none")+
  ylim(0,180)

race_counts <- hr_data %>%
  group_by(RaceDesc) %>%
  summarise(Count = n()) %>%
  mutate(Percentage = Count / sum(Count) * 100)

# Plot with percentage labels
j2 = ggplot(race_counts, aes(x = factor(RaceDesc), y = Count, fill = factor(RaceDesc))) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), vjust = -0.5, size = 4) + # Add percentage labels
  labs(
    x = "Race",
    y = "Count",
    title = "Employee Numbers by Race"
  ) +
  theme_minimal()+
  theme(legend.position = "none")+
  ylim(0,180)

grid.arrange(j1, j2,  nrow = 2)

The charts display the distribution of employees by gender and race. In terms of gender, 55.4% of employees are female, while 44.6% are male, indicating a slight majority of female employees. Regarding race, the largest proportion of employees, 59.5%, are White, followed by 25.3% who identify as Black or African American. Asian employees make up 9.7%, while smaller proportions include individuals identifying as American Indian or Alaska Native (1.1%), Two or More Races (4.1%), and Hispanic (0.4%). These distributions highlight the gender balance and racial diversity within the workforce, with certain racial groups being more represented than others.

We can expand on These Demographics and view their Satisfaction levels as well.

3) What are the key demographic characteristics of the workforce?

p1 = ggplot(hr_data, aes(x = EmpSatisfaction, fill = factor(Sex))) +
  geom_bar() +
  labs(
    x = "Employee Satisfaction Rating",
    y = "Count",
    title = "Employee Satisfaction by Gender"
  )  +
  geom_text(stat = "count", aes(label = ..count..), vjust = -.5) +
  theme_minimal()+
  theme(legend.position = "none",
        axis.title.x = element_blank(),   # Add spacing between panels
    strip.background = element_rect(fill = "gray90"))+
  scale_fill_manual(values = c('pink','darkblue'))+
  facet_wrap(~ Sex)+
  ylim(0,80)

p2 = ggplot(hr_data, aes(x = EmpSatisfaction, fill = factor(RaceDesc))) +
  geom_bar()+
  labs(
    x = "Employee Satisfaction Rating",
    y = "Count",
    title = "Employee Satisfaction by Race"
  )  +
  geom_text(stat = "count", aes(label = ..count..), vjust = -.5) +
  theme_minimal()+
  theme(legend.position = "none",
        strip.background = element_rect(fill = "gray90"))+
  facet_wrap(~ RaceDesc)+
  ylim(0,100)



grid.arrange(p1, p2,  nrow = 2)

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Among genders, females show a concentration of satisfaction ratings of 3, 4, and 5, with 45, 52, and 47 employees, respectively, and minimal dissatisfaction (ratings of 1 and 2). Males also predominantly fall in satisfaction ratings of 3, 4, and 5, with counts of 42, 33, and 41, respectively, but there is slightly less representation in the highest satisfaction category compared to females.

When analyzed by race, White employees dominate the dataset, with most reporting higher satisfaction levels (ratings of 3, 4, and 5). Black or African American employees also cluster around the mid to high satisfaction ratings (3, 4, and 5), although their counts are smaller than White employees. Other racial groups like Asian, Two or More Races, and American Indian or Alaska Native have relatively fewer employees, with their satisfaction ratings distributed sparsely. Hispanics are underrepresented, with only one employee in the dataset reporting a satisfaction rating of 3.

The data suggests that the workforce has higher overall satisfaction levels (3, 4, and 5) across all demographics, but the representation of different races is uneven, with some groups being significantly underrepresented.

4) Which departments have the highest and lowest performance scores?

ggplot(hr_data, aes(x = PerfScoreID, fill = factor(Department))) +
  geom_bar() +
  labs(
    x = "Performance Ratings",
    y = "Count",
    title = "Performance Score Distribution"
  )  +
  geom_text(stat = "count", aes(label = ..count..), vjust = -.5) +
  theme_minimal()+
  theme(legend.position = "none")+
  facet_wrap(~ Department)+
  ylim(0,150)

The performance score distribution by department reveals interesting trends across different teams. The Production department has the largest representation, with a majority of employees receiving a performance rating of 3, indicating average performance. A smaller group in this department achieves a high performance rating of 4, while a few fall into lower ratings (1 and 2).

The IT/IS department also exhibits a strong concentration in performance rating 3, though it has fewer employees overall compared to Production. A handful of IT/IS employees have a higher rating of 4 or lower ratings of 1 and 2.

The Software Engineering department shows a similar trend, with most employees receiving a performance rating of 3, but there is notable representation in the high-performance category (rating 4). Lower performance ratings are rare in this department.

The Admin Offices and Sales departments have minimal representation, with all employees in these areas achieving performance ratings of 3.

Overall, the data suggests that the majority of employees across departments are achieving average performance levels, with relatively few outliers in either the high- or low-performance categories. This could highlight the need for interventions in underperforming areas or reward structures for high performers.

5) How does salary vary in general

ggplot(hr_data, aes(x = Salary)) +
  geom_boxplot(fill = "skyblue") +
  theme_minimal() +
  labs(
    title = "Boxplot of All Columns in Data-set",
    x = "Salary in $",
    y = ""
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text.y = element_blank())

This boxplot visualizes the distribution of salaries within the dataset. The majority of salaries are concentrated around the median value, which is approximately $50,000. However, there are several significant outliers extending beyond $100,000, with a few reaching up to $200,000. These outliers indicate a small group of employees earning substantially higher salaries compared to the rest of the workforce. The narrow interquartile range suggests limited salary variation among the majority of employees, but the presence of these outliers points to possible managerial or specialized roles with higher pay.

6) How do salaries vary by department

ggplot(hr_data, aes(Salary, fill = Department)) +
  geom_density(alpha = 0.5) +
  labs(title = "Salary Distribution by Department",
       x = "Salary",
       y = "Density")+
  theme_minimal() +
  theme(axis.text.y = element_blank())

This density plot highlights salary distributions across departments, showing distinct trends. The Production department has a concentrated salary distribution at lower ranges, indicating uniform, lower pay scales. In contrast, Software Engineering and IT/IS exhibit peaks in higher salary ranges, suggesting these departments have better compensation. Sales and Admin Offices display broader distributions, reflecting greater variability in pay, with Admin Offices showing a smaller peak at higher salaries, likely for managerial roles. While there is some overlap in salary ranges among IT/IS, Software Engineering, and Sales, the peaks reveal department-specific salary structures, emphasizing clear differences in compensation trends.

7) How does Salary Fluctuate with Age and Time Worked

w1 = ggplot(hr_data, aes(x= Age, y = Salary))+
  geom_point()+
  geom_smooth(color = "darkgreen")+
  labs(title = "Salary by Age",
         x = "Age",
         y = "Salary") +
  theme_minimal()

w2 =ggplot(hr_data, aes(x= DaysWorked, y = Salary))+
  geom_point()+
  geom_smooth(color = "darkblue")+
  labs(title = "Salary by Time with Company",
         x = "Days with Company",
         y = "Salary") +
  theme_minimal()

grid.arrange(w1, w2, nrow=2)

This plot examines the relationship between salary and two variables: age and tenure with the company. In the “Salary by Age” chart, there is a slight upward trend in salaries with increasing age, indicating that more experienced or senior employees may earn slightly higher salaries, although the overall increase is modest. The “Salary by Time with Company” chart shows a different pattern, with salaries remaining relatively constant regardless of the length of tenure, except for a minor upward trend among employees with the longest tenure. These insights suggest that salary growth is more influenced by external factors, such as market value or role-specific qualifications, than by age or company tenure alone.

8) Which Variable Are most Highly Associated with Salary

numeric_cols = hr_data[, sapply(hr_data, is.numeric)]%>%
  dplyr::select(-ManagerID, -EmpID)

# Correlation matrix
cor_matrix = cor(numeric_cols)

# Correlation visualization
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.cex = 0.6, addCoef.col = "black", number.cex = .4)

This correlation matrix highlights relationships between variables in the dataset. Notably, there is a moderate positive correlation (0.60) between Salary and SpecialProjectsCount, suggesting that employees involved in more special projects tend to earn higher salaries. EngagementSurvey negatively correlates with DaysLateLast30 (-0.57), indicating that higher employee engagement is associated with fewer late arrivals in the last 30 days. Weak correlations between Salary and other variables like Age or DaysWorked imply that these factors do not significantly influence salary. Additionally, EmpSatisfaction shows a slight negative correlation (-0.22) with DaysLateLast30, hinting that employees with higher satisfaction levels may be more punctual. Overall, these insights can help focus on impactful factors such as special project involvement and engagement to optimize performance and compensation strategies.

Principal Component Analysis (PCA)

We will use PCA to analyze some underlying factors that may affect the distribution of our data. We begin this section to decipher whether there are significant latent factors that explain varance in our variables.

Feature Engineering

pca_data <- datos %>%
  dplyr::select(Salary, EngagementSurvey, EmpSatisfaction, SpecialProjectsCount, DaysLateLast30, Absences, PerfScoreID, DeptID, DaysWorked, Age ) %>%
  na.omit()

# Standardize the data
pca_result = prcomp(pca_data, scale = TRUE)
pca_result_2 = PCA(pca_data, scale = TRUE, graph = FALSE)
summary(pca_result)

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.6045 1.4424 1.0716 1.0002 0.96517 0.91422 0.71118
## Proportion of Variance 0.2575 0.2080 0.1148 0.1000 0.09316 0.08358 0.05058
## Cumulative Proportion  0.2575 0.4655 0.5803 0.6804 0.77352 0.85710 0.90767
##                            PC8     PC9    PC10
## Standard deviation     0.70005 0.50784 0.41866
## Proportion of Variance 0.04901 0.02579 0.01753
## Cumulative Proportion  0.95668 0.98247 1.00000

We will only be looking at numerical values for this analysis, and removing and variables that act as labels with no weight such as employee ID.

Principal Component Analysis

fviz_eig(pca_result, addlabels = TRUE, main = "Explained Variance by PCA Components")+
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  geom_line(aes(group = 1), color = "black")+
  theme_minimal()

This scree plot displays the proportion of explained variance contributed by each principal component (PC) in a PCA analysis. The first two components, PC1 and PC2, collectively explain 46.5% of the variance (25.7% and 20.8% respectively). This indicates that these components capture a significant portion of the dataset’s information. Subsequent components contribute progressively less variance, with the variance explained by PC3 to PC6 ranging between 8.4% and 11.5%, and components beyond PC6 contributing minimal additional variance. This suggests that dimensionality reduction using the first few principal components is feasible, preserving much of the dataset’s variability while reducing complexity.

p1 = fviz_contrib(pca_result, choice = "var", axes = 1) + 
  ggtitle("Contributions to PC1")+
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  theme_minimal() + theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10)
    )+
      labs(x = "Column Name")


p2  = fviz_contrib(pca_result, choice = "var", axes = 2) + 
  ggtitle("Contributions to PC2")+
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  theme_minimal() + theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10)
    )+
      labs(x = "Column Name")

grid.arrange(p1, p2, ncol = 2)

The bar plots show the top contributors to Principal Component 1 (PC1) and Principal Component 2 (PC2) in a PCA analysis.

For PC1, the top contributing features are PerfScoreID, DaysLateLast30, and EngagementSurvey, each contributing more than 20% to the variance explained by PC1. These features likely capture performance-related trends and behaviors. Other notable contributors include DeptID, SpecialProjectsCount, and Salary, suggesting organizational and compensation factors also play a role.

For PC2, SpecialProjectsCount, DeptID, and Salary are the most significant contributors, with SpecialProjectsCount leading the variance explanation. This suggests that PC2 captures a different dimension of variability, possibly emphasizing the interplay between project involvement and departmental roles.

Features like EngagementSurvey, DaysLateLast30, and PerfScoreID also contribute meaningfully but to a lesser extent. Lower contributors for both components include Age, DaysWorked, and Absences, indicating these variables have minimal influence on the primary axes of variability in the data. This highlights the dominant role of performance, engagement, and departmental factors in shaping the structure of the dataset.

x.f <- factanal(pca_data, 3, scores="Bartlett", rotation="varimax")
x.f

## 
## Call:
## factanal(x = pca_data, factors = 3, scores = "Bartlett", rotation = "varimax")
## 
## Uniquenesses:
##               Salary     EngagementSurvey      EmpSatisfaction 
##                0.493                0.595                0.816 
## SpecialProjectsCount       DaysLateLast30             Absences 
##                0.039                0.213                0.950 
##          PerfScoreID               DeptID           DaysWorked 
##                0.219                0.338                0.982 
##                  Age 
##                0.967 
## 
## Loadings:
##                      Factor1 Factor2 Factor3
## Salary                0.649           0.293 
## EngagementSurvey              0.630         
## EmpSatisfaction               0.225   0.364 
## SpecialProjectsCount  0.973          -0.120 
## DaysLateLast30               -0.882         
## Absences                              0.222 
## PerfScoreID                   0.788   0.390 
## DeptID               -0.790  -0.106   0.161 
## DaysWorked                    0.133         
## Age                                   0.153 
## 
##                Factor1 Factor2 Factor3
## SS loadings      2.009   1.883   0.495
## Proportion Var   0.201   0.188   0.050
## Cumulative Var   0.201   0.389   0.439
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 16.57 on 18 degrees of freedom.
## The p-value is 0.553

From the factor analysis output, we can make the following interpretations about the three factors:

Factor 1:
- The highest loadings are observed for SpecialProjectsCount (0.972), Salary (0.649), and EngagementSurvey (0.630).
- This factor likely represents Performance and Contribution since high engagement, involvement in projects, and higher salaries are often indicators of employee productivity and value.
Factor 2:
- The dominant contributors are DaysLateLast30 (-0.882) and DeptID (-0.791), while PerfScoreID (0.789) also shows a strong loading.
- This factor may reflect Workplace Discipline and Role, as it includes negative contributions for tardiness and departmental associations, alongside performance scores.
Factor 3:
- The notable contributors are EmpSatisfaction (0.363), PerfScoreID (0.389), and Absences (0.222).
- This factor appears to relate to Employee Satisfaction and Well-Being, emphasizing the relationship between satisfaction, performance, and attendance.

Variance Explained:

Factor 1 accounts for the largest proportion of variance (20.1%), highlighting the importance of performance and contribution-related metrics in the dataset.
Factor 2 explains 18.8% of the variance, indicating the significance of workplace discipline and role.
Factor 3 explains 4.9% of the variance, focusing on employee well-being and satisfaction.

The cumulative variance explained is 43.9%, suggesting these factors capture a substantial portion of the dataset’s variability, though there may be additional nuances not captured in these three factors.

We can observe similar relationships with variables and latent factors:

# PCA variable contributions
var = get_pca_var(pca_result)

# Round values to 2 decimal places
rounded_coord <- round(var$coord, 2)
rounded_contrib <- round(var$contrib, 2)
rounded_cos2 <- round(var$cos2, 2)

rounded_cos2

##                      Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7 Dim.8 Dim.9
## Salary                0.28  0.30  0.02  0.00  0.10  0.00  0.19  0.08  0.00
## EngagementSurvey      0.39  0.19  0.01  0.01  0.07  0.01  0.05  0.27  0.00
## EmpSatisfaction       0.13  0.09  0.10  0.14  0.08  0.41  0.05  0.00  0.00
## SpecialProjectsCount  0.32  0.55  0.00  0.00  0.00  0.00  0.02  0.01  0.01
## DaysLateLast30        0.54  0.21  0.02  0.00  0.03  0.01  0.01  0.07  0.11
## Absences              0.01  0.00  0.53  0.13  0.01  0.31  0.01  0.00  0.00
## PerfScoreID           0.57  0.21  0.00  0.00  0.00  0.00  0.06  0.04  0.10
## DeptID                0.32  0.47  0.00  0.01  0.03  0.00  0.07  0.02  0.03
## DaysWorked            0.02  0.01  0.38  0.08  0.40  0.08  0.02  0.00  0.00
## Age                   0.01  0.04  0.08  0.63  0.22  0.00  0.02  0.00  0.00
##                      Dim.10
## Salary                 0.01
## EngagementSurvey       0.00
## EmpSatisfaction        0.00
## SpecialProjectsCount   0.09
## DaysLateLast30         0.01
## Absences               0.00
## PerfScoreID            0.01
## DeptID                 0.06
## DaysWorked             0.00
## Age                    0.00

# Correlation Plot of cos2 values
corrplot(var$cos2, is.corr = FALSE)

Dimension 4:

DaysWorked has a slightly stronger cos² value here compared to other dimensions. This suggests that Dimension 4 may be capturing subtle differences in employee tenure or commitment levels, which are not as strongly aligned with productivity or performance metrics.

Dimension 5:

Absences begins to show a moderate contribution here. This could indicate that Dimension 5 captures unique patterns related to attendance behavior, which may have limited overlap with broader performance indicators but still reveal trends in reliability or work consistency.

Dimension 6:

Age has its highest contribution in this dimension. While its overall cos² values are low, the slight prominence here could suggest that Dimension 6 reflects age-related trends, such as generational differences in job performance, engagement, or work style.

For the sake of visualizing we will look at only our first two principal components:

Visualization and Insight

fviz_pca_var(pca_result_2, col.var = "contrib", gradient.cols = c("blue", "green", "red"), repel = T)

The Graph Above shows the initial plane of our principal components and the direction in which our variables will affect the spread of our individuals. We can add our subjects to see how they display on our plane:

fviz_pca_biplot(pca_result_2, geom = "point", repel = TRUE, col.var = "black", col.ind = "lightblue")

With each of our employees now graphed on our Principal component plane we can add on to this analysis and wee how certain variables affect the spread of our employees on this plane. Namely, Salary, Peformance Scores, and Employee Satisfaction

1. PCA by Performance Rating:

#adding Dimension coord to entries 
pca_plot = as.data.frame(pca_result_2$ind$coord)

# adding rest of info to entries
pca_plot <- pca_plot %>%
  bind_cols(datos %>% 
              dplyr::select(-c(Employee_Name, EmpID)))

ggplot(pca_plot, aes(x = Dim.1, y = Dim.2, color = PerformanceScore)) +
  geom_point()+
  ggtitle("PCA Results by Performance Rating") +
  labs(x = "Principal Component 1", y = "Principal Component 2", color = "Performance ") +
  theme_minimal()

This plot illustrates distinct clusters for different performance categories, such as “Exceeds,” “Fully Meets,” “Needs Improvement,” and “PIP.” The separation indicates that the principal components capture meaningful variance related to performance ratings. Employees with “Exceeds” and “Fully Meets” ratings form dominant clusters, while “PIP” and “Needs Improvement” are more scattered.

2. PCA by Salary:

ggplot(pca_plot, aes(x = Dim.1, y = Dim.2, color = Salary))+
  geom_point()+
  ggtitle("PCA Results by Salary") +
  labs(x = "Principal Component 1", y = "Principal Component 2") +
  scale_color_gradientn(colors = c("red", "blue", "green"))+ 
  theme_minimal()

The second plot reveals how salary aligns with the principal components. Higher salaries are predominantly associated with one cluster, while lower salaries are distributed across the other clusters. This suggests a potential link between salary and underlying factors in the PCA dimensions, such as performance, experience, or job role.

3. PCA by Satisfaction Level:

ggplot(pca_plot, aes(x = Dim.1, y = Dim.2, color = factor(EmpSatisfaction)))+
  geom_point()+
  ggtitle("PCA Results by Satisfaction Level") +
  labs(x = "Principal Component 1", y = "Principal Component 2",
       color = "Employee Satisfaction") + 
  theme_minimal()

This plot indicates clusters based on employee satisfaction levels. Satisfaction ratings of 4 and 5 are grouped closely, implying similarities in factors influencing high satisfaction. Lower satisfaction ratings (e.g., 1 and 2) show a broader spread, reflecting diverse underlying reasons for dissatisfaction.

Overall PC analysis

The analysis reveals clear patterns and relationships in the dataset. High-performing, well-compensated employees tend to exhibit higher satisfaction and form cohesive clusters. In contrast, employees with lower performance ratings, salaries, or satisfaction are more dispersed, reflecting diverse challenges and experiences. These insights emphasize the value of targeted interventions, such as rewarding high performers, addressing salary disparities, and resolving dissatisfaction among outliers, to enhance organizational efficiency and employee well-being.

Clustering Analysis

From our initial PCA plot we can already see the data begin to group itself in what appear to be apparent group. We will move forward using CLustering Algorithms to bring derive any latent cluster groups within our data and make sense of them

Validating Cluster Methods

pca_data_scale <- scale(pca_data)
hopkins_stat = hopkins(pca_data_scale, m=90)
hopkins_stat

## [1] 0.9998572

The Hopkins statistic for this dataset is exceptionally high, indicating a very strong clustering tendency. This suggests that the data naturally forms distinct groups or clusters. Therefore, applying clustering algorithms, such as K-means or hierarchical clustering, is highly appropriate and likely to yield meaningful results. This insight aligns with prior PCA-based visualizations, which also showed clear groupings among variables like performance, salary, and satisfaction.

fviz_nbclust(pca_data_scale, kmeans, method = "wss",nstart = 200)

Using our graph above we can gather that an appropriate number of clusters for this dataset would be 3 clusters, moving forward we will use this cluster number to visualize and identify our data.

Visualizations and Analysis

K-Means

set.seed(123)
datos.km = kmeans(pca_data_scale, 3, nstart = 200)

clustermeans <- aggregate(pca_data, by=list(cluster=datos.km$cluster), mean)

fviz_cluster(datos.km, 
             pca_data,
             ellipse.type = "t", 
             repel = T, geom = "point")+
  theme_minimal()+
  labs(title = "K-Means Cluster Plot")+
  scale_fill_brewer(palette = "Set1")+
  scale_color_brewer(palette = "Set1")

We can see that our clusters Show genuine distinctness from one another while capturing a vast majority of the data. We have clusters 1 nd 2 that are seemingly much more related to one another as opposed to Cluster 3. Based on their position, we can immidiately assume the red cluster to denote lower performing employees, Blue with average performing employees, and Green with high level employees.

Dendrogram

datos.dist.eucl = dist(pca_data, method = "euclidean")
datos.hc_ward = hclust(d = datos.dist.eucl, method = "ward.D2")
fviz_dend(datos.hc_ward, 
          cex = 0.5,
          k = 3,
          k_colors = c("#2E9FDF", "#E7B800", "#FC4E07"),
          color_labels_by_k = TRUE, 
          rect = TRUE 
          )

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Here we have another version of the visulizatioon of our data’s cluster with a different algorithm. For the sake of Interpretability and insight we will continue forward with the K-means Algorithm for CLustering.

Analysis

To continue the exploration of these clusters we can run some of our previous descriptive analytics and filter by our clusters.

ggplot(clustermeans, aes(x = cluster, y = Salary, fill = factor(cluster))) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(title = "Cluster-wise Average Salaries",
       x = "Variable",
       y = "Average Salary",
       fill = "Cluster")+
  scale_fill_brewer(palette = "Set1")

The bar chart visualizes the average salary distribution across three clusters. Cluster 3, represented in blue, exhibits the highest average salary, indicating that this group likely comprises higher-level roles or employees with better compensation. Cluster 1, represented in red, has a moderately high average salary, suggesting a middle-tier group. Cluster 2, in green, shows the lowest average salary, implying this cluster consists of employees with entry-level or lower-tier roles. This differentiation in salary averages among clusters reflects potential variations in role types, seniority, or performance levels across the groups, providing a clear opportunity for further analysis of the factors driving these differences.

ggplot(clustermeans, aes(x = cluster, y = PerfScoreID, fill = factor(cluster))) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(title = "Cluster-wise Average Performance Scores",
       x = "Clusters",
       y = "Performance Scores",
       fill = "Cluster")+
  scale_fill_brewer(palette = "Set1")

This bar plot illustrates the average performance scores across three clusters. Cluster 1, depicted in red, has the lowest average performance score, indicating a potential area for improvement or identifying characteristics of lower-performing groups. Cluster 2, shown in blue, achieves a significantly higher average score, suggesting this group consistently meets or exceeds expectations. Cluster 3, represented in green, also performs strongly, closely matching Cluster 2’s average performance score. The clear distinctions in average scores among clusters could guide targeted interventions, resource allocation, or strategies tailored to the needs and strengths of each group.

# Add cluster assignments to the original dataset
pca_data_with_clusters <- cbind(pca_data, cluster = datos.km$cluster)

pca_data_with_clusters %>%
  group_by(EmpSatisfaction, cluster) %>%
  summarize(count = n()) %>%
  mutate(prop = count / sum(count)) -> prop_data

## `summarise()` has grouped output by 'EmpSatisfaction'. You can override using
## the `.groups` argument.

# Create the stacked bar plot with proportions
ggplot(prop_data, aes(x = factor(EmpSatisfaction), y = prop, fill = factor(cluster))) +
  geom_bar(stat = "identity", position = "stack") +
  theme_minimal() +
  labs(title = "Employee Satisfaction Distribution by Cluster",
       x = "Employee Satisfaction",
       y = "Proportion",
       fill = "Cluster") +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_brewer(palette = "Set1")

Cluster 1 dominates at the lowest satisfaction level (1), indicating a group of largely dissatisfied employees, while its proportion decreases as satisfaction levels rise. Cluster 3, on the other hand, prevails at higher satisfaction levels (3, 4, and 5), suggesting it represents the most satisfied employees. Cluster 2 maintains a consistent but smaller proportion across all satisfaction levels, indicating a moderately satisfied group. These insights suggest that employees in Cluster 1 may require targeted interventions to improve satisfaction, while the factors contributing to high satisfaction in Cluster 3 could be leveraged to enhance overall workplace morale.

Cluster Analysis Conclusion

Cluster 1 primarily consists of employees with lower performance scores, lower satisfaction levels, and lower salaries. This group also shows a concentration of dissatisfaction, as evidenced by their dominance in the lowest satisfaction rating categories. Employees in this cluster may benefit from targeted interventions such as professional development programs, better workload management, or initiatives to improve job engagement and satisfaction.

Cluster 2 represents a moderately balanced group, with average performance scores, salaries, and satisfaction levels. While not significantly excelling, this group maintains a steady presence across satisfaction levels. Management could focus on this cluster to understand and address potential barriers to higher performance or satisfaction. Providing opportunities for growth or incentives may help move this group towards higher satisfaction and productivity.

Cluster 3 stands out as the high-performing, highly satisfied, and better-compensated group. This cluster includes employees who predominantly fall into the higher satisfaction and performance categories. This group serves as a benchmark for organizational success. Retaining employees in this cluster is critical, which could involve reinforcing positive workplace practices, offering continued career growth opportunities, and maintaining competitive compensation.

In conclusion, Cluster 1 requires immediate attention to address dissatisfaction and performance issues, Cluster 2 needs strategic investments to unlock potential and elevate performance, and Cluster 3 should be nurtured to sustain excellence and serve as a model for other clusters. A tailored approach to each cluster will enable the organization to optimize workforce satisfaction and productivity effectively.

Conclusion

Based on the PCA and clustering analyses, we can draw comprehensive insights into the underlying structure and segmentation of the data. The PCA revealed that the first two components explained a significant portion of the variance, providing a clear dimensionality reduction for visualization and analysis. Variables such as “SpecialProjectsCount,” “DaysLateLast30,” and “Performance Scores” exhibited strong contributions to the principal components, highlighting their importance in explaining variability across employees.

The clustering analysis, performed using k-means, identified three distinct groups of employees based on their characteristics. Cluster 1 represented employees with lower salaries and performance scores, potentially indicating underperforming or less experienced individuals. Cluster 2 comprised mid-range performers with stable but average scores across metrics, suggesting employees who meet expectations but lack standout traits. Cluster 3 included high performers with the highest salaries and performance scores, likely encompassing experienced and high-value employees who drive organizational outcomes.

Actionable Insight

From a managerial perspective, the following actionable insights are proposed:

Focus on Cluster 3: This cluster contains high performers, so retaining and further developing these employees should be prioritized. Consider offering competitive incentives, leadership opportunities, and career development programs to ensure their satisfaction and continued contribution.
Support Cluster 1: Employees in this cluster may benefit from additional support, such as training programs, mentorship, or performance improvement plans. Addressing gaps in their skills and motivation could improve their output and align them closer to organizational goals.
Enhance Cluster 2’s Potential: Employees in this group exhibit stability but may require targeted initiatives to unlock their potential. Providing professional development opportunities or assigning challenging projects could encourage growth and engagement.
Variable Analysis for Policy Development: Variables such as “SpecialProjectsCount” and “DaysLateLast30” strongly influence outcomes and should be monitored. Policies encouraging participation in special projects and addressing punctuality issues can positively impact overall performance.

By aligning resources and strategies to address the needs of these clusters, the organization can optimize workforce performance and satisfaction.

Data Mining Project on HR Dataset

Jeffrey Fernandez

2024-12-25