Association between variables

Scoring model

Analyzed data set: HRDataset_v14 available here

Loading the necesary packages…

Project Objectives

Exploratory data analysis - EDA was done in another post and available here
Identification and detailed analysis of the studied phenomenon through the development of scoring models available here and here
Analyzing the association between variables
Interpretation of results and presentation of conclusions

Dataset description

In order to determine the factors that significantly influence an individual’s salary within a company, we selected the dataset entitled Human Resources Data Set, published on the Kaggle platform: https://www.kaggle.com/datasets/rhuebner/human-resources-data-set

This dataset was created by Dr. Carla Patalano and Dr. Rich. It was designed as an educational resource to help students learn how to perform exploratory data analysis (EDA). The dataset provides a wide range of features that enable both data visualization and the development of machine learning / predictive analytics models.

Within this dataset, we decided to explore and attempt to answer several research questions, such as:

Are there areas within the company where salary distribution is not equitable?
Is an individual’s salary influenced by any specific factors present in the dataset?

loading the dataset

Dataset description

The structure of the dataset

glimpse(HRDataset_v14)

## Rows: 311
## Columns: 36
## $ Employee_Name              <chr> "Adinolfi, Wilson  K", "Ait Sidi, Karthikey…
## $ EmpID                      <dbl> 10026, 10084, 10196, 10088, 10069, 10002, 1…
## $ MarriedID                  <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ MaritalStatusID            <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ GenderID                   <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ EmpStatusID                <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ DeptID                     <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ PerfScoreID                <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ FromDiversityJobFairID     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ Salary                     <dbl> 62506, 104437, 64955, 64991, 50825, 57568, …
## $ Termd                      <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ PositionID                 <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ Position                   <chr> "Production Technician I", "Sr. DBA", "Prod…
## $ State                      <chr> "MA", "MA", "MA", "MA", "MA", "MA", "MA", "…
## $ Zip                        <chr> "01960", "02148", "01810", "01886", "02169"…
## $ DOB                        <chr> "07/10/83", "05/05/75", "09/19/88", "09/27/…
## $ Sex                        <chr> "M", "M", "F", "F", "F", "F", "F", "M", "F"…
## $ MaritalDesc                <chr> "Single", "Married", "Married", "Married", …
## $ CitizenDesc                <chr> "US Citizen", "US Citizen", "US Citizen", "…
## $ HispanicLatino             <chr> "No", "No", "No", "No", "No", "No", "No", "…
## $ RaceDesc                   <chr> "White", "White", "White", "White", "White"…
## $ DateofHire                 <chr> "7/5/2011", "3/30/2015", "7/5/2011", "1/7/2…
## $ DateofTermination          <chr> NA, "6/16/2016", "9/24/2012", NA, "9/6/2016…
## $ TermReason                 <chr> "N/A-StillEmployed", "career change", "hour…
## $ EmploymentStatus           <chr> "Active", "Voluntarily Terminated", "Volunt…
## $ Department                 <chr> "Production", "IT/IS", "Production", "Produ…
## $ ManagerName                <chr> "Michael Albert", "Simon Roup", "Kissy Sull…
## $ ManagerID                  <dbl> 22, 4, 20, 16, 39, 11, 10, 19, 12, 7, 14, 2…
## $ RecruitmentSource          <chr> "LinkedIn", "Indeed", "LinkedIn", "Indeed",…
## $ PerformanceScore           <chr> "Exceeds", "Fully Meets", "Fully Meets", "F…
## $ EngagementSurvey           <dbl> 4.60, 4.96, 3.02, 4.84, 5.00, 5.00, 3.04, 5…
## $ EmpSatisfaction            <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ SpecialProjectsCount       <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ LastPerformanceReview_Date <chr> "1/17/2019", "2/24/2016", "5/15/2012", "1/3…
## $ DaysLateLast30             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Absences                   <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…

The key findings from the EDA and their implications for further modeling

Based on EDA analysis we so that:

Salary exhibits non-normal behavior and strong right skewness.
Extreme values (executive-level salaries) may influence regression results.
Transformation techniques (e.g., log transformation of salary) may improve model performance.
Categorical variables such as department, position, and employment status are likely strong predictors of salary.
Gender alone may not fully explain salary variation without controlling for position and department.

## Rows: 311
## Columns: 15
## $ married_id                 <fct> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ marital_status_id          <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ gender_id                  <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ emp_status_id              <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ dept_id                    <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ perf_score_id              <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ from_diversity_job_fair_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ termd                      <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ position_id                <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ emp_satisfaction           <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ special_projects_count     <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ days_late_last30           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ absences                   <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…
## $ status                     <chr> "no_married", "married", "married", "marrie…
## $ log10_salary               <dbl> 4.795922, 5.018854, 4.812613, 4.812853, 4.7…

## Rows: 311
## Columns: 15
## $ married_id                 <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ marital_status_id          <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ gender_id                  <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ emp_status_id              <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ dept_id                    <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ perf_score_id              <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ from_diversity_job_fair_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ termd                      <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ position_id                <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ emp_satisfaction           <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ special_projects_count     <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ days_late_last30           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ absences                   <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…
## $ salary                     <int> 62506, 104437, 64955, 64991, 50825, 57568, …
## $ status                     <chr> "no_married", "married", "married", "marrie…

Correlation matrix

Difference between gropus

## 
##  Welch Two Sample t-test
## 
## data:  salary by sex
## t = -0.9956, df = 296.39, p-value = 0.3203
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
##  -8461.792  2776.447
## sample estimates:
## mean in group female   mean in group male 
##             67786.73             70629.40

## 
##  Welch Two Sample t-test
## 
## data:  salary by status
## t = 0.45511, df = 253.63, p-value = 0.6494
## alternative hypothesis: true difference in means between group married and group no_married is not equal to 0
## 95 percent confidence interval:
##  -4465.733  7150.088
## sample estimates:
##    mean in group married mean in group no_married 
##                 69827.72                 68485.54

##              Df       Sum Sq   Mean Sq F value Pr(>F)
## race_desc     5   4051661020 810332204   1.286   0.27
## Residuals   305 192133817279 629946942

##              Df       Sum Sq   Mean Sq F value Pr(>F)
## race_desc     5   4051661020 810332204   1.286   0.27
## Residuals   305 192133817279 629946942

Association between variables

To visually explore the relationships between multiple categorical variables and salary groups among employees we will use a parallel plot.

department → emp_satisfaction → employment_status → marital_desc → performance_score → position → salary_group → sex

Interpretation

The parallel sets plot highlights strong associations between department, position and salary group. Most employees in the Production department fall into the lowest salary category (<55000), while higher salary categories are primarily associated with IT and Software Engineering positions. Gender differences appear visually, with higher salary bands showing a greater proportion of male employees.

Association between variables

by Irimia Mihaela

2026-03-09

Scoring model

Analyzed data set: HRDataset_v14 available here