Scoring

Exploratory data analysis - EDA

Analyzed data set: HRDataset_v14 available here

Loading the necesary packages…

Project Objectives

Exploratory Data Analysis
Identification and detailed analysis of the studied phenomenon through the development of scoring models
Interpretation of results and presentation of conclusions

Dataset description

In order to determine the factors that significantly influence an individual’s salary within a company, we selected the dataset entitled Human Resources Data Set, published on the Kaggle platform: https://www.kaggle.com/datasets/rhuebner/human-resources-data-set

This dataset was created by Dr. Carla Patalano and Dr. Rich. It was designed as an educational resource to help students learn how to perform exploratory data analysis (EDA). The dataset provides a wide range of features that enable both data visualization and the development of machine learning / predictive analytics models.

Within this dataset, we decided to explore and attempt to answer several research questions, such as:

Are there areas within the company where salary distribution is not equitable?
Is an individual’s salary influenced by any specific factors present in the dataset?
Can we build a scoring model capable of estimating an employee’s salary? If so, to what extent is the model accurate?

loading the dataset

Dataset description

The structure of the dataset

glimpse(HRDataset_v14)

## Rows: 311
## Columns: 36
## $ Employee_Name              <chr> "Adinolfi, Wilson  K", "Ait Sidi, Karthikey…
## $ EmpID                      <dbl> 10026, 10084, 10196, 10088, 10069, 10002, 1…
## $ MarriedID                  <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ MaritalStatusID            <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ GenderID                   <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ EmpStatusID                <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ DeptID                     <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ PerfScoreID                <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ FromDiversityJobFairID     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ Salary                     <dbl> 62506, 104437, 64955, 64991, 50825, 57568, …
## $ Termd                      <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ PositionID                 <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ Position                   <chr> "Production Technician I", "Sr. DBA", "Prod…
## $ State                      <chr> "MA", "MA", "MA", "MA", "MA", "MA", "MA", "…
## $ Zip                        <chr> "01960", "02148", "01810", "01886", "02169"…
## $ DOB                        <chr> "07/10/83", "05/05/75", "09/19/88", "09/27/…
## $ Sex                        <chr> "M", "M", "F", "F", "F", "F", "F", "M", "F"…
## $ MaritalDesc                <chr> "Single", "Married", "Married", "Married", …
## $ CitizenDesc                <chr> "US Citizen", "US Citizen", "US Citizen", "…
## $ HispanicLatino             <chr> "No", "No", "No", "No", "No", "No", "No", "…
## $ RaceDesc                   <chr> "White", "White", "White", "White", "White"…
## $ DateofHire                 <chr> "7/5/2011", "3/30/2015", "7/5/2011", "1/7/2…
## $ DateofTermination          <chr> NA, "6/16/2016", "9/24/2012", NA, "9/6/2016…
## $ TermReason                 <chr> "N/A-StillEmployed", "career change", "hour…
## $ EmploymentStatus           <chr> "Active", "Voluntarily Terminated", "Volunt…
## $ Department                 <chr> "Production", "IT/IS", "Production", "Produ…
## $ ManagerName                <chr> "Michael Albert", "Simon Roup", "Kissy Sull…
## $ ManagerID                  <dbl> 22, 4, 20, 16, 39, 11, 10, 19, 12, 7, 14, 2…
## $ RecruitmentSource          <chr> "LinkedIn", "Indeed", "LinkedIn", "Indeed",…
## $ PerformanceScore           <chr> "Exceeds", "Fully Meets", "Fully Meets", "F…
## $ EngagementSurvey           <dbl> 4.60, 4.96, 3.02, 4.84, 5.00, 5.00, 3.04, 5…
## $ EmpSatisfaction            <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ SpecialProjectsCount       <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ LastPerformanceReview_Date <chr> "1/17/2019", "2/24/2016", "5/15/2012", "1/3…
## $ DaysLateLast30             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Absences                   <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…

hr <- HRDataset_v14

Dataset dimensions

dim(hr)

## [1] 311  36

The dataset contains 36 variables and 311 observations

Checking the structure of the dataset

str(hr)       # Displays the variable types assigned by R for each column

## spc_tbl_ [311 × 36] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Employee_Name             : chr [1:311] "Adinolfi, Wilson  K" "Ait Sidi, Karthikeyan" "Akinkuolie, Sarah" "Alagbe,Trina" ...
##  $ EmpID                     : num [1:311] 10026 10084 10196 10088 10069 ...
##  $ MarriedID                 : num [1:311] 0 1 1 1 0 0 0 0 0 0 ...
##  $ MaritalStatusID           : num [1:311] 0 1 1 1 2 0 0 4 0 2 ...
##  $ GenderID                  : num [1:311] 1 1 0 0 0 0 0 1 0 1 ...
##  $ EmpStatusID               : num [1:311] 1 5 5 1 5 1 1 1 3 1 ...
##  $ DeptID                    : num [1:311] 5 3 5 5 5 5 4 5 5 3 ...
##  $ PerfScoreID               : num [1:311] 4 3 3 3 3 4 3 3 3 3 ...
##  $ FromDiversityJobFairID    : num [1:311] 0 0 0 0 0 0 0 0 1 0 ...
##  $ Salary                    : num [1:311] 62506 104437 64955 64991 50825 ...
##  $ Termd                     : num [1:311] 0 1 1 0 1 0 0 0 0 0 ...
##  $ PositionID                : num [1:311] 19 27 20 19 19 19 24 19 19 14 ...
##  $ Position                  : chr [1:311] "Production Technician I" "Sr. DBA" "Production Technician II" "Production Technician I" ...
##  $ State                     : chr [1:311] "MA" "MA" "MA" "MA" ...
##  $ Zip                       : chr [1:311] "01960" "02148" "01810" "01886" ...
##  $ DOB                       : chr [1:311] "07/10/83" "05/05/75" "09/19/88" "09/27/88" ...
##  $ Sex                       : chr [1:311] "M" "M" "F" "F" ...
##  $ MaritalDesc               : chr [1:311] "Single" "Married" "Married" "Married" ...
##  $ CitizenDesc               : chr [1:311] "US Citizen" "US Citizen" "US Citizen" "US Citizen" ...
##  $ HispanicLatino            : chr [1:311] "No" "No" "No" "No" ...
##  $ RaceDesc                  : chr [1:311] "White" "White" "White" "White" ...
##  $ DateofHire                : chr [1:311] "7/5/2011" "3/30/2015" "7/5/2011" "1/7/2008" ...
##  $ DateofTermination         : chr [1:311] NA "6/16/2016" "9/24/2012" NA ...
##  $ TermReason                : chr [1:311] "N/A-StillEmployed" "career change" "hours" "N/A-StillEmployed" ...
##  $ EmploymentStatus          : chr [1:311] "Active" "Voluntarily Terminated" "Voluntarily Terminated" "Active" ...
##  $ Department                : chr [1:311] "Production" "IT/IS" "Production" "Production" ...
##  $ ManagerName               : chr [1:311] "Michael Albert" "Simon Roup" "Kissy Sullivan" "Elijiah Gray" ...
##  $ ManagerID                 : num [1:311] 22 4 20 16 39 11 10 19 12 7 ...
##  $ RecruitmentSource         : chr [1:311] "LinkedIn" "Indeed" "LinkedIn" "Indeed" ...
##  $ PerformanceScore          : chr [1:311] "Exceeds" "Fully Meets" "Fully Meets" "Fully Meets" ...
##  $ EngagementSurvey          : num [1:311] 4.6 4.96 3.02 4.84 5 5 3.04 5 4.46 5 ...
##  $ EmpSatisfaction           : num [1:311] 5 3 3 5 4 5 3 4 3 5 ...
##  $ SpecialProjectsCount      : num [1:311] 0 6 0 0 0 0 4 0 0 6 ...
##  $ LastPerformanceReview_Date: chr [1:311] "1/17/2019" "2/24/2016" "5/15/2012" "1/3/2019" ...
##  $ DaysLateLast30            : num [1:311] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Absences                  : num [1:311] 1 17 3 15 2 15 19 19 4 16 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Employee_Name = col_character(),
##   ..   EmpID = col_double(),
##   ..   MarriedID = col_double(),
##   ..   MaritalStatusID = col_double(),
##   ..   GenderID = col_double(),
##   ..   EmpStatusID = col_double(),
##   ..   DeptID = col_double(),
##   ..   PerfScoreID = col_double(),
##   ..   FromDiversityJobFairID = col_double(),
##   ..   Salary = col_double(),
##   ..   Termd = col_double(),
##   ..   PositionID = col_double(),
##   ..   Position = col_character(),
##   ..   State = col_character(),
##   ..   Zip = col_character(),
##   ..   DOB = col_character(),
##   ..   Sex = col_character(),
##   ..   MaritalDesc = col_character(),
##   ..   CitizenDesc = col_character(),
##   ..   HispanicLatino = col_character(),
##   ..   RaceDesc = col_character(),
##   ..   DateofHire = col_character(),
##   ..   DateofTermination = col_character(),
##   ..   TermReason = col_character(),
##   ..   EmploymentStatus = col_character(),
##   ..   Department = col_character(),
##   ..   ManagerName = col_character(),
##   ..   ManagerID = col_double(),
##   ..   RecruitmentSource = col_character(),
##   ..   PerformanceScore = col_character(),
##   ..   EngagementSurvey = col_double(),
##   ..   EmpSatisfaction = col_double(),
##   ..   SpecialProjectsCount = col_double(),
##   ..   LastPerformanceReview_Date = col_character(),
##   ..   DaysLateLast30 = col_double(),
##   ..   Absences = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

# glimpse(hr)   # Displays the variable types assigned by R for each column

We get the same information as above, but more detailed: 311 observations and 35 variables and the type of the variables

DATA CLEANING

We remove variables that are irrelevant or contain redundant information All variable names are converted to lowercase and formatted without spaces

hr <- hr %>%
  janitor::clean_names()

Identifying numerical variables that should be treated as categorical

numerical_categorical_vars <- hr %>%
  dplyr::select(-emp_id, -salary, -zip, -engagement_survey, -days_late_last30, -absences) %>%
  dplyr::select_if(is.numeric) %>%
  names()
numerical_categorical_vars

##  [1] "married_id"                 "marital_status_id"         
##  [3] "gender_id"                  "emp_status_id"             
##  [5] "dept_id"                    "perf_score_id"             
##  [7] "from_diversity_job_fair_id" "termd"                     
##  [9] "position_id"                "manager_id"                
## [11] "emp_satisfaction"           "special_projects_count"

Alternative method (commented out)

numerical_categorical_vars <- hr %>%
  dplyr::select(-emp_id, -salary, -zip, -engagement_survey, -days_late_last30, -absences) %>%
  dplyr::select_if(is.numeric) %>%
  gather(var, value) %>%
  group_by(var) %>%
  summarise(n = n(),
            missing = sum(is.na(value)))
numerical_categorical_vars

var	n	missing
dept_id	311	0
emp_satisfaction	311	0
emp_status_id	311	0
from_diversity_job_fair_id	311	0
gender_id	311	0
manager_id	311	8
marital_status_id	311	0
married_id	311	0
perf_score_id	311	0
position_id	311	0
special_projects_count	311	0
termd	311	0

Converting selected numerical variables into categorical variables (factors)

hr <- hr %>%
  mutate(across(c(married_id, marital_status_id, gender_id, emp_status_id, 
                  dept_id, perf_score_id, from_diversity_job_fair_id, 
                  termd, position_id, emp_satisfaction, 
                  special_projects_count), as.factor))

Missing values identification

# summary(hr)
any(is.na(hr))

## [1] TRUE

sum(is.na(hr))

## [1] 215

There are 8 missing values in manager_id

any(is.na(hr$manager_id))

## [1] TRUE

print('The number of the missing values is', sum(is.na(hr$manager_id)))

## [1] "The number of the missing values is"

converting the data type variable form string type to date type using separate(data, col, into, sep) function

hr1 <- hr%>%
  as.data.frame()%>%
  separate(dob, c('zi', 'luna', 'an'), '/')%>%
  mutate(an1 = paste0('19', an))%>%
  mutate(dob1 = paste0(zi, '/', luna, '/', an1))

hr1 <- hr1%>%
  mutate_at(vars(dob1, dateof_hire, dateof_hire, dateof_termination, last_performance_review_date), funs(as.Date(., format = '%d/%m/%Y')))

DT::datatable(
  hr1,
  options = list(
    scrollX = TRUE,
    pageLength = 5
  )
)

data_type<- hr1%>%
  dplyr::select(dob1, dateof_hire, dateof_hire, dateof_termination, last_performance_review_date)
# summary(data_type)

checking the missing values of the data type variables

summary(hr1)

##  employee_name          emp_id      married_id marital_status_id gender_id
##  Length:311         Min.   :10001   0:187      0:137             0:176    
##  Class :character   1st Qu.:10078   1:124      1:124             1:135    
##  Mode  :character   Median :10156              2: 30                      
##                     Mean   :10156              3: 12                      
##                     3rd Qu.:10234              4:  8                      
##                     Max.   :10311                                         
##                                                                           
##  emp_status_id dept_id perf_score_id from_diversity_job_fair_id
##  1:184         1: 10   1: 13         0:282                     
##  2: 11         2:  1   2: 18         1: 29                     
##  3: 14         3: 50   3:243                                   
##  4: 14         4: 10   4: 37                                   
##  5: 88         5:208                                           
##                6: 32                                           
##                                                                
##      salary       termd    position_id    position            state          
##  Min.   : 45046   0:207   19     :137   Length:311         Length:311        
##  1st Qu.: 55502   1:104   20     : 57   Class :character   Class :character  
##  Median : 62810           3      : 27   Mode  :character   Mode  :character  
##  Mean   : 69021           18     : 13                                        
##  3rd Qu.: 72036           24     :  9                                        
##  Max.   :250000           9      :  8                                        
##                           (Other): 60                                        
##      zip                 zi                luna                an           
##  Length:311         Length:311         Length:311         Length:311        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      sex            marital_desc       citizen_desc       hispanic_latino   
##  Length:311         Length:311         Length:311         Length:311        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   race_desc          dateof_hire         dateof_termination  
##  Length:311         Min.   :2006-09-01   Min.   :2012-02-01  
##  Class :character   1st Qu.:2011-10-01   1st Qu.:2014-04-04  
##  Mode  :character   Median :2012-09-01   Median :2015-06-08  
##                     Mean   :2013-02-04   Mean   :2015-06-15  
##                     3rd Qu.:2014-10-11   3rd Qu.:2016-08-05  
##                     Max.   :2018-09-07   Max.   :2018-10-11  
##                     NA's   :138          NA's   :265         
##  term_reason        employment_status   department        manager_name      
##  Length:311         Length:311         Length:311         Length:311        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    manager_id    recruitment_source performance_score  engagement_survey
##  Min.   : 1.00   Length:311         Length:311         Min.   :1.12     
##  1st Qu.:10.00   Class :character   Class :character   1st Qu.:3.69     
##  Median :15.00   Mode  :character   Mode  :character   Median :4.28     
##  Mean   :14.57                                         Mean   :4.11     
##  3rd Qu.:19.00                                         3rd Qu.:4.70     
##  Max.   :39.00                                         Max.   :5.00     
##  NA's   :8                                                              
##  emp_satisfaction special_projects_count last_performance_review_date
##  1:  2            0      :241            Min.   :2011-04-05          
##  2:  9            5      : 21            1st Qu.:2015-02-07          
##  3:108            6      : 21            Median :2018-05-02          
##  4: 94            7      : 12            Mean   :2017-03-14          
##  5: 98            4      :  9            3rd Qu.:2019-07-01          
##                   3      :  3            Max.   :2019-12-02          
##                   (Other):  4            NA's   :184                 
##  days_late_last30    absences         an1                 dob1           
##  Min.   :0.0000   Min.   : 1.00   Length:311         Min.   :1951-02-01  
##  1st Qu.:0.0000   1st Qu.: 5.00   Class :character   1st Qu.:1973-12-13  
##  Median :0.0000   Median :10.00   Mode  :character   Median :1980-06-21  
##  Mean   :0.4148   Mean   :10.24                      Mean   :1979-02-14  
##  3rd Qu.:0.0000   3rd Qu.:15.00                      3rd Qu.:1986-02-20  
##  Max.   :6.0000   Max.   :20.00                      Max.   :1992-07-05  
##                                                      NA's   :169

numer_var_char <- hr1 %>%
   dplyr::select(-employee_name, -zip, -manager_name, -zi, -luna, -an, -an1)%>%
   dplyr::select_if(is.character) %>%
  names()
numer_var_char

##  [1] "position"           "state"              "sex"               
##  [4] "marital_desc"       "citizen_desc"       "hispanic_latino"   
##  [7] "race_desc"          "term_reason"        "employment_status" 
## [10] "department"         "recruitment_source" "performance_score"

number of persons as function of status variable

denumire_status_married_id <- hr1 %>%
  dplyr::select(married_id, marital_desc)%>%
  arrange(married_id) %>%
  mutate(id = row_number())%>%
  spread(marital_desc, married_id)%>%
  dplyr::select(-id) %>%
  gather(var, values) %>%
  na.omit()%>%
  group_by(var)%>%
  summarise(n = n())
denumire_status_married_id %>%
  arrange(desc(n))

var	n
Single	137
Married	124
Divorced	30
Separated	12
Widowed	8

The number of married/unmarried persons

status_married_id <- hr1 %>%
  dplyr::select(married_id)%>%
  arrange(married_id) %>%
  mutate(id = row_number(),
         status = if_else(married_id == '1', 'Married', 'No_Married'))%>%
  # mutate(married_id = as.numeric(married_id))%>%
  spread(status, married_id)%>%
  dplyr::select(-id) %>%
  gather(var, values) %>%
  na.omit()%>%
  group_by(var)%>%
  summarise(n = n())
status_married_id

var	n
Married	124
No_Married	187

converting the string variables into categorical or factor variables

hr2 <- hr1 %>%
  mutate_at(vars(position, state, sex, marital_desc, citizen_desc, hispanic_latino, race_desc,
                 term_reason, employment_status, department, recruitment_source, performance_score), funs(factor))

or the alternative way

identifying the mangers with missing values for the id_manager

hr_manageri_departamente <- hr1 %>%
  dplyr::select(manager_id, manager_name, dept_id, department) %>%
  arrange(dept_id, department, manager_id, manager_name) %>%
  group_by(manager_name)%>%
  distinct(manager_id, manager_name, department, dept_id)

hr_manageri_departamente %>%
  filter(is.na(manager_id))

## # A tibble: 1 × 4
## # Groups:   manager_name [1]
##   manager_id manager_name   department dept_id
##        <dbl> <chr>          <chr>      <fct>  
## 1         NA Webster Butler Production 5

extracting the name of the manager with missing value for the manager_id variable

nume_manager <- hr1 %>%
  dplyr::select(manager_id, manager_name, dept_id, department) %>%
  arrange(dept_id, department, manager_id, manager_name) %>%
  group_by(manager_name)%>%
  distinct(manager_id, manager_name, department, dept_id)%>%
  filter(is.na(manager_id))%>%
  pull(manager_name)
print(paste('The name of the manager is: ', nume_manager, collapse = ', '))

## [1] "The name of the manager is:  Webster Butler"

The name and the number of the department in which Webster Butler activates

departament <- hr_manageri_departamente%>%
  filter(manager_name == nume_manager)
departament

## # A tibble: 2 × 4
## # Groups:   manager_name [1]
##   manager_id manager_name   department dept_id
##        <dbl> <chr>          <chr>      <fct>  
## 1         39 Webster Butler Production 5      
## 2         NA Webster Butler Production 5

extracting the manager_id of the Webster Butler

nr_departament <- hr_manageri_departamente%>%
  filter(manager_name == nume_manager) %>%
  filter(!is.na(manager_id))%>%
  pull(manager_id)
print(paste('The number of the id is: ', nr_departament))

## [1] "The number of the id is:  39"

filling de missing manager_id value

hr1 <- hr1 %>%
  mutate(manager_id = if_else(manager_name == nume_manager, nr_departament, manager_id))

##    manager_id   manager_name      
##  Min.   : 1.0   Length:311        
##  1st Qu.:10.0   Class :character  
##  Median :16.0   Mode  :character  
##  Mean   :15.2                     
##  3rd Qu.:19.5                     
##  Max.   :39.0

temp <- hr_manageri_departamente %>%
  dplyr::select(manager_name, manager_id, department, dept_id) %>%
  arrange(manager_name)%>%
  mutate(nr = rank(manager_name))%>%
  arrange(nr)

the name and number of departments

departamente <- hr1 %>%
  distinct(department, dept_id)%>%
  arrange(dept_id)

we notice that production department has assigned 2 different id department number (5 and 6)

checking for the number of observation/employees registered in each department

departamente <- hr1 %>%
  dplyr::select(department, dept_id)%>%
  filter(department == 'Production')%>%
  arrange(dept_id) %>%
  mutate(id = row_number())%>%
  spread(dept_id, department)%>%
  dplyr::select(-id) %>%
  gather(var, values) %>%
  na.omit()%>%
  group_by(var)%>%
  summarise(n = n())
departamente

var	n
5	208
6	1

correcting the number of the department for the Production section

hr1 <- hr1 %>%
  mutate(dept_id = if_else(department == 'Production', '5', dept_id))

modifying the department number of the Admin Offices

hr1 <- hr1 %>%
  mutate(dept_id = if_else(department == 'Admin Offices', '7', dept_id))

summary(hr1)

##  employee_name          emp_id      married_id marital_status_id gender_id
##  Length:311         Min.   :10001   0:187      0:137             0:176    
##  Class :character   1st Qu.:10078   1:124      1:124             1:135    
##  Mode  :character   Median :10156              2: 30                      
##                     Mean   :10156              3: 12                      
##                     3rd Qu.:10234              4:  8                      
##                     Max.   :10311                                         
##                                                                           
##  emp_status_id   dept_id          perf_score_id from_diversity_job_fair_id
##  1:184         Length:311         1: 13         0:282                     
##  2: 11         Class :character   2: 18         1: 29                     
##  3: 14         Mode  :character   3:243                                   
##  4: 14                            4: 37                                   
##  5: 88                                                                    
##                                                                           
##                                                                           
##      salary       termd    position_id    position            state          
##  Min.   : 45046   0:207   19     :137   Length:311         Length:311        
##  1st Qu.: 55502   1:104   20     : 57   Class :character   Class :character  
##  Median : 62810           3      : 27   Mode  :character   Mode  :character  
##  Mean   : 69021           18     : 13                                        
##  3rd Qu.: 72036           24     :  9                                        
##  Max.   :250000           9      :  8                                        
##                           (Other): 60                                        
##      zip                 zi                luna                an           
##  Length:311         Length:311         Length:311         Length:311        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      sex            marital_desc       citizen_desc       hispanic_latino   
##  Length:311         Length:311         Length:311         Length:311        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   race_desc          dateof_hire         dateof_termination  
##  Length:311         Min.   :2006-09-01   Min.   :2012-02-01  
##  Class :character   1st Qu.:2011-10-01   1st Qu.:2014-04-04  
##  Mode  :character   Median :2012-09-01   Median :2015-06-08  
##                     Mean   :2013-02-04   Mean   :2015-06-15  
##                     3rd Qu.:2014-10-11   3rd Qu.:2016-08-05  
##                     Max.   :2018-09-07   Max.   :2018-10-11  
##                     NA's   :138          NA's   :265         
##  term_reason        employment_status   department        manager_name      
##  Length:311         Length:311         Length:311         Length:311        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    manager_id   recruitment_source performance_score  engagement_survey
##  Min.   : 1.0   Length:311         Length:311         Min.   :1.12     
##  1st Qu.:10.0   Class :character   Class :character   1st Qu.:3.69     
##  Median :16.0   Mode  :character   Mode  :character   Median :4.28     
##  Mean   :15.2                                         Mean   :4.11     
##  3rd Qu.:19.5                                         3rd Qu.:4.70     
##  Max.   :39.0                                         Max.   :5.00     
##                                                                        
##  emp_satisfaction special_projects_count last_performance_review_date
##  1:  2            0      :241            Min.   :2011-04-05          
##  2:  9            5      : 21            1st Qu.:2015-02-07          
##  3:108            6      : 21            Median :2018-05-02          
##  4: 94            7      : 12            Mean   :2017-03-14          
##  5: 98            4      :  9            3rd Qu.:2019-07-01          
##                   3      :  3            Max.   :2019-12-02          
##                   (Other):  4            NA's   :184                 
##  days_late_last30    absences         an1                 dob1           
##  Min.   :0.0000   Min.   : 1.00   Length:311         Min.   :1951-02-01  
##  1st Qu.:0.0000   1st Qu.: 5.00   Class :character   1st Qu.:1973-12-13  
##  Median :0.0000   Median :10.00   Mode  :character   Median :1980-06-21  
##  Mean   :0.4148   Mean   :10.24                      Mean   :1979-02-14  
##  3rd Qu.:0.0000   3rd Qu.:15.00                      3rd Qu.:1986-02-20  
##  Max.   :6.0000   Max.   :20.00                      Max.   :1992-07-05  
##                                                      NA's   :169

a list containing the positions or jobs from the company

jobs <- hr1 %>%
  dplyr::select(dept_id, department, position_id, position, manager_name, employee_name)%>%
  distinct(dept_id, department, position_id, position,  manager_name)%>%
  arrange(department, position_id,  manager_name)

jobs_denumire_it_mamanger <- hr1 %>%
  dplyr::select(dept_id, department, position_id, position, manager_name, employee_name)%>%
  distinct(position_id, position, dept_id, department, manager_name)%>%
  arrange(position_id, department, manager_name)%>%
  filter(position_id == 13)%>%
  pull(position)
print(paste('The name of the department are: ', jobs_denumire_it_mamanger, collapse = ', '))

## [1] "The name of the department are:  IT Manager - Support, The name of the department are:  IT Manager - Infra, The name of the department are:  IT Manager - DB"

changing the jab name int IT Manager

hr1 <- hr1 %>%
  mutate(position = if_else(position %in% jobs_denumire_it_mamanger, 'IT Manager', position),
         position = if_else(position %in% c('IT Director'), 'IT Manager', position))

we wandering if Janet King fills 5 different job positions or there are 5 distinct persons with the same name

jobs_Janet_King <- hr1 %>%
  dplyr::select(dept_id, department, position_id, position, manager_name, employee_name)%>%
  distinct(position_id, position, dept_id, department, manager_name)%>%
  arrange(position_id, department, manager_name)%>%
  filter(manager_name == 'Janet King')
jobs_Janet_King

position_id	position	dept_id	department	manager_name
6	CIO	3	IT/IS	Janet King
10	Director of Operations	5	Production	Janet King
11	Director of Sales	6	Sales	Janet King
17	Production Manager	5	Production	Janet King
18	Production Manager	5	Production	Janet King
23	Shared Services Manager	7	Admin Offices	Janet King
23	Software Engineer	1	Software Engineering	Janet King

so Janet King appears as Production manager of the department with the same name

jobs_Janet_King_prod <- hr1 %>%
  dplyr::select(dept_id, department, position_id, position, manager_name, employee_name)%>%
  distinct(position_id, position, dept_id, department, manager_name)%>%
  arrange(position_id, department, manager_name)%>%
  filter(manager_name == 'Janet King' & department == 'Production')
jobs_Janet_King_prod

position_id	position	dept_id	department	manager_name
10	Director of Operations	5	Production	Janet King
17	Production Manager	5	Production	Janet King
18	Production Manager	5	Production	Janet King

changing the name jab into Production Manager

hr1 <- hr1 %>%
  mutate(position = if_else(position == 'Director of Operations', 'Production Manager', position))

jobs_id_10_17_18 <- hr1 %>%
  dplyr::select(dept_id, department, position_id, position, manager_name, employee_name)%>%
  distinct(position_id, position, dept_id, department, manager_name)%>%
  arrange(position_id, department, manager_name)%>%
  filter(position_id %in% c(10, 17, 18))
jobs_id_10_17_18

position_id	position	dept_id	department	manager_name
10	Production Manager	5	Production	Janet King
17	Production Manager	5	Production	Janet King
18	Production Manager	5	Production	Janet King

so we have 31 distinct values for the name of the managers

distinct_nume_dep<-hr_manageri_departamente %>%
  dplyr::select(manager_name, manager_id, department, dept_id) %>%
  distinct(manager_name, manager_id, department, dept_id)

var_term_reason <- hr1 %>%
  distinct(term_reason)

changing the N/A-StillEmployed into StillEmployed

hr1 <- hr1 %>%
  mutate(term_reason = if_else(term_reason == 'N/A-StillEmployed', 'StillEmployed', term_reason))

Basic information about data types and missing values

…as text…

temp <- DataExplorer::introduce(hr1)
temp %>%
  gather(Metric, Value)

Metric	Value
rows	311
columns	40
discrete_columns	34
continuous_columns	6
all_missing_columns	0
total_missing_values	756
complete_rows	7
total_observations	1.24e+04
memory_usage	1.52e+05

… and as plot

plot_intro(temp)

Missing values

… as plot

Display the data types as text…

type	cnt	pcnt	col_name
character	20	50	c(`1` = "employee_name", `7` = "dept_id", `13` = "position", `14` = "state", `15` = "zip", `16` = "zi", `17` = "luna", `18` = "an", `19` = "sex", `20` = "marital_desc", `21` = "citizen_desc", `22` = "hispanic_latino", `23` = "race_desc", `26` = "term_reason", `27` = "employment_status", `28` = "department", `29` = "manager_name", `31` = "recruitment_source", `32` = "performance_score", `39` = "an1")
factor	10	25	c(`3` = "married_id", `4` = "marital_status_id", `5` = "gender_id", `6` = "emp_status_id", `8` = "perf_score_id", `9` = "from_diversity_job_fair_id", `11` = "termd", `12` = "position_id", `34` = "emp_satisfaction", `35` = "special_projects_count")
numeric	6	15	c(`2` = "emp_id", `10` = "salary", `30` = "manager_id", `33` = "engagement_survey", `37` = "days_late_last30", `38` = "absences")
Date	4	10	c(`24` = "dateof_hire", `25` = "dateof_termination", `36` = "last_performance_review_date", `40` = "dob1")

… and as chart

Display the size of each column

Missing values: we have 4 data type variables that contains over 40% missing values selecting the new dataset

selecting the character type variables

numer_var_char <- hr2 %>%
  dplyr::select_if(is.character) %>%
  names()
numer_var_char

##  [1] "employee_name"      "dept_id"            "position"          
##  [4] "state"              "sex"                "marital_desc"      
##  [7] "citizen_desc"       "hispanic_latino"    "race_desc"         
## [10] "term_reason"        "employment_status"  "department"        
## [13] "manager_name"       "recruitment_source" "performance_score"

hr2 <- hr2 %>%
  mutate_at(vars(dept_id), funs(factor))%>%
  mutate_at(vars("position", "state" , "sex", "marital_desc","citizen_desc",      
                 "hispanic_latino", "race_desc", "term_reason",  "employment_status", "department",
                 "recruitment_source", "performance_score"), funs(factor))

we have 28 distinct job position

jobs<- hr2 %>%
  distinct(position_id, position)%>%
  arrange(position_id)

jobs2 <- hr2 %>%
  dplyr::select(position_id, position)%>%
  mutate(id = row_number())%>%
  group_by(position_id, position)%>%
  summarise(n = n())

we changing the job position with the same name

hr2 <- hr2 %>%
  mutate(position_id = if_else(position == 'Production Manager', '18', position_id),
         position_id = if_else(position == 'Software Engineer', '24', position_id),
         position_id = if_else(position == 'IT Manager', '12', position_id))

salary is a numerica variable and will be the dependent variable

the rest of the variable are the predectors

after a few changes…

## 'data.frame':    311 obs. of  31 variables:
##  $ employee_name             : chr  "Adinolfi, Wilson  K" "Ait Sidi, Karthikeyan" "Akinkuolie, Sarah" "Alagbe,Trina" ...
##  $ emp_id                    : num  10026 10084 10196 10088 10069 ...
##  $ married_id                : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 1 1 ...
##  $ status                    : chr  "no_married" "married" "married" "married" ...
##  $ marital_status_id         : Factor w/ 5 levels "0","1","2","3",..: 1 2 2 2 3 1 1 5 1 3 ...
##  $ gender_id                 : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 2 1 2 ...
##  $ emp_status_id             : Factor w/ 5 levels "1","2","3","4",..: 1 5 5 1 5 1 1 1 3 1 ...
##  $ dept_id                   : Factor w/ 7 levels "1","2","3","4",..: 5 3 5 5 5 5 4 5 5 3 ...
##  $ perf_score_id             : Factor w/ 4 levels "1","2","3","4": 4 3 3 3 3 4 3 3 3 3 ...
##  $ from_diversity_job_fair_id: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
##  $ salary                    : num  62506 104437 64955 64991 50825 ...
##  $ termd                     : Factor w/ 2 levels "0","1": 1 2 2 1 2 1 1 1 1 1 ...
##  $ position_id               : Factor w/ 27 levels "1","11","12",..: 8 17 10 8 8 8 14 8 8 4 ...
##  $ position                  : Factor w/ 27 levels "accountant_i",..: 18 26 19 18 18 18 23 18 18 13 ...
##  $ state                     : Factor w/ 28 levels "AL","AZ","CA",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ sex                       : Factor w/ 2 levels "female","male": 2 2 1 1 1 1 1 2 1 2 ...
##  $ marital_desc              : Factor w/ 5 levels "divorced","married",..: 4 2 2 2 1 4 4 5 4 1 ...
##  $ citizen_desc              : Factor w/ 3 levels "eligible_noncitizen",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ hispanic_latino           : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ race_desc                 : Factor w/ 6 levels "american_indian_or_alaska_native",..: 6 6 6 6 6 6 6 6 3 6 ...
##  $ term_reason               : Factor w/ 18 levels "another_position",..: 17 3 6 17 16 17 17 17 17 17 ...
##  $ employment_status         : Factor w/ 3 levels "active","terminated_for_cause",..: 1 3 3 1 3 1 1 1 1 1 ...
##  $ department                : Factor w/ 6 levels "admin_offices",..: 4 3 4 4 4 4 6 4 4 3 ...
##  $ manager_name              : chr  "Michael Albert" "Simon Roup" "Kissy Sullivan" "Elijiah Gray" ...
##  $ manager_id                : num  22 4 20 16 39 11 10 19 12 7 ...
##  $ recruitment_source        : Factor w/ 9 levels "careerbuilder",..: 6 5 6 5 4 6 6 3 2 5 ...
##  $ performance_score         : Factor w/ 4 levels "exceeds","fully_meets",..: 1 2 2 2 2 1 2 2 2 2 ...
##  $ emp_satisfaction          : Factor w/ 5 levels "1","2","3","4",..: 5 3 3 5 4 5 3 4 3 5 ...
##  $ special_projects_count    : Factor w/ 9 levels "0","1","2","3",..: 1 7 1 1 1 1 5 1 1 7 ...
##  $ days_late_last30          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ absences                  : num  1 17 3 15 2 15 19 19 4 16 ...

… and as chart

Exploratory data analysis

Graphical and numerica analysis of variables

Descriptive statistic of the salary variable

temp <-  hr2 %>%
  summarise(
    n_of_individuals = n(), # nr. indivizi din firma
    min_Salary = min(hr2$salary), # valoare minim
    first_quartile = quantile(hr2$salary,.25), # prima quartila
    mean_Salary = mean(hr2$salary), # media varstei
    median_Salary = median(hr2$salary), # mediana salariului
    third_quartile = quantile(hr2$salary, .75), # a treia quartila
    max_Salary = max(hr2$salary), # maxim
    trim_mean_Salary =mean(x = hr2$salary, trim = .2),# 20% media trunchiata
    range_Salary = paste(range(hr2$salary), collapse = ' - '), # range 
    iqr_Salary= IQR(hr2$salary), # interval interquartilic IQ
    var_Salary = var(hr2$salary), # varianta
    sd_Salary = sd(hr2$salary), # abaterea standard
    skewness = PerformanceAnalytics::skewness(hr2$salary), #asimetria
    kurtosis = PerformanceAnalytics::kurtosis(hr2$salary), #boltirea
    modul = DescTools::Mode(hr2$salary)) %>% #modul
  gather(variabile, values)%>%
  group_by(variabile)%>%
  slice_head(n = 1)

Most employees in the company earn a salary of 61,242 USD

The IQR is 16,534.5 USD – meaning that 50% of employees earn between 55,501.5 USD and 72,036 USD

Numerical Variables Indicators

Display the first and last 6 rows of the dataset

print(hr2 %>% head(6)) # Displaying the first 6 observations of the dataset

##           employee_name emp_id married_id     status marital_status_id
## 1   Adinolfi, Wilson  K  10026          0 no_married                 0
## 2 Ait Sidi, Karthikeyan  10084          1    married                 1
## 3     Akinkuolie, Sarah  10196          1    married                 1
## 4          Alagbe,Trina  10088          1    married                 1
## 5       Anderson, Carol  10069          0 no_married                 2
## 6       Anderson, Linda  10002          0 no_married                 0
##   gender_id emp_status_id dept_id perf_score_id from_diversity_job_fair_id
## 1         1             1       5             4                          0
## 2         1             5       3             3                          0
## 3         0             5       5             3                          0
## 4         0             1       5             3                          0
## 5         0             5       5             3                          0
## 6         0             1       5             4                          0
##   salary termd position_id                 position state    sex marital_desc
## 1  62506     0          19  production_technician_i    MA   male       single
## 2 104437     1          27                  sr._dba    MA   male      married
## 3  64955     1          20 production_technician_ii    MA female      married
## 4  64991     0          19  production_technician_i    MA female      married
## 5  50825     1          19  production_technician_i    MA female     divorced
## 6  57568     0          19  production_technician_i    MA female       single
##   citizen_desc hispanic_latino race_desc      term_reason
## 1   us_citizen              no     white    stillemployed
## 2   us_citizen              no     white    career_change
## 3   us_citizen              no     white            hours
## 4   us_citizen              no     white    stillemployed
## 5   us_citizen              no     white return_to_school
## 6   us_citizen              no     white    stillemployed
##        employment_status department   manager_name manager_id
## 1                 active production Michael Albert         22
## 2 voluntarily_terminated      it/is     Simon Roup          4
## 3 voluntarily_terminated production Kissy Sullivan         20
## 4                 active production   Elijiah Gray         16
## 5 voluntarily_terminated production Webster Butler         39
## 6                 active production       Amy Dunn         11
##   recruitment_source performance_score emp_satisfaction special_projects_count
## 1           linkedin           exceeds                5                      0
## 2             indeed       fully_meets                3                      6
## 3           linkedin       fully_meets                3                      0
## 4             indeed       fully_meets                5                      0
## 5      google_search       fully_meets                4                      0
## 6           linkedin           exceeds                5                      0
##   days_late_last30 absences
## 1                0        1
## 2                0       17
## 3                0        3
## 4                0       15
## 5                0        2
## 6                0       15

print(hr2 %>% tail(6)) # Displaying the last 6 observations of the dataset

##         employee_name emp_id married_id     status marital_status_id gender_id
## 306     Wolk, Hang  T  10174          0 no_married                 0         0
## 307    Woodson, Jason  10135          0 no_married                 0         1
## 308 Ybarra, Catherine  10301          0 no_married                 0         0
## 309  Zamora, Jennifer  10010          0 no_married                 0         0
## 310       Zhou, Julia  10043          0 no_married                 0         0
## 311     Zima, Colleen  10271          0 no_married                 4         0
##     emp_status_id dept_id perf_score_id from_diversity_job_fair_id salary termd
## 306             1       5             3                          0  60446     0
## 307             1       5             3                          0  65893     0
## 308             5       5             1                          0  48513     1
## 309             1       3             4                          0 220450     0
## 310             1       3             3                          0  89292     0
## 311             1       5             3                          0  45046     0
##     position_id                 position state    sex marital_desc citizen_desc
## 306          20 production_technician_ii    MA female       single   us_citizen
## 307          20 production_technician_ii    MA   male       single   us_citizen
## 308          19  production_technician_i    MA female       single   us_citizen
## 309           6                      cio    MA female       single   us_citizen
## 310           9             data_analyst    MA female       single   us_citizen
## 311          19  production_technician_i    MA female      widowed   us_citizen
##     hispanic_latino race_desc      term_reason      employment_status
## 306              no     white    stillemployed                 active
## 307              no     white    stillemployed                 active
## 308              no     asian another_position voluntarily_terminated
## 309              no     white    stillemployed                 active
## 310              no     white    stillemployed                 active
## 311              no     asian    stillemployed                 active
##     department   manager_name manager_id recruitment_source performance_score
## 306 production  David Stanley         14           linkedin       fully_meets
## 307 production Kissy Sullivan         20           linkedin       fully_meets
## 308 production Brannon Miller         12      google_search               pip
## 309      it/is     Janet King          2  employee_referral           exceeds
## 310      it/is     Simon Roup          4  employee_referral       fully_meets
## 311 production  David Stanley         14           linkedin       fully_meets
##     emp_satisfaction special_projects_count days_late_last30 absences
## 306                4                      0                0       14
## 307                4                      0                0       13
## 308                2                      0                5        4
## 309                5                      6                0       16
## 310                3                      5                0       11
## 311                5                      0                0        2

Descriptive statistics for numerical variables

The minimum salary offered by the company is 45,046 USD

The maximum salary offered by the company is 250,000 USD

The average salary earned by an employee in the company is 69,021 USD

25% of employees earn up to 55,502 USD, while 75% earn more than 55,502 USD

75% of employees earn up to 72,036 USD, while 25% earn more than 72,036 USD

Half of the employees earn up to 62,810 USD, while the remaining 50% earn more than 62,810 USD

Data summary
Name	Piped data
Number of rows	311
Number of columns	31
_______________________
Column type frequency:
character	3
factor	23
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
employee_name	1	8	23	311
status	1	7	10	2
manager_name	1	8	18	21

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
married_id	1	FALSE	2	0: 187, 1: 124
marital_status_id	1	FALSE	5	0: 137, 1: 124, 2: 30, 3: 12
gender_id	1	FALSE	2	0: 176, 1: 135
emp_status_id	1	FALSE	5	1: 184, 5: 88, 3: 14, 4: 14
dept_id	1	FALSE	7	5: 209, 3: 50, 6: 31, 4: 10
perf_score_id	1	FALSE	4	3: 243, 4: 37, 2: 18, 1: 13
from_diversity_job_fair_id	1	FALSE	2	0: 282, 1: 29
termd	1	FALSE	2	0: 207, 1: 104
position_id	1	FALSE	27	19: 137, 20: 57, 3: 27, 18: 15
position	1	FALSE	27	pro: 137, pro: 57, are: 27, pro: 15
state	1	FALSE	28	MA: 276, CT: 6, TX: 3, VT: 2
sex	1	FALSE	2	fem: 176, mal: 135
marital_desc	1	FALSE	5	sin: 137, mar: 124, div: 30, sep: 12
citizen_desc	1	FALSE	3	us_: 295, eli: 12, non: 4
hispanic_latino	1	FALSE	2	no: 283, yes: 28
race_desc	1	FALSE	6	whi: 187, bla: 80, asi: 29, two: 11
term_reason	1	FALSE	18	sti: 207, ano: 20, unh: 14, mor: 11
employment_status	1	FALSE	3	act: 207, vol: 88, ter: 16
department	1	FALSE	6	pro: 209, it/: 50, sal: 31, sof: 11
recruitment_source	1	FALSE	9	ind: 87, lin: 76, goo: 49, emp: 31
performance_score	1	FALSE	4	ful: 243, exc: 37, nee: 18, pip: 13
emp_satisfaction	1	FALSE	5	3: 108, 5: 98, 4: 94, 2: 9
special_projects_count	1	FALSE	9	0: 241, 5: 21, 6: 21, 7: 12

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
emp_id	1	10156.00	89.92	10001	10078.5	10156	10233.5	10311	▇▇▇▇▇
salary	1	69020.68	25156.64	45046	55501.5	62810	72036.0	250000	▇▁▁▁▁
manager_id	1	15.20	8.86	1	10.0	16	19.5	39	▅▇▇▁▂
days_late_last30	1	0.41	1.29	0	0.0	0	0.0	6	▇▁▁▁▁
absences	1	10.24	5.85	1	5.0	10	15.0	20	▇▆▅▇▆

On average, an employee’s salary deviates from the mean by ±25,157 USD

The salary range within the company is 204,954 USD

The total payroll at company level amounts to 21,465,433 USD

The coefficient of variation indicates that the mean is broadly representative,

although the dispersion is relatively high, suggesting low homogeneity of the population

The standard error of the mean salary is 1,426.50 USD

With a 95% probability, the true mean salary lies within the interval:

mean ± 2,806.85 USD, i.e., [66,213 USD, 71,827 USD]

psych::describe(hr2)

The skewness coefficient indicates that the salary distribution is positively skewed (right-skewed)

The kurtosis coefficient indicates that the distribution is leptokurtic

The median absolute deviation (MAD) is 11,834.11 USD

## hr2 
## 
##  31  Variables      311  Observations
## --------------------------------------------------------------------------------
## employee_name 
##        n  missing distinct 
##      311        0      311 
## 
## lowest : Adinolfi, Wilson  K   Ait Sidi, Karthikeyan Akinkuolie, Sarah     Alagbe,Trina          Anderson, Carol      
## highest: Woodson, Jason        Ybarra, Catherine     Zamora, Jennifer      Zhou, Julia           Zima, Colleen        
## --------------------------------------------------------------------------------
## emp_id 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      311        0      311        1    10156      104    10016    10032 
##      .25      .50      .75      .90      .95 
##    10078    10156    10234    10280    10296 
## 
## lowest : 10001 10002 10003 10004 10005, highest: 10307 10308 10309 10310 10311
## --------------------------------------------------------------------------------
## married_id 
##        n  missing distinct 
##      311        0        2 
##                       
## Value          0     1
## Frequency    187   124
## Proportion 0.601 0.399
## --------------------------------------------------------------------------------
## status 
##        n  missing distinct 
##      311        0        2 
##                                 
## Value         married no_married
## Frequency         124        187
## Proportion      0.399      0.601
## --------------------------------------------------------------------------------
## marital_status_id 
##        n  missing distinct 
##      311        0        5 
##                                         
## Value          0     1     2     3     4
## Frequency    137   124    30    12     8
## Proportion 0.441 0.399 0.096 0.039 0.026
## --------------------------------------------------------------------------------
## gender_id 
##        n  missing distinct 
##      311        0        2 
##                       
## Value          0     1
## Frequency    176   135
## Proportion 0.566 0.434
## --------------------------------------------------------------------------------
## emp_status_id 
##        n  missing distinct 
##      311        0        5 
##                                         
## Value          1     2     3     4     5
## Frequency    184    11    14    14    88
## Proportion 0.592 0.035 0.045 0.045 0.283
## --------------------------------------------------------------------------------
## dept_id 
##        n  missing distinct 
##      311        0        7 
##                                                     
## Value          1     2     3     4     5     6     7
## Frequency      1     1    50    10   209    31     9
## Proportion 0.003 0.003 0.161 0.032 0.672 0.100 0.029
## --------------------------------------------------------------------------------
## perf_score_id 
##        n  missing distinct 
##      311        0        4 
##                                   
## Value          1     2     3     4
## Frequency     13    18   243    37
## Proportion 0.042 0.058 0.781 0.119
## --------------------------------------------------------------------------------
## from_diversity_job_fair_id 
##        n  missing distinct 
##      311        0        2 
##                       
## Value          0     1
## Frequency    282    29
## Proportion 0.907 0.093
## --------------------------------------------------------------------------------
## salary 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      311        0      308        1    69021    21997    47000    50178 
##      .25      .50      .75      .90      .95 
##    55502    62810    72036    95660   108107 
## 
## lowest :  45046  45069  45115  45395  45433, highest: 170500 178000 180000 220450 250000
## --------------------------------------------------------------------------------
## termd 
##        n  missing distinct 
##      311        0        2 
##                       
## Value          0     1
## Frequency    207   104
## Proportion 0.666 0.334
## --------------------------------------------------------------------------------
## position_id 
##        n  missing distinct 
##      311        0       27 
## 
## lowest : 1  11 12 14 15, highest: 5  6  7  8  9 
## --------------------------------------------------------------------------------
## position 
##        n  missing distinct 
##      311        0       27 
## 
## lowest : accountant_i                 administrative_assistant     area_sales_manager           bi_developer                 bi_director                 
## highest: software_engineer            software_engineering_manager sr._accountant               sr._dba                      sr._network_engineer        
## --------------------------------------------------------------------------------
## state 
##        n  missing distinct 
##      311        0       28 
## 
## lowest : AL AZ CA CO CT, highest: TX UT VA VT WA
## --------------------------------------------------------------------------------
## sex 
##        n  missing distinct 
##      311        0        2 
##                         
## Value      female   male
## Frequency     176    135
## Proportion  0.566  0.434
## --------------------------------------------------------------------------------
## marital_desc 
##        n  missing distinct 
##      311        0        5 
##                                                             
## Value       divorced   married separated    single   widowed
## Frequency         30       124        12       137         8
## Proportion     0.096     0.399     0.039     0.441     0.026
## --------------------------------------------------------------------------------
## citizen_desc 
##        n  missing distinct 
##      311        0        3 
##                                                                       
## Value      eligible_noncitizen         non-citizen          us_citizen
## Frequency                   12                   4                 295
## Proportion               0.039               0.013               0.949
## --------------------------------------------------------------------------------
## hispanic_latino 
##        n  missing distinct 
##      311        0        2 
##                     
## Value        no  yes
## Frequency   283   28
## Proportion 0.91 0.09
## --------------------------------------------------------------------------------
## race_desc 
##        n  missing distinct 
##      311        0        6 
## 
## american_indian_or_alaska_native (3, 0.010), asian (29, 0.093),
## black_or_african_american (80, 0.257), hispanic (1, 0.003), two_or_more_races
## (11, 0.035), white (187, 0.601)
## --------------------------------------------------------------------------------
## term_reason 
##        n  missing distinct 
##      311        0       18 
## 
## lowest : another_position       attendance             career_change          fatal_attraction       gross_misconduct      
## highest: relocation_out_of_area retiring               return_to_school       stillemployed          unhappy               
## --------------------------------------------------------------------------------
## employment_status 
##        n  missing distinct 
##      311        0        3 
##                                                                                
## Value                      active   terminated_for_cause voluntarily_terminated
## Frequency                     207                     16                     88
## Proportion                  0.666                  0.051                  0.283
## --------------------------------------------------------------------------------
## department 
##        n  missing distinct 
##      311        0        6 
##                                                                          
## Value             admin_offices     executive_office                it/is
## Frequency                     9                    1                   50
## Proportion                0.029                0.003                0.161
##                                                                          
## Value                production                sales software_engineering
## Frequency                   209                   31                   11
## Proportion                0.672                0.100                0.035
## --------------------------------------------------------------------------------
## manager_name 
##        n  missing distinct 
##      311        0       21 
## 
## lowest : Alex Sweetwater    Amy Dunn           Board of Directors Brandon R. LeBlanc Brannon Miller    
## highest: Lynn Daneault      Michael Albert     Peter Monroe       Simon Roup         Webster Butler    
## --------------------------------------------------------------------------------
## manager_id 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      311        0       23    0.996     15.2     9.47      2.0      4.0 
##      .25      .50      .75      .90      .95 
##     10.0     16.0     19.5     22.0     39.0 
## 
## lowest :  1  2  3  4  5, highest: 20 21 22 30 39
## --------------------------------------------------------------------------------
## recruitment_source 
##        n  missing distinct 
##      311        0        9 
## 
## lowest : careerbuilder           diversity_job_fair      employee_referral       google_search           indeed                 
## highest: indeed                  linkedin                on-line_web_application other                   website                
## --------------------------------------------------------------------------------
## performance_score 
##        n  missing distinct 
##      311        0        4 
##                                                                 
## Value                exceeds       fully_meets needs_improvement
## Frequency                 37               243                18
## Proportion             0.119             0.781             0.058
##                             
## Value                    pip
## Frequency                 13
## Proportion             0.042
## --------------------------------------------------------------------------------
## emp_satisfaction 
##        n  missing distinct 
##      311        0        5 
##                                         
## Value          1     2     3     4     5
## Frequency      2     9   108    94    98
## Proportion 0.006 0.029 0.347 0.302 0.315
## --------------------------------------------------------------------------------
## special_projects_count 
##        n  missing distinct 
##      311        0        9 
##                                                                 
## Value          0     1     2     3     4     5     6     7     8
## Frequency    241     1     1     3     9    21    21    12     2
## Proportion 0.775 0.003 0.003 0.010 0.029 0.068 0.068 0.039 0.006
## --------------------------------------------------------------------------------
## days_late_last30 
##        n  missing distinct     Info     Mean      Gmd 
##      311        0        7    0.286   0.4148   0.7624 
##                                                     
## Value          0     1     2     3     4     5     6
## Frequency    278     1     6     6     8     6     6
## Proportion 0.894 0.003 0.019 0.019 0.026 0.019 0.019
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## absences 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      311        0       20    0.997    10.24    6.749        2        2 
##      .25      .50      .75      .90      .95 
##        5       10       15       18       19 
##                                                                             
## Value          1     2     3     4     5     6     7     8     9    10    11
## Frequency     14    21    16    23    12    16    21    11    14    10    15
## Proportion 0.045 0.068 0.051 0.074 0.039 0.051 0.068 0.035 0.045 0.032 0.048
##                                                                 
## Value         12    13    14    15    16    17    18    19    20
## Frequency      8    17    17    20    23    15     8    16    14
## Proportion 0.026 0.055 0.055 0.064 0.074 0.048 0.026 0.051 0.045
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------

## hr2 %>% select_if(is.numeric) 
## 
##  5  Variables      311  Observations
## --------------------------------------------------------------------------------
## emp_id 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      311        0      311        1    10156      104    10016    10032 
##      .25      .50      .75      .90      .95 
##    10078    10156    10234    10280    10296 
## 
## lowest : 10001 10002 10003 10004 10005, highest: 10307 10308 10309 10310 10311
## --------------------------------------------------------------------------------
## salary 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      311        0      308        1    69021    21997    47000    50178 
##      .25      .50      .75      .90      .95 
##    55502    62810    72036    95660   108107 
## 
## lowest :  45046  45069  45115  45395  45433, highest: 170500 178000 180000 220450 250000
## --------------------------------------------------------------------------------
## manager_id 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      311        0       23    0.996     15.2     9.47      2.0      4.0 
##      .25      .50      .75      .90      .95 
##     10.0     16.0     19.5     22.0     39.0 
## 
## lowest :  1  2  3  4  5, highest: 20 21 22 30 39
## --------------------------------------------------------------------------------
## days_late_last30 
##        n  missing distinct     Info     Mean      Gmd 
##      311        0        7    0.286   0.4148   0.7624 
##                                                     
## Value          0     1     2     3     4     5     6
## Frequency    278     1     6     6     8     6     6
## Proportion 0.894 0.003 0.019 0.019 0.026 0.019 0.019
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## absences 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      311        0       20    0.997    10.24    6.749        2        2 
##      .25      .50      .75      .90      .95 
##        5       10       15       18       19 
##                                                                             
## Value          1     2     3     4     5     6     7     8     9    10    11
## Frequency     14    21    16    23    12    16    21    11    14    10    15
## Proportion 0.045 0.068 0.051 0.074 0.039 0.051 0.068 0.035 0.045 0.032 0.048
##                                                                 
## Value         12    13    14    15    16    17    18    19    20
## Frequency      8    17    17    20    23    15     8    16    14
## Proportion 0.026 0.055 0.055 0.064 0.074 0.048 0.026 0.051 0.045
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------

Skewness coefficient = 3.290213 indicates a right-skewed salary distribution (>0)

Kurtosis coefficient = 15.18561 indicates a leptokurtic distribution (>0)

The trimmed mean salary (excluding the lowest and highest 10% of values) is 64,523.67 USD

Minimum salary case

The minimum salary (45,046 USD) corresponds to employee Zima, Colleen, who holds the position of Production Technician I in the Production department, managed by David Stanley

Maximum salary case

The maximum salary (250,000 USD) corresponds to employee King, Janet, who is President & CEO on the Board of Directors

Graphical Analysis of Numerical Variables

Creating a dataset containing only numerical variables

num_variables <- hr2 %>%
  select_if(is.numeric) %>%
  mutate(id = row_number()) %>%
  pivot_longer(-id, names_to = "variable", values_to = "value")

###### Histograms of numerical variables 
num_variables %>%
  ggplot(aes(x = value, fill = variable)) +
  geom_histogram() +
  facet_wrap(~variable, scale = "free") +
  guides(fill = FALSE) +
  theme(axis.text.x = element_text(size = 9)) +
  theme(strip.text.x = element_text(size = 12)) +
  xlab("") + ylab("Frequency")

Salary Histogram

hist(hr2$salary,
     breaks = 'Freedman-Diaconis',
     freq = FALSE,
     col = 'thistle1',
     main = 'Salary Distribution Histogram',
     ylab = 'Frequency',
     xlab = 'Salary Value (USD)')

abline(v = mean(hr2$salary), col = 'black', lwd = 3)

curve(dnorm(x, mean = mean(hr2$salary), sd = sd(hr2$salary)),
      col = 'thistle4',
      lwd = 3,
      add = TRUE)

lines(density(hr2$salary), col = 'blue', lwd = 3)

Salary Boxplot, Q-Q Plot

###### Salary Distribution by Gender ######
hr2_m <- hr2 %>% filter(sex == 'male')
hr2_f <- hr2 %>% filter(sex == 'female')

Female and male employees

Categorical Variables Analysis

Analysis of some selected categorical variables

Absolute frequency table for the variable Sex

table(hr2$sex)

## 
## female   male 
##    176    135

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  311 
## 
##  
##           |    female |      male | 
##           |-----------|-----------|
##           |       176 |       135 | 
##           |     0.566 |     0.434 | 
##           |-----------|-----------|
## 
## 
## 
##

Most employees in the company are female

Absolute frequency table for the variable MarriedID / Status

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  311 
## 
##  
##            |    married | no_married | 
##            |------------|------------|
##            |        124 |        187 | 
##            |      0.399 |      0.601 | 
##            |------------|------------|
## 
## 
## 
##

Approximately 40% of employees are married, while 60% are unmarried (i.e., 187 employees out of a total of 311 are unmarried)

Absolute frequency table for the variable Marital Status

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  311 
## 
##  
##           |  divorced |   married | separated |    single |   widowed | 
##           |-----------|-----------|-----------|-----------|-----------|
##           |        30 |       124 |        12 |       137 |         8 | 
##           |     0.096 |     0.399 |     0.039 |     0.441 |     0.026 | 
##           |-----------|-----------|-----------|-----------|-----------|
## 
## 
## 
##

Absolute frequency table for the variable Department

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  311 
## 
##  
##                      |        admin_offices |     executive_office |                it/is |           production |                sales | 
##                      |----------------------|----------------------|----------------------|----------------------|----------------------|
##                      |                    9 |                    1 |                   50 |                  209 |                   31 | 
##                      |                0.029 |                0.003 |                0.161 |                0.672 |                0.100 | 
##                      |----------------------|----------------------|----------------------|----------------------|----------------------|
## 
## 
##                      | software_engineering | 
##                      |----------------------|
##                      |                   11 | 
##                      |                0.035 | 
##                      |----------------------|
## 
## 
## 
##

9.6% of employees are divorced, 40% are married, 4% are separated, 44% are single, and 2.6% are widowed.

Descriptive analysis of numerical variables by categories of categorical variables

Statistical indicators for salary and absences by gender

hr2_f <- hr2 %>% filter(sex == 'female') 
hr2_m <- hr2 %>% filter(sex == 'male')

Female employees analysis

The minimum salary for female employees is 45,046 USD, corresponding to employee Zima, Colleen, who holds the position of Production Technician I in the Production department

hr2_f[hr2_f$salary == min(hr2_f$salary),]%>%
  DT::datatable(
  .,
  options = list(
    scrollX = TRUE,
    pageLength = 5
  )
)

The minimum number of absences is 1 day, corresponding to 8 employees working in the following departments: IT/IS, Production, and Sales, holding positions such as Production Technician I, Production Technician II, IT Support, and Area Sales Manager

hr2_f[hr2_f$absences == min(hr2_f$absences),]%>%
  DT::datatable(
  .,
  options = list(
    scrollX = TRUE,
    pageLength = 5
  )
)

The maximum salary for female employees is 250,000 USD, corresponding to employee King, Janet, who holds the position of President & CEO in the Executive Office (Board of Directors)

hr2_f[hr2_f$salary == max(hr2_f$salary),]%>%
  DT::datatable(
  .,
  options = list(
    scrollX = TRUE,
    pageLength = 5
  )
)

The maximum number of absences is 20 days, corresponding to 6 employees working in the Production and Sales departments, holding positions such as Production Technician I and II

hr2_f[hr2_f$absences == max(hr2_f$absences),] %>%
  DT::datatable(
  .,
  options = list(
    scrollX = TRUE,
    pageLength = 5
  )
)

Skewness and Kurtosis by Gender

## # A tibble: 4 × 4
## # Groups:   sex [2]
##   sex    variable skewness kurtosis
##   <fct>  <chr>       <dbl>    <dbl>
## 1 female absences   0.0232    -1.40
## 2 female salary     4.04      21.5 
## 3 male   absences   0.0363    -1.15
## 4 male   salary     2.16       5.51

sex	mode_absences	mode_salary	n
female	16	6.12e+04	176
male	13		135

Graphical analysis of numerical variables

Tabular representation of numerical variable frequencies

Generate a normal distribution with the same mean and standard deviation

boxplot(hr2$salary, salary_norm, main = “Comparison of the Salary Variable Boxplotthe Normal Distribution Boxplot”, at = c(1,2), las = 2, col = c(“orange”,“red”), border = “brown”, horizontal = FALSE, notch = TRUE, ylim = c(0, 250000) )

text(x = 1:length(hr2), y = par(“usr”)[3] - 0.45, at = c(1, 2), labels = c(“Salary”, “Normal”), xpd = NA, srt = 35, adj = 1.2, cex = 1.2)

Outliers (extreme values) – Graphical representation

Calculate percentage of outliers for each variable

Graphical representation of outliers

data.frame(variable.name, outlier.percentage_g1) %>% 
  mutate(variable.name = as.factor(variable.name),
         outlier.percentage_g1 = round(outlier.percentage_g1, 2)) %>% 
  ggplot(aes(y = fct_reorder(variable.name, outlier.percentage_g1),
             x = outlier.percentage_g1)) +
  geom_bar(stat = "identity") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold")) +
  labs(x = "Number of Outliers (%)",
       y = "Numerical Variables",
       title = "Proportion of Outliers for Each Variable\n(Determined Using the IQR Method)") +
  geom_label(aes(label = paste0(outlier.percentage_g1, "%"))) +
  scale_x_continuous(limits = c(0, 15))

Graphical analysis of categorical variables

p1 = plot_bar_sorted(hr2, sex, "Employee distribution by Gender")

p2 = plot_bar_sorted(hr2, state, "Employee distribution by state")

p1 = plot_bar_sorted(hr2, position, "Employee distribution by position")

p2 = plot_bar_sorted(hr2, marital_desc, "Employee distribution by marital status")

p1 = plot_bar_sorted(hr2, citizen_desc, "Employee distribution by citizenship status")

p2 = plot_bar_sorted(hr2, race_desc, "Employee distribution by race")

p1 = plot_bar_sorted(hr2, employment_status, "Employee distribution by employment status")

p2 = plot_bar_sorted(hr2, recruitment_source, "Employee distribution by recruitment source")

p1 = plot_bar_sorted(hr2, department, "Employee distribution by department")

p2 = plot_bar_sorted(hr2, special_projects_count, 
                "Employee Distribution by Special Projects Count")

Basic Barplot

Histograms of the gen variable

Histograms of the departament variable

Histograms of the employee status variable

The exploratory data analysis (EDA) emphases some important aspects like:

Salary distribution characteristics

The salary distribution within the company is positively skewed (right-skewed), as indicated by the high skewness coefficient (Skewness > 0).

This suggests that:

Most employees earn salaries clustered around the lower and middle range.

A small number of employees earn significantly higher salaries, pulling the mean upward.

The maximum salary (250,000 USD) substantially exceeds the median, indicating the presence of extreme values.

The kurtosis coefficient indicates a leptokurtic distribution.

The presence of some extreme values (outliers) was observed.

Salary variability and representativeness

The standard deviation (≈ 25,157 USD) indicates a relatively high dispersion of salaries around the mean.

The coefficient of variation suggests that:

        - The population is relatively heterogeneous.

        - The mean salary is broadly representative, but variability is significant.

Gender-based differences

The dataset shows that:

    -   The majority of employees are female.
    -   Both male and female salary distributions are right-skewed.

Extreme salary values are present in both groups.

The maximum salary (President & CEO position) belongs to a female employee.

Absences and late days

The analysis of absences shows:

        - The presence of outliers in all three numerical variables (salary, absences, days late).

        - A non-normal distribution pattern.

Some employees have significantly higher absence counts compared to the majority.

Categorical variable insights

The categorical analysis reveals that:

        - Most employees are female.

        - Approximately 40% of employees are married, while the majority are unmarried.

        - The workforce is distributed across multiple departments, with some departments significantly larger than                  others.

        - The majority of employees hold operational or production-related positions.

Implications for further modeling

Based on EDA analysis we so that:

-   Salary exhibits non-normal behavior and strong right skewness.

-   Extreme values (executive-level salaries) may influence regression results.

-   Transformation techniques (e.g., log transformation of salary) may improve model performance.

-   Categorical variables such as department, position, and employment status are likely strong predictors of salary.

-   Gender alone may not fully explain salary variation without controlling for position and department.

Conclusion

The company exhibits substantial salary variability, structural heterogeneity across departments, and a right-skewed compensation structure influenced by a small number of high-paying executive roles.

The exploratory analysis confirms that salary is influenced by multiple structural and demographic factors, justifying the development of a predictive scoring model in the next stage of analysis.

Bibliography

Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.
Kuhn et al., (2020). Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. https://www.tidymodels.org
Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S (2022). skimr: Compact and Flexible Summaries of Data. R package version 2.1.5, https://CRAN.R-project.org/package=skimr.
Peterson BG, Carl P (2020). PerformanceAnalytics: Econometric Tools for Performance and Risk Analysis. R package version 2.0.4, https://CRAN.R-project.org/package=PerformanceAnalytics.
Wickham H, Pedersen T, Seidel D (2025). scales: Scale Functions for Visualization. R package version 1.4.0, https://CRAN.R-project.org/package=scales.
Fox J, Weisberg S (2019). An R Companion to Applied Regression, Third edition. Sage, Thousand Oaks CA. https://socialsciences.mcmaster.ca/jfox/Books/Companion/.
Schloerke B, Cook D, Larmarange J, Briatte F, Marbach M, Thoen E, Elberg A, Crowley J (2021). GGally: Extension to ‘ggplot2’. R package version 2.1.2, https://CRAN.R-project.org/package=GGally.
Taiyun Wei and Viliam Simko (2021). R package ‘corrplot’: Visualization of a Correlation Matrix (Version 0.92). Available from https://github.com/taiyun/corrplot
Rubba C (2023). htmltab: Assemble Data Frames from HTML Tables. R package version 0.8.2.9000, https://github.com/htmltab/htmltab.
Brandon M. Greenwell and Bradley C. Boehmke (2020). Variable Importance Plots—An Introduction to the vip Package. The R Journal, 12(1), 343–366. URL https://doi.org/10.32614/RJ-2020-013.
Marvin N. Wright, Andreas Ziegler (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, 77(1), 1-17. doi:10.18637/jss.v077.i01
Robinson D, Hayes A, Couch S (2023). broom: Convert Statistical Objects into Tidy Tibbles. R package version 1.0.5, https://CRAN.R-project.org/package=broom. 13.H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Wickham H, Hester J, Bryan J (2023). readr: Read Rectangular Text Data. R package version 2.1.4, https://CRAN.R-project.org/package=readr.
Zhu H (2021). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.3.4, https://CRAN.R-project.org/package=kableExtra.
Cui B (2020). DataExplorer: Automate Data Exploration and Treatment. R package version 0.8.2, https://CRAN.R-project.org/package=DataExplorer.
Rushworth A (2022). inspectdf: Inspection, Comparison and Visualisation of Data Frames. R package version 0.0.12, https://CRAN.R-project.org/package=inspectdf.
Grosjean P, Ibanez F (2018). pastecs: Package for Analysis of Space-Time Ecological Series. R package version 1.3.21, https://CRAN.R-project.org/package=pastecs.
https://www.kaggle.com/datasets/rhuebner/human-resources-data-set
William Revelle (2023). psych: Procedures for Psychological, Psychometric, and Personality Research. Northwestern University, Evanston, Illinois. R package version 2.3.9, https://CRAN.R-project.org/package=psych.

Scoring

by Irimia Mihaela

2026-02-26

Exploratory data analysis - EDA

Analyzed data set: HRDataset_v14 available here

Project Objectives

Dataset description

DATA CLEANING

Exploratory data analysis

Graphical and numerica analysis of variables

Numerical Variables Indicators

Descriptive statistics for numerical variables

Graphical Analysis of Numerical Variables

Creating a dataset containing only numerical variables

Salary Histogram

Salary Boxplot, Q-Q Plot

Female and male employees

Categorical Variables Analysis

Descriptive analysis of numerical variables by categories of categorical variables

Female employees analysis

Skewness and Kurtosis by Gender

Graphical analysis of numerical variables

Tabular representation of numerical variable frequencies

Generate a normal distribution with the same mean and standard deviation

Outliers (extreme values) – Graphical representation

Graphical analysis of categorical variables

Basic Barplot

Histograms of the gen variable

Histograms of the departament variable

Histograms of the employee status variable

The exploratory data analysis (EDA) emphases some important aspects like:

Conclusion

Bibliography

Scoring

by Irimia Mihaela

2026-02-26

Exploratory data analysis - EDA

Analyzed data set: HRDataset_v14 available here

Project Objectives

Dataset description

DATA CLEANING

Exploratory data analysis

Graphical and numerica analysis of variables

Numerical Variables Indicators

Descriptive statistics for numerical variables

Graphical Analysis of Numerical Variables

Creating a dataset containing only numerical variables

Salary Histogram

Salary Boxplot, Q-Q Plot

Female and male employees

Categorical Variables Analysis

Descriptive analysis of numerical variables by categories of categorical variables

Female employees analysis

Skewness and Kurtosis by Gender

Modal values of numerical variables by gender

Graphical analysis of numerical variables

Tabular representation of numerical variable frequencies

Generate a normal distribution with the same mean and standard deviation

Outliers (extreme values) – Graphical representation

Graphical analysis of categorical variables

Basic Barplot

Histograms of the gen variable

Histograms of the departament variable

Histograms of the employee status variable

The exploratory data analysis (EDA) emphases some important aspects like:

Conclusion

Bibliography