Loading the necesary packages…
In order to determine the factors that significantly influence an individual’s salary within a company, we selected the dataset entitled Human Resources Data Set, published on the Kaggle platform: https://www.kaggle.com/datasets/rhuebner/human-resources-data-set
This dataset was created by Dr. Carla Patalano and Dr. Rich. It was designed as an educational resource to help students learn how to perform exploratory data analysis (EDA). The dataset provides a wide range of features that enable both data visualization and the development of machine learning / predictive analytics models.
Within this dataset, we decided to explore and attempt to answer several research questions, such as:
Are there areas within the company where salary distribution is not equitable?
Is an individual’s salary influenced by any specific factors present in the dataset?
Can we build a scoring model capable of estimating an employee’s salary? If so, to what extent is the model accurate?
loading the dataset
Dataset description
The structure of the dataset
glimpse(HRDataset_v14)
## Rows: 311
## Columns: 36
## $ Employee_Name <chr> "Adinolfi, Wilson K", "Ait Sidi, Karthikey…
## $ EmpID <dbl> 10026, 10084, 10196, 10088, 10069, 10002, 1…
## $ MarriedID <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
## $ MaritalStatusID <dbl> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
## $ GenderID <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
## $ EmpStatusID <dbl> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
## $ DeptID <dbl> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
## $ PerfScoreID <dbl> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
## $ FromDiversityJobFairID <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
## $ Salary <dbl> 62506, 104437, 64955, 64991, 50825, 57568, …
## $ Termd <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ PositionID <dbl> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
## $ Position <chr> "Production Technician I", "Sr. DBA", "Prod…
## $ State <chr> "MA", "MA", "MA", "MA", "MA", "MA", "MA", "…
## $ Zip <chr> "01960", "02148", "01810", "01886", "02169"…
## $ DOB <chr> "07/10/83", "05/05/75", "09/19/88", "09/27/…
## $ Sex <chr> "M", "M", "F", "F", "F", "F", "F", "M", "F"…
## $ MaritalDesc <chr> "Single", "Married", "Married", "Married", …
## $ CitizenDesc <chr> "US Citizen", "US Citizen", "US Citizen", "…
## $ HispanicLatino <chr> "No", "No", "No", "No", "No", "No", "No", "…
## $ RaceDesc <chr> "White", "White", "White", "White", "White"…
## $ DateofHire <chr> "7/5/2011", "3/30/2015", "7/5/2011", "1/7/2…
## $ DateofTermination <chr> NA, "6/16/2016", "9/24/2012", NA, "9/6/2016…
## $ TermReason <chr> "N/A-StillEmployed", "career change", "hour…
## $ EmploymentStatus <chr> "Active", "Voluntarily Terminated", "Volunt…
## $ Department <chr> "Production", "IT/IS", "Production", "Produ…
## $ ManagerName <chr> "Michael Albert", "Simon Roup", "Kissy Sull…
## $ ManagerID <dbl> 22, 4, 20, 16, 39, 11, 10, 19, 12, 7, 14, 2…
## $ RecruitmentSource <chr> "LinkedIn", "Indeed", "LinkedIn", "Indeed",…
## $ PerformanceScore <chr> "Exceeds", "Fully Meets", "Fully Meets", "F…
## $ EngagementSurvey <dbl> 4.60, 4.96, 3.02, 4.84, 5.00, 5.00, 3.04, 5…
## $ EmpSatisfaction <dbl> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
## $ SpecialProjectsCount <dbl> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
## $ LastPerformanceReview_Date <chr> "1/17/2019", "2/24/2016", "5/15/2012", "1/3…
## $ DaysLateLast30 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Absences <dbl> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…
hr <- HRDataset_v14
Dataset dimensions
dim(hr)
## [1] 311 36
The dataset contains 36 variables and 311 observations
Checking the structure of the dataset
str(hr) # Displays the variable types assigned by R for each column
## spc_tbl_ [311 × 36] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Employee_Name : chr [1:311] "Adinolfi, Wilson K" "Ait Sidi, Karthikeyan" "Akinkuolie, Sarah" "Alagbe,Trina" ...
## $ EmpID : num [1:311] 10026 10084 10196 10088 10069 ...
## $ MarriedID : num [1:311] 0 1 1 1 0 0 0 0 0 0 ...
## $ MaritalStatusID : num [1:311] 0 1 1 1 2 0 0 4 0 2 ...
## $ GenderID : num [1:311] 1 1 0 0 0 0 0 1 0 1 ...
## $ EmpStatusID : num [1:311] 1 5 5 1 5 1 1 1 3 1 ...
## $ DeptID : num [1:311] 5 3 5 5 5 5 4 5 5 3 ...
## $ PerfScoreID : num [1:311] 4 3 3 3 3 4 3 3 3 3 ...
## $ FromDiversityJobFairID : num [1:311] 0 0 0 0 0 0 0 0 1 0 ...
## $ Salary : num [1:311] 62506 104437 64955 64991 50825 ...
## $ Termd : num [1:311] 0 1 1 0 1 0 0 0 0 0 ...
## $ PositionID : num [1:311] 19 27 20 19 19 19 24 19 19 14 ...
## $ Position : chr [1:311] "Production Technician I" "Sr. DBA" "Production Technician II" "Production Technician I" ...
## $ State : chr [1:311] "MA" "MA" "MA" "MA" ...
## $ Zip : chr [1:311] "01960" "02148" "01810" "01886" ...
## $ DOB : chr [1:311] "07/10/83" "05/05/75" "09/19/88" "09/27/88" ...
## $ Sex : chr [1:311] "M" "M" "F" "F" ...
## $ MaritalDesc : chr [1:311] "Single" "Married" "Married" "Married" ...
## $ CitizenDesc : chr [1:311] "US Citizen" "US Citizen" "US Citizen" "US Citizen" ...
## $ HispanicLatino : chr [1:311] "No" "No" "No" "No" ...
## $ RaceDesc : chr [1:311] "White" "White" "White" "White" ...
## $ DateofHire : chr [1:311] "7/5/2011" "3/30/2015" "7/5/2011" "1/7/2008" ...
## $ DateofTermination : chr [1:311] NA "6/16/2016" "9/24/2012" NA ...
## $ TermReason : chr [1:311] "N/A-StillEmployed" "career change" "hours" "N/A-StillEmployed" ...
## $ EmploymentStatus : chr [1:311] "Active" "Voluntarily Terminated" "Voluntarily Terminated" "Active" ...
## $ Department : chr [1:311] "Production" "IT/IS" "Production" "Production" ...
## $ ManagerName : chr [1:311] "Michael Albert" "Simon Roup" "Kissy Sullivan" "Elijiah Gray" ...
## $ ManagerID : num [1:311] 22 4 20 16 39 11 10 19 12 7 ...
## $ RecruitmentSource : chr [1:311] "LinkedIn" "Indeed" "LinkedIn" "Indeed" ...
## $ PerformanceScore : chr [1:311] "Exceeds" "Fully Meets" "Fully Meets" "Fully Meets" ...
## $ EngagementSurvey : num [1:311] 4.6 4.96 3.02 4.84 5 5 3.04 5 4.46 5 ...
## $ EmpSatisfaction : num [1:311] 5 3 3 5 4 5 3 4 3 5 ...
## $ SpecialProjectsCount : num [1:311] 0 6 0 0 0 0 4 0 0 6 ...
## $ LastPerformanceReview_Date: chr [1:311] "1/17/2019" "2/24/2016" "5/15/2012" "1/3/2019" ...
## $ DaysLateLast30 : num [1:311] 0 0 0 0 0 0 0 0 0 0 ...
## $ Absences : num [1:311] 1 17 3 15 2 15 19 19 4 16 ...
## - attr(*, "spec")=
## .. cols(
## .. Employee_Name = col_character(),
## .. EmpID = col_double(),
## .. MarriedID = col_double(),
## .. MaritalStatusID = col_double(),
## .. GenderID = col_double(),
## .. EmpStatusID = col_double(),
## .. DeptID = col_double(),
## .. PerfScoreID = col_double(),
## .. FromDiversityJobFairID = col_double(),
## .. Salary = col_double(),
## .. Termd = col_double(),
## .. PositionID = col_double(),
## .. Position = col_character(),
## .. State = col_character(),
## .. Zip = col_character(),
## .. DOB = col_character(),
## .. Sex = col_character(),
## .. MaritalDesc = col_character(),
## .. CitizenDesc = col_character(),
## .. HispanicLatino = col_character(),
## .. RaceDesc = col_character(),
## .. DateofHire = col_character(),
## .. DateofTermination = col_character(),
## .. TermReason = col_character(),
## .. EmploymentStatus = col_character(),
## .. Department = col_character(),
## .. ManagerName = col_character(),
## .. ManagerID = col_double(),
## .. RecruitmentSource = col_character(),
## .. PerformanceScore = col_character(),
## .. EngagementSurvey = col_double(),
## .. EmpSatisfaction = col_double(),
## .. SpecialProjectsCount = col_double(),
## .. LastPerformanceReview_Date = col_character(),
## .. DaysLateLast30 = col_double(),
## .. Absences = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# glimpse(hr) # Displays the variable types assigned by R for each column
We get the same information as above, but more detailed: 311 observations and 35 variables and the type of the variables
We remove variables that are irrelevant or contain redundant information All variable names are converted to lowercase and formatted without spaces
hr <- hr %>%
janitor::clean_names()
Identifying numerical variables that should be treated as categorical
numerical_categorical_vars <- hr %>%
dplyr::select(-emp_id, -salary, -zip, -engagement_survey, -days_late_last30, -absences) %>%
dplyr::select_if(is.numeric) %>%
names()
numerical_categorical_vars
## [1] "married_id" "marital_status_id"
## [3] "gender_id" "emp_status_id"
## [5] "dept_id" "perf_score_id"
## [7] "from_diversity_job_fair_id" "termd"
## [9] "position_id" "manager_id"
## [11] "emp_satisfaction" "special_projects_count"
Alternative method (commented out)
numerical_categorical_vars <- hr %>%
dplyr::select(-emp_id, -salary, -zip, -engagement_survey, -days_late_last30, -absences) %>%
dplyr::select_if(is.numeric) %>%
gather(var, value) %>%
group_by(var) %>%
summarise(n = n(),
missing = sum(is.na(value)))
numerical_categorical_vars
| var | n | missing |
|---|---|---|
| dept_id | 311 | 0 |
| emp_satisfaction | 311 | 0 |
| emp_status_id | 311 | 0 |
| from_diversity_job_fair_id | 311 | 0 |
| gender_id | 311 | 0 |
| manager_id | 311 | 8 |
| marital_status_id | 311 | 0 |
| married_id | 311 | 0 |
| perf_score_id | 311 | 0 |
| position_id | 311 | 0 |
| special_projects_count | 311 | 0 |
| termd | 311 | 0 |
Converting selected numerical variables into categorical variables (factors)
hr <- hr %>%
mutate(across(c(married_id, marital_status_id, gender_id, emp_status_id,
dept_id, perf_score_id, from_diversity_job_fair_id,
termd, position_id, emp_satisfaction,
special_projects_count), as.factor))
Missing values identification
# summary(hr)
any(is.na(hr))
## [1] TRUE
sum(is.na(hr))
## [1] 215
There are 8 missing values in manager_id
any(is.na(hr$manager_id))
## [1] TRUE
print('The number of the missing values is', sum(is.na(hr$manager_id)))
## [1] "The number of the missing values is"
converting the data type variable form string type to date type using separate(data, col, into, sep) function
hr1 <- hr%>%
as.data.frame()%>%
separate(dob, c('zi', 'luna', 'an'), '/')%>%
mutate(an1 = paste0('19', an))%>%
mutate(dob1 = paste0(zi, '/', luna, '/', an1))
hr1 <- hr1%>%
mutate_at(vars(dob1, dateof_hire, dateof_hire, dateof_termination, last_performance_review_date), funs(as.Date(., format = '%d/%m/%Y')))
DT::datatable(
hr1,
options = list(
scrollX = TRUE,
pageLength = 5
)
)
data_type<- hr1%>%
dplyr::select(dob1, dateof_hire, dateof_hire, dateof_termination, last_performance_review_date)
# summary(data_type)
checking the missing values of the data type variables
summary(hr1)
## employee_name emp_id married_id marital_status_id gender_id
## Length:311 Min. :10001 0:187 0:137 0:176
## Class :character 1st Qu.:10078 1:124 1:124 1:135
## Mode :character Median :10156 2: 30
## Mean :10156 3: 12
## 3rd Qu.:10234 4: 8
## Max. :10311
##
## emp_status_id dept_id perf_score_id from_diversity_job_fair_id
## 1:184 1: 10 1: 13 0:282
## 2: 11 2: 1 2: 18 1: 29
## 3: 14 3: 50 3:243
## 4: 14 4: 10 4: 37
## 5: 88 5:208
## 6: 32
##
## salary termd position_id position state
## Min. : 45046 0:207 19 :137 Length:311 Length:311
## 1st Qu.: 55502 1:104 20 : 57 Class :character Class :character
## Median : 62810 3 : 27 Mode :character Mode :character
## Mean : 69021 18 : 13
## 3rd Qu.: 72036 24 : 9
## Max. :250000 9 : 8
## (Other): 60
## zip zi luna an
## Length:311 Length:311 Length:311 Length:311
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## sex marital_desc citizen_desc hispanic_latino
## Length:311 Length:311 Length:311 Length:311
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## race_desc dateof_hire dateof_termination
## Length:311 Min. :2006-09-01 Min. :2012-02-01
## Class :character 1st Qu.:2011-10-01 1st Qu.:2014-04-04
## Mode :character Median :2012-09-01 Median :2015-06-08
## Mean :2013-02-04 Mean :2015-06-15
## 3rd Qu.:2014-10-11 3rd Qu.:2016-08-05
## Max. :2018-09-07 Max. :2018-10-11
## NA's :138 NA's :265
## term_reason employment_status department manager_name
## Length:311 Length:311 Length:311 Length:311
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## manager_id recruitment_source performance_score engagement_survey
## Min. : 1.00 Length:311 Length:311 Min. :1.12
## 1st Qu.:10.00 Class :character Class :character 1st Qu.:3.69
## Median :15.00 Mode :character Mode :character Median :4.28
## Mean :14.57 Mean :4.11
## 3rd Qu.:19.00 3rd Qu.:4.70
## Max. :39.00 Max. :5.00
## NA's :8
## emp_satisfaction special_projects_count last_performance_review_date
## 1: 2 0 :241 Min. :2011-04-05
## 2: 9 5 : 21 1st Qu.:2015-02-07
## 3:108 6 : 21 Median :2018-05-02
## 4: 94 7 : 12 Mean :2017-03-14
## 5: 98 4 : 9 3rd Qu.:2019-07-01
## 3 : 3 Max. :2019-12-02
## (Other): 4 NA's :184
## days_late_last30 absences an1 dob1
## Min. :0.0000 Min. : 1.00 Length:311 Min. :1951-02-01
## 1st Qu.:0.0000 1st Qu.: 5.00 Class :character 1st Qu.:1973-12-13
## Median :0.0000 Median :10.00 Mode :character Median :1980-06-21
## Mean :0.4148 Mean :10.24 Mean :1979-02-14
## 3rd Qu.:0.0000 3rd Qu.:15.00 3rd Qu.:1986-02-20
## Max. :6.0000 Max. :20.00 Max. :1992-07-05
## NA's :169
numer_var_char <- hr1 %>%
dplyr::select(-employee_name, -zip, -manager_name, -zi, -luna, -an, -an1)%>%
dplyr::select_if(is.character) %>%
names()
numer_var_char
## [1] "position" "state" "sex"
## [4] "marital_desc" "citizen_desc" "hispanic_latino"
## [7] "race_desc" "term_reason" "employment_status"
## [10] "department" "recruitment_source" "performance_score"
number of persons as function of status variable
denumire_status_married_id <- hr1 %>%
dplyr::select(married_id, marital_desc)%>%
arrange(married_id) %>%
mutate(id = row_number())%>%
spread(marital_desc, married_id)%>%
dplyr::select(-id) %>%
gather(var, values) %>%
na.omit()%>%
group_by(var)%>%
summarise(n = n())
denumire_status_married_id %>%
arrange(desc(n))
| var | n |
|---|---|
| Single | 137 |
| Married | 124 |
| Divorced | 30 |
| Separated | 12 |
| Widowed | 8 |
The number of married/unmarried persons
status_married_id <- hr1 %>%
dplyr::select(married_id)%>%
arrange(married_id) %>%
mutate(id = row_number(),
status = if_else(married_id == '1', 'Married', 'No_Married'))%>%
# mutate(married_id = as.numeric(married_id))%>%
spread(status, married_id)%>%
dplyr::select(-id) %>%
gather(var, values) %>%
na.omit()%>%
group_by(var)%>%
summarise(n = n())
status_married_id
| var | n |
|---|---|
| Married | 124 |
| No_Married | 187 |
converting the string variables into categorical or factor variables
hr2 <- hr1 %>%
mutate_at(vars(position, state, sex, marital_desc, citizen_desc, hispanic_latino, race_desc,
term_reason, employment_status, department, recruitment_source, performance_score), funs(factor))
or the alternative way
identifying the mangers with missing values for the id_manager
hr_manageri_departamente <- hr1 %>%
dplyr::select(manager_id, manager_name, dept_id, department) %>%
arrange(dept_id, department, manager_id, manager_name) %>%
group_by(manager_name)%>%
distinct(manager_id, manager_name, department, dept_id)
hr_manageri_departamente %>%
filter(is.na(manager_id))
## # A tibble: 1 × 4
## # Groups: manager_name [1]
## manager_id manager_name department dept_id
## <dbl> <chr> <chr> <fct>
## 1 NA Webster Butler Production 5
extracting the name of the manager with missing value for the manager_id variable
nume_manager <- hr1 %>%
dplyr::select(manager_id, manager_name, dept_id, department) %>%
arrange(dept_id, department, manager_id, manager_name) %>%
group_by(manager_name)%>%
distinct(manager_id, manager_name, department, dept_id)%>%
filter(is.na(manager_id))%>%
pull(manager_name)
print(paste('The name of the manager is: ', nume_manager, collapse = ', '))
## [1] "The name of the manager is: Webster Butler"
The name and the number of the department in which Webster Butler activates
departament <- hr_manageri_departamente%>%
filter(manager_name == nume_manager)
departament
## # A tibble: 2 × 4
## # Groups: manager_name [1]
## manager_id manager_name department dept_id
## <dbl> <chr> <chr> <fct>
## 1 39 Webster Butler Production 5
## 2 NA Webster Butler Production 5
extracting the manager_id of the Webster Butler
nr_departament <- hr_manageri_departamente%>%
filter(manager_name == nume_manager) %>%
filter(!is.na(manager_id))%>%
pull(manager_id)
print(paste('The number of the id is: ', nr_departament))
## [1] "The number of the id is: 39"
filling de missing manager_id value
hr1 <- hr1 %>%
mutate(manager_id = if_else(manager_name == nume_manager, nr_departament, manager_id))
## manager_id manager_name
## Min. : 1.0 Length:311
## 1st Qu.:10.0 Class :character
## Median :16.0 Mode :character
## Mean :15.2
## 3rd Qu.:19.5
## Max. :39.0
temp <- hr_manageri_departamente %>%
dplyr::select(manager_name, manager_id, department, dept_id) %>%
arrange(manager_name)%>%
mutate(nr = rank(manager_name))%>%
arrange(nr)
the name and number of departments
departamente <- hr1 %>%
distinct(department, dept_id)%>%
arrange(dept_id)
we notice that production department has assigned 2 different id department number (5 and 6)
checking for the number of observation/employees registered in each department
departamente <- hr1 %>%
dplyr::select(department, dept_id)%>%
filter(department == 'Production')%>%
arrange(dept_id) %>%
mutate(id = row_number())%>%
spread(dept_id, department)%>%
dplyr::select(-id) %>%
gather(var, values) %>%
na.omit()%>%
group_by(var)%>%
summarise(n = n())
departamente
| var | n |
|---|---|
| 5 | 208 |
| 6 | 1 |
correcting the number of the department for the Production section
hr1 <- hr1 %>%
mutate(dept_id = if_else(department == 'Production', '5', dept_id))
modifying the department number of the Admin Offices
hr1 <- hr1 %>%
mutate(dept_id = if_else(department == 'Admin Offices', '7', dept_id))
summary(hr1)
## employee_name emp_id married_id marital_status_id gender_id
## Length:311 Min. :10001 0:187 0:137 0:176
## Class :character 1st Qu.:10078 1:124 1:124 1:135
## Mode :character Median :10156 2: 30
## Mean :10156 3: 12
## 3rd Qu.:10234 4: 8
## Max. :10311
##
## emp_status_id dept_id perf_score_id from_diversity_job_fair_id
## 1:184 Length:311 1: 13 0:282
## 2: 11 Class :character 2: 18 1: 29
## 3: 14 Mode :character 3:243
## 4: 14 4: 37
## 5: 88
##
##
## salary termd position_id position state
## Min. : 45046 0:207 19 :137 Length:311 Length:311
## 1st Qu.: 55502 1:104 20 : 57 Class :character Class :character
## Median : 62810 3 : 27 Mode :character Mode :character
## Mean : 69021 18 : 13
## 3rd Qu.: 72036 24 : 9
## Max. :250000 9 : 8
## (Other): 60
## zip zi luna an
## Length:311 Length:311 Length:311 Length:311
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## sex marital_desc citizen_desc hispanic_latino
## Length:311 Length:311 Length:311 Length:311
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## race_desc dateof_hire dateof_termination
## Length:311 Min. :2006-09-01 Min. :2012-02-01
## Class :character 1st Qu.:2011-10-01 1st Qu.:2014-04-04
## Mode :character Median :2012-09-01 Median :2015-06-08
## Mean :2013-02-04 Mean :2015-06-15
## 3rd Qu.:2014-10-11 3rd Qu.:2016-08-05
## Max. :2018-09-07 Max. :2018-10-11
## NA's :138 NA's :265
## term_reason employment_status department manager_name
## Length:311 Length:311 Length:311 Length:311
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## manager_id recruitment_source performance_score engagement_survey
## Min. : 1.0 Length:311 Length:311 Min. :1.12
## 1st Qu.:10.0 Class :character Class :character 1st Qu.:3.69
## Median :16.0 Mode :character Mode :character Median :4.28
## Mean :15.2 Mean :4.11
## 3rd Qu.:19.5 3rd Qu.:4.70
## Max. :39.0 Max. :5.00
##
## emp_satisfaction special_projects_count last_performance_review_date
## 1: 2 0 :241 Min. :2011-04-05
## 2: 9 5 : 21 1st Qu.:2015-02-07
## 3:108 6 : 21 Median :2018-05-02
## 4: 94 7 : 12 Mean :2017-03-14
## 5: 98 4 : 9 3rd Qu.:2019-07-01
## 3 : 3 Max. :2019-12-02
## (Other): 4 NA's :184
## days_late_last30 absences an1 dob1
## Min. :0.0000 Min. : 1.00 Length:311 Min. :1951-02-01
## 1st Qu.:0.0000 1st Qu.: 5.00 Class :character 1st Qu.:1973-12-13
## Median :0.0000 Median :10.00 Mode :character Median :1980-06-21
## Mean :0.4148 Mean :10.24 Mean :1979-02-14
## 3rd Qu.:0.0000 3rd Qu.:15.00 3rd Qu.:1986-02-20
## Max. :6.0000 Max. :20.00 Max. :1992-07-05
## NA's :169
a list containing the positions or jobs from the company
jobs <- hr1 %>%
dplyr::select(dept_id, department, position_id, position, manager_name, employee_name)%>%
distinct(dept_id, department, position_id, position, manager_name)%>%
arrange(department, position_id, manager_name)
jobs_denumire_it_mamanger <- hr1 %>%
dplyr::select(dept_id, department, position_id, position, manager_name, employee_name)%>%
distinct(position_id, position, dept_id, department, manager_name)%>%
arrange(position_id, department, manager_name)%>%
filter(position_id == 13)%>%
pull(position)
print(paste('The name of the department are: ', jobs_denumire_it_mamanger, collapse = ', '))
## [1] "The name of the department are: IT Manager - Support, The name of the department are: IT Manager - Infra, The name of the department are: IT Manager - DB"
changing the jab name int IT Manager
hr1 <- hr1 %>%
mutate(position = if_else(position %in% jobs_denumire_it_mamanger, 'IT Manager', position),
position = if_else(position %in% c('IT Director'), 'IT Manager', position))
we wandering if Janet King fills 5 different job positions or there are 5 distinct persons with the same name
jobs_Janet_King <- hr1 %>%
dplyr::select(dept_id, department, position_id, position, manager_name, employee_name)%>%
distinct(position_id, position, dept_id, department, manager_name)%>%
arrange(position_id, department, manager_name)%>%
filter(manager_name == 'Janet King')
jobs_Janet_King
| position_id | position | dept_id | department | manager_name |
|---|---|---|---|---|
| 6 | CIO | 3 | IT/IS | Janet King |
| 10 | Director of Operations | 5 | Production | Janet King |
| 11 | Director of Sales | 6 | Sales | Janet King |
| 17 | Production Manager | 5 | Production | Janet King |
| 18 | Production Manager | 5 | Production | Janet King |
| 23 | Shared Services Manager | 7 | Admin Offices | Janet King |
| 23 | Software Engineer | 1 | Software Engineering | Janet King |
so Janet King appears as Production manager of the department with the same name
jobs_Janet_King_prod <- hr1 %>%
dplyr::select(dept_id, department, position_id, position, manager_name, employee_name)%>%
distinct(position_id, position, dept_id, department, manager_name)%>%
arrange(position_id, department, manager_name)%>%
filter(manager_name == 'Janet King' & department == 'Production')
jobs_Janet_King_prod
| position_id | position | dept_id | department | manager_name |
|---|---|---|---|---|
| 10 | Director of Operations | 5 | Production | Janet King |
| 17 | Production Manager | 5 | Production | Janet King |
| 18 | Production Manager | 5 | Production | Janet King |
changing the name jab into Production Manager
hr1 <- hr1 %>%
mutate(position = if_else(position == 'Director of Operations', 'Production Manager', position))
jobs_id_10_17_18 <- hr1 %>%
dplyr::select(dept_id, department, position_id, position, manager_name, employee_name)%>%
distinct(position_id, position, dept_id, department, manager_name)%>%
arrange(position_id, department, manager_name)%>%
filter(position_id %in% c(10, 17, 18))
jobs_id_10_17_18
| position_id | position | dept_id | department | manager_name |
|---|---|---|---|---|
| 10 | Production Manager | 5 | Production | Janet King |
| 17 | Production Manager | 5 | Production | Janet King |
| 18 | Production Manager | 5 | Production | Janet King |
so we have 31 distinct values for the name of the managers
distinct_nume_dep<-hr_manageri_departamente %>%
dplyr::select(manager_name, manager_id, department, dept_id) %>%
distinct(manager_name, manager_id, department, dept_id)
var_term_reason <- hr1 %>%
distinct(term_reason)
changing the N/A-StillEmployed into StillEmployed
hr1 <- hr1 %>%
mutate(term_reason = if_else(term_reason == 'N/A-StillEmployed', 'StillEmployed', term_reason))
Basic information about data types and missing values
…as text…
temp <- DataExplorer::introduce(hr1)
temp %>%
gather(Metric, Value)
| Metric | Value |
|---|---|
| rows | 311 |
| columns | 40 |
| discrete_columns | 34 |
| continuous_columns | 6 |
| all_missing_columns | 0 |
| total_missing_values | 756 |
| complete_rows | 7 |
| total_observations | 1.24e+04 |
| memory_usage | 1.52e+05 |
… and as plot
plot_intro(temp)
Missing values
… as plot
Display the data types as text…
| type | cnt | pcnt | col_name |
|---|---|---|---|
| character | 20 | 50 | c(`1` = "employee_name", `7` = "dept_id", `13` = "position", `14` = "state", `15` = "zip", `16` = "zi", `17` = "luna", `18` = "an", `19` = "sex", `20` = "marital_desc", `21` = "citizen_desc", `22` = "hispanic_latino", `23` = "race_desc", `26` = "term_reason", `27` = "employment_status", `28` = "department", `29` = "manager_name", `31` = "recruitment_source", `32` = "performance_score", `39` = "an1") |
| factor | 10 | 25 | c(`3` = "married_id", `4` = "marital_status_id", `5` = "gender_id", `6` = "emp_status_id", `8` = "perf_score_id", `9` = "from_diversity_job_fair_id", `11` = "termd", `12` = "position_id", `34` = "emp_satisfaction", `35` = "special_projects_count") |
| numeric | 6 | 15 | c(`2` = "emp_id", `10` = "salary", `30` = "manager_id", `33` = "engagement_survey", `37` = "days_late_last30", `38` = "absences") |
| Date | 4 | 10 | c(`24` = "dateof_hire", `25` = "dateof_termination", `36` = "last_performance_review_date", `40` = "dob1") |
… and as chart
Display the size of each column
Missing values: we have 4 data type variables that contains over 40%
missing values
selecting the new dataset
selecting the character type variables
numer_var_char <- hr2 %>%
dplyr::select_if(is.character) %>%
names()
numer_var_char
## [1] "employee_name" "dept_id" "position"
## [4] "state" "sex" "marital_desc"
## [7] "citizen_desc" "hispanic_latino" "race_desc"
## [10] "term_reason" "employment_status" "department"
## [13] "manager_name" "recruitment_source" "performance_score"
hr2 <- hr2 %>%
mutate_at(vars(dept_id), funs(factor))%>%
mutate_at(vars("position", "state" , "sex", "marital_desc","citizen_desc",
"hispanic_latino", "race_desc", "term_reason", "employment_status", "department",
"recruitment_source", "performance_score"), funs(factor))
we have 28 distinct job position
jobs<- hr2 %>%
distinct(position_id, position)%>%
arrange(position_id)
jobs2 <- hr2 %>%
dplyr::select(position_id, position)%>%
mutate(id = row_number())%>%
group_by(position_id, position)%>%
summarise(n = n())
we changing the job position with the same name
hr2 <- hr2 %>%
mutate(position_id = if_else(position == 'Production Manager', '18', position_id),
position_id = if_else(position == 'Software Engineer', '24', position_id),
position_id = if_else(position == 'IT Manager', '12', position_id))
salary is a numerica variable and will be the dependent variable
the rest of the variable are the predectors
after a few changes…
## 'data.frame': 311 obs. of 31 variables:
## $ employee_name : chr "Adinolfi, Wilson K" "Ait Sidi, Karthikeyan" "Akinkuolie, Sarah" "Alagbe,Trina" ...
## $ emp_id : num 10026 10084 10196 10088 10069 ...
## $ married_id : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 1 1 ...
## $ status : chr "no_married" "married" "married" "married" ...
## $ marital_status_id : Factor w/ 5 levels "0","1","2","3",..: 1 2 2 2 3 1 1 5 1 3 ...
## $ gender_id : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 2 1 2 ...
## $ emp_status_id : Factor w/ 5 levels "1","2","3","4",..: 1 5 5 1 5 1 1 1 3 1 ...
## $ dept_id : Factor w/ 7 levels "1","2","3","4",..: 5 3 5 5 5 5 4 5 5 3 ...
## $ perf_score_id : Factor w/ 4 levels "1","2","3","4": 4 3 3 3 3 4 3 3 3 3 ...
## $ from_diversity_job_fair_id: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
## $ salary : num 62506 104437 64955 64991 50825 ...
## $ termd : Factor w/ 2 levels "0","1": 1 2 2 1 2 1 1 1 1 1 ...
## $ position_id : Factor w/ 27 levels "1","11","12",..: 8 17 10 8 8 8 14 8 8 4 ...
## $ position : Factor w/ 27 levels "accountant_i",..: 18 26 19 18 18 18 23 18 18 13 ...
## $ state : Factor w/ 28 levels "AL","AZ","CA",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ sex : Factor w/ 2 levels "female","male": 2 2 1 1 1 1 1 2 1 2 ...
## $ marital_desc : Factor w/ 5 levels "divorced","married",..: 4 2 2 2 1 4 4 5 4 1 ...
## $ citizen_desc : Factor w/ 3 levels "eligible_noncitizen",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ hispanic_latino : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ race_desc : Factor w/ 6 levels "american_indian_or_alaska_native",..: 6 6 6 6 6 6 6 6 3 6 ...
## $ term_reason : Factor w/ 18 levels "another_position",..: 17 3 6 17 16 17 17 17 17 17 ...
## $ employment_status : Factor w/ 3 levels "active","terminated_for_cause",..: 1 3 3 1 3 1 1 1 1 1 ...
## $ department : Factor w/ 6 levels "admin_offices",..: 4 3 4 4 4 4 6 4 4 3 ...
## $ manager_name : chr "Michael Albert" "Simon Roup" "Kissy Sullivan" "Elijiah Gray" ...
## $ manager_id : num 22 4 20 16 39 11 10 19 12 7 ...
## $ recruitment_source : Factor w/ 9 levels "careerbuilder",..: 6 5 6 5 4 6 6 3 2 5 ...
## $ performance_score : Factor w/ 4 levels "exceeds","fully_meets",..: 1 2 2 2 2 1 2 2 2 2 ...
## $ emp_satisfaction : Factor w/ 5 levels "1","2","3","4",..: 5 3 3 5 4 5 3 4 3 5 ...
## $ special_projects_count : Factor w/ 9 levels "0","1","2","3",..: 1 7 1 1 1 1 5 1 1 7 ...
## $ days_late_last30 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ absences : num 1 17 3 15 2 15 19 19 4 16 ...
… and as chart
Descriptive statistic of the salary variable
temp <- hr2 %>%
summarise(
n_of_individuals = n(), # nr. indivizi din firma
min_Salary = min(hr2$salary), # valoare minim
first_quartile = quantile(hr2$salary,.25), # prima quartila
mean_Salary = mean(hr2$salary), # media varstei
median_Salary = median(hr2$salary), # mediana salariului
third_quartile = quantile(hr2$salary, .75), # a treia quartila
max_Salary = max(hr2$salary), # maxim
trim_mean_Salary =mean(x = hr2$salary, trim = .2),# 20% media trunchiata
range_Salary = paste(range(hr2$salary), collapse = ' - '), # range
iqr_Salary= IQR(hr2$salary), # interval interquartilic IQ
var_Salary = var(hr2$salary), # varianta
sd_Salary = sd(hr2$salary), # abaterea standard
skewness = PerformanceAnalytics::skewness(hr2$salary), #asimetria
kurtosis = PerformanceAnalytics::kurtosis(hr2$salary), #boltirea
modul = DescTools::Mode(hr2$salary)) %>% #modul
gather(variabile, values)%>%
group_by(variabile)%>%
slice_head(n = 1)
Most employees in the company earn a salary of 61,242 USD
The IQR is 16,534.5 USD – meaning that 50% of employees earn between 55,501.5 USD and 72,036 USD
Display the first and last 6 rows of the dataset
print(hr2 %>% head(6)) # Displaying the first 6 observations of the dataset
## employee_name emp_id married_id status marital_status_id
## 1 Adinolfi, Wilson K 10026 0 no_married 0
## 2 Ait Sidi, Karthikeyan 10084 1 married 1
## 3 Akinkuolie, Sarah 10196 1 married 1
## 4 Alagbe,Trina 10088 1 married 1
## 5 Anderson, Carol 10069 0 no_married 2
## 6 Anderson, Linda 10002 0 no_married 0
## gender_id emp_status_id dept_id perf_score_id from_diversity_job_fair_id
## 1 1 1 5 4 0
## 2 1 5 3 3 0
## 3 0 5 5 3 0
## 4 0 1 5 3 0
## 5 0 5 5 3 0
## 6 0 1 5 4 0
## salary termd position_id position state sex marital_desc
## 1 62506 0 19 production_technician_i MA male single
## 2 104437 1 27 sr._dba MA male married
## 3 64955 1 20 production_technician_ii MA female married
## 4 64991 0 19 production_technician_i MA female married
## 5 50825 1 19 production_technician_i MA female divorced
## 6 57568 0 19 production_technician_i MA female single
## citizen_desc hispanic_latino race_desc term_reason
## 1 us_citizen no white stillemployed
## 2 us_citizen no white career_change
## 3 us_citizen no white hours
## 4 us_citizen no white stillemployed
## 5 us_citizen no white return_to_school
## 6 us_citizen no white stillemployed
## employment_status department manager_name manager_id
## 1 active production Michael Albert 22
## 2 voluntarily_terminated it/is Simon Roup 4
## 3 voluntarily_terminated production Kissy Sullivan 20
## 4 active production Elijiah Gray 16
## 5 voluntarily_terminated production Webster Butler 39
## 6 active production Amy Dunn 11
## recruitment_source performance_score emp_satisfaction special_projects_count
## 1 linkedin exceeds 5 0
## 2 indeed fully_meets 3 6
## 3 linkedin fully_meets 3 0
## 4 indeed fully_meets 5 0
## 5 google_search fully_meets 4 0
## 6 linkedin exceeds 5 0
## days_late_last30 absences
## 1 0 1
## 2 0 17
## 3 0 3
## 4 0 15
## 5 0 2
## 6 0 15
print(hr2 %>% tail(6)) # Displaying the last 6 observations of the dataset
## employee_name emp_id married_id status marital_status_id gender_id
## 306 Wolk, Hang T 10174 0 no_married 0 0
## 307 Woodson, Jason 10135 0 no_married 0 1
## 308 Ybarra, Catherine 10301 0 no_married 0 0
## 309 Zamora, Jennifer 10010 0 no_married 0 0
## 310 Zhou, Julia 10043 0 no_married 0 0
## 311 Zima, Colleen 10271 0 no_married 4 0
## emp_status_id dept_id perf_score_id from_diversity_job_fair_id salary termd
## 306 1 5 3 0 60446 0
## 307 1 5 3 0 65893 0
## 308 5 5 1 0 48513 1
## 309 1 3 4 0 220450 0
## 310 1 3 3 0 89292 0
## 311 1 5 3 0 45046 0
## position_id position state sex marital_desc citizen_desc
## 306 20 production_technician_ii MA female single us_citizen
## 307 20 production_technician_ii MA male single us_citizen
## 308 19 production_technician_i MA female single us_citizen
## 309 6 cio MA female single us_citizen
## 310 9 data_analyst MA female single us_citizen
## 311 19 production_technician_i MA female widowed us_citizen
## hispanic_latino race_desc term_reason employment_status
## 306 no white stillemployed active
## 307 no white stillemployed active
## 308 no asian another_position voluntarily_terminated
## 309 no white stillemployed active
## 310 no white stillemployed active
## 311 no asian stillemployed active
## department manager_name manager_id recruitment_source performance_score
## 306 production David Stanley 14 linkedin fully_meets
## 307 production Kissy Sullivan 20 linkedin fully_meets
## 308 production Brannon Miller 12 google_search pip
## 309 it/is Janet King 2 employee_referral exceeds
## 310 it/is Simon Roup 4 employee_referral fully_meets
## 311 production David Stanley 14 linkedin fully_meets
## emp_satisfaction special_projects_count days_late_last30 absences
## 306 4 0 0 14
## 307 4 0 0 13
## 308 2 0 5 4
## 309 5 6 0 16
## 310 3 5 0 11
## 311 5 0 0 2
The minimum salary offered by the company is 45,046 USD
The maximum salary offered by the company is 250,000 USD
The average salary earned by an employee in the company is 69,021 USD
25% of employees earn up to 55,502 USD, while 75% earn more than 55,502 USD
75% of employees earn up to 72,036 USD, while 25% earn more than 72,036 USD
Half of the employees earn up to 62,810 USD, while the remaining 50% earn more than 62,810 USD
| Name | Piped data |
| Number of rows | 311 |
| Number of columns | 31 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| factor | 23 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| employee_name | 0 | 1 | 8 | 23 | 0 | 311 | 0 |
| status | 0 | 1 | 7 | 10 | 0 | 2 | 0 |
| manager_name | 0 | 1 | 8 | 18 | 0 | 21 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| married_id | 0 | 1 | FALSE | 2 | 0: 187, 1: 124 |
| marital_status_id | 0 | 1 | FALSE | 5 | 0: 137, 1: 124, 2: 30, 3: 12 |
| gender_id | 0 | 1 | FALSE | 2 | 0: 176, 1: 135 |
| emp_status_id | 0 | 1 | FALSE | 5 | 1: 184, 5: 88, 3: 14, 4: 14 |
| dept_id | 0 | 1 | FALSE | 7 | 5: 209, 3: 50, 6: 31, 4: 10 |
| perf_score_id | 0 | 1 | FALSE | 4 | 3: 243, 4: 37, 2: 18, 1: 13 |
| from_diversity_job_fair_id | 0 | 1 | FALSE | 2 | 0: 282, 1: 29 |
| termd | 0 | 1 | FALSE | 2 | 0: 207, 1: 104 |
| position_id | 0 | 1 | FALSE | 27 | 19: 137, 20: 57, 3: 27, 18: 15 |
| position | 0 | 1 | FALSE | 27 | pro: 137, pro: 57, are: 27, pro: 15 |
| state | 0 | 1 | FALSE | 28 | MA: 276, CT: 6, TX: 3, VT: 2 |
| sex | 0 | 1 | FALSE | 2 | fem: 176, mal: 135 |
| marital_desc | 0 | 1 | FALSE | 5 | sin: 137, mar: 124, div: 30, sep: 12 |
| citizen_desc | 0 | 1 | FALSE | 3 | us_: 295, eli: 12, non: 4 |
| hispanic_latino | 0 | 1 | FALSE | 2 | no: 283, yes: 28 |
| race_desc | 0 | 1 | FALSE | 6 | whi: 187, bla: 80, asi: 29, two: 11 |
| term_reason | 0 | 1 | FALSE | 18 | sti: 207, ano: 20, unh: 14, mor: 11 |
| employment_status | 0 | 1 | FALSE | 3 | act: 207, vol: 88, ter: 16 |
| department | 0 | 1 | FALSE | 6 | pro: 209, it/: 50, sal: 31, sof: 11 |
| recruitment_source | 0 | 1 | FALSE | 9 | ind: 87, lin: 76, goo: 49, emp: 31 |
| performance_score | 0 | 1 | FALSE | 4 | ful: 243, exc: 37, nee: 18, pip: 13 |
| emp_satisfaction | 0 | 1 | FALSE | 5 | 3: 108, 5: 98, 4: 94, 2: 9 |
| special_projects_count | 0 | 1 | FALSE | 9 | 0: 241, 5: 21, 6: 21, 7: 12 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| emp_id | 0 | 1 | 10156.00 | 89.92 | 10001 | 10078.5 | 10156 | 10233.5 | 10311 | ▇▇▇▇▇ |
| salary | 0 | 1 | 69020.68 | 25156.64 | 45046 | 55501.5 | 62810 | 72036.0 | 250000 | ▇▁▁▁▁ |
| manager_id | 0 | 1 | 15.20 | 8.86 | 1 | 10.0 | 16 | 19.5 | 39 | ▅▇▇▁▂ |
| days_late_last30 | 0 | 1 | 0.41 | 1.29 | 0 | 0.0 | 0 | 0.0 | 6 | ▇▁▁▁▁ |
| absences | 0 | 1 | 10.24 | 5.85 | 1 | 5.0 | 10 | 15.0 | 20 | ▇▆▅▇▆ |
On average, an employee’s salary deviates from the mean by ±25,157 USD
The salary range within the company is 204,954 USD
The total payroll at company level amounts to 21,465,433 USD
The coefficient of variation indicates that the mean is broadly representative,
although the dispersion is relatively high, suggesting low homogeneity of the population
The standard error of the mean salary is 1,426.50 USD
With a 95% probability, the true mean salary lies within the interval:
mean ± 2,806.85 USD, i.e., [66,213 USD, 71,827 USD]
psych::describe(hr2)
The skewness coefficient indicates that the salary distribution is positively skewed (right-skewed)
The kurtosis coefficient indicates that the distribution is leptokurtic
The median absolute deviation (MAD) is 11,834.11 USD
## hr2
##
## 31 Variables 311 Observations
## --------------------------------------------------------------------------------
## employee_name
## n missing distinct
## 311 0 311
##
## lowest : Adinolfi, Wilson K Ait Sidi, Karthikeyan Akinkuolie, Sarah Alagbe,Trina Anderson, Carol
## highest: Woodson, Jason Ybarra, Catherine Zamora, Jennifer Zhou, Julia Zima, Colleen
## --------------------------------------------------------------------------------
## emp_id
## n missing distinct Info Mean Gmd .05 .10
## 311 0 311 1 10156 104 10016 10032
## .25 .50 .75 .90 .95
## 10078 10156 10234 10280 10296
##
## lowest : 10001 10002 10003 10004 10005, highest: 10307 10308 10309 10310 10311
## --------------------------------------------------------------------------------
## married_id
## n missing distinct
## 311 0 2
##
## Value 0 1
## Frequency 187 124
## Proportion 0.601 0.399
## --------------------------------------------------------------------------------
## status
## n missing distinct
## 311 0 2
##
## Value married no_married
## Frequency 124 187
## Proportion 0.399 0.601
## --------------------------------------------------------------------------------
## marital_status_id
## n missing distinct
## 311 0 5
##
## Value 0 1 2 3 4
## Frequency 137 124 30 12 8
## Proportion 0.441 0.399 0.096 0.039 0.026
## --------------------------------------------------------------------------------
## gender_id
## n missing distinct
## 311 0 2
##
## Value 0 1
## Frequency 176 135
## Proportion 0.566 0.434
## --------------------------------------------------------------------------------
## emp_status_id
## n missing distinct
## 311 0 5
##
## Value 1 2 3 4 5
## Frequency 184 11 14 14 88
## Proportion 0.592 0.035 0.045 0.045 0.283
## --------------------------------------------------------------------------------
## dept_id
## n missing distinct
## 311 0 7
##
## Value 1 2 3 4 5 6 7
## Frequency 1 1 50 10 209 31 9
## Proportion 0.003 0.003 0.161 0.032 0.672 0.100 0.029
## --------------------------------------------------------------------------------
## perf_score_id
## n missing distinct
## 311 0 4
##
## Value 1 2 3 4
## Frequency 13 18 243 37
## Proportion 0.042 0.058 0.781 0.119
## --------------------------------------------------------------------------------
## from_diversity_job_fair_id
## n missing distinct
## 311 0 2
##
## Value 0 1
## Frequency 282 29
## Proportion 0.907 0.093
## --------------------------------------------------------------------------------
## salary
## n missing distinct Info Mean Gmd .05 .10
## 311 0 308 1 69021 21997 47000 50178
## .25 .50 .75 .90 .95
## 55502 62810 72036 95660 108107
##
## lowest : 45046 45069 45115 45395 45433, highest: 170500 178000 180000 220450 250000
## --------------------------------------------------------------------------------
## termd
## n missing distinct
## 311 0 2
##
## Value 0 1
## Frequency 207 104
## Proportion 0.666 0.334
## --------------------------------------------------------------------------------
## position_id
## n missing distinct
## 311 0 27
##
## lowest : 1 11 12 14 15, highest: 5 6 7 8 9
## --------------------------------------------------------------------------------
## position
## n missing distinct
## 311 0 27
##
## lowest : accountant_i administrative_assistant area_sales_manager bi_developer bi_director
## highest: software_engineer software_engineering_manager sr._accountant sr._dba sr._network_engineer
## --------------------------------------------------------------------------------
## state
## n missing distinct
## 311 0 28
##
## lowest : AL AZ CA CO CT, highest: TX UT VA VT WA
## --------------------------------------------------------------------------------
## sex
## n missing distinct
## 311 0 2
##
## Value female male
## Frequency 176 135
## Proportion 0.566 0.434
## --------------------------------------------------------------------------------
## marital_desc
## n missing distinct
## 311 0 5
##
## Value divorced married separated single widowed
## Frequency 30 124 12 137 8
## Proportion 0.096 0.399 0.039 0.441 0.026
## --------------------------------------------------------------------------------
## citizen_desc
## n missing distinct
## 311 0 3
##
## Value eligible_noncitizen non-citizen us_citizen
## Frequency 12 4 295
## Proportion 0.039 0.013 0.949
## --------------------------------------------------------------------------------
## hispanic_latino
## n missing distinct
## 311 0 2
##
## Value no yes
## Frequency 283 28
## Proportion 0.91 0.09
## --------------------------------------------------------------------------------
## race_desc
## n missing distinct
## 311 0 6
##
## american_indian_or_alaska_native (3, 0.010), asian (29, 0.093),
## black_or_african_american (80, 0.257), hispanic (1, 0.003), two_or_more_races
## (11, 0.035), white (187, 0.601)
## --------------------------------------------------------------------------------
## term_reason
## n missing distinct
## 311 0 18
##
## lowest : another_position attendance career_change fatal_attraction gross_misconduct
## highest: relocation_out_of_area retiring return_to_school stillemployed unhappy
## --------------------------------------------------------------------------------
## employment_status
## n missing distinct
## 311 0 3
##
## Value active terminated_for_cause voluntarily_terminated
## Frequency 207 16 88
## Proportion 0.666 0.051 0.283
## --------------------------------------------------------------------------------
## department
## n missing distinct
## 311 0 6
##
## Value admin_offices executive_office it/is
## Frequency 9 1 50
## Proportion 0.029 0.003 0.161
##
## Value production sales software_engineering
## Frequency 209 31 11
## Proportion 0.672 0.100 0.035
## --------------------------------------------------------------------------------
## manager_name
## n missing distinct
## 311 0 21
##
## lowest : Alex Sweetwater Amy Dunn Board of Directors Brandon R. LeBlanc Brannon Miller
## highest: Lynn Daneault Michael Albert Peter Monroe Simon Roup Webster Butler
## --------------------------------------------------------------------------------
## manager_id
## n missing distinct Info Mean Gmd .05 .10
## 311 0 23 0.996 15.2 9.47 2.0 4.0
## .25 .50 .75 .90 .95
## 10.0 16.0 19.5 22.0 39.0
##
## lowest : 1 2 3 4 5, highest: 20 21 22 30 39
## --------------------------------------------------------------------------------
## recruitment_source
## n missing distinct
## 311 0 9
##
## lowest : careerbuilder diversity_job_fair employee_referral google_search indeed
## highest: indeed linkedin on-line_web_application other website
## --------------------------------------------------------------------------------
## performance_score
## n missing distinct
## 311 0 4
##
## Value exceeds fully_meets needs_improvement
## Frequency 37 243 18
## Proportion 0.119 0.781 0.058
##
## Value pip
## Frequency 13
## Proportion 0.042
## --------------------------------------------------------------------------------
## emp_satisfaction
## n missing distinct
## 311 0 5
##
## Value 1 2 3 4 5
## Frequency 2 9 108 94 98
## Proportion 0.006 0.029 0.347 0.302 0.315
## --------------------------------------------------------------------------------
## special_projects_count
## n missing distinct
## 311 0 9
##
## Value 0 1 2 3 4 5 6 7 8
## Frequency 241 1 1 3 9 21 21 12 2
## Proportion 0.775 0.003 0.003 0.010 0.029 0.068 0.068 0.039 0.006
## --------------------------------------------------------------------------------
## days_late_last30
## n missing distinct Info Mean Gmd
## 311 0 7 0.286 0.4148 0.7624
##
## Value 0 1 2 3 4 5 6
## Frequency 278 1 6 6 8 6 6
## Proportion 0.894 0.003 0.019 0.019 0.026 0.019 0.019
##
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## absences
## n missing distinct Info Mean Gmd .05 .10
## 311 0 20 0.997 10.24 6.749 2 2
## .25 .50 .75 .90 .95
## 5 10 15 18 19
##
## Value 1 2 3 4 5 6 7 8 9 10 11
## Frequency 14 21 16 23 12 16 21 11 14 10 15
## Proportion 0.045 0.068 0.051 0.074 0.039 0.051 0.068 0.035 0.045 0.032 0.048
##
## Value 12 13 14 15 16 17 18 19 20
## Frequency 8 17 17 20 23 15 8 16 14
## Proportion 0.026 0.055 0.055 0.064 0.074 0.048 0.026 0.051 0.045
##
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## hr2 %>% select_if(is.numeric)
##
## 5 Variables 311 Observations
## --------------------------------------------------------------------------------
## emp_id
## n missing distinct Info Mean Gmd .05 .10
## 311 0 311 1 10156 104 10016 10032
## .25 .50 .75 .90 .95
## 10078 10156 10234 10280 10296
##
## lowest : 10001 10002 10003 10004 10005, highest: 10307 10308 10309 10310 10311
## --------------------------------------------------------------------------------
## salary
## n missing distinct Info Mean Gmd .05 .10
## 311 0 308 1 69021 21997 47000 50178
## .25 .50 .75 .90 .95
## 55502 62810 72036 95660 108107
##
## lowest : 45046 45069 45115 45395 45433, highest: 170500 178000 180000 220450 250000
## --------------------------------------------------------------------------------
## manager_id
## n missing distinct Info Mean Gmd .05 .10
## 311 0 23 0.996 15.2 9.47 2.0 4.0
## .25 .50 .75 .90 .95
## 10.0 16.0 19.5 22.0 39.0
##
## lowest : 1 2 3 4 5, highest: 20 21 22 30 39
## --------------------------------------------------------------------------------
## days_late_last30
## n missing distinct Info Mean Gmd
## 311 0 7 0.286 0.4148 0.7624
##
## Value 0 1 2 3 4 5 6
## Frequency 278 1 6 6 8 6 6
## Proportion 0.894 0.003 0.019 0.019 0.026 0.019 0.019
##
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## absences
## n missing distinct Info Mean Gmd .05 .10
## 311 0 20 0.997 10.24 6.749 2 2
## .25 .50 .75 .90 .95
## 5 10 15 18 19
##
## Value 1 2 3 4 5 6 7 8 9 10 11
## Frequency 14 21 16 23 12 16 21 11 14 10 15
## Proportion 0.045 0.068 0.051 0.074 0.039 0.051 0.068 0.035 0.045 0.032 0.048
##
## Value 12 13 14 15 16 17 18 19 20
## Frequency 8 17 17 20 23 15 8 16 14
## Proportion 0.026 0.055 0.055 0.064 0.074 0.048 0.026 0.051 0.045
##
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
Skewness coefficient = 3.290213 indicates a right-skewed salary distribution (>0)
Kurtosis coefficient = 15.18561 indicates a leptokurtic distribution (>0)
The trimmed mean salary (excluding the lowest and highest 10% of values) is 64,523.67 USD
Minimum salary case
The minimum salary (45,046 USD) corresponds to employee Zima, Colleen, who holds the position of Production Technician I in the Production department, managed by David Stanley
Maximum salary case
The maximum salary (250,000 USD) corresponds to employee King, Janet, who is President & CEO on the Board of Directors
num_variables <- hr2 %>%
select_if(is.numeric) %>%
mutate(id = row_number()) %>%
pivot_longer(-id, names_to = "variable", values_to = "value")
###### Histograms of numerical variables
num_variables %>%
ggplot(aes(x = value, fill = variable)) +
geom_histogram() +
facet_wrap(~variable, scale = "free") +
guides(fill = FALSE) +
theme(axis.text.x = element_text(size = 9)) +
theme(strip.text.x = element_text(size = 12)) +
xlab("") + ylab("Frequency")
hist(hr2$salary,
breaks = 'Freedman-Diaconis',
freq = FALSE,
col = 'thistle1',
main = 'Salary Distribution Histogram',
ylab = 'Frequency',
xlab = 'Salary Value (USD)')
abline(v = mean(hr2$salary), col = 'black', lwd = 3)
curve(dnorm(x, mean = mean(hr2$salary), sd = sd(hr2$salary)),
col = 'thistle4',
lwd = 3,
add = TRUE)
lines(density(hr2$salary), col = 'blue', lwd = 3)
###### Salary Distribution by Gender ######
hr2_m <- hr2 %>% filter(sex == 'male')
hr2_f <- hr2 %>% filter(sex == 'female')
Analysis of some selected categorical variables
Absolute frequency table for the variable Sex
table(hr2$sex)
##
## female male
## 176 135
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 311
##
##
## | female | male |
## |-----------|-----------|
## | 176 | 135 |
## | 0.566 | 0.434 |
## |-----------|-----------|
##
##
##
##
Most employees in the company are female
Absolute frequency table for the variable MarriedID / Status
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 311
##
##
## | married | no_married |
## |------------|------------|
## | 124 | 187 |
## | 0.399 | 0.601 |
## |------------|------------|
##
##
##
##
Approximately 40% of employees are married, while 60% are unmarried (i.e., 187 employees out of a total of 311 are unmarried)
Absolute frequency table for the variable Marital Status
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 311
##
##
## | divorced | married | separated | single | widowed |
## |-----------|-----------|-----------|-----------|-----------|
## | 30 | 124 | 12 | 137 | 8 |
## | 0.096 | 0.399 | 0.039 | 0.441 | 0.026 |
## |-----------|-----------|-----------|-----------|-----------|
##
##
##
##
Absolute frequency table for the variable Department
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 311
##
##
## | admin_offices | executive_office | it/is | production | sales |
## |----------------------|----------------------|----------------------|----------------------|----------------------|
## | 9 | 1 | 50 | 209 | 31 |
## | 0.029 | 0.003 | 0.161 | 0.672 | 0.100 |
## |----------------------|----------------------|----------------------|----------------------|----------------------|
##
##
## | software_engineering |
## |----------------------|
## | 11 |
## | 0.035 |
## |----------------------|
##
##
##
##
9.6% of employees are divorced, 40% are married, 4% are separated, 44% are single, and 2.6% are widowed.
Statistical indicators for salary and absences by gender
hr2_f <- hr2 %>% filter(sex == 'female')
hr2_m <- hr2 %>% filter(sex == 'male')
The minimum salary for female employees is 45,046 USD, corresponding to employee Zima, Colleen, who holds the position of Production Technician I in the Production department
hr2_f[hr2_f$salary == min(hr2_f$salary),]%>%
DT::datatable(
.,
options = list(
scrollX = TRUE,
pageLength = 5
)
)
The minimum number of absences is 1 day, corresponding to 8 employees working in the following departments: IT/IS, Production, and Sales, holding positions such as Production Technician I, Production Technician II, IT Support, and Area Sales Manager
hr2_f[hr2_f$absences == min(hr2_f$absences),]%>%
DT::datatable(
.,
options = list(
scrollX = TRUE,
pageLength = 5
)
)
The maximum salary for female employees is 250,000 USD, corresponding to employee King, Janet, who holds the position of President & CEO in the Executive Office (Board of Directors)
hr2_f[hr2_f$salary == max(hr2_f$salary),]%>%
DT::datatable(
.,
options = list(
scrollX = TRUE,
pageLength = 5
)
)
The maximum number of absences is 20 days, corresponding to 6 employees working in the Production and Sales departments, holding positions such as Production Technician I and II
hr2_f[hr2_f$absences == max(hr2_f$absences),] %>%
DT::datatable(
.,
options = list(
scrollX = TRUE,
pageLength = 5
)
)
## # A tibble: 4 × 4
## # Groups: sex [2]
## sex variable skewness kurtosis
## <fct> <chr> <dbl> <dbl>
## 1 female absences 0.0232 -1.40
## 2 female salary 4.04 21.5
## 3 male absences 0.0363 -1.15
## 4 male salary 2.16 5.51
| sex | mode_absences | mode_salary | n |
|---|---|---|---|
| female | 16 | 6.12e+04 | 176 |
| male | 13 | 135 |
boxplot(hr2$salary, salary_norm, main = “Comparison of the Salary Variable Boxplotthe Normal Distribution Boxplot”, at = c(1,2), las = 2, col = c(“orange”,“red”), border = “brown”, horizontal = FALSE, notch = TRUE, ylim = c(0, 250000) )
text(x = 1:length(hr2), y = par(“usr”)[3] - 0.45, at = c(1, 2), labels = c(“Salary”, “Normal”), xpd = NA, srt = 35, adj = 1.2, cex = 1.2)
Calculate percentage of outliers for each variable
Graphical representation of outliers
data.frame(variable.name, outlier.percentage_g1) %>%
mutate(variable.name = as.factor(variable.name),
outlier.percentage_g1 = round(outlier.percentage_g1, 2)) %>%
ggplot(aes(y = fct_reorder(variable.name, outlier.percentage_g1),
x = outlier.percentage_g1)) +
geom_bar(stat = "identity") +
theme_bw() +
theme(plot.title = element_text(face = "bold")) +
labs(x = "Number of Outliers (%)",
y = "Numerical Variables",
title = "Proportion of Outliers for Each Variable\n(Determined Using the IQR Method)") +
geom_label(aes(label = paste0(outlier.percentage_g1, "%"))) +
scale_x_continuous(limits = c(0, 15))
p1 = plot_bar_sorted(hr2, sex, "Employee distribution by Gender")
p2 = plot_bar_sorted(hr2, state, "Employee distribution by state")
p1 = plot_bar_sorted(hr2, position, "Employee distribution by position")
p2 = plot_bar_sorted(hr2, marital_desc, "Employee distribution by marital status")
p1 = plot_bar_sorted(hr2, citizen_desc, "Employee distribution by citizenship status")
p2 = plot_bar_sorted(hr2, race_desc, "Employee distribution by race")
p1 = plot_bar_sorted(hr2, employment_status, "Employee distribution by employment status")
p2 = plot_bar_sorted(hr2, recruitment_source, "Employee distribution by recruitment source")
p1 = plot_bar_sorted(hr2, department, "Employee distribution by department")
p2 = plot_bar_sorted(hr2, special_projects_count,
"Employee Distribution by Special Projects Count")
The salary distribution within the company is positively skewed (right-skewed), as indicated by the high skewness coefficient (Skewness > 0).
This suggests that:
Most employees earn salaries clustered around the lower and middle range.
A small number of employees earn significantly higher salaries, pulling the mean upward.
The maximum salary (250,000 USD) substantially exceeds the median, indicating the presence of extreme values.
The kurtosis coefficient indicates a leptokurtic distribution.
The presence of some extreme values (outliers) was observed.
The standard deviation (≈ 25,157 USD) indicates a relatively high dispersion of salaries around the mean.
The coefficient of variation suggests that:
- The population is relatively heterogeneous.
- The mean salary is broadly representative, but variability is significant.
The dataset shows that:
- The majority of employees are female.
- Both male and female salary distributions are right-skewed.
Extreme salary values are present in both groups.
The maximum salary (President & CEO position) belongs to a female employee.
The analysis of absences shows:
- The presence of outliers in all three numerical variables (salary, absences, days late).
- A non-normal distribution pattern.
Some employees have significantly higher absence counts compared to the majority.
The categorical analysis reveals that:
- Most employees are female.
- Approximately 40% of employees are married, while the majority are unmarried.
- The workforce is distributed across multiple departments, with some departments significantly larger than others.
- The majority of employees hold operational or production-related positions.
Based on EDA analysis we so that:
- Salary exhibits non-normal behavior and strong right skewness.
- Extreme values (executive-level salaries) may influence regression results.
- Transformation techniques (e.g., log transformation of salary) may improve model performance.
- Categorical variables such as department, position, and employment status are likely strong predictors of salary.
- Gender alone may not fully explain salary variation without controlling for position and department.
The company exhibits substantial salary variability, structural heterogeneity across departments, and a right-skewed compensation structure influenced by a small number of high-paying executive roles.
The exploratory analysis confirms that salary is influenced by multiple structural and demographic factors, justifying the development of a predictive scoring model in the next stage of analysis.