Final Project for R

Synopsis

My experience of interning at a HR consulting company at Willis Towers Watson draws my attention onto the topic of Human Resources. You might think that this is my personal interest. In fact, however, Human Resources is a topic which is closely related to everyone. People, either aiming to have a better life or to furfill personal life goals, can’t help but being involved in the labor market. Human resource is a sector where can gradually enhance the efficiency of human working behavior and thus create more value to the society as well as push the global development forward at its full speed.

What do you think is the reason behind why employee quit? Will it be salary? How much could salary effect people’s motivation as well as career path in general?

Read this article from the BALANCE before we take a deeper look into the topic.

Contact me via:

LinkedIn: Kathy Sun

Handshake: Kathy Sun

Instagram: Supermoooe

twitter: BizAnalyticsKat

1. Introduction

1.1 Object of the Project

And in this final project I would like to discuss more about the core of the sector–rewards and benefits and to simplify it more, I will mainly talk about salary management.

1.2 Details of the Datasets

I collect most of my data from Kaggle. Following are some details for the datasets:

Kaggle
- Human Reources Core Data Set
  (This is the core HR data set.)
  - Year: 2017
  - Varibles: Employee Name, Employee Number, State, Zip, DOB, Age, Sex, MaritalDesc, CitizenDesc, Hispanic/Latino, RaceDesc, Date of Hire, Date of Termination, Reason For Term, Employment Status, Department, Position, Pay Rate, Manager Name, Employee Source, Performance Score.
- Human Reources Production Staff Data Set
  (This is the production staff data set, complete with information about productivity, performance score, etc. It would be interesting to see the relationships between productivity and their performance score. You would think higher productivity would indicate a better overall performance. Is this necessarily the case?)
  - Year: 2017
  - Variable: Employee Name, Race Desc, Date of Hire, TermDate, Reason for Term, Employment Status, Department, Position, Pay, Manager Name, Performance Score, Abutments/Hour Wk 1, Abutments/Hour Wk 2, Daily Error Rate, 90-day Complaints.
- Human Resources Analytics
  (Why are our best and most experienced employees leaving prematurely?)
  - Year: 2015
  - Variable: Satisfaction Level, Last evaluation, Number of projects, Average monthly hours, spent at the company, Whether they have had a work accident, Whether they have had a promotion in the last 5 years, Departments,Salary,Whether the employee has left.

1.3 Methodology of Data Analyzing

I will use a lot of data visualization to illustrate and compare the different sets of data. By finding the differences in the employers wage and their correlation with promotion, satisfaction and, most importantly, the employees’ termination decisions. And technical wise I will choose the packages including: ggplot2, tidyverse, dplyr and more to explore

1.4 Projections of the project

I want to use all the visualization as stated above to compare all the data sets that I can find about employee information. I want to find the correlation between salary and termination decisions. And more importantly, I want to examine carefully about confounders among the variables in the datasets. Will that be the case that the salary effect their satisfaction level or their promotion possibility, or even their production outcomes? After researching, I hope I could finalize a guessing and return back to the article we read in the beginning. And I want to use other data sets to underscore the extra cause of hiring new people to take place the old ones and address the importance of a reasonable salary managment and only by that could firms be at their highest efficiency level and help to develop our human races and our home land–earth in the best way.

Contact me via:

LinkedIn: Kathy Sun

Handshake: Kathy Sun

Instagram: Supermoooe

twitter: BizAnalyticsKat

2.Packages Required

Here is a list of packages used with description of their function:

tidyverse: for cleaning data and creating a tidy format
tidyr: coming with tidyverse and also for tidy data
dplyr: transforming data
knitr: displaying a chart
ggplot2: for data visualization
magrittr: allowing pipe operator to avoid repeated code
DT: for nice HTML charts output

library(tidyverse)
library(tidyr)        
library(dplyr)
library(knitr)
library(ggplot2) 
library(magrittr)     
library(DT)

Contact me via:

LinkedIn: Kathy Sun

Handshake: Kathy Sun

Instagram: Supermoooe

twitter: BizAnalyticsKat

3.Data Preparation

3.1 Data Source

The original dataset is called “Human Resources Data Set” and is collected by Dr. Rich who is a principal data architect at New England Quality Care Alliance. This dataset can be found on Kaggle.
Click here to download the original dataset.

3.2 Variables of source data

1. HR Core Dataset This dataset contains 21 variables on 300+ employees. The variable in the core HR dataset include:

library(DT)
v_core <- read_csv("C:/Users/WFU/Desktop/temporary/R/Final Project Prep/v_core.csv")
cols(
  Variable = col_character(),
  Description = col_character()
)

## cols(
##   Variable = col_character(),
##   Description = col_character()
## )

datatable(v_core)

2. Production Staff Dataset This dataset only includes the people who are working in the production sector.It includes 15 variables on 200+ employees who are working at the production department. And the following is a list of name and description of each variables used.

library(DT)
v_prod <- read_csv("C:/Users/WFU/Desktop/temporary/R/Final Project Prep/v_prod.csv")
datatable(v_prod)

3.3 Data Importing and Cleaning

These two data sets include some of the same employees but provide different variables. So I decided to clean them and merge them together to give us a more detailed and explicit over view of the employee’s behavior and status.

Here are the few steps that I did to deal with the data:

Step One: Cleaning the datasets core and prod and keeping the necessary variables

core <- read.csv("core_dataset.csv", na.strings = c("", "NA"))
prod <- read.csv("production_staff.csv", na.strings = c("", "NA"))

Salary_Scale <- read.csv("salary_grid.csv") %>%
  select ("X", "X.3") %>%
  rename ( Position = "X") %>%
  rename ( Median_Salary = "X.3")
Salary_Scale <- Salary_Scale[-c(1),]

Step Two: Merging the two different data sets together based on Emplyees Name and rename the redundant variables

core_clean <- core %>%
  arrange (Employee.Name) 
core_clean <- core_clean[ c("Employee.Name", "Reason.For.Term", "Employment.Status", "Pay.Rate", "Performance.Score", "Position")]


prod_clean <- prod %>%
  arrange(Employee.Name)
prod_clean <- prod_clean[ c("Employee.Name", "Daily.Error.Rate", "X90.day.Complaints")]

Step Three: Filtering out the currently active employees and terminated employees and put them into two subsets to compare

join <- core_clean %>%
  left_join(prod_clean, na.strings =c("", "NA")) %>%
  select (- Employee.Name) %>%
  rename(Status = "Employment.Status") %>%
  rename(Salary = "Pay.Rate") %>%
  rename(Score = "Performance.Score") %>%
  rename(complaints = "X90.day.Complaints")
              

join_Active <- join %>%
  filter(Status == "Active") %>%
  mutate(category = "Active")

join_Terminated <- join %>%
  filter(Status == "Voluntarily Terminated" | Status == "Terminated for Cause") %>%
  mutate(category = "Terminated")

Clean <- rbind (join_Active, join_Terminated)

####3.4 Data Preview

datatable(Clean)

Contact me via:

LinkedIn: Kathy Sun

Handshake: Kathy Sun

Instagram: Supermoooe

twitter: BizAnalyticsKat

4. Exploratory Data Analysis

Distribution of the Reasons for Termination

Through Obeservation, We could see that there are few reasons lead to employees’ terminations.
Those reasons include:
1. “Another position” as in the person might get promoted or transitted to a different sector.
2. “attendance” as if their attendance meet with the company’s requirement.
3. “career change” as the employee changed the filed that he works in.
4. “gross misconduct” there is soome misconduction that leads to the employee’s termination.
5. “hours” the long working hour push the employees away
6. “madical issue”
7. “military”
8. “more money”
9. “no-call no-show”
10. “performance”
11. “relocation out of area”
12. “retiring”
13. “returning to school”
14. “unhappy”

And first I want to take a glance of how those reasons behind retiring distribute.

Clean%>%
  filter (category == "Terminated") %>%
  ggplot( aes(x= Reason.For.Term, fill = Reason.For.Term)) +
  geom_bar() +
  coord_polar()

Comparing between Wages of Employeed and Terminated Employees

From the polar boxplot we could see that the salary plays a huge factor in employee’s descition of termination.

Thus, I decide to run a boxplot of hourly salary between the people who are still employed and already terminated to capture the differences.

Clean %>%
  ggplot ( aes(x = category, y = Salary, color = category)) +
  geom_boxplot(fill = "white")

Contact me via:

LinkedIn: Kathy Sun

Handshake: Kathy Sun

Instagram: Supermoooe

twitter: BizAnalyticsKat

5. Summary

Contact me via:

LinkedIn: Kathy Sun

Handshake: Kathy Sun

Instagram: Supermoooe

twitter: BizAnalyticsKat

Final Project for R

Kathy Sun

August 9, 2017

Human Resources Analysis

Synopsis

Contact me via:

1. Introduction

1.1 Object of the Project

1.2 Details of the Datasets

Kaggle

1.3 Methodology of Data Analyzing

1.4 Projections of the project

Contact me via:

2.Packages Required

Contact me via:

3.Data Preparation

3.1 Data Source

3.2 Variables of source data

3.3 Data Importing and Cleaning

Contact me via:

4. Exploratory Data Analysis

Distribution of the Reasons for Termination

Comparing between Wages of Employeed and Terminated Employees

Contact me via:

5. Summary

Contact me via: