Human Resources Analytics: Exploring Employee Data in R

Chapter 1:

Looking at the recruiting data Real HR datasets are tough to find because of privacy and ethical concerns about sharing sensitive employee data. The dataset you’ll be using throughout this course is a synthetic one produced by IBM, and modified for learning purposes.

In this chapter, you’ll be focusing on the sales department and the recruiting channels they were hired from.

library(readr)
recruitment <- read_csv("recruitment_data.csv")
recruitment

## # A tibble: 446 x 4
##    attrition performance_rating sales_quota_pct recruiting_source
##        <dbl>              <dbl>           <dbl> <chr>            
##  1         1                  3           1.09  Applied Online   
##  2         0                  3           2.39  <NA>             
##  3         1                  2           0.498 Campus           
##  4         0                  2           2.51  <NA>             
##  5         0                  3           1.42  Applied Online   
##  6         1                  3           0.548 Referral         
##  7         1                  3           0.794 Applied Online   
##  8         0                  2           1.01  Referral         
##  9         0                  3           1.52  Campus           
## 10         0                  3           2.07  <NA>             
## # ... with 436 more rows

Identifying groups in data You would like to help the talent acquisition team understand which recruiting channel will produce the best sales hires. You can apply the HR analytics process to help them. Start by examining the recruiting channels in the data.

# Load the dplyr package
library(dplyr)
# Get an overview of the recruitment data
summary(recruitment)

##    attrition     performance_rating sales_quota_pct   recruiting_source 
##  Min.   :0.000   Min.   :1.000      Min.   :-0.7108   Length:446        
##  1st Qu.:0.000   1st Qu.:2.000      1st Qu.: 0.5844   Class :character  
##  Median :0.000   Median :3.000      Median : 1.0701   Mode  :character  
##  Mean   :0.213   Mean   :2.895      Mean   : 1.0826                     
##  3rd Qu.:0.000   3rd Qu.:3.000      3rd Qu.: 1.5325                     
##  Max.   :1.000   Max.   :5.000      Max.   : 3.6667

# See which recruiting sources the company has been using
recruiting_source_count <- recruitment %>% 
  count(recruiting_source)
recruiting_source_count

## # A tibble: 5 x 2
##   recruiting_source     n
##   <chr>             <int>
## 1 Applied Online      130
## 2 Campus               56
## 3 Referral             45
## 4 Search Firm          10
## 5 <NA>                205

Sales numbers by recruiting source Which recruiting channel produces the best salespeople? One quality of hire metric you can use is sales quota attainment, or how much a salesperson sold last year relative to their quota. An employee whose sales_quota_pct equals .75 sold 75% of their quota, for example. This metric can be helpful because raw sales numbers are not always comparable between employees.

Calculate the average sales quota attainment achieved by hires from each recruiting source.

# Find the average sales quota attainment for each recruiting source
avg_sales <- recruitment %>%
  group_by(recruiting_source) %>% 
  summarize(avg_sales_quota_pct = mean(sales_quota_pct)) 
  
# Display the result
avg_sales

## # A tibble: 5 x 2
##   recruiting_source avg_sales_quota_pct
##   <chr>                           <dbl>
## 1 Applied Online                  1.06 
## 2 Campus                          0.908
## 3 Referral                        1.02 
## 4 Search Firm                     0.887
## 5 <NA>                            1.17

Attrition rates by recruiting source Another quality of hire metric you can consider is the attrition rate, or how often hires leave the company. Determine which recruiting channels have the highest and lowest attrition rates.

In the last exercise, the output was a data frame with the recruiting channels and the average quota attainment. It would have been easier to tell which channel had the highest-performing employees if it were sorted with arrange().

# Find the average attrition for the sales team, by recruiting source, sorted from lowest attrition rate to highest
avg_attrition <- recruitment %>%
  group_by(recruiting_source) %>% 
  summarize(attrition_rate = mean(attrition)) %>% 
  arrange(attrition_rate)
  
# Display the result
avg_attrition

## # A tibble: 5 x 2
##   recruiting_source attrition_rate
##   <chr>                      <dbl>
## 1 <NA>                       0.132
## 2 Applied Online             0.246
## 3 Campus                     0.286
## 4 Referral                   0.333
## 5 Search Firm                0.5

Visualizing the sales performance differences The last step in the HR analytics process is to test and plot the results. For now, you’ll focus on visualizing the data from the previous exercises. You’ll be making a bar chart so you can more easily see the average sales quota attainment for each recruiting channel.

# Load the ggplot2 package
library(ggplot2)

# Plot the bar chart
ggplot(avg_sales, aes(x = recruiting_source, y = avg_sales_quota_pct)) +
  geom_col()

Visualizing the attrition differences You’ve been using two quality of hire metrics to compare the recruiting channels. In addition to looking at the sales output of the hires, you are also looking at the attrition rates. Plot a bar chart again, but this time plot average attrition instead of sales quota attainment.

# Load ggplot2
library(ggplot2)

# Plot the bar chart
ggplot(avg_attrition, aes(x = recruiting_source, y = attrition_rate)) +
  geom_col()

Drawing conclusions Now it’s time to draw conclusions from your analysis. Both avg_attrition and avg_sales are available in the workspace. Which of the recruiting sources in this dataset produced the best hires, measured by attrition and sales? Which source produced the worst hires?

# Load the packages
library(readr)
library(dplyr)

# Import the data
survey <- read_csv("survey_data.csv")

# Get an overview of the data
summary(survey)

##   employee_id      department          engagement       salary      
##  Min.   :   1.0   Length:1470        Min.   :1.00   Min.   : 45530  
##  1st Qu.: 491.2   Class :character   1st Qu.:3.00   1st Qu.: 59407  
##  Median :1020.5   Mode  :character   Median :3.00   Median : 70481  
##  Mean   :1024.9                      Mean   :3.05   Mean   : 74162  
##  3rd Qu.:1555.8                      3rd Qu.:4.00   3rd Qu.: 84763  
##  Max.   :2068.0                      Max.   :5.00   Max.   :164073  
##  vacation_days_taken
##  Min.   : 0.00      
##  1st Qu.: 6.00      
##  Median :10.00      
##  Mean   :11.27      
##  3rd Qu.:16.00      
##  Max.   :38.00

# Examine the counts of the department variable
survey %>% 
  count(department)

## # A tibble: 3 x 2
##   department      n
##   <chr>       <int>
## 1 Engineering   961
## 2 Finance        63
## 3 Sales         446

CHAPTER 2

Which department has the lowest engagement? You and the HR analytics team have been asked to help the department with the lowest engagement score improve. The first step is to determine which department has the lowest score.

# Output the average engagement score for each department, sorted
survey %>%
  group_by(department) %>%
  summarize(avg_engagement = mean(engagement)) %>%
  arrange(avg_engagement)

## # A tibble: 3 x 2
##   department  avg_engagement
##   <chr>                <dbl>
## 1 Sales                 2.81
## 2 Engineering           3.15
## 3 Finance               3.24

Comparing other factors by department Another common way to think about engagement is identifying which employees are disengaged, which you’ll define as having an engagement score of 1 or 2. The survey dataset doesn’t have a column called disengaged, but you can create it yourself.

Then you can use your dplyr tools to summarize several variables at once.

# Create the disengaged variable and assign the result to survey
survey_disengaged <- survey %>% 
  mutate(disengaged = ifelse(engagement <= 2, 1, 0)) 

survey_disengaged

## # A tibble: 1,470 x 6
##    employee_id department  engagement  salary vacation_days_taken disengaged
##          <dbl> <chr>            <dbl>   <dbl>               <dbl>      <dbl>
##  1           1 Sales                3 103264.                   7          0
##  2           2 Engineering          3  80709.                  12          0
##  3           4 Engineering          3  60737.                  12          0
##  4           5 Engineering          3  99116.                   7          0
##  5           7 Engineering          3  51022.                  18          0
##  6           8 Engineering          3  98400.                   9          0
##  7          10 Engineering          3  57106.                  18          0
##  8          11 Engineering          1  55065.                   4          1
##  9          12 Engineering          4  77158.                  12          0
## 10          13 Engineering          2  48365.                  14          1
## # ... with 1,460 more rows

# Summarize the three variables by department
survey_summary <- survey_disengaged %>% 
  group_by(department) %>% 
  summarize(pct_disengaged = mean(disengaged),
            avg_salary = mean(salary),
            avg_vacation_days = mean(vacation_days_taken))

survey_summary

## # A tibble: 3 x 4
##   department  pct_disengaged avg_salary avg_vacation_days
##   <chr>                <dbl>      <dbl>             <dbl>
## 1 Engineering          0.206     73576.             12.2 
## 2 Finance              0.190     76652.             11.5 
## 3 Sales                0.330     75074.              9.22

Visualizing several variables at once Now you can use the visualization technique from the video to graph the data in survey_summary. This technique is especially useful if you have more departments or more summary measures than you have in this example.

# Load packages
library(ggplot2)
library(tidyr)

# Gather data for plotting
survey_gathered <- survey_summary %>% 
  gather(key = "measure", value = "value",
         pct_disengaged, avg_salary, avg_vacation_days)

# Create three bar charts
ggplot(survey_gathered, aes(measure, value, fill = department)) +
  geom_col(position = "dodge")

Visualizing several variables at once with facets In the visualization you just created, the y-axis extended to fit avg_salary, which was in the thousands. That meant that you couldn’t see the bars for the other two variables you were trying to study. Here you’ll use faceting so that each variable plot has its own y-axis.

library(ggplot2)

# Create three faceted bar charts
ggplot(survey_gathered, aes(measure, value, fill = department)) +
  geom_col(position = "dodge") +
  facet_wrap(~ measure, scales = "free")

Drawing conclusions from graphs Your exploratory analysis showed that one of the departments has the lowest average employee engagement score. You’ve also looked at other variables, averaged by department. The survey_summary data frame is available to you in the console. What can you say about the department with the highest percentage of disengaged employees?

Statistical significance - disengaged employees You’ve seen some evidence that the sales department has a higher proportion of disengaged employees than the rest of the company, but you aren’t yet certain if that difference is significant. Now you can test whether that difference is statistically significant using the tests you just learned about.

# Add the in_sales variable
survey_sales <- survey_disengaged %>%
  mutate(in_sales = ifelse(department == "Sales", "Sales", "Other"))

# Test the hypothesis using survey_sales
chisq.test(survey_sales$in_sales, survey_sales$disengaged)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  survey_sales$in_sales and survey_sales$disengaged
## X-squared = 25.524, df = 1, p-value = 4.368e-07

# Is the result significant?
significant <- TRUE

Statistical significance - vacation days

Your other observation was that employees in the sales department take fewer vacation days on average than the rest of the company. You can test whether that observation is statistically significant as well.

# Test the hypothesis using the survey_sales data
t.test(vacation_days_taken ~ in_sales, data = survey_sales)

## 
##  Welch Two Sample t-test
## 
## data:  vacation_days_taken by in_sales
## t = 8.1549, df = 1022.9, p-value = 1.016e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.229473 3.642409
## sample estimates:
## mean in group Other mean in group Sales 
##           12.160156            9.224215

# Is the result significant?
significant <- TRUE

Drawing conclusions from the tests You’ve tested the differences between the sales department and the rest of the company, and interpreted the results. Now, which of these conclusions can you draw about the sales department’s employees?

CHAPTER 3

Importing the pay data In this chapter you will analyze the salaries of new hires and current employees to determine whether new hires are earning more. The salary dataset for this chapter is saved as fair_pay_data.csv. Import the dataset and do an initial examination.

#Load the packages
library(readr)
library(dplyr)

# Import the data
pay <- read_csv("fair_pay_data.csv")

# Get an overview of the data
summary(pay)

##   employee_id      department            salary         new_hire        
##  Min.   :   1.0   Length:1470        Min.   : 43820   Length:1470       
##  1st Qu.: 491.2   Class :character   1st Qu.: 59378   Class :character  
##  Median :1020.5   Mode  :character   Median : 70425   Mode  :character  
##  Mean   :1024.9                      Mean   : 74142                     
##  3rd Qu.:1555.8                      3rd Qu.: 84809                     
##  Max.   :2068.0                      Max.   :164073                     
##   job_level        
##  Length:1470       
##  Class :character  
##  Mode  :character  
##                    
##                    
##

# Check average salary of new hires and non-new hires
pay %>% 
  group_by(new_hire) %>% 
  summarize(avg_salary = mean(salary))

## # A tibble: 2 x 2
##   new_hire avg_salary
##   <chr>         <dbl>
## 1 No           73425.
## 2 Yes          76074.

Is the difference significant? The new hires have a higher average salary than the more tenured employees, but you don’t yet know whether that difference is statistically significant. Check whether the difference is significant using the appropriate statistical test.

# Perform the correct statistical test
t.test(salary ~ new_hire, data = pay)

## 
##  Welch Two Sample t-test
## 
## data:  salary by new_hire
## t = -2.3437, df = 685.16, p-value = 0.01938
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4869.4242  -429.9199
## sample estimates:
##  mean in group No mean in group Yes 
##          73424.60          76074.28

# Do the same test, and tidy up the output
library(broom)

## Warning: package 'broom' was built under R version 3.6.3

t.test(salary ~ new_hire, data = pay) %>%
  tidy()

## # A tibble: 1 x 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1   -2650.    73425.    76074.     -2.34  0.0194      685.   -4869.     -430.
## # ... with 2 more variables: method <chr>, alternative <chr>

# Is the result significant?
significant <- TRUE

What other differences exist? One variable you didn’t consider in your earlier statistical test is job level. Different job levels tend to have different pay levels, so if the mix of job levels for new hires is different than the mix of job levels for current employees, you could be dealing with omitted variable bias.

Check whether the job level composition is the same for new hires as it is for current employees.

# Load package
library(ggplot2)

# Create a stacked bar chart
pay %>% 
  ggplot(aes(x = new_hire, fill = job_level)) +
  geom_bar()

# Create a 100% filled stacked bar chart
pay %>% 
  ggplot(aes(x = new_hire, fill = job_level)) +
  geom_bar(position = "fill")

What does the pay difference look like now? It appears that the job level mix is different for new hires. New hires are less likely to be hourly employees, and more likely to be salaried or managers. Do new hires have a higher average salary than current employees when job level is taken into account? Calculate the average salaries, and then recreate the bar chart from earlier in the chapter, adding faceting to split it up by the three job levels. Are the bar heights closer together than they were in the first plot?

# Calculate the average salary for each group of interest
pay_grouped <- pay %>% 
  group_by(new_hire, job_level) %>% 
  summarize(avg_salary = mean(salary))
  
# Graph the results using facet_wrap()  
pay_grouped %>%
  ggplot(aes(x = new_hire, y = avg_salary)) +
  geom_col() +
  facet_wrap(~ job_level)

Are hourly hires paid more? In the plot you made, the bars were nearly equal. This supports the idea that an omitted variable - job level - is driving the difference in pay for new hires and current employees. However, the graph shows a small difference in the average salaries for hourly workers. Test whether a significant pay difference exists between hourly new hires and hourly current employees.

# Load the package
library(broom)

# Filter the data to include only hourly employees
pay_filter <- pay %>%
  filter(job_level == "Hourly")
pay_filter

## # A tibble: 1,039 x 5
##    employee_id department  salary new_hire job_level
##          <dbl> <chr>        <dbl> <chr>    <chr>    
##  1           2 Engineering 80709. No       Hourly   
##  2           4 Engineering 60737. Yes      Hourly   
##  3           7 Engineering 51022. No       Hourly   
##  4          10 Engineering 57106. Yes      Hourly   
##  5          11 Engineering 55065. No       Hourly   
##  6          12 Engineering 77158. No       Hourly   
##  7          13 Engineering 48365. No       Hourly   
##  8          14 Engineering 60945. Yes      Hourly   
##  9          15 Engineering 59161. No       Hourly   
## 10          16 Engineering 79324. Yes      Hourly   
## # ... with 1,029 more rows

# Test the difference in pay
t.test(salary ~ new_hire, data = pay_filter) %>%
  tidy()

## # A tibble: 1 x 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1   -1107.    63966.    65073.     -1.75  0.0807      500.   -2349.      136.
## # ... with 2 more variables: method <chr>, alternative <chr>

# Is the result significant?
significant <- FALSE

New hire pay: a simple regression Linear regression is another tool that can be used to test differences between groups. Start by using a simple linear regression to test the original idea that that new hires earn more than current employees, without taking any other factors into account. Use the pay dataset again for the following exercises.

# Run the simple regression
model_simple <- lm(salary ~ new_hire, data = pay)

# Display the summary of model_simple
model_simple %>% 
  summary()

## 
## Call:
## lm(formula = salary ~ new_hire, data = pay)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32255 -14466  -3681  10740  87998 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  73424.6      577.2 127.200   <2e-16 ***
## new_hireYes   2649.7     1109.4   2.388    0.017 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18900 on 1468 degrees of freedom
## Multiple R-squared:  0.003871,   Adjusted R-squared:  0.003193 
## F-statistic: 5.705 on 1 and 1468 DF,  p-value: 0.01704

# Display a tidy summary
model_simple %>% 
  tidy()

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)   73425.      577.    127.    0     
## 2 new_hireYes    2650.     1109.      2.39  0.0170

# Is new hire pay significantly higher in this model?
significant <- TRUE

New hire pay: accounting for job levels

The power of linear regression is that it can combine the steps you took earlier in the chapter of testing the difference between groups, then adding a filter, and testing again. By adding the additional variable directly into the regression, you get a significance result that does take the additional information into account.

In this exercise you’ll use multiple regression to test the difference in pay between new hires and current employees, while accounting for the fact that the job level mix is not the same in the two groups.

# Run the multiple regression
model_multiple <- lm(salary ~ new_hire + job_level, data = pay)

# Display the summary of model_multiple
model_multiple %>% 
  summary()

## 
## Call:
## lm(formula = salary ~ new_hire + job_level, data = pay)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -21013  -7091   -425   6771  44322 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        64049.3      308.3 207.722   <2e-16 ***
## new_hireYes          782.7      524.8   1.491    0.136    
## job_levelManager   54918.8      915.3  60.001   <2e-16 ***
## job_levelSalaried  26865.6      567.2  47.369   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8930 on 1466 degrees of freedom
## Multiple R-squared:  0.7779, Adjusted R-squared:  0.7775 
## F-statistic:  1712 on 3 and 1466 DF,  p-value: < 2.2e-16

# Display a tidy summary
model_multiple %>% 
  tidy()

## # A tibble: 4 x 5
##   term              estimate std.error statistic   p.value
##   <chr>                <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)         64049.      308.    208.   0.       
## 2 new_hireYes           783.      525.      1.49 1.36e-  1
## 3 job_levelManager    54919.      915.     60.0  0.       
## 4 job_levelSalaried   26866.      567.     47.4  7.39e-298

# Is new hire pay significantly higher in this model?
significant <- FALSE

Drawing conclusions from the tests Based on your t-tests and regressions, what is your conclusion?

New hires are being paid…

CHAPTER 4

Importing the two datasets To do analytics with HR data, you frequently need to merge or join together data from different data sources. In this chapter, you’ll start with some base HR data and join it with performance rating data to allow you to test some hypotheses about performance ratings differences between groups.

Before doing that, take a moment to examine the two datasets you’ll be working with for this chapter.

# Load the packages
library(readr)
library(dplyr)

# Import the data
hr_data <- read_csv("hr_data.csv")
performance_data <- read_csv("performance_data.csv")

# Examine the datasets
summary(hr_data)

##   employee_id      department         job_level            gender         
##  Min.   :   1.0   Length:1470        Length:1470        Length:1470       
##  1st Qu.: 491.2   Class :character   Class :character   Class :character  
##  Median :1020.5   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1024.9                                                           
##  3rd Qu.:1555.8                                                           
##  Max.   :2068.0

summary(performance_data)

##   employee_id         rating    
##  Min.   :   1.0   Min.   :1.00  
##  1st Qu.: 491.2   1st Qu.:2.00  
##  Median :1020.5   Median :3.00  
##  Mean   :1024.9   Mean   :2.83  
##  3rd Qu.:1555.8   3rd Qu.:4.00  
##  Max.   :2068.0   Max.   :5.00

Joining performance data Now that you’ve looked at hr_data and performance_data, it’s time to join them. Use the techniques from the video to put all the data you need in one data frame. Once you have the final data frame, check to see whether the average performance rating is higher for men or women.

# Join the two tables
joined_data <- left_join(hr_data, performance_data, by = "employee_id")

# Examine the result
summary(joined_data)

##   employee_id      department         job_level            gender         
##  Min.   :   1.0   Length:1470        Length:1470        Length:1470       
##  1st Qu.: 491.2   Class :character   Class :character   Class :character  
##  Median :1020.5   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1024.9                                                           
##  3rd Qu.:1555.8                                                           
##  Max.   :2068.0                                                           
##      rating    
##  Min.   :1.00  
##  1st Qu.:2.00  
##  Median :3.00  
##  Mean   :2.83  
##  3rd Qu.:4.00  
##  Max.   :5.00

# Check whether the average performance rating differs by gender 
joined_data %>%
  group_by(gender) %>%
  summarize(avg_rating = mean(rating))

## # A tibble: 2 x 2
##   gender avg_rating
##   <chr>       <dbl>
## 1 Female       2.75
## 2 Male         2.92

Focus on high performers

Performance ratings matter to organizations and individual employees. Being labeled a “high performer” can matter even more than the exact rating when the employee is considered for promotions, bonuses, and raises. Organizations can define “high performer” however they wish, and in this chapter, a high performer is any employee with a rating of 4 or 5 on the 5-point scale.

Create the new high_performer variable, and perform a statistical test on whether the women in this dataset are significantly less likely to be labeled high performers.

# Add the high_performer column
performance <- joined_data %>%  
  mutate(high_performer = ifelse(rating >= 4, 1, 0))

# Test whether one gender is more likely to be a high performer
chisq.test(performance$gender, performance$high_performer)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  performance$gender and performance$high_performer
## X-squared = 22.229, df = 1, p-value = 2.42e-06

# Do the same test, and tidy the output
library(broom)
chisq.test(performance$gender, performance$high_performer) %>% 
  tidy()

## # A tibble: 1 x 4
##   statistic    p.value parameter method                                         
##       <dbl>      <dbl>     <int> <chr>                                          
## 1      22.2 0.00000242         1 Pearson's Chi-squared test with Yates' continu~

# Is the test result significant?
significant <- TRUE

Comparing distributions of high performers

You just performed a statistical test of the difference in high_performer distribution between men and women in this dataset. You can also visualize the difference with 100% stacked bar charts, as you did with job_level in the last chapter.

In this exercise, you’ll use the same method you did before to visually compare the difference in the distribution of ratings given to men and women. You’ll start by comparing the mix of high performers, and finish by comparing the distribution of all ratings.

# Visualize the distribution of high_performer by gender
performance %>%
  ggplot(aes(gender, fill = factor(high_performer))) +
  geom_bar(position = "fill")

# Visualize the distribution of all ratings by gender
performance %>%
  ggplot(aes(gender, fill = factor(rating))) +
  geom_bar(position = "fill")

Checking for omitted variable bias In many organizations, employees at higher job levels in the organization are more likely to be considered high performers. That is, the distribution of performance ratings is not always the same at different job levels.

First, check the difference in job level distribution by gender to see if there is any potential for job level being an omitted variable we need to consider. Then plot the high performer distribution by both gender and job level. Does it look like the gender difference disappears?

# Visualize the distribution of job_level by gender
performance %>%
  ggplot(aes(x = gender, fill = job_level)) +
  geom_bar(position = "fill")

# Test whether men and women have different job level distributions
chisq.test(performance$gender, performance$job_level)

## 
##  Pearson's Chi-squared test
## 
## data:  performance$gender and performance$job_level
## X-squared = 44.506, df = 2, p-value = 2.166e-10

# Visualize the distribution of high_performer by gender, faceted by job level
performance %>% 
  ggplot(aes(x = gender, fill = factor(high_performer))) +
  geom_bar(position = "fill") +
  facet_wrap(~ job_level)

Performance ratings: a simple logistic regression

Since high_performer is a binary variable, you’ll need to use logistic regression to test whether men are statistically more likely to be labeled high performers in this dataset, even when job level is taken into account.

Do a warm-up by running a simple logistic regression without adding in job level.

# Run a simple logistic regression
logistic_simple <- glm(high_performer ~ gender, family = "binomial", data = performance) 

# View the result with summary()
logistic_simple %>%
  summary()

## 
## Call:
## glm(formula = high_performer ~ gender, family = "binomial", data = performance)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8871  -0.8871  -0.6957   1.4986   1.7535  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.29540    0.08813 -14.699  < 2e-16 ***
## genderMale   0.56596    0.11921   4.748 2.06e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1709.0  on 1469  degrees of freedom
## Residual deviance: 1686.1  on 1468  degrees of freedom
## AIC: 1690.1
## 
## Number of Fisher Scoring iterations: 4

# View a tidy version of the result
logistic_simple %>%
  tidy()

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -1.30     0.0881    -14.7  6.58e-49
## 2 genderMale     0.566    0.119       4.75 2.06e- 6

# Is the result significant?
significant <- TRUE

Performance ratings: accounting for job levels

Now that you’re warmed up on logistic regression, you can use it to statistically test what you saw in an earlier graph - are women less likely to be labeled a high performer, even when taking differences in job level into account? You can test this by adding job_level into the glm() formula and looking at the new p-value. The process is the same as adding a new variable to a linear regression.

# Run a multiple logistic regression
logistic_multiple <- glm(high_performer ~ gender + job_level, family = "binomial", data = performance)

# View the result with summary() or tidy()
logistic_multiple %>%
  tidy()

## # A tibble: 4 x 5
##   term              estimate std.error statistic  p.value
##   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)         -1.69      0.103    -16.5  2.74e-61
## 2 genderMale           0.319     0.129      2.47 1.34e- 2
## 3 job_levelManager     2.74      0.251     10.9  1.01e-27
## 4 job_levelSalaried    1.10      0.141      7.82 5.17e-15

# Is the result significant?  
significant <- TRUE

CHAPTER 5

Importing and joining the accident data It’s the final chapter! Congratulations on making it this far. This chapter’s scenario is that a senior executive believes that workplace accidents have increased this past year at the production sites. She wants you to find out if that’s true, and if it is, to look into what might be driving the increase.

Start by importing the HR and accident datasets. Then join them together, and add a had_accident variable to make it easier to analyze accident rates.

# Load the packages
library(readr)
library(dplyr)

# Import the data 
hr_data <- read_csv("hr_data_2.csv")
accident_data <- read_csv("accident_data.csv")

# Create hr_joined with left_join() and mutate()
hr_joined <- left_join(hr_data, accident_data, by = c("employee_id", "year")) %>% 
  mutate(had_accident = ifelse(is.na(accident_type), 0, 1))
  
hr_joined

## # A tibble: 2,940 x 6
##     year employee_id location    overtime_hours accident_type had_accident
##    <dbl>       <dbl> <chr>                <dbl> <chr>                <dbl>
##  1  2016           1 Northwood               14 <NA>                     0
##  2  2017           1 Northwood                8 Mild                     1
##  3  2016           2 East Valley              8 <NA>                     0
##  4  2017           2 East Valley             11 <NA>                     0
##  5  2016           4 East Valley              4 <NA>                     0
##  6  2017           4 East Valley              2 Mild                     1
##  7  2016           5 West River               0 <NA>                     0
##  8  2017           5 West River              17 <NA>                     0
##  9  2016           7 West River              21 <NA>                     0
## 10  2017           7 West River               9 <NA>                     0
## # ... with 2,930 more rows

Where is the highest accident rate? It’s time to answer the executive’s first question: did the accident rate increase from 2016 to 2017? Now that you have the had_accident variable, you can calculate the accident rate by taking the mean of had_accident. Use the appropriate statistical test to see if the accident rate increase is significant. Then, determine which site has the highest accident rate.

# Find accident rate for each year
hr_joined %>% 
  group_by(year) %>% 
  summarize(accident_rate = mean(had_accident))

## # A tibble: 2 x 2
##    year accident_rate
##   <dbl>         <dbl>
## 1  2016        0.0850
## 2  2017        0.120

# Test difference in accident rate between years
chisq.test(hr_joined$year, hr_joined$had_accident)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  hr_joined$year and hr_joined$had_accident
## X-squared = 9.5986, df = 1, p-value = 0.001947

# Which location had the highest acccident rate?
hr_joined %>% 
  group_by(location) %>%  
  summarize(accident_rate = mean(had_accident)) %>% 
  arrange(desc(accident_rate))

## # A tibble: 4 x 2
##   location    accident_rate
##   <chr>               <dbl>
## 1 East Valley        0.128 
## 2 Southfield         0.103 
## 3 West River         0.0961
## 4 Northwood          0.0831

Where did the accident rate increase most?

The location with the highest accident rate is a fine place to start, but to answer the executive’s question, it will be more productive to focus on the location where the accident rate increased most from last year. You’ll use both a table and a graph in this exercise to answer that question: which location’s accident rate had the largest percentage point increase from 2016 to 2017?

# Compare annual accident rates by location
accident_rates <- hr_joined %>% 
  group_by(location, year) %>% 
  summarize(accident_rate = mean(had_accident))
  
accident_rates

## # A tibble: 8 x 3
## # Groups:   location [4]
##   location     year accident_rate
##   <chr>       <dbl>         <dbl>
## 1 East Valley  2016        0.113 
## 2 East Valley  2017        0.143 
## 3 Northwood    2016        0.0644
## 4 Northwood    2017        0.102 
## 5 Southfield   2016        0.0764
## 6 Southfield   2017        0.130 
## 7 West River   2016        0.0824
## 8 West River   2017        0.110

# Graph it
accident_rates %>% 
  ggplot(aes(factor(year), accident_rate)) +
  geom_col() +
  facet_wrap(~location)

# Answer the question
increased_most <- "Southfield"

Focusing on the problem location

Because Southfield was the location where the accident increased the most from 2016 to 2017, you should investigate what else changed there. Overworked employees are more likely to make mistakes that lead to accidents, so you can start by comparing average overtime hours worked in each year.

# Filter out the other locations
southfield <- hr_joined %>% 
  filter(location == "Southfield")

# Find the average overtime hours worked by year
southfield %>%
  group_by(year) %>% 
  summarize(average_overtime_hours = mean(overtime_hours))

## # A tibble: 2 x 2
##    year average_overtime_hours
##   <dbl>                  <dbl>
## 1  2016                   8.67
## 2  2017                   9.97

# Test difference in Southfield's overtime hours between years
t.test(overtime_hours ~ year, data = southfield)

## 
##  Welch Two Sample t-test
## 
## data:  overtime_hours by year
## t = -1.6043, df = 595.46, p-value = 0.1092
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.904043  0.292747
## sample estimates:
## mean in group 2016 mean in group 2017 
##           8.667774           9.973422

# Do the years have significantly different average overtime hours?  
significant <- FALSE

Bringing in more data

Since overtime hours didn’t have a significant change between the years, you should see what other variables you can check. In consulting with your team, somebody suggests trying engagement scores, and specifically the number of disengaged employees at the location. You don’t have the survey data ready yet, so you’ll need to load it in and join it to the data you’ve been working with. Performing multiple joins is common in HR analytics.

After the join, change year to factor(year). Since you only care about the year as a grouping variable, and not its actual numeric value, changing it to a factor will make further analysis easier.

# Import the survey data
survey_data <- read_csv("survey_data_2.csv")

# Create the safety dataset
safety <- left_join(hr_joined, survey_data, by = c("year", "employee_id")) %>% 
  mutate(disengaged = ifelse(engagement <= 2, 1, 0), 
         year = factor(year))

Checking for omitted variables With the survey data joined, you can test your colleague’s idea that employee engagement, or specifically, employee disengagment, changed at the same time that the accident rate changed. Use a bar chart to compare the distributions, just as you have in previous chapters, and then use a statistical test to check if the difference you can see is significant.

# Visualize the difference in % disengaged by year in Southfield
southfield <- safety %>% 
  filter(location == 'Southfield') 
southfield %>%
  ggplot(aes(x = year, fill = factor(disengaged))) +
  geom_bar(position = "fill")

# Test whether one year had significantly more disengaged employees
chisq.test(southfield$year, southfield$disengaged)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  southfield$year and southfield$disengaged
## X-squared = 7.1906, df = 1, p-value = 0.007329

# Is the result significant?
significant <- TRUE

Is that change isolated to the problem location?

You’ve found a difference in disengagement between the two years at Southfield, and no difference in overtime hours worked. Now you’ll check the other locations too, to make sure those locations didn’t also have a big change in disengagement. If the disengagement rate increased at the other locations where the accident rate did not change much, you’ll have to doubt whether disengagement and accident rates are connected.

To test the other locations all at once, create a new dataset where Southfield is filtered out. Then rerun the same statistical tests from the previous exercises, but use the new dataset to test the other locations.

# Filter out Southfield
other_locs <- safety %>% 
  filter(location != "Southfield")

# Test whether one year had significantly more overtime hours worked
t.test(overtime_hours ~ year, data = other_locs)

## 
##  Welch Two Sample t-test
## 
## data:  overtime_hours by year
## t = -0.48267, df = 2320.3, p-value = 0.6294
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.9961022  0.6026035
## sample estimates:
## mean in group 2016 mean in group 2017 
##           9.278015           9.474765

# Test whether one year had significantly more disengaged employees
chisq.test(other_locs$year, other_locs$disengaged)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  other_locs$year and other_locs$disengaged
## X-squared = 0.0023091, df = 1, p-value = 0.9617

# Are the results significant?
significant_overtime <- FALSE
significant_disengaged <- FALSE

Using regression to identify change drivers

It’s time to use what you know about multiple regression to answer the executive’s question: why did the accident rate increase?

Phrased another way, is there a variable in your dataset that, when added into a multiple regression, can explain the difference in accident rates between the two years?

# Use multiple regression to test the impact of year and disengaged on accident rate in Southfield
regression <- glm(had_accident ~ year + disengaged, family = "binomial", data = southfield)

# Examine the output
library(broom)
tidy(regression)

## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -2.92      0.250    -11.7  1.74e-31
## 2 year2017       0.440     0.285      1.55 1.22e- 1
## 3 disengaged     1.44      0.278      5.19 2.13e- 7

# Interpret the regression output
significant_year <- FALSE
significant_disengaged <- TRUE

Human Resources Analytics: Exploring Employee Data in R

Ali Frady

5/31/2020