##Introduction Human resources analytics, also known as people analytics is a data-driven approach to managing people at work. It’s goal is to understande employee’s needs, improve employee safety, understand and improve employee fairness and diversity, identify drivers of employee attrition but also identify the best recruiting source and check if the workloads of employees are accomplished.

Here, I will be using eight datasets, and answer the questions: which recruiting source is the best, what is driving low employee engagement, are new hires getting paid too much, are performance ratings being given consistently, and try to mprove employee safety with data.

Also, I will use pakckages from tidyverse, such as: readr, tidyr, dplyr and ggplot.

##Which recruiting source is the best?

###Importing data

recruitment <- read_csv("D:/recruitment_data.csv")
## Parsed with column specification:
## cols(
##   attrition = col_double(),
##   performance_rating = col_double(),
##   sales_quota_pct = col_double(),
##   recruiting_source = col_character()
## )
head(recruitment)
## # A tibble: 6 x 4
##   attrition performance_rating sales_quota_pct recruiting_source
##       <dbl>              <dbl>           <dbl> <chr>            
## 1         1                  3           1.09  Applied Online   
## 2         0                  3           2.39  <NA>             
## 3         1                  2           0.498 Campus           
## 4         0                  2           2.51  <NA>             
## 5         0                  3           1.42  Applied Online   
## 6         1                  3           0.548 Referral

Here we see four coulumns: attrition which is metric of how often hires leave the company, performance rating - performance on scale from 1 to 5, sales_quota_pct - how much a salesperson sold last year relative to their quota and recruiting_source - through which source employees were hired.

Analyzing data

A general process for HR analytics is to first identify the groups to compare, calculate summary statistics about those groups and plot or test the differences between those groups.

Now I will try to find the best recuriting source that produces the best hires.

We will begin with examining recruiting channels in the data. Good ideas is to find out nubers of hires per each distinct recruting source.

recruitment %>% 
  count(recruiting_source)
## # A tibble: 5 x 2
##   recruiting_source     n
##   <chr>             <int>
## 1 <NA>                205
## 2 Applied Online      130
## 3 Campus               56
## 4 Referral             45
## 5 Search Firm          10

We see that some employees don’t have recruting source.

Since we are interested in which recruiting channel produces the best salespeople we can use metric sales quota attainment or how much a salesperson sold last year relative to their quota.

avg_sales <- recruitment %>%
  group_by(recruiting_source) %>% 
  summarize(avg_sales_quota_pct = mean(sales_quota_pct)) %>%
  arrange(desc(avg_sales_quota_pct))
avg_sales
## # A tibble: 5 x 2
##   recruiting_source avg_sales_quota_pct
##   <chr>                           <dbl>
## 1 <NA>                            1.17 
## 2 Applied Online                  1.06 
## 3 Referral                        1.02 
## 4 Campus                          0.908
## 5 Search Firm                     0.887

Another metric that we can consider is attrition rate or how often hires leave the company.

avg_attrition <- recruitment %>%
  group_by(recruiting_source) %>% 
  summarize(attrition_rate = mean(attrition)) %>% 
  arrange(desc(attrition_rate))
avg_attrition
## # A tibble: 5 x 2
##   recruiting_source attrition_rate
##   <chr>                      <dbl>
## 1 Search Firm                0.5  
## 2 Referral                   0.333
## 3 Campus                     0.286
## 4 Applied Online             0.246
## 5 <NA>                       0.132

The last step in the HR analytics process is to test and plot the results.

ggplot(avg_sales, aes(x = recruiting_source, y = avg_sales_quota_pct)) +
  geom_col()

ggplot(avg_attrition, aes(x = recruiting_source, y = attrition_rate)) +
  geom_col()

From plots we can see that those who found firm has the lowest average sales quota and most often leaves the firm, while candidates that Applied online have the highest average sales quota and leaves the company less, if we don’t consider NA. Therefore, we can conclude that the best recruiting system is through applying online, while the worst through searching by firm.

##What is driving low employee engagement?

It is believed that those employees that are involved in, enthusiastic about and commited to their work and workplace better performer. This is usually measured with a survey, bur behavioral data can be used as well.

Here, I will investigate the link between engagement and business outcomes.

###Importing the data

survey <- read_csv("D:/survey_data.csv")
## Parsed with column specification:
## cols(
##   employee_id = col_double(),
##   department = col_character(),
##   engagement = col_double(),
##   salary = col_double(),
##   vacation_days_taken = col_double()
## )
survey
## # A tibble: 1,470 x 5
##    employee_id department  engagement  salary vacation_days_taken
##          <dbl> <chr>            <dbl>   <dbl>               <dbl>
##  1           1 Sales                3 103264.                   7
##  2           2 Engineering          3  80709.                  12
##  3           4 Engineering          3  60737.                  12
##  4           5 Engineering          3  99116.                   7
##  5           7 Engineering          3  51022.                  18
##  6           8 Engineering          3  98400.                   9
##  7          10 Engineering          3  57106.                  18
##  8          11 Engineering          1  55065.                   4
##  9          12 Engineering          4  77158.                  12
## 10          13 Engineering          2  48365.                  14
## # ... with 1,460 more rows

The columns are self-descriptive. Coumn engagement contains information about how much every employee is engaged on scale from 1 to 5, where 1 is the least engaged and 5 the most.

###Analyzing data

survey %>% 
  count(department)
## # A tibble: 3 x 2
##   department      n
##   <chr>       <int>
## 1 Engineering   961
## 2 Finance        63
## 3 Sales         446
survey %>%
  group_by(department) %>%
  summarize(avg_engagement = mean(engagement)) %>%
  arrange(desc(avg_engagement))
## # A tibble: 3 x 2
##   department  avg_engagement
##   <chr>                <dbl>
## 1 Finance               3.24
## 2 Engineering           3.15
## 3 Sales                 2.81

We can see that sales deparment has the lowest engagement.

We will create a column to identify which employees are disangaged by assigning 0 if their score is 1 or 2 so that we can compare another factors.

survey_disengaged <- survey %>% 
  mutate(disengaged = ifelse(engagement <= 2, 1, 0)) 
survey_disengaged
## # A tibble: 1,470 x 6
##    employee_id department  engagement  salary vacation_days_tak~ disengaged
##          <dbl> <chr>            <dbl>   <dbl>              <dbl>      <dbl>
##  1           1 Sales                3 103264.                  7          0
##  2           2 Engineering          3  80709.                 12          0
##  3           4 Engineering          3  60737.                 12          0
##  4           5 Engineering          3  99116.                  7          0
##  5           7 Engineering          3  51022.                 18          0
##  6           8 Engineering          3  98400.                  9          0
##  7          10 Engineering          3  57106.                 18          0
##  8          11 Engineering          1  55065.                  4          1
##  9          12 Engineering          4  77158.                 12          0
## 10          13 Engineering          2  48365.                 14          1
## # ... with 1,460 more rows
survey_summary <- survey_disengaged %>% 
  group_by(department) %>% 
  summarize(pct_disengaged = mean(disengaged),
            avg_salary = mean(salary),
            avg_vacation_days = mean(vacation_days_taken))
survey_summary
## # A tibble: 3 x 4
##   department  pct_disengaged avg_salary avg_vacation_days
##   <chr>                <dbl>      <dbl>             <dbl>
## 1 Engineering          0.206     73576.             12.2 
## 2 Finance              0.190     76652.             11.5 
## 3 Sales                0.330     75074.              9.22

Next we are going to compare the percenage of disangeged, average salary and avarage of taken days and to visualize these variables.

survey_summary <- survey_disengaged %>% 
  group_by(department) %>% 
  summarize(pct_disengaged = mean(disengaged),
            avg_salary = mean(salary),
            avg_vacation_days = mean(vacation_days_taken))
survey_summary
## # A tibble: 3 x 4
##   department  pct_disengaged avg_salary avg_vacation_days
##   <chr>                <dbl>      <dbl>             <dbl>
## 1 Engineering          0.206     73576.             12.2 
## 2 Finance              0.190     76652.             11.5 
## 3 Sales                0.330     75074.              9.22
survey_gathered <- survey_summary %>% 
  gather(key = "measure", value = "value",
         pct_disengaged, avg_salary, avg_vacation_days)
survey_gathered
## # A tibble: 9 x 3
##   department  measure               value
##   <chr>       <chr>                 <dbl>
## 1 Engineering pct_disengaged        0.206
## 2 Finance     pct_disengaged        0.190
## 3 Sales       pct_disengaged        0.330
## 4 Engineering avg_salary        73576.   
## 5 Finance     avg_salary        76652.   
## 6 Sales       avg_salary        75074.   
## 7 Engineering avg_vacation_days    12.2  
## 8 Finance     avg_vacation_days    11.5  
## 9 Sales       avg_vacation_days     9.22
ggplot(survey_gathered, aes(measure, value, fill = department)) +
  geom_col(position = "dodge") +
  facet_wrap(~ measure, scales = "free")

From the graphs we can see that sales deprtment has the lowest engagement. When we look at other variables we see that this deparment has fewest average vacation days taken.

However to prove that this difference is significant we will have to use statistical test. To test the hypothesis that the sales department has the same proportion of disengaged employees as the rest of the company we are going to use chi-squared test. This test is alternative to t-test but it is used when the variables are categorical.

survey_sales <- survey %>%
  mutate(in_sales = ifelse(department == "Sales", "Sales", "Other"))%>%
  mutate(disengaged = ifelse(engagement <= 2, 1, 0)) 



chisq.test(survey_sales$in_sales, survey_sales$disengaged)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  survey_sales$in_sales and survey_sales$disengaged
## X-squared = 25.524, df = 1, p-value = 4.368e-07

Since the p-value is less than 0.05 we can say that this test result is significant.

Now we are going to test the hypothesis that employees in the sales department took fewer vacation days, on average, than the rest of the company.

In this case we are going to use t-test since the variables that we compare are continuous.

t.test(vacation_days_taken ~ in_sales, data = survey_sales)
## 
##  Welch Two Sample t-test
## 
## data:  vacation_days_taken by in_sales
## t = 8.1549, df = 1022.9, p-value = 1.016e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.229473 3.642409
## sample estimates:
## mean in group Other mean in group Sales 
##           12.160156            9.224215

At the 0.05 level the test result is significant.

At the end we can conclude that sales department is less engaged and that is takes fewer vacation days than other deprtments. But other investigation are necessary to find out what causes sales department to be less engaged and to take fewer vacation days.

##Are new hires getting paid too much?

Sometimes the company increases offered salary to make the job more attractive. In a dataset ‘pay’ we will find out whether new hires are getting paid more than current employees.

###Importing data

library(tidyverse)
## -- Attaching packages ------------------------------------------------ tidyverse 1.2.1 --
## v tibble  2.1.1     v stringr 1.4.0
## v purrr   0.3.2     v forcats 0.4.0
## Warning: package 'tibble' was built under R version 3.5.3
## Warning: package 'purrr' was built under R version 3.5.3
## Warning: package 'stringr' was built under R version 3.5.2
## Warning: package 'forcats' was built under R version 3.5.2
## -- Conflicts --------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
setwd("~/Data_analysis-master")
pay <- read_csv("D:/Data_analysis-master/fair_pay.csv")
## Parsed with column specification:
## cols(
##   employee_id = col_double(),
##   department = col_character(),
##   salary = col_double(),
##   new_hire = col_character(),
##   job_level = col_character()
## )
head(pay)
## # A tibble: 6 x 5
##   employee_id department   salary new_hire job_level
##         <dbl> <chr>         <dbl> <chr>    <chr>    
## 1           1 Sales       103264. No       Salaried 
## 2           2 Engineering  80709. No       Hourly   
## 3           4 Engineering  60737. Yes      Hourly   
## 4           5 Engineering  99116. Yes      Salaried 
## 5           7 Engineering  51022. No       Hourly   
## 6           8 Engineering  98400. No       Salaried

###Inspecting and analyzing data

pay %>% 
  group_by(new_hire) %>% 
  summarize(avg_salary = mean(salary))
## # A tibble: 2 x 2
##   new_hire avg_salary
##   <chr>         <dbl>
## 1 No           73425.
## 2 Yes          76074.

From this output we can see that new hires on average gets more paid.

To prove if this difference is significant we will use t-test.

library(broom)
t.test(salary ~ new_hire, data = pay) %>%
  tidy()
## # A tibble: 1 x 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>
## 1   -2650.    73425.    76074.     -2.34  0.0194      685.   -4869.
## # ... with 3 more variables: conf.high <dbl>, method <chr>,
## #   alternative <chr>

The result of the test at the 0.05 level is significant. In another words there is a significant difference in salary.

However a problem that can occurs is ommited variable bias. Omitted variable bias occurs when an omitted variable is correlated with the dependent variable, and the way the groups are divided. A good way to inSpect this is visualizing it. In particular, by using 100% stacked bar chart which are perfect when we are more interested in the group composition than the actual number of employees in each group.

pay %>% 
  ggplot(aes(x = new_hire, fill = job_level)) +
  geom_bar(position = "fill")

It appears that the job level mix is different for new hires. New hires are less likely to be hourly employees, and more likely to be salaried or managers.

pay_grouped <- pay %>% 
  group_by(new_hire, job_level) %>% 
  summarize(avg_salary = mean(salary))
pay_grouped
## # A tibble: 6 x 3
## # Groups:   new_hire [2]
##   new_hire job_level avg_salary
##   <chr>    <chr>          <dbl>
## 1 No       Hourly        63966.
## 2 No       Manager      119132.
## 3 No       Salaried      91144.
## 4 Yes      Hourly        65073.
## 5 Yes      Manager      119423.
## 6 Yes      Salaried      91140.
pay_grouped %>%
  ggplot(aes(x = new_hire, y = avg_salary)) +
  geom_col() +
  facet_wrap(~ job_level)

From the plot we made the bars were nearly equal. This supports the idea that an omitted variable - job level - is driving the difference in pay for new hires and current employees.

However, the graph shows a small difference in the average salaries for hourly workers. We are going to test whether a significant pay difference exists between hourly new hires and hourly current employees.

In the plot we made, the bars were nearly equal. This supports the idea that an omitted variable - job level - is driving the difference in pay for new hires and current employees.

pay_filter <- pay %>%
  filter(job_level == "Hourly")

t.test(salary ~ new_hire, data = pay_filter) %>%
  tidy()
## # A tibble: 1 x 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>
## 1   -1107.    63966.    65073.     -1.75  0.0807      500.   -2349.
## # ... with 3 more variables: conf.high <dbl>, method <chr>,
## #   alternative <chr>

Since the test is not significant at the 0.05 level we can say that hourly new hires are not paid more than old hourly hires.

We have seen that the difference in salary between new hires and current employees seemed to disappear when another variable was taken into account.

We could have used multiple linear regression to test the significance of the difference between 2 groups while taking one or more other factors, such as omitted variables into account.

model_multiple <- lm(salary ~ new_hire + job_level, data = pay)
model_multiple%>%
  tidy()
## # A tibble: 4 x 5
##   term              estimate std.error statistic   p.value
##   <chr>                <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)         64049.      308.    208.   0.       
## 2 new_hireYes           783.      525.      1.49 1.36e-  1
## 3 job_levelManager    54919.      915.     60.0  0.       
## 4 job_levelSalaried   26866.      567.     47.4  7.39e-298

since we’re testing the effect that being a new hire has on salary we will use p-value from new_hireYes row. Since it is higher than 0.05 we can say that new hire are not payed significantly higher in this model.

New hires are being paid about the same as current employees when job level is considered.

However if we take into account deparments - the results of tests are different.

model_multiple <- lm(salary ~ new_hire + department, data = pay)
model_multiple%>%
  tidy()
## # A tibble: 4 x 5
##   term              estimate std.error statistic p.value
##   <chr>                <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)         72844.      679.    107.    0     
## 2 new_hireYes          2649.     1109.      2.39  0.0170
## 3 departmentFinance    3093.     2457.      1.26  0.208 
## 4 departmentSales      1477.     1082.      1.36  0.173

Here since p-value is lower than 0.05 we can say that new hire are payed significantly higher in this model.

Therefore we can conclude that that depending on variable that we take into account the result of test depends in that way.

New hires are being paid about the same as current employees when job level is considered, but paid less if deparments are included.

##Are performance ratings being given consistently?

Performance management helps an organization keep track of which employees are providing extra value, or below-average value, and compensating them accordingly. However cheching performance is sometimes subjective (individual managers’ preferences, or even biases - conscious or subconscious).

###Importing and joining datasets

Usually the data (depending on the type of informatons that contains) is splitted in different tables. Here we will be working with 2 datasets which we will join into one table.

hr_data <- read_csv("hr_data.csv")
## Parsed with column specification:
## cols(
##   employee_id = col_double(),
##   department = col_character(),
##   job_level = col_character(),
##   gender = col_character()
## )
head(hr_data)
## # A tibble: 6 x 4
##   employee_id department  job_level gender
##         <dbl> <chr>       <chr>     <chr> 
## 1           1 Sales       Salaried  Female
## 2           2 Engineering Hourly    Female
## 3           4 Engineering Hourly    Female
## 4           5 Engineering Salaried  Male  
## 5           7 Engineering Hourly    Male  
## 6           8 Engineering Salaried  Female
performance_data <- read_csv("performance_data.csv")
## Parsed with column specification:
## cols(
##   employee_id = col_double(),
##   rating = col_double()
## )
head(performance_data)
## # A tibble: 6 x 2
##   employee_id rating
##         <dbl>  <dbl>
## 1           1      4
## 2           2      4
## 3           4      4
## 4           5      4
## 5           7      2
## 6           8      5
joined_data <- left_join(hr_data, performance_data, by = "employee_id")
head(joined_data)
## # A tibble: 6 x 5
##   employee_id department  job_level gender rating
##         <dbl> <chr>       <chr>     <chr>   <dbl>
## 1           1 Sales       Salaried  Female      4
## 2           2 Engineering Hourly    Female      4
## 3           4 Engineering Hourly    Female      4
## 4           5 Engineering Salaried  Male        4
## 5           7 Engineering Hourly    Male        2
## 6           8 Engineering Salaried  Female      5

###Analyzing

We are interested in whether average performance rating differs by gender.

joined_data %>%
  group_by(gender) %>%
  summarize(avg_rating = mean(rating))
## # A tibble: 2 x 2
##   gender avg_rating
##   <chr>       <dbl>
## 1 Female       2.75
## 2 Male         2.92

It seems like male has better average performance ratings, but we should investigate if this difference is significant.

performance <- joined_data %>%  
  mutate(high_performer = ifelse(rating >= 4, 1, 0))
performance
## # A tibble: 1,470 x 6
##    employee_id department  job_level gender rating high_performer
##          <dbl> <chr>       <chr>     <chr>   <dbl>          <dbl>
##  1           1 Sales       Salaried  Female      4              1
##  2           2 Engineering Hourly    Female      4              1
##  3           4 Engineering Hourly    Female      4              1
##  4           5 Engineering Salaried  Male        4              1
##  5           7 Engineering Hourly    Male        2              0
##  6           8 Engineering Salaried  Female      5              1
##  7          10 Engineering Hourly    Female      3              0
##  8          11 Engineering Hourly    Male        4              1
##  9          12 Engineering Hourly    Female      3              0
## 10          13 Engineering Hourly    Male        3              0
## # ... with 1,460 more rows

Now we are going to test whether one gender is more likely to be a higher performer.

chisq.test(performance$gender, performance$high_performer)%>% 
  tidy()
## # A tibble: 1 x 4
##   statistic    p.value parameter method                                    
##       <dbl>      <dbl>     <int> <chr>                                     
## 1      22.2 0.00000242         1 Pearson's Chi-squared test with Yates' co~

Since the p-value is less than 0.05 difference between performers is significant.

###Visualizing difference

performance %>%
  ggplot(aes(gender, fill = factor(high_performer))) +
  geom_bar(position = "fill")  

From graph we can see that males are higher performers.

To create a more detailed graph we can use ratings variable.

performance %>%
  ggplot(aes(gender, fill = factor(rating))) +
  geom_bar(position = "fill")

The visualizations match what we found in the statistical test. However we need to check for omitted variable bias. Employees at higher job levels in the organization are more likely to be considered high performers.

First, we are going to check the difference in job level distribution by gender.

performance %>%
  ggplot(aes(x = gender, fill = job_level)) +
  geom_bar(position = "fill")

chisq.test(performance$gender, performance$job_level)
## 
##  Pearson's Chi-squared test
## 
## data:  performance$gender and performance$job_level
## X-squared = 44.506, df = 2, p-value = 2.166e-10

Finally we are going to visualize

performance %>% 
  ggplot(aes(x = gender, fill = factor(high_performer))) +
  geom_bar(position = "fill") +
  facet_wrap(~ job_level)

Logistic regression fits a binary dependent variable. We’ll need to use logistic regression to test whether men are statistically more likely to be labeled high performers in this dataset, even when job level is taken into account.

We are going to test if women are less likely to be labeled a high performer, even when taking differences in job level into account.

logistic_multiple <- glm(high_performer ~ gender + job_level, family = "binomial", data = performance)
logistic_multiple%>%
  tidy()
## # A tibble: 4 x 5
##   term              estimate std.error statistic  p.value
##   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)         -1.69      0.103    -16.5  2.74e-61
## 2 genderMale           0.319     0.129      2.47 1.34e- 2
## 3 job_levelManager     2.74      0.251     10.9  1.01e-27
## 4 job_levelSalaried    1.10      0.141      7.82 5.17e-15

Since the p-value is less than 0.05 we can say that difference is statistically significant.

We can conclude that even when differences in job levels are taken into account, women are less likely to be labeled high performers than men in this dataset.

##Improving employee safety with data

Let’s say that we know that workplace accidents have increased this past year at the production sites. We have to find out if that’s true, and if it is, to look into what might be driving the increase.

###Importing and joining datasets

hr_data <- read_csv("hr_2.csv")
## Parsed with column specification:
## cols(
##   year = col_double(),
##   employee_id = col_double(),
##   location = col_character(),
##   overtime_hours = col_double()
## )
accident_data <- read_csv("accident_data.csv")
## Parsed with column specification:
## cols(
##   year = col_double(),
##   employee_id = col_double(),
##   accident_type = col_character()
## )
head(hr_data)
## # A tibble: 6 x 4
##    year employee_id location    overtime_hours
##   <dbl>       <dbl> <chr>                <dbl>
## 1  2016           1 Northwood               14
## 2  2017           1 Northwood                8
## 3  2016           2 East Valley              8
## 4  2017           2 East Valley             11
## 5  2016           4 East Valley              4
## 6  2017           4 East Valley              2
head(accident_data)
## # A tibble: 6 x 3
##    year employee_id accident_type
##   <dbl>       <dbl> <chr>        
## 1  2017           1 Mild         
## 2  2017           4 Mild         
## 3  2017          11 Mild         
## 4  2017          19 Mild         
## 5  2017          22 Mild         
## 6  2016          23 Mild
hr_joined <- left_join(hr_data, accident_data, by = c("employee_id", "year")) 
head(hr_joined)
## # A tibble: 6 x 5
##    year employee_id location    overtime_hours accident_type
##   <dbl>       <dbl> <chr>                <dbl> <chr>        
## 1  2016           1 Northwood               14 <NA>         
## 2  2017           1 Northwood                8 Mild         
## 3  2016           2 East Valley              8 <NA>         
## 4  2017           2 East Valley             11 <NA>         
## 5  2016           4 East Valley              4 <NA>         
## 6  2017           4 East Valley              2 Mild

Since many employee didn’t have accident these values are represented as NAs. To distinguish these values we will mark NAs with 0, and any other with 1.

hr_joined <- hr_joined%>%
  mutate(had_accident = ifelse(is.na(accident_type), 0, 1))

We will begin with answering on question if the accident rate increase from 2016 to 2017?

hr_joined %>% 
  group_by(year) %>% 
  summarize(accident_rate = mean(had_accident))
## # A tibble: 2 x 2
##    year accident_rate
##   <dbl>         <dbl>
## 1  2016        0.0850
## 2  2017        0.120

We will use chisq.test to test difference in accident rate between years.

chisq.test(hr_joined$year, hr_joined$had_accident)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  hr_joined$year and hr_joined$had_accident
## X-squared = 9.5986, df = 1, p-value = 0.001947

Test proves that accidents has significantly increase in 2017 at the 0.05 level.

We can find which location had the highest rate of accidents.

hr_joined %>% 
  group_by(location) %>%  
  summarize(accident_rate = mean(had_accident)) %>% 
  arrange(desc(accident_rate))
## # A tibble: 4 x 2
##   location    accident_rate
##   <chr>               <dbl>
## 1 East Valley        0.128 
## 2 Southfield         0.103 
## 3 West River         0.0961
## 4 Northwood          0.0831

Now we will focus on location where the accident rate increased most from last year.

accident_rates <- hr_joined %>% 
  group_by(location, year) %>% 
  summarize(accident_rate = mean(had_accident))
accident_rates
## # A tibble: 8 x 3
## # Groups:   location [4]
##   location     year accident_rate
##   <chr>       <dbl>         <dbl>
## 1 East Valley  2016        0.113 
## 2 East Valley  2017        0.143 
## 3 Northwood    2016        0.0644
## 4 Northwood    2017        0.102 
## 5 Southfield   2016        0.0764
## 6 Southfield   2017        0.130 
## 7 West River   2016        0.0824
## 8 West River   2017        0.110
accident_rates %>% 
  ggplot(aes(factor(year), accident_rate)) +
  geom_col() +
  facet_wrap(~location)

We can observe that Southfield had the biggest increase of accidents.

Because Southfield was the location where the accident increased the most from 2016 to 2017, we should investigate what else changed there.

We also have overtime column, and the first thing on our mind can be overworked employees are more likely to make mistakes that lead to accidents. Therefore, we can compare average overtime hours worked in each year.

southfield <- hr_joined %>% 
  filter(location == "Southfield")

southfield %>%
  group_by(year) %>% 
  summarize(average_overtime_hours = mean(overtime_hours))
## # A tibble: 2 x 2
##    year average_overtime_hours
##   <dbl>                  <dbl>
## 1  2016                   8.67
## 2  2017                   9.97

We will use t-test to test whether the average of overtime hours worked was higher in either year.

t.test(overtime_hours ~ year, data = southfield)
## 
##  Welch Two Sample t-test
## 
## data:  overtime_hours by year
## t = -1.6043, df = 595.46, p-value = 0.1092
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.904043  0.292747
## sample estimates:
## mean in group 2016 mean in group 2017 
##           8.667774           9.973422

However difference in overtime hours is not significant at 0.05 level.

Since overtime hours didn’t have a significant change between the years, we should see what other variables we can check. We can check details about engagement of each employee. Engagement is stored in different dataset so we will join these two.

survey_data <- read_csv("survey_2.csv")
## Parsed with column specification:
## cols(
##   year = col_double(),
##   employee_id = col_double(),
##   engagement = col_double()
## )
safety <- left_join(hr_joined, survey_data, by = c("year", "employee_id")) %>% 
  mutate(disengaged = ifelse(engagement <= 2, 1, 0), 
         year = factor(year))

safety_southfield <- safety %>% 
  filter(location == "Southfield")

We can see if employee engagement could be an omitted variable that would explain the difference in the annual accident rates.

We can test whether employee disengagement changed at the same time that the accident rate changed.

safety_southfield %>% 
  ggplot(aes(x = year, fill = factor(disengaged))) +
  geom_bar(position = "fill")

chisq.test(safety_southfield$year, safety_southfield$disengaged)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  safety_southfield$year and safety_southfield$disengaged
## X-squared = 7.1906, df = 1, p-value = 0.007329

The result is significant at 0.05 level.

We’ve found a difference in disengagement between the two years at Southfield, and no difference in overtime hours worked.

We’ll check the other locations too, to make sure those locations didn’t also have a big change in disengagement. If the disengagement rate increased at the other locations where the accident rate did not change much, we’ll have to doubt whether disengagement and accident rates are connected.

other_locs <- safety %>% 
  filter(location != "Southfield")
t.test(overtime_hours ~ year, data = other_locs) 
## 
##  Welch Two Sample t-test
## 
## data:  overtime_hours by year
## t = -0.48267, df = 2320.3, p-value = 0.6294
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.9961022  0.6026035
## sample estimates:
## mean in group 2016 mean in group 2017 
##           9.278015           9.474765
chisq.test(other_locs$year, other_locs$disengaged)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  other_locs$year and other_locs$disengaged
## X-squared = 0.0023091, df = 1, p-value = 0.9617

We can see that neither disangegement nor overtime hours changed significantly over years in other locations. We will test connection between disengagement and accident rates.

We will use multiple regression to answer the executive’s question: why did the accident rate increase?

Is there a variable in our dataset that, when added into a multiple regression, can explain the difference in accident rates between the two years?

Here, multiple regression can be used to test the impact of year and disengaged on accident rate in Southfield.

regression <- glm(had_accident ~ year + disengaged, family = "binomial", data = safety_southfield)
tidy(regression)
## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -2.92      0.250    -11.7  1.74e-31
## 2 year2017       0.440     0.285      1.55 1.22e- 1
## 3 disengaged     1.44      0.278      5.19 2.13e- 7

Employee disengagement can explain the difference in accident rates between years, but also more analysis is necessary.