Introduction

The purpose of this analysis is to examine the departure of employees from an organization by analyzing different variables (categorical and quantitative) and their interrelationship with employee attrition. Because employee attrition is a way of reducing the size of the staff without the meddling of the management and can create gaps in companies and organizations. Some of the gaps can be a real challenge to the organization and can be represented as a steady and uncontrollable reduction of the workforce as a result of retirement, relocation, salary, and work-life balance. Therefore, determining the underlying cause of employees leaving can help businesses to build the proper systems and recruitment strategies required to lower the attrition rate.

The HR_EmployeeAttrition is a dataset consisting of 8 categorical, 19 quantitative variables, and 1470 observations. The objective of this project is to perform an exploratory data analysis on the data set and its relationship to employee attrition. We will begin to see how the dataset is organized by using some function such as head() tail(), dim(), str(), and glimpse(). Also, we will try to clean the data set by first lowering all the letters of the variables and then seeing if there is any missing value by using the na.rm().

Additionally, we will use some visualizations graphs between some variables to see how they are correlated such as boxplots and histograms.

Load libraries

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(readr)
library(dplyr)

Import the data set

getwd()

## [1] "C:/Users/Mitcheyla$/Desktop/DATA110 -VISUALISATION"

setwd("C:/Users/Mitcheyla$/Desktop/DATA110 -VISUALISATION")

HR_EmployeeAttrition <- read_csv('HR_EmployeeAttrition.csv')

## Rows: 1470 Columns: 27
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): BusinessTravel, Department, EducationField, Gender, JobRole, Marit...
## dbl (18): Resear, DistanceFromHomejobrole, Education, EmployeeCount, Environ...
## lgl  (1): Attrition
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Explore the data

Let see see the first and last 10 rows of the data

head(HR_EmployeeAttrition, 10)

## # A tibble: 10 × 27
##    Resear Attri…¹ Busin…² Depar…³ Dista…⁴ Educa…⁵ Educa…⁶ Emplo…⁷ Envir…⁸ Gender
##     <dbl> <lgl>   <chr>   <chr>     <dbl>   <dbl> <chr>     <dbl>   <dbl> <chr> 
##  1     41 TRUE    Travel… Sales         1       2 Life S…       1       2 Female
##  2     49 FALSE   Travel… Resear…       8       1 Life S…       1       3 Male  
##  3     37 TRUE    Travel… Resear…       2       2 Other         1       4 Male  
##  4     33 FALSE   Travel… Resear…       3       4 Life S…       1       4 Female
##  5     27 FALSE   Travel… Resear…       2       1 Medical       1       1 Male  
##  6     32 FALSE   Travel… Resear…       2       2 Life S…       1       4 Male  
##  7     59 FALSE   Travel… Resear…       3       3 Medical       1       3 Female
##  8     30 FALSE   Travel… Resear…      24       1 Life S…       1       4 Male  
##  9     38 FALSE   Travel… Resear…      23       3 Life S…       1       4 Male  
## 10     36 FALSE   Travel… Resear…      27       3 Medical       1       3 Male  
## # … with 17 more variables: JobInvolvement <dbl>, JobLevel <dbl>,
## #   JobRole <chr>, JobSatisfaction <dbl>, MaritalStatus <chr>,
## #   MonthlyIncome <dbl>, NumCompaniesWorked <dbl>, Over18 <chr>,
## #   OverTime <chr>, PercentSalaryHike <dbl>, PerformanceRating <dbl>,
## #   RelationshipSatisfaction <dbl>, StandardHours <dbl>,
## #   StockOptionLevel <dbl>, TotalWorkingYears <dbl>,
## #   TrainingTimesLastYear <dbl>, WorkLifeBalance <dbl>, and abbreviated …

tail(HR_EmployeeAttrition,10)

## # A tibble: 10 × 27
##    Resear Attri…¹ Busin…² Depar…³ Dista…⁴ Educa…⁵ Educa…⁶ Emplo…⁷ Envir…⁸ Gender
##     <dbl> <lgl>   <chr>   <chr>     <dbl>   <dbl> <chr>     <dbl>   <dbl> <chr> 
##  1     29 FALSE   Travel… Resear…      28       4 Medical       1       4 Female
##  2     50 TRUE    Travel… Sales        28       3 Market…       1       4 Male  
##  3     39 FALSE   Travel… Sales        24       1 Market…       1       2 Female
##  4     31 FALSE   Non-Tr… Resear…       5       3 Medical       1       2 Male  
##  5     26 FALSE   Travel… Sales         5       3 Other         1       4 Female
##  6     36 FALSE   Travel… Resear…      23       2 Medical       1       3 Male  
##  7     39 FALSE   Travel… Resear…       6       1 Medical       1       4 Male  
##  8     27 FALSE   Travel… Resear…       4       3 Life S…       1       2 Male  
##  9     49 FALSE   Travel… Sales         2       3 Medical       1       4 Male  
## 10     34 FALSE   Travel… Resear…       8       3 Medical       1       2 Male  
## # … with 17 more variables: JobInvolvement <dbl>, JobLevel <dbl>,
## #   JobRole <chr>, JobSatisfaction <dbl>, MaritalStatus <chr>,
## #   MonthlyIncome <dbl>, NumCompaniesWorked <dbl>, Over18 <chr>,
## #   OverTime <chr>, PercentSalaryHike <dbl>, PerformanceRating <dbl>,
## #   RelationshipSatisfaction <dbl>, StandardHours <dbl>,
## #   StockOptionLevel <dbl>, TotalWorkingYears <dbl>,
## #   TrainingTimesLastYear <dbl>, WorkLifeBalance <dbl>, and abbreviated …

dimension and structure of the data

dim(HR_EmployeeAttrition)

## [1] 1470   27

there are 1470 rows and 27 variables

str(HR_EmployeeAttrition)

## spec_tbl_df [1,470 × 27] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Resear                  : num [1:1470] 41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : logi [1:1470] TRUE FALSE TRUE FALSE FALSE FALSE ...
##  $ BusinessTravel          : chr [1:1470] "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" "Travel_Frequently" ...
##  $ Department              : chr [1:1470] "Sales" "Research & Development" "Research & Development" "Research & Development" ...
##  $ DistanceFromHomejobrole : num [1:1470] 1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : num [1:1470] 2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : chr [1:1470] "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
##  $ EmployeeCount           : num [1:1470] 1 1 1 1 1 1 1 1 1 1 ...
##  $ EnvironmentSatisfaction : num [1:1470] 2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : chr [1:1470] "Female" "Male" "Male" "Female" ...
##  $ JobInvolvement          : num [1:1470] 3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : num [1:1470] 2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : chr [1:1470] "Sales Executive" "Research Scientist" "Laboratory Technician" "Research Scientist" ...
##  $ JobSatisfaction         : num [1:1470] 4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : chr [1:1470] "Single" "Married" "Single" "Married" ...
##  $ MonthlyIncome           : num [1:1470] 5993 5130 2090 2909 3468 ...
##  $ NumCompaniesWorked      : num [1:1470] 8 1 6 1 9 0 4 1 0 6 ...
##  $ Over18                  : chr [1:1470] "Y" "Y" "Y" "Y" ...
##  $ OverTime                : chr [1:1470] "Yes" "No" "Yes" "Yes" ...
##  $ PercentSalaryHike       : num [1:1470] 11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : num [1:1470] 3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: num [1:1470] 1 4 2 3 4 3 1 2 2 2 ...
##  $ StandardHours           : num [1:1470] 80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : num [1:1470] 0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : num [1:1470] 8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : num [1:1470] 0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : num [1:1470] 1 3 3 3 3 2 2 3 3 2 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Resear = col_double(),
##   ..   Attrition = col_logical(),
##   ..   BusinessTravel = col_character(),
##   ..   Department = col_character(),
##   ..   DistanceFromHomejobrole = col_double(),
##   ..   Education = col_double(),
##   ..   EducationField = col_character(),
##   ..   EmployeeCount = col_double(),
##   ..   EnvironmentSatisfaction = col_double(),
##   ..   Gender = col_character(),
##   ..   JobInvolvement = col_double(),
##   ..   JobLevel = col_double(),
##   ..   JobRole = col_character(),
##   ..   JobSatisfaction = col_double(),
##   ..   MaritalStatus = col_character(),
##   ..   MonthlyIncome = col_double(),
##   ..   NumCompaniesWorked = col_double(),
##   ..   Over18 = col_character(),
##   ..   OverTime = col_character(),
##   ..   PercentSalaryHike = col_double(),
##   ..   PerformanceRating = col_double(),
##   ..   RelationshipSatisfaction = col_double(),
##   ..   StandardHours = col_double(),
##   ..   StockOptionLevel = col_double(),
##   ..   TotalWorkingYears = col_double(),
##   ..   TrainingTimesLastYear = col_double(),
##   ..   WorkLifeBalance = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

glimpse(HR_EmployeeAttrition)

## Rows: 1,470
## Columns: 27
## $ Resear                   <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
## $ Attrition                <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE…
## $ BusinessTravel           <chr> "Travel_Rarely", "Travel_Frequently", "Travel…
## $ Department               <chr> "Sales", "Research & Development", "Research …
## $ DistanceFromHomejobrole  <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
## $ Education                <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
## $ EducationField           <chr> "Life Sciences", "Life Sciences", "Other", "L…
## $ EmployeeCount            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ EnvironmentSatisfaction  <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, …
## $ Gender                   <chr> "Female", "Male", "Male", "Female", "Male", "…
## $ JobInvolvement           <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, …
## $ JobLevel                 <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, …
## $ JobRole                  <chr> "Sales Executive", "Research Scientist", "Lab…
## $ JobSatisfaction          <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, …
## $ MaritalStatus            <chr> "Single", "Married", "Single", "Married", "Ma…
## $ MonthlyIncome            <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269…
## $ NumCompaniesWorked       <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, …
## $ Over18                   <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", …
## $ OverTime                 <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes",…
## $ PercentSalaryHike        <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1…
## $ PerformanceRating        <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, …
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, …
## $ StandardHours            <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8…
## $ StockOptionLevel         <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, …
## $ TotalWorkingYears        <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3…
## $ TrainingTimesLastYear    <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, …
## $ WorkLifeBalance          <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, …

Clean up the data

Look for any missing values and summarize the data to make sure there is no missing value

anyNA(HR_EmployeeAttrition)

## [1] FALSE

summary(HR_EmployeeAttrition)

##      Resear      Attrition       BusinessTravel      Department       
##  Min.   :18.00   Mode :logical   Length:1470        Length:1470       
##  1st Qu.:30.00   FALSE:1233      Class :character   Class :character  
##  Median :36.00   TRUE :237       Mode  :character   Mode  :character  
##  Mean   :36.92                                                        
##  3rd Qu.:43.00                                                        
##  Max.   :60.00                                                        
##  DistanceFromHomejobrole   Education     EducationField     EmployeeCount
##  Min.   : 1.000          Min.   :1.000   Length:1470        Min.   :1    
##  1st Qu.: 2.000          1st Qu.:2.000   Class :character   1st Qu.:1    
##  Median : 7.000          Median :3.000   Mode  :character   Median :1    
##  Mean   : 9.193          Mean   :2.913                      Mean   :1    
##  3rd Qu.:14.000          3rd Qu.:4.000                      3rd Qu.:1    
##  Max.   :29.000          Max.   :5.000                      Max.   :1    
##  EnvironmentSatisfaction    Gender          JobInvolvement    JobLevel    
##  Min.   :1.000           Length:1470        Min.   :1.00   Min.   :1.000  
##  1st Qu.:2.000           Class :character   1st Qu.:2.00   1st Qu.:1.000  
##  Median :3.000           Mode  :character   Median :3.00   Median :2.000  
##  Mean   :2.722                              Mean   :2.73   Mean   :2.064  
##  3rd Qu.:4.000                              3rd Qu.:3.00   3rd Qu.:3.000  
##  Max.   :4.000                              Max.   :4.00   Max.   :5.000  
##    JobRole          JobSatisfaction MaritalStatus      MonthlyIncome  
##  Length:1470        Min.   :1.000   Length:1470        Min.   : 1009  
##  Class :character   1st Qu.:2.000   Class :character   1st Qu.: 2911  
##  Mode  :character   Median :3.000   Mode  :character   Median : 4919  
##                     Mean   :2.729                      Mean   : 6503  
##                     3rd Qu.:4.000                      3rd Qu.: 8379  
##                     Max.   :4.000                      Max.   :19999  
##  NumCompaniesWorked    Over18            OverTime         PercentSalaryHike
##  Min.   :0.000      Length:1470        Length:1470        Min.   :11.00    
##  1st Qu.:1.000      Class :character   Class :character   1st Qu.:12.00    
##  Median :2.000      Mode  :character   Mode  :character   Median :14.00    
##  Mean   :2.693                                            Mean   :15.21    
##  3rd Qu.:4.000                                            3rd Qu.:18.00    
##  Max.   :9.000                                            Max.   :25.00    
##  PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
##  Min.   :3.000     Min.   :1.000            Min.   :80    Min.   :0.0000  
##  1st Qu.:3.000     1st Qu.:2.000            1st Qu.:80    1st Qu.:0.0000  
##  Median :3.000     Median :3.000            Median :80    Median :1.0000  
##  Mean   :3.154     Mean   :2.712            Mean   :80    Mean   :0.7939  
##  3rd Qu.:3.000     3rd Qu.:4.000            3rd Qu.:80    3rd Qu.:1.0000  
##  Max.   :4.000     Max.   :4.000            Max.   :80    Max.   :3.0000  
##  TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
##  Min.   : 0.00     Min.   :0.000         Min.   :1.000  
##  1st Qu.: 6.00     1st Qu.:2.000         1st Qu.:2.000  
##  Median :10.00     Median :3.000         Median :3.000  
##  Mean   :11.28     Mean   :2.799         Mean   :2.761  
##  3rd Qu.:15.00     3rd Qu.:3.000         3rd Qu.:3.000  
##  Max.   :40.00     Max.   :6.000         Max.   :4.000

names(HR_EmployeeAttrition) <- tolower(names(HR_EmployeeAttrition))
str(HR_EmployeeAttrition)

## spec_tbl_df [1,470 × 27] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ resear                  : num [1:1470] 41 49 37 33 27 32 59 30 38 36 ...
##  $ attrition               : logi [1:1470] TRUE FALSE TRUE FALSE FALSE FALSE ...
##  $ businesstravel          : chr [1:1470] "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" "Travel_Frequently" ...
##  $ department              : chr [1:1470] "Sales" "Research & Development" "Research & Development" "Research & Development" ...
##  $ distancefromhomejobrole : num [1:1470] 1 8 2 3 2 2 3 24 23 27 ...
##  $ education               : num [1:1470] 2 1 2 4 1 2 3 1 3 3 ...
##  $ educationfield          : chr [1:1470] "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
##  $ employeecount           : num [1:1470] 1 1 1 1 1 1 1 1 1 1 ...
##  $ environmentsatisfaction : num [1:1470] 2 3 4 4 1 4 3 4 4 3 ...
##  $ gender                  : chr [1:1470] "Female" "Male" "Male" "Female" ...
##  $ jobinvolvement          : num [1:1470] 3 2 2 3 3 3 4 3 2 3 ...
##  $ joblevel                : num [1:1470] 2 2 1 1 1 1 1 1 3 2 ...
##  $ jobrole                 : chr [1:1470] "Sales Executive" "Research Scientist" "Laboratory Technician" "Research Scientist" ...
##  $ jobsatisfaction         : num [1:1470] 4 2 3 3 2 4 1 3 3 3 ...
##  $ maritalstatus           : chr [1:1470] "Single" "Married" "Single" "Married" ...
##  $ monthlyincome           : num [1:1470] 5993 5130 2090 2909 3468 ...
##  $ numcompaniesworked      : num [1:1470] 8 1 6 1 9 0 4 1 0 6 ...
##  $ over18                  : chr [1:1470] "Y" "Y" "Y" "Y" ...
##  $ overtime                : chr [1:1470] "Yes" "No" "Yes" "Yes" ...
##  $ percentsalaryhike       : num [1:1470] 11 23 15 11 12 13 20 22 21 13 ...
##  $ performancerating       : num [1:1470] 3 4 3 3 3 3 4 4 4 3 ...
##  $ relationshipsatisfaction: num [1:1470] 1 4 2 3 4 3 1 2 2 2 ...
##  $ standardhours           : num [1:1470] 80 80 80 80 80 80 80 80 80 80 ...
##  $ stockoptionlevel        : num [1:1470] 0 1 0 0 1 0 3 1 0 2 ...
##  $ totalworkingyears       : num [1:1470] 8 10 7 8 6 8 12 1 10 17 ...
##  $ trainingtimeslastyear   : num [1:1470] 0 3 3 3 3 2 3 2 2 3 ...
##  $ worklifebalance         : num [1:1470] 1 3 3 3 3 2 2 3 3 2 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Resear = col_double(),
##   ..   Attrition = col_logical(),
##   ..   BusinessTravel = col_character(),
##   ..   Department = col_character(),
##   ..   DistanceFromHomejobrole = col_double(),
##   ..   Education = col_double(),
##   ..   EducationField = col_character(),
##   ..   EmployeeCount = col_double(),
##   ..   EnvironmentSatisfaction = col_double(),
##   ..   Gender = col_character(),
##   ..   JobInvolvement = col_double(),
##   ..   JobLevel = col_double(),
##   ..   JobRole = col_character(),
##   ..   JobSatisfaction = col_double(),
##   ..   MaritalStatus = col_character(),
##   ..   MonthlyIncome = col_double(),
##   ..   NumCompaniesWorked = col_double(),
##   ..   Over18 = col_character(),
##   ..   OverTime = col_character(),
##   ..   PercentSalaryHike = col_double(),
##   ..   PerformanceRating = col_double(),
##   ..   RelationshipSatisfaction = col_double(),
##   ..   StandardHours = col_double(),
##   ..   StockOptionLevel = col_double(),
##   ..   TotalWorkingYears = col_double(),
##   ..   TrainingTimesLastYear = col_double(),
##   ..   WorkLifeBalance = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

First Visualization to view the number of employees attrition

library(wesanderson)

View the employee attrition rate in the data set

ggplot(HR_EmployeeAttrition, aes(x = attrition)) +
  geom_bar(position = "stack", fill = wes_palette("Cavalcanti1", n = 2))+
  theme_minimal() +
  labs(x = "Attrition", 
       y = "Count",
       title = "Employee Attrition",
       caption = "Source: HR_EmployeeAttrition")

HR_EmployeeAttrition %>%
  group_by(attrition) %>%
  summarise(n = n())

## # A tibble: 2 × 2
##   attrition     n
##   <lgl>     <int>
## 1 FALSE      1233
## 2 TRUE        237

We can see that 237 employees who left their employments and 1233 keep their employments.Also, about 16.12% of the employees left their jobs(based on the number of observations).

HR_EmployeeAttrition %>%
  group_by(attrition, department) %>%
  summarise(n = n())

## `summarise()` has grouped output by 'attrition'. You can override using the
## `.groups` argument.

## # A tibble: 6 × 3
## # Groups:   attrition [2]
##   attrition department                 n
##   <lgl>     <chr>                  <int>
## 1 FALSE     Human Resources           51
## 2 FALSE     Research & Development   828
## 3 FALSE     Sales                    354
## 4 TRUE      Human Resources           12
## 5 TRUE      Research & Development   133
## 6 TRUE      Sales                     92

Load library ggplot

library(ggplot2)

Visualization

plot2 <- HR_EmployeeAttrition%>%
ggplot() +
  geom_bar(aes(y =..count.., x= (department), fill =(attrition)),  position = position_stack()) +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
plot2

This visualization shows the 237 employees who left their employment were in Human Resources, Research & Development, and Sales Department. Also, we notice that Research & Development Department has a larger proportion of resignations compared to the other departments (Human Resources and Sales).

Take a deeper look to know the reasons those employee left their employments

First, I will use the group_by function to know exactly the minimum, average, and maximum monthly income of those employees who left their employment, and I will make a visualization using the three departments to see the relationship between the monthly income and employee attrition.

HR_EmployeeAttrition %>% group_by(attrition) %>% 
                  summarise(n_employees = n(),
                            min_monthlyincome = min(monthlyincome),
                            avg_monthly = mean(monthlyincome),
                            max_monthlyincome = max(monthlyincome),
                            sd_monthlyincome = sd(monthlyincome),
                            pct_less_60k = mean(monthlyincome <= 5000))

## # A tibble: 2 × 7
##   attrition n_employees min_monthlyincome avg_monthly max_mont…¹ sd_mo…² pct_l…³
##   <lgl>           <int>             <dbl>       <dbl>      <dbl>   <dbl>   <dbl>
## 1 FALSE            1233              1051       6833.      19999   4818.   0.475
## 2 TRUE              237              1009       4787.      19859   3640.   0.688
## # … with abbreviated variable names ¹max_monthlyincome, ²sd_monthlyincome,
## #   ³pct_less_60k

Visualization to Show Relationship with Monthly Income, Department, and Attrition

ggplot(data = HR_EmployeeAttrition, aes(x = monthlyincome, fill =department)) + 
   geom_histogram(aes(y = ..count..), color = "Black", bins = 20) +
   facet_wrap(~ attrition, nrow = 2) +
   labs(title = "Monthly Income Distribution by Department (Attrition - Yes/No)",
           x = "Monthly Income (US Dollars", y = "Proportion of Employees")

In this visualization, we can see employees from the three departments who left their jobs most of their monthly incomes were between $0 and $5000. However, between $5000 and $20000, they were not many employees who left their employment.

load libraries

library(ggplot2)
library(ggfortify)
library(htmltools)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

Visualization of Departments( Human Resources, Research & Development) and Job Satisfaction

HR_EmployeeAttrition%>%
  ggplot() + geom_boxplot(aes(y=jobsatisfaction, group=department, fill=department)) +
  scale_fill_manual(values=c("Dark Blue","Orange","Gray")) +
  theme(axis.text.y=element_blank()) +                       # Remove the useless y-axis tick values.
  ggtitle("Comparison Between Department and Job Satisfaction ") +
  coord_flip()

In this plot, we can notice that employees in those departments they were pretty satisfied with their jobs because they were almost in the same range (3) of job satisfaction. Therefore, this variable was not really the reason they left the company.

Business Travel, Department, and Employee Attrition Plot

Plot4 <- filter(HR_EmployeeAttrition, department == "Human Resources" | department == "Sales" | department == "Research & Development")
ggplot (Plot4, aes(x = attrition, y = businesstravel, color = department)) +
  ylab("Frequency of Business Travel)") +
  theme_minimal(base_size = 12) +
  ggtitle("Attrion and Business Travel") +
  geom_jitter() + 
  scale_color_brewer(palette = 'Set1')

From this chart, we can see that the sales employees who leave their employments travel more than the employees although the Research and Development department has more employees who leave their employments. For the sales department this business travel might be one of the reasons they leave their employments.

Select new variables in the data set

HR_Att <- HR_EmployeeAttrition%>%select(attrition, department, distancefromhomejobrole, environmentsatisfaction,  monthlyincome, totalworkingyears, jobrole, joblevel, percentsalaryhike, performancerating, worklifebalance, relationshipsatisfaction, maritalstatus)%>%
filter(department == "Human Resources" | department == "Research & Development" | department == "Sales")

Visulation to see if the monthly income is related to the total years working in their jobs

ggplot(data = HR_Att, aes(x= totalworkingyears, color = attrition))+
  geom_point(position=position_jitterdodge(),alpha=1,aes (y =  monthlyincome), bins = 20) +
   labs(title = "Employee's Monthly Income To Company Distribution Divide By Years in Company",
           x = "Total Working Years", y = "Monthly Income")+
            theme(plot.title = element_text(size=12),
                   axis.text.x = element_text(size=12) )

## Warning: Ignoring unknown parameters: bins

In this visualization, we can see most employees who leave their employments are between 0 to 10 years with a monthly income between 0 to 5 thousands dollars. The higher the salary is, more time the employees are willing to stay.

Want to see the p value to know if there are positive correlation between variables

jobsat_t <- t.test(HR_EmployeeAttrition$jobsatisfaction ~ HR_EmployeeAttrition$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
resear_t <- t.test(HR_EmployeeAttrition$resear ~ HR_EmployeeAttrition$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
wrklifebal_t <- t.test(HR_EmployeeAttrition$worklifebalance ~ HR_EmployeeAttrition$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
workyrs_t <- t.test(HR_EmployeeAttrition$totalworkingyears ~ HR_EmployeeAttrition$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
numcompw_t <- t.test(HR_EmployeeAttrition$numcompaniesworked ~ HR_EmployeeAttrition$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
perc_t <- t.test(HR_EmployeeAttrition$percentsalaryhike ~ HR_EmployeeAttrition$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
stock_t <- t.test(HR_EmployeeAttrition$stockoptionlevel ~ HR_EmployeeAttrition$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
envsat_t <- t.test(HR_EmployeeAttrition$environmentsatisfaction ~ HR_EmployeeAttrition$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)


kable <- tribble(
  ~name, ~p.value,
  "Job Satisfaction", jobsat_t$p.value,
  "Resear", resear_t$p.value,
  "Work Life Balance", wrklifebal_t$p.value,
  "Total Working Years", workyrs_t$p.value,
  "Number Of Companies Work",numcompw_t$p.value,
  "STock", stock_t$p.value,
  "Environment Satisfaction", envsat_t$p.value,
  "Percentage Salary Hike",perc_t$p.value
)

knitr::kable(kable)

name	p.value
Job Satisfaction	0.0001052
Resear	0.0000000
Work Life Balance	0.0304657
Total Working Years	0.0000000
Number Of Companies Work	0.1163340
STock	0.0000003
Environment Satisfaction	0.0002092
Percentage Salary Hike	0.6144301

The t.test shows that many variables have a negative correlation with employee attrition. However, the p-value for the percentage salary hike variable is 0.6. It is not considered significant. Therefore, we cannot claim the direction of the effect on employee attrition.

ESSAY

Many organizations or companies are faced to recruit talents and recruitment costs money and time. However, it is possible to reduce training and recruitment costs by solving the employee turnover problem. Therefore, my goal is to explore and analyze the data set to have a better understanding of why employee leave their jobs.

By analyzing the data, I found that many there are many reasons employees quit their jobs. One of them is the commute from home to work. Because of that some companies instead of letting those employees quit their jobs, can allow them to work from home. As I mention recruitment costs money and time. If those employees are highly qualified, both parties(employers and employees) can take advantage of working from home. Employers save money, and employees save money from buying gas, food, and being in a safe environment(home)

Additionally, another aspect of the analysis that amazes me is the unwillingness of some employees who have more than 10 years in the organizations or companies to quit. We can also prove from the longer years employees spend in a company, the lower the willingness to quit. Especially regardless of occupation. After 10 years of work, almost no employee wants to leave. Also, when comparing the job satisfaction, the three departments that employees left their jobs were pretty satisfied with their employments. The one variable the have in common that makes some of the quit was their monthly income.

Finally, there many things that I wish I could include in this analysis. For example, the facet graph between the variables explores much deeper the cause of employee attrition. Also, making a multiple regression model to compare the variables inputs into a model and try to predict the reasons for employee attrition and using the correlation plot for visualization and exploration.

Project1

Sheyla Daccarett

2022-10-17