The purpose of this analysis is to examine the departure of employees from an organization by analyzing different variables (categorical and quantitative) and their interrelationship with employee attrition. Because employee attrition is a way of reducing the size of the staff without the meddling of the management and can create gaps in companies and organizations. Some of the gaps can be a real challenge to the organization and can be represented as a steady and uncontrollable reduction of the workforce as a result of retirement, relocation, salary, and work-life balance. Therefore, determining the underlying cause of employees leaving can help businesses to build the proper systems and recruitment strategies required to lower the attrition rate.
The HR_EmployeeAttrition is a dataset consisting of 8 categorical, 19 quantitative variables, and 1470 observations. The objective of this project is to perform an exploratory data analysis on the data set and its relationship to employee attrition. We will begin to see how the dataset is organized by using some function such as head() tail(), dim(), str(), and glimpse(). Also, we will try to clean the data set by first lowering all the letters of the variables and then seeing if there is any missing value by using the na.rm().
Additionally, we will use some visualizations graphs between some variables to see how they are correlated such as boxplots and histograms.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(readr)
library(dplyr)
getwd()
## [1] "C:/Users/Mitcheyla$/Desktop/DATA110 -VISUALISATION"
setwd("C:/Users/Mitcheyla$/Desktop/DATA110 -VISUALISATION")
HR_EmployeeAttrition <- read_csv('HR_EmployeeAttrition.csv')
## Rows: 1470 Columns: 27
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): BusinessTravel, Department, EducationField, Gender, JobRole, Marit...
## dbl (18): Resear, DistanceFromHomejobrole, Education, EmployeeCount, Environ...
## lgl (1): Attrition
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let see see the first and last 10 rows of the data
head(HR_EmployeeAttrition, 10)
## # A tibble: 10 × 27
## Resear Attri…¹ Busin…² Depar…³ Dista…⁴ Educa…⁵ Educa…⁶ Emplo…⁷ Envir…⁸ Gender
## <dbl> <lgl> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 41 TRUE Travel… Sales 1 2 Life S… 1 2 Female
## 2 49 FALSE Travel… Resear… 8 1 Life S… 1 3 Male
## 3 37 TRUE Travel… Resear… 2 2 Other 1 4 Male
## 4 33 FALSE Travel… Resear… 3 4 Life S… 1 4 Female
## 5 27 FALSE Travel… Resear… 2 1 Medical 1 1 Male
## 6 32 FALSE Travel… Resear… 2 2 Life S… 1 4 Male
## 7 59 FALSE Travel… Resear… 3 3 Medical 1 3 Female
## 8 30 FALSE Travel… Resear… 24 1 Life S… 1 4 Male
## 9 38 FALSE Travel… Resear… 23 3 Life S… 1 4 Male
## 10 36 FALSE Travel… Resear… 27 3 Medical 1 3 Male
## # … with 17 more variables: JobInvolvement <dbl>, JobLevel <dbl>,
## # JobRole <chr>, JobSatisfaction <dbl>, MaritalStatus <chr>,
## # MonthlyIncome <dbl>, NumCompaniesWorked <dbl>, Over18 <chr>,
## # OverTime <chr>, PercentSalaryHike <dbl>, PerformanceRating <dbl>,
## # RelationshipSatisfaction <dbl>, StandardHours <dbl>,
## # StockOptionLevel <dbl>, TotalWorkingYears <dbl>,
## # TrainingTimesLastYear <dbl>, WorkLifeBalance <dbl>, and abbreviated …
tail(HR_EmployeeAttrition,10)
## # A tibble: 10 × 27
## Resear Attri…¹ Busin…² Depar…³ Dista…⁴ Educa…⁵ Educa…⁶ Emplo…⁷ Envir…⁸ Gender
## <dbl> <lgl> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 29 FALSE Travel… Resear… 28 4 Medical 1 4 Female
## 2 50 TRUE Travel… Sales 28 3 Market… 1 4 Male
## 3 39 FALSE Travel… Sales 24 1 Market… 1 2 Female
## 4 31 FALSE Non-Tr… Resear… 5 3 Medical 1 2 Male
## 5 26 FALSE Travel… Sales 5 3 Other 1 4 Female
## 6 36 FALSE Travel… Resear… 23 2 Medical 1 3 Male
## 7 39 FALSE Travel… Resear… 6 1 Medical 1 4 Male
## 8 27 FALSE Travel… Resear… 4 3 Life S… 1 2 Male
## 9 49 FALSE Travel… Sales 2 3 Medical 1 4 Male
## 10 34 FALSE Travel… Resear… 8 3 Medical 1 2 Male
## # … with 17 more variables: JobInvolvement <dbl>, JobLevel <dbl>,
## # JobRole <chr>, JobSatisfaction <dbl>, MaritalStatus <chr>,
## # MonthlyIncome <dbl>, NumCompaniesWorked <dbl>, Over18 <chr>,
## # OverTime <chr>, PercentSalaryHike <dbl>, PerformanceRating <dbl>,
## # RelationshipSatisfaction <dbl>, StandardHours <dbl>,
## # StockOptionLevel <dbl>, TotalWorkingYears <dbl>,
## # TrainingTimesLastYear <dbl>, WorkLifeBalance <dbl>, and abbreviated …
dim(HR_EmployeeAttrition)
## [1] 1470 27
there are 1470 rows and 27 variables
str(HR_EmployeeAttrition)
## spec_tbl_df [1,470 × 27] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Resear : num [1:1470] 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : logi [1:1470] TRUE FALSE TRUE FALSE FALSE FALSE ...
## $ BusinessTravel : chr [1:1470] "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" "Travel_Frequently" ...
## $ Department : chr [1:1470] "Sales" "Research & Development" "Research & Development" "Research & Development" ...
## $ DistanceFromHomejobrole : num [1:1470] 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : num [1:1470] 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : chr [1:1470] "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
## $ EmployeeCount : num [1:1470] 1 1 1 1 1 1 1 1 1 1 ...
## $ EnvironmentSatisfaction : num [1:1470] 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : chr [1:1470] "Female" "Male" "Male" "Female" ...
## $ JobInvolvement : num [1:1470] 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : num [1:1470] 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : chr [1:1470] "Sales Executive" "Research Scientist" "Laboratory Technician" "Research Scientist" ...
## $ JobSatisfaction : num [1:1470] 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : chr [1:1470] "Single" "Married" "Single" "Married" ...
## $ MonthlyIncome : num [1:1470] 5993 5130 2090 2909 3468 ...
## $ NumCompaniesWorked : num [1:1470] 8 1 6 1 9 0 4 1 0 6 ...
## $ Over18 : chr [1:1470] "Y" "Y" "Y" "Y" ...
## $ OverTime : chr [1:1470] "Yes" "No" "Yes" "Yes" ...
## $ PercentSalaryHike : num [1:1470] 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : num [1:1470] 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: num [1:1470] 1 4 2 3 4 3 1 2 2 2 ...
## $ StandardHours : num [1:1470] 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : num [1:1470] 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : num [1:1470] 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : num [1:1470] 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : num [1:1470] 1 3 3 3 3 2 2 3 3 2 ...
## - attr(*, "spec")=
## .. cols(
## .. Resear = col_double(),
## .. Attrition = col_logical(),
## .. BusinessTravel = col_character(),
## .. Department = col_character(),
## .. DistanceFromHomejobrole = col_double(),
## .. Education = col_double(),
## .. EducationField = col_character(),
## .. EmployeeCount = col_double(),
## .. EnvironmentSatisfaction = col_double(),
## .. Gender = col_character(),
## .. JobInvolvement = col_double(),
## .. JobLevel = col_double(),
## .. JobRole = col_character(),
## .. JobSatisfaction = col_double(),
## .. MaritalStatus = col_character(),
## .. MonthlyIncome = col_double(),
## .. NumCompaniesWorked = col_double(),
## .. Over18 = col_character(),
## .. OverTime = col_character(),
## .. PercentSalaryHike = col_double(),
## .. PerformanceRating = col_double(),
## .. RelationshipSatisfaction = col_double(),
## .. StandardHours = col_double(),
## .. StockOptionLevel = col_double(),
## .. TotalWorkingYears = col_double(),
## .. TrainingTimesLastYear = col_double(),
## .. WorkLifeBalance = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
glimpse(HR_EmployeeAttrition)
## Rows: 1,470
## Columns: 27
## $ Resear <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
## $ Attrition <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE…
## $ BusinessTravel <chr> "Travel_Rarely", "Travel_Frequently", "Travel…
## $ Department <chr> "Sales", "Research & Development", "Research …
## $ DistanceFromHomejobrole <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
## $ Education <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
## $ EducationField <chr> "Life Sciences", "Life Sciences", "Other", "L…
## $ EmployeeCount <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ EnvironmentSatisfaction <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, …
## $ Gender <chr> "Female", "Male", "Male", "Female", "Male", "…
## $ JobInvolvement <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, …
## $ JobLevel <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, …
## $ JobRole <chr> "Sales Executive", "Research Scientist", "Lab…
## $ JobSatisfaction <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, …
## $ MaritalStatus <chr> "Single", "Married", "Single", "Married", "Ma…
## $ MonthlyIncome <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269…
## $ NumCompaniesWorked <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, …
## $ Over18 <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", …
## $ OverTime <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes",…
## $ PercentSalaryHike <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1…
## $ PerformanceRating <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, …
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, …
## $ StandardHours <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8…
## $ StockOptionLevel <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, …
## $ TotalWorkingYears <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3…
## $ TrainingTimesLastYear <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, …
## $ WorkLifeBalance <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, …
Look for any missing values and summarize the data to make sure there is no missing value
anyNA(HR_EmployeeAttrition)
## [1] FALSE
summary(HR_EmployeeAttrition)
## Resear Attrition BusinessTravel Department
## Min. :18.00 Mode :logical Length:1470 Length:1470
## 1st Qu.:30.00 FALSE:1233 Class :character Class :character
## Median :36.00 TRUE :237 Mode :character Mode :character
## Mean :36.92
## 3rd Qu.:43.00
## Max. :60.00
## DistanceFromHomejobrole Education EducationField EmployeeCount
## Min. : 1.000 Min. :1.000 Length:1470 Min. :1
## 1st Qu.: 2.000 1st Qu.:2.000 Class :character 1st Qu.:1
## Median : 7.000 Median :3.000 Mode :character Median :1
## Mean : 9.193 Mean :2.913 Mean :1
## 3rd Qu.:14.000 3rd Qu.:4.000 3rd Qu.:1
## Max. :29.000 Max. :5.000 Max. :1
## EnvironmentSatisfaction Gender JobInvolvement JobLevel
## Min. :1.000 Length:1470 Min. :1.00 Min. :1.000
## 1st Qu.:2.000 Class :character 1st Qu.:2.00 1st Qu.:1.000
## Median :3.000 Mode :character Median :3.00 Median :2.000
## Mean :2.722 Mean :2.73 Mean :2.064
## 3rd Qu.:4.000 3rd Qu.:3.00 3rd Qu.:3.000
## Max. :4.000 Max. :4.00 Max. :5.000
## JobRole JobSatisfaction MaritalStatus MonthlyIncome
## Length:1470 Min. :1.000 Length:1470 Min. : 1009
## Class :character 1st Qu.:2.000 Class :character 1st Qu.: 2911
## Mode :character Median :3.000 Mode :character Median : 4919
## Mean :2.729 Mean : 6503
## 3rd Qu.:4.000 3rd Qu.: 8379
## Max. :4.000 Max. :19999
## NumCompaniesWorked Over18 OverTime PercentSalaryHike
## Min. :0.000 Length:1470 Length:1470 Min. :11.00
## 1st Qu.:1.000 Class :character Class :character 1st Qu.:12.00
## Median :2.000 Mode :character Mode :character Median :14.00
## Mean :2.693 Mean :15.21
## 3rd Qu.:4.000 3rd Qu.:18.00
## Max. :9.000 Max. :25.00
## PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
## Min. :3.000 Min. :1.000 Min. :80 Min. :0.0000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:80 1st Qu.:0.0000
## Median :3.000 Median :3.000 Median :80 Median :1.0000
## Mean :3.154 Mean :2.712 Mean :80 Mean :0.7939
## 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:80 3rd Qu.:1.0000
## Max. :4.000 Max. :4.000 Max. :80 Max. :3.0000
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
## Min. : 0.00 Min. :0.000 Min. :1.000
## 1st Qu.: 6.00 1st Qu.:2.000 1st Qu.:2.000
## Median :10.00 Median :3.000 Median :3.000
## Mean :11.28 Mean :2.799 Mean :2.761
## 3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :40.00 Max. :6.000 Max. :4.000
names(HR_EmployeeAttrition) <- tolower(names(HR_EmployeeAttrition))
str(HR_EmployeeAttrition)
## spec_tbl_df [1,470 × 27] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ resear : num [1:1470] 41 49 37 33 27 32 59 30 38 36 ...
## $ attrition : logi [1:1470] TRUE FALSE TRUE FALSE FALSE FALSE ...
## $ businesstravel : chr [1:1470] "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" "Travel_Frequently" ...
## $ department : chr [1:1470] "Sales" "Research & Development" "Research & Development" "Research & Development" ...
## $ distancefromhomejobrole : num [1:1470] 1 8 2 3 2 2 3 24 23 27 ...
## $ education : num [1:1470] 2 1 2 4 1 2 3 1 3 3 ...
## $ educationfield : chr [1:1470] "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
## $ employeecount : num [1:1470] 1 1 1 1 1 1 1 1 1 1 ...
## $ environmentsatisfaction : num [1:1470] 2 3 4 4 1 4 3 4 4 3 ...
## $ gender : chr [1:1470] "Female" "Male" "Male" "Female" ...
## $ jobinvolvement : num [1:1470] 3 2 2 3 3 3 4 3 2 3 ...
## $ joblevel : num [1:1470] 2 2 1 1 1 1 1 1 3 2 ...
## $ jobrole : chr [1:1470] "Sales Executive" "Research Scientist" "Laboratory Technician" "Research Scientist" ...
## $ jobsatisfaction : num [1:1470] 4 2 3 3 2 4 1 3 3 3 ...
## $ maritalstatus : chr [1:1470] "Single" "Married" "Single" "Married" ...
## $ monthlyincome : num [1:1470] 5993 5130 2090 2909 3468 ...
## $ numcompaniesworked : num [1:1470] 8 1 6 1 9 0 4 1 0 6 ...
## $ over18 : chr [1:1470] "Y" "Y" "Y" "Y" ...
## $ overtime : chr [1:1470] "Yes" "No" "Yes" "Yes" ...
## $ percentsalaryhike : num [1:1470] 11 23 15 11 12 13 20 22 21 13 ...
## $ performancerating : num [1:1470] 3 4 3 3 3 3 4 4 4 3 ...
## $ relationshipsatisfaction: num [1:1470] 1 4 2 3 4 3 1 2 2 2 ...
## $ standardhours : num [1:1470] 80 80 80 80 80 80 80 80 80 80 ...
## $ stockoptionlevel : num [1:1470] 0 1 0 0 1 0 3 1 0 2 ...
## $ totalworkingyears : num [1:1470] 8 10 7 8 6 8 12 1 10 17 ...
## $ trainingtimeslastyear : num [1:1470] 0 3 3 3 3 2 3 2 2 3 ...
## $ worklifebalance : num [1:1470] 1 3 3 3 3 2 2 3 3 2 ...
## - attr(*, "spec")=
## .. cols(
## .. Resear = col_double(),
## .. Attrition = col_logical(),
## .. BusinessTravel = col_character(),
## .. Department = col_character(),
## .. DistanceFromHomejobrole = col_double(),
## .. Education = col_double(),
## .. EducationField = col_character(),
## .. EmployeeCount = col_double(),
## .. EnvironmentSatisfaction = col_double(),
## .. Gender = col_character(),
## .. JobInvolvement = col_double(),
## .. JobLevel = col_double(),
## .. JobRole = col_character(),
## .. JobSatisfaction = col_double(),
## .. MaritalStatus = col_character(),
## .. MonthlyIncome = col_double(),
## .. NumCompaniesWorked = col_double(),
## .. Over18 = col_character(),
## .. OverTime = col_character(),
## .. PercentSalaryHike = col_double(),
## .. PerformanceRating = col_double(),
## .. RelationshipSatisfaction = col_double(),
## .. StandardHours = col_double(),
## .. StockOptionLevel = col_double(),
## .. TotalWorkingYears = col_double(),
## .. TrainingTimesLastYear = col_double(),
## .. WorkLifeBalance = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
library(wesanderson)
ggplot(HR_EmployeeAttrition, aes(x = attrition)) +
geom_bar(position = "stack", fill = wes_palette("Cavalcanti1", n = 2))+
theme_minimal() +
labs(x = "Attrition",
y = "Count",
title = "Employee Attrition",
caption = "Source: HR_EmployeeAttrition")
HR_EmployeeAttrition %>%
group_by(attrition) %>%
summarise(n = n())
## # A tibble: 2 × 2
## attrition n
## <lgl> <int>
## 1 FALSE 1233
## 2 TRUE 237
We can see that 237 employees who left their employments and 1233 keep their employments.Also, about 16.12% of the employees left their jobs(based on the number of observations).
HR_EmployeeAttrition %>%
group_by(attrition, department) %>%
summarise(n = n())
## `summarise()` has grouped output by 'attrition'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 3
## # Groups: attrition [2]
## attrition department n
## <lgl> <chr> <int>
## 1 FALSE Human Resources 51
## 2 FALSE Research & Development 828
## 3 FALSE Sales 354
## 4 TRUE Human Resources 12
## 5 TRUE Research & Development 133
## 6 TRUE Sales 92
library(ggplot2)
plot2 <- HR_EmployeeAttrition%>%
ggplot() +
geom_bar(aes(y =..count.., x= (department), fill =(attrition)), position = position_stack()) +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
plot2
This visualization shows the 237 employees who left their employment were in Human Resources, Research & Development, and Sales Department. Also, we notice that Research & Development Department has a larger proportion of resignations compared to the other departments (Human Resources and Sales).
First, I will use the group_by function to know exactly the minimum, average, and maximum monthly income of those employees who left their employment, and I will make a visualization using the three departments to see the relationship between the monthly income and employee attrition.
HR_EmployeeAttrition %>% group_by(attrition) %>%
summarise(n_employees = n(),
min_monthlyincome = min(monthlyincome),
avg_monthly = mean(monthlyincome),
max_monthlyincome = max(monthlyincome),
sd_monthlyincome = sd(monthlyincome),
pct_less_60k = mean(monthlyincome <= 5000))
## # A tibble: 2 × 7
## attrition n_employees min_monthlyincome avg_monthly max_mont…¹ sd_mo…² pct_l…³
## <lgl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 FALSE 1233 1051 6833. 19999 4818. 0.475
## 2 TRUE 237 1009 4787. 19859 3640. 0.688
## # … with abbreviated variable names ¹max_monthlyincome, ²sd_monthlyincome,
## # ³pct_less_60k
ggplot(data = HR_EmployeeAttrition, aes(x = monthlyincome, fill =department)) +
geom_histogram(aes(y = ..count..), color = "Black", bins = 20) +
facet_wrap(~ attrition, nrow = 2) +
labs(title = "Monthly Income Distribution by Department (Attrition - Yes/No)",
x = "Monthly Income (US Dollars", y = "Proportion of Employees")
In this visualization, we can see employees from the three departments who left their jobs most of their monthly incomes were between $0 and $5000. However, between $5000 and $20000, they were not many employees who left their employment.
library(ggplot2)
library(ggfortify)
library(htmltools)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
HR_EmployeeAttrition%>%
ggplot() + geom_boxplot(aes(y=jobsatisfaction, group=department, fill=department)) +
scale_fill_manual(values=c("Dark Blue","Orange","Gray")) +
theme(axis.text.y=element_blank()) + # Remove the useless y-axis tick values.
ggtitle("Comparison Between Department and Job Satisfaction ") +
coord_flip()
In this plot, we can notice that employees in those departments they were pretty satisfied with their jobs because they were almost in the same range (3) of job satisfaction. Therefore, this variable was not really the reason they left the company.
Plot4 <- filter(HR_EmployeeAttrition, department == "Human Resources" | department == "Sales" | department == "Research & Development")
ggplot (Plot4, aes(x = attrition, y = businesstravel, color = department)) +
ylab("Frequency of Business Travel)") +
theme_minimal(base_size = 12) +
ggtitle("Attrion and Business Travel") +
geom_jitter() +
scale_color_brewer(palette = 'Set1')
From this chart, we can see that the sales employees who leave their employments travel more than the employees although the Research and Development department has more employees who leave their employments. For the sales department this business travel might be one of the reasons they leave their employments.
HR_Att <- HR_EmployeeAttrition%>%select(attrition, department, distancefromhomejobrole, environmentsatisfaction, monthlyincome, totalworkingyears, jobrole, joblevel, percentsalaryhike, performancerating, worklifebalance, relationshipsatisfaction, maritalstatus)%>%
filter(department == "Human Resources" | department == "Research & Development" | department == "Sales")
jobsat_t <- t.test(HR_EmployeeAttrition$jobsatisfaction ~ HR_EmployeeAttrition$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
resear_t <- t.test(HR_EmployeeAttrition$resear ~ HR_EmployeeAttrition$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
wrklifebal_t <- t.test(HR_EmployeeAttrition$worklifebalance ~ HR_EmployeeAttrition$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
workyrs_t <- t.test(HR_EmployeeAttrition$totalworkingyears ~ HR_EmployeeAttrition$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
numcompw_t <- t.test(HR_EmployeeAttrition$numcompaniesworked ~ HR_EmployeeAttrition$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
perc_t <- t.test(HR_EmployeeAttrition$percentsalaryhike ~ HR_EmployeeAttrition$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
stock_t <- t.test(HR_EmployeeAttrition$stockoptionlevel ~ HR_EmployeeAttrition$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
envsat_t <- t.test(HR_EmployeeAttrition$environmentsatisfaction ~ HR_EmployeeAttrition$attrition, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
kable <- tribble(
~name, ~p.value,
"Job Satisfaction", jobsat_t$p.value,
"Resear", resear_t$p.value,
"Work Life Balance", wrklifebal_t$p.value,
"Total Working Years", workyrs_t$p.value,
"Number Of Companies Work",numcompw_t$p.value,
"STock", stock_t$p.value,
"Environment Satisfaction", envsat_t$p.value,
"Percentage Salary Hike",perc_t$p.value
)
knitr::kable(kable)
| name | p.value |
|---|---|
| Job Satisfaction | 0.0001052 |
| Resear | 0.0000000 |
| Work Life Balance | 0.0304657 |
| Total Working Years | 0.0000000 |
| Number Of Companies Work | 0.1163340 |
| STock | 0.0000003 |
| Environment Satisfaction | 0.0002092 |
| Percentage Salary Hike | 0.6144301 |
The t.test shows that many variables have a negative correlation with employee attrition. However, the p-value for the percentage salary hike variable is 0.6. It is not considered significant. Therefore, we cannot claim the direction of the effect on employee attrition.
Many organizations or companies are faced to recruit talents and recruitment costs money and time. However, it is possible to reduce training and recruitment costs by solving the employee turnover problem. Therefore, my goal is to explore and analyze the data set to have a better understanding of why employee leave their jobs.
By analyzing the data, I found that many there are many reasons employees quit their jobs. One of them is the commute from home to work. Because of that some companies instead of letting those employees quit their jobs, can allow them to work from home. As I mention recruitment costs money and time. If those employees are highly qualified, both parties(employers and employees) can take advantage of working from home. Employers save money, and employees save money from buying gas, food, and being in a safe environment(home)
Additionally, another aspect of the analysis that amazes me is the unwillingness of some employees who have more than 10 years in the organizations or companies to quit. We can also prove from the longer years employees spend in a company, the lower the willingness to quit. Especially regardless of occupation. After 10 years of work, almost no employee wants to leave. Also, when comparing the job satisfaction, the three departments that employees left their jobs were pretty satisfied with their employments. The one variable the have in common that makes some of the quit was their monthly income.
Finally, there many things that I wish I could include in this analysis. For example, the facet graph between the variables explores much deeper the cause of employee attrition. Also, making a multiple regression model to compare the variables inputs into a model and try to predict the reasons for employee attrition and using the correlation plot for visualization and exploration.