Project 3a

HR Employee Attrition

In this analysis, we will delve into different categorical and quantitative variables and their connection to employee attrition. We will start by giving a general overview of the dataset, providing a summary statistic for 2-3 quantitative variables, frequency distribution and relative frequency distribution for a categorical variable, and a contingency table for two categorical variables. Also, we will use some visualizations such as bar graphs, pie charts, histograms, and boxplots between different variables and employee attrition. This dataset can be found at https://drive.google.com/drive/folders/1AMRfddeMwKRaNidOV87JP1iCVn_z-Uv_

Metadata list

Age: quantitative variable Attrition: Categorical variable, True for ‘Yes’ False ‘No’ the departure of employees from the organization for any reason BusinessTravel: categorical variable, travel undertaken for work or business purposes Department: Categorical variable, one part of a large organization Distance From Home jobrole : quantitative variable Education: quantitative variable Education Field: categorical variable Employee Count: quantitative variable EnvironmentSatisfaction: categorical variable (1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High) Gender categorical variable (Male and Female) JobInvolvement: categorical variable (1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’) JobLevel: quantitative variable JobRole: categorical variable JobSatisfaction: categorical variable (1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’) MaritalStatus: categorical variable (Single, Divorced, Married) MonthlyIncome: quantitative variable NumCompaniesWorked: quantitative variable Over18: categorical variable OverTime(YES, No): categorical variable PercentSalaryHike: quantitative variable PerformanceRating: categorical variable (1 ‘Low’ 2 ‘Good’ 3 ‘Excellent’ 4 ‘Outstanding’) RelationshipSatisfaction:quantitative variable (1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’) StandardHours 80: quantitative variable StockOptionLevel: quantitative variable TotalWorkingYears: quantitative variable TrainingTimesLastYear: quantitative variable WorkLifeBalance: categorical variable ( 1 ‘Bad’ 2 ‘Good’ 3 ‘Better’ 4 ‘Best’)

Load library

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2

## Warning: package 'dplyr' was built under R version 4.2.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Import the dataset

library(readr)
HR_EmployeeAttrition <- read_csv("C:/Users/Mitcheyla$/Desktop/DATA 101, Fall Semester/HR_EmployeeAttrition.csv")

## Rows: 1470 Columns: 27
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): BusinessTravel, Department, EducationField, Gender, JobRole, Marit...
## dbl (18): Age, DistanceFromHomejobrole, Education, EmployeeCount, Environmen...
## lgl  (1): Attrition
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(HR_EmployeeAttrition)

Summary Statistics for Three quantitative variables

HR_EmployeeAttrition<- HR_EmployeeAttrition[,c("MonthlyIncome", "Age", "TotalWorkingYears")] 
summary(HR_EmployeeAttrition)

##  MonthlyIncome        Age        TotalWorkingYears
##  Min.   : 1009   Min.   :18.00   Min.   : 0.00    
##  1st Qu.: 2911   1st Qu.:30.00   1st Qu.: 6.00    
##  Median : 4919   Median :36.00   Median :10.00    
##  Mean   : 6503   Mean   :36.92   Mean   :11.28    
##  3rd Qu.: 8379   3rd Qu.:43.00   3rd Qu.:15.00    
##  Max.   :19999   Max.   :60.00   Max.   :40.00

HR_EmployeeAttrition<- HR_EmployeeAttrition[,c("MonthlyIncome", "Age", "TotalWorkingYears")] 
sd(HR_EmployeeAttrition$MonthlyIncome)

## [1] 4707.957

sd(HR_EmployeeAttrition$Age)

## [1] 9.135373

sd(HR_EmployeeAttrition$TotalWorkingYears)

## [1] 7.780782

library(readr)
HR_EmployeeAttrition <- read_csv("C:/Users/Mitcheyla$/Desktop/DATA 101, Fall Semester/HR_EmployeeAttrition.csv")

## Rows: 1470 Columns: 27
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): BusinessTravel, Department, EducationField, Gender, JobRole, Marit...
## dbl (18): Age, DistanceFromHomejobrole, Education, EmployeeCount, Environmen...
## lgl  (1): Attrition
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(HR_EmployeeAttrition)

Frequency distribution and relative frequency distribution for a key categorical variable

table(HR_EmployeeAttrition$Gender)

## 
## Female   Male 
##    588    882

table(HR_EmployeeAttrition$JobSatisfaction)/length(HR_EmployeeAttrition$JobSatisfaction)

## 
##         1         2         3         4 
## 0.1965986 0.1904762 0.3006803 0.3122449

sum(table(HR_EmployeeAttrition$JobSatisfaction)/length(HR_EmployeeAttrition$JobSatisfaction))

## [1] 1

contingency table for two categorical variables.

table(HR_EmployeeAttrition$BusinessTravel, HR_EmployeeAttrition$MaritalStatus)

##                    
##                     Divorced Married Single
##   Non-Travel              44      59     47
##   Travel_Frequently       63     118     96
##   Travel_Rarely          220     496    327

Visualisation

library(ggplot2)
library(wesanderson)

HR_EmployeeAttrition %>%
  group_by(Attrition, Department) %>%
  summarise(n = n())

## `summarise()` has grouped output by 'Attrition'. You can override using the
## `.groups` argument.

## # A tibble: 6 × 3
## # Groups:   Attrition [2]
##   Attrition Department                 n
##   <lgl>     <chr>                  <int>
## 1 FALSE     Human Resources           51
## 2 FALSE     Research & Development   828
## 3 FALSE     Sales                    354
## 4 TRUE      Human Resources           12
## 5 TRUE      Research & Development   133
## 6 TRUE      Sales                     92

Bar graph of Employee Attrition

ggplot(HR_EmployeeAttrition, aes(x = Attrition)) +
  geom_bar(position = "stack", fill = wes_palette("GrandBudapest1", n = 2)) +
  theme_minimal() +
  labs(x = "Attrition", 
       y = "Count",
       title = "Employee Attrition",
       caption = "Source: HR_EmployeeAttrition")

Pie Graph

df <- table(HR_EmployeeAttrition$Gender,HR_EmployeeAttrition$Attrition)
lbl <- c("Male", "Female")
df

##         
##          FALSE TRUE
##   Female   501   87
##   Male     732  150

pie(df, labels = lbl, main = "Gender Attrition", col = c("Yellow", "Green"))

Histogram of Employee Attrition and Age

ggplot(data=HR_EmployeeAttrition, aes(HR_EmployeeAttrition$Age)) + 
        geom_histogram(breaks=seq(20, 60, by=2), 
                       col="gray", 
                       aes(fill=..count..))+
        labs(x="Age", title = "Employee Attrition", y="Count")+
        scale_fill_gradient("Count", low="Orange", high="dark green")

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.

Histogram of Employee Attrition and Monthly Income

ggplot(data = HR_EmployeeAttrition, aes(x = MonthlyIncome, fill =Department)) + 
   geom_histogram(aes(y = ..count..), color = "Black", bins = 20) +
   facet_wrap(~ Attrition, nrow = 2) +
   labs(title = "Monthly Income Distribution by Department (Attrition - Yes/No)",
           x = "Monthly Income (US Dollars", y = "Proportion of Employees")

Boxplot of Employee Attrition and Job Satisfaction

  ggplot(HR_EmployeeAttrition, aes(x = Attrition, y = JobSatisfaction)) +
  geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
  theme_dark() +
  labs(y = "Job Satisfaction", title = "Relationship between Employee Attrition and Job Satisfaction")

## Boxplot of Employee Attrition and Total Working Years

ggplot(HR_EmployeeAttrition, aes(x = Attrition, y = TotalWorkingYears)) +
  geom_boxplot(fill = wes_palette("Darjeeling1", n = 2)) +
  theme_dark() +
  labs(y = "Total Working Yrs", title = "Relationship between Employee Attrition and Number of working years")

Summary

By analyzing the data, the first graph shows the number of employees who do not quit their jobs is greater than the ones who quit. The pie chart shows more males quit their jobs than females. Also, We can see in the histogram (Employee Attrition and Age), most people who quit their jobs were young ((between 28-36). Therefore, I think the more people get older, they know quitting one job and finding another one will not be easy, especially in a marketplace where recruiters are looking for younger talented people. Moreover, in the second histogram, there are only three departments (Sales, Human Resources, and Research and Development) where employees quit their jobs. However, most of them were in the Research and Development department. Also, for most of them, their monthly income was between 0-5000 dollars.

Additionally, in the first boxplot(Employee Attrition and Job Satisfaction), we can see the employees who quit their jobs were not highly satisfied. Their job satisfaction level was between 1 to 3. Finally, another aspect of the analysis that amazes me is the unwillingness of some employees who have more than 10 years in organizations or companies to quit. Also, we can see there are a lot of outliers in the last graph. In sum, when comparing job satisfaction, monthly income, and Total years working. The one variable they have that is really relevant and makes some of them quit was their monthly income.