In this analysis, we will delve into different categorical and quantitative variables and their connection to employee attrition. We will start by giving a general overview of the dataset, providing a summary statistic for 2-3 quantitative variables, frequency distribution and relative frequency distribution for a categorical variable, and a contingency table for two categorical variables. Also, we will use some visualizations such as bar graphs, pie charts, histograms, and boxplots between different variables and employee attrition. This dataset can be found at https://drive.google.com/drive/folders/1AMRfddeMwKRaNidOV87JP1iCVn_z-Uv_
Age: quantitative variable Attrition: Categorical variable, True for ‘Yes’ False ‘No’ the departure of employees from the organization for any reason BusinessTravel: categorical variable, travel undertaken for work or business purposes Department: Categorical variable, one part of a large organization Distance From Home jobrole : quantitative variable Education: quantitative variable Education Field: categorical variable Employee Count: quantitative variable EnvironmentSatisfaction: categorical variable (1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High) Gender categorical variable (Male and Female) JobInvolvement: categorical variable (1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’) JobLevel: quantitative variable JobRole: categorical variable JobSatisfaction: categorical variable (1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’) MaritalStatus: categorical variable (Single, Divorced, Married) MonthlyIncome: quantitative variable NumCompaniesWorked: quantitative variable Over18: categorical variable OverTime(YES, No): categorical variable PercentSalaryHike: quantitative variable PerformanceRating: categorical variable (1 ‘Low’ 2 ‘Good’ 3 ‘Excellent’ 4 ‘Outstanding’) RelationshipSatisfaction:quantitative variable (1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’) StandardHours 80: quantitative variable StockOptionLevel: quantitative variable TotalWorkingYears: quantitative variable TrainingTimesLastYear: quantitative variable WorkLifeBalance: categorical variable ( 1 ‘Bad’ 2 ‘Good’ 3 ‘Better’ 4 ‘Best’)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## Warning: package 'dplyr' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(readr)
HR_EmployeeAttrition <- read_csv("C:/Users/Mitcheyla$/Desktop/DATA 101, Fall Semester/HR_EmployeeAttrition.csv")
## Rows: 1470 Columns: 27
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): BusinessTravel, Department, EducationField, Gender, JobRole, Marit...
## dbl (18): Age, DistanceFromHomejobrole, Education, EmployeeCount, Environmen...
## lgl (1): Attrition
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(HR_EmployeeAttrition)
HR_EmployeeAttrition<- HR_EmployeeAttrition[,c("MonthlyIncome", "Age", "TotalWorkingYears")]
summary(HR_EmployeeAttrition)
## MonthlyIncome Age TotalWorkingYears
## Min. : 1009 Min. :18.00 Min. : 0.00
## 1st Qu.: 2911 1st Qu.:30.00 1st Qu.: 6.00
## Median : 4919 Median :36.00 Median :10.00
## Mean : 6503 Mean :36.92 Mean :11.28
## 3rd Qu.: 8379 3rd Qu.:43.00 3rd Qu.:15.00
## Max. :19999 Max. :60.00 Max. :40.00
HR_EmployeeAttrition<- HR_EmployeeAttrition[,c("MonthlyIncome", "Age", "TotalWorkingYears")]
sd(HR_EmployeeAttrition$MonthlyIncome)
## [1] 4707.957
sd(HR_EmployeeAttrition$Age)
## [1] 9.135373
sd(HR_EmployeeAttrition$TotalWorkingYears)
## [1] 7.780782
library(readr)
HR_EmployeeAttrition <- read_csv("C:/Users/Mitcheyla$/Desktop/DATA 101, Fall Semester/HR_EmployeeAttrition.csv")
## Rows: 1470 Columns: 27
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): BusinessTravel, Department, EducationField, Gender, JobRole, Marit...
## dbl (18): Age, DistanceFromHomejobrole, Education, EmployeeCount, Environmen...
## lgl (1): Attrition
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(HR_EmployeeAttrition)
table(HR_EmployeeAttrition$Gender)
##
## Female Male
## 588 882
table(HR_EmployeeAttrition$JobSatisfaction)/length(HR_EmployeeAttrition$JobSatisfaction)
##
## 1 2 3 4
## 0.1965986 0.1904762 0.3006803 0.3122449
sum(table(HR_EmployeeAttrition$JobSatisfaction)/length(HR_EmployeeAttrition$JobSatisfaction))
## [1] 1
table(HR_EmployeeAttrition$BusinessTravel, HR_EmployeeAttrition$MaritalStatus)
##
## Divorced Married Single
## Non-Travel 44 59 47
## Travel_Frequently 63 118 96
## Travel_Rarely 220 496 327
library(ggplot2)
library(wesanderson)
HR_EmployeeAttrition %>%
group_by(Attrition, Department) %>%
summarise(n = n())
## `summarise()` has grouped output by 'Attrition'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 3
## # Groups: Attrition [2]
## Attrition Department n
## <lgl> <chr> <int>
## 1 FALSE Human Resources 51
## 2 FALSE Research & Development 828
## 3 FALSE Sales 354
## 4 TRUE Human Resources 12
## 5 TRUE Research & Development 133
## 6 TRUE Sales 92
ggplot(HR_EmployeeAttrition, aes(x = Attrition)) +
geom_bar(position = "stack", fill = wes_palette("GrandBudapest1", n = 2)) +
theme_minimal() +
labs(x = "Attrition",
y = "Count",
title = "Employee Attrition",
caption = "Source: HR_EmployeeAttrition")
df <- table(HR_EmployeeAttrition$Gender,HR_EmployeeAttrition$Attrition)
lbl <- c("Male", "Female")
df
##
## FALSE TRUE
## Female 501 87
## Male 732 150
pie(df, labels = lbl, main = "Gender Attrition", col = c("Yellow", "Green"))
ggplot(data=HR_EmployeeAttrition, aes(HR_EmployeeAttrition$Age)) +
geom_histogram(breaks=seq(20, 60, by=2),
col="gray",
aes(fill=..count..))+
labs(x="Age", title = "Employee Attrition", y="Count")+
scale_fill_gradient("Count", low="Orange", high="dark green")
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
ggplot(data = HR_EmployeeAttrition, aes(x = MonthlyIncome, fill =Department)) +
geom_histogram(aes(y = ..count..), color = "Black", bins = 20) +
facet_wrap(~ Attrition, nrow = 2) +
labs(title = "Monthly Income Distribution by Department (Attrition - Yes/No)",
x = "Monthly Income (US Dollars", y = "Proportion of Employees")
ggplot(HR_EmployeeAttrition, aes(x = Attrition, y = JobSatisfaction)) +
geom_boxplot(fill = wes_palette("GrandBudapest1", n = 2)) +
theme_dark() +
labs(y = "Job Satisfaction", title = "Relationship between Employee Attrition and Job Satisfaction")
## Boxplot of Employee Attrition and Total Working Years
ggplot(HR_EmployeeAttrition, aes(x = Attrition, y = TotalWorkingYears)) +
geom_boxplot(fill = wes_palette("Darjeeling1", n = 2)) +
theme_dark() +
labs(y = "Total Working Yrs", title = "Relationship between Employee Attrition and Number of working years")
By analyzing the data, the first graph shows the number of employees who do not quit their jobs is greater than the ones who quit. The pie chart shows more males quit their jobs than females. Also, We can see in the histogram (Employee Attrition and Age), most people who quit their jobs were young ((between 28-36). Therefore, I think the more people get older, they know quitting one job and finding another one will not be easy, especially in a marketplace where recruiters are looking for younger talented people. Moreover, in the second histogram, there are only three departments (Sales, Human Resources, and Research and Development) where employees quit their jobs. However, most of them were in the Research and Development department. Also, for most of them, their monthly income was between 0-5000 dollars.
Additionally, in the first boxplot(Employee Attrition and Job Satisfaction), we can see the employees who quit their jobs were not highly satisfied. Their job satisfaction level was between 1 to 3. Finally, another aspect of the analysis that amazes me is the unwillingness of some employees who have more than 10 years in organizations or companies to quit. Also, we can see there are a lot of outliers in the last graph. In sum, when comparing job satisfaction, monthly income, and Total years working. The one variable they have that is really relevant and makes some of them quit was their monthly income.