The IBM HR Analytics Employee Attrition & Performance dataset has become one of the most well recognized datasets for those interested in people analytics. Although the dataset is a fictional, it includes various HR metrics commonly collected in various organizations today.
The data is perfect for those interested in practicing data analytics skills; however, it should not be used as a template in organizational settings. My reasons are as follows:
First, it is important to note that the dataset simplifies few metrics and does not provide additional information on how each construct was measured or what each construct means. For example, let’s take job satisfaction which was measured on a scale of 1 to 4 (1 = low, 2 = medium, 3= high 4 = very high). In the organizational psychology literature, there various scales which measures and defines job satisfaction differently. Without knowing how an organization measures or defines these constructs, it may be difficult to understand why an effect is taking place.
Second, I would like to note how dangerous it can be to utilize some of the metrics included in the dataset for business decisions. Based on your region’s labor regulations, you may be exposing you and your organization to potential discrimination charges. I advise you to consult with your legal team before making any decisions.
Overall, I loved playing with the dataset. I think it provides a glimpse of what people analytics can be like. Let me show you how I would’ve approached the dataset if I was in a real organizational setting.
For organizations, turnover is often extremely costly. Resources must be allocated to finding a suitable replacement, and even after finding someone, the organization must invest in the replacement’s learning and development. Because turnover comes at a premium, organizational psychologists often use turnover as a way to persuade company leaders to better care for their employees. This is one reason why I believe company culture, engagement, and well-being have recently become such hot topics.
Therefore with this dataset, I will strive to identify the reasons why individuals may be leaving the organizations. Therefore I will attempt the following:
Note that I will not be creating a prediction model. Rather than trying to predict, the goal is to analyze why employees have left.
Side Note Although the data is fictional, I have attempted to treat the data as a real-world dataset.
Loading Libraries
library(readr)
library(ggplot2)
library(ggcorrplot)
library(dplyr)
library(ggthemes)
library(scales)
library(ggthemr)
library(fabricatr)
library(Hmisc)
knitr::opts_chunk$set(message = FALSE, warning = FALSE, fig.width=8, fig.height =6)
Loading Data
ibm_data <- read_csv("data/ibm_dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv")
names(ibm_data)
## [1] "Age" "Attrition"
## [3] "BusinessTravel" "DailyRate"
## [5] "Department" "DistanceFromHome"
## [7] "Education" "EducationField"
## [9] "EmployeeCount" "EmployeeNumber"
## [11] "EnvironmentSatisfaction" "Gender"
## [13] "HourlyRate" "JobInvolvement"
## [15] "JobLevel" "JobRole"
## [17] "JobSatisfaction" "MaritalStatus"
## [19] "MonthlyIncome" "MonthlyRate"
## [21] "NumCompaniesWorked" "Over18"
## [23] "OverTime" "PercentSalaryHike"
## [25] "PerformanceRating" "RelationshipSatisfaction"
## [27] "StandardHours" "StockOptionLevel"
## [29] "TotalWorkingYears" "TrainingTimesLastYear"
## [31] "WorkLifeBalance" "YearsAtCompany"
## [33] "YearsInCurrentRole" "YearsSinceLastPromotion"
## [35] "YearsWithCurrManager"
# setting up second dataset for wrangling and analysis
data1 <- ibm_data
Before any analysis, it is critical for the data to be cleaned. For anyone interested in what a clean data looks like, I highly recommend reading Hadley Wickham’s paper on “Tidy Data” Link.
Overall, the data is relatively clean and follows the principle of “Tidy Data”. However, there are additional checks
Key Questions:
Counting all NA values within each column.
colSums(is.na(data1))
## Age Attrition BusinessTravel
## 0 0 0
## DailyRate Department DistanceFromHome
## 0 0 0
## Education EducationField EmployeeCount
## 0 0 0
## EmployeeNumber EnvironmentSatisfaction Gender
## 0 0 0
## HourlyRate JobInvolvement JobLevel
## 0 0 0
## JobRole JobSatisfaction MaritalStatus
## 0 0 0
## MonthlyIncome MonthlyRate NumCompaniesWorked
## 0 0 0
## Over18 OverTime PercentSalaryHike
## 0 0 0
## PerformanceRating RelationshipSatisfaction StandardHours
## 0 0 0
## StockOptionLevel TotalWorkingYears TrainingTimesLastYear
## 0 0 0
## WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 0 0 0
## YearsSinceLastPromotion YearsWithCurrManager
## 0 0
Results: No missing data.
Checking the classes of all the columns.
str(data1)
Checking if all of the character columns are appropriate factor variables.
table(data1$Attrition)
table(data1$BusinessTravel)
table(data1$Department)
table(data1$EducationField)
table(data1$Gender)
table(data1$JobRole)
table(data1$MaritalStatus)
table(data1$Over18)
table(data1$OverTime)
I check for the table for all of the character class columns because my hunch is that they are all factor (categorial) variables. Checking the table for the columns allows me to evaluate if they are indeed categorical variables. And it allows me to get a better idea of the people of the organization.
Based on the table information of each of the character columns(), the character class columns are all factor variables and should be converted to such.
Changing character class columns into factor class.
for(i in 1:dim(data1)[2]){
if(class(data1[,i]) == "character"){
data1[,i] <- as.factor(data1[,i])
}
}
str(data1)
Often times, irrelevant data are peppered into the dataset. I like to remove irrelevant variables to keep my dataset neat and slim as much as possible (stylistic preference). However, if you or the organization are planning to add on to the dataset or foresee this being a long term project, I would recommend not taking any variables out. However, since this is a completed dataset with no plans to add more participants, I will be removing irrelevant variables Additionally, datasets often includes participant identity variables such as names, computer id, etc.; all of which should be removed to insure participant anonymity.
Irrelevancy should be determined with careful consideration and should be discussed with relevant stakeholders
Considering Irrelevancy
summary(data1)
Take Out: EmployeeCount: all partipants are labeled as 1 StandardHours: all partipants have 80 Over 18: all participant are over 18
Taking out Irrelevant Data
data2 <- subset(data1, select = -c(EmployeeCount, StandardHours, Over18))
When dealing with people metrics some causes of errors are as follows:
Preliminary Check for Data Errors
I use the summary function to see if the min and max values of each variable seem to make sense. Here I am checking for any unreasonable min and max values (extreme outliers + numbers outside exxpected range).
summary(data2)
Found all of the min and max values of factor variables and survey questions to be within their appropriate range. However, without knowledge of the company, I could not properly evaluate the other variables.
Checking if all survey questions were answered the same. (Checking for straightlining)
survey_cols <- c(10, 13, 16, 24, 28)
survey_variables <- data2[,survey_cols]
which(apply(survey_variables, 1, function(x) length(unique(x))==1))
## [1] 134 158 194 241 297 351 432 472 497 502 507 547 688 729 858
## [16] 932 1104 1108 1127 1142 1285
sum(apply(survey_variables, 1, function(x) length(unique(x))==1))
## [1] 21
Overall, I found 21 participants to have circled the same response for each. However, I will not remove these participants for the following reasons: - All of the survey questions should theoretically be highly correlated with each other, meaning it is not unreasonable to see similar answers. - I do not know the composition of the survey, meaning I do not know if all these questions were next to each other, making it difficult to identify “straightlining.” - There are too few questions for me to comfortably declare straightlining.
Setting up data to use for EDA
f_data <- data2
Therefore, I will be using f_data (final data) as my data set for EDA & inferential analysis
The purpose of the EDA is to help me understand the data and help me form my hypotheses. At an initial glance of the variables, employees are divided structurally by the following:
I will initially take a look at the attrition distribution of those three divisions. Then I will investigate if attrition rate differs for varying groups within these divisions. These are:
Summary: After looking at the attrition distribution as a whole and then by job role, department, and job level here are my following insights:
Overall, 16% of the company is leaving. Having comparative metrics such as competitor’s attrition rates would help understand the value of this 16%.
Sales Reps had the highest within job role attrition rate at 40%. Meaning that 40% of sales rep employees left. Laboratory technicians and human resources had the next highest with 24% and 23%.
In terms of departments, the attrition rate within departments did not vary as much seeing how the sales, hr, and r&d had an attrition rate of 21%, 19%, and 14% respectively.
As expected, those in level 1 had the highest within attrition rate at 26%. However, there was a slight increase in attrition rate from level 2 to level 3 with level 2 having a within attrition rate of 10% with level 3 at 15%.
Final thoughts: The most concerning irregularities comes when looking at individuals from different job roles. Sales reps at 40% attrition is an alarming number. Although less shocking, individuals from level 1 are also leaving at a higher rate of 26%. Identifying the key drivers of attrition especially at the noted job role and levels will be critical.
Overall Attrition Distribution
ggthemr('dust')
f_data %>%
count(Attrition) %>%
mutate(pct = prop.table(n)) %>%
mutate(name = paste(round(pct,2)*100,"%", " (", n, ")", sep = "")) %>%
ggplot(aes(x = Attrition, y = pct, fill = Attrition)) +
geom_col(position = 'dodge', width = .5) +
geom_text(aes(label = name), vjust = -.5) +
scale_y_continuous(labels = scales::percent) +
labs(x = "Attrition", y = "", title = "Company Attrition Distribution", subtitle = "How many people actually left?") +
theme(axis.text.x = element_blank())
Quick Takeaways:
Attrition Within Job Role
f_data %>%
group_by(JobRole) %>%
count(Attrition) %>%
mutate(pct = prop.table(n)) %>%
mutate(name = paste(round(pct,2)*100,"%", sep = "")) %>%
mutate(JobRole = gsub(" ", "\n", JobRole)) %>%
subset(Attrition == "Yes") %>%
ggplot(aes(x = reorder(JobRole, -pct), y = pct)) +
geom_point(size = 5, aes(y = pct)) +
geom_segment(aes(x = JobRole, xend= JobRole, y = 0, yend = pct),
size = 1.2, linetype = 1, alpha = .8, color = "#8d7a64") +
labs(title = "Attrition Percentage Within Each Job Role",
subtitle = "What percentage of people left within each job role?",
x = "Job Roles",
y = "% of Attrition") +
geom_text(aes(label = name, x = JobRole, y= pct), vjust = -1.8) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, .5))
Quick Takeaways:
Attrition Within Each Department
f_data %>%
group_by(Department) %>%
count(Attrition) %>%
mutate(pct = prop.table(n)) %>%
mutate(name = paste(round(pct,2)*100,"%", sep = "")) %>%
mutate(Department = gsub(" ", "\n", Department)) %>%
subset(Attrition == "Yes") %>%
ggplot(aes(x = reorder(Department, -pct), y = pct)) +
geom_point(size = 5, aes(y = pct)) +
geom_segment(aes(x = Department, xend= Department, y = 0, yend = pct),
size = 1.2, linetype = 1, alpha = .8, color = "#8d7a64") +
labs(title = "Attrition Within Each Department",
subtitle = "What percentage of people left within each department?",
x = "Departments",
y = "% of Attrition") +
geom_text(aes(label = name, x = Department, y= pct), vjust = -1.8) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, .5))
Quick Takeaways:
Attrition Within Job Level
f_data %>%
group_by(JobLevel) %>%
count(Attrition) %>%
mutate(pct = prop.table(n)) %>%
mutate(name = paste(round(pct,2)*100,"%", sep = "")) %>%
mutate(JobLevel = paste("Level", JobLevel)) %>%
subset(Attrition == "Yes") %>%
ggplot(aes(x = JobLevel, y = pct)) +
geom_point(size = 6, aes(y = pct)) +
geom_segment(aes(x = JobLevel, xend= JobLevel, y = 0, yend = pct),
size = 1.2, linetype = 1, alpha = .8, color = "#8d7a64") +
labs(title = "Attrition Percentage Within Job Level",
subtitle = "What percentage of people left within each job level?",
x = "Job Levels",
y = "% of Attrition") +
geom_text(aes(label = name, x = JobLevel, y= pct), vjust = -1.8) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, .5))
Quick Takeaways:
Therefore, I will evaluate if people in different age and gender are leaving at different rates. I will ask the following:
Summary: After looking at the attrition distribution separated by age groups and then age groups within job roles, departments, and job levels here are my following insights:
The average employee age is 39.
Approximately 55% of all of the attrition occurred between individuals ages 18-32. This seems to indicate that the company is struggling to maintain younger talent than older talent.
The company lost 36% of employees in the 18-25 bracket while losing 22% of all employees in the 26-32 bracket.
The loss of young talent was specially bad for those who are sales representatives, in the sales department, and in level1.
Final thoughts: The most concerning irregularities comes when looking at the loss of young talent. It may be possible that there are culture or policies issues unfavorable to younger individuals. Further analysis on why younger individuals are leaving will be evaluated when looking at income and satisfaction ratings. This high attrition rate may be fine if those who are leaving are lower performers. However, it is a serious issue if high performers are leaving as well.
Age Distribution
ggthemr("dust")
f_data %>%
ggplot(aes(x = Age, fill = Attrition)) +
geom_histogram(binwidth = 2) +
geom_segment(aes(x = mean(Age), y = 0, xend = mean(Age), yend = Inf, linetype = "Mean"), col = "#484848", lwd = 1.2) +
labs(x = "Individual's Age", y = "count", title = "Age Distribution", subtitle = "What is the age distribution of the organization?") +
scale_linetype_manual(name = "Line", values = c("Mean" = 3)) +
guides(fill = guide_legend(order = 1), linetype = guide_legend(order = 2))
Quick Takeaways:
Creating Age Brackets
AgeQ <- split_quantile(f_data$Age, type = 4)
AgeQ <- as.factor(cut(f_data$Age, breaks = 6,
labels = c("18-25", "26-32", "33-39", "40-46", "47-53", "54-60")))
table(AgeQ)
## AgeQ
## 18-25 26-32 33-39 40-46 47-53 54-60
## 123 393 432 282 153 87
f_data %>%
select(Age, Attrition) %>%
mutate(AgeQ = AgeQ) %>%
filter(Attrition == "Yes") %>%
group_by(AgeQ) %>%
summarise(count = n()) %>%
mutate(pct = prop.table(count),
label= paste0(round(pct*100,0), "%"," (", count, ")", sep = "")) %>%
ggplot(aes(x = AgeQ, y = pct)) +
geom_bar(stat = "identity") +
geom_text(aes(x = AgeQ, y = pct, label = label),
vjust = -1) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0,.5)) +
labs(title = "Company Attrition Distribution By Age Brackets",
subtitle = "How many people left within each age bracket?",
x = "Age Brackets",
y = "")
Quick Takeaways:
f_data %>%
select(Age, Attrition) %>%
mutate(AgeQ = AgeQ) %>%
group_by(AgeQ, Attrition) %>%
summarise(count = n()) %>%
group_by(AgeQ) %>%
mutate(pct = prop.table(count),
label= paste0(round(pct*100,0), "%", sep = "")) %>%
filter(Attrition == "Yes") %>%
ggplot(aes(x = AgeQ, y = pct)) +
geom_bar(stat = "identity") +
geom_text(aes(x = AgeQ, y = pct, label = label),
vjust = -1) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0,.5)) +
labs(title = "Company Attrition Distribution Within Age Brackets",
subtitle = "How many people left within each age bracket?",
x = "Age Bracket",
y = "")
Quick Takeaways:
f_data %>%
select(Age, JobRole, Department, JobLevel, Attrition) %>%
mutate(AgeQ = AgeQ) %>%
ggplot(aes(x = AgeQ, fill = Attrition)) +
geom_bar() +
facet_wrap(vars(JobRole)) +
labs(title = "Attrition by Age and Job Role", subtitle = "Are there attribution differences between age groups in certain job roles??",
x = "Age Brackets")
Quick Takeaways:
f_data %>%
select(Age, JobRole, Department, JobLevel, Attrition) %>%
mutate(AgeQ = AgeQ) %>%
ggplot(aes(x = AgeQ, fill = Attrition)) +
geom_bar() +
facet_wrap(vars(Department)) +
labs(title = "Attrition by Age and Department", subtitle = "Are there attribution differences between age groups in certain departments??",
x = "Age Bracket"
)
Quick Takeaways:
f_data %>%
select(Age, JobLevel, Attrition) %>%
mutate(AgeQ = AgeQ) %>%
ggplot(aes(x = AgeQ, fill = Attrition)) +
geom_bar() +
facet_wrap(vars(JobLevel)) +
labs(title = "Attrition by Age and Job Level", subtitle = "Are there attribution differences between age groups in certain job levels??")
Quick Takeaways:
Hypotheses:
Summary: After looking at the attrition distribution separated by gender and then gender within job roles, departments, and job levels here are my following insights:
Males within the organization seems to be leaving at a higher rate than female (14.8% to 17%)
Overall, attrition rates between the two gender seem to be proportional within each job role, department and level
Final thoughts: Although not conclusive, the exploratory analysis does not seem to suggest that there were any gender differences in terms of attrition. However, further analysis may be required to ascertain the claim.
f_data %>%
select(Attrition, Gender) %>%
group_by(Attrition, Gender) %>%
summarise(Count = n()) %>%
group_by(Gender) %>%
arrange(Count) %>%
mutate(Count_total = cumsum(Count),
CountPer = prop.table(Count),
final = paste(Count, " (", round(CountPer*100,1), "%)", sep = ""),
Totals = paste("Total:", sum(Count)),
Top = sum(Count)) %>%
ggplot(aes(x = Gender, y = Count, fill = Attrition)) +
geom_bar(stat = "identity") +
geom_text(aes(label = final, x = Gender, y = Count_total), vjust = 1.6, color = "white",
fontface = "bold") +
geom_text(aes(label = Totals, x = Gender, y = Top), vjust = -.6) +
labs(title = "Total Attrition by Gender", subtitle = "Was attrition more frequent within certain gender types?")
Quick Takeaways:
f_data %>%
ggplot(aes(x = Gender, fill = Attrition)) +
geom_bar() +
facet_wrap(vars(JobRole)) +
labs(title = "Attrition by Gender & Job Role", subtitle = "Are there attribution differences between genders in certain job roles?")
Quick Takeaways:
f_data %>%
ggplot(aes(x = Gender, fill = Attrition)) +
geom_bar() +
facet_wrap(vars(Department)) +
labs(title = "Attrition by Gender & Department", subtitle = "Are there attribution differences between genders in certain departments?")
Quick Takeaways: - Overall, relatively proportional attrition between genders with all the departments.
f_data %>%
ggplot(aes(x = Gender, fill = Attrition)) +
geom_bar() +
facet_wrap(vars(JobLevel)) +
labs(title = "Attrition by Gender & Job Level", subtitle = "Are there attribution differences between genders in certain job levels?")
Quick Takeaways:
####Potential Factor 2: Income Differences
Hypotheses to Test:
Summary:
Final thoughts: The analysis has shown that income seems to be a strong driver of attrition especially for low paying job roles. Sales, which had the highest attrition rate was one of the least paid jobs. The analysis has also shown that income disparities between peers is most likely not a primary reason for attrition. It may be relevant to see if job roles such as sales reps, research scientists and laboratory technicians should have their income levels be reevaluated and readjusted based on industry and location standards.
Monthly Income and Attrition
ggthemr('dust')
medianvalues <- f_data %>%
group_by(Attrition) %>%
mutate(medians = median(MonthlyIncome))
f_data %>%
ggplot(aes(x = Attrition, y = MonthlyIncome)) +
geom_boxplot(aes(fill = Attrition)) +
geom_text(data = medianvalues, aes(x = Attrition, y = medians, label = medians),
color = "white", fontface = "bold",
vjust = -1) +
labs( x = "Attrition", y = "Monthly Income", title = "Attribution by Income", subtitle = "Can making less money lead to attrition?")
Quick Takeaways:
ggthemr_reset()
f_data %>%
ggplot(aes(x = Age , y = MonthlyIncome, col = JobRole)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Income by Job Role", subtitle = "How much money are people making within each job role?") +
scale_y_continuous(breaks = seq(0, 20000, 5000),
labels = paste0(as.character(seq(0,20,5)),"K")) +
facet_wrap(vars(JobRole)) +
theme_minimal() +
theme(text = element_text( color = '#5b4f41'),
plot.title = element_text(size = 16, face = "bold"),
panel.background = element_rect(fill = "#FAF7F2"),
panel.grid.major = element_line(colour = "#E3DDCC"))
Quick Takeaways:
Income Differences by Job Role and Attrition
ggthemr('dust')
f_data %>%
mutate(JobRole = gsub(" ", "\n", JobRole)) %>%
ggplot(aes(x = JobRole, y = MonthlyIncome, fill = Attrition)) +
geom_boxplot() +
theme(legend.position = "top") +
labs( x = "Job Role", y = "Monthly Income", title = "Income by Job Role and Attrition", subtitle = "Are people leaving because they are making less than their coworkers in the same role?")
Quick Takeaways:
Income Differences by Department and Attrition
ggthemr('dust')
f_data %>%
ggplot(aes(x = Department, y = MonthlyIncome, fill = Attrition)) +
geom_boxplot() +
theme(legend.position = "top") +
labs( x = "Department", y = "Monthly Income", title = "Income by Department", subtitle = "Can making less money among peers within departments lead to attrition?" )
Quick Takeaways:
Income Differences by Job Level and Attrition
ggthemr('dust')
f_data %>%
ggplot(aes(x = as.factor(JobLevel), y = MonthlyIncome, fill = Attrition)) +
geom_boxplot() +
theme(legend.position = "top") +
labs( x = "Job Level", y = "Monthly Income", title = "Income by Job Level and Attrition", subtitle = "Can making less money among peers within the same job level lead to attrition?" )
Quick Takeaways:
Income Differences by Job Satisfaction and Attrition
ggthemr_reset()
f_data %>%
group_by(JobSatisfaction, Attrition) %>%
summarise(AvgInc = mean(MonthlyIncome)) %>%
ggplot(aes(x = JobSatisfaction, y = AvgInc)) +
geom_point(size = 5, aes(y = AvgInc, color = Attrition)) +
geom_segment(aes(x = JobSatisfaction, xend = JobSatisfaction, y = 0,
yend = AvgInc, color = Attrition), size = 1.2, linetype = 1, alpha = .8) +
facet_wrap(vars(Attrition))+
scale_y_continuous(limits = c(0,8000)) +
labs(x = "Job Satisfaction Levels", y = "Average Income", title = "Average Income by Satisfaction Level and Attrition",
subtitle = "Are people leaving because they are unsatisfied due to their low income?") +
geom_text(aes(x = JobSatisfaction, y = AvgInc,
label = round(AvgInc,0)), vjust = -1) +
theme(text = element_text( color = '#5b4f41'),
plot.title = element_text(size = 16, face = "bold"),
panel.background = element_rect(fill = "#FAF7F2"),
panel.grid.major = element_line(colour = "#E3DDCC"),
strip.background = element_blank())+
scale_color_brewer(palette = "Set2", name = "Attrition")
Quick Takeaways:
####Potential Factor 3: Satisfaction Variables
Hypotheses to Test:
Summary:
Final thoughts: The analysis has shown that income seems to be a strong driver of attrition especially for low paying job roles. Sales, which had the highest attrition rate was one of the least paid jobs. The analysis has also shown that income disparities between peers is most likely not a primary reason for attrition. It may be relevant to see if job roles such as sales reps, research scientists and laboratory technicians should have their income levels be reevaluated and readjusted based on industry and location standards.
Attrition by Job Satisfaction by roles
ggthemr("fresh")
f_data %>%
select(JobRole, JobSatisfaction, Attrition) %>%
mutate(JobRole = gsub(" ", "\n", JobRole)) %>%
group_by(JobRole, Attrition) %>%
summarise(avg = mean(JobSatisfaction)) %>%
ggplot(aes(x = JobRole, y = avg, fill = Attrition)) +
geom_bar(stat = "identity", position = "dodge", width = .5) +
scale_y_continuous(limits = c(0,4)) +
coord_flip() +
labs(title = "Job Satisfaction by Job Role and Attrition",
x = " Job Roles ", y = "Avg Job Satisfaction",
subtitle = "Does having lower job satisfaction cause attrition within job roles?")
Quick Takeaways:
Attrition by Environmental Satisfaction
ggthemr("fresh")
f_data %>%
select(JobRole, EnvironmentSatisfaction, Attrition) %>%
mutate(JobRole = gsub(" ", "\n", JobRole)) %>%
group_by(JobRole, Attrition) %>%
summarise(avg = mean(EnvironmentSatisfaction)) %>%
ggplot(aes(x = JobRole, y = avg, fill = Attrition)) +
geom_bar(stat = "identity", position = "dodge", width = .5) +
scale_y_continuous(limits = c(0,4)) +
coord_flip() +
labs(title = "Environment Satisfaction by Job Role and Attrition",
x = " Job Roles ", y = "Avg Environment Satisfaction",
subtitle = "Does having lower Environment Satisfaction cause attrition within job roles?")