SQL Final Project

Introduction

This work is aimed to present exploratory data analysis and hypothetical policies implementation effect for the dataset about employees and their lifetime with the company. The primary purpose for working with the data is to deal with employee churning, understand the current situation with the existing employees and suggest data-driven policies to enhance worker`s loyalty rates (e.g. decrease the number of people leaving their job).

Short Summary of the main findings

Exploratory data analysis shows that there is equal gender distribution among the workers in general, and among churners as well. On average, people work about 13 years for the company. Besides, the majority of workers came from Vancouver, British Columbia, Victoria and Nanaimo. An average age of all the employees is about 45, whereas the most “vulnerable” category is people aged from 25 to 45, as they have the highest probability to churn, especially during their first year working in the company. The highest churn rates can be observed in such occupations as “Dairy Person” and “Produce Clerk”. In addition, people from Kelowna are more likely to churn compared to employees from other cities.

As for the modelling part, only people who are aged from 25 to 45 were selected including only whose who left their job voluntary by resignation or who are still working. Another restriction was put on the city and job title of terminated employees: only top-5 cities and top-5 job titles where the churn rates are the highest were taken to the final subset. The most important factor to predict churn rate appeared to be length of service, whereas gender, age, job title and city were also taken into account. Initial prediction for relative proportion of churners was about 4.4%.

By further analysis it was revealed that in the selected subset people are more likely to churn during their 1st, 2nd and 3rd year in the company. Taking this into consideration, 3 attempts to model the situation where the company managed to retain all the workers who were willing to churn were undertaken.

First, the situation where the company managed to retain all the workers who wanted to churn during their first year was simulated. The relative proportion of churners was decreased by 1%, which is equivalent to 25% decrease in row numbers.

Second, the situation where the company managed to retain all the workers during 1st and 2nd year was simulated. The relative proportion of churners went down by 1.5%, constituting 35% decrease in row numbers.

Third, the situation where the company managed to retain all the workers during 1st, 2nd, and 3rd year was simulated. The relative proportion of churners fell down by 2.9%, which signified 67% decrease in the actual number of churners.

Exploratory Data Analysis

Exploring socio-demographic and personal characteristics of existing workers

tab1 = dbGetQuery(con, "SELECT age, status
                        FROM info")

a <- ggplot(tab1) +
     geom_histogram(aes(x = age), bins = 20, binwidth = 2, alpha = 0.5, 
     fill = "lightblue", color = "black")+
     labs(title = "Emplyees age distribution
            ", x = "", y = "Number of respondents") +
     scale_x_continuous(breaks = 0:65*5)+
     scale_y_continuous(breaks = 0:700*100)+
     ggplot2::annotate(x= 35, y=500, label=paste("Mean age = ", 
     round(mean(tab1$age),1)),geom="text", size= 3.5)+
     ggplot2::annotate(x= 35, y=400, label=paste("Median age = ", 
     round(median(tab1$age),1)),geom="text", size= 3.5)+
     theme_minimal()+
     theme(text = element_text(size = 10),
     plot.title = element_text(hjust = 0.5), title = element_text(size = 8))


tab2 = dbGetQuery(con, "SELECT city, COUNT(*) AS n
                        FROM cities JOIN info ON cities.cityID = info.cityID
                        GROUP BY city
                        ORDER by n DESC
                        LIMIT 10")

tab2$city = as.factor(tab2$city)


b <- ggplot(tab2, aes(x = reorder(city, n), y = n)) + 
     geom_bar(color = "black", stat = "identity", alpha = 0.5, fill = "lightblue") +
     geom_text(aes(x = city, y = n, label = paste("n = ", n)), 
     size = 2.5, hjust = -0.1)+
     scale_y_continuous(breaks = 0:2000*250) +
     ylim(0,1700)+
     labs(title = "Top 10 cities occupied by employees
               ", y  = "Number of employees", x = "") +
     theme_minimal() +
     coord_flip()+
     theme(text = element_text(size = 11),
     plot.title = element_text(hjust = 0.5), title = element_text(size = 8))


tab3 = dbGetQuery(con, "SELECT gender, status
                        FROM info")
tab3$gender = as.factor(tab3$gender)


c <- ggplot(tab3, aes(x = gender, y = ..count../sum(..count..))) + 
     geom_bar(color = "black", alpha = 0.5, fill = "lightblue") +
     geom_text(aes(label = percent(..count../sum(..count..))), 
     size = 4, stat= "count", position = position_stack(vjust = 0.5)) +
     scale_y_continuous(labels = percent) +
     labs(title = "Employees gender distribution
               ", y  = "", x = "") +
     theme_minimal() +
     theme(text = element_text(size = 12),
     plot.title = element_text(hjust = 0.5), title = element_text(size = 8))


tab4 = dbGetQuery(con, "SELECT length_of_service, status
                        FROM info")


d <- ggplot(tab4) +
     geom_histogram(aes(x = length_of_service),bins = 20,binwidth = 2, alpha = 0.5, 
     fill = "lightblue", color = "black")+
     labs(title = " Employees length of service distribution
            ", x = "", y = "Number of respondents") +
     scale_x_continuous(breaks = 0:26*2)+
     #scale_y_continuous(breaks = 0:3500*500)+
     ggplot2::annotate(x= 21, y=800, label=paste("Mean length of service = ", 
     round(mean(tab4$length_of_service),1)),geom="text", size= 2.5)+
     ggplot2::annotate(x= 21, y=700, label=paste("Median length of service = ", 
     round(median(tab4$length_of_service),1)),geom="text", size= 2.5)+
     theme_minimal()+
     theme(text = element_text(size = 10),
     plot.title = element_text(hjust = 0.5), title = element_text(size = 8))


grid.arrange(a,c,d,b, nrow = 2)

General employee overview

As it can be seen from the dashboard, an average employee serves the company for about 13 years, whereas an impressing number of people stay with the company even longer. Besides, employees’ gender is almost equally distributed, with females taking over males slightly. As for the employee age, it can be said that this company did a great job to represent different age groups, however there are overwhelming number of people who are over 60. Finally, the majority of employees came from Vancouver, British Columbia. Still, there are quite a lot of people from Victoria and Nanaimo (Canada) as well.

Exploring characteristics of terminated workers VS remaining workers

For the purpose of meaningful analysis, I have selected only those observations where an employee left the job voluntary because there is no reason to study workers who were fired, and excluded also retirement cases along the way.

#trying to find job title where voluntary leaving rates are the highest

tab5 = dbGetQuery(con, "SELECT job_title,COUNT(*) as n
                        FROM info
                        WHERE status == 'terminated'
                        AND termtype_desc IN ('Not Applicable', 'Voluntary') 
                        AND termreason_desc IN ('Not Applicable', 'Resignaton')
                        GROUP BY job_title
                        ORDER BY n DESC
                        LIMIT 5")


tab6 = dbGetQuery(con, "SELECT job_title, status
                        FROM info
                        WHERE job_title IN ('Meat Cutter', 'Cashier', 
                        'Dairy Person', 'Produce Clerk', 'Baker') 
                        AND termtype_desc IN ('Not Applicable', 'Voluntary') 
                        AND termreason_desc IN ('Not Applicable', 'Resignaton')")


tab6$job_title = as.factor(tab6$job_title)
tab6$status = as.factor(tab6$status)

tab6_perc = tab6 %>% group_by(job_title, status) %>% summarise(count = n()) %>% mutate(perc = round((count/sum(count)*100),1)) %>% arrange(-count)

tab6_perc$job_title <- ordered(tab6_perc$job_title, 
            levels = c("Cashier","Meat Cutter","Baker","Produce Clerk","Dairy Person"))



gr1 <-ggplot(data = na.omit(subset(tab6_perc, select = c(job_title,status,count, perc))), 
       aes(x = job_title, y = count, fill = status)) + 
       geom_bar(stat = "identity",color = "black", alpha = 0.8, position = "fill") + 
       geom_text(aes(label = ifelse(status == "terminated",perc, "          %")), y = 0.07, size = 3) +
       geom_text(aes(label = ifelse(status == "active", perc, "           %")), 
       y = 0.8, size = 3) +
       scale_y_continuous(labels = scales::percent)+
       labs(title = "Relative proportion of employees churn\nby top-5 popular jobs
            ", y  = "", x = "") +
       theme_minimal() +
       scale_fill_brewer(palette = "Pastel2")+
       guides(fill = F)+
       theme(text = element_text(size = 12), axis.text.x = element_text(angle = 20),
       plot.title = element_text(hjust = 0.5), title = element_text(size = 8))



tab1_1 = dbGetQuery(con, "SELECT age, status
                        FROM info
                        WHERE termtype_desc IN ('Not Applicable', 'Voluntary') 
                        AND termreason_desc IN ('Not Applicable', 'Resignaton')")

tab1_1$age_cut = cut(tab1_1$age, breaks = c(15,25,35,45,55,65))
ag = tab1_1 %>%  
      group_by(age_cut, status) %>% 
      summarise(count = n()) %>% mutate(perc = round((count/sum(count)*100),1)) %>% na.omit()
            
gr2 <- ggplot(data = na.omit(subset(ag, select = c(age_cut, status, count, perc))), 
       aes(x = age_cut, y = count, fill = status)) + 
       geom_bar(stat = "identity", color = "black", alpha = 0.8, position = "fill") + 
       geom_text(aes(label = ifelse(status == "terminated", perc, "          %")), y = 0.1, size = 3) +
       geom_text(aes(label = ifelse(status == "active", perc, "           %")), y = 0.8, size = 3) +
       scale_y_continuous(labels = scales::percent)+
       labs(title = "Relative proportion of employees churn\nby age groups
            ", y  = "", x = "Age group") +
       theme_minimal() +
       guides(fill = F)+
       scale_fill_brewer(palette = "Pastel2")+
       theme(text = element_text(size = 10), axis.text.x = element_text(angle = 20),
       plot.title = element_text(hjust = 0.5), title = element_text(size = 8))

grid.arrange(gr1,gr2, nrow = 1)

As it can be seen from the graphs, there are top-5 jobs where employee churn rates are the highest, with “Dairy Person” position being the most likely to abandon by employees. In addition, it can be noticed that people aged from 25 to 45 are most likely to leave their jobs voluntary. So, middle aged people are the most vulnerable age category in the company.

tab4_1 = dbGetQuery(con, "SELECT length_of_service, status
                        FROM info
                        WHERE termtype_desc IN ('Not Applicable', 'Voluntary') 
                        AND termreason_desc IN ('Not Applicable', 'Resignaton')")

tab4_1$length_cut = cut(tab4_1$length_of_service, breaks = c(0,3,5,10,15,20,26))

ag1 = tab4_1 %>%  
      group_by(length_cut, status) %>% 
      summarise(count = n()) %>% mutate(perc = round((count/sum(count)*100),1)) %>% na.omit()
            
gr3 <- ggplot(data = na.omit(subset(ag1, select = c(length_cut, status, count, perc))), 
       aes(x = length_cut, y = count, fill = status)) + 
       geom_bar(stat = "identity", color = "black", alpha = 0.8, position = "fill") + 
       geom_text(aes(label = ifelse(status == "terminated", perc, "          %")), y = 0.1, size = 3) +
       geom_text(aes(label = ifelse(status == "active", perc, "           %")), y = 0.8, size = 3) +
       scale_y_continuous(labels = scales::percent)+
       labs(title = "Relative proportion of employees churn\nby length of service
            ", y  = "", x = "Length of service") +
       theme_minimal() +
       guides(fill = F)+
       scale_fill_brewer(palette = "Pastel2")+
       theme(text = element_text(size = 10), #axis.text.x = element_text(angle = 20),
       plot.title = element_text(hjust = 0.5), title = element_text(size = 8))



#trying to find city where voluntary leavings rates are the highest

city = dbGetQuery(con, "SELECT city ,COUNT(*) as n
                        FROM cities JOIN info ON cities.cityID = info.cityID
                        WHERE status == 'terminated'
                        AND termtype_desc IN ('Not Applicable', 'Voluntary') 
                        AND termreason_desc IN ('Not Applicable', 'Resignaton')
                        GROUP BY city
                        ORDER BY n DESC
                        LIMIT 5")

tab7 = dbGetQuery(con, "SELECT city, status
                        FROM cities JOIN info ON cities.cityID = info.cityID
                        WHERE city IN ('Vancouver', 'Victoria', 'Nanaimo', 'New Westminster', 'Kelowna')                         
                        AND termtype_desc IN ('Not Applicable', 'Voluntary') 
                        AND termreason_desc IN ('Not Applicable', 'Resignaton')")

tab7$city = as.factor(tab7$city)
tab7$status = as.factor(tab7$status)

tab7_perc = tab7 %>% group_by(city, status) %>% summarise(count = n()) %>% mutate(perc = round((count/sum(count)*100),1))

tab7_perc$city <- ordered(tab7_perc$city, 
         levels = c("Nanaimo","Victoria","Vancouver","New Westminster","Kelowna"))



gr4 <-ggplot(data = na.omit(subset(tab7_perc, select = c(city,status,count, perc))), 
       aes(x = city, y = count, fill = status)) + 
       geom_bar(stat = "identity",color = "black", alpha = 0.8, position = "fill") + 
       geom_text(aes(label = ifelse(status == "terminated",perc, "          %")), y = 0.1, size = 3) +
       geom_text(aes(label = ifelse(status == "active", perc, "           %")), 
       y = 0.8, size = 3) +
       scale_y_continuous(labels = scales::percent)+
       labs(title = "Relative proportion of employees churn\nfrom top-5 popular cities
            ", y  = "", x = "") +
       theme_minimal() +
       scale_fill_brewer(palette = "Pastel2")+
       guides(fill = F)+
       theme(text = element_text(size = 12), axis.text.x = element_text(angle = 20),
       plot.title = element_text(hjust = 0.5), title = element_text(size = 8))

grid.arrange(gr3, gr4, nrow = 1)

Further exploration reveals that in general people are more likely to churn having spent from 3 to 5 years working in the company. Besides, the highest churn rate can be observed among people from Kelowna and New Westminster.

tab8 = dbGetQuery(con, "SELECT department, COUNT(*) AS n
                        FROM departments JOIN info ON departments.departmentID = info.departmentID
                        GROUP BY department
                        ORDER BY n DESC
                        LIMIT 5")

tab9 = dbGetQuery(con, "SELECT department, status
                        FROM departments JOIN info ON departments.departmentID = info.departmentID
                        WHERE department IN ('Meats', 'Customer Service', 'Produce', 'Dairy', 'Bakery')
                        AND termtype_desc IN ('Not Applicable', 'Voluntary') 
                        AND termreason_desc IN ('Not Applicable', 'Resignaton')")

tab9$department = as.factor(tab9$department)
tab9$status = as.factor(tab9$status)

tab9_perc = tab9 %>% group_by(department, status) %>% summarise(count = n()) %>% mutate(perc = round((count/sum(count)*100),1))

tab9_perc$department <- ordered(tab9_perc$department, 
         levels = c("Customer Service","Meats","Bakery","Produce","Dairy"))



gr5 <-ggplot(data = na.omit(subset(tab9_perc, select = c(department,status,count, perc))), 
       aes(x = department, y = count, fill = status)) + 
       geom_bar(stat = "identity",color = "black", alpha = 0.8, position = "fill") + 
       geom_text(aes(label = ifelse(status == "terminated",perc, "          %")), y = 0.07, size = 3) +
       geom_text(aes(label = ifelse(status == "active", perc, "           %")), 
       y = 0.8, size = 3) +
       scale_y_continuous(labels = scales::percent)+
       labs(title = "Relative proportion of employees churn\nby top-5 popular departments
            ", y  = "", x = "") +
       theme_minimal() +
       scale_fill_brewer(palette = "Pastel2")+
       guides(fill = F)+
       theme(text = element_text(size = 12), axis.text.x = element_text(angle = 20),
       plot.title = element_text(hjust = 0.5), title = element_text(size = 8))



tab3_1 = dbGetQuery(con, "SELECT gender, status
                        FROM info
                        WHERE termtype_desc IN ('Not Applicable', 'Voluntary') 
                        AND termreason_desc IN ('Not Applicable', 'Resignaton')")

tab3_1$gender = as.factor(tab3_1$gender)
tab3_1$status = as.factor(tab3_1$status)

ag3 = tab3_1 %>%  
      group_by(gender, status) %>% 
      summarise(count = n()) %>% mutate(perc = round((count/sum(count)*100),1))



gr6 <-ggplot(data = na.omit(subset(ag3, select = c(gender,status,count,perc))), 
       aes(x = gender, y = count, fill = status)) + 
       geom_bar(stat = "identity",color = "black", alpha = 0.8, position = "fill") + 
       geom_text(aes(label = ifelse(status == "terminated",perc, "          %")), y = 0.07, size = 3) +
       geom_text(aes(label = ifelse(status == "active", perc, "           %")), 
       y = 0.8, size = 3) +
       scale_y_continuous(labels = scales::percent)+
       labs(title = "Relative proportion of employees churn\nby gender
            ", y  = "", x = "") +
       theme_minimal() +
       scale_fill_brewer(palette = "Pastel2")+
       guides(fill = F)+
       theme(text = element_text(size = 12),
       plot.title = element_text(hjust = 0.5), title = element_text(size = 8))

grid.arrange(gr5, gr6, nrow = 1)

Finally, it can be seen that (as it was already shown before) Dairy department is the most frequently abandoned by emplyees in the company. As for the gender defferentiation, men and women are equally likely to quit their jobs at the company, with females doing it a little bit more often than males.

Modelling for employee churn prediction

Taking into account EDA results, for further meaningful analysis I decided to take a subset of people aged from 25 to 45 inculuding only whose who left their job voluntary by resignation or who are still working. Another restriction was put on the city and job title of terminated emplyees: I took top-5 cities and top-5 job titles where the churn rates were the highest.

For modelling I decided on building a decision tree, I chose gender, age, city, job_title and length_of_service as predictors. Model accuracy appeared to be about 97% which is great.

library(vip)
vip(treemodel, scale = T) + labs(y = "Relative importance", title = "Relative importance of predictors in Decision Tree model")+
  theme_minimal()

The most important factor is length of service of an employee in the company and theoretically it is possible to influence this variable (compared to age!) Since I have restricted age from 25 to 45 as it was the most popular age period to leave the job, it appeared to be that length of service for this subsample varies from 0 to 15. So, let`s see at what length of service employees become less likely to churn.

Hypothetical actions to take and their effects

data.model_test = data.model
data.model_test$length_cut = cut(data.model_test$length_of_service, breaks = c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15))

churn = data.model_test %>%  
      group_by(length_cut, status) %>% 
      summarise(count = n()) %>% mutate(perc = round((count/sum(count)*100),1)) %>% na.omit()
            
ggplot(data = na.omit(subset(churn, select = c(length_cut, status, count, perc))), 
       aes(x = length_cut, y = count, fill = status)) + 
       geom_bar(stat = "identity", color = "black", alpha = 0.8, position = "fill") + 
       geom_text(aes(label = ifelse(status == "terminated", perc, "          %")), y = 0.1, size = 3) +
       geom_text(aes(label = ifelse(status == "active", perc, "           %")), y = 0.8, size = 3) +
       scale_y_continuous(labels = scales::percent)+
       labs(title = "Relative proportion of employees churn\nby length of service in new subset
            ", y  = "", x = "Length of service") +
       theme_minimal() +
       guides(fill = F)+
       scale_fill_brewer(palette = "Pastel2")+
       theme(text = element_text(size = 10), #axis.text.x = element_text(angle = 20),
       plot.title = element_text(hjust = 0.5), title = element_text(size = 10))

It can be seen that people from our subset (middle aged, left the job voluntary by resignation from top-5 populated cities and top-5 popular professions) are more likely to quit in their 1st year at work. More than a half employees churn during this period! So, the first year is crucial and it is also the first peak that is needed to be overtaken to decrease the probability to churn. The second year is also dangerous for the employer as 40% of employees from the specified subsample churn during this time. About 20% churn during third year and after 3+ years of service the probability to churn gradually decreases. However, it is important to notice the so called 4th peak during 4-5 years of service, where the churn rates rise up unexpectedly.

So to say, the first peak is in the 1st year, second peak of churning is on the 2nd years of service and the third peak is on the 3rd. And after these critical periods people are less and less likely to churn in the future.

So, practically speaking, company should put extra efforts in making an employee come through at least 3 years of service to escape the crucial consequences of the first 3 peaks of churning. After that the company may relax a little bit, BUT during this extremely crucial period (1-2-3 years) it is very important to encourage employees to work further and make them as happy as possible.

To put those assumptions into action, I will try to alter the data in such a way that every employee who have churned in the first year will be assigned to “active” status as if they did not churn. In other words, it would be if the company really got the message and put extreme efforts to prevent churning during the first year. I will try the same trick for 2nd and 3rd year. Let`s see what happens!

Initial predicted churn distribution on test sample

data.model_test = data.model

ggplot(as.data.frame(predTest), aes(x = predTest, y = ..count../sum(..count..))) + 
     geom_bar(color = "black", alpha = 0.5, fill = "lightblue") +
     geom_text(aes(label = percent(..count../sum(..count..))), 
     size = 4, stat= "count", vjust = -0.5) +
     scale_y_continuous(labels = percent) +
     labs(title = "Employees churn proportions in new subset
               ", y  = "", x = "") +
     theme_minimal() +
     theme(text = element_text(size = 12),
     plot.title = element_text(hjust = 0.5), title = element_text(size = 10))

New prediction as if nobody churned during the 1st year of service

data.model_test$id = 1:nrow(data.model_test)

data.model_test$status[id = 618] = "active"
data.model_test$status[id = 626] = "active"
data.model_test$status[id = 642] = "active"

set.seed(8)
ind = createDataPartition(data.model_test$status, p = 0.8, list = F)
train = data.model[ind,]
test = data.model[-ind,]



treemodel = ctree(status~., data = train)

predTest = predict(treemodel, test) 

ggplot(as.data.frame(predTest), aes(x = predTest, y = ..count../sum(..count..))) + 
     geom_bar(color = "black", alpha = 0.5, fill = "lightblue") +
     geom_text(aes(label = percent(..count../sum(..count..))), 
     size = 4, stat= "count", vjust = -0.5) +
     scale_y_continuous(labels = percent) +
     labs(title = "Employees churn proportion in new subset (1 year retention)
               ", y  = "", x = "") +
     theme_minimal() +
     theme(text = element_text(size = 12),
     plot.title = element_text(hjust = 0.5), title = element_text(size = 10))

As it can be seen from the new ration, now relative proportion of terminated workers are predicted to be 3.4 %, whereas initial prediction was almost 4.4 %. In other words, if the company would manage to retain all the employees who wanted to churn during their first year, it would result in 1 % decrease in relative proportion of churners regarding specified subsample.

It is also important to note that in row numbers, retaining all the employees who were willing to churn during the first year would result in nearly 25% decrease in number of people quitting their jobs.

New prediction as if nobody churned during the 1st and 2nd year of service

data.model_test$status[id = 454] = "active"
data.model_test$status[id = 604] = "active"

set.seed(10)
ind = createDataPartition(data.model_test$status, p = 0.8, list = F)
train = data.model[ind,]
test = data.model[-ind,]

treemodel = ctree(status~., data = train)

predTest = predict(treemodel, test) 

ggplot(as.data.frame(predTest), aes(x = predTest, y = ..count../sum(..count..))) + 
     geom_bar(color = "black", alpha = 0.5, fill = "lightblue") +
     geom_text(aes(label = percent(..count../sum(..count..))), 
     size = 4, stat= "count", vjust = -0.5) +
     scale_y_continuous(labels = percent) +
     labs(title = "Employees churn proportion in new subset (2 years retention)
               ", y  = "", x = "") +
     theme_minimal() +
     theme(text = element_text(size = 12),
     plot.title = element_text(hjust = 0.5), title = element_text(size = 10))

Again, we can observe that the relative proportion of churners became even lower. Now we have 2.9% of churners compared to initial 4.4%, which is 1.5% decrease in terms of relative proportion and 35% decrease in row number of people.

New prediction as if nobody churned during the 1st and 2nd and 3rd year of service

data.model_test$status[id = 588] = "active"

set.seed(11)
ind = createDataPartition(data.model_test$status, p = 0.8, list = F)
train = data.model[ind,]
test = data.model[-ind,]

treemodel = ctree(status~., data = train)

predTest = predict(treemodel, test) 

ggplot(as.data.frame(predTest), aes(x = predTest, y = ..count../sum(..count..))) + 
     geom_bar(color = "black", alpha = 0.5, fill = "lightblue") +
     geom_text(aes(label = percent(..count../sum(..count..))), 
     size = 4, stat= "count", vjust = -0.5) +
     scale_y_continuous(labels = percent) +
     labs(title = "Employees churn proportion in new subset (3 years retention)
               ", y  = "", x = "") +
     theme_minimal() +
     theme(text = element_text(size = 12),
     plot.title = element_text(hjust = 0.5), title = element_text(size = 10))

Amazing! If the company would manage to retain employees from the specified subset for more than 3 years, than the relative proportion of churners would fall down to 1.4% compared to initial 4.3%, which is 2.9% decrease in relative proportion of churn rate and 67% decrease of people churning in row numbers.