Sebagai seseorang yang bekerja di bidang HR, Anda tidak hanya harus bertanggung jawab atas rekrutmen karyawan, tetapi juga ketika mereka memutuskan untuk resign. Masalahnya, tidak seperti rekrutmen karyawan yang bisa disiapkan dari jauh-jauh hari, karyawan resign tidak pernah tidak membuat Anda kaget. Hal tersebut selalu datang tanpa warning meskipun karyawan yang bersangkutan memiliki waktu minimal dua minggu hingga ia akhirnya benar-benar resign.
Untuk itu, Anda perlu mempersiapkan diri dari sekarang agar tidak bingung ketika ada seorang karyawan yang mengajukan resign. Jadi, pada kesempatan kali ini, kita akan melakukan analisis untuk memprediksi karyawan yang akan memutuskan untuk resign.
Sumber Data:
Sumber data yang digunakan dalam laporan ini adalah data pegawai terminasi di perusahaan KIFEST. Dapat di akses melalui: https://bit.ly/KIFEST_HR .
Tahapan yanag akan kita pelajari:
1. Ekstraksi Data
2. Eksplorasi Data
3. Persiapan Data
4. Pemodelan
5. Evaluasi
6. Rekomendasi
Pengambilan sumber data dari google spreedsheet dengan cara instal packages dan library “gsheet”.
rm(list = ls())
#install.packages('gsheet')
library(gsheet)
## Warning: package 'gsheet' was built under R version 4.0.5
kifest_df <- gsheet2tbl('docs.google.com/spreadsheets/d/1RK3obhtQRzuKMljpzTIyzCAPaJnfAP6zjELx8bILkqg/edit?usp=sharing')
#install.packages("reactable")
library(reactable)
reactable(
kifest_df,
defaultPageSize = 5,
bordered = TRUE, striped = TRUE, highlight = TRUE
)
Pada data tersebut, memiliki dataset 654 baris (observasi) dan 21 kolom (variabel).
dim(kifest_df)
## [1] 654 21
Berikut 21 Kolom (variabel) tersebut melputi:
colnames(kifest_df)
## [1] "id" "gender" "employee_type"
## [4] "level_of_position" "band" "org_unit"
## [7] "org_division" "funct_bus" "position"
## [10] "location" "age" "long_work"
## [13] "education_type" "entitas" "salary"
## [16] "attrition" "resign_type" "marital_status"
## [19] "dependent" "grade" "termination_date"
dengan ringkasan masing-masing kolom (variabel) sebagai berikut:
summary(kifest_df)
## id gender employee_type level_of_position
## Min. :10000012 Length:654 Length:654 Length:654
## 1st Qu.:10000931 Class :character Class :character Class :character
## Median :10001723 Mode :character Mode :character Mode :character
## Mean :10093226
## 3rd Qu.:10001996
## Max. :30000063
## band org_unit org_division funct_bus
## Length:654 Length:654 Length:654 Length:654
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## position location age long_work
## Length:654 Length:654 Min. :20.00 Min. : 0.00
## Class :character Class :character 1st Qu.:23.00 1st Qu.: 2.00
## Mode :character Mode :character Median :26.00 Median : 2.00
## Mean :34.95 Mean :11.47
## 3rd Qu.:55.00 3rd Qu.:27.00
## Max. :62.00 Max. :40.00
## education_type entitas salary attrition
## Length:654 Length:654 Min. : 2125000 Length:654
## Class :character Class :character 1st Qu.: 3803000 Class :character
## Mode :character Mode :character Median : 4383000 Mode :character
## Mean : 5557305
## 3rd Qu.: 5632000
## Max. :50000000
## resign_type marital_status dependent grade
## Length:654 Length:654 Min. :0.0000 Min. : 0.000
## Class :character Class :character 1st Qu.:0.0000 1st Qu.: 0.000
## Mode :character Mode :character Median :0.0000 Median : 0.000
## Mean :0.6269 Mean : 2.096
## 3rd Qu.:1.0000 3rd Qu.: 4.000
## Max. :4.0000 Max. :16.000
## termination_date
## Length:654
## Class :character
## Mode :character
##
##
##
Analisis Univariat menggunakan “ggplot” dengan memjabarkan 4 variabel ke dalam diagaram batang.
#install.packages("ggplot2")
library(ggplot2)
#install.packages("gridExtra")
library(gridExtra)
#install.packages("dplyr")
library(dplyr)
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
p1 <- kifest_df %>%
group_by(resign_type) %>%
summarise(counts = n()) %>%
ggplot(aes(x = as.factor(resign_type), y = counts)) + geom_bar(stat = 'identity',
fill = "coral1") + ggtitle("Jenis Terminasi") +geom_text(aes(label=counts), size = 2.5,
position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),
axis.text.x = element_text(size =7,angle = 45, hjust = 1),
axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 900))
p2 <- kifest_df %>%
group_by(dependent) %>%
summarise(counts = n()) %>%
ggplot(aes(x = as.factor(dependent), y = counts)) + geom_bar(stat = 'identity',
fill = "coral1") + ggtitle("Tanggungan Keluarga") +geom_text(aes(label=counts), size = 2.5,
position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),
axis.text.x = element_text(size =7,angle = 45, hjust = 1),
axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 900))
p3 <- kifest_df %>%
group_by(gender) %>%
summarise(counts = n()) %>%
ggplot(aes(x = as.factor(gender), y = counts)) + geom_bar(stat = 'identity',
fill = "coral1") + ggtitle("Jenis Kelamin") +geom_text(aes(label=counts), size = 2.5,
position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),
axis.text.x = element_text(size =7,angle = 45, hjust = 1),
axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 900))
p4 <- kifest_df %>%
group_by(level_of_position) %>%
summarise(counts = n()) %>%
ggplot(aes(x = as.factor(level_of_position), y = counts)) + geom_bar(stat = 'identity',
fill = "coral1") + ggtitle("Level Jabatan") +geom_text(aes(label=counts), size = 2.5,
position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),
axis.text.x = element_text(size =7,angle = 45, hjust = 1),
axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 900))
grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2)
Berdasarkan diagaram tersebut, bahwa:
1. Jenis Terminasi
Terdapat 6 jenis terminasi, urutan terbanyak contrack_end, Normal Retirement, Resign, Died, fraud dan terakhir Demotion.
2. Tanggungan Keluarga
Pegawai yang tidak mempunyai tanggungan keluarga paling banyak terminasi, disusul tanggungan 1, 2, 3 dan 4.
3. Jenis Kelamin
Pegawai lak-laki lebih banyak terminasi di banding pegawai perempuan.
4. Level Jabatan
Pegawai dengan level jabatan pelaksana paling banyak terminasi di banding dengan level jabatan yang lain.
Analisis Bivariat menggunakan “ggplot” dengan membandingkan variable target “resign_type” dengan variabel “salary” ke dalam bentuk boxplot.
kifest_df$level_of_position <- factor(kifest_df$level_of_position)
ggplot(data=kifest_df, aes(x=resign_type,
y=salary)) +
geom_boxplot() +
scale_y_continuous(breaks = c(1000000,2000000,3000000,4000000,5000000,10000000,15000000,20000000,30000000,
40000000,50000000
,60000,70000),
labels = c("1K","2K","3K","4K","5K","10K","15K","20K","30K","40K","50K","60K","70K"),
limits = c(1000000,60000000))+
geom_jitter(alpha = 0.3,
color = "blue",
width = 0.2)
Dari boxplot diatas, dapat di simpulkan bahwa Pegawai terminasi dengan alasan habis kontrak “Contract_End” paling banyak, dengan rentang gaji antara 3K-40K, disusul dengan “Normal_Retirement”, “Resign”, “Died” dan “Demotion”.
Dari analisis univariat dan bivariat diatas, Kami kerucutkan target terminasi menjadi 2 kategori, yaitu:
1. Resign berisi variabel “Resign”.
2. Termination berisi variabel: “Contract_End”, “Normal_Retirement”, “Resign”, “Died” dan “Demotion”.
Pengkerucutan ini dilakukan karena variable Resign adalah jenis terminasi yang diusulkan pegawai atau atas permintaan pegawai sendiri.
Analisis bivariat dengan 2 variable target sebagai berikut:
p5 <- kifest_df %>%
ggplot(aes(x = age, fill = attrition)) + geom_density(alpha = 0.5) + ggtitle("Age") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())
p6 <- kifest_df %>%
ggplot(aes(x = long_work, fill = attrition)) + geom_density(alpha = 0.5) + ggtitle("Long Work") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())
p7 <- kifest_df %>%
ggplot(aes(x = dependent, fill = attrition)) + geom_density(alpha = 0.5) + ggtitle("Dependent") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())
p8 <- kifest_df %>%
ggplot(aes(x = grade, fill = attrition)) + geom_density(alpha = 0.5) + ggtitle("Grade") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())
grid.arrange(p5, p6, p7, p8, nrow = 2, ncol = 2)
Analisis multivariat kami membandingkan variabel target dengan 4 variabel lain (Gaji “salary”, Umur “age”, Lama Kerja “long_work” dan Jumlah tanggungan “dependent”.
#install.packages("corrgram")
library(corrgram)
corrgram(kifest_df[,c("salary", "age", "long_work",
"dependent")],
main="Correlation Between Continous Variable")
Berdasarkan data diatas, hubungan antara lama kerja dan umur sangar berkaitan erat, disusul hubungan antara umur dan lama kerja di bandung dengan jumlah tanggungan keluarga, dengan detail hubungan variabel tersebut sebagai berikut:
cor(kifest_df[,c("salary", "age", "long_work",
"dependent")])
## salary age long_work dependent
## salary 1.0000000 0.4078812 0.2017021 0.2450270
## age 0.4078812 1.0000000 0.9196373 0.6540971
## long_work 0.2017021 0.9196373 1.0000000 0.5874774
## dependent 0.2450270 0.6540971 0.5874774 1.0000000
3.1.1 Handle missing value with Mean
Detect Missing Value on Data
kifest_df %>% summarise_all(
~ sum(is.na(.))
)
## # A tibble: 1 x 21
## id gender employee_type level_of_position band org_unit org_division
## <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 0 0 0
## # ... with 14 more variables: funct_bus <int>, position <int>, location <int>,
## # age <int>, long_work <int>, education_type <int>, entitas <int>,
## # salary <int>, attrition <int>, resign_type <int>, marital_status <int>,
## # dependent <int>, grade <int>, termination_date <int>
Dari data tersebut, beberapa data dengan jenis variabel “chr” harus di ubah menjadi jenis variabel “factor” dengan cara sebagai berikut:
kifest_df$resign_type = as.factor(kifest_df$resign_type)
kifest_df$gender = as.factor(kifest_df$gender)
kifest_df$marital_status = as.factor(kifest_df$marital_status)
kifest_df$employee_type = as.factor(kifest_df$employee_type)
kifest_df$level_of_position = as.factor(kifest_df$level_of_position)
kifest_df$attrition = as.factor(kifest_df$attrition)
3.1.2 Handle Outliers
Get Outliers Values
boxplot((kifest_df$salary),
main="Salary")
out_salary <- boxplot.stats(kifest_df$salary)$out
Berdasarkan dari boxplot diatas, terdapat pencilan data gaji pegawai di angkat 50K, yang akan kami hilangkan agar hasil prediksi lebih baik (tidak terlalu bias), dengan cara sebagai berikut:
Get Outliers Index
out_idx <- which(kifest_df$salary%in% c(out_salary))
Data with out outliers
kifest_df_clean <-kifest_df[-out_idx,]
boxplot(kifest_df_clean$salary)
Setelah dihilangkan pencilan data, diperoleh data seperti boxplot diatas, dari 654 data observasi menjadi 594 data observasi.
dim (kifest_df)
## [1] 654 21
dim (kifest_df_clean)
## [1] 594 21
set.seed(2021)
#install.packages("caret")
library(caret)
kifest_df_clean$resign_type = as.factor(kifest_df_clean$resign_type)
TrainingIndex <- createDataPartition(kifest_df_clean$resign_type, p=0.8, list = FALSE)
TrainingSet <- kifest_df_clean[TrainingIndex,] # Training Set
TestingSet <- kifest_df_clean[-TrainingIndex,] # Test Set
TrainingSet %>% summarise_all(
~ sum(is.na(.))
)
## # A tibble: 1 x 21
## id gender employee_type level_of_position band org_unit org_division
## <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 0 0 0
## # ... with 14 more variables: funct_bus <int>, position <int>, location <int>,
## # age <int>, long_work <int>, education_type <int>, entitas <int>,
## # salary <int>, attrition <int>, resign_type <int>, marital_status <int>,
## # dependent <int>, grade <int>, termination_date <int>
sum(is.na(TrainingSet))
## [1] 0
Create Model LR
Model_2 <- train(attrition ~ age+salary+marital_status+long_work+gender+employee_type+level_of_position,
data = TrainingSet,
method = "svmPoly",
na.action = na.omit,
preProcess=c("scale","center"),
trControl= trainControl(method="none"),
tuneGrid = data.frame(degree=1,scale=1,C=1)
)
Evaluation Testing LR
Model.pred_2 <-predict(Model_2, TestingSet)
model.matrix_2 <-confusionMatrix(Model.pred_2, TestingSet$attrition)
model.matrix_2
## Confusion Matrix and Statistics
##
## Reference
## Prediction RESIGN TERMINATION
## RESIGN 4 0
## TERMINATION 12 100
##
## Accuracy : 0.8966
## 95% CI : (0.8263, 0.9454)
## No Information Rate : 0.8621
## P-Value [Acc > NIR] : 0.173671
##
## Kappa : 0.365
##
## Mcnemar's Test P-Value : 0.001496
##
## Sensitivity : 0.25000
## Specificity : 1.00000
## Pos Pred Value : 1.00000
## Neg Pred Value : 0.89286
## Prevalence : 0.13793
## Detection Rate : 0.03448
## Detection Prevalence : 0.03448
## Balanced Accuracy : 0.62500
##
## 'Positive' Class : RESIGN
##
table(Model.pred_2)
## Model.pred_2
## RESIGN TERMINATION
## 4 112
table(TestingSet$attrition)
##
## RESIGN TERMINATION
## 16 100
logit.perf_2 <- table(TestingSet$attrition, Model.pred_2,
dnn = c("Actual", "Predicted"))
logit.perf_2
## Predicted
## Actual RESIGN TERMINATION
## RESIGN 4 12
## TERMINATION 0 100
Create Model Tree
#install.packages("party")
library(party)
fit.ctree_2 <- ctree(formula = attrition ~ age+salary+marital_status+long_work+gender+
employee_type+level_of_position,
data = TrainingSet)
plot(fit.ctree_2)
Evaluation Testing Tree
ctree.pred_2 <- predict(fit.ctree_2, TestingSet)
ctree.perf_2 <- table(TestingSet$attrition, ctree.pred_2,
dnn = c("Actual","Predict"))
ctree.perf_2
## Predict
## Actual RESIGN TERMINATION
## RESIGN 12 4
## TERMINATION 7 93
Create Model RF
#install.packages("randomForest")
library(randomForest)
set.seed(2021)
fit.forest_2 <- randomForest(formula = attrition ~ age+salary+marital_status+long_work+gender,
data = TrainingSet,
na.action = na.roughfix)
fit.forest_2
##
## Call:
## randomForest(formula = attrition ~ age + salary + marital_status + long_work + gender, data = TrainingSet, na.action = na.roughfix)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 11.51%
## Confusion matrix:
## RESIGN TERMINATION class.error
## RESIGN 30 34 0.53125000
## TERMINATION 21 393 0.05072464
Evaluation Testing RF
forest.pred_2 <- predict(fit.forest_2, TestingSet)
forest.perf_2 <- table(forest.pred_2, TestingSet$attrition,
dnn = c("Predict","Actual"))
forest.perf_2
## Actual
## Predict RESIGN TERMINATION
## RESIGN 9 4
## TERMINATION 7 96
Create model SVM
#install.packages("e1071")
library(e1071)
library(caret)
Model_svm_2 <- train(attrition ~ age+salary+marital_status+long_work+gender, data = TrainingSet,
method = "svmPoly",
na.action = na.omit,
preProcess=c("scale","center"),
trControl= trainControl(method="none"),
tuneGrid = data.frame(degree=1,scale=1,C=1)
)
Model_svm_2
## Support Vector Machines with Polynomial Kernel
##
## 478 samples
## 5 predictor
## 2 classes: 'RESIGN', 'TERMINATION'
##
## Pre-processing: scaled (6), centered (6)
## Resampling: None
Evaluation Testing SVM
Model.Testing.SV_2 <- predict(Model_svm_2, TestingSet)
Model.testing.confusion_2 <-confusionMatrix(Model.Testing.SV_2,
TestingSet$attrition)
Model.testing.confusion_2
## Confusion Matrix and Statistics
##
## Reference
## Prediction RESIGN TERMINATION
## RESIGN 0 0
## TERMINATION 16 100
##
## Accuracy : 0.8621
## 95% CI : (0.7857, 0.9191)
## No Information Rate : 0.8621
## P-Value [Acc > NIR] : 0.5661446
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 0.0001768
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.8621
## Prevalence : 0.1379
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : RESIGN
##
svm.perf_2 <- table(Model.Testing.SV_2, TestingSet$attrition,
dnn = c("Predict","Actual"))
svm.perf_2
## Actual
## Predict RESIGN TERMINATION
## RESIGN 0 0
## TERMINATION 16 100
Performance Comparison
performance <- function(table, n=2){
tn <- table[1,1]
fp <- table[1,2]
fn <- table[2,1]
tp <- table[2,2]
sensitivity = tp/(tp+fn) # recall
specificity = tn/(tn+fp)
ppp = tp/(tp+fp) # precision
npp = tn/(tn+fn)
hitrate = (tp+tn)/(tp+tn+fp+fn) # accuracy
fiscore <- 2 * ppp * sensitivity /(ppp+sensitivity)
result <- paste("Sensitivity = ", round(sensitivity, n) ,
"\nSpecificity = ", round(specificity, n),
"\nPositive Predictive Value = ", round(ppp, n),
"\nNegative Predictive Value = ", round(npp, n),
"\nAccuracy = ", round(hitrate, n),
"\nF1 Score = ",round(fiscore,n))
cat(result)
}
performance(logit.perf_2) #Show Result Logistic Regression
## Sensitivity = 1
## Specificity = 0.25
## Positive Predictive Value = 0.89
## Negative Predictive Value = 1
## Accuracy = 0.9
## F1 Score = 0.94
performance(ctree.perf_2) #Show Result Classification Tree
## Sensitivity = 0.93
## Specificity = 0.75
## Positive Predictive Value = 0.96
## Negative Predictive Value = 0.63
## Accuracy = 0.91
## F1 Score = 0.94
performance(forest.perf_2) #Show Result Random Forest
## Sensitivity = 0.93
## Specificity = 0.69
## Positive Predictive Value = 0.96
## Negative Predictive Value = 0.56
## Accuracy = 0.91
## F1 Score = 0.95
performance(svm.perf_2) #Show Result Support Vector Machine
## Sensitivity = 0.86
## Specificity = NaN
## Positive Predictive Value = 1
## Negative Predictive Value = 0
## Accuracy = 0.86
## F1 Score = 0.93
Variabel-variabel yang penting dalam memprediksi karyawan yang terminasi adalah: Usia, Lama Bekerja, Salary Jenis Kelamin dan Dependent.
Performa algoritma yang paling baik adalah Ctree dengan sensitifitas 0,75, spesifikasi 0,93, akurasi 0,91 dengan memprediksi pegawai yang terminasi 10.34% dari populasi.
Penilaian dari segi bisnis, algoritma classification regression lebih baik, karena dapat memprediksi 16% dari populasi, walau sensitifitasnya lebih rendah yaitu 0,25 tetapi lebih baik pada nilai spesikasi 1 dan akurasi 0,9. Dapat di simpulkan, jika dari 116 pegawai yang di prediksi terminasi, dapat dipastikan 16 pegawai yang terminasi dengan alasan mengundurkan diri “resign” (spesifik) sedangkan jika menggunakan algoritma Ctree dimana memprediksi 16 orang terminasi tetapi masih terdapat 4 pegawai yang terminasi dengan alasan normal terminasi dan 12 dengan alasan mengundurkan diri “resign”. Dengan demikian:
- HR lebih focus mengelola pegawai yang sudah diprediksi resign.
- Perusahaan dapat menghemat biaya rekrutment, pelatiahan dan pesangon.
Disarankan untuk menambah variable KPI dan Unit Kerja untuk menilai apakah memberikan kontribusi yang significant atau tidak.