Disini kita menggunakan data dari https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset
library(dplyr)
library(keras)
library(caret)
library(recipes)
library(ROSE)
library(rsample)
library(tensorflow)
reticulate::use_python(python = "C:/Users/LENOVO/anaconda3/envs/r-tensorflow-gpu/python.exe",required=TRUE)
Pertama-tama kita akan membaca file yang berbentuk csv
df <- read.csv("dataset//WA_Fn-UseC_-HR-Employee-Attrition.csv")
str(df)
## 'data.frame': 1470 obs. of 35 variables:
## $ ï..Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : chr "Yes" "No" "Yes" "No" ...
## $ BusinessTravel : chr "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" "Travel_Frequently" ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ Department : chr "Sales" "Research & Development" "Research & Development" "Research & Development" ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : int 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : chr "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : int 1 2 4 5 7 8 10 11 12 13 ...
## $ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : chr "Female" "Male" "Male" "Female" ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : chr "Sales Executive" "Research Scientist" "Laboratory Technician" "Research Scientist" ...
## $ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : chr "Single" "Married" "Single" "Married" ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ Over18 : chr "Y" "Y" "Y" "Y" ...
## $ OverTime : chr "Yes" "No" "Yes" "Yes" ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : int 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
Di data tersebut, ternyata tidak ada data yang NA
anyNA(df)
## [1] FALSE
Disini kita menyesuaikan tipe columnya yang belum sesuai
df_clean <- df %>%
mutate(Attrition = as.factor(Attrition),
BusinessTravel = as.factor(BusinessTravel),
Department = as.factor(Department),
Education = as.factor(Education),
EducationField = as.factor(EducationField),
Gender = as.factor(Gender),
JobRole = as.factor(JobRole),
MaritalStatus = as.factor(MaritalStatus),
Over18 = as.factor(Over18),
OverTime = as.factor(OverTime)
) %>%
select(-c(Over18, EmployeeCount, EmployeeNumber, StandardHours))
str(df_clean)
## 'data.frame': 1470 obs. of 31 variables:
## $ ï..Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : Factor w/ 5 levels "1","2","3","4",..: 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
## $ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
## $ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : int 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
## $ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
Kita akan melakukan spliting data menjadi training dan testing sebesar 0.8
set.seed(2021)
split_df <- initial_split(df_clean, 0.8)
train <- training(split_df)
test <- testing(split_df)
Disini kita cek apakah data train kita targetnya sudah balanced
table(train$Attrition) %>% prop.table()
##
## No Yes
## 0.8426871 0.1573129
total_n <- table(train$Attrition)[1] + table(train$Attrition)[2]
Setelah kita cek sebelumnya, ternyata data target dari train itu belum balanced, maka dari itu kita akan lakukan balancing terhadapt data target menggunakan ovun.sample dan menggunakan method upsample dan downsample (both)
train_balance <- ovun.sample(Attrition ~ ., data = df_clean, method = "both", N = total_n)
table(train_balance$data$Attrition) %>% prop.table()
##
## No Yes
## 0.5119048 0.4880952
Kita akan membuat obj untuk scaling dan membuat dummy dengan menggunakan function recipe
rec_obj <- recipe(Attrition ~ ., data = train_balance$data) %>%
step_center(all_numeric_predictors()) %>%
step_scale(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
prep(data = train_balance$data)
rec_obj
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 30
##
## Training data contained 1176 data points and no missing data.
##
## Operations:
##
## Centering for ï..Age, DailyRate, ... [trained]
## Scaling for ï..Age, DailyRate, ... [trained]
## Dummy variables from BusinessTravel, Department, Education, ... [trained]
Setelah itu, kita akan memecah predictor dan target, dan merubahnya sesuai kebutuhan
train_x <- bake(object = rec_obj, new_data = train_balance$data) %>% select(-Attrition) %>% as.matrix()
train_y <- bake(object = rec_obj, new_data = train_balance$data) %>% select(Attrition)
test_x <- bake(object = rec_obj, new_data = test) %>% select(-Attrition) %>% as.matrix()
test_y <- bake(object = rec_obj, new_data = test) %>% select(Attrition)
train_x <- array_reshape(train_x, dim(train_x))
test_x <- array_reshape(test_x, dim(test_x))
train_y <- ifelse(pull(train_y) == "No", 1, 0)
test_y <- ifelse(pull(test_y) == "No", 1, 0)
train_y <- to_categorical(train_y)
test_y <- to_categorical(test_y, num_classes = 2)
n_input <- dim(train_x)[2]
Sekarang kita membuat arsitektur NN nya dengan menggunakan hidel_layer 3x, dan output_layer 1x. untuk seluruh hiden_layer kita menggunakan units 64 dan activation tanh
untuk optimizer kira menggunakan optimizer adam dengan learning rate default yaitu 0.0005
model <- keras_model_sequential(name = "Model1") %>%
layer_dense(activation = "tanh",
units = 64,
input_shape = n_input) %>%
layer_dense(activation = "tanh",
units = 64) %>%
layer_dense(activation = "tanh",
units = 64) %>%
layer_dense(activation = "sigmoid",
units = 2,
name = "output")
model %>%
compile(loss = "binary_crossentropy",
metrics = c("accuracy"),
optimizer = optimizer_adam(learning_rate = 0.0005))
model
## Model
## Model: "Model1"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## dense_2 (Dense) (None, 64) 3072
## ________________________________________________________________________________
## dense_1 (Dense) (None, 64) 4160
## ________________________________________________________________________________
## dense (Dense) (None, 64) 4160
## ________________________________________________________________________________
## output (Dense) (None, 2) 130
## ================================================================================
## Total params: 11,522
## Trainable params: 11,522
## Non-trainable params: 0
## ________________________________________________________________________________
Untuk training model, kita memakai epoch 25 dan batch size sebesar 32
set.seed(2021)
history <- model %>%
fit(x = train_x,
y = train_y,
batch_size = 32,
epoch = 25,
validation_data = list(test_x, test_y),
verbose = 1)
plot(history)
## `geom_smooth()` using formula 'y ~ x'
Setelah model berhasil belajar, kita akan memprediksi data test nya
Dari hasi prediksi, maka kita harus merubah hasil tersebut menjadi category seperti di awal. dan juga n menjalankan confusionMatrix untuk melihat accuracy, recall, dan precision
preidct_test <- predict(model, test_x) %>% k_argmax() %>% as.array()
preidct_test <- ifelse(preidct_test == 1, "No", "Yes")
confusionMatrix(as.factor(preidct_test), test$Attrition)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 216 7
## Yes 26 45
##
## Accuracy : 0.8878
## 95% CI : (0.846, 0.9215)
## No Information Rate : 0.8231
## P-Value [Acc > NIR] : 0.001498
##
## Kappa : 0.6629
##
## Mcnemar's Test P-Value : 0.001728
##
## Sensitivity : 0.8926
## Specificity : 0.8654
## Pos Pred Value : 0.9686
## Neg Pred Value : 0.6338
## Prevalence : 0.8231
## Detection Rate : 0.7347
## Detection Prevalence : 0.7585
## Balanced Accuracy : 0.8790
##
## 'Positive' Class : No
##
Dari hasil yang kita dapatkan, model tersebut menghasilkan accuracy yang cukup bagus di atas 90% Dan model ini juga masih dapat di improve kedepannya.