R Markdown

Disini kita menggunakan data dari https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset

library(dplyr)

library(keras)

library(caret)
library(recipes)
library(ROSE)
library(rsample)

library(tensorflow)

reticulate::use_python(python = "C:/Users/LENOVO/anaconda3/envs/r-tensorflow-gpu/python.exe",required=TRUE)

Pertama-tama kita akan membaca file yang berbentuk csv

df <- read.csv("dataset//WA_Fn-UseC_-HR-Employee-Attrition.csv")
str(df)
## 'data.frame':    1470 obs. of  35 variables:
##  $ ï..Age                  : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : chr  "Yes" "No" "Yes" "No" ...
##  $ BusinessTravel          : chr  "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" "Travel_Frequently" ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : chr  "Sales" "Research & Development" "Research & Development" "Research & Development" ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : int  2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : chr  "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
##  $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EmployeeNumber          : int  1 2 4 5 7 8 10 11 12 13 ...
##  $ EnvironmentSatisfaction : int  2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : chr  "Female" "Male" "Male" "Female" ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : int  3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : chr  "Sales Executive" "Research Scientist" "Laboratory Technician" "Research Scientist" ...
##  $ JobSatisfaction         : int  4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : chr  "Single" "Married" "Single" "Married" ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ Over18                  : chr  "Y" "Y" "Y" "Y" ...
##  $ OverTime                : chr  "Yes" "No" "Yes" "Yes" ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : int  3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: int  1 4 2 3 4 3 1 2 2 2 ...
##  $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : int  1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...

Di data tersebut, ternyata tidak ada data yang NA

anyNA(df)
## [1] FALSE

Disini kita menyesuaikan tipe columnya yang belum sesuai

df_clean <- df %>%
      mutate(Attrition = as.factor(Attrition),
             BusinessTravel = as.factor(BusinessTravel),
             Department = as.factor(Department),
             Education = as.factor(Education),
             EducationField = as.factor(EducationField),
             Gender = as.factor(Gender),
             JobRole = as.factor(JobRole),
             MaritalStatus = as.factor(MaritalStatus),
             Over18 = as.factor(Over18),
             OverTime = as.factor(OverTime)
             ) %>%
      select(-c(Over18, EmployeeCount, EmployeeNumber, StandardHours))       
 str(df_clean)  
## 'data.frame':    1470 obs. of  31 variables:
##  $ ï..Age                  : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : Factor w/ 5 levels "1","2","3","4",..: 2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
##  $ EnvironmentSatisfaction : int  2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : int  3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
##  $ JobSatisfaction         : int  4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : int  3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: int  1 4 2 3 4 3 1 2 2 2 ...
##  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : int  1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...

Kita akan melakukan spliting data menjadi training dan testing sebesar 0.8

set.seed(2021)

split_df <- initial_split(df_clean, 0.8)

train <- training(split_df)
test <- testing(split_df)

Disini kita cek apakah data train kita targetnya sudah balanced

table(train$Attrition) %>% prop.table()
## 
##        No       Yes 
## 0.8426871 0.1573129
total_n <- table(train$Attrition)[1] + table(train$Attrition)[2]

Setelah kita cek sebelumnya, ternyata data target dari train itu belum balanced, maka dari itu kita akan lakukan balancing terhadapt data target menggunakan ovun.sample dan menggunakan method upsample dan downsample (both)

train_balance <- ovun.sample(Attrition ~ ., data = df_clean, method = "both", N = total_n)

table(train_balance$data$Attrition) %>% prop.table()
## 
##        No       Yes 
## 0.5119048 0.4880952

Kita akan membuat obj untuk scaling dan membuat dummy dengan menggunakan function recipe

rec_obj <- recipe(Attrition ~ ., data = train_balance$data) %>%
    step_center(all_numeric_predictors()) %>%
    step_scale(all_numeric_predictors()) %>%
    step_dummy(all_nominal_predictors()) %>%
    prep(data = train_balance$data)

rec_obj
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         30
## 
## Training data contained 1176 data points and no missing data.
## 
## Operations:
## 
## Centering for ï..Age, DailyRate, ... [trained]
## Scaling for ï..Age, DailyRate, ... [trained]
## Dummy variables from BusinessTravel, Department, Education, ... [trained]

Setelah itu, kita akan memecah predictor dan target, dan merubahnya sesuai kebutuhan

train_x <- bake(object = rec_obj, new_data = train_balance$data) %>% select(-Attrition) %>% as.matrix()
train_y <- bake(object = rec_obj, new_data = train_balance$data) %>% select(Attrition)

test_x <- bake(object = rec_obj, new_data = test) %>% select(-Attrition) %>%  as.matrix()
test_y <- bake(object = rec_obj, new_data = test) %>% select(Attrition)

train_x <- array_reshape(train_x, dim(train_x))
test_x <- array_reshape(test_x, dim(test_x))

train_y <- ifelse(pull(train_y) == "No", 1, 0) 
test_y <- ifelse(pull(test_y) == "No", 1, 0) 

train_y <- to_categorical(train_y)
test_y <- to_categorical(test_y, num_classes = 2)

n_input <- dim(train_x)[2]

Sekarang kita membuat arsitektur NN nya dengan menggunakan hidel_layer 3x, dan output_layer 1x. untuk seluruh hiden_layer kita menggunakan units 64 dan activation tanh

untuk optimizer kira menggunakan optimizer adam dengan learning rate default yaitu 0.0005

model <- keras_model_sequential(name = "Model1") %>%
      layer_dense(activation = "tanh",
                  units = 64,
                  input_shape = n_input) %>%
      layer_dense(activation = "tanh",
                  units = 64) %>%
      layer_dense(activation = "tanh",
                  units = 64) %>%
      layer_dense(activation = "sigmoid",
                  units = 2,
                  name = "output")    

model %>%
  compile(loss = "binary_crossentropy",
          metrics = c("accuracy"),
          optimizer = optimizer_adam(learning_rate = 0.0005))
model
## Model
## Model: "Model1"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense_2 (Dense)                     (None, 64)                      3072        
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 64)                      4160        
## ________________________________________________________________________________
## dense (Dense)                       (None, 64)                      4160        
## ________________________________________________________________________________
## output (Dense)                      (None, 2)                       130         
## ================================================================================
## Total params: 11,522
## Trainable params: 11,522
## Non-trainable params: 0
## ________________________________________________________________________________

Untuk training model, kita memakai epoch 25 dan batch size sebesar 32

set.seed(2021)

history <- model %>%
  fit(x = train_x,
      y = train_y,
      batch_size = 32,
      epoch = 25,
      validation_data = list(test_x, test_y),
      verbose = 1)
plot(history)
## `geom_smooth()` using formula 'y ~ x'

Setelah model berhasil belajar, kita akan memprediksi data test nya

Dari hasi prediksi, maka kita harus merubah hasil tersebut menjadi category seperti di awal. dan juga n menjalankan confusionMatrix untuk melihat accuracy, recall, dan precision

preidct_test <- predict(model, test_x) %>% k_argmax() %>% as.array()
preidct_test <- ifelse(preidct_test == 1, "No", "Yes")

confusionMatrix(as.factor(preidct_test), test$Attrition)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  216   7
##        Yes  26  45
##                                          
##                Accuracy : 0.8878         
##                  95% CI : (0.846, 0.9215)
##     No Information Rate : 0.8231         
##     P-Value [Acc > NIR] : 0.001498       
##                                          
##                   Kappa : 0.6629         
##                                          
##  Mcnemar's Test P-Value : 0.001728       
##                                          
##             Sensitivity : 0.8926         
##             Specificity : 0.8654         
##          Pos Pred Value : 0.9686         
##          Neg Pred Value : 0.6338         
##              Prevalence : 0.8231         
##          Detection Rate : 0.7347         
##    Detection Prevalence : 0.7585         
##       Balanced Accuracy : 0.8790         
##                                          
##        'Positive' Class : No             
## 

Dari hasil yang kita dapatkan, model tersebut menghasilkan accuracy yang cukup bagus di atas 90% Dan model ini juga masih dapat di improve kedepannya.