1 Objektif

Pada kali ini saya akan melakukan prediksi serta menghasilkan akurasi dengan menggunakan Neural Network. Dalam kasus ini saya menggunakan dataset yang berasal dari https://github.com/bagasbgy/keras-examples/tree/classification-dense/data/data-clean.csv . Data tersebut berupa data yang berbentuk data tabular. Data tersebut tentang data karyawan yang sudah atau belum attrition.

2 Library

Sebelum masuk ke Neural Network, sebaiknya import semua library yang dibutuhkan.

# import libs
library(neuralnet)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.5     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::compute() masks neuralnet::compute()
## x dplyr::filter()  masks stats::filter()
## x dplyr::lag()     masks stats::lag()
library(MLmetrics)
## 
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
## 
##     Recall
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following objects are masked from 'package:MLmetrics':
## 
##     MAE, RMSE
## The following object is masked from 'package:purrr':
## 
##     lift
library(rsample)
library(keras)

3 Read Data

Tahap kedua yaitu dengan membaca dataset terlebih dahulu.

empl <- read.csv("dataset/data-clean.csv")
glimpse(empl)
## Rows: 1,470
## Columns: 35
## $ attrition                  <chr> "yes", "no", "yes", "no", "no", "no", "n...
## $ age                        <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, ...
## $ business_travel            <chr> "travel_rarely", "travel_frequently", "t...
## $ daily_rate                 <int> 1102, 279, 1373, 1392, 591, 1005, 1324, ...
## $ department                 <chr> "sales", "research_development", "resear...
## $ distance_from_home         <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15,...
## $ education                  <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2...
## $ education_field            <chr> "life_sciences", "life_sciences", "other...
## $ employee_count             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ employee_number            <int> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15...
## $ environment_satisfaction   <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2...
## $ gender                     <chr> "female", "male", "male", "female", "mal...
## $ hourly_rate                <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, ...
## $ job_involvement            <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3...
## $ job_level                  <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1...
## $ job_role                   <chr> "sales_executive", "research_scientist",...
## $ job_satisfaction           <int> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4...
## $ marital_status             <chr> "single", "married", "single", "married"...
## $ monthly_income             <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670...
## $ monthly_rate               <int> 19479, 24907, 2396, 23159, 16632, 11864,...
## $ num_companies_worked       <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0...
## $ over_18                    <chr> "y", "y", "y", "y", "y", "y", "y", "y", ...
## $ over_time                  <chr> "yes", "no", "yes", "yes", "no", "no", "...
## $ percent_salary_hike        <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, ...
## $ performance_rating         <int> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3...
## $ relationship_satisfaction  <int> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3...
## $ standard_hours             <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, ...
## $ stock_option_level         <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1...
## $ total_working_years        <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10,...
## $ training_times_last_year   <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2...
## $ work_life_balance          <int> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3...
## $ years_at_company           <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, ...
## $ years_in_current_role      <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2...
## $ years_since_last_promotion <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1...
## $ years_with_curr_manager    <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2...

4 Data Preparation

4.1 Ubah tipe data yang belum sesuai

Tahap selanjutnya yaitu mengubah tipe data yang character menjadi factor.

empl_clean <- empl %>% 
  mutate_if(is.character, as.factor)

4.2 Check Missing Value

Selanjutnya cek missing value, apakah ada value yang missing atau tidak.

colSums(is.na(empl_clean))
##                  attrition                        age 
##                          0                          0 
##            business_travel                 daily_rate 
##                          0                          0 
##                 department         distance_from_home 
##                          0                          0 
##                  education            education_field 
##                          0                          0 
##             employee_count            employee_number 
##                          0                          0 
##   environment_satisfaction                     gender 
##                          0                          0 
##                hourly_rate            job_involvement 
##                          0                          0 
##                  job_level                   job_role 
##                          0                          0 
##           job_satisfaction             marital_status 
##                          0                          0 
##             monthly_income               monthly_rate 
##                          0                          0 
##       num_companies_worked                    over_18 
##                          0                          0 
##                  over_time        percent_salary_hike 
##                          0                          0 
##         performance_rating  relationship_satisfaction 
##                          0                          0 
##             standard_hours         stock_option_level 
##                          0                          0 
##        total_working_years   training_times_last_year 
##                          0                          0 
##          work_life_balance           years_at_company 
##                          0                          0 
##      years_in_current_role years_since_last_promotion 
##                          0                          0 
##    years_with_curr_manager 
##                          0

Dari data yang diatas, tidak ada value yang missing.

4.3 Check Proporsi kelas target

Selanjutnya, mengecek kelas target apakah seimbang atau tidak. Disini target saya yaitu attrition.

prop.table(table(empl_clean$attrition))
## 
##        no       yes 
## 0.8387755 0.1612245
empl_clean <- empl_clean %>% 
  select(-c(job_level, over_time, employee_count, employee_number, over_18, performance_rating, relationship_satisfaction, education, job_involvement))

4.4 Mengubah variabel prediktor + target

Selanjutnya, melakukan pengubahan variabel prediktor + target.

empl_dummy <- model.matrix(~., empl_clean %>% select(-attrition)) %>% 
  as.data.frame() %>% 
  select(-1) %>% 
  bind_cols(empl_clean %>% select(attrition))

colnames(empl_dummy) <- str_replace_all(string = colnames(empl_dummy), pattern = "_", replacement = "")

5 Cross Validation

Langkah selanjutnya melakukan cross validation dengan membagi data 80% data train dan 20% data test.

set.seed(100)
empl_split <- initial_split(data = empl_dummy, prop = 0.8, strata = "attrition")

empl_train <- training(empl_split)
empl_test <- testing(empl_split)

Setelah itu lakukan prop.table lagi untuk mengetahui proporsi kelas train untuk target

prop.table(table(empl_train$attrition))
## 
##        no       yes 
## 0.8385726 0.1614274

Selanjutnya melakukan upsample pada data train dengan menggunakan fungsi upSample().

set.seed(100)
empl_train_up <-  upSample(x = empl_train %>% select(-attrition), y = empl_train$attrition, yname = "attrition")
prop.table(table(empl_train_up$attrition))
## 
##  no yes 
## 0.5 0.5

6 Data Preprocessing Terhadap Empl_dummy

Langkah selanjutnya, melakukan data preprocessing sampai data sesuai dengan ketentuan input pada keras.

6.1 Mengubah menjadi matrix

Langkah selanjutnya yaitu memisahkan x dan y sebagai prediktor dan target serta mengubahnya data yang semula data frame menjadi matrix.

# prediktor
train_x <- empl_train_up %>% 
  select(-attrition) %>% 
  data.matrix()

test_x <- empl_test %>% 
  select(-attrition) %>% 
  data.matrix()

# target
train_y <- empl_train_up %>% 
  select(attrition)

test_y <- empl_test %>% 
  select(attrition)

6.2 Mengubah matrix menjadi array

Langkah selanjutnya yaitu mengubah matrix yang telah diperoleh menjadi array

#prediktor
train_x_keras <- train_x %>%
  array_reshape(dim = dim(train_x))
## Warning in normalizePath(path.expand(path), winslash, mustWork): path[1]="C:
## \Users\User\anaconda3\envs\rstudio/python.exe": The system cannot find the file
## specified
test_x_keras <-  test_x %>%
  array_reshape(dim = dim(test_x))

6.3 Melakukan One Hot Encoding

Langkah selanjutnya yaitu dengan melakukan one hot encoding pada target variabel.

train_y_keras <- train_y %>% 
  mutate(attrition = as.numeric(attrition)-1) %>%
  data.matrix() %>% 
  to_categorical(num_classes = 2)

7 Deep Learning

Langkah selanjutnya define architecture Pada kali ini saya define architecture dengan menggunakan keras dengan 3 layer hidden dan dari tiap layernya unitnya sebesar 256,128 dan 64. Activation yang saya gunakan yaitu dengan menggunakan relu serta activation sigmoid untuk layer output.

7.1 Define architecture

tensorflow::tf$random$set_seed(100)
model_empl <- keras_model_sequential()
model_empl %>% 
  layer_dense(input_shape = ncol(train_x_keras),
              units = 256,
              activation = "relu",
              name = "hidden1") %>% 
  layer_dense(units = 128,
              activation = "relu",
              name = "hidden2") %>% 
  layer_dense(units = 64,
              activation = "relu",
              name = "hidden3") %>% 
  layer_dense(units = 2,
              activation = "sigmoid",
              name = "output")

summary(model_empl)
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## hidden1 (Dense)                     (None, 256)                     10240       
## ________________________________________________________________________________
## hidden2 (Dense)                     (None, 128)                     32896       
## ________________________________________________________________________________
## hidden3 (Dense)                     (None, 64)                      8256        
## ________________________________________________________________________________
## output (Dense)                      (None, 2)                       130         
## ================================================================================
## Total params: 51,522
## Trainable params: 51,522
## Non-trainable params: 0
## ________________________________________________________________________________

7.2 Compile Model

Langkah selanjutnya yaitu melakukan compile model. Optimizer yang saya gunakan yaitu adam.

7.2.1 Optimizer adam

Selanjutnya saya menggunaka optimizer yaitu adam dengan learning rate = 0.001.

model_empl %>%
  compile(optimizer = optimizer_adam(lr=0.001),
          loss = "binary_crossentropy", # klasifikasi biner
          metrics = "accuracy")

7.3 Training Model

Pada tahap ini melakukan training terhadap model yang telah dihasilkan dengan menggunakan optimizer yaitu adam.

history <- model_empl %>% 
  fit(train_x_keras,
      train_y_keras,
      batch_size = 19,
      epoch = 15)

Akurasi dari data train diatas yaitu 60.33%.

plot(history)
## `geom_smooth()` using formula 'y ~ x'

7.4 Melakukan Prediksi Data Test

Selanjutnya lakukan prediksi pada data test.

pred_prob <- predict(object = model_empl, x = test_x_keras)

# gunakan threshold 0.5
pred_label <- as.factor(ifelse(pred_prob[,2] > 0.5, yes = "yes", no = "no"))

head(pred_label)
## [1] no no no no no no
## Levels: no yes
confusionMatrix(data = pred_label, reference = as.factor(test_y$attrition), positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  234  39
##        yes  12   8
##                                           
##                Accuracy : 0.8259          
##                  95% CI : (0.7776, 0.8676)
##     No Information Rate : 0.8396          
##     P-Value [Acc > NIR] : 0.7658164       
##                                           
##                   Kappa : 0.1582          
##                                           
##  Mcnemar's Test P-Value : 0.0002719       
##                                           
##             Sensitivity : 0.17021         
##             Specificity : 0.95122         
##          Pos Pred Value : 0.40000         
##          Neg Pred Value : 0.85714         
##              Prevalence : 0.16041         
##          Detection Rate : 0.02730         
##    Detection Prevalence : 0.06826         
##       Balanced Accuracy : 0.56072         
##                                           
##        'Positive' Class : yes             
## 

Dari data test, akurasi yang dihasilkan yaitu 82.59%.

7.5 Define architecture

tensorflow::tf$random$set_seed(100)
model_empl_sgd <- keras_model_sequential()
model_empl_sgd %>% 
  layer_dense(input_shape = ncol(train_x_keras),
              units = 256,
              activation = "relu",
              name = "hidden1") %>% 
  layer_dense(units = 128,
              activation = "relu",
              name = "hidden2") %>% 
  layer_dense(units = 64,
              activation = "relu",
              name = "hidden3") %>% 
  layer_dense(units = 2,
              activation = "sigmoid",
              name = "output")

summary(model_empl_sgd)
## Model: "sequential_1"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## hidden1 (Dense)                     (None, 256)                     10240       
## ________________________________________________________________________________
## hidden2 (Dense)                     (None, 128)                     32896       
## ________________________________________________________________________________
## hidden3 (Dense)                     (None, 64)                      8256        
## ________________________________________________________________________________
## output (Dense)                      (None, 2)                       130         
## ================================================================================
## Total params: 51,522
## Trainable params: 51,522
## Non-trainable params: 0
## ________________________________________________________________________________

7.6 Compile Model

Langkah selanjutnya yaitu melakukan compile model. Optimizer yang saya gunakan yaitu sgd.

7.6.1 Optimizer sgd

Selanjutnya saya menggunaka optimizer yaitu sgd dengan learning rate = 0.001.

model_empl_sgd %>%
  compile(optimizer = optimizer_sgd(lr=0.001),
          loss = "binary_crossentropy", # klasifikasi biner
          metrics = "accuracy")

7.7 Training Model

Pada tahap ini melakukan training terhadap model yang telah dihasilkan dengan menggunakan optimizer yaitu adam.

history <- model_empl_sgd %>% 
  fit(train_x_keras,
      train_y_keras,
      batch_size = 19,
      epoch = 15)

Akurasi dari data train diatas yaitu 50.00%.

plot(history)
## `geom_smooth()` using formula 'y ~ x'

7.8 Melakukan Prediksi Data Test

Selanjutnya lakukan prediksi pada data test.

pred_prob_sgd <- predict(object = model_empl_sgd, x = test_x_keras)

# gunakan threshold 0.5
pred_label_sgd <- as.factor(ifelse(pred_prob[,2] > 0.5, yes = "yes", no = "no"))

head(pred_label)
## [1] no no no no no no
## Levels: no yes
confusionMatrix(data = pred_label_sgd, reference = as.factor(test_y$attrition), positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  234  39
##        yes  12   8
##                                           
##                Accuracy : 0.8259          
##                  95% CI : (0.7776, 0.8676)
##     No Information Rate : 0.8396          
##     P-Value [Acc > NIR] : 0.7658164       
##                                           
##                   Kappa : 0.1582          
##                                           
##  Mcnemar's Test P-Value : 0.0002719       
##                                           
##             Sensitivity : 0.17021         
##             Specificity : 0.95122         
##          Pos Pred Value : 0.40000         
##          Neg Pred Value : 0.85714         
##              Prevalence : 0.16041         
##          Detection Rate : 0.02730         
##    Detection Prevalence : 0.06826         
##       Balanced Accuracy : 0.56072         
##                                           
##        'Positive' Class : yes             
## 

Dari data test, akurasi yang dihasilkan yaitu 82.59%.

8 Kesimpulan

Dari kedua optimizer yang saya dapatkan terdiri dari : * adam : - Akurasi data train : 60.33% - Akurasi data test : 82.59%. * sgd : - Akurasi data train :50.00%. - Akurasi data test : 82.59%.

Maka yang dapat saya simpulkan adalah dengan menggunakan optimizer yaitu adam memiliki nilai akurasi lebih bagus di data train namun untuk data test untuk optimizer adam maupun sgd memiliki nilai yang seimbang.