Pada kali ini saya akan melakukan prediksi serta menghasilkan akurasi dengan menggunakan Neural Network. Dalam kasus ini saya menggunakan dataset yang berasal dari https://github.com/bagasbgy/keras-examples/tree/classification-dense/data/data-clean.csv . Data tersebut berupa data yang berbentuk data tabular. Data tersebut tentang data karyawan yang sudah atau belum attrition.
Sebelum masuk ke Neural Network, sebaiknya import semua library yang dibutuhkan.
# import libs
library(neuralnet)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.5 v dplyr 1.0.3
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::compute() masks neuralnet::compute()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(MLmetrics)
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following objects are masked from 'package:MLmetrics':
##
## MAE, RMSE
## The following object is masked from 'package:purrr':
##
## lift
library(rsample)
library(keras)
Tahap kedua yaitu dengan membaca dataset terlebih dahulu.
empl <- read.csv("dataset/data-clean.csv")
glimpse(empl)
## Rows: 1,470
## Columns: 35
## $ attrition <chr> "yes", "no", "yes", "no", "no", "no", "n...
## $ age <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, ...
## $ business_travel <chr> "travel_rarely", "travel_frequently", "t...
## $ daily_rate <int> 1102, 279, 1373, 1392, 591, 1005, 1324, ...
## $ department <chr> "sales", "research_development", "resear...
## $ distance_from_home <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15,...
## $ education <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2...
## $ education_field <chr> "life_sciences", "life_sciences", "other...
## $ employee_count <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ employee_number <int> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15...
## $ environment_satisfaction <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2...
## $ gender <chr> "female", "male", "male", "female", "mal...
## $ hourly_rate <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, ...
## $ job_involvement <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3...
## $ job_level <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1...
## $ job_role <chr> "sales_executive", "research_scientist",...
## $ job_satisfaction <int> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4...
## $ marital_status <chr> "single", "married", "single", "married"...
## $ monthly_income <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670...
## $ monthly_rate <int> 19479, 24907, 2396, 23159, 16632, 11864,...
## $ num_companies_worked <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0...
## $ over_18 <chr> "y", "y", "y", "y", "y", "y", "y", "y", ...
## $ over_time <chr> "yes", "no", "yes", "yes", "no", "no", "...
## $ percent_salary_hike <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, ...
## $ performance_rating <int> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3...
## $ relationship_satisfaction <int> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3...
## $ standard_hours <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, ...
## $ stock_option_level <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1...
## $ total_working_years <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10,...
## $ training_times_last_year <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2...
## $ work_life_balance <int> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3...
## $ years_at_company <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, ...
## $ years_in_current_role <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2...
## $ years_since_last_promotion <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1...
## $ years_with_curr_manager <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2...
Tahap selanjutnya yaitu mengubah tipe data yang character menjadi factor.
empl_clean <- empl %>%
mutate_if(is.character, as.factor)
Selanjutnya cek missing value, apakah ada value yang missing atau tidak.
colSums(is.na(empl_clean))
## attrition age
## 0 0
## business_travel daily_rate
## 0 0
## department distance_from_home
## 0 0
## education education_field
## 0 0
## employee_count employee_number
## 0 0
## environment_satisfaction gender
## 0 0
## hourly_rate job_involvement
## 0 0
## job_level job_role
## 0 0
## job_satisfaction marital_status
## 0 0
## monthly_income monthly_rate
## 0 0
## num_companies_worked over_18
## 0 0
## over_time percent_salary_hike
## 0 0
## performance_rating relationship_satisfaction
## 0 0
## standard_hours stock_option_level
## 0 0
## total_working_years training_times_last_year
## 0 0
## work_life_balance years_at_company
## 0 0
## years_in_current_role years_since_last_promotion
## 0 0
## years_with_curr_manager
## 0
Dari data yang diatas, tidak ada value yang missing.
Selanjutnya, mengecek kelas target apakah seimbang atau tidak. Disini target saya yaitu attrition.
prop.table(table(empl_clean$attrition))
##
## no yes
## 0.8387755 0.1612245
empl_clean <- empl_clean %>%
select(-c(job_level, over_time, employee_count, employee_number, over_18, performance_rating, relationship_satisfaction, education, job_involvement))
Selanjutnya, melakukan pengubahan variabel prediktor + target.
empl_dummy <- model.matrix(~., empl_clean %>% select(-attrition)) %>%
as.data.frame() %>%
select(-1) %>%
bind_cols(empl_clean %>% select(attrition))
colnames(empl_dummy) <- str_replace_all(string = colnames(empl_dummy), pattern = "_", replacement = "")
Langkah selanjutnya melakukan cross validation dengan membagi data 80% data train dan 20% data test.
set.seed(100)
empl_split <- initial_split(data = empl_dummy, prop = 0.8, strata = "attrition")
empl_train <- training(empl_split)
empl_test <- testing(empl_split)
Setelah itu lakukan prop.table lagi untuk mengetahui proporsi kelas train untuk target
prop.table(table(empl_train$attrition))
##
## no yes
## 0.8385726 0.1614274
Selanjutnya melakukan upsample pada data train dengan menggunakan fungsi upSample().
set.seed(100)
empl_train_up <- upSample(x = empl_train %>% select(-attrition), y = empl_train$attrition, yname = "attrition")
prop.table(table(empl_train_up$attrition))
##
## no yes
## 0.5 0.5
Langkah selanjutnya, melakukan data preprocessing sampai data sesuai dengan ketentuan input pada keras.
Langkah selanjutnya yaitu memisahkan x dan y sebagai prediktor dan target serta mengubahnya data yang semula data frame menjadi matrix.
# prediktor
train_x <- empl_train_up %>%
select(-attrition) %>%
data.matrix()
test_x <- empl_test %>%
select(-attrition) %>%
data.matrix()
# target
train_y <- empl_train_up %>%
select(attrition)
test_y <- empl_test %>%
select(attrition)
Langkah selanjutnya yaitu mengubah matrix yang telah diperoleh menjadi array
#prediktor
train_x_keras <- train_x %>%
array_reshape(dim = dim(train_x))
## Warning in normalizePath(path.expand(path), winslash, mustWork): path[1]="C:
## \Users\User\anaconda3\envs\rstudio/python.exe": The system cannot find the file
## specified
test_x_keras <- test_x %>%
array_reshape(dim = dim(test_x))
Langkah selanjutnya yaitu dengan melakukan one hot encoding pada target variabel.
train_y_keras <- train_y %>%
mutate(attrition = as.numeric(attrition)-1) %>%
data.matrix() %>%
to_categorical(num_classes = 2)
Langkah selanjutnya define architecture Pada kali ini saya define architecture dengan menggunakan keras dengan 3 layer hidden dan dari tiap layernya unitnya sebesar 256,128 dan 64. Activation yang saya gunakan yaitu dengan menggunakan relu serta activation sigmoid untuk layer output.
tensorflow::tf$random$set_seed(100)
model_empl <- keras_model_sequential()
model_empl %>%
layer_dense(input_shape = ncol(train_x_keras),
units = 256,
activation = "relu",
name = "hidden1") %>%
layer_dense(units = 128,
activation = "relu",
name = "hidden2") %>%
layer_dense(units = 64,
activation = "relu",
name = "hidden3") %>%
layer_dense(units = 2,
activation = "sigmoid",
name = "output")
summary(model_empl)
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## hidden1 (Dense) (None, 256) 10240
## ________________________________________________________________________________
## hidden2 (Dense) (None, 128) 32896
## ________________________________________________________________________________
## hidden3 (Dense) (None, 64) 8256
## ________________________________________________________________________________
## output (Dense) (None, 2) 130
## ================================================================================
## Total params: 51,522
## Trainable params: 51,522
## Non-trainable params: 0
## ________________________________________________________________________________
Langkah selanjutnya yaitu melakukan compile model. Optimizer yang saya gunakan yaitu adam.
Selanjutnya saya menggunaka optimizer yaitu adam dengan learning rate = 0.001.
model_empl %>%
compile(optimizer = optimizer_adam(lr=0.001),
loss = "binary_crossentropy", # klasifikasi biner
metrics = "accuracy")
Pada tahap ini melakukan training terhadap model yang telah dihasilkan dengan menggunakan optimizer yaitu adam.
history <- model_empl %>%
fit(train_x_keras,
train_y_keras,
batch_size = 19,
epoch = 15)
Akurasi dari data train diatas yaitu 60.33%.
plot(history)
## `geom_smooth()` using formula 'y ~ x'
Selanjutnya lakukan prediksi pada data test.
pred_prob <- predict(object = model_empl, x = test_x_keras)
# gunakan threshold 0.5
pred_label <- as.factor(ifelse(pred_prob[,2] > 0.5, yes = "yes", no = "no"))
head(pred_label)
## [1] no no no no no no
## Levels: no yes
confusionMatrix(data = pred_label, reference = as.factor(test_y$attrition), positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 234 39
## yes 12 8
##
## Accuracy : 0.8259
## 95% CI : (0.7776, 0.8676)
## No Information Rate : 0.8396
## P-Value [Acc > NIR] : 0.7658164
##
## Kappa : 0.1582
##
## Mcnemar's Test P-Value : 0.0002719
##
## Sensitivity : 0.17021
## Specificity : 0.95122
## Pos Pred Value : 0.40000
## Neg Pred Value : 0.85714
## Prevalence : 0.16041
## Detection Rate : 0.02730
## Detection Prevalence : 0.06826
## Balanced Accuracy : 0.56072
##
## 'Positive' Class : yes
##
Dari data test, akurasi yang dihasilkan yaitu 82.59%.
tensorflow::tf$random$set_seed(100)
model_empl_sgd <- keras_model_sequential()
model_empl_sgd %>%
layer_dense(input_shape = ncol(train_x_keras),
units = 256,
activation = "relu",
name = "hidden1") %>%
layer_dense(units = 128,
activation = "relu",
name = "hidden2") %>%
layer_dense(units = 64,
activation = "relu",
name = "hidden3") %>%
layer_dense(units = 2,
activation = "sigmoid",
name = "output")
summary(model_empl_sgd)
## Model: "sequential_1"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## hidden1 (Dense) (None, 256) 10240
## ________________________________________________________________________________
## hidden2 (Dense) (None, 128) 32896
## ________________________________________________________________________________
## hidden3 (Dense) (None, 64) 8256
## ________________________________________________________________________________
## output (Dense) (None, 2) 130
## ================================================================================
## Total params: 51,522
## Trainable params: 51,522
## Non-trainable params: 0
## ________________________________________________________________________________
Langkah selanjutnya yaitu melakukan compile model. Optimizer yang saya gunakan yaitu sgd.
Selanjutnya saya menggunaka optimizer yaitu sgd dengan learning rate = 0.001.
model_empl_sgd %>%
compile(optimizer = optimizer_sgd(lr=0.001),
loss = "binary_crossentropy", # klasifikasi biner
metrics = "accuracy")
Pada tahap ini melakukan training terhadap model yang telah dihasilkan dengan menggunakan optimizer yaitu adam.
history <- model_empl_sgd %>%
fit(train_x_keras,
train_y_keras,
batch_size = 19,
epoch = 15)
Akurasi dari data train diatas yaitu 50.00%.
plot(history)
## `geom_smooth()` using formula 'y ~ x'
Selanjutnya lakukan prediksi pada data test.
pred_prob_sgd <- predict(object = model_empl_sgd, x = test_x_keras)
# gunakan threshold 0.5
pred_label_sgd <- as.factor(ifelse(pred_prob[,2] > 0.5, yes = "yes", no = "no"))
head(pred_label)
## [1] no no no no no no
## Levels: no yes
confusionMatrix(data = pred_label_sgd, reference = as.factor(test_y$attrition), positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 234 39
## yes 12 8
##
## Accuracy : 0.8259
## 95% CI : (0.7776, 0.8676)
## No Information Rate : 0.8396
## P-Value [Acc > NIR] : 0.7658164
##
## Kappa : 0.1582
##
## Mcnemar's Test P-Value : 0.0002719
##
## Sensitivity : 0.17021
## Specificity : 0.95122
## Pos Pred Value : 0.40000
## Neg Pred Value : 0.85714
## Prevalence : 0.16041
## Detection Rate : 0.02730
## Detection Prevalence : 0.06826
## Balanced Accuracy : 0.56072
##
## 'Positive' Class : yes
##
Dari data test, akurasi yang dihasilkan yaitu 82.59%.
Dari kedua optimizer yang saya dapatkan terdiri dari : * adam : - Akurasi data train : 60.33% - Akurasi data test : 82.59%. * sgd : - Akurasi data train :50.00%. - Akurasi data test : 82.59%.
Maka yang dapat saya simpulkan adalah dengan menggunakan optimizer yaitu adam memiliki nilai akurasi lebih bagus di data train namun untuk data test untuk optimizer adam maupun sgd memiliki nilai yang seimbang.