Introduction

Deep learning is a framework that is often used when dealing with unstructured data. Therefore, let’s classify images using deep learning with a hard framework. We will use HR Analytics Employee Attrition & Performance data. We will try to find out whether employees will leave / resign with the data we have. Previously we will import the libbrary according to below;

Source: Kaggle

Import Data & Inspection

We’ll call data with the read.csv function below:

Attrition <- read.csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")

with head() function we’ll take a quick look at our data below;

head(Attrition)

The data contains:

age : Age of employee
Attrition : Attrition of employee (Yes, No)
BusinessTravel : Frequency of business travel (Non-Travel, Travel_Rarely, Travel_Frequently)
DailyRate : Amount of money a company has to pay employee to work for them for a day
Department : Work Department (Research and Development, Sales, Human Resources)
DistanceFromHome : Distance between company and home
Education : Level of education (1: Below College, 2: College, 3: Bachelor, 4: Master, 5: Doctor)
EducationField : Field of Education (Life Sciences, Medical, Human Resources, Technical Degree, Marketing, Other)
EmployeeCount : Count of employee (always 1)
EmployeeNumber : ID Employee
EnvironmentSatisfaction: Satisfaction of environment score(1: Low, 2: Medium, 3: High, 4: Very High)
Gender : Gender (male/female)
HourlyRate : Amount of money a company has to pay employee to work for them for an hour
JobInvolvement : Level of job involvement (1: Low, 2: Medium, 3: High, 4: Very High)
JobLevel : Level of job (1 - 5)
JobRole : Role of job (Sales Executive, Research Scientist, Laboratory Technician, Manager, Healthcare Representative, Sales Representative, Manufacturing Director, Human Resources, Manager)
JobSatisfaction : Satisfaction of job (1: Low, 2: Medium, 3: High, 4: Very High)
MaritalStatus : Marital Status (Married, Single, Divorced)
MonthlyIncome : Monthly Income
MonthlyRate : Percent of salary of hike
NumCompaniesWorked : Total number of companies have been worked with
Over18 : Employee age over 18 years old (Yes, No)
OverTime : Frequently spent overtime working (Yes, No)
PercentSalaryHike : Percent of salary of hike
PerformanceRating : Level of performance assessment (1: Low, 2: Good, 3: Excellent, 4: Outstanding)
RelationshipSatisfaction: Level of relationship satisfaction (1: Low, 2: Medium, 3: High, 4: Very High)
StandardHours : Standard work hours (always 80)
StockOptionLevel : Stock option level (0 - 3)
TotalWorkingYears : Years of total working
TrainingTimesLastYear : Training times of last year
WorkLifeBalance : Level of work life balance (1: Bad, 2: Good, 3: Better, 4: Best)
YearsAtCompany : Years at company
YearsInCurrentRole : Years in current role
YearsSinceLastPromotion: Years since last promotion
YearsWithCurrManager : Years with current manager

Data Preprocesing

With the command glimpse () we will check whether the data is in accordance with the type or whether there is something that must be changed

glimpse(Attrition)

#> Rows: 1,470
#> Columns: 35
#> $ Age                      <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
#> $ Attrition                <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "…
#> $ BusinessTravel           <chr> "Travel_Rarely", "Travel_Frequently", "Travel…
#> $ DailyRate                <int> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,…
#> $ Department               <chr> "Sales", "Research & Development", "Research …
#> $ DistanceFromHome         <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
#> $ Education                <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
#> $ EducationField           <chr> "Life Sciences", "Life Sciences", "Other", "L…
#> $ EmployeeCount            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ EmployeeNumber           <int> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,…
#> $ EnvironmentSatisfaction  <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, …
#> $ Gender                   <chr> "Female", "Male", "Male", "Female", "Male", "…
#> $ HourlyRate               <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4…
#> $ JobInvolvement           <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, …
#> $ JobLevel                 <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, …
#> $ JobRole                  <chr> "Sales Executive", "Research Scientist", "Lab…
#> $ JobSatisfaction          <int> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, …
#> $ MaritalStatus            <chr> "Single", "Married", "Single", "Married", "Ma…
#> $ MonthlyIncome            <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269…
#> $ MonthlyRate              <int> 19479, 24907, 2396, 23159, 16632, 11864, 9964…
#> $ NumCompaniesWorked       <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, …
#> $ Over18                   <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", …
#> $ OverTime                 <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes",…
#> $ PercentSalaryHike        <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1…
#> $ PerformanceRating        <int> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, …
#> $ RelationshipSatisfaction <int> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, …
#> $ StandardHours            <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8…
#> $ StockOptionLevel         <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, …
#> $ TotalWorkingYears        <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3…
#> $ TrainingTimesLastYear    <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, …
#> $ WorkLifeBalance          <int> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, …
#> $ YearsAtCompany           <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,…
#> $ YearsInCurrentRole       <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, …
#> $ YearsSinceLastPromotion  <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, …
#> $ YearsWithCurrManager     <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, …

From the function above, we can get information that the data has 1.470 rows with 35 columns. Some of the data types that we will delete include: EmployeeNumber ,over18, EmployeeCount and StandardHours. For data that has a character type, we will change it to a factor.

attrition_clean <- Attrition %>% 
  select(-EmployeeNumber, -Over18, -EmployeeCount, -StandardHours) %>% 
  mutate_if(.predicate = is.character, .funs = as.factor) %>% mutate(Attrition = if_else(Attrition == "Yes", 1, 0),
                                                                     Attrition = as.factor(Attrition))

head(attrition_clean)

prop.table(table(attrition_clean$Attrition))

#> 
#>         0         1 
#> 0.8387755 0.1612245

Next we will check if there are any missing values

colSums(is.na(attrition_clean))

#>                      Age                Attrition           BusinessTravel 
#>                        0                        0                        0 
#>                DailyRate               Department         DistanceFromHome 
#>                        0                        0                        0 
#>                Education           EducationField  EnvironmentSatisfaction 
#>                        0                        0                        0 
#>                   Gender               HourlyRate           JobInvolvement 
#>                        0                        0                        0 
#>                 JobLevel                  JobRole          JobSatisfaction 
#>                        0                        0                        0 
#>            MaritalStatus            MonthlyIncome              MonthlyRate 
#>                        0                        0                        0 
#>       NumCompaniesWorked                 OverTime        PercentSalaryHike 
#>                        0                        0                        0 
#>        PerformanceRating RelationshipSatisfaction         StockOptionLevel 
#>                        0                        0                        0 
#>        TotalWorkingYears    TrainingTimesLastYear          WorkLifeBalance 
#>                        0                        0                        0 
#>           YearsAtCompany       YearsInCurrentRole  YearsSinceLastPromotion 
#>                        0                        0                        0 
#>     YearsWithCurrManager 
#>                        0

From the function above, we can see that there are no missing values

Handling Imbalance Data

# upsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)
library(caret)

attrition_up <- upSample(x = attrition_clean %>% select(-Attrition),
                       y = attrition_clean$Attrition,
                       yname = "Attrition")

prop.table(table(attrition_up$Attrition))

#> 
#>   0   1 
#> 0.5 0.5

clas target has balance

we will do one hot encoding, with the aim of changing the categorical data type to a dummy variable except coloumn target

attrition_ohe <- model.matrix(Attrition ~ .,attrition_up)

We must delete (intercept) and additional column target Attrition

attrition_ohe_clean1 <- 
attrition_ohe %>% 
  as.data.frame() %>% 
  select(-'(Intercept)')

attrition_ohe_clean1$Attrition <- attrition_up$Attrition

After doing One Hot Encoding, there are several new column names that have *_, space* symbols. While the neuralnet model cannot accept a column name that has a symbol, therefore we must remove the symbol

colnames(attrition_ohe_clean1)

#>  [1] "Age"                              "BusinessTravelTravel_Frequently" 
#>  [3] "BusinessTravelTravel_Rarely"      "DailyRate"                       
#>  [5] "DepartmentResearch & Development" "DepartmentSales"                 
#>  [7] "DistanceFromHome"                 "Education"                       
#>  [9] "EducationFieldLife Sciences"      "EducationFieldMarketing"         
#> [11] "EducationFieldMedical"            "EducationFieldOther"             
#> [13] "EducationFieldTechnical Degree"   "EnvironmentSatisfaction"         
#> [15] "GenderMale"                       "HourlyRate"                      
#> [17] "JobInvolvement"                   "JobLevel"                        
#> [19] "JobRoleHuman Resources"           "JobRoleLaboratory Technician"    
#> [21] "JobRoleManager"                   "JobRoleManufacturing Director"   
#> [23] "JobRoleResearch Director"         "JobRoleResearch Scientist"       
#> [25] "JobRoleSales Executive"           "JobRoleSales Representative"     
#> [27] "JobSatisfaction"                  "MaritalStatusMarried"            
#> [29] "MaritalStatusSingle"              "MonthlyIncome"                   
#> [31] "MonthlyRate"                      "NumCompaniesWorked"              
#> [33] "OverTimeYes"                      "PercentSalaryHike"               
#> [35] "PerformanceRating"                "RelationshipSatisfaction"        
#> [37] "StockOptionLevel"                 "TotalWorkingYears"               
#> [39] "TrainingTimesLastYear"            "WorkLifeBalance"                 
#> [41] "YearsAtCompany"                   "YearsInCurrentRole"              
#> [43] "YearsSinceLastPromotion"          "YearsWithCurrManager"            
#> [45] "Attrition"

With prompt gsub we can remove space, & , _

colnames(attrition_ohe_clean1) <- gsub(pattern = "[ &_]", replacement = "", x = colnames(attrition_ohe_clean1))

Exploratory Data Analysis

In this section, we will show some statistics on the distribution of data for each variable in the attrition_clean dataset. By using the summary() function, the following output is produced:

summary(attrition_clean)

#>       Age        Attrition           BusinessTravel   DailyRate     
#>  Min.   :18.00   0:1233    Non-Travel       : 150   Min.   : 102.0  
#>  1st Qu.:30.00   1: 237    Travel_Frequently: 277   1st Qu.: 465.0  
#>  Median :36.00             Travel_Rarely    :1043   Median : 802.0  
#>  Mean   :36.92                                      Mean   : 802.5  
#>  3rd Qu.:43.00                                      3rd Qu.:1157.0  
#>  Max.   :60.00                                      Max.   :1499.0  
#>                                                                     
#>                   Department  DistanceFromHome   Education    
#>  Human Resources       : 63   Min.   : 1.000   Min.   :1.000  
#>  Research & Development:961   1st Qu.: 2.000   1st Qu.:2.000  
#>  Sales                 :446   Median : 7.000   Median :3.000  
#>                               Mean   : 9.193   Mean   :2.913  
#>                               3rd Qu.:14.000   3rd Qu.:4.000  
#>                               Max.   :29.000   Max.   :5.000  
#>                                                               
#>           EducationField EnvironmentSatisfaction    Gender      HourlyRate    
#>  Human Resources : 27    Min.   :1.000           Female:588   Min.   : 30.00  
#>  Life Sciences   :606    1st Qu.:2.000           Male  :882   1st Qu.: 48.00  
#>  Marketing       :159    Median :3.000                        Median : 66.00  
#>  Medical         :464    Mean   :2.722                        Mean   : 65.89  
#>  Other           : 82    3rd Qu.:4.000                        3rd Qu.: 83.75  
#>  Technical Degree:132    Max.   :4.000                        Max.   :100.00  
#>                                                                               
#>  JobInvolvement    JobLevel                          JobRole    JobSatisfaction
#>  Min.   :1.00   Min.   :1.000   Sales Executive          :326   Min.   :1.000  
#>  1st Qu.:2.00   1st Qu.:1.000   Research Scientist       :292   1st Qu.:2.000  
#>  Median :3.00   Median :2.000   Laboratory Technician    :259   Median :3.000  
#>  Mean   :2.73   Mean   :2.064   Manufacturing Director   :145   Mean   :2.729  
#>  3rd Qu.:3.00   3rd Qu.:3.000   Healthcare Representative:131   3rd Qu.:4.000  
#>  Max.   :4.00   Max.   :5.000   Manager                  :102   Max.   :4.000  
#>                                 (Other)                  :215                  
#>   MaritalStatus MonthlyIncome    MonthlyRate    NumCompaniesWorked OverTime  
#>  Divorced:327   Min.   : 1009   Min.   : 2094   Min.   :0.000      No :1054  
#>  Married :673   1st Qu.: 2911   1st Qu.: 8047   1st Qu.:1.000      Yes: 416  
#>  Single  :470   Median : 4919   Median :14236   Median :2.000                
#>                 Mean   : 6503   Mean   :14313   Mean   :2.693                
#>                 3rd Qu.: 8379   3rd Qu.:20462   3rd Qu.:4.000                
#>                 Max.   :19999   Max.   :26999   Max.   :9.000                
#>                                                                              
#>  PercentSalaryHike PerformanceRating RelationshipSatisfaction StockOptionLevel
#>  Min.   :11.00     Min.   :3.000     Min.   :1.000            Min.   :0.0000  
#>  1st Qu.:12.00     1st Qu.:3.000     1st Qu.:2.000            1st Qu.:0.0000  
#>  Median :14.00     Median :3.000     Median :3.000            Median :1.0000  
#>  Mean   :15.21     Mean   :3.154     Mean   :2.712            Mean   :0.7939  
#>  3rd Qu.:18.00     3rd Qu.:3.000     3rd Qu.:4.000            3rd Qu.:1.0000  
#>  Max.   :25.00     Max.   :4.000     Max.   :4.000            Max.   :3.0000  
#>                                                                               
#>  TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany  
#>  Min.   : 0.00     Min.   :0.000         Min.   :1.000   Min.   : 0.000  
#>  1st Qu.: 6.00     1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.000  
#>  Median :10.00     Median :3.000         Median :3.000   Median : 5.000  
#>  Mean   :11.28     Mean   :2.799         Mean   :2.761   Mean   : 7.008  
#>  3rd Qu.:15.00     3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.: 9.000  
#>  Max.   :40.00     Max.   :6.000         Max.   :4.000   Max.   :40.000  
#>                                                                          
#>  YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
#>  Min.   : 0.000     Min.   : 0.000          Min.   : 0.000      
#>  1st Qu.: 2.000     1st Qu.: 0.000          1st Qu.: 2.000      
#>  Median : 3.000     Median : 1.000          Median : 3.000      
#>  Mean   : 4.229     Mean   : 2.188          Mean   : 4.123      
#>  3rd Qu.: 7.000     3rd Qu.: 3.000          3rd Qu.: 7.000      
#>  Max.   :18.000     Max.   :15.000          Max.   :17.000      
#>

Insight:

Mean adn median for age is almost same
Most attrition is NO with value 1233
BusinessTravel most in data is Travel_Rarely
DailyRate for minimum and maximum is different high
Departement count most is Research & Development
DistanceFromHome most far is 29km
mean Education is bachelor degree
EducationField most is Life Sciences with count 606
Satisfaction of environment score mean is 2.722
Gender male is dominan, but gap is not to far
HourlyRate rate mean is 65.89, but minimum hourly rate is 3 more times than maximum hourly rate
Job level involement mean is almost high level
Level job mean is 2
Job role most count is Sales Executive with 326
Job satisfaction mean almost high, almost same as median
Martial status most is married
Monthly income mean is 6503, but minimum monthly income is 20 more times than maximum mounthly income.
Monthly rate hike mean is 14313
NumCompaniesWorked mean is 2.693
Most of employes is no overtime 1054
PercentSalaryHike mean is 15.21
PerformanceRating mean is 3.154
RelationshipSatisfaction mean is 2.712 almost high
StockOptionLevel mean is 0.79
TotalWorkingYears maximum is 40, and mean is 11.28
TrainingTimesLastYear mean is 2.79
WorkLifeBalance mean is 2.76 that mena is almost better (3)
YearsAtCompany maximum 40th years
YearsInCurrentRole mean 4.299
YearsSinceLastPromotion maximum is 15 years
YearsWithCurrManager maximum is 17 years

From the preprocessing data we will explore what we get, including the correlation between variables and their predictors

attrition_BT <- attrition_clean %>% group_by(Attrition, BusinessTravel) %>% summarise(count= n())


ggplot(attrition_BT, mapping = aes(x = count, y = reorder(Attrition, count)))+
  geom_col(aes(fill = BusinessTravel), position = "dodge")+
  scale_fill_brewer(palette = "Blues")+
  labs(
     title = "Total Attrition in HR Analytics Employee Attrition & Performance",
     subtitle = "Attrition Vs Business Travel",
     x = "Count Attrition",
     y = NULL,
     fill = "Business Travel"
   ) + theme_light()

From above plot Travel_rarely is most attrition no and travel_rarely is most attrition yes

attrition_dpt <- attrition_clean %>% group_by(Attrition, Department) %>% summarise(count= n())


ggplot(attrition_dpt, mapping = aes(x = count, y = reorder(Attrition, count)))+
  geom_col(aes(fill = Department), position = "dodge")+
  scale_fill_brewer(palette = "Blues")+
  labs(
     title = "Total Attrition in HR Analytics Employee Attrition & Performance",
     subtitle = "Attrition Vs Department",
     x = "Count Attrition",
     y = NULL,
     fill = "Department"
   ) + theme_light()

From Above Plot, department Research & development is most attrition no and departement Research & development is most attrition yes

attrition_ef <- attrition_clean %>% group_by(Attrition, EducationField) %>% summarise(count= n())


ggplot(attrition_ef, mapping = aes(x = count, y = reorder(Attrition, count)))+
  geom_col(aes(fill = EducationField), position = "dodge")+
  scale_fill_brewer(palette = "Blues")+
  labs(
     title = "Total Attrition in HR Analytics Employee Attrition & Performance",
     subtitle = "Attrition Vs Education Field",
     x = "Count Attrition",
     y = NULL,
     fill = "Education Field"
   ) + theme_light()

From Above Plot, education life sciencis is most attrition no and education life sciencis is most attrition yes

attrition_jr <- attrition_clean %>% group_by(Attrition, JobRole) %>% summarise(count= n())


ggplot(attrition_jr, mapping = aes(x = count, y = reorder(Attrition, count)))+
  geom_col(aes(fill = JobRole), position = "dodge")+
  scale_fill_brewer(palette = "Blues")+
  labs(
     title = "Total Attrition in HR Analytics Employee Attrition & Performance",
     subtitle = "Attrition Vs Job Role",
     x = "Count Attrition",
     y = NULL,
     fill = "Job Role"
   ) + theme_light()

From Above Plot, job role sales executive is most attrition no and than reaserch scientist and job role laboratory technician is most attrition yes

attrition_ms <- attrition_clean %>% group_by(Attrition, MaritalStatus) %>% summarise(count= n())


ggplot(attrition_ms, mapping = aes(x = count, y = reorder(Attrition, count)))+
  geom_col(aes(fill = MaritalStatus), position = "dodge")+
  scale_fill_brewer(palette = "Blues")+
  labs(
     title = "Total Attrition in HR Analytics Employee Attrition & Performance",
     subtitle = "Attrition Vs Marital Status",
     x = "Count Attrition",
     y = NULL,
     fill = "Marital Status"
   ) + theme_light()

From Above Plot, martial status married most attrition no and martial status single most attrition yes

attrition_ms <- attrition_clean %>% group_by(Attrition, OverTime) %>% summarise(count= n())


ggplot(attrition_ms, mapping = aes(x = count, y = reorder(Attrition, count)))+
  geom_col(aes(fill = OverTime), position = "dodge")+
  scale_fill_brewer(palette = "Blues")+
  labs(
     title = "Total Attrition in HR Analytics Employee Attrition & Performance",
     subtitle = "Attrition Vs Over Time",
     x = "Count Attrition",
     y = NULL,
     fill = "Over Time"
   ) + theme_light()

From Above Plot, over time no most attrition no and over time yes most attrition yes

Cross Validation

Perform cross validation using initial_split with a proportion of 80% data for training data

set.seed(100)

# Spliting
attrition_split <- initial_split(data = attrition_ohe_clean1, # nama df
                             prop = 0.8, # proporsi untuk data train
                             strata = "Attrition") # nama label target untuk membuat label target balanced

# Data train
attrition_train <- training(attrition_split)

# Data validation
attrition_test <- testing(attrition_split)

We will cek balance data train

prop.table(table(attrition_train$Attrition))

#> 
#>   0   1 
#> 0.5 0.5

Separate the target-predictors, turn the data into a matrix

#prediktor
train_x <- attrition_train %>% 
  select(-Attrition) %>% 
  as.matrix()

test_x <- attrition_test %>% 
  select(-Attrition) %>% 
  as.matrix()

#target
train_y <- attrition_train %>% 
  select(Attrition) %>% 
  as.matrix()

test_y <- attrition_test %>% 
  select(Attrition) %>% 
  as.matrix()

Keras framework accepts data in array form. So predictor data in matrix form needs to be converted into array form using array_reshape().

# prediktor
train_x_keras <- train_x %>% 
  array_reshape(dim = dim(train_x))

test_x_keras <- test_x %>% 
  array_reshape(dim = dim(test_x))

Convert the target (categorical data) to a One Hot Encoding variable using the to_categorical() function:

train_y_keras <- train_y %>% 
  to_categorical()

test_y_keras <- test_y %>% 
  to_categorical()

Model Building

Define Architecture

The first step we do is build the architecture of our deep learning model. In the following, there are some provisions to help deep learning model architecture using Keras.

step 1. Always prefixed with keras_model_sequential()

# keras initialization
model <- keras_model_sequential()

step 2. Building each layer (Input, Hidden & Output)

# Untuk mengunci random
set.seed(100)
 
# Please type your answer
model %>% 
  layer_dense(input_shape = 44,
              units = 256,
              activation = "relu",
              name = "H1") %>%
  layer_dense(units = 128,
              activation = "relu",
              name = "H2") %>%
  layer_dense(units = 32,
              activation = "relu",
              name = "H3") %>%
  layer_dense(units = 2,
              activation = "sigmoid",
              name = "Out")

summary(model)

#> Model: "sequential"
#> ________________________________________________________________________________
#>  Layer (type)                       Output Shape                    Param #     
#> ================================================================================
#>  H1 (Dense)                         (None, 256)                     11520       
#>  H2 (Dense)                         (None, 128)                     32896       
#>  H3 (Dense)                         (None, 32)                      4128        
#>  Out (Dense)                        (None, 2)                       66          
#> ================================================================================
#> Total params: 48,610
#> Trainable params: 48,610
#> Non-trainable params: 0
#> ________________________________________________________________________________

Insight:

Total params : weight of model 2.122
Trainable params : parameters/connections whose weights can change according to the training process
Non-trainable params: weight/parameter values don’t change or are locked in value

Compile a Model

At the stage of building the architecture, we have not provided an error calculation that will be used. It is at this stage that we will provide a way of calculating errors, using the compile() function.

set.seed(100)

model %>% 
  compile(loss = "binary_crossentropy",
          optimizer = optimizer_adam(learning_rate =0.001),
          metrics = "accuracy")

Fit (Training Model)

set.seed(100)

history <- model %>% 
           fit(x = train_x_keras, 
               y = train_y_keras, 
               batch_size = 5, 
               epochs = 15, 
               verbose = T,
               test_data = list(test_x_keras, test_y_keras))

Plotting Model:

plot(history)

Predict

Make predictions on the test_x_keras data by using the predict() function

# Please type your answer
set.seed(00)
predict_class <- predict(model, test_x_keras) %>% 
  k_argmax() %>%
  as.array() %>% 
  as.factor()

to see the prediction results

predict_class %>% 
  head()

#> [1] 1 1 0 1 1 0
#> Levels: 0 1

Evaluation

Business Question: Predict whether the employee submits attrition yes or no

Kelas positif: no (0)
Kelas negatif: yes (1)

# Please type your answer
caret::confusionMatrix(predict_class, as.factor(attrition_test$Attrition), positive = "0")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0  55  34
#>          1 192 213
#>                                               
#>                Accuracy : 0.5425              
#>                  95% CI : (0.4974, 0.5871)    
#>     No Information Rate : 0.5                 
#>     P-Value [Acc > NIR] : 0.03249             
#>                                               
#>                   Kappa : 0.085               
#>                                               
#>  Mcnemar's Test P-Value : < 0.0000000000000002
#>                                               
#>             Sensitivity : 0.2227              
#>             Specificity : 0.8623              
#>          Pos Pred Value : 0.6180              
#>          Neg Pred Value : 0.5259              
#>              Prevalence : 0.5000              
#>          Detection Rate : 0.1113              
#>    Detection Prevalence : 0.1802              
#>       Balanced Accuracy : 0.5425              
#>                                               
#>        'Positive' Class : 0                   
#>

From evaluation, we have matrix:

FN: The attrition prediction model is yes, even though the actual attrition is no. Risk: the company provides facilities/salary increase/position promotion.

FP: The attrition prediction model is no, even though the actual attrition is yes. Risk: the company loses employees.

The risk that is concerning is if an FP event occurs so take the Precision evaluation metric

Conclusion

The total predictors are 48, but we only use 44 because 4 predictors have unique values that do not vary. In business travel, travel rarely has the highest effect on attrition yes and no. The research & development department has the highest influence on attrition no and yes. In education field life science has the highest influence on attrition no and yes. The sales representative job role has the highest attrition no, the laboratory technician job role has the highest yes attrition. In martial status married it has the highest influence on attrition no, in single martial status it has the highest influence on attrition yes. In overtime no has the highest attrition no, in overtime yes has the highest attrition yes. The hiden layer that we use uses 4 layers. The accuracy of our model at epoch 15. The results of our predictions are 0 and 1, where 0 is no and 1 is yes. From the matrix that we use is precision (Pos Pred Value)with the risk to the company is the loss of employees

Reference

https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset

LBB Neural Network & Deep Learning: IBM HR Analytics Employee Attrition & Performance

Muh Amri Sidiq

2024-10-29