Assignment

Build Neural Network model on same:
Steps involved should be:
a) Data Import (Target variable is “Attrition” column)
b) Split the data in Dev & Hold Out sample (70:30)
c) Perform Exploratory Data Analysis
d) Identify columns which are of no use. drop those columns
e) Build Neural Network Model (Development sample)
f) Validate NN model on Hold Out. If need be improvise

a) Data Import (Target variable is “Attrition” column)

importing the data

setwd("C:/Users/aksha/OneDrive/Documents/Shilpi_xtras/Shilpi_extras/GL_BAPI/Rprgm")
read.HR_Attrition_Data <- read.table("HR_Employee_Attrition_Data-3.csv", sep = ",", header = T)

library(caret)

## Warning: package 'caret' was built under R version 3.5.3

## Loading required package: lattice

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.5.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.5.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(InformationValue)

## Warning: package 'InformationValue' was built under R version 3.5.3

## 
## Attaching package: 'InformationValue'

## The following objects are masked from 'package:caret':
## 
##     confusionMatrix, precision, sensitivity, specificity

glimpse(read.HR_Attrition_Data)

## Observations: 2,940
## Variables: 35
## $ Age                      <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 3...
## $ Attrition                <fct> Yes, No, Yes, No, No, No, No, No, No,...
## $ BusinessTravel           <fct> Travel_Rarely, Travel_Frequently, Tra...
## $ DailyRate                <int> 1102, 279, 1373, 1392, 591, 1005, 132...
## $ Department               <fct> Sales, Research & Development, Resear...
## $ DistanceFromHome         <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, ...
## $ Education                <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1...
## $ EducationField           <fct> Life Sciences, Life Sciences, Other, ...
## $ EmployeeCount            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ EmployeeNumber           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12...
## $ EnvironmentSatisfaction  <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1...
## $ Gender                   <fct> Female, Male, Male, Female, Male, Mal...
## $ HourlyRate               <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 9...
## $ JobInvolvement           <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3...
## $ JobLevel                 <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1...
## $ JobRole                  <fct> Sales Executive, Research Scientist, ...
## $ JobSatisfaction          <int> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3...
## $ MaritalStatus            <fct> Single, Married, Single, Married, Mar...
## $ MonthlyIncome            <int> 5993, 5130, 2090, 2909, 3468, 3068, 2...
## $ MonthlyRate              <int> 19479, 24907, 2396, 23159, 16632, 118...
## $ NumCompaniesWorked       <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1...
## $ Over18                   <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y...
## $ OverTime                 <fct> Yes, No, Yes, Yes, No, No, Yes, No, N...
## $ PercentSalaryHike        <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 1...
## $ PerformanceRating        <int> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3...
## $ RelationshipSatisfaction <int> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4...
## $ StandardHours            <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 8...
## $ StockOptionLevel         <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1...
## $ TotalWorkingYears        <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, ...
## $ TrainingTimesLastYear    <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1...
## $ WorkLifeBalance          <int> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2...
## $ YearsAtCompany           <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, ...
## $ YearsInCurrentRole       <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2...
## $ YearsSinceLastPromotion  <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4...
## $ YearsWithCurrManager     <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3...

Comments: Data read correctly in R. There are 35 columns and 2940 rows

c) Perform Exploratory Data Analysis

d) Identify columns which are of no use. drop those columns

EDA of Data before splitting into training and test data

sum(is.na(read.HR_Attrition_Data))

## [1] 0

Comments: No NAs in data

str(read.HR_Attrition_Data)

## 'data.frame':    2940 obs. of  35 variables:
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : int  2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
##  $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EmployeeNumber          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ EnvironmentSatisfaction : int  2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : int  3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
##  $ JobSatisfaction         : int  4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ Over18                  : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : int  3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: int  1 4 2 3 4 3 1 2 2 2 ...
##  $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : int  1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...

Comments: Have factors and integers type of data.
Of all 35 variables/columns,Attrition is the Target variable.
Attrition data is unbalanced data with unequal number of yes and no.

summary(read.HR_Attrition_Data)

##       Age        Attrition            BusinessTravel   DailyRate     
##  Min.   :18.00   No :2466   Non-Travel       : 300   Min.   : 102.0  
##  1st Qu.:30.00   Yes: 474   Travel_Frequently: 554   1st Qu.: 465.0  
##  Median :36.00              Travel_Rarely    :2086   Median : 802.0  
##  Mean   :36.92                                       Mean   : 802.5  
##  3rd Qu.:43.00                                       3rd Qu.:1157.0  
##  Max.   :60.00                                       Max.   :1499.0  
##                                                                      
##                   Department   DistanceFromHome   Education    
##  Human Resources       : 126   Min.   : 1.000   Min.   :1.000  
##  Research & Development:1922   1st Qu.: 2.000   1st Qu.:2.000  
##  Sales                 : 892   Median : 7.000   Median :3.000  
##                                Mean   : 9.193   Mean   :2.913  
##                                3rd Qu.:14.000   3rd Qu.:4.000  
##                                Max.   :29.000   Max.   :5.000  
##                                                                
##           EducationField EmployeeCount EmployeeNumber  
##  Human Resources :  54   Min.   :1     Min.   :   1.0  
##  Life Sciences   :1212   1st Qu.:1     1st Qu.: 735.8  
##  Marketing       : 318   Median :1     Median :1470.5  
##  Medical         : 928   Mean   :1     Mean   :1470.5  
##  Other           : 164   3rd Qu.:1     3rd Qu.:2205.2  
##  Technical Degree: 264   Max.   :1     Max.   :2940.0  
##                                                        
##  EnvironmentSatisfaction    Gender       HourlyRate     JobInvolvement
##  Min.   :1.000           Female:1176   Min.   : 30.00   Min.   :1.00  
##  1st Qu.:2.000           Male  :1764   1st Qu.: 48.00   1st Qu.:2.00  
##  Median :3.000                         Median : 66.00   Median :3.00  
##  Mean   :2.722                         Mean   : 65.89   Mean   :2.73  
##  3rd Qu.:4.000                         3rd Qu.: 84.00   3rd Qu.:3.00  
##  Max.   :4.000                         Max.   :100.00   Max.   :4.00  
##                                                                       
##     JobLevel                          JobRole    JobSatisfaction
##  Min.   :1.000   Sales Executive          :652   Min.   :1.000  
##  1st Qu.:1.000   Research Scientist       :584   1st Qu.:2.000  
##  Median :2.000   Laboratory Technician    :518   Median :3.000  
##  Mean   :2.064   Manufacturing Director   :290   Mean   :2.729  
##  3rd Qu.:3.000   Healthcare Representative:262   3rd Qu.:4.000  
##  Max.   :5.000   Manager                  :204   Max.   :4.000  
##                  (Other)                  :430                  
##   MaritalStatus  MonthlyIncome    MonthlyRate    NumCompaniesWorked
##  Divorced: 654   Min.   : 1009   Min.   : 2094   Min.   :0.000     
##  Married :1346   1st Qu.: 2911   1st Qu.: 8045   1st Qu.:1.000     
##  Single  : 940   Median : 4919   Median :14236   Median :2.000     
##                  Mean   : 6503   Mean   :14313   Mean   :2.693     
##                  3rd Qu.: 8380   3rd Qu.:20462   3rd Qu.:4.000     
##                  Max.   :19999   Max.   :26999   Max.   :9.000     
##                                                                    
##  Over18   OverTime   PercentSalaryHike PerformanceRating
##  Y:2940   No :2108   Min.   :11.00     Min.   :3.000    
##           Yes: 832   1st Qu.:12.00     1st Qu.:3.000    
##                      Median :14.00     Median :3.000    
##                      Mean   :15.21     Mean   :3.154    
##                      3rd Qu.:18.00     3rd Qu.:3.000    
##                      Max.   :25.00     Max.   :4.000    
##                                                         
##  RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
##  Min.   :1.000            Min.   :80    Min.   :0.0000   Min.   : 0.00    
##  1st Qu.:2.000            1st Qu.:80    1st Qu.:0.0000   1st Qu.: 6.00    
##  Median :3.000            Median :80    Median :1.0000   Median :10.00    
##  Mean   :2.712            Mean   :80    Mean   :0.7939   Mean   :11.28    
##  3rd Qu.:4.000            3rd Qu.:80    3rd Qu.:1.0000   3rd Qu.:15.00    
##  Max.   :4.000            Max.   :80    Max.   :3.0000   Max.   :40.00    
##                                                                           
##  TrainingTimesLastYear WorkLifeBalance YearsAtCompany   YearsInCurrentRole
##  Min.   :0.000         Min.   :1.000   Min.   : 0.000   Min.   : 0.000    
##  1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.000   1st Qu.: 2.000    
##  Median :3.000         Median :3.000   Median : 5.000   Median : 3.000    
##  Mean   :2.799         Mean   :2.761   Mean   : 7.008   Mean   : 4.229    
##  3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.: 9.000   3rd Qu.: 7.000    
##  Max.   :6.000         Max.   :4.000   Max.   :40.000   Max.   :18.000    
##                                                                           
##  YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000          Min.   : 0.000      
##  1st Qu.: 0.000          1st Qu.: 2.000      
##  Median : 1.000          Median : 3.000      
##  Mean   : 2.188          Mean   : 4.123      
##  3rd Qu.: 3.000          3rd Qu.: 7.000      
##  Max.   :15.000          Max.   :17.000      
##

Comments: From Summary: MonthlyIncome has high range and has high range.

nearZeroVar(read.HR_Attrition_Data, saveMetrics= TRUE)

##                          freqRatio percentUnique zeroVar   nzv
## Age                       1.012987    1.46258503   FALSE FALSE
## Attrition                 5.202532    0.06802721   FALSE FALSE
## BusinessTravel            3.765343    0.10204082   FALSE FALSE
## DailyRate                 1.200000   30.13605442   FALSE FALSE
## Department                2.154709    0.10204082   FALSE FALSE
## DistanceFromHome          1.014423    0.98639456   FALSE FALSE
## Education                 1.437186    0.17006803   FALSE FALSE
## EducationField            1.306034    0.20408163   FALSE FALSE
## EmployeeCount             0.000000    0.03401361    TRUE  TRUE
## EmployeeNumber            1.000000  100.00000000   FALSE FALSE
## EnvironmentSatisfaction   1.015695    0.13605442   FALSE FALSE
## Gender                    1.500000    0.06802721   FALSE FALSE
## HourlyRate                1.035714    2.41496599   FALSE FALSE
## JobInvolvement            2.314667    0.13605442   FALSE FALSE
## JobLevel                  1.016854    0.17006803   FALSE FALSE
## JobRole                   1.116438    0.30612245   FALSE FALSE
## JobSatisfaction           1.038462    0.13605442   FALSE FALSE
## MaritalStatus             1.431915    0.10204082   FALSE FALSE
## MonthlyIncome             1.333333   45.88435374   FALSE FALSE
## MonthlyRate               1.000000   48.53741497   FALSE FALSE
## NumCompaniesWorked        2.644670    0.34013605   FALSE FALSE
## Over18                    0.000000    0.03401361    TRUE  TRUE
## OverTime                  2.533654    0.06802721   FALSE FALSE
## PercentSalaryHike         1.004785    0.51020408   FALSE FALSE
## PerformanceRating         5.504425    0.06802721   FALSE FALSE
## RelationshipSatisfaction  1.062500    0.13605442   FALSE FALSE
## StandardHours             0.000000    0.03401361    TRUE  TRUE
## StockOptionLevel          1.058725    0.13605442   FALSE FALSE
## TotalWorkingYears         1.616000    1.36054422   FALSE FALSE
## TrainingTimesLastYear     1.114053    0.23809524   FALSE FALSE
## WorkLifeBalance           2.595930    0.13605442   FALSE FALSE
## YearsAtCompany            1.146199    1.25850340   FALSE FALSE
## YearsInCurrentRole        1.524590    0.64625850   FALSE FALSE
## YearsSinceLastPromotion   1.627451    0.54421769   FALSE FALSE
## YearsWithCurrManager      1.307985    0.61224490   FALSE FALSE

Comments:Removing columns with single value
Drop columns where freq ratio as 0 and 1 =. These have 1 value or all unique values
EmployeeNumber,EmployeeCount ,Over18 ,StandardHours are dropped.

table(read.HR_Attrition_Data$Attrition)

## 
##   No  Yes 
## 2466  474

Comments: Unbalanced nature of data

Data type wise EDA

read.HR_Attrition_Data %>% select_if(is.numeric) %>%
  glimpse()

## Observations: 2,940
## Variables: 26
## $ Age                      <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 3...
## $ DailyRate                <int> 1102, 279, 1373, 1392, 591, 1005, 132...
## $ DistanceFromHome         <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, ...
## $ Education                <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1...
## $ EmployeeCount            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ EmployeeNumber           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12...
## $ EnvironmentSatisfaction  <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1...
## $ HourlyRate               <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 9...
## $ JobInvolvement           <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3...
## $ JobLevel                 <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1...
## $ JobSatisfaction          <int> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3...
## $ MonthlyIncome            <int> 5993, 5130, 2090, 2909, 3468, 3068, 2...
## $ MonthlyRate              <int> 19479, 24907, 2396, 23159, 16632, 118...
## $ NumCompaniesWorked       <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1...
## $ PercentSalaryHike        <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 1...
## $ PerformanceRating        <int> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3...
## $ RelationshipSatisfaction <int> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4...
## $ StandardHours            <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 8...
## $ StockOptionLevel         <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1...
## $ TotalWorkingYears        <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, ...
## $ TrainingTimesLastYear    <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1...
## $ WorkLifeBalance          <int> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2...
## $ YearsAtCompany           <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, ...
## $ YearsInCurrentRole       <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2...
## $ YearsSinceLastPromotion  <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4...
## $ YearsWithCurrManager     <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3...

read.HR_Attrition_Data %>% select_if(is.factor)%>%
  glimpse()

## Observations: 2,940
## Variables: 9
## $ Attrition      <fct> Yes, No, Yes, No, No, No, No, No, No, No, No, N...
## $ BusinessTravel <fct> Travel_Rarely, Travel_Frequently, Travel_Rarely...
## $ Department     <fct> Sales, Research & Development, Research & Devel...
## $ EducationField <fct> Life Sciences, Life Sciences, Other, Life Scien...
## $ Gender         <fct> Female, Male, Male, Female, Male, Male, Female,...
## $ JobRole        <fct> Sales Executive, Research Scientist, Laboratory...
## $ MaritalStatus  <fct> Single, Married, Single, Married, Married, Sing...
## $ Over18         <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y,...
## $ OverTime       <fct> Yes, No, Yes, Yes, No, No, Yes, No, No, No, No,...

Comments: Segregating columns based on their data types.

WOE calculation for Categorical variables:

Less than 0.02, then the predictor is not useful for modeling (separating the Goods from the Bads)
0.02 to 0.1, then the predictor has only a weak relationship.
0.1 to 0.3, then the predictor has a medium strength relationship.
0.3 or higher, then the predictor has a strong relationship.

Factor variables predictors strength analysis using WOE and IV

Business Travel EDA on whole data

Att=ifelse(read.HR_Attrition_Data$Attrition=="Yes",1,0)
options(scipen = 999, digits = 2)
WOETable(X=read.HR_Attrition_Data$BusinessTravel, Y=Att)

##                 CAT GOODS BADS TOTAL PCT_G PCT_B    WOE     IV
## 1        Non-Travel    24  276   300 0.051  0.11 -0.793 0.0486
## 2 Travel_Frequently   138  416   554 0.291  0.17  0.546 0.0668
## 3     Travel_Rarely   312 1774  2086 0.658  0.72 -0.089 0.0054

Comments: Since IV value is low hence weak relationship with Attrition.
This predictor can be ignored.

Department factor

WOETable(X=read.HR_Attrition_Data$Department, Y=Att)

##                      CAT GOODS BADS TOTAL PCT_G PCT_B   WOE     IV
## 1        Human Resources    24  102   126 0.051 0.041  0.20 0.0019
## 2 Research & Development   266 1656  1922 0.561 0.672 -0.18 0.0198
## 3                  Sales   184  708   892 0.388 0.287  0.30 0.0305

Comments: Since IV value is quite low hence the predictor is not useful for modeling.
This predictor can be ignored.

EducationField factor

WOETable(X=read.HR_Attrition_Data$EducationField, Y=Att)

##                CAT GOODS BADS TOTAL PCT_G PCT_B   WOE     IV
## 1  Human Resources    14   40    54 0.030 0.016  0.60 0.0080
## 2    Life Sciences   178 1034  1212 0.376 0.419 -0.11 0.0048
## 3        Marketing    70  248   318 0.148 0.101  0.38 0.0181
## 4          Medical   126  802   928 0.266 0.325 -0.20 0.0120
## 5            Other    22  142   164 0.046 0.058 -0.22 0.0024
## 6 Technical Degree    64  200   264 0.135 0.081  0.51 0.0275

Comments: Since IV value is low hence predictorhas weak relationship.
This predictor can be ignored.

Gender factor

WOETable(X=read.HR_Attrition_Data$Gender, Y=Att)

##      CAT GOODS BADS TOTAL PCT_G PCT_B    WOE     IV
## 1 Female   174 1002  1176  0.37  0.41 -0.102 0.0040
## 2   Male   300 1464  1764  0.63  0.59  0.064 0.0025

Comments: Since IV values are low, hence predictor can be ignored.

JobRole factor

WOETable(X=read.HR_Attrition_Data$JobRole, Y=Att)

##                         CAT GOODS BADS TOTAL  PCT_G PCT_B    WOE
## 1 Healthcare Representative    18  244   262 0.0380 0.099 -0.958
## 2           Human Resources    24   80   104 0.0506 0.032  0.445
## 3     Laboratory Technician   124  394   518 0.2616 0.160  0.493
## 4                   Manager    10  194   204 0.0211 0.079 -1.316
## 5    Manufacturing Director    20  270   290 0.0422 0.109 -0.954
## 6         Research Director     4  156   160 0.0084 0.063 -2.014
## 7        Research Scientist    94  490   584 0.1983 0.199 -0.002
## 8           Sales Executive   114  538   652 0.2405 0.218  0.097
## 9      Sales Representative    66  100   166 0.1392 0.041  1.234
##           IV
## 1 0.05838892
## 2 0.00809845
## 3 0.05021016
## 4 0.07577324
## 5 0.06416873
## 6 0.11043337
## 7 0.00000077
## 8 0.00217775
## 9 0.12174571

Comments: Since IV values are moderate hence predictor has medium strength so considered.

MaritalStatus factor

WOETable(X=read.HR_Attrition_Data$MaritalStatus, Y=Att)

##        CAT GOODS BADS TOTAL PCT_G PCT_B   WOE    IV
## 1 Divorced    66  588   654  0.14  0.24 -0.54 0.053
## 2  Married   168 1178  1346  0.35  0.48 -0.30 0.037
## 3   Single   240  700   940  0.51  0.28  0.58 0.129

Comments: Marital Status are moderate hence predictor has medium strength so considered.

OverTime factor

WOETable(X=read.HR_Attrition_Data$OverTime, Y=Att)

##   CAT GOODS BADS TOTAL PCT_G PCT_B   WOE   IV
## 1  No   220 1888  2108  0.46  0.77 -0.50 0.15
## 2 Yes   254  578   832  0.54  0.23  0.83 0.25

Comments: OverTime are moderate hence predictor has medium strength so considered.

Numeric Predictors Analysis:

Age EDA

ggplot(data=read.HR_Attrition_Data,aes(x=Age))+
  geom_bar(alpha=0.5,fill="red",color="black") +
  ggtitle("Norm Distribution of Age")

ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=Age)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))

hist(read.HR_Attrition_Data$Age)

Comments: As per the boxplot for Attrition as Yes and No;this Predictor can be considered.

DailyRate EDA

ggplot(data=read.HR_Attrition_Data,aes(x=DailyRate))+
  geom_histogram(alpha=0.5,fill="red",color="black") +
  ggtitle("Histogram of DistanceFromHome")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=DailyRate)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))

hist(read.HR_Attrition_Data$DailyRate)

Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be ignored.
####DistanceFromHome EDA

 ggplot(data=read.HR_Attrition_Data,aes(x=DistanceFromHome))+
   geom_bar(alpha=0.5,fill="red",color="black") +
   ggtitle("Norm Distribustion of DistanceFromHome")

 ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=DistanceFromHome)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))

 hist(read.HR_Attrition_Data$DistanceFromHome)

Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be considered.

Education EDA

 ggplot(data=read.HR_Attrition_Data,aes(x=Education))+
   geom_bar(alpha=0.5,fill="red",color="black") +
   ggtitle("Norm Distribustion of Education")

 ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=Education)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))

 hist(read.HR_Attrition_Data$Education)

Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be ignored

EnvironmentSatisfaction EDA

 ggplot(data=read.HR_Attrition_Data,aes(x=EnvironmentSatisfaction))+
   geom_bar(alpha=0.5,fill="red",color="black") +
   ggtitle("Norm Distribustion of EnvironmentSatisfaction")

 ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=DistanceFromHome)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))

Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be considered.

HourlyRate EDA

 ggplot(data=read.HR_Attrition_Data,aes(x=HourlyRate))+
   geom_bar(alpha=0.5,fill="red",color="black") +
   ggtitle("Norm Distribustion of HourlyRate")

 ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=HourlyRate)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))

 hist(read.HR_Attrition_Data$HourlyRate)

Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be ignored.

JobInvolvement EDA

WOETable(X=as.factor(read.HR_Attrition_Data$JobInvolvement), Y=Att)

##   CAT GOODS BADS TOTAL PCT_G PCT_B   WOE    IV
## 1   1    56  110   166 0.118 0.045  0.97 0.072
## 2   2   142  608   750 0.300 0.247  0.19 0.010
## 3   3   250 1486  1736 0.527 0.603 -0.13 0.010
## 4   4    26  262   288 0.055 0.106 -0.66 0.034

Comments: It is an ordinal value,hence WOE is calculated.
Since it low IV ;this Predictor can be ignored.

JobLevel EDA

 WOETable(X=as.factor(read.HR_Attrition_Data$JobLevel), Y=Att)

##   CAT GOODS BADS TOTAL PCT_G PCT_B   WOE     IV
## 1   1   286  800  1086 0.603 0.324  0.62 0.1731
## 2   2   104  964  1068 0.219 0.391 -0.58 0.0991
## 3   3    64  372   436 0.135 0.151 -0.11 0.0018
## 4   4    10  202   212 0.021 0.082 -1.36 0.0825
## 5   5    10  128   138 0.021 0.052 -0.90 0.0277

Comments: It is an ordinal value,hence WOE is calculated.
Since it moderate IV ;this Predictor can be considered.

JobSatisfaction EDA

 WOETable(X=as.factor(read.HR_Attrition_Data$JobSatisfaction), Y=Att)

##   CAT GOODS BADS TOTAL PCT_G PCT_B    WOE       IV
## 1   1   132  446   578  0.28  0.18  0.432 0.042136
## 2   2    92  468   560  0.19  0.19  0.022 0.000097
## 3   3   146  738   884  0.31  0.30  0.029 0.000252
## 4   4   104  814   918  0.22  0.33 -0.408 0.045204

Comments: It is an ordinal value,hence WOE is calculated.
Since it low IV ;this Predictor can be ignored.

MonthlyIncome EDA

 ggplot(data=read.HR_Attrition_Data,aes(x=MonthlyIncome))+
   geom_histogram(alpha=0.5,fill="red",color="black") +
   ggtitle("Norm Distribustion of MonthlyIncome")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

 hist(read.HR_Attrition_Data$MonthlyIncome)

 hist(log2(read.HR_Attrition_Data$MonthlyIncome))

 ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=MonthlyIncome)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))

Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be well considered.

MonthlyRate EDA

 ggplot(data=read.HR_Attrition_Data,aes(x=MonthlyRate))+
   geom_histogram(alpha=0.5,fill="red",color="black") +
   ggtitle("Norm Distribustion of MonthlyRate")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

 hist(read.HR_Attrition_Data$MonthlyRate)

 ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=MonthlyRate)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))

Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be considered.

NumCompaniesWorked EDA

 ggplot(data=read.HR_Attrition_Data,aes(x=NumCompaniesWorked))+
   geom_bar(alpha=0.5,fill="red",color="black") +
   ggtitle("Norm Distribustion of NumCompaniesWorked")

 hist(read.HR_Attrition_Data$NumCompaniesWorked)

 ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=NumCompaniesWorked)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))

Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be condsidered.

PercentSalaryHike EDA

 ggplot(data=read.HR_Attrition_Data,aes(x=PercentSalaryHike))+
   geom_bar(alpha=0.5,fill="red",color="black") +
   ggtitle("Norm Distribustion of PercentSalaryHike")

 hist(read.HR_Attrition_Data$PercentSalaryHike)

 boxplot(read.HR_Attrition_Data$PercentSalaryHike,horizontal = T)

 ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=PercentSalaryHike)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))

Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be considered.

PerformanceRating EDA

WOETable(X=as.factor(read.HR_Attrition_Data$PerformanceRating), Y=Att)

##   CAT GOODS BADS TOTAL PCT_G PCT_B     WOE        IV
## 1   3   400 2088  2488  0.84  0.85 -0.0034 0.0000095
## 2   4    74  378   452  0.16  0.15  0.0183 0.0000519

Comments: It is an ordinal value,hence WOE is calculated.
Since it low IV ;this Predictor can be ignored.

RelationshipSatisfaction EDA

 WOETable(X=as.factor(read.HR_Attrition_Data$RelationshipSatisfaction), Y=Att)

##   CAT GOODS BADS TOTAL PCT_G PCT_B    WOE      IV
## 1   1   114  438   552  0.24  0.18  0.303 0.01906
## 2   2    90  516   606  0.19  0.21 -0.097 0.00188
## 3   3   142  776   918  0.30  0.31 -0.049 0.00074
## 4   4   128  736   864  0.27  0.30 -0.100 0.00284

Comments: It is an ordinal value,hence WOE is calculated.
Since it low IV ;this Predictor can be ignored.

####StockOptionLevel EDA

 WOETable(X=as.factor(read.HR_Attrition_Data$StockOptionLevel), Y=Att)

##   CAT GOODS BADS TOTAL PCT_G PCT_B   WOE      IV
## 1   0   308  954  1262 0.650 0.387  0.52 0.13635
## 2   1   112 1080  1192 0.236 0.438 -0.62 0.12444
## 3   2    24  292   316 0.051 0.118 -0.85 0.05758
## 4   3    30  140   170 0.063 0.057  0.11 0.00071

Comments: It is an ordinal value,hence WOE is calculated.
Since it low IV ;this Predictor can be ignored.

TotalWorkingYears EDA

 ggplot(data=read.HR_Attrition_Data,aes(x=TotalWorkingYears))+
   geom_bar(alpha=0.5,fill="red",color="black") +
   ggtitle("Norm Distribustion of TotalWorkingYears")

 hist(read.HR_Attrition_Data$TotalWorkingYears)

 ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=TotalWorkingYears)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))

Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be considered.

TrainingTimesLastYear EDA

 glimpse(read.HR_Attrition_Data$TrainingTimesLastYear)

##  int [1:2940] 0 3 3 3 3 2 3 2 2 3 ...

 ggplot(data=read.HR_Attrition_Data,aes(x=TrainingTimesLastYear))+
   geom_bar(alpha=0.5,fill="red",color="black") +
   ggtitle("Norm Distribustion of TrainingTimesLastYear")

 hist(read.HR_Attrition_Data$TrainingTimesLastYear)

 ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=TrainingTimesLastYear)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))

Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be considered.

WorkLifeBalance EDA

 WOETable(X=as.factor(read.HR_Attrition_Data$StockOptionLevel), Y=Att)

##   CAT GOODS BADS TOTAL PCT_G PCT_B   WOE      IV
## 1   0   308  954  1262 0.650 0.387  0.52 0.13635
## 2   1   112 1080  1192 0.236 0.438 -0.62 0.12444
## 3   2    24  292   316 0.051 0.118 -0.85 0.05758
## 4   3    30  140   170 0.063 0.057  0.11 0.00071

Comments: It is an ordinal value,hence WOE is calculated.
Since it low IV ;this Predictor can be ignored.

YearsAtCompany EDA

 ggplot(data=read.HR_Attrition_Data,aes(x=YearsAtCompany))+
   geom_bar(alpha=0.5,fill="red",color="black") +
   ggtitle("Norm Distribustion of YearsAtCompany")

 hist(read.HR_Attrition_Data$YearsAtCompany)

 ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=YearsAtCompany)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))

Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be considered.

YearsInCurrentRole EDA

 ggplot(data=read.HR_Attrition_Data,aes(x=YearsInCurrentRole))+
   geom_bar(alpha=0.5,fill="red",color="black") +
   ggtitle("Norm Distribustion of YearsInCurrentRole")

 hist(read.HR_Attrition_Data$YearsInCurrentRole)

 ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=YearsInCurrentRole)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))

Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be considered.

YearsSinceLastPromotion EDA

 ggplot(data=read.HR_Attrition_Data,aes(x=YearsSinceLastPromotion))+
   geom_bar(alpha=0.5,fill="red",color="black") +
   ggtitle("Norm Distribustion of YearsSinceLastPromotion")

 hist(read.HR_Attrition_Data$YearsSinceLastPromotion)

 ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=YearsSinceLastPromotion)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))

Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be considered.

YearsWithCurrManager EDA

 ggplot(data=read.HR_Attrition_Data,aes(x=YearsWithCurrManager))+
   geom_bar(alpha=0.5,fill="red",color="black") +
   ggtitle("Norm Distribustion of YearsWithCurrManager")

 hist(read.HR_Attrition_Data$YearsWithCurrManager)

 ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=YearsWithCurrManager)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))

Based on Analysis – New data frame created called Newdf created after dropping unwanted columns.

Here is final column list:

Considered Columns:

Age
Attrition
DistanceFromHome
EnvironmentSatisfaction
JobLevel
JobRole
MaritalStatus
MonthlyRate
NumCompaniesWorked
OverTime
StockOptionLevel
TrainingTimesLastYear
WorkLifeBalance
YearsInCurrentRole
YearsSinceLastPromotion
YearsWithCurrManager
MonthlyIncome
PercentSalaryHike
TotalWorkingYears
YearsAtCompany

Dropped columns:

EmployeeCount
Over18
StandardHours
EmployeeNumber
BusinessTravel
Department
EducationField
Gender
DailyRate
Education
HourlyRate
JobInvolvement
JobSatisfaction
PerformanceRating
RelationshipSatisfaction

Newdf<-data.frame( read.HR_Attrition_Data$Attrition
                  ,read.HR_Attrition_Data$MonthlyIncome
                  ,read.HR_Attrition_Data$PercentSalaryHike 
                  ,read.HR_Attrition_Data$TotalWorkingYears
                  ,read.HR_Attrition_Data$YearsAtCompany 
                  ,read.HR_Attrition_Data$Age                                  
                  ,read.HR_Attrition_Data$DistanceFromHome               
                  ,read.HR_Attrition_Data$EnvironmentSatisfaction                        
                  ,read.HR_Attrition_Data$JobLevel                       
                  ,read.HR_Attrition_Data$JobRole                        
                  ,read.HR_Attrition_Data$MaritalStatus                  
                  ,read.HR_Attrition_Data$MonthlyRate                    
                  ,read.HR_Attrition_Data$NumCompaniesWorked             
                  ,read.HR_Attrition_Data$OverTime                       
                  ,read.HR_Attrition_Data$StockOptionLevel               
                  ,read.HR_Attrition_Data$TrainingTimesLastYear          
                  ,read.HR_Attrition_Data$WorkLifeBalance                
                  ,read.HR_Attrition_Data$YearsInCurrentRole             
                  ,read.HR_Attrition_Data$YearsSinceLastPromotion        
                  ,read.HR_Attrition_Data$YearsWithCurrManager
)

Comments: New data frame created called Newdf created

Columns based on their data types in Newdf dataframe

Newdf %>% select_if(is.numeric) %>%
  glimpse()

## Observations: 2,940
## Variables: 16
## $ read.HR_Attrition_Data.MonthlyIncome           <int> 5993, 5130, 209...
## $ read.HR_Attrition_Data.PercentSalaryHike       <int> 11, 23, 15, 11,...
## $ read.HR_Attrition_Data.TotalWorkingYears       <int> 8, 10, 7, 8, 6,...
## $ read.HR_Attrition_Data.YearsAtCompany          <int> 6, 10, 0, 8, 2,...
## $ read.HR_Attrition_Data.Age                     <int> 41, 49, 37, 33,...
## $ read.HR_Attrition_Data.DistanceFromHome        <int> 1, 8, 2, 3, 2, ...
## $ read.HR_Attrition_Data.EnvironmentSatisfaction <int> 2, 3, 4, 4, 1, ...
## $ read.HR_Attrition_Data.JobLevel                <int> 2, 2, 1, 1, 1, ...
## $ read.HR_Attrition_Data.MonthlyRate             <int> 19479, 24907, 2...
## $ read.HR_Attrition_Data.NumCompaniesWorked      <int> 8, 1, 6, 1, 9, ...
## $ read.HR_Attrition_Data.StockOptionLevel        <int> 0, 1, 0, 0, 1, ...
## $ read.HR_Attrition_Data.TrainingTimesLastYear   <int> 0, 3, 3, 3, 3, ...
## $ read.HR_Attrition_Data.WorkLifeBalance         <int> 1, 3, 3, 3, 3, ...
## $ read.HR_Attrition_Data.YearsInCurrentRole      <int> 4, 7, 0, 7, 2, ...
## $ read.HR_Attrition_Data.YearsSinceLastPromotion <int> 0, 1, 0, 3, 2, ...
## $ read.HR_Attrition_Data.YearsWithCurrManager    <int> 5, 7, 0, 0, 2, ...

Newdf %>% select_if(is.factor)%>%
  glimpse()

## Observations: 2,940
## Variables: 4
## $ read.HR_Attrition_Data.Attrition     <fct> Yes, No, Yes, No, No, No,...
## $ read.HR_Attrition_Data.JobRole       <fct> Sales Executive, Research...
## $ read.HR_Attrition_Data.MaritalStatus <fct> Single, Married, Single, ...
## $ read.HR_Attrition_Data.OverTime      <fct> Yes, No, Yes, Yes, No, No...

Comments: This dataframe is of whole data - 16 numeric columns and 4 categotical columns, inclusing Target variable

b) Split the data in Dev & Hold Out sample (70:30)

set.seed(1212)
s <- sample(c(1:2940), size = 2058)
Newdf.train <- Newdf[s,]
Newdf.test <- Newdf[-s,]
nrow(Newdf.train)

## [1] 2058

nrow(Newdf.test)

## [1] 882

Comments: The train data has 2058 rows
The test data has 882 rows

Count data for Target columns in train and test data

table(Newdf.train[,1])

## 
##   No  Yes 
## 1723  335

table(Newdf.test[,1])

## 
##  No Yes 
## 743 139

Categorical values to dummy varaibles for train data

JobRole.matrix <- model.matrix(~ read.HR_Attrition_Data.JobRole - 1,data = Newdf.train)
Newdf.train <- data.frame(Newdf.train, JobRole.matrix)

MaritalStatus.matrix <- model.matrix(~ read.HR_Attrition_Data.MaritalStatus - 1,data = Newdf.train)
Newdf.train <- data.frame(Newdf.train, MaritalStatus.matrix)

OverTime.matrix <- model.matrix(~ read.HR_Attrition_Data.OverTime - 1,data = Newdf.train)
Newdf.train <- data.frame(Newdf.train, OverTime.matrix)

Comments: In order to scale categorical is converted to integer dummy variables

Newdf.train<-Newdf.train[,-c(10,11,14)]

Comments:From total of 34 columns in train data ; dropping of Categorical variables JobRole,MaritalStatus,OverTime
Total of 31 columns in Newdf.train

Attrition_num<-ifelse(Newdf.train$read.HR_Attrition_Data.Attrition == 'Yes',1,0)
class(Attrition_num)

## [1] "numeric"

Scaling Data done for normalization

train.dev.scaled <- scale(Newdf.train[,-1])

Comments: the Target column is removed and the data is scaled.

e) Build Neural Network Model (Development sample)

allVars.train<-colnames(train.dev.scaled)
predictor.Variables<-paste(allVars.train,collapse="+")
form=as.formula(paste("Attrition_num ~",predictor.Variables,collapse="+"))
form

## Attrition_num ~ read.HR_Attrition_Data.MonthlyIncome + read.HR_Attrition_Data.PercentSalaryHike + 
##     read.HR_Attrition_Data.TotalWorkingYears + read.HR_Attrition_Data.YearsAtCompany + 
##     read.HR_Attrition_Data.Age + read.HR_Attrition_Data.DistanceFromHome + 
##     read.HR_Attrition_Data.EnvironmentSatisfaction + read.HR_Attrition_Data.JobLevel + 
##     read.HR_Attrition_Data.MonthlyRate + read.HR_Attrition_Data.NumCompaniesWorked + 
##     read.HR_Attrition_Data.StockOptionLevel + read.HR_Attrition_Data.TrainingTimesLastYear + 
##     read.HR_Attrition_Data.WorkLifeBalance + read.HR_Attrition_Data.YearsInCurrentRole + 
##     read.HR_Attrition_Data.YearsSinceLastPromotion + read.HR_Attrition_Data.YearsWithCurrManager + 
##     read.HR_Attrition_Data.JobRoleHealthcare.Representative + 
##     read.HR_Attrition_Data.JobRoleHuman.Resources + read.HR_Attrition_Data.JobRoleLaboratory.Technician + 
##     read.HR_Attrition_Data.JobRoleManager + read.HR_Attrition_Data.JobRoleManufacturing.Director + 
##     read.HR_Attrition_Data.JobRoleResearch.Director + read.HR_Attrition_Data.JobRoleResearch.Scientist + 
##     read.HR_Attrition_Data.JobRoleSales.Executive + read.HR_Attrition_Data.JobRoleSales.Representative + 
##     read.HR_Attrition_Data.MaritalStatusDivorced + read.HR_Attrition_Data.MaritalStatusMarried + 
##     read.HR_Attrition_Data.MaritalStatusSingle + read.HR_Attrition_Data.OverTimeNo + 
##     read.HR_Attrition_Data.OverTimeYes

Comments: Form - shows that columns on which Neural Network will run.

train.dev.scaled <- cbind(Attrition_num, train.dev.scaled)

Comments: Adding the numeric Attrition column from train data to make data set include target as well as independant variables before running neuralnet function.

library(neuralnet)

## Warning: package 'neuralnet' was built under R version 3.5.3

## 
## Attaching package: 'neuralnet'

## The following object is masked from 'package:dplyr':
## 
##     compute

set.seed(1212)
nn1.Attr <- neuralnet(formula=form,
                 data = train.dev.scaled, 
                 hidden = 3,
                 err.fct = "sse",
                 linear.output = FALSE,
                 lifesign = "full",
                 lifesign.step = 2000,
                 threshold = 0.01,
                 stepmax = 200000

                 
)

## hidden: 3 thresh: 0.01 rep: 1/1 steps:

##    2000  min thresh: 0.0559604882956513
##                                                    4000  min thresh: 0.0532541743930464
##                                                    6000  min thresh: 0.0297463576005015
##                                                    8000  min thresh: 0.0230909932692286
##                                                   10000  min thresh: 0.0230909932692286
##                                                   12000  min thresh: 0.0218351879306054
##                                                   14000  min thresh: 0.0135645371330507
##                                                   16000  min thresh: 0.0124392125335601
##                                                   18000  min thresh: 0.0100557167590136
##                                                   20000  min thresh: 0.0100557167590136
##                                                   20209  error: 66.85872 time: 21.54 secs

Plotting the neural network graph

plot(nn1.Attr)

Validating NN model for accuracy

Assigning the Probabilities to Dev Sample

Newdf.train$Prob = nn1.Attr$net.result[[1]]

The distribution of the estimated probabilities

quantile( Newdf.train$Prob, c(0,1,5,10,25,50,75,80,90,95,98,99,100)/100)

##                                                                                                  0% 
## 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000044 
##                                                                                                  1% 
## 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000044 
##                                                                                                  5% 
## 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000054 
##                                                                                                 10% 
## 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000009767 
##                                                                                                 25% 
## 0.0000000000000000000000000000000000000000000000000000000000000000860700222054994643246239593992186 
##                                                                                                 50% 
## 0.0000000000000000000000000000000681137920984060446939062360982575228263158351182937622070312500000 
##                                                                                                 75% 
## 0.0031410413657090734997068270928366473526693880558013916015625000000000000000000000000000000000000 
##                                                                                                 80% 
## 0.1292709043456795048321339436370180919766426086425781250000000000000000000000000000000000000000000 
##                                                                                                 90% 
## 0.8337076493282028488707169344706926494836807250976562500000000000000000000000000000000000000000000 
##                                                                                                 95% 
## 1.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 
##                                                                                                 98% 
## 1.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 
##                                                                                                 99% 
## 1.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 
##                                                                                                100% 
## 1.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

hist(Newdf.train$Prob)

Comments:Probability Distribution through histogram looks wide spread.

library(caret)
library(e1071)

## Warning: package 'e1071' was built under R version 3.5.3

Newdf.train$Class = ifelse(Newdf.train$Prob>0.2,1,0)

Comments: Assgining 0 / 1 class based on certain threshold for train data

library(ROCR)

## Warning: package 'ROCR' was built under R version 3.5.3

## Loading required package: gplots

## Warning: package 'gplots' was built under R version 3.5.3

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

## 
## Attaching package: 'ROCR'

## The following object is masked from 'package:neuralnet':
## 
##     prediction

library(ineq)
pred <- ROCR::prediction(Newdf.train$Prob, as.numeric(ifelse(Newdf.train$read.HR_Attrition_Data.Attrition=="Yes",1,0)))
perf <- performance(pred, "tpr", "fpr")
KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]])
auc <- performance(pred,"auc"); 
auc <- as.numeric(auc@y.values)
gini = ineq(Newdf.train$Prob, type="Gini")

with( Newdf.train, table(Newdf.train$read.HR_Attrition_Data.Attrition, 
                         as.factor(Newdf.train$Class)  ))

##      
##          0    1
##   No  1595  128
##   Yes   70  265

Comments: Accuracy computed : TP+TN/total
1595+265=1860
1860/2058=90.3
This could be over fit model too. Hence need to be tested with test data.

auc

## [1] 0.87

KS

## [1] 0.72

gini

## [1] 0.85

Comments: AUC of .87 shows its a good model.
Other accuracy indicator like Gini and KS value are indicating the discriminatory power of the model.

f) Validate NN model on Hold Out. If need be improvise

table(Newdf.test$read.HR_Attrition_Data.Attrition)

## 
##  No Yes 
## 743 139

Categorical values to dummy varaibles in test data

JobRole.matrix <- model.matrix(~ read.HR_Attrition_Data.JobRole - 1,data = Newdf.test)
Newdf.test <- data.frame(Newdf.test, JobRole.matrix)

MaritalStatus.matrix <- model.matrix(~ read.HR_Attrition_Data.MaritalStatus - 1,data = Newdf.test)
Newdf.test <- data.frame(Newdf.test, MaritalStatus.matrix)

OverTime.matrix <- model.matrix(~ read.HR_Attrition_Data.OverTime - 1,data = Newdf.test)
Newdf.test <- data.frame(Newdf.test, OverTime.matrix)

Comments: Like train data, test data dummy variables are created.

Dropping categorical data - JobRole,MaritalStatus,OverTime for test data

Newdf.test<-Newdf.test[,-c(10,11,14)]

Comments: Since dummy variables are created, getting rid of unwanted categorical variables

test.scaled <- scale(Newdf.test[,-1])

Comments: Scaling the test data without target column.

compute.output = compute(nn1.Attr, test.scaled)
Newdf.test$Predict.score = compute.output$net.result

Comments: Using Compute function, probability of prediction is calculated for test data

quantile(Newdf.test$Predict.score, c(0,1,5,10,25,50,75,80,90,95,98,99,100)/100)

##                                                                                                  0% 
## 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000044 
##                                                                                                  1% 
## 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000044 
##                                                                                                  5% 
## 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000074 
##                                                                                                 10% 
## 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000046757 
##                                                                                                 25% 
## 0.0000000000000000000000000000000000000000000000000000000000000000534841377913839063906742410381412 
##                                                                                                 50% 
## 0.0000000000000000000000000000000259325333293416769846889691208957628987263888120651245117187500000 
##                                                                                                 75% 
## 0.0005154781439665877696559848075708032411057502031326293945312500000000000000000000000000000000000 
##                                                                                                 80% 
## 0.2108857756868171007269552319485228508710861206054687500000000000000000000000000000000000000000000 
##                                                                                                 90% 
## 0.5626696232398592512069512849848251789808273315429687500000000000000000000000000000000000000000000 
##                                                                                                 95% 
## 1.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 
##                                                                                                 98% 
## 1.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 
##                                                                                                 99% 
## 1.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 
##                                                                                                100% 
## 1.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

hist(Newdf.test$Predict.score)

plot(nn1.Attr)

Newdf.test$Class.test = ifelse(Newdf.test$Predict.score>0.2,1,0)

Comments: Assgining 0 / 1 class based on certain threshold on test data.

pred.t <- ROCR::prediction(Newdf.test$Predict.score, as.numeric(ifelse(Newdf.test$read.HR_Attrition_Data.Attrition=="Yes",1,0)))
perf.t <- performance(pred.t, "tpr", "fpr")
KS.t <- max(attr(perf.t, 'y.values')[[1]]-attr(perf.t, 'x.values')[[1]])
auc.t <- performance(pred.t,"auc"); 
auc.t <- as.numeric(auc.t@y.values)
gini.t = ineq(Newdf.test$Predict.score, type="Gini")

with( Newdf.test, table(Newdf.test$read.HR_Attrition_Data.Attrition, 
                         as.factor(Newdf.test$Class.test)  ))

##      
##         0   1
##   No  644  99
##   Yes  60  79

Comments: 644+79=723
total= 882
Accuracy= 723/882
81%
Accuracy is 81% on test data.
Since 90% was accuracy for train data and 81% for test data. The difference is more than 5%.
Hence, it is an overfit model.

auc.t

## [1] 0.75

KS.t

## [1] 0.45

gini.t

## [1] 0.84

Comments: AUC of 0.75 shows model is fairly good as far as test data is concerned.

NN_Assignment_Group6

Group 6

August 5, 2019

Assignment

a) Data Import (Target variable is “Attrition” column)

importing the data

c) Perform Exploratory Data Analysis

d) Identify columns which are of no use. drop those columns

EDA of Data before splitting into training and test data

Data type wise EDA

WOE calculation for Categorical variables:

Factor variables predictors strength analysis using WOE and IV

Department factor

EducationField factor

Gender factor

JobRole factor

MaritalStatus factor

OverTime factor

Numeric Predictors Analysis:

Age EDA

DailyRate EDA

Education EDA

EnvironmentSatisfaction EDA

HourlyRate EDA

JobInvolvement EDA

JobLevel EDA

JobSatisfaction EDA

MonthlyIncome EDA

MonthlyRate EDA

NumCompaniesWorked EDA

PercentSalaryHike EDA

PerformanceRating EDA

RelationshipSatisfaction EDA

TotalWorkingYears EDA

TrainingTimesLastYear EDA

WorkLifeBalance EDA

YearsAtCompany EDA

YearsInCurrentRole EDA

YearsSinceLastPromotion EDA

YearsWithCurrManager EDA

Based on Analysis – New data frame created called Newdf created after dropping unwanted columns.

Here is final column list:

Considered Columns:

Dropped columns:

Columns based on their data types in Newdf dataframe

b) Split the data in Dev & Hold Out sample (70:30)

Count data for Target columns in train and test data

Categorical values to dummy varaibles for train data

Scaling Data done for normalization

e) Build Neural Network Model (Development sample)

Plotting the neural network graph

Validating NN model for accuracy

Assigning the Probabilities to Dev Sample

The distribution of the estimated probabilities

f) Validate NN model on Hold Out. If need be improvise

Categorical values to dummy varaibles in test data

Dropping categorical data - JobRole,MaritalStatus,OverTime for test data

Cross validation of train data can be done to overcome overfitting