Build Neural Network model on same:
Steps involved should be:
a) Data Import (Target variable is “Attrition” column)
b) Split the data in Dev & Hold Out sample (70:30)
c) Perform Exploratory Data Analysis
d) Identify columns which are of no use. drop those columns
e) Build Neural Network Model (Development sample)
f) Validate NN model on Hold Out. If need be improvise
setwd("C:/Users/aksha/OneDrive/Documents/Shilpi_xtras/Shilpi_extras/GL_BAPI/Rprgm")
read.HR_Attrition_Data <- read.table("HR_Employee_Attrition_Data-3.csv", sep = ",", header = T)
library(caret)
## Warning: package 'caret' was built under R version 3.5.3
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.5.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(InformationValue)
## Warning: package 'InformationValue' was built under R version 3.5.3
##
## Attaching package: 'InformationValue'
## The following objects are masked from 'package:caret':
##
## confusionMatrix, precision, sensitivity, specificity
glimpse(read.HR_Attrition_Data)
## Observations: 2,940
## Variables: 35
## $ Age <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 3...
## $ Attrition <fct> Yes, No, Yes, No, No, No, No, No, No,...
## $ BusinessTravel <fct> Travel_Rarely, Travel_Frequently, Tra...
## $ DailyRate <int> 1102, 279, 1373, 1392, 591, 1005, 132...
## $ Department <fct> Sales, Research & Development, Resear...
## $ DistanceFromHome <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, ...
## $ Education <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1...
## $ EducationField <fct> Life Sciences, Life Sciences, Other, ...
## $ EmployeeCount <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ EmployeeNumber <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12...
## $ EnvironmentSatisfaction <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1...
## $ Gender <fct> Female, Male, Male, Female, Male, Mal...
## $ HourlyRate <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 9...
## $ JobInvolvement <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3...
## $ JobLevel <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1...
## $ JobRole <fct> Sales Executive, Research Scientist, ...
## $ JobSatisfaction <int> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3...
## $ MaritalStatus <fct> Single, Married, Single, Married, Mar...
## $ MonthlyIncome <int> 5993, 5130, 2090, 2909, 3468, 3068, 2...
## $ MonthlyRate <int> 19479, 24907, 2396, 23159, 16632, 118...
## $ NumCompaniesWorked <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1...
## $ Over18 <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y...
## $ OverTime <fct> Yes, No, Yes, Yes, No, No, Yes, No, N...
## $ PercentSalaryHike <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 1...
## $ PerformanceRating <int> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3...
## $ RelationshipSatisfaction <int> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4...
## $ StandardHours <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 8...
## $ StockOptionLevel <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1...
## $ TotalWorkingYears <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, ...
## $ TrainingTimesLastYear <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1...
## $ WorkLifeBalance <int> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2...
## $ YearsAtCompany <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, ...
## $ YearsInCurrentRole <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2...
## $ YearsSinceLastPromotion <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4...
## $ YearsWithCurrManager <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3...
Comments: Data read correctly in R. There are 35 columns and 2940 rows
sum(is.na(read.HR_Attrition_Data))
## [1] 0
Comments: No NAs in data
str(read.HR_Attrition_Data)
## 'data.frame': 2940 obs. of 35 variables:
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : int 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : int 1 2 3 4 5 6 7 8 9 10 ...
## $ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
## $ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ Over18 : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : int 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
Comments: Have factors and integers type of data.
Of all 35 variables/columns,Attrition is the Target variable.
Attrition data is unbalanced data with unequal number of yes and no.
summary(read.HR_Attrition_Data)
## Age Attrition BusinessTravel DailyRate
## Min. :18.00 No :2466 Non-Travel : 300 Min. : 102.0
## 1st Qu.:30.00 Yes: 474 Travel_Frequently: 554 1st Qu.: 465.0
## Median :36.00 Travel_Rarely :2086 Median : 802.0
## Mean :36.92 Mean : 802.5
## 3rd Qu.:43.00 3rd Qu.:1157.0
## Max. :60.00 Max. :1499.0
##
## Department DistanceFromHome Education
## Human Resources : 126 Min. : 1.000 Min. :1.000
## Research & Development:1922 1st Qu.: 2.000 1st Qu.:2.000
## Sales : 892 Median : 7.000 Median :3.000
## Mean : 9.193 Mean :2.913
## 3rd Qu.:14.000 3rd Qu.:4.000
## Max. :29.000 Max. :5.000
##
## EducationField EmployeeCount EmployeeNumber
## Human Resources : 54 Min. :1 Min. : 1.0
## Life Sciences :1212 1st Qu.:1 1st Qu.: 735.8
## Marketing : 318 Median :1 Median :1470.5
## Medical : 928 Mean :1 Mean :1470.5
## Other : 164 3rd Qu.:1 3rd Qu.:2205.2
## Technical Degree: 264 Max. :1 Max. :2940.0
##
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement
## Min. :1.000 Female:1176 Min. : 30.00 Min. :1.00
## 1st Qu.:2.000 Male :1764 1st Qu.: 48.00 1st Qu.:2.00
## Median :3.000 Median : 66.00 Median :3.00
## Mean :2.722 Mean : 65.89 Mean :2.73
## 3rd Qu.:4.000 3rd Qu.: 84.00 3rd Qu.:3.00
## Max. :4.000 Max. :100.00 Max. :4.00
##
## JobLevel JobRole JobSatisfaction
## Min. :1.000 Sales Executive :652 Min. :1.000
## 1st Qu.:1.000 Research Scientist :584 1st Qu.:2.000
## Median :2.000 Laboratory Technician :518 Median :3.000
## Mean :2.064 Manufacturing Director :290 Mean :2.729
## 3rd Qu.:3.000 Healthcare Representative:262 3rd Qu.:4.000
## Max. :5.000 Manager :204 Max. :4.000
## (Other) :430
## MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked
## Divorced: 654 Min. : 1009 Min. : 2094 Min. :0.000
## Married :1346 1st Qu.: 2911 1st Qu.: 8045 1st Qu.:1.000
## Single : 940 Median : 4919 Median :14236 Median :2.000
## Mean : 6503 Mean :14313 Mean :2.693
## 3rd Qu.: 8380 3rd Qu.:20462 3rd Qu.:4.000
## Max. :19999 Max. :26999 Max. :9.000
##
## Over18 OverTime PercentSalaryHike PerformanceRating
## Y:2940 No :2108 Min. :11.00 Min. :3.000
## Yes: 832 1st Qu.:12.00 1st Qu.:3.000
## Median :14.00 Median :3.000
## Mean :15.21 Mean :3.154
## 3rd Qu.:18.00 3rd Qu.:3.000
## Max. :25.00 Max. :4.000
##
## RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## Min. :1.000 Min. :80 Min. :0.0000 Min. : 0.00
## 1st Qu.:2.000 1st Qu.:80 1st Qu.:0.0000 1st Qu.: 6.00
## Median :3.000 Median :80 Median :1.0000 Median :10.00
## Mean :2.712 Mean :80 Mean :0.7939 Mean :11.28
## 3rd Qu.:4.000 3rd Qu.:80 3rd Qu.:1.0000 3rd Qu.:15.00
## Max. :4.000 Max. :80 Max. :3.0000 Max. :40.00
##
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## Min. :0.000 Min. :1.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.000 1st Qu.: 2.000
## Median :3.000 Median :3.000 Median : 5.000 Median : 3.000
## Mean :2.799 Mean :2.761 Mean : 7.008 Mean : 4.229
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.: 9.000 3rd Qu.: 7.000
## Max. :6.000 Max. :4.000 Max. :40.000 Max. :18.000
##
## YearsSinceLastPromotion YearsWithCurrManager
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 1.000 Median : 3.000
## Mean : 2.188 Mean : 4.123
## 3rd Qu.: 3.000 3rd Qu.: 7.000
## Max. :15.000 Max. :17.000
##
Comments: From Summary: MonthlyIncome has high range and has high range.
nearZeroVar(read.HR_Attrition_Data, saveMetrics= TRUE)
## freqRatio percentUnique zeroVar nzv
## Age 1.012987 1.46258503 FALSE FALSE
## Attrition 5.202532 0.06802721 FALSE FALSE
## BusinessTravel 3.765343 0.10204082 FALSE FALSE
## DailyRate 1.200000 30.13605442 FALSE FALSE
## Department 2.154709 0.10204082 FALSE FALSE
## DistanceFromHome 1.014423 0.98639456 FALSE FALSE
## Education 1.437186 0.17006803 FALSE FALSE
## EducationField 1.306034 0.20408163 FALSE FALSE
## EmployeeCount 0.000000 0.03401361 TRUE TRUE
## EmployeeNumber 1.000000 100.00000000 FALSE FALSE
## EnvironmentSatisfaction 1.015695 0.13605442 FALSE FALSE
## Gender 1.500000 0.06802721 FALSE FALSE
## HourlyRate 1.035714 2.41496599 FALSE FALSE
## JobInvolvement 2.314667 0.13605442 FALSE FALSE
## JobLevel 1.016854 0.17006803 FALSE FALSE
## JobRole 1.116438 0.30612245 FALSE FALSE
## JobSatisfaction 1.038462 0.13605442 FALSE FALSE
## MaritalStatus 1.431915 0.10204082 FALSE FALSE
## MonthlyIncome 1.333333 45.88435374 FALSE FALSE
## MonthlyRate 1.000000 48.53741497 FALSE FALSE
## NumCompaniesWorked 2.644670 0.34013605 FALSE FALSE
## Over18 0.000000 0.03401361 TRUE TRUE
## OverTime 2.533654 0.06802721 FALSE FALSE
## PercentSalaryHike 1.004785 0.51020408 FALSE FALSE
## PerformanceRating 5.504425 0.06802721 FALSE FALSE
## RelationshipSatisfaction 1.062500 0.13605442 FALSE FALSE
## StandardHours 0.000000 0.03401361 TRUE TRUE
## StockOptionLevel 1.058725 0.13605442 FALSE FALSE
## TotalWorkingYears 1.616000 1.36054422 FALSE FALSE
## TrainingTimesLastYear 1.114053 0.23809524 FALSE FALSE
## WorkLifeBalance 2.595930 0.13605442 FALSE FALSE
## YearsAtCompany 1.146199 1.25850340 FALSE FALSE
## YearsInCurrentRole 1.524590 0.64625850 FALSE FALSE
## YearsSinceLastPromotion 1.627451 0.54421769 FALSE FALSE
## YearsWithCurrManager 1.307985 0.61224490 FALSE FALSE
Comments:Removing columns with single value
Drop columns where freq ratio as 0 and 1 =. These have 1 value or all unique values
EmployeeNumber,EmployeeCount ,Over18 ,StandardHours are dropped.
table(read.HR_Attrition_Data$Attrition)
##
## No Yes
## 2466 474
Comments: Unbalanced nature of data
read.HR_Attrition_Data %>% select_if(is.numeric) %>%
glimpse()
## Observations: 2,940
## Variables: 26
## $ Age <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 3...
## $ DailyRate <int> 1102, 279, 1373, 1392, 591, 1005, 132...
## $ DistanceFromHome <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, ...
## $ Education <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1...
## $ EmployeeCount <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ EmployeeNumber <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12...
## $ EnvironmentSatisfaction <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1...
## $ HourlyRate <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 9...
## $ JobInvolvement <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3...
## $ JobLevel <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1...
## $ JobSatisfaction <int> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3...
## $ MonthlyIncome <int> 5993, 5130, 2090, 2909, 3468, 3068, 2...
## $ MonthlyRate <int> 19479, 24907, 2396, 23159, 16632, 118...
## $ NumCompaniesWorked <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1...
## $ PercentSalaryHike <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 1...
## $ PerformanceRating <int> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3...
## $ RelationshipSatisfaction <int> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4...
## $ StandardHours <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 8...
## $ StockOptionLevel <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1...
## $ TotalWorkingYears <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, ...
## $ TrainingTimesLastYear <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1...
## $ WorkLifeBalance <int> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2...
## $ YearsAtCompany <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, ...
## $ YearsInCurrentRole <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2...
## $ YearsSinceLastPromotion <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4...
## $ YearsWithCurrManager <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3...
read.HR_Attrition_Data %>% select_if(is.factor)%>%
glimpse()
## Observations: 2,940
## Variables: 9
## $ Attrition <fct> Yes, No, Yes, No, No, No, No, No, No, No, No, N...
## $ BusinessTravel <fct> Travel_Rarely, Travel_Frequently, Travel_Rarely...
## $ Department <fct> Sales, Research & Development, Research & Devel...
## $ EducationField <fct> Life Sciences, Life Sciences, Other, Life Scien...
## $ Gender <fct> Female, Male, Male, Female, Male, Male, Female,...
## $ JobRole <fct> Sales Executive, Research Scientist, Laboratory...
## $ MaritalStatus <fct> Single, Married, Single, Married, Married, Sing...
## $ Over18 <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y,...
## $ OverTime <fct> Yes, No, Yes, Yes, No, No, Yes, No, No, No, No,...
Comments: Segregating columns based on their data types.
Less than 0.02, then the predictor is not useful for modeling (separating the Goods from the Bads)
0.02 to 0.1, then the predictor has only a weak relationship.
0.1 to 0.3, then the predictor has a medium strength relationship.
0.3 or higher, then the predictor has a strong relationship.
Business Travel EDA on whole data
Att=ifelse(read.HR_Attrition_Data$Attrition=="Yes",1,0)
options(scipen = 999, digits = 2)
WOETable(X=read.HR_Attrition_Data$BusinessTravel, Y=Att)
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV
## 1 Non-Travel 24 276 300 0.051 0.11 -0.793 0.0486
## 2 Travel_Frequently 138 416 554 0.291 0.17 0.546 0.0668
## 3 Travel_Rarely 312 1774 2086 0.658 0.72 -0.089 0.0054
Comments: Since IV value is low hence weak relationship with Attrition.
This predictor can be ignored.
WOETable(X=read.HR_Attrition_Data$Department, Y=Att)
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV
## 1 Human Resources 24 102 126 0.051 0.041 0.20 0.0019
## 2 Research & Development 266 1656 1922 0.561 0.672 -0.18 0.0198
## 3 Sales 184 708 892 0.388 0.287 0.30 0.0305
Comments: Since IV value is quite low hence the predictor is not useful for modeling.
This predictor can be ignored.
WOETable(X=read.HR_Attrition_Data$EducationField, Y=Att)
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV
## 1 Human Resources 14 40 54 0.030 0.016 0.60 0.0080
## 2 Life Sciences 178 1034 1212 0.376 0.419 -0.11 0.0048
## 3 Marketing 70 248 318 0.148 0.101 0.38 0.0181
## 4 Medical 126 802 928 0.266 0.325 -0.20 0.0120
## 5 Other 22 142 164 0.046 0.058 -0.22 0.0024
## 6 Technical Degree 64 200 264 0.135 0.081 0.51 0.0275
Comments: Since IV value is low hence predictorhas weak relationship.
This predictor can be ignored.
WOETable(X=read.HR_Attrition_Data$Gender, Y=Att)
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV
## 1 Female 174 1002 1176 0.37 0.41 -0.102 0.0040
## 2 Male 300 1464 1764 0.63 0.59 0.064 0.0025
Comments: Since IV values are low, hence predictor can be ignored.
WOETable(X=read.HR_Attrition_Data$JobRole, Y=Att)
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE
## 1 Healthcare Representative 18 244 262 0.0380 0.099 -0.958
## 2 Human Resources 24 80 104 0.0506 0.032 0.445
## 3 Laboratory Technician 124 394 518 0.2616 0.160 0.493
## 4 Manager 10 194 204 0.0211 0.079 -1.316
## 5 Manufacturing Director 20 270 290 0.0422 0.109 -0.954
## 6 Research Director 4 156 160 0.0084 0.063 -2.014
## 7 Research Scientist 94 490 584 0.1983 0.199 -0.002
## 8 Sales Executive 114 538 652 0.2405 0.218 0.097
## 9 Sales Representative 66 100 166 0.1392 0.041 1.234
## IV
## 1 0.05838892
## 2 0.00809845
## 3 0.05021016
## 4 0.07577324
## 5 0.06416873
## 6 0.11043337
## 7 0.00000077
## 8 0.00217775
## 9 0.12174571
Comments: Since IV values are moderate hence predictor has medium strength so considered.
WOETable(X=read.HR_Attrition_Data$MaritalStatus, Y=Att)
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV
## 1 Divorced 66 588 654 0.14 0.24 -0.54 0.053
## 2 Married 168 1178 1346 0.35 0.48 -0.30 0.037
## 3 Single 240 700 940 0.51 0.28 0.58 0.129
Comments: Marital Status are moderate hence predictor has medium strength so considered.
WOETable(X=read.HR_Attrition_Data$OverTime, Y=Att)
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV
## 1 No 220 1888 2108 0.46 0.77 -0.50 0.15
## 2 Yes 254 578 832 0.54 0.23 0.83 0.25
Comments: OverTime are moderate hence predictor has medium strength so considered.
ggplot(data=read.HR_Attrition_Data,aes(x=Age))+
geom_bar(alpha=0.5,fill="red",color="black") +
ggtitle("Norm Distribution of Age")
ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=Age)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))
hist(read.HR_Attrition_Data$Age)
Comments: As per the boxplot for Attrition as Yes and No;this Predictor can be considered.
ggplot(data=read.HR_Attrition_Data,aes(x=DailyRate))+
geom_histogram(alpha=0.5,fill="red",color="black") +
ggtitle("Histogram of DistanceFromHome")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=DailyRate)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))
hist(read.HR_Attrition_Data$DailyRate)
Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be ignored.
####DistanceFromHome EDA
ggplot(data=read.HR_Attrition_Data,aes(x=DistanceFromHome))+
geom_bar(alpha=0.5,fill="red",color="black") +
ggtitle("Norm Distribustion of DistanceFromHome")
ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=DistanceFromHome)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))
hist(read.HR_Attrition_Data$DistanceFromHome)
Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be considered.
ggplot(data=read.HR_Attrition_Data,aes(x=Education))+
geom_bar(alpha=0.5,fill="red",color="black") +
ggtitle("Norm Distribustion of Education")
ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=Education)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))
hist(read.HR_Attrition_Data$Education)
Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be ignored
ggplot(data=read.HR_Attrition_Data,aes(x=EnvironmentSatisfaction))+
geom_bar(alpha=0.5,fill="red",color="black") +
ggtitle("Norm Distribustion of EnvironmentSatisfaction")
ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=DistanceFromHome)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))
Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be considered.
ggplot(data=read.HR_Attrition_Data,aes(x=HourlyRate))+
geom_bar(alpha=0.5,fill="red",color="black") +
ggtitle("Norm Distribustion of HourlyRate")
ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=HourlyRate)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))
hist(read.HR_Attrition_Data$HourlyRate)
Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be ignored.
WOETable(X=as.factor(read.HR_Attrition_Data$JobInvolvement), Y=Att)
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV
## 1 1 56 110 166 0.118 0.045 0.97 0.072
## 2 2 142 608 750 0.300 0.247 0.19 0.010
## 3 3 250 1486 1736 0.527 0.603 -0.13 0.010
## 4 4 26 262 288 0.055 0.106 -0.66 0.034
Comments: It is an ordinal value,hence WOE is calculated.
Since it low IV ;this Predictor can be ignored.
WOETable(X=as.factor(read.HR_Attrition_Data$JobLevel), Y=Att)
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV
## 1 1 286 800 1086 0.603 0.324 0.62 0.1731
## 2 2 104 964 1068 0.219 0.391 -0.58 0.0991
## 3 3 64 372 436 0.135 0.151 -0.11 0.0018
## 4 4 10 202 212 0.021 0.082 -1.36 0.0825
## 5 5 10 128 138 0.021 0.052 -0.90 0.0277
Comments: It is an ordinal value,hence WOE is calculated.
Since it moderate IV ;this Predictor can be considered.
WOETable(X=as.factor(read.HR_Attrition_Data$JobSatisfaction), Y=Att)
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV
## 1 1 132 446 578 0.28 0.18 0.432 0.042136
## 2 2 92 468 560 0.19 0.19 0.022 0.000097
## 3 3 146 738 884 0.31 0.30 0.029 0.000252
## 4 4 104 814 918 0.22 0.33 -0.408 0.045204
Comments: It is an ordinal value,hence WOE is calculated.
Since it low IV ;this Predictor can be ignored.
ggplot(data=read.HR_Attrition_Data,aes(x=MonthlyIncome))+
geom_histogram(alpha=0.5,fill="red",color="black") +
ggtitle("Norm Distribustion of MonthlyIncome")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
hist(read.HR_Attrition_Data$MonthlyIncome)
hist(log2(read.HR_Attrition_Data$MonthlyIncome))
ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=MonthlyIncome)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))
Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be well considered.
ggplot(data=read.HR_Attrition_Data,aes(x=MonthlyRate))+
geom_histogram(alpha=0.5,fill="red",color="black") +
ggtitle("Norm Distribustion of MonthlyRate")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
hist(read.HR_Attrition_Data$MonthlyRate)
ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=MonthlyRate)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))
Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be considered.
ggplot(data=read.HR_Attrition_Data,aes(x=NumCompaniesWorked))+
geom_bar(alpha=0.5,fill="red",color="black") +
ggtitle("Norm Distribustion of NumCompaniesWorked")
hist(read.HR_Attrition_Data$NumCompaniesWorked)
ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=NumCompaniesWorked)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))
Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be condsidered.
ggplot(data=read.HR_Attrition_Data,aes(x=PercentSalaryHike))+
geom_bar(alpha=0.5,fill="red",color="black") +
ggtitle("Norm Distribustion of PercentSalaryHike")
hist(read.HR_Attrition_Data$PercentSalaryHike)
boxplot(read.HR_Attrition_Data$PercentSalaryHike,horizontal = T)
ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=PercentSalaryHike)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))
Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be considered.
WOETable(X=as.factor(read.HR_Attrition_Data$PerformanceRating), Y=Att)
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV
## 1 3 400 2088 2488 0.84 0.85 -0.0034 0.0000095
## 2 4 74 378 452 0.16 0.15 0.0183 0.0000519
Comments: It is an ordinal value,hence WOE is calculated.
Since it low IV ;this Predictor can be ignored.
WOETable(X=as.factor(read.HR_Attrition_Data$RelationshipSatisfaction), Y=Att)
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV
## 1 1 114 438 552 0.24 0.18 0.303 0.01906
## 2 2 90 516 606 0.19 0.21 -0.097 0.00188
## 3 3 142 776 918 0.30 0.31 -0.049 0.00074
## 4 4 128 736 864 0.27 0.30 -0.100 0.00284
Comments: It is an ordinal value,hence WOE is calculated.
Since it low IV ;this Predictor can be ignored.
####StockOptionLevel EDA
WOETable(X=as.factor(read.HR_Attrition_Data$StockOptionLevel), Y=Att)
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV
## 1 0 308 954 1262 0.650 0.387 0.52 0.13635
## 2 1 112 1080 1192 0.236 0.438 -0.62 0.12444
## 3 2 24 292 316 0.051 0.118 -0.85 0.05758
## 4 3 30 140 170 0.063 0.057 0.11 0.00071
Comments: It is an ordinal value,hence WOE is calculated.
Since it low IV ;this Predictor can be ignored.
ggplot(data=read.HR_Attrition_Data,aes(x=TotalWorkingYears))+
geom_bar(alpha=0.5,fill="red",color="black") +
ggtitle("Norm Distribustion of TotalWorkingYears")
hist(read.HR_Attrition_Data$TotalWorkingYears)
ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=TotalWorkingYears)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))
Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be considered.
glimpse(read.HR_Attrition_Data$TrainingTimesLastYear)
## int [1:2940] 0 3 3 3 3 2 3 2 2 3 ...
ggplot(data=read.HR_Attrition_Data,aes(x=TrainingTimesLastYear))+
geom_bar(alpha=0.5,fill="red",color="black") +
ggtitle("Norm Distribustion of TrainingTimesLastYear")
hist(read.HR_Attrition_Data$TrainingTimesLastYear)
ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=TrainingTimesLastYear)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))
Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be considered.
WOETable(X=as.factor(read.HR_Attrition_Data$StockOptionLevel), Y=Att)
## CAT GOODS BADS TOTAL PCT_G PCT_B WOE IV
## 1 0 308 954 1262 0.650 0.387 0.52 0.13635
## 2 1 112 1080 1192 0.236 0.438 -0.62 0.12444
## 3 2 24 292 316 0.051 0.118 -0.85 0.05758
## 4 3 30 140 170 0.063 0.057 0.11 0.00071
Comments: It is an ordinal value,hence WOE is calculated.
Since it low IV ;this Predictor can be ignored.
ggplot(data=read.HR_Attrition_Data,aes(x=YearsAtCompany))+
geom_bar(alpha=0.5,fill="red",color="black") +
ggtitle("Norm Distribustion of YearsAtCompany")
hist(read.HR_Attrition_Data$YearsAtCompany)
ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=YearsAtCompany)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))
Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be considered.
ggplot(data=read.HR_Attrition_Data,aes(x=YearsInCurrentRole))+
geom_bar(alpha=0.5,fill="red",color="black") +
ggtitle("Norm Distribustion of YearsInCurrentRole")
hist(read.HR_Attrition_Data$YearsInCurrentRole)
ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=YearsInCurrentRole)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))
Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be considered.
ggplot(data=read.HR_Attrition_Data,aes(x=YearsSinceLastPromotion))+
geom_bar(alpha=0.5,fill="red",color="black") +
ggtitle("Norm Distribustion of YearsSinceLastPromotion")
hist(read.HR_Attrition_Data$YearsSinceLastPromotion)
ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=YearsSinceLastPromotion)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))
Comments: As per the boxplot for Attrition as Yes and No; this Predictor can be considered.
ggplot(data=read.HR_Attrition_Data,aes(x=YearsWithCurrManager))+
geom_bar(alpha=0.5,fill="red",color="black") +
ggtitle("Norm Distribustion of YearsWithCurrManager")
hist(read.HR_Attrition_Data$YearsWithCurrManager)
ggplot(data=read.HR_Attrition_Data,mapping=aes(x=" ",y=YearsWithCurrManager)) + geom_boxplot(aes(color=read.HR_Attrition_Data$Attrition))
Age
Attrition
DistanceFromHome
EnvironmentSatisfaction
JobLevel
JobRole
MaritalStatus
MonthlyRate
NumCompaniesWorked
OverTime
StockOptionLevel
TrainingTimesLastYear
WorkLifeBalance
YearsInCurrentRole
YearsSinceLastPromotion
YearsWithCurrManager
MonthlyIncome
PercentSalaryHike
TotalWorkingYears
YearsAtCompany
EmployeeCount
Over18
StandardHours
EmployeeNumber
BusinessTravel
Department
EducationField
Gender
DailyRate
Education
HourlyRate
JobInvolvement
JobSatisfaction
PerformanceRating
RelationshipSatisfaction
Newdf<-data.frame( read.HR_Attrition_Data$Attrition
,read.HR_Attrition_Data$MonthlyIncome
,read.HR_Attrition_Data$PercentSalaryHike
,read.HR_Attrition_Data$TotalWorkingYears
,read.HR_Attrition_Data$YearsAtCompany
,read.HR_Attrition_Data$Age
,read.HR_Attrition_Data$DistanceFromHome
,read.HR_Attrition_Data$EnvironmentSatisfaction
,read.HR_Attrition_Data$JobLevel
,read.HR_Attrition_Data$JobRole
,read.HR_Attrition_Data$MaritalStatus
,read.HR_Attrition_Data$MonthlyRate
,read.HR_Attrition_Data$NumCompaniesWorked
,read.HR_Attrition_Data$OverTime
,read.HR_Attrition_Data$StockOptionLevel
,read.HR_Attrition_Data$TrainingTimesLastYear
,read.HR_Attrition_Data$WorkLifeBalance
,read.HR_Attrition_Data$YearsInCurrentRole
,read.HR_Attrition_Data$YearsSinceLastPromotion
,read.HR_Attrition_Data$YearsWithCurrManager
)
Comments: New data frame created called Newdf created
Newdf %>% select_if(is.numeric) %>%
glimpse()
## Observations: 2,940
## Variables: 16
## $ read.HR_Attrition_Data.MonthlyIncome <int> 5993, 5130, 209...
## $ read.HR_Attrition_Data.PercentSalaryHike <int> 11, 23, 15, 11,...
## $ read.HR_Attrition_Data.TotalWorkingYears <int> 8, 10, 7, 8, 6,...
## $ read.HR_Attrition_Data.YearsAtCompany <int> 6, 10, 0, 8, 2,...
## $ read.HR_Attrition_Data.Age <int> 41, 49, 37, 33,...
## $ read.HR_Attrition_Data.DistanceFromHome <int> 1, 8, 2, 3, 2, ...
## $ read.HR_Attrition_Data.EnvironmentSatisfaction <int> 2, 3, 4, 4, 1, ...
## $ read.HR_Attrition_Data.JobLevel <int> 2, 2, 1, 1, 1, ...
## $ read.HR_Attrition_Data.MonthlyRate <int> 19479, 24907, 2...
## $ read.HR_Attrition_Data.NumCompaniesWorked <int> 8, 1, 6, 1, 9, ...
## $ read.HR_Attrition_Data.StockOptionLevel <int> 0, 1, 0, 0, 1, ...
## $ read.HR_Attrition_Data.TrainingTimesLastYear <int> 0, 3, 3, 3, 3, ...
## $ read.HR_Attrition_Data.WorkLifeBalance <int> 1, 3, 3, 3, 3, ...
## $ read.HR_Attrition_Data.YearsInCurrentRole <int> 4, 7, 0, 7, 2, ...
## $ read.HR_Attrition_Data.YearsSinceLastPromotion <int> 0, 1, 0, 3, 2, ...
## $ read.HR_Attrition_Data.YearsWithCurrManager <int> 5, 7, 0, 0, 2, ...
Newdf %>% select_if(is.factor)%>%
glimpse()
## Observations: 2,940
## Variables: 4
## $ read.HR_Attrition_Data.Attrition <fct> Yes, No, Yes, No, No, No,...
## $ read.HR_Attrition_Data.JobRole <fct> Sales Executive, Research...
## $ read.HR_Attrition_Data.MaritalStatus <fct> Single, Married, Single, ...
## $ read.HR_Attrition_Data.OverTime <fct> Yes, No, Yes, Yes, No, No...
Comments: This dataframe is of whole data - 16 numeric columns and 4 categotical columns, inclusing Target variable
set.seed(1212)
s <- sample(c(1:2940), size = 2058)
Newdf.train <- Newdf[s,]
Newdf.test <- Newdf[-s,]
nrow(Newdf.train)
## [1] 2058
nrow(Newdf.test)
## [1] 882
Comments: The train data has 2058 rows
The test data has 882 rows
table(Newdf.train[,1])
##
## No Yes
## 1723 335
table(Newdf.test[,1])
##
## No Yes
## 743 139
JobRole.matrix <- model.matrix(~ read.HR_Attrition_Data.JobRole - 1,data = Newdf.train)
Newdf.train <- data.frame(Newdf.train, JobRole.matrix)
MaritalStatus.matrix <- model.matrix(~ read.HR_Attrition_Data.MaritalStatus - 1,data = Newdf.train)
Newdf.train <- data.frame(Newdf.train, MaritalStatus.matrix)
OverTime.matrix <- model.matrix(~ read.HR_Attrition_Data.OverTime - 1,data = Newdf.train)
Newdf.train <- data.frame(Newdf.train, OverTime.matrix)
Comments: In order to scale categorical is converted to integer dummy variables
Newdf.train<-Newdf.train[,-c(10,11,14)]
Comments:From total of 34 columns in train data ; dropping of Categorical variables JobRole,MaritalStatus,OverTime
Total of 31 columns in Newdf.train
Attrition_num<-ifelse(Newdf.train$read.HR_Attrition_Data.Attrition == 'Yes',1,0)
class(Attrition_num)
## [1] "numeric"
train.dev.scaled <- scale(Newdf.train[,-1])
Comments: the Target column is removed and the data is scaled.
allVars.train<-colnames(train.dev.scaled)
predictor.Variables<-paste(allVars.train,collapse="+")
form=as.formula(paste("Attrition_num ~",predictor.Variables,collapse="+"))
form
## Attrition_num ~ read.HR_Attrition_Data.MonthlyIncome + read.HR_Attrition_Data.PercentSalaryHike +
## read.HR_Attrition_Data.TotalWorkingYears + read.HR_Attrition_Data.YearsAtCompany +
## read.HR_Attrition_Data.Age + read.HR_Attrition_Data.DistanceFromHome +
## read.HR_Attrition_Data.EnvironmentSatisfaction + read.HR_Attrition_Data.JobLevel +
## read.HR_Attrition_Data.MonthlyRate + read.HR_Attrition_Data.NumCompaniesWorked +
## read.HR_Attrition_Data.StockOptionLevel + read.HR_Attrition_Data.TrainingTimesLastYear +
## read.HR_Attrition_Data.WorkLifeBalance + read.HR_Attrition_Data.YearsInCurrentRole +
## read.HR_Attrition_Data.YearsSinceLastPromotion + read.HR_Attrition_Data.YearsWithCurrManager +
## read.HR_Attrition_Data.JobRoleHealthcare.Representative +
## read.HR_Attrition_Data.JobRoleHuman.Resources + read.HR_Attrition_Data.JobRoleLaboratory.Technician +
## read.HR_Attrition_Data.JobRoleManager + read.HR_Attrition_Data.JobRoleManufacturing.Director +
## read.HR_Attrition_Data.JobRoleResearch.Director + read.HR_Attrition_Data.JobRoleResearch.Scientist +
## read.HR_Attrition_Data.JobRoleSales.Executive + read.HR_Attrition_Data.JobRoleSales.Representative +
## read.HR_Attrition_Data.MaritalStatusDivorced + read.HR_Attrition_Data.MaritalStatusMarried +
## read.HR_Attrition_Data.MaritalStatusSingle + read.HR_Attrition_Data.OverTimeNo +
## read.HR_Attrition_Data.OverTimeYes
Comments: Form - shows that columns on which Neural Network will run.
train.dev.scaled <- cbind(Attrition_num, train.dev.scaled)
Comments: Adding the numeric Attrition column from train data to make data set include target as well as independant variables before running neuralnet function.
library(neuralnet)
## Warning: package 'neuralnet' was built under R version 3.5.3
##
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
##
## compute
set.seed(1212)
nn1.Attr <- neuralnet(formula=form,
data = train.dev.scaled,
hidden = 3,
err.fct = "sse",
linear.output = FALSE,
lifesign = "full",
lifesign.step = 2000,
threshold = 0.01,
stepmax = 200000
)
## hidden: 3 thresh: 0.01 rep: 1/1 steps:
## 2000 min thresh: 0.0559604882956513
## 4000 min thresh: 0.0532541743930464
## 6000 min thresh: 0.0297463576005015
## 8000 min thresh: 0.0230909932692286
## 10000 min thresh: 0.0230909932692286
## 12000 min thresh: 0.0218351879306054
## 14000 min thresh: 0.0135645371330507
## 16000 min thresh: 0.0124392125335601
## 18000 min thresh: 0.0100557167590136
## 20000 min thresh: 0.0100557167590136
## 20209 error: 66.85872 time: 21.54 secs
plot(nn1.Attr)
Newdf.train$Prob = nn1.Attr$net.result[[1]]
quantile( Newdf.train$Prob, c(0,1,5,10,25,50,75,80,90,95,98,99,100)/100)
## 0%
## 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000044
## 1%
## 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000044
## 5%
## 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000054
## 10%
## 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000009767
## 25%
## 0.0000000000000000000000000000000000000000000000000000000000000000860700222054994643246239593992186
## 50%
## 0.0000000000000000000000000000000681137920984060446939062360982575228263158351182937622070312500000
## 75%
## 0.0031410413657090734997068270928366473526693880558013916015625000000000000000000000000000000000000
## 80%
## 0.1292709043456795048321339436370180919766426086425781250000000000000000000000000000000000000000000
## 90%
## 0.8337076493282028488707169344706926494836807250976562500000000000000000000000000000000000000000000
## 95%
## 1.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## 98%
## 1.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## 99%
## 1.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## 100%
## 1.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
hist(Newdf.train$Prob)
Comments:Probability Distribution through histogram looks wide spread.
library(caret)
library(e1071)
## Warning: package 'e1071' was built under R version 3.5.3
Newdf.train$Class = ifelse(Newdf.train$Prob>0.2,1,0)
Comments: Assgining 0 / 1 class based on certain threshold for train data
library(ROCR)
## Warning: package 'ROCR' was built under R version 3.5.3
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.5.3
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
##
## Attaching package: 'ROCR'
## The following object is masked from 'package:neuralnet':
##
## prediction
library(ineq)
pred <- ROCR::prediction(Newdf.train$Prob, as.numeric(ifelse(Newdf.train$read.HR_Attrition_Data.Attrition=="Yes",1,0)))
perf <- performance(pred, "tpr", "fpr")
KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]])
auc <- performance(pred,"auc");
auc <- as.numeric(auc@y.values)
gini = ineq(Newdf.train$Prob, type="Gini")
with( Newdf.train, table(Newdf.train$read.HR_Attrition_Data.Attrition,
as.factor(Newdf.train$Class) ))
##
## 0 1
## No 1595 128
## Yes 70 265
Comments: Accuracy computed : TP+TN/total
1595+265=1860
1860/2058=90.3
This could be over fit model too. Hence need to be tested with test data.
auc
## [1] 0.87
KS
## [1] 0.72
gini
## [1] 0.85
Comments: AUC of .87 shows its a good model.
Other accuracy indicator like Gini and KS value are indicating the discriminatory power of the model.
table(Newdf.test$read.HR_Attrition_Data.Attrition)
##
## No Yes
## 743 139
JobRole.matrix <- model.matrix(~ read.HR_Attrition_Data.JobRole - 1,data = Newdf.test)
Newdf.test <- data.frame(Newdf.test, JobRole.matrix)
MaritalStatus.matrix <- model.matrix(~ read.HR_Attrition_Data.MaritalStatus - 1,data = Newdf.test)
Newdf.test <- data.frame(Newdf.test, MaritalStatus.matrix)
OverTime.matrix <- model.matrix(~ read.HR_Attrition_Data.OverTime - 1,data = Newdf.test)
Newdf.test <- data.frame(Newdf.test, OverTime.matrix)
Comments: Like train data, test data dummy variables are created.
Newdf.test<-Newdf.test[,-c(10,11,14)]
Comments: Since dummy variables are created, getting rid of unwanted categorical variables
test.scaled <- scale(Newdf.test[,-1])
Comments: Scaling the test data without target column.
compute.output = compute(nn1.Attr, test.scaled)
Newdf.test$Predict.score = compute.output$net.result
Comments: Using Compute function, probability of prediction is calculated for test data
quantile(Newdf.test$Predict.score, c(0,1,5,10,25,50,75,80,90,95,98,99,100)/100)
## 0%
## 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000044
## 1%
## 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000044
## 5%
## 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000074
## 10%
## 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000046757
## 25%
## 0.0000000000000000000000000000000000000000000000000000000000000000534841377913839063906742410381412
## 50%
## 0.0000000000000000000000000000000259325333293416769846889691208957628987263888120651245117187500000
## 75%
## 0.0005154781439665877696559848075708032411057502031326293945312500000000000000000000000000000000000
## 80%
## 0.2108857756868171007269552319485228508710861206054687500000000000000000000000000000000000000000000
## 90%
## 0.5626696232398592512069512849848251789808273315429687500000000000000000000000000000000000000000000
## 95%
## 1.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## 98%
## 1.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## 99%
## 1.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## 100%
## 1.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
hist(Newdf.test$Predict.score)
plot(nn1.Attr)
Newdf.test$Class.test = ifelse(Newdf.test$Predict.score>0.2,1,0)
Comments: Assgining 0 / 1 class based on certain threshold on test data.
pred.t <- ROCR::prediction(Newdf.test$Predict.score, as.numeric(ifelse(Newdf.test$read.HR_Attrition_Data.Attrition=="Yes",1,0)))
perf.t <- performance(pred.t, "tpr", "fpr")
KS.t <- max(attr(perf.t, 'y.values')[[1]]-attr(perf.t, 'x.values')[[1]])
auc.t <- performance(pred.t,"auc");
auc.t <- as.numeric(auc.t@y.values)
gini.t = ineq(Newdf.test$Predict.score, type="Gini")
with( Newdf.test, table(Newdf.test$read.HR_Attrition_Data.Attrition,
as.factor(Newdf.test$Class.test) ))
##
## 0 1
## No 644 99
## Yes 60 79
Comments: 644+79=723
total= 882
Accuracy= 723/882
81%
Accuracy is 81% on test data.
Since 90% was accuracy for train data and 81% for test data. The difference is more than 5%.
Hence, it is an overfit model.
auc.t
## [1] 0.75
KS.t
## [1] 0.45
gini.t
## [1] 0.84
Comments: AUC of 0.75 shows model is fairly good as far as test data is concerned.