In CHAID I, we covered basics of the model. Today, we will use the same model with two improvement techniques namely oversampling to balance the data and re-categorization of numeric variables to factor.

Basic Settings and Data Import

Let’s begin by loading the required libraries and importing the data set we are going to use for this model.

####### Basic Settings and data import #######

#set working directory
setwd("C:/Users/awani/Desktop/50daysofAnalytics")
options(scipen = 999)

# load required libraries

if (!require("pacman")) install.packages("pacman")
pacman::p_load(partykit, caret, knitr, dplyr, rsample, kableExtra, ggplot2, tidyr, reshape2,purr, e1071, ROSE)
library(CHAID)

#read data
data(attrition)

Understanding Data

Before we proceed any further, it is essential to understand the data well, get it in correct format and more importantly check data correctness.

Data format correction

CHAID demands data to be either categorical or nominal. In the data set, we have mixture of both nominal (factor), ordinal (ordered factor) and continuous (numeric). Hence, we will have to convert non factors to factor type.

#data format correction
str(attrition)
## 'data.frame':    1470 obs. of  31 variables:
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : Factor w/ 3 levels "Human_Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : Ord.factor w/ 5 levels "Below_College"<..: 2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : Factor w/ 6 levels "Human_Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
##  $ EnvironmentSatisfaction : Ord.factor w/ 4 levels "Low"<"Medium"<..: 2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : Ord.factor w/ 4 levels "Low"<"Medium"<..: 3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare_Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
##  $ JobSatisfaction         : Ord.factor w/ 4 levels "Low"<"Medium"<..: 4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : Ord.factor w/ 4 levels "Low"<"Good"<"Excellent"<..: 3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: Ord.factor w/ 4 levels "Low"<"Medium"<..: 1 4 2 3 4 3 1 2 2 2 ...
##  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : Ord.factor w/ 4 levels "Bad"<"Good"<"Better"<..: 1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...
#store nonfactor variables
nonfactor = which(unlist(lapply(attrition, is.numeric)))

#automatically categorize to factors
for (i in nonfactor)
{
  test = attrition[,i]
  attrition[,i] = as.factor(ifelse(test>= quantile(test)[4], "Very High", 
                                   ifelse(test <= quantile(test)[2], "Very Low",
                                          ifelse(test > quantile(test)[2] & test <quantile(test)[3], "Low","High"))))
}
Exploratory Data Analysis

The frequency distribution of variable attrition shows that frauds are under represented in data. We should either use over or under sampling method to correct this bias. Also, a frequency plot can give us quick understanding of distributions

#### Exploratory Data Analysis ###

#dependent variable
table(attrition$Attrition)
## 
##   No  Yes 
## 1233  237
# Independent Variables

attrition[,-nonfactor] %>%
  gather() %>%                             
  ggplot(aes(value)) +                    
  facet_wrap(~ key, scales = "free") +   
  geom_bar(fill = "blue") 
## Warning: attributes are not identical across measure variables;
## they will be dropped

#visualize converted variables
attrition[,nonfactor] %>%
  gather() %>%                             
  ggplot(aes(value)) +                    
  facet_wrap(~ key, scales = "free") +   
  geom_bar(fill = "blue")
## Warning: attributes are not identical across measure variables;
## they will be dropped

Data Prepartion

Lets us now prepare our data sets for training and validation of CHAID model. Since, the is imbalanced, unequal representation of fraud and non fraud cases, we use oversampling to balance it up.

#oversampling
data_corrected = ovun.sample(Attrition ~ ., data = attrition, method = "over", N = 1233*2)$data
table(data_corrected$Attrition)
## 
##   No  Yes 
## 1233 1233
#training and Validation dataset
set.seed(123)
smp_size = floor(0.7 * nrow(data_corrected))
train_ind = sample(seq_len(nrow(data_corrected)), size = smp_size)

train = data_corrected[train_ind, ]
val = data_corrected[-train_ind, ]

Model Training and Summary

Let us now train CHAID using training data set we prepared. We will the test model accuracy by using the model to classify frauds in validation data set.

#model training
chaid_model =  chaid(Attrition ~ ., data = train)

#summary
chaid_model
## 
## Model formula:
## Attrition ~ Age + BusinessTravel + DailyRate + Department + DistanceFromHome + 
##     Education + EducationField + EnvironmentSatisfaction + Gender + 
##     HourlyRate + JobInvolvement + JobLevel + JobRole + JobSatisfaction + 
##     MaritalStatus + MonthlyIncome + MonthlyRate + NumCompaniesWorked + 
##     OverTime + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + 
##     StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear + 
##     WorkLifeBalance + YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + 
##     YearsWithCurrManager
## 
## Fitted party:
## [1] root
## |   [2] OverTime in No
## |   |   [3] StockOptionLevel in Very High
## |   |   |   [4] JobRole in Healthcare_Representative, Human_Resources, Laboratory_Technician, Sales_Executive
## |   |   |   |   [5] JobLevel in High
## |   |   |   |   |   [6] BusinessTravel in Non-Travel, Travel_Rarely: No (n = 88, err = 4.5%)
## |   |   |   |   |   [7] BusinessTravel in Travel_Frequently
## |   |   |   |   |   |   [8] HourlyRate in High, Low, Very Low: No (n = 17, err = 0.0%)
## |   |   |   |   |   |   [9] HourlyRate in Very High: Yes (n = 10, err = 30.0%)
## |   |   |   |   [10] JobLevel in Very High, Very Low
## |   |   |   |   |   [11] RelationshipSatisfaction in Low
## |   |   |   |   |   |   [12] PercentSalaryHike in High, Very High, Very Low
## |   |   |   |   |   |   |   [13] Age in High, Low: No (n = 8, err = 0.0%)
## |   |   |   |   |   |   |   [14] Age in Very High, Very Low: Yes (n = 19, err = 36.8%)
## |   |   |   |   |   |   [15] PercentSalaryHike in Low: Yes (n = 15, err = 6.7%)
## |   |   |   |   |   [16] RelationshipSatisfaction in Medium: No (n = 25, err = 0.0%)
## |   |   |   |   |   [17] RelationshipSatisfaction in High, Very_High
## |   |   |   |   |   |   [18] PercentSalaryHike in High, Low, Very High
## |   |   |   |   |   |   |   [19] Age in High, Low: No (n = 27, err = 0.0%)
## |   |   |   |   |   |   |   [20] Age in Very High, Very Low
## |   |   |   |   |   |   |   |   [21] Education in Below_College, College: No (n = 13, err = 15.4%)
## |   |   |   |   |   |   |   |   [22] Education in Bachelor: Yes (n = 17, err = 29.4%)
## |   |   |   |   |   |   |   |   [23] Education in Master, Doctor: No (n = 8, err = 0.0%)
## |   |   |   |   |   |   [24] PercentSalaryHike in Very Low
## |   |   |   |   |   |   |   [25] JobSatisfaction in Low, Medium: No (n = 8, err = 0.0%)
## |   |   |   |   |   |   |   [26] JobSatisfaction in High, Very_High
## |   |   |   |   |   |   |   |   [27] EnvironmentSatisfaction in Low: Yes (n = 6, err = 16.7%)
## |   |   |   |   |   |   |   |   [28] EnvironmentSatisfaction in Medium: No (n = 4, err = 0.0%)
## |   |   |   |   |   |   |   |   [29] EnvironmentSatisfaction in High, Very_High: Yes (n = 20, err = 20.0%)
## |   |   |   [30] JobRole in Manager, Manufacturing_Director, Research_Director, Research_Scientist
## |   |   |   |   [31] EnvironmentSatisfaction in Low
## |   |   |   |   |   [32] PercentSalaryHike in High, Very Low
## |   |   |   |   |   |   [33] YearsAtCompany in High, Low: No (n = 10, err = 0.0%)
## |   |   |   |   |   |   [34] YearsAtCompany in Very High: No (n = 12, err = 41.7%)
## |   |   |   |   |   |   [35] YearsAtCompany in Very Low: Yes (n = 11, err = 9.1%)
## |   |   |   |   |   [36] PercentSalaryHike in Low, Very High: No (n = 20, err = 0.0%)
## |   |   |   |   [37] EnvironmentSatisfaction in Medium, High, Very_High
## |   |   |   |   |   [38] RelationshipSatisfaction in Low: No (n = 26, err = 11.5%)
## |   |   |   |   |   [39] RelationshipSatisfaction in Medium, High, Very_High: No (n = 121, err = 0.0%)
## |   |   |   [40] JobRole in Sales_Representative
## |   |   |   |   [41] WorkLifeBalance in Bad, Good: Yes (n = 6, err = 33.3%)
## |   |   |   |   [42] WorkLifeBalance in Better: No (n = 13, err = 7.7%)
## |   |   |   |   [43] WorkLifeBalance in Best: Yes (n = 17, err = 0.0%)
## |   |   [44] StockOptionLevel in Very Low
## |   |   |   [45] JobSatisfaction in Low
## |   |   |   |   [46] DistanceFromHome in High, Low, Very High
## |   |   |   |   |   [47] EducationField in Human_Resources, Life_Sciences, Marketing, Medical
## |   |   |   |   |   |   [48] NumCompaniesWorked in High, Very High
## |   |   |   |   |   |   |   [49] MonthlyRate in High, Low, Very High: Yes (n = 61, err = 1.6%)
## |   |   |   |   |   |   |   [50] MonthlyRate in Very Low: No (n = 4, err = 25.0%)
## |   |   |   |   |   |   [51] NumCompaniesWorked in Very Low
## |   |   |   |   |   |   |   [52] PercentSalaryHike in High, Low: Yes (n = 18, err = 22.2%)
## |   |   |   |   |   |   |   [53] PercentSalaryHike in Very High, Very Low: No (n = 12, err = 0.0%)
## |   |   |   |   |   [54] EducationField in Other: No (n = 5, err = 0.0%)
## |   |   |   |   |   [55] EducationField in Technical_Degree: Yes (n = 24, err = 0.0%)
## |   |   |   |   [56] DistanceFromHome in Very Low
## |   |   |   |   |   [57] MonthlyIncome in High, Low, Very High: No (n = 12, err = 0.0%)
## |   |   |   |   |   [58] MonthlyIncome in Very Low: Yes (n = 10, err = 40.0%)
## |   |   |   [59] JobSatisfaction in Medium, High
## |   |   |   |   [60] YearsInCurrentRole in High, Very High
## |   |   |   |   |   [61] BusinessTravel in Non-Travel, Travel_Rarely
## |   |   |   |   |   |   [62] Gender in Female
## |   |   |   |   |   |   |   [63] TrainingTimesLastYear in Very High: No (n = 15, err = 0.0%)
## |   |   |   |   |   |   |   [64] TrainingTimesLastYear in Very Low: No (n = 13, err = 46.2%)
## |   |   |   |   |   |   [65] Gender in Male: No (n = 39, err = 0.0%)
## |   |   |   |   |   [66] BusinessTravel in Travel_Frequently
## |   |   |   |   |   |   [67] HourlyRate in High, Low
## |   |   |   |   |   |   |   [68] RelationshipSatisfaction in Low, Medium, High: Yes (n = 20, err = 0.0%)
## |   |   |   |   |   |   |   [69] RelationshipSatisfaction in Very_High: No (n = 2, err = 0.0%)
## |   |   |   |   |   |   [70] HourlyRate in Very High, Very Low: No (n = 11, err = 18.2%)
## |   |   |   |   [71] YearsInCurrentRole in Very Low
## |   |   |   |   |   [72] JobRole in Healthcare_Representative, Human_Resources, Laboratory_Technician, Research_Scientist, Sales_Executive, Sales_Representative
## |   |   |   |   |   |   [73] PercentSalaryHike in High, Very Low
## |   |   |   |   |   |   |   [74] Age in High, Very High
## |   |   |   |   |   |   |   |   [75] MaritalStatus in Divorced, Married: Yes (n = 15, err = 26.7%)
## |   |   |   |   |   |   |   |   [76] MaritalStatus in Single: No (n = 13, err = 0.0%)
## |   |   |   |   |   |   |   [77] Age in Low, Very Low
## |   |   |   |   |   |   |   |   [78] EnvironmentSatisfaction in Low: Yes (n = 29, err = 0.0%)
## |   |   |   |   |   |   |   |   [79] EnvironmentSatisfaction in Medium, High, Very_High
## |   |   |   |   |   |   |   |   |   [80] BusinessTravel in Non-Travel, Travel_Frequently: Yes (n = 22, err = 0.0%)
## |   |   |   |   |   |   |   |   |   [81] BusinessTravel in Travel_Rarely
## |   |   |   |   |   |   |   |   |   |   [82] Education in Below_College, College: No (n = 6, err = 0.0%)
## |   |   |   |   |   |   |   |   |   |   [83] Education in Bachelor, Master, Doctor
## |   |   |   |   |   |   |   |   |   |   |   [84] MaritalStatus in Divorced, Married: No (n = 7, err = 42.9%)
## |   |   |   |   |   |   |   |   |   |   |   [85] MaritalStatus in Single: Yes (n = 21, err = 4.8%)
## |   |   |   |   |   |   [86] PercentSalaryHike in Low, Very High
## |   |   |   |   |   |   |   [87] HourlyRate in High, Very High: No (n = 12, err = 0.0%)
## |   |   |   |   |   |   |   [88] HourlyRate in Low, Very Low
## |   |   |   |   |   |   |   |   [89] YearsSinceLastPromotion in High, Very High: No (n = 16, err = 43.8%)
## |   |   |   |   |   |   |   |   [90] YearsSinceLastPromotion in Very Low: Yes (n = 9, err = 0.0%)
## |   |   |   |   |   [91] JobRole in Manager, Manufacturing_Director, Research_Director: No (n = 13, err = 0.0%)
## |   |   |   [92] JobSatisfaction in Very_High
## |   |   |   |   [93] YearsAtCompany in High: No (n = 28, err = 0.0%)
## |   |   |   |   [94] YearsAtCompany in Low, Very High, Very Low
## |   |   |   |   |   [95] MonthlyRate in High
## |   |   |   |   |   |   [96] WorkLifeBalance in Bad, Good: Yes (n = 5, err = 40.0%)
## |   |   |   |   |   |   [97] WorkLifeBalance in Better: No (n = 14, err = 0.0%)
## |   |   |   |   |   |   [98] WorkLifeBalance in Best: Yes (n = 5, err = 20.0%)
## |   |   |   |   |   [99] MonthlyRate in Low, Very High
## |   |   |   |   |   |   [100] MonthlyIncome in High, Very Low
## |   |   |   |   |   |   |   [101] DistanceFromHome in High, Low: No (n = 6, err = 0.0%)
## |   |   |   |   |   |   |   [102] DistanceFromHome in Very High, Very Low: Yes (n = 19, err = 5.3%)
## |   |   |   |   |   |   [103] MonthlyIncome in Low: No (n = 10, err = 0.0%)
## |   |   |   |   |   |   [104] MonthlyIncome in Very High: Yes (n = 11, err = 9.1%)
## |   |   |   |   |   [105] MonthlyRate in Very Low: No (n = 18, err = 0.0%)
## |   [106] OverTime in Yes
## |   |   [107] JobLevel in High, Very High
## |   |   |   [108] MaritalStatus in Divorced
## |   |   |   |   [109] JobSatisfaction in Low: Yes (n = 16, err = 37.5%)
## |   |   |   |   [110] JobSatisfaction in Medium, High, Very_High: No (n = 40, err = 2.5%)
## |   |   |   [111] MaritalStatus in Married
## |   |   |   |   [112] JobSatisfaction in Low, Medium
## |   |   |   |   |   [113] NumCompaniesWorked in High, Very Low: No (n = 19, err = 5.3%)
## |   |   |   |   |   [114] NumCompaniesWorked in Very High: No (n = 17, err = 47.1%)
## |   |   |   |   [115] JobSatisfaction in High
## |   |   |   |   |   [116] RelationshipSatisfaction in Low
## |   |   |   |   |   |   [117] YearsWithCurrManager in High, Very Low: Yes (n = 21, err = 0.0%)
## |   |   |   |   |   |   [118] YearsWithCurrManager in Very High: No (n = 2, err = 0.0%)
## |   |   |   |   |   [119] RelationshipSatisfaction in Medium, High
## |   |   |   |   |   |   [120] JobInvolvement in Low: Yes (n = 3, err = 0.0%)
## |   |   |   |   |   |   [121] JobInvolvement in Medium, High, Very_High: No (n = 17, err = 11.8%)
## |   |   |   |   |   [122] RelationshipSatisfaction in Very_High: Yes (n = 17, err = 35.3%)
## |   |   |   |   [123] JobSatisfaction in Very_High: No (n = 36, err = 8.3%)
## |   |   |   [124] MaritalStatus in Single
## |   |   |   |   [125] JobRole in Healthcare_Representative, Human_Resources, Laboratory_Technician, Research_Director, Research_Scientist, Sales_Representative
## |   |   |   |   |   [126] TotalWorkingYears in High, Very Low: No (n = 5, err = 0.0%)
## |   |   |   |   |   [127] TotalWorkingYears in Low, Very High: Yes (n = 19, err = 21.1%)
## |   |   |   |   [128] JobRole in Manager, Manufacturing_Director: No (n = 15, err = 6.7%)
## |   |   |   |   [129] JobRole in Sales_Executive: Yes (n = 88, err = 10.2%)
## |   |   [130] JobLevel in Very Low
## |   |   |   [131] MonthlyIncome in High, Low, Very High
## |   |   |   |   [132] JobInvolvement in Low, Medium
## |   |   |   |   |   [133] Education in Below_College: Yes (n = 8, err = 0.0%)
## |   |   |   |   |   [134] Education in College: No (n = 2, err = 0.0%)
## |   |   |   |   |   [135] Education in Bachelor, Master, Doctor: Yes (n = 34, err = 8.8%)
## |   |   |   |   [136] JobInvolvement in High
## |   |   |   |   |   [137] EducationField in Human_Resources, Life_Sciences, Marketing, Technical_Degree
## |   |   |   |   |   |   [138] JobRole in Healthcare_Representative, Human_Resources, Laboratory_Technician, Manager, Manufacturing_Director, Research_Director, Sales_Executive, Sales_Representative
## |   |   |   |   |   |   |   [139] RelationshipSatisfaction in Low: No (n = 4, err = 50.0%)
## |   |   |   |   |   |   |   [140] RelationshipSatisfaction in Medium, High, Very_High: Yes (n = 27, err = 0.0%)
## |   |   |   |   |   |   [141] JobRole in Research_Scientist: No (n = 9, err = 44.4%)
## |   |   |   |   |   [142] EducationField in Medical, Other: No (n = 12, err = 25.0%)
## |   |   |   |   [143] JobInvolvement in Very_High: No (n = 5, err = 0.0%)
## |   |   |   [144] MonthlyIncome in Very Low
## |   |   |   |   [145] YearsInCurrentRole in High
## |   |   |   |   |   [146] EnvironmentSatisfaction in Low: Yes (n = 4, err = 0.0%)
## |   |   |   |   |   [147] EnvironmentSatisfaction in Medium: No (n = 3, err = 0.0%)
## |   |   |   |   |   [148] EnvironmentSatisfaction in High: Yes (n = 17, err = 11.8%)
## |   |   |   |   |   [149] EnvironmentSatisfaction in Very_High: No (n = 5, err = 0.0%)
## |   |   |   |   [150] YearsInCurrentRole in Very High, Very Low
## |   |   |   |   |   [151] RelationshipSatisfaction in Low, Medium: Yes (n = 95, err = 1.1%)
## |   |   |   |   |   [152] RelationshipSatisfaction in High, Very_High
## |   |   |   |   |   |   [153] YearsAtCompany in High, Very High: Yes (n = 26, err = 11.5%)
## |   |   |   |   |   |   [154] YearsAtCompany in Low: No (n = 2, err = 0.0%)
## |   |   |   |   |   |   [155] YearsAtCompany in Very Low: Yes (n = 112, err = 8.0%)
## 
## Number of inner nodes:    67
## Number of terminal nodes: 88

Pediction and Model Evaluation

The CHIAD model seems with re-categorized numeric variables significantly over performs the one without. Also, since the data was balanced, sensitivity and specificity are almost equal.

#prediction
val$pred = predict(chaid_model, val[,-2])

#confusion Matrix
confusionMatrix(val$Attrition, factor(val$pred))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  278  79
##        Yes  60 323
##                                              
##                Accuracy : 0.8122             
##                  95% CI : (0.7821, 0.8397)   
##     No Information Rate : 0.5432             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.6232             
##  Mcnemar's Test P-Value : 0.1268             
##                                              
##             Sensitivity : 0.8225             
##             Specificity : 0.8035             
##          Pos Pred Value : 0.7787             
##          Neg Pred Value : 0.8433             
##              Prevalence : 0.4568             
##          Detection Rate : 0.3757             
##    Detection Prevalence : 0.4824             
##       Balanced Accuracy : 0.8130             
##                                              
##        'Positive' Class : No                 
##