CHAID - I

CHAID or Chi-square automatic interaction detection is decision tree algorithm based on adjusted significance. Unlike CART, that uses impurity indices to find splits, CHAID uses CHI SQUARE test. In CHAID analysis, nominal or ordinal data can be used. To use continuous variables, they are split in different categories with equal observations.

Understanding Data

Before we proceed any further, it is essential to understand the data well, get it in correct format and more importantly check data correctness.

Data format correction

CHAID demands data to be either categorical or nominal. In the data set, we have mixture of both nominal (factor), ordinal (ordered factor) and continuous (numeric). Let us first use only factors to make the model. We will convert numeric variables by converting them to factor and use them to measure its impact on model accuracy in CHAID II.

#data format correction
str(attrition)

## 'data.frame':    1470 obs. of  31 variables:
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : Factor w/ 3 levels "Human_Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : Ord.factor w/ 5 levels "Below_College"<..: 2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : Factor w/ 6 levels "Human_Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
##  $ EnvironmentSatisfaction : Ord.factor w/ 4 levels "Low"<"Medium"<..: 2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : Ord.factor w/ 4 levels "Low"<"Medium"<..: 3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare_Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
##  $ JobSatisfaction         : Ord.factor w/ 4 levels "Low"<"Medium"<..: 4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : Ord.factor w/ 4 levels "Low"<"Good"<"Excellent"<..: 3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: Ord.factor w/ 4 levels "Low"<"Medium"<..: 1 4 2 3 4 3 1 2 2 2 ...
##  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : Ord.factor w/ 4 levels "Bad"<"Good"<"Better"<..: 1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...

data = select_if(attrition, is.factor)

Exploratory Data Analysis

The frequency distribution of variable attrition shows that frauds are under represented in data. We should either use over or under sampling method to correct this bias. Also, a frequency plot can give us quick understanding of distributions

#### Exploratory Data Analysis ###

#dependent variable
table(data$Attrition)

## 
##   No  Yes 
## 1233  237

# Independent Variables

# distribution plot
data %>%
  gather() %>%                             
  ggplot(aes(value)) +                    
  facet_wrap(~ key, scales = "free") +   
  geom_bar(fill = "blue")

## Warning: attributes are not identical across measure variables;
## they will be dropped

Data Prepartion

Lets us now prepare our data sets for training and validation of CHAID model. Since, the is imbalanced, unequal representation of fraud and non fraud cases, we should either use oversampling or under sampling. Under sampling, although results in loss of information, is better in this case. However, we will skip it for now and use somewhat biased sample for model training. We will measure the impact of using unbiased data in CHAID II

#training and Validation dataset
set.seed(123)
smp_size = floor(0.7 * nrow(data))
train_ind = sample(seq_len(nrow(data)), size = smp_size)

train = data[train_ind, ]
val = data[-train_ind, ]

Model Training and Summary

Let us now train CHAID using training data set we prepared. We will the test model accuracy by using the model to classify frauds in validation data set.

#model training
chaid_model =  chaid(Attrition ~ ., data = train)

#summary
chaid_model

## 
## Model formula:
## Attrition ~ BusinessTravel + Department + Education + EducationField + 
##     EnvironmentSatisfaction + Gender + JobInvolvement + JobRole + 
##     JobSatisfaction + MaritalStatus + OverTime + PerformanceRating + 
##     RelationshipSatisfaction + WorkLifeBalance
## 
## Fitted party:
## [1] root
## |   [2] OverTime in No
## |   |   [3] JobSatisfaction in Low
## |   |   |   [4] MaritalStatus in Divorced, Married
## |   |   |   |   [5] Department in Human_Resources, Sales: No (n = 34, err = 23.5%)
## |   |   |   |   [6] Department in Research_Development: No (n = 64, err = 3.1%)
## |   |   |   [7] MaritalStatus in Single: No (n = 46, err = 32.6%)
## |   |   [8] JobSatisfaction in Medium, High, Very_High
## |   |   |   [9] JobInvolvement in Low: No (n = 33, err = 21.2%)
## |   |   |   [10] JobInvolvement in Medium, High, Very_High: No (n = 556, err = 7.0%)
## |   [11] OverTime in Yes
## |   |   [12] MaritalStatus in Divorced, Married
## |   |   |   [13] BusinessTravel in Non-Travel: No (n = 18, err = 0.0%)
## |   |   |   [14] BusinessTravel in Travel_Frequently: No (n = 35, err = 42.9%)
## |   |   |   [15] BusinessTravel in Travel_Rarely
## |   |   |   |   [16] JobSatisfaction in Low, Medium, High: No (n = 95, err = 26.3%)
## |   |   |   |   [17] JobSatisfaction in Very_High: No (n = 56, err = 8.9%)
## |   |   [18] MaritalStatus in Single
## |   |   |   [19] RelationshipSatisfaction in Low, Medium: Yes (n = 33, err = 33.3%)
## |   |   |   [20] RelationshipSatisfaction in High, Very_High
## |   |   |   |   [21] WorkLifeBalance in Bad, Good: Yes (n = 16, err = 31.2%)
## |   |   |   |   [22] WorkLifeBalance in Better, Best: No (n = 43, err = 25.6%)
## 
## Number of inner nodes:    10
## Number of terminal nodes: 12

#plot
plot(chaid_model, main="CAHID Tree with Attrition data")

Pediction and Model Evaluation

The CHIAD model seems to be doing a decent job of classification with overall accuracy of 82% with very high sensitivity of 86% and specificity of 42%.

#prediction
val$pred = predict(chaid_model, val[,-1])

#confusion Matrix
confusionMatrix(val$Attrition, factor(val$pred))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  356   8
##        Yes  66  11
##                                           
##                Accuracy : 0.8322          
##                  95% CI : (0.794, 0.8659) 
##     No Information Rate : 0.9569          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1719          
##  Mcnemar's Test P-Value : 0.00000000003446
##                                           
##             Sensitivity : 0.8436          
##             Specificity : 0.5789          
##          Pos Pred Value : 0.9780          
##          Neg Pred Value : 0.1429          
##              Prevalence : 0.9569          
##          Detection Rate : 0.8073          
##    Detection Prevalence : 0.8254          
##       Balanced Accuracy : 0.7113          
##                                           
##        'Positive' Class : No              
##