Purpose of this Analysis

The analysis is based on a dataset which contains data regarding jobs offered to candidates & their having accepted it and the current state of their joining in terms of whether the candidates who joined the job or not.

The analysis will first focus on establishing what are the characteristics of the Candidates who Accepted the Job Offer and Joined the Job. The analysis will also focus on establishing the reverse case, i.e. what are the characteristics of the Candidates who Accepted the Job Offer and Did Not Join.

Based on the data, we will create a model. Using the model, it should be possible to predict whether a Candidate, who Accepts the Job Offer, will actually Join or Not.

Analysis of the available Data

We begin our analysis by loading the data. This data is available in a CSV file named EmployeeJoining.csv. This data is loaded to a Data Frame.

Initially, the entire raw data will be loaded from the file. Later, subsets of this raw data will be created so that we have the right dataset for our analysis.

The dataset contains 12,333 number of data points in 17 attributes.

The Structure of the Data is provided below.

## 'data.frame':    12333 obs. of  17 variables:
##  $ Candidate_Ref           : int  2110407 2112635 2112838 2115021 2115125 2117167 2119124 2121918 2127572 2137866 ...
##  $ DOJ_Extended            : int  1 0 0 0 1 1 1 0 1 0 ...
##  $ Duration_to_Accept_Offer: int  14 18 3 26 1 17 37 NA 16 NA ...
##  $ Notice_Period           : int  30 30 45 30 120 30 30 45 0 30 ...
##  $ Offered_Band            : chr  "E2" "E2" "E2" "E2" ...
##  $ Percent_Hike_Expected   : num  -20.8 50 42.8 42.8 42.6 ...
##  $ Percent_Hike_Offered    : num  13.2 320 42.8 42.8 42.6 ...
##  $ Percent_Difference_CTC  : num  42.9 180 0 0 0 ...
##  $ Joining_Bonus           : chr  "No" "No" "No" "No" ...
##  $ Relocation_Status       : chr  "No" "No" "No" "No" ...
##  $ Gender                  : chr  "Female" "Male" "Male" "Male" ...
##  $ Source                  : chr  "Agency" "Employee Referral" "Agency" "Employee Referral" ...
##  $ Relevant_Experience     : int  7 8 4 4 6 2 7 4 8 4 ...
##  $ LOB                     : chr  "ERS" "INFRA" "INFRA" "INFRA" ...
##  $ Location                : chr  "Noida" "Chennai" "Noida" "Noida" ...
##  $ Age                     : int  34 34 27 34 34 34 32 34 34 34 ...
##  $ Status                  : chr  "Joined" "Joined" "Joined" "Joined" ...

Description of the Data

Provided below a brief description of all the attributes in the uploaded data. The data in the raw form is described. The desirable form as required for analysis is stated. Based on these 2 perspectives, necessary data transformations is conducted.

  1. Candidate_Ref: This is a unique Candidate Reference Number. Candidate Reference Number is composed of digits only. However, this is a non-numeric data and cannot be used in any calculations. So, we will convert this data to Character type of data.

  2. DOJ_Extended: This attribute indicates whether the Date of Joining was extended for the Candidate or not. This field can take 2 values; either 1 or 2. “2” indicates that the Date of Joining was not extended for the Candidate. “1” indicates that the Date of Joining was extended for the Candidate. So, we will convert this data to a Factor.

  3. Duration_to_Accept_Offer: This attribute contains the Number of Days the Candidates took to accept the Job Offer. This attribute contains Numeric Data.

  4. Notice_Period: This attribute contains the Number of Days the Candidates needs as Notice Period towards his/her Current Employer. This attribute contains Numeric Data.

  5. Offered_Band: This attribute contains the Band offered to the Candidates. Every Organisation assigns a Band to each of the Employees. Bands indicate the level of responsibility entrusted to the Employee. Band may be or may not be assigned to the Employees based on Seniority. This field contains Categorical Data. This data will never be used in any calculations. However, this data is used for classification of Employees. Presently, this data is a Factor. We will convert the String Codes to Numeric Codes.

## [1] "E2" "E1" "E3" "E4" "E0" "E6" "E5"
  1. Percent_Hike_Expected: This attribute contains the Percentage hike (increase over Current CTC [Cost to Company]) expected by the Candidates. This attribute contains Numeric Data.

  2. Percent_Hike_Offered: This attribute contains the Percentage hike (increase over Current CTC [Cost to Company]) offered to the Candidates. This attribute contains Numeric Data.

  3. Percent_Difference_CTC: This attribute contains the Percentage Difference between the CTC [Cost to Company] offered to the Candidates and the CTC of the Candidate in his/her Current Company. This attribute contains Numeric Data.

  4. Joining_Bonus: This attribute indicates whether Joining Bonus was given to the Candidate or not. This field can take 2 values; either “Yes” or “No”. “Yes” indicates that Joining Bonus was given to the Candidate. “No” indicates that Joining Bonus was not given to the Candidate. We will replace “Yes” with “1” and “No” with “2”.

  5. Relocation_Status: This attribute indicates whether the Candidate needs to Relocate while Joining the Company or not. This field can take 2 values; either “Yes” or “No”. “Yes” indicates that the Candidate needs to Relocate. “No” indicates that the Candidate does not needs to Relocate. We will replace “Yes” with “1” and “No” with “2”.

  6. Gender: This attribute indicates the Gender of the Candidate. This field can take 2 values; either “Male” or “Female”. We will replace “Male” with “1” and “Female” with “2”.

  7. Source: This attribute indicates the Source through which the Candidate was found. This field presently takes 3 values; either “Direct” or “Agency” or “Employee Referral”. Presently, this data is a Factor. We will convert the String Codes to Numeric Codes.

## [1] "Agency"            "Employee Referral" "Direct"
  1. Relevant_Experience: This attribute contains the Years of Relevant Experience of the Candidates for the Position Considered. This attribute contains Numeric Data.

  2. LOB: This attribute indicates the LOB (Line of Business) for which the Candidate is being hired. This field presently takes 12 values. Presently, this data is a Factor. We will convert the String Codes to Numeric Codes.

##  [1] "ERS"        "INFRA"      "Healthcare" "BFSI"       "CSMP"      
##  [6] "ETS"        "AXON"       "EAS"        "SALES"      "CORP"      
## [11] "BSERV"      "MMS"
  1. Location: This attribute indicates the Location for which the Candidate is being hired. This field presently takes 12 values. Presently, this data is a Factor. We will convert the String Codes to Numeric Codes.
##  [1] "Noida"      "Chennai"    "Gurgaon"    "Bangalore"  "Kolkata"   
##  [6] "Mumbai"     "Hyderabad"  "Cochin"     "Pune"       "Others"    
## [11] "Coimbatore" "Ahmedabad"
  1. Age: This attribute contains the Age in Years of the Candidates. This attribute contains Numeric Data.

  2. Status: This attribute indicates whether the Candidate Joined the Company or not. This field can take 2 values; either “Joined” or “Not Joined”. We will replace “Joined” with “1” and “Not Joined” with “0”. We have the actual data regarding the Candidates. We will use this data to train our model so that it is able to predict whether a Candidate will Join or Not based on the other attributes that we will provide as input.

The modified structure of the data is as follows.

## 'data.frame':    12333 obs. of  17 variables:
##  $ Candidate_Ref           : chr  "2110407" "2112635" "2112838" "2115021" ...
##  $ DOJ_Extended            : Factor w/ 2 levels "1","2": 1 2 2 2 1 1 1 2 1 2 ...
##  $ Duration_to_Accept_Offer: int  14 18 3 26 1 17 37 NA 16 NA ...
##  $ Notice_Period           : int  30 30 45 30 120 30 30 45 0 30 ...
##  $ Offered_Band            : Factor w/ 7 levels "1","2","3","4",..: 3 3 3 3 3 2 3 3 2 2 ...
##  $ Percent_Hike_Expected   : num  -20.8 50 42.8 42.8 42.6 ...
##  $ Percent_Hike_Offered    : num  13.2 320 42.8 42.8 42.6 ...
##  $ Percent_Difference_CTC  : num  42.9 180 0 0 0 ...
##  $ Joining_Bonus           : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Relocation_Status       : Factor w/ 2 levels "1","2": 2 2 2 2 1 2 2 2 2 2 ...
##  $ Gender                  : Factor w/ 2 levels "1","2": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Source                  : Factor w/ 3 levels "1","2","3": 1 3 1 3 3 3 3 3 2 2 ...
##  $ Relevant_Experience     : int  7 8 4 4 6 2 7 4 8 4 ...
##  $ LOB                     : Factor w/ 12 levels "1","2","3","4",..: 12 1 1 1 1 1 1 1 2 12 ...
##  $ Location                : Factor w/ 12 levels "1","2","3","4",..: 11 1 11 11 11 11 11 11 11 11 ...
##  $ Age                     : int  34 34 27 34 34 34 32 34 34 34 ...
##  $ Status                  : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 2 1 2 ...

Removing Missing Information

We will check if there is any missing information in the dataset. There are many techniques for dealing with the missing data. We will try the approach where we will remove the records which have any missing data. After doing so, if we have substantial amount of data left, we will proceed on this strategy.

We will check if the data is absent and/or data is NULL.

Total number of Records = 12,333

Missing Values per Attribute

Candidate_Ref = 0 DOJ_Extended = 0 Duration_to_Accept_Offer = 2,719 Notice_Period = 0 Offered_Band = 0 Percent_Hike_Expected = 747 Percent_Hike_Offered = 596 Percent_Difference_CTC = 851 Joining_Bonus = 0 Relocation_Status = 0 Gender = 0 Source = 0 Relevant_Experience = 0 LOB = 0 Location = 0 Age = 0 Status = 0

We find that maximum number of missing values are for the attribute Duration_to_Accept_Offer. Intuitively, it seems like this attribute contributes significantly towards the decision of joining by the candidate.

We will create 2 datasets with no missing values.

  1. In the first dataset, we will retain all the attributes and only include the records which have no missing values. We will call this DataSet1. So, DataSet1 has 9,011 records.

  2. In the second dataset, we will omit the attribute Duration_to_Accept_Offer and only include the records which have no missing values. We will call this DataSet2. So, DataSet2 has 11,430 records.

We notice that in both the datasets, we have enough data. So, we will stick to this strategy of data cleansing.

Now, let us find the correlation between the attribute Status and the attribute Duration_to_Accept_Offer. This should give an indication whether the attribute Duration_to_Accept_Offer has influence on the decision of the Candidates in Joining after accepting the offer. Though this is not a comprehensive test, this should provide a decent first-hand idea.

Our NULL Hypothesis is that there is NO CORRELATION between the attribute Status and the attribute Duration_to_Accept_Offer.

Correlation Coefficient between Status and Duration_to_Accept_Offer = 0.0626342

## 
##  Pearson's product-moment correlation
## 
## data:  v_temp1$IntStatus and v_temp1$Duration_to_Accept_Offer
## t = 6.1528, df = 9612, p-value = 7.917e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.04269795 0.08252054
## sample estimates:
##        cor 
## 0.06263418

The P-value of 7.917088810^{-10} is below the alpha of 0.05 and thus we cannot reject the NULL Hypothesis.

The correlation between Status and Duration_to_Accept_Offer is very weak. So, it indicates that the Duration to Accept Offer does not impact the Candidates decision to Join after accepting the Offer. Also, we can conclude that if we omit the attribute Duration_to_Accept_Offer, our model for predicting Status will not be impacted.

So, from this point, we will use DataSet2 for analysis. DataSet2 has 11,430 records.

Removing Outliers

We will consider the Data Points which are outside the Tukey’s Fences as Outliers.

Outliers can be present in the data broadly because of two reasons. They are as follows.

  1. There are exceptional cases in every event that is observed. In this case, the Outliers are genuine data which have happened in exceptional circumstances and/or genuinely as an exception.

  2. Outliers in data can be present due to error in data capture. In this case, the Outliers are erroneous data.

In either case, Outliers can distort the result of our analysis and have significant negative impact on the accuracy of the Prediction Model. So, we will eliminate all Outliers from the Data.

We will find the Outliers in the Numeric Data in the Data Set. To detect whether Outliers are present or not, we will create the Box and Whisker Plots for the Numeric Data.

Finding Outliers

We compute Tukey’s Fences as follows.

Lower Limit (L) = Q1 - 1.5 * (Q3 - Q1)

Upper Limit (U) = Q3 + 1.5 * (Q3 - Q1)

Here, Q1 is the 1st Quartile and Q3 is the 3rd Quartile.

Any value less than L and greater than U is an Outlier

For Notice Period

Quartile 1 (Q1) = 30

Quartile 3 (Q3) = 60

Lower Limit (L) = -15

Upper Limit (U) = 105

From the above figures, we notice that there are Outliers above the Upper Boundary. Let us examine these records to determine whether it needs eliminating.

The list of top 10 records where Notice Period is above the Upper Limit is as follows.

##    Candidate_Ref Notice_Period Status
## 1        2115125           120      1
## 2        2205496           120      1
## 3        2286200           120      2
## 4        2286297           120      1
## 5        2120995           120      1
## 6        2128469           120      1
## 7        2156459           120      2
## 8        2157132           120      2
## 9        2163132           120      1
## 10       2172869           120      2

We notice that the Maximum Outlier Value is 120. The data is from India. In Indian context, there are many Companies where the Notice Period is 120 days. So, we will not remove these records.

For Relevant Experience

Quartile 1 (Q1) = 3

Quartile 3 (Q3) = 6

Lower Limit (L) = -1.5

Upper Limit (U) = 10.5

From the above figures, we notice that there are Outliers above the Upper Boundary. Let us examine these records to determine whether it needs eliminating.

The list of top 10 records where Relevant Experience is above the Upper Limit is as follows.

##    Candidate_Ref Relevant_Experience Status
## 1        3304208                  26      1
## 2        2469500                  24      1
## 3        2394768                  23      1
## 4        2850618                  23      1
## 5        2368801                  22      1
## 6        2824825                  22      1
## 7        3379637                  22      1
## 8        3607236                  22      1
## 9        2344905                  21      1
## 10       2224425                  20      1

We notice that the Maximum Outlier Value is 26. The data is from India. In Indian context, there are many Candidates with this sort of experience who apply for Jobs. So, we will not remove these records.

For Age

Quartile 1 (Q1) = 27

Quartile 3 (Q3) = 34

Lower Limit (L) = 16.5

Upper Limit (U) = 44.5

From the above figures, we notice that there are Outliers above the Upper Boundary. Let us examine these records to determine whether it needs eliminating.

The list of top 10 records where Age is above the Upper Limit is as follows.

##    Candidate_Ref Age Status
## 1        3225493  62      2
## 2        2635174  60      2
## 3        2394647  51      1
## 4        3211288  51      2
## 5        2464782  50      1
## 6        2469131  50      1
## 7        2604373  50      1
## 8        2618623  50      1
## 9        3304208  50      1
## 10       2322707  49      1

We notice that the Maximum Outlier Value is 62 and there are 2 records where the Candidate’s age is 62 and 60. The data is from India. In Indian context, there can be Candidates in this age bracket who can apply for Jobs. So, we will not remove these records.

For Percent Hike Expected

Quartile 1 (Q1) = 25

Quartile 3 (Q3) = 52.17

Lower Limit (L) = -15.755

Upper Limit (U) = 92.925

From the above figures, we notice that there are Outliers both above the Upper Boundary and below the Lower Boundary. Let us examine these records to determine whether it needs eliminating.

The list of top 10 records where Percent Hike Expected is above the Upper Limit is as follows.

##    Candidate_Ref Percent_Hike_Expected Status
## 1        3258125                359.77      1
## 2        2575776                327.35      1
## 3        2478398                323.08      1
## 4        3434802                314.68      1
## 5        3258078                306.25      1
## 6        3664507                300.00      1
## 7        3311520                294.74      2
## 8        3267878                291.30      2
## 9        3040819                280.95      1
## 10       3379707                266.67      2

We notice that the Maximum Outlier Value is 359.77%. In Indian context, there can be Candidates with this kind of Hike Expectations who apply for Jobs. So, we will not remove these records.

The list of top 10 records where Percent Hike Expected is below the Lower Limit is as follows.

##    Candidate_Ref Percent_Hike_Expected Status
## 1        3258076                -68.83      2
## 2        3304533                -62.50      1
## 3        3685332                -57.43      2
## 4        2850795                -54.04      1
## 5        2850654                -53.50      2
## 6        2920298                -50.00      2
## 7        3807331                -49.58      2
## 8        2569520                -48.25      1
## 9        2483538                -48.05      1
## 10       2403871                -45.81      1

We notice that the Minimum Outlier Value is -68.83%. It is hard to expect candidates to expect less than half their salaries. However, due to current COVID-19 pandemic, this also can be expected. So, we will not remove these records.

For Percent Hike Offered

Quartile 1 (Q1) = 20

Quartile 3 (Q3) = 47.2

Lower Limit (L) = -20.8

Upper Limit (U) = 88

From the above figures, we notice that there are Outliers both above the Upper Boundary and below the Lower Boundary. Let us examine these records to determine whether it needs eliminating.

The list of top 10 records where Percent Hike Offered is above the Upper Limit is as follows.

##    Candidate_Ref Percent_Hike_Offered Status
## 1        3435024               471.43      1
## 2        2535545               414.29      1
## 3        2534384               400.00      1
## 4        2221149               393.33      1
## 5        3250085               375.00      1
## 6        2483174               350.00      1
## 7        2452368               333.33      1
## 8        2112635               320.00      1
## 9        2550678               316.67      1
## 10       2478398               314.53      1

We notice that the Maximum Outlier Value is 471.43%. In Indian context, there can be Candidates with this kind of Hike Offerings. So, we will not remove these records.

The list of top 10 records where Percent Hike Offered is below the Lower Limit is as follows.

##    Candidate_Ref Percent_Hike_Offered Status
## 1        2151180               -60.53      2
## 2        2458879               -60.00      1
## 3        2237763               -55.75      1
## 4        3488402               -54.97      2
## 5        3462905               -54.55      1
## 6        2407721               -53.75      1
## 7        2920298               -52.78      2
## 8        2137866               -51.37      2
## 9        2347819               -50.82      1
## 10       3617498               -50.13      2

We notice that the Minimum Outlier Value is -60.53%. It is hard to expect candidates are offered less than half their salaries. However, due to current COVID-19 pandemic, this also can be expected. So, we will not remove these records.

For Percent Difference CTC

Quartile 1 (Q1) = -9.09

Quartile 3 (Q3) = 0

Lower Limit (L) = -22.725

Upper Limit (U) = 13.635

From the above figures, we notice that there are Outliers both above the Upper Boundary and below the Lower Boundary. Let us examine these records to determine whether it needs eliminating.

The list of top 10 records where Percent Difference CTC is above the Upper Limit is as follows.

##    Candidate_Ref Percent_Difference_CTC Status
## 1        3258076                 300.00      2
## 2        2535545                 300.00      1
## 3        2550678                 272.34      1
## 4        3289177                 240.91      1
## 5        2534384                 233.33      1
## 6        2282545                 227.27      2
## 7        2452368                 225.00      1
## 8        3429009                 218.37      1
## 9        2221149                 196.00      1
## 10       3411311                 195.00      2

We notice that the Maximum Outlier Value is 300%. In Indian context, there can be Candidates with this kind of Hike Offerings. So, we will not remove these records.

The list of top 10 records where Percent Difference CTC is below the Lower Limit is as follows.

##    Candidate_Ref Percent_Difference_CTC Status
## 1        3784642                 -68.95      2
## 2        2458879                 -67.27      1
## 3        3462905                 -66.67      1
## 4        3463034                 -66.00      1
## 5        3434802                 -65.55      1
## 6        2780729                 -64.29      2
## 7        2802356                 -63.64      1
## 8        2151180                 -62.50      2
## 9        3290644                 -62.50      2
## 10       3274543                 -62.50      2

We notice that the Minimum Outlier Value is -68.95%. It is hard to expect candidates are offered less than half their salaries. However, due to current COVID-19 pandemic, this also can be expected. So, we will not remove these records.

Final Data To Analyse and To Use for Creating Model

The summary of the dataset is provided below.

##  Candidate_Ref      DOJ_Extended Notice_Period   Offered_Band
##  Length:11430       1:4868       Min.   :  0.0   1: 911      
##  Class :character   2:6562       1st Qu.: 30.0   2:6525      
##  Mode  :character                Median : 30.0   3:3241      
##                                  Mean   : 37.7   4: 635      
##                                  3rd Qu.: 60.0   5:  94      
##                                  Max.   :120.0   6:  22      
##                                                  7:   2      
##  Percent_Hike_Expected Percent_Hike_Offered Percent_Difference_CTC
##  Min.   :-68.83        Min.   :-60.53       Min.   :-68.950       
##  1st Qu.: 25.00        1st Qu.: 20.00       1st Qu.: -9.090       
##  Median : 38.89        Median : 34.26       Median :  0.000       
##  Mean   : 41.88        Mean   : 37.87       Mean   : -2.088       
##  3rd Qu.: 52.17        3rd Qu.: 47.20       3rd Qu.:  0.000       
##  Max.   :359.77        Max.   :471.43       Max.   :300.000       
##                                                                   
##  Joining_Bonus Relocation_Status Gender   Source   Relevant_Experience
##  1:  511       1: 1389           1:9368   1:3081   Min.   : 0.000     
##  2:10919       2:10041           2:2062   2:6376   1st Qu.: 3.000     
##                                           3:1973   Median : 4.000     
##                                                    Mean   : 4.176     
##                                                    3rd Qu.: 6.000     
##                                                    Max.   :26.000     
##                                                                       
##       LOB          Location         Age        Status  
##  1      :3170   1      :3955   Min.   :20.00   1:8388  
##  12     :2906   11     :3600   1st Qu.:27.00   2:3042  
##  3      :1597   3      :2689   Median :30.00           
##  5      : 883   6      : 454   Mean   :30.31           
##  10     : 753   5      : 235   3rd Qu.:34.00           
##  4      : 651   2      : 174   Max.   :62.00           
##  (Other):1470   (Other): 323

Should we analyse this data?

Now that we have finalised the dataset for analysis, we need answering whether it is really required to analyse this data or not.

We notice that, in this dataset, the number of Candidates who did not join the Company after accepting the Job Offer is 0. This is 0% of the total number of Candidates in the dataset.

This percentage is rather high. So, it makes perfect sense to try to figure out what are the reasons for the Candidates not joining after accepting the Offer.

This scenario leads to considerable amount of wasted expenses for the organisation. So, it makes perfect sense to device a mechanism to reduce this cost. One mechanism to control this cost could be if we had a model which could predict whether a Candidate, who accepts an Offer, will actually Join the Company. Using this model, the Company could decide for which Candidates it should invest it and for which Candidates it should not invest in.

Where should we focus our analysis?

We have established that it is worth analysing this data. Next,we need to figure out what needs analysing. Further, we need to prioritise among the items that needs analysis. We have to focus on the reasons what makes Candidates Join the Company and also what drives Candidates not to Join the Company after accepting the Offer.

We start by trying to figure out which attributes contribute significantly to the Candidate’s decision to join or not to join. In technical terms, this implies that we need to figure out the attributes (or combination of attributes) which have strong correlation - positive or negative - to the attribute Status.

Zero-Variance Predictors

Predictors with a single unique value (also known as “Zero-Variance Predictors”) will cause the model to fail. Since we will be tuning models using resampling methods, a random sample of the training set may result in some predictors with more than one unique value to become a zero-variance predictor. These so-called “near zero-variance predictors” can cause numerical problems during resampling for some models, such as linear regression.

To identify this kind of predictors, two properties need to be examined.

  1. The percentage of unique values in the training set can be calculated for each predictor. Variables with low percentages have a higher probability of becoming a zero-variance predictor during resampling. However, this in itself is not a problem. Binary Predictors are likely to have low percentages and should not be discarded for this simple reason.

  2. The other important criterion to examine is the skewness of the frequency distribution of the variable. If the ratio of most frequent value of a predictor to the second most frequent value is large, the distribution of the predictor may be highly skewed.

If both these criteria are flagged, the predictor may be a near zero-variance predictor. We will use the benchmark that if:

  1. the percentage of unique values is less than 20% and

  2. the ratio of the most frequent to the second most frequent value is greater than 20,

the predictor may cause problem for our model and thus should be removed.

The result of Near Zero-Variance Predictor Analysis is as follows.

## [1] "Joining_Bonus"

As Joining_Bonus is a binary variable, we will not discard it from our model.

Multicollinearity

Models are susceptible to multicollinearity (i.e., high correlations between predictors). Linear models, neural networks and other models can have poor performance in these situations or may generate unstable solutions. Other models, such as classication or regression trees, might be resistant to highly correlated predictors, but multicollinearity may negatively effect interpretability of the model.

To minimize the effect of multicollinearity, we will reduce the number of dimensions. To do this, we compute the correlation matrix of the predictors and use an algorithm to remove the a subset of the problematic predictors such that all of the pairwise correlations are below a threshold. This algorithm finds the minimal set of predictors that can be removed so that the pairwise correlations are below a specific threshold. Note that, if two variables have a high correlation, the algorithm determines which one is involved with the most pairwise correlations and is removed.

The result of our analysis is as follows.

## [1] "No multicollenearity found"

The Correlation Coefficients between Status and the other attributes is as follows.

##                 Source Percent_Difference_CTC               Location 
##           -0.129734150           -0.088806004           -0.075290612 
##   Percent_Hike_Offered                    Age                 Gender 
##           -0.074621355           -0.047143620           -0.035927506 
##          Joining_Bonus  Percent_Hike_Expected           Offered_Band 
##           -0.007664980           -0.002603188            0.056405206 
##                    LOB    Relevant_Experience           DOJ_Extended 
##            0.072817581            0.096124151            0.133148866 
##      Relocation_Status          Notice_Period                 Status 
##            0.223981956            0.241088064            1.000000000

We find that the correlation between Status and the other attributes is very weak. However, this is because Status is a binary variable. The determined correlations definitely point to which attributes have a positive influence on a Candidate Joining and which have a negative influence.

The attributes which have a positive influence are as follows (in the decreasing order of correlation).

  1. Relocation Involved
  2. Date of Joining Extention Provided
  3. Source
  4. Percent Difference in CTC
  5. Percent Hike Offered
  6. Line of Business (LOB)
  7. Age
  8. Percent Hike Expected

The attributes which have a negative influence are as follows (in the decreasing order of correlation).

  1. Joining Bonus
  2. Gender
  3. Location
  4. Offered Band
  5. Relevant Experience
  6. Notice Period

Analysis to Conduct

We will try to answer the following questions.

A. Relocation

  1. Are Candidates, who need relocating, less likely to join?
  2. To which City(ies) Candidates are less likely to relocate?
  3. To which City(ies) Candidates are more likely to relocate?
  4. Does Extention of Joining Date influence Candidate’s (needing Relocation) decision to Join?
  5. Does Percentage Hike Offered influence Candidate’s (needing Relocation) decision to Join?
  6. For which City(ies), Percent Hike Offered need to be higher for Candidates needing Relocation?
  7. For which LOB(s), Percent Hike Offered need to be higher for Candidates needing Relocation?
  8. For which Band(s), Percent Hike Offered need to be higher for Candidates needing Relocation?
  9. Does Gender have an impact on Joining Decision for Candidates needing Relocation?
  10. Does Offered Band have an impact on Joining Decision for Candidates needing Relocation?
  11. Are Older Candidates less likely to Relocate? Does Older Candidates need higher Hike Percentage to Relocate?

B. Source

  1. Candidates from which Source(s) are more likely or less likely join in which City(ies)?
  2. Candidates from which Source(s) are more likely or less likely join in which LOB(s)?
  3. Candidates from which Source(s) are more likely or less likely join in which Band(s)?

C. Hike Offered

  1. Does difference between Hike Expected and Hike Offered influence the Decision to Join?
  2. In which City(ies), Candidates are less likely to Join if there is Significant difference between Hike Expected and Hike Offered?
  3. In which LOB(s), Candidates are less likely to Join if there is Significant difference between Hike Expected and Hike Offered?
  4. In which Band(s), Candidates are less likely to Join if there is Significant difference between Hike Expected and Hike Offered?
  5. What is critical level of difference between Hike Expected and Hike Offered which makes a Candidate Join or not Join?
  6. What is the critical Hike Percentages for different Age Groups?

Analysis

In this chapter, we analyse the data for the queries we established in section 5.2.3.

Question regarding Relocation

In this section, we discuss the analysis questions regarding the Relocation Creteria.

Are Candidates, who need relocating, less likely to join?

We find out the percentage of Candidates who Joined or did not Join based on whether they neededd Relocating or not.

Relocation Status 1 indicates that the Candidate needed Relocating to Join the Company. Relocation Status 2 indicates that the Candidate did not need to Relocate to Join the Company.

Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.

We see that all the Candidates, who needed Relocation, Joined. There are Candidates who did not need relocation and did not join. So, it can be concluded that Relocation is not a criteria because Candidates do not join.

As a result of this finding, all other questions related to Relocation do not need any further analysis.

Question regarding Source

In this section, we discuss the analysis questions regarding the Sources through which the Candidates are obtained.

Candidates from which Source(s) are more likely or less likely to join in which City(ies)?

The Codes for the Source are as follows.

1 - Agency 2 - Direct 3 - Employee Referral

Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.

We see below the Crosstab between Status and Source.

This graph suggests that Candidates who come through Employee Referrals are more likely to Join. Also, this graph suggests that more perentage of Candidates who do not Join come through Agencies. However, to be sure, let us see the volume of Candidates coming through each of the Sources.

We notice that similar number of Candidates come through Agencies and through Employee Referrals. So, we can safely conclude that Employee Referals is a better Source to get Candidates (who are more likely to Join) than through Agencies.

However, more number of Candidates are recruited through Direct Employment than Candidates through Employee Referrals and Candidates through Agencies put together. So, we need to examine these cases more closely. However, before doing that let us wrap up our finding on regarding Candidates recruited through Employee Referrals and Candidates recruited through Agencies.

Candidates recruited through Employee Referrals

Let us examine the Cities from which the Candidates, recruited through Employee Referrals, come from.

The Codes for the Cities (Location) are as follows.

1 - Chennai 2 - Gurgaon 3 - Bangalorel 4 - Kolkata 5 - Mumbai 6 - Hyderabad 7 - Cochin 8 - Pune 9 - Coimbatore 10 - Ahmedabad 11 - Noida 12 - Others

Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.

The graphs shows that 100% of the Candidates coming through Employee Referrals joined when they were recruited for Cochin, Coimbatore, Ahmadabad and Others. All these Cities can be considered as smaller cities. It is safe to consider that the number of Jobs available in these Cities is less (compared to the other Cities in our list).

Let us examine how many of these Candidates needed Relocating.

We notice that majority of these Candidates needed to Relocate to Join the Company.

We examine the Hike Offered to these Candidates.

We see that majority of the Candidates were offered between 30-40% Hike.

Lastly, we examine what was the Hike Expected by these Candidates.

We notice that the Hike Expected by these Candidates was similar to the Hike they were Offered.

So, we can conclude that for Smaller Cities, it is best to get Candidates through Employee Referrals offerring them a hike of 30-40%. We can view this as a win-win situation as Salaries in these Cities would be less ( and thus the Company benefits) and the Candidates get the hike they expect (and thus can save as cost of living would be less).

Candidates recruited through Agencies

Let us examine the Cities from which the Candidates, recruited through Agencies, come from.

The Codes for the Cities (Location) are as follows.

1 - Chennai 2 - Gurgaon 3 - Bangalorel 4 - Kolkata 5 - Mumbai 6 - Hyderabad 7 - Cochin 8 - Pune 9 - Coimbatore 10 - Ahmedabad 11 - Noida 12 - Others

Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.

We notice that Agencies generally operate between 60-70% efficiency. We could consider the case of Coimbatore to be an Outlier.

We already know that all Candidates, who needed Relocating, Joined. So, we can conclude that Agengies are not challenged in finding Candidate when Relocation is essential.

Let us try to establish the reason why Candidates from Agencies do not join. We will examine the Hike Offered to the Candidates and the Hike Expected by the Candidates.

We will examine the Hike Offered to the Candidates versus the Hike Expected by the Candidates.

We notice that in majority of the cases, the Hike Offered to the Candidates was in general lower than the Hike Expected by the Candidates (The Slope of the Regression Line is much less than 45 degrees). So, this couldd be an area for investigation for the Company. For example, the lower offers could be due to quality of the Candidates supplied by the Agencies.

Candidates recruited Directly

Let us examine the Cities from which the Candidates, recruited Directly, come from.

The Codes for the Cities (Location) are as follows.

1 - Chennai 2 - Gurgaon 3 - Bangalorel 4 - Kolkata 5 - Mumbai 6 - Hyderabad 7 - Cochin 8 - Pune 9 - Coimbatore 10 - Ahmedabad 11 - Noida 12 - Others

Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.

The efficiency of Direct Recruitment is between 70-80%. Since recruitment through Direct mean accounts for the maximum number of Candidates, it needs analysing the reason for renege here.

We check the counts per City.

The number of Direct Recruitments in Chennai and Noida are far more than the number of Direct Recruitments in Bangalore. However, the number of reneges in Bangalore is similar to that in Chennai and Noida. So, we examine the case of Bangalore.

The case of Candidates recruited Directly for Bangalore

Let us examine the following.

  1. Hike Expected versus Hike Offered
  2. Gender of the Candidates
  3. Age of the Candidates
Hike Expected versus Hike Offered

We examine the Hike Expected versus the Hike Offered to the Candidates approached through Direct Recruitment for Bangalore who did not join,

We notice that there is a consistency when it comes to making offers to the Candidates. However, in general, Candidates were offered less than their expectation. Let us further examine this by Age and Gender.

We plot the graph of Candiates who came through Direct Recruitment in Bangalore and did not Join.

Gender 1 indicates Male Candidates. Status 2 indicates Female Candidate.

Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.

We notice that the renege percentage is absolutely similar for MAle Candidates and Female Candidates.

We plot the graph of Candiates who came through Direct Recruitment in Bangalore.

Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.

We notice that renege is consistently high for all ages except for age group less than 25 and for age of 34.

It is difficult to interpret the case of age of 34. The first question is why the number of offers made for Candidates aged 34 was so high. We check the LOB for which these Candidates were demanded.

We plot the graph of Candiates who came through Direct Recruitment in Bangalore aged 34.

The Codes for the Line of Business (LOB) are as follows.

1 - INFRA 2 - Healthcare 3 - BFSI 4 - CSMP 5 - ETS 6 - AXON 7 - EAS 8 - SALES 9 - CORP 10 - BSERV 11 - MMS 12 - ERS

Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.

We notice that renege is especially high for EAS LOB. Let us see if this LOB can be shifted to any other location.

We plot the graph of Candiates who came through Direct Recruitment aged 34 for LOB EAS.

The Codes for the Cities (Location) are as follows.

1 - Chennai 2 - Gurgaon 3 - Bangalorel 4 - Kolkata 5 - Mumbai 6 - Hyderabad 7 - Cochin 8 - Pune 9 - Coimbatore 10 - Ahmedabad 11 - Noida 12 - Others

Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.

As the renege in EAS LOB is exceptionally high in Bangalore compared to the other locations, we can conclude that it is better to shut down the EAS LOB in Bangalore and operate from other locations.

Candidates from which Source(s) are more likely or less likely to join in which LOB(s)?

The Codes for the Source are as follows.

1 - Agency 2 - Direct 3 - Employee Referral

Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.

We already know the following.

  1. Conversion Rate of Employee Referral is the highest.
  2. Most of the Candidates are obtained through Direct Recruitment.

Let us examine the Characteristics with respect to the LOBs.

Candidates recruited through Employee Referrals

Let us examine the LOBs for which the Candidates were recruited through Employee Referrals.

The Codes for the Line of Business (LOB) are as follows.

1 - INFRA 2 - Healthcare 3 - BFSI 4 - CSMP 5 - ETS 6 - AXON 7 - EAS 8 - SALES 9 - CORP 10 - BSERV 11 - MMS 12 - ERS

Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.

We see that the levelof renege across all the LOBs for Candidates obtained through Employee Referral is between 10-30. So, there is not much to choose from through there are a few exceptions. However, renege is very low in INFRA LOB. Let us examine the characteristics for this LOB.

We see that the majority of the Candidates who joined were Females.

A quarter of the Candidates were aged 34 years.

We can see that Joining Bonus was given in only 3 cases. So, Joining Bonus is not a motivation for Joining.

Lastly, we see the correlation between Hike Offered versus Hike Expected.

We see that Hike Offered is much lower than the Hike Expected.

We do not find anything exception here except that it is advised to recruit Female Candidates through Employee Referrals in the INFRA LOB in the Lower Bands.

Candidates recruited through Agencies

Let us examine the LOBs for which the Candidates were recruited through Agencies.

The Codes for the Line of Business (LOB) are as follows.

1 - INFRA 2 - Healthcare 3 - BFSI 4 - CSMP 5 - ETS 6 - AXON 7 - EAS 8 - SALES 9 - CORP 10 - BSERV 11 - MMS 12 - ERS

Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.

We notice that the renege percentage is generally very high most of the LOBs. We know that majority of the Candidates are recruited directly. So, it advised to reduce the recruitment through Agencies.

Candidates recruited through Agencies

Let us examine the LOBs for which the Candidates were recruited through Agencies.

The Codes for the Line of Business (LOB) are as follows.

1 - INFRA 2 - Healthcare 3 - BFSI 4 - CSMP 5 - ETS 6 - AXON 7 - EAS 8 - SALES 9 - CORP 10 - BSERV 11 - MMS 12 - ERS

Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.

The data is usual except that BSERV has remarkably low renege. Let us examine the characteristics for this LOB.

We see that the majority of the Candidates who joined were Females.

Almost all the Candidates were aged 34 years.

We can see that Joining Bonus was given in only 1 cases. So, Joining Bonus is not a motivation for Joining.

Lastly, we see the correlation between Hike Offered versus Hike Expected.

From the above charts, we can conclude the the preferred mode of recruitment should be through Direct Recruitment.

Question regarding Hike

In this section, we discuss the analysis questions regarding the Hikes.

Does difference between Hike Expected and Hike Offered influence the Decision to Join?

We have found answer for this question in the previous sections. However, we will look at some data.

First we examine the Hike Offered versus Hike Expected for all the Candidates who did not join.

##      Expected o_m1000_m50 o_m50_0 o_0_10 o_10_50 o_50_100 o_100_200 o_200_1000
## 1 e_m1000_m50           0       2      0       1        0         0          0
## 2     e_m50_0           2      23      4       5        0         0          0
## 3      e_0_10           2      37     95      27        4         5          0
## 4     e_10_50           1     215    126    1409       88        23         12
## 5    e_50_100           0      36     19     385      395        25          3
## 6   e_100_200           0       1      1      14       38        37          1
## 7  e_200_1000           0       0      0       0        0         5          1

If you notice the diagnoal of this data, you will notice that majority of the Candidates were offered the hike that they expected.

So, we can conclude that the compensation package offered is not a reason for renege.

Conclusions from the Analysis

The following can be conclude from the above analysis.

  1. Need for Relocation is not a criteria because of which Candidates do not join.
  2. Candidates coming through Employee Referrals are more likely to join than Candidates coming through Agencies.
  3. For Smaller Cities, it is best to get Candidates through Employee Referrals offerring them a hike of 30-40%.
  4. The Organisation needs examining the Quality of the Candidates supplied by the Agencies.
  5. Shut down the EAS LOB in Bangalore and operate from other locations.
  6. It is advised to recruit Female Candidates through Employee Referrals in the INFRA LOB in the Lower Bands.
  7. It is advised to reduce the recruitment through Agencies.
  8. Preferred mode of recruitment should be through Direct Recruitment.

Creating the Model for predicting Renege

We now create the Model for being able to predict renege based on the data that we have.

Random Forest Model

We split the dataset such that we assign 75% of the data for Training the Model. The rest 25% of the data will be used to Test the Model.

Number of Records in Training Set = 8572 Number of Records in Test Set = 2858

** Summary of Training Set**

##  DOJ_Extended Notice_Period    Offered_Band Percent_Hike_Expected
##  1:3666       Min.   :  0.00   1: 689       Min.   :-68.83       
##  2:4906       1st Qu.: 30.00   2:4875       1st Qu.: 25.00       
##               Median : 30.00   3:2438       Median : 38.89       
##               Mean   : 37.66   4: 479       Mean   : 41.93       
##               3rd Qu.: 60.00   5:  73       3rd Qu.: 52.17       
##               Max.   :120.00   6:  16       Max.   :327.35       
##                                7:   2                            
##  Percent_Hike_Offered Percent_Difference_CTC Joining_Bonus Relocation_Status
##  Min.   :-60.53       Min.   :-67.27         1: 376        1:1035           
##  1st Qu.: 20.00       1st Qu.: -8.89         2:8196        2:7537           
##  Median : 34.45       Median :  0.00                                        
##  Mean   : 38.15       Mean   : -1.94                                        
##  3rd Qu.: 47.54       3rd Qu.:  0.00                                        
##  Max.   :471.43       Max.   :300.00                                        
##                                                                             
##  Gender   Source   Relevant_Experience      LOB          Location   
##  1:7029   1:2322   Min.   : 0.000      1      :2427   1      :2951  
##  2:1543   2:4761   1st Qu.: 3.000      12     :2154   11     :2711  
##           3:1489   Median : 4.000      3      :1195   3      :2022  
##                    Mean   : 4.193      5      : 665   6      : 334  
##                    3rd Qu.: 6.000      10     : 548   5      : 183  
##                    Max.   :26.000      6      : 485   2      : 136  
##                                        (Other):1098   (Other): 235  
##       Age        Status  
##  Min.   :20.00   1:6281  
##  1st Qu.:27.00   2:2291  
##  Median :30.00           
##  Mean   :30.34           
##  3rd Qu.:34.00           
##  Max.   :62.00           
## 

** Summary of Test Set**

##  DOJ_Extended Notice_Period    Offered_Band Percent_Hike_Expected
##  1:1202       Min.   :  0.00   1: 222       Min.   :-57.43       
##  2:1656       1st Qu.: 30.00   2:1650       1st Qu.: 25.00       
##               Median : 30.00   3: 803       Median : 38.89       
##               Mean   : 37.81   4: 156       Mean   : 41.73       
##               3rd Qu.: 60.00   5:  21       3rd Qu.: 52.78       
##               Max.   :120.00   6:   6       Max.   :359.77       
##                                7:   0                            
##  Percent_Hike_Offered Percent_Difference_CTC Joining_Bonus Relocation_Status
##  Min.   :-53.75       Min.   :-68.950        1: 135        1: 354           
##  1st Qu.: 18.42       1st Qu.: -9.510        2:2723        2:2504           
##  Median : 33.93       Median :  0.000                                       
##  Mean   : 37.02       Mean   : -2.532                                       
##  3rd Qu.: 46.60       3rd Qu.:  0.000                                       
##  Max.   :400.00       Max.   :233.330                                       
##                                                                             
##  Gender   Source   Relevant_Experience      LOB         Location   
##  1:2339   1: 759   Min.   : 0.000      12     :752   1      :1004  
##  2: 519   2:1615   1st Qu.: 3.000      1      :743   11     : 889  
##           3: 484   Median : 4.000      3      :402   3      : 667  
##                    Mean   : 4.124      5      :218   6      : 120  
##                    3rd Qu.: 5.000      10     :205   5      :  52  
##                    Max.   :22.000      4      :169   4      :  39  
##                                        (Other):369   (Other):  87  
##       Age        Status  
##  Min.   :22.00   1:2107  
##  1st Qu.:27.00   2: 751  
##  Median :30.00           
##  Mean   :30.22           
##  3rd Qu.:34.00           
##  Max.   :50.00           
## 

Now we create our Model using Random Forest.

## 
## Call:
##  randomForest(formula = Status ~ ., data = v_train, ntree = 1000,      mtry = 2, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 23.12%
## Confusion matrix:
##      1   2 class.error
## 1 5972 309  0.04919599
## 2 1673 618  0.73024880

We now make predictions on the Training Set.

The Accuracy of the Model on the Training Data = 92.2188520765282%

The Confusion Matrix is provides below.

## Confusion Matrix and Statistics
## 
##             
## v_pred_train    1    2
##            1 6272  658
##            2    9 1633
##                                           
##                Accuracy : 0.9222          
##                  95% CI : (0.9163, 0.9278)
##     No Information Rate : 0.7327          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7817          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9986          
##             Specificity : 0.7128          
##          Pos Pred Value : 0.9051          
##          Neg Pred Value : 0.9945          
##              Prevalence : 0.7327          
##          Detection Rate : 0.7317          
##    Detection Prevalence : 0.8084          
##       Balanced Accuracy : 0.8557          
##                                           
##        'Positive' Class : 1               
## 

We now make predictions on the Test Set.

The Accuracy of the Model on the Test Data = 78.2715185444367%

The Confusion Matrix is provides below.

## Confusion Matrix and Statistics
## 
##            
## v_pred_test    1    2
##           1 1995  509
##           2  112  242
##                                           
##                Accuracy : 0.7827          
##                  95% CI : (0.7671, 0.7977)
##     No Information Rate : 0.7372          
##     P-Value [Acc > NIR] : 9.98e-09        
##                                           
##                   Kappa : 0.3242          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9468          
##             Specificity : 0.3222          
##          Pos Pred Value : 0.7967          
##          Neg Pred Value : 0.6836          
##              Prevalence : 0.7372          
##          Detection Rate : 0.6980          
##    Detection Prevalence : 0.8761          
##       Balanced Accuracy : 0.6345          
##                                           
##        'Positive' Class : 1               
## 

We check the variables which are imporant for making the prediction.

##                               1          2 MeanDecreaseAccuracy
## DOJ_Extended           42.81650  36.613110            52.541284
## Notice_Period          49.57371  59.624897            70.716803
## Offered_Band           32.36441  -4.069789            34.966272
## Percent_Hike_Expected  42.24829 -16.764391            36.282926
## Percent_Hike_Offered   47.07039  -2.001910            48.593526
## Percent_Difference_CTC 36.65857  31.512760            51.728554
## Joining_Bonus          19.20148  -7.675910            13.283936
## Relocation_Status      74.91316 104.480451           105.321617
## Gender                  8.70433  -2.919776             5.627725
## Source                 24.51146   3.471047            24.590123
## Relevant_Experience    35.80203  15.007628            47.043338
## LOB                    57.37765  19.618882            66.795492
## Location               35.65446  -5.737479            29.735008
## Age                    18.46604   8.502554            22.224827
##                        MeanDecreaseGini
## DOJ_Extended                   87.36077
## Notice_Period                 214.61865
## Offered_Band                   72.65718
## Percent_Hike_Expected         282.04167
## Percent_Hike_Offered          318.07026
## Percent_Difference_CTC        299.51695
## Joining_Bonus                  23.08735
## Relocation_Status             148.57220
## Gender                         38.04041
## Source                         86.87576
## Relevant_Experience           185.65279
## LOB                           225.63871
## Location                      143.26188
## Age                           191.18886

Random Forest Model using Caret Package

We now create the Random Forest Model using the Caret Package. We use the same Training and Test Data.

## Random Forest 
## 
## 8572 samples
##   15 predictor
##    2 classes: '1', '2' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 8572, 8572, 8572, 8572, 8572, 8572, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa
##    2    1         1    
##    8    1         1    
##   15    1         1    
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

We now make predictions on the Training Set.

The Accuracy of the Model on the Training Data = 100%

The Confusion Matrix is provides below.

## Confusion Matrix and Statistics
## 
##             
## v_pred_train    1    2
##            1 6281    0
##            2    0 2291
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9996, 1)
##     No Information Rate : 0.7327     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.7327     
##          Detection Rate : 0.7327     
##    Detection Prevalence : 0.7327     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 1          
## 

We now make predictions on the Test Set.

The Accuracy of the Model on the Test Data = 100%

The Confusion Matrix is provides below.

## Confusion Matrix and Statistics
## 
##            
## v_pred_test    1    2
##           1 2107    0
##           2    0  751
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9987, 1)
##     No Information Rate : 0.7372     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.7372     
##          Detection Rate : 0.7372     
##    Detection Prevalence : 0.7372     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 1          
##