The analysis is based on a dataset which contains data regarding jobs offered to candidates & their having accepted it and the current state of their joining in terms of whether the candidates who joined the job or not.
The analysis will first focus on establishing what are the characteristics of the Candidates who Accepted the Job Offer and Joined the Job. The analysis will also focus on establishing the reverse case, i.e. what are the characteristics of the Candidates who Accepted the Job Offer and Did Not Join.
Based on the data, we will create a model. Using the model, it should be possible to predict whether a Candidate, who Accepts the Job Offer, will actually Join or Not.
We begin our analysis by loading the data. This data is available in a CSV file named EmployeeJoining.csv. This data is loaded to a Data Frame.
Initially, the entire raw data will be loaded from the file. Later, subsets of this raw data will be created so that we have the right dataset for our analysis.
The dataset contains 12,333 number of data points in 17 attributes.
The Structure of the Data is provided below.
## 'data.frame': 12333 obs. of 17 variables:
## $ Candidate_Ref : int 2110407 2112635 2112838 2115021 2115125 2117167 2119124 2121918 2127572 2137866 ...
## $ DOJ_Extended : int 1 0 0 0 1 1 1 0 1 0 ...
## $ Duration_to_Accept_Offer: int 14 18 3 26 1 17 37 NA 16 NA ...
## $ Notice_Period : int 30 30 45 30 120 30 30 45 0 30 ...
## $ Offered_Band : chr "E2" "E2" "E2" "E2" ...
## $ Percent_Hike_Expected : num -20.8 50 42.8 42.8 42.6 ...
## $ Percent_Hike_Offered : num 13.2 320 42.8 42.8 42.6 ...
## $ Percent_Difference_CTC : num 42.9 180 0 0 0 ...
## $ Joining_Bonus : chr "No" "No" "No" "No" ...
## $ Relocation_Status : chr "No" "No" "No" "No" ...
## $ Gender : chr "Female" "Male" "Male" "Male" ...
## $ Source : chr "Agency" "Employee Referral" "Agency" "Employee Referral" ...
## $ Relevant_Experience : int 7 8 4 4 6 2 7 4 8 4 ...
## $ LOB : chr "ERS" "INFRA" "INFRA" "INFRA" ...
## $ Location : chr "Noida" "Chennai" "Noida" "Noida" ...
## $ Age : int 34 34 27 34 34 34 32 34 34 34 ...
## $ Status : chr "Joined" "Joined" "Joined" "Joined" ...
Provided below a brief description of all the attributes in the uploaded data. The data in the raw form is described. The desirable form as required for analysis is stated. Based on these 2 perspectives, necessary data transformations is conducted.
Candidate_Ref: This is a unique Candidate Reference Number. Candidate Reference Number is composed of digits only. However, this is a non-numeric data and cannot be used in any calculations. So, we will convert this data to Character type of data.
DOJ_Extended: This attribute indicates whether the Date of Joining was extended for the Candidate or not. This field can take 2 values; either 1 or 2. “2” indicates that the Date of Joining was not extended for the Candidate. “1” indicates that the Date of Joining was extended for the Candidate. So, we will convert this data to a Factor.
Duration_to_Accept_Offer: This attribute contains the Number of Days the Candidates took to accept the Job Offer. This attribute contains Numeric Data.
Notice_Period: This attribute contains the Number of Days the Candidates needs as Notice Period towards his/her Current Employer. This attribute contains Numeric Data.
Offered_Band: This attribute contains the Band offered to the Candidates. Every Organisation assigns a Band to each of the Employees. Bands indicate the level of responsibility entrusted to the Employee. Band may be or may not be assigned to the Employees based on Seniority. This field contains Categorical Data. This data will never be used in any calculations. However, this data is used for classification of Employees. Presently, this data is a Factor. We will convert the String Codes to Numeric Codes.
## [1] "E2" "E1" "E3" "E4" "E0" "E6" "E5"
Percent_Hike_Expected: This attribute contains the Percentage hike (increase over Current CTC [Cost to Company]) expected by the Candidates. This attribute contains Numeric Data.
Percent_Hike_Offered: This attribute contains the Percentage hike (increase over Current CTC [Cost to Company]) offered to the Candidates. This attribute contains Numeric Data.
Percent_Difference_CTC: This attribute contains the Percentage Difference between the CTC [Cost to Company] offered to the Candidates and the CTC of the Candidate in his/her Current Company. This attribute contains Numeric Data.
Joining_Bonus: This attribute indicates whether Joining Bonus was given to the Candidate or not. This field can take 2 values; either “Yes” or “No”. “Yes” indicates that Joining Bonus was given to the Candidate. “No” indicates that Joining Bonus was not given to the Candidate. We will replace “Yes” with “1” and “No” with “2”.
Relocation_Status: This attribute indicates whether the Candidate needs to Relocate while Joining the Company or not. This field can take 2 values; either “Yes” or “No”. “Yes” indicates that the Candidate needs to Relocate. “No” indicates that the Candidate does not needs to Relocate. We will replace “Yes” with “1” and “No” with “2”.
Gender: This attribute indicates the Gender of the Candidate. This field can take 2 values; either “Male” or “Female”. We will replace “Male” with “1” and “Female” with “2”.
Source: This attribute indicates the Source through which the Candidate was found. This field presently takes 3 values; either “Direct” or “Agency” or “Employee Referral”. Presently, this data is a Factor. We will convert the String Codes to Numeric Codes.
## [1] "Agency" "Employee Referral" "Direct"
Relevant_Experience: This attribute contains the Years of Relevant Experience of the Candidates for the Position Considered. This attribute contains Numeric Data.
LOB: This attribute indicates the LOB (Line of Business) for which the Candidate is being hired. This field presently takes 12 values. Presently, this data is a Factor. We will convert the String Codes to Numeric Codes.
## [1] "ERS" "INFRA" "Healthcare" "BFSI" "CSMP"
## [6] "ETS" "AXON" "EAS" "SALES" "CORP"
## [11] "BSERV" "MMS"
## [1] "Noida" "Chennai" "Gurgaon" "Bangalore" "Kolkata"
## [6] "Mumbai" "Hyderabad" "Cochin" "Pune" "Others"
## [11] "Coimbatore" "Ahmedabad"
Age: This attribute contains the Age in Years of the Candidates. This attribute contains Numeric Data.
Status: This attribute indicates whether the Candidate Joined the Company or not. This field can take 2 values; either “Joined” or “Not Joined”. We will replace “Joined” with “1” and “Not Joined” with “0”. We have the actual data regarding the Candidates. We will use this data to train our model so that it is able to predict whether a Candidate will Join or Not based on the other attributes that we will provide as input.
The modified structure of the data is as follows.
## 'data.frame': 12333 obs. of 17 variables:
## $ Candidate_Ref : chr "2110407" "2112635" "2112838" "2115021" ...
## $ DOJ_Extended : Factor w/ 2 levels "1","2": 1 2 2 2 1 1 1 2 1 2 ...
## $ Duration_to_Accept_Offer: int 14 18 3 26 1 17 37 NA 16 NA ...
## $ Notice_Period : int 30 30 45 30 120 30 30 45 0 30 ...
## $ Offered_Band : Factor w/ 7 levels "1","2","3","4",..: 3 3 3 3 3 2 3 3 2 2 ...
## $ Percent_Hike_Expected : num -20.8 50 42.8 42.8 42.6 ...
## $ Percent_Hike_Offered : num 13.2 320 42.8 42.8 42.6 ...
## $ Percent_Difference_CTC : num 42.9 180 0 0 0 ...
## $ Joining_Bonus : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ Relocation_Status : Factor w/ 2 levels "1","2": 2 2 2 2 1 2 2 2 2 2 ...
## $ Gender : Factor w/ 2 levels "1","2": 2 1 1 1 1 1 1 1 2 1 ...
## $ Source : Factor w/ 3 levels "1","2","3": 1 3 1 3 3 3 3 3 2 2 ...
## $ Relevant_Experience : int 7 8 4 4 6 2 7 4 8 4 ...
## $ LOB : Factor w/ 12 levels "1","2","3","4",..: 12 1 1 1 1 1 1 1 2 12 ...
## $ Location : Factor w/ 12 levels "1","2","3","4",..: 11 1 11 11 11 11 11 11 11 11 ...
## $ Age : int 34 34 27 34 34 34 32 34 34 34 ...
## $ Status : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 2 1 2 ...
We will check if there is any missing information in the dataset. There are many techniques for dealing with the missing data. We will try the approach where we will remove the records which have any missing data. After doing so, if we have substantial amount of data left, we will proceed on this strategy.
We will check if the data is absent and/or data is NULL.
Total number of Records = 12,333
Candidate_Ref = 0 DOJ_Extended = 0 Duration_to_Accept_Offer = 2,719 Notice_Period = 0 Offered_Band = 0 Percent_Hike_Expected = 747 Percent_Hike_Offered = 596 Percent_Difference_CTC = 851 Joining_Bonus = 0 Relocation_Status = 0 Gender = 0 Source = 0 Relevant_Experience = 0 LOB = 0 Location = 0 Age = 0 Status = 0
We find that maximum number of missing values are for the attribute Duration_to_Accept_Offer. Intuitively, it seems like this attribute contributes significantly towards the decision of joining by the candidate.
We will create 2 datasets with no missing values.
In the first dataset, we will retain all the attributes and only include the records which have no missing values. We will call this DataSet1. So, DataSet1 has 9,011 records.
In the second dataset, we will omit the attribute Duration_to_Accept_Offer and only include the records which have no missing values. We will call this DataSet2. So, DataSet2 has 11,430 records.
We notice that in both the datasets, we have enough data. So, we will stick to this strategy of data cleansing.
Now, let us find the correlation between the attribute Status and the attribute Duration_to_Accept_Offer. This should give an indication whether the attribute Duration_to_Accept_Offer has influence on the decision of the Candidates in Joining after accepting the offer. Though this is not a comprehensive test, this should provide a decent first-hand idea.
Our NULL Hypothesis is that there is NO CORRELATION between the attribute Status and the attribute Duration_to_Accept_Offer.
Correlation Coefficient between Status and Duration_to_Accept_Offer = 0.0626342
##
## Pearson's product-moment correlation
##
## data: v_temp1$IntStatus and v_temp1$Duration_to_Accept_Offer
## t = 6.1528, df = 9612, p-value = 7.917e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.04269795 0.08252054
## sample estimates:
## cor
## 0.06263418
The P-value of 7.917088810^{-10} is below the alpha of 0.05 and thus we cannot reject the NULL Hypothesis.
The correlation between Status and Duration_to_Accept_Offer is very weak. So, it indicates that the Duration to Accept Offer does not impact the Candidates decision to Join after accepting the Offer. Also, we can conclude that if we omit the attribute Duration_to_Accept_Offer, our model for predicting Status will not be impacted.
So, from this point, we will use DataSet2 for analysis. DataSet2 has 11,430 records.
We will consider the Data Points which are outside the Tukey’s Fences as Outliers.
Outliers can be present in the data broadly because of two reasons. They are as follows.
There are exceptional cases in every event that is observed. In this case, the Outliers are genuine data which have happened in exceptional circumstances and/or genuinely as an exception.
Outliers in data can be present due to error in data capture. In this case, the Outliers are erroneous data.
In either case, Outliers can distort the result of our analysis and have significant negative impact on the accuracy of the Prediction Model. So, we will eliminate all Outliers from the Data.
We will find the Outliers in the Numeric Data in the Data Set. To detect whether Outliers are present or not, we will create the Box and Whisker Plots for the Numeric Data.
We compute Tukey’s Fences as follows.
Lower Limit (L) = Q1 - 1.5 * (Q3 - Q1)
Upper Limit (U) = Q3 + 1.5 * (Q3 - Q1)
Here, Q1 is the 1st Quartile and Q3 is the 3rd Quartile.
Any value less than L and greater than U is an Outlier
Quartile 1 (Q1) = 30
Quartile 3 (Q3) = 60
Lower Limit (L) = -15
Upper Limit (U) = 105
From the above figures, we notice that there are Outliers above the Upper Boundary. Let us examine these records to determine whether it needs eliminating.
The list of top 10 records where Notice Period is above the Upper Limit is as follows.
## Candidate_Ref Notice_Period Status
## 1 2115125 120 1
## 2 2205496 120 1
## 3 2286200 120 2
## 4 2286297 120 1
## 5 2120995 120 1
## 6 2128469 120 1
## 7 2156459 120 2
## 8 2157132 120 2
## 9 2163132 120 1
## 10 2172869 120 2
We notice that the Maximum Outlier Value is 120. The data is from India. In Indian context, there are many Companies where the Notice Period is 120 days. So, we will not remove these records.
Quartile 1 (Q1) = 3
Quartile 3 (Q3) = 6
Lower Limit (L) = -1.5
Upper Limit (U) = 10.5
From the above figures, we notice that there are Outliers above the Upper Boundary. Let us examine these records to determine whether it needs eliminating.
The list of top 10 records where Relevant Experience is above the Upper Limit is as follows.
## Candidate_Ref Relevant_Experience Status
## 1 3304208 26 1
## 2 2469500 24 1
## 3 2394768 23 1
## 4 2850618 23 1
## 5 2368801 22 1
## 6 2824825 22 1
## 7 3379637 22 1
## 8 3607236 22 1
## 9 2344905 21 1
## 10 2224425 20 1
We notice that the Maximum Outlier Value is 26. The data is from India. In Indian context, there are many Candidates with this sort of experience who apply for Jobs. So, we will not remove these records.
Quartile 1 (Q1) = 27
Quartile 3 (Q3) = 34
Lower Limit (L) = 16.5
Upper Limit (U) = 44.5
From the above figures, we notice that there are Outliers above the Upper Boundary. Let us examine these records to determine whether it needs eliminating.
The list of top 10 records where Age is above the Upper Limit is as follows.
## Candidate_Ref Age Status
## 1 3225493 62 2
## 2 2635174 60 2
## 3 2394647 51 1
## 4 3211288 51 2
## 5 2464782 50 1
## 6 2469131 50 1
## 7 2604373 50 1
## 8 2618623 50 1
## 9 3304208 50 1
## 10 2322707 49 1
We notice that the Maximum Outlier Value is 62 and there are 2 records where the Candidate’s age is 62 and 60. The data is from India. In Indian context, there can be Candidates in this age bracket who can apply for Jobs. So, we will not remove these records.
Quartile 1 (Q1) = 25
Quartile 3 (Q3) = 52.17
Lower Limit (L) = -15.755
Upper Limit (U) = 92.925
From the above figures, we notice that there are Outliers both above the Upper Boundary and below the Lower Boundary. Let us examine these records to determine whether it needs eliminating.
The list of top 10 records where Percent Hike Expected is above the Upper Limit is as follows.
## Candidate_Ref Percent_Hike_Expected Status
## 1 3258125 359.77 1
## 2 2575776 327.35 1
## 3 2478398 323.08 1
## 4 3434802 314.68 1
## 5 3258078 306.25 1
## 6 3664507 300.00 1
## 7 3311520 294.74 2
## 8 3267878 291.30 2
## 9 3040819 280.95 1
## 10 3379707 266.67 2
We notice that the Maximum Outlier Value is 359.77%. In Indian context, there can be Candidates with this kind of Hike Expectations who apply for Jobs. So, we will not remove these records.
The list of top 10 records where Percent Hike Expected is below the Lower Limit is as follows.
## Candidate_Ref Percent_Hike_Expected Status
## 1 3258076 -68.83 2
## 2 3304533 -62.50 1
## 3 3685332 -57.43 2
## 4 2850795 -54.04 1
## 5 2850654 -53.50 2
## 6 2920298 -50.00 2
## 7 3807331 -49.58 2
## 8 2569520 -48.25 1
## 9 2483538 -48.05 1
## 10 2403871 -45.81 1
We notice that the Minimum Outlier Value is -68.83%. It is hard to expect candidates to expect less than half their salaries. However, due to current COVID-19 pandemic, this also can be expected. So, we will not remove these records.
Quartile 1 (Q1) = 20
Quartile 3 (Q3) = 47.2
Lower Limit (L) = -20.8
Upper Limit (U) = 88
From the above figures, we notice that there are Outliers both above the Upper Boundary and below the Lower Boundary. Let us examine these records to determine whether it needs eliminating.
The list of top 10 records where Percent Hike Offered is above the Upper Limit is as follows.
## Candidate_Ref Percent_Hike_Offered Status
## 1 3435024 471.43 1
## 2 2535545 414.29 1
## 3 2534384 400.00 1
## 4 2221149 393.33 1
## 5 3250085 375.00 1
## 6 2483174 350.00 1
## 7 2452368 333.33 1
## 8 2112635 320.00 1
## 9 2550678 316.67 1
## 10 2478398 314.53 1
We notice that the Maximum Outlier Value is 471.43%. In Indian context, there can be Candidates with this kind of Hike Offerings. So, we will not remove these records.
The list of top 10 records where Percent Hike Offered is below the Lower Limit is as follows.
## Candidate_Ref Percent_Hike_Offered Status
## 1 2151180 -60.53 2
## 2 2458879 -60.00 1
## 3 2237763 -55.75 1
## 4 3488402 -54.97 2
## 5 3462905 -54.55 1
## 6 2407721 -53.75 1
## 7 2920298 -52.78 2
## 8 2137866 -51.37 2
## 9 2347819 -50.82 1
## 10 3617498 -50.13 2
We notice that the Minimum Outlier Value is -60.53%. It is hard to expect candidates are offered less than half their salaries. However, due to current COVID-19 pandemic, this also can be expected. So, we will not remove these records.
Quartile 1 (Q1) = -9.09
Quartile 3 (Q3) = 0
Lower Limit (L) = -22.725
Upper Limit (U) = 13.635
From the above figures, we notice that there are Outliers both above the Upper Boundary and below the Lower Boundary. Let us examine these records to determine whether it needs eliminating.
The list of top 10 records where Percent Difference CTC is above the Upper Limit is as follows.
## Candidate_Ref Percent_Difference_CTC Status
## 1 3258076 300.00 2
## 2 2535545 300.00 1
## 3 2550678 272.34 1
## 4 3289177 240.91 1
## 5 2534384 233.33 1
## 6 2282545 227.27 2
## 7 2452368 225.00 1
## 8 3429009 218.37 1
## 9 2221149 196.00 1
## 10 3411311 195.00 2
We notice that the Maximum Outlier Value is 300%. In Indian context, there can be Candidates with this kind of Hike Offerings. So, we will not remove these records.
The list of top 10 records where Percent Difference CTC is below the Lower Limit is as follows.
## Candidate_Ref Percent_Difference_CTC Status
## 1 3784642 -68.95 2
## 2 2458879 -67.27 1
## 3 3462905 -66.67 1
## 4 3463034 -66.00 1
## 5 3434802 -65.55 1
## 6 2780729 -64.29 2
## 7 2802356 -63.64 1
## 8 2151180 -62.50 2
## 9 3290644 -62.50 2
## 10 3274543 -62.50 2
We notice that the Minimum Outlier Value is -68.95%. It is hard to expect candidates are offered less than half their salaries. However, due to current COVID-19 pandemic, this also can be expected. So, we will not remove these records.
The summary of the dataset is provided below.
## Candidate_Ref DOJ_Extended Notice_Period Offered_Band
## Length:11430 1:4868 Min. : 0.0 1: 911
## Class :character 2:6562 1st Qu.: 30.0 2:6525
## Mode :character Median : 30.0 3:3241
## Mean : 37.7 4: 635
## 3rd Qu.: 60.0 5: 94
## Max. :120.0 6: 22
## 7: 2
## Percent_Hike_Expected Percent_Hike_Offered Percent_Difference_CTC
## Min. :-68.83 Min. :-60.53 Min. :-68.950
## 1st Qu.: 25.00 1st Qu.: 20.00 1st Qu.: -9.090
## Median : 38.89 Median : 34.26 Median : 0.000
## Mean : 41.88 Mean : 37.87 Mean : -2.088
## 3rd Qu.: 52.17 3rd Qu.: 47.20 3rd Qu.: 0.000
## Max. :359.77 Max. :471.43 Max. :300.000
##
## Joining_Bonus Relocation_Status Gender Source Relevant_Experience
## 1: 511 1: 1389 1:9368 1:3081 Min. : 0.000
## 2:10919 2:10041 2:2062 2:6376 1st Qu.: 3.000
## 3:1973 Median : 4.000
## Mean : 4.176
## 3rd Qu.: 6.000
## Max. :26.000
##
## LOB Location Age Status
## 1 :3170 1 :3955 Min. :20.00 1:8388
## 12 :2906 11 :3600 1st Qu.:27.00 2:3042
## 3 :1597 3 :2689 Median :30.00
## 5 : 883 6 : 454 Mean :30.31
## 10 : 753 5 : 235 3rd Qu.:34.00
## 4 : 651 2 : 174 Max. :62.00
## (Other):1470 (Other): 323
Now that we have finalised the dataset for analysis, we need answering whether it is really required to analyse this data or not.
We notice that, in this dataset, the number of Candidates who did not join the Company after accepting the Job Offer is 0. This is 0% of the total number of Candidates in the dataset.
This percentage is rather high. So, it makes perfect sense to try to figure out what are the reasons for the Candidates not joining after accepting the Offer.
This scenario leads to considerable amount of wasted expenses for the organisation. So, it makes perfect sense to device a mechanism to reduce this cost. One mechanism to control this cost could be if we had a model which could predict whether a Candidate, who accepts an Offer, will actually Join the Company. Using this model, the Company could decide for which Candidates it should invest it and for which Candidates it should not invest in.
We have established that it is worth analysing this data. Next,we need to figure out what needs analysing. Further, we need to prioritise among the items that needs analysis. We have to focus on the reasons what makes Candidates Join the Company and also what drives Candidates not to Join the Company after accepting the Offer.
We start by trying to figure out which attributes contribute significantly to the Candidate’s decision to join or not to join. In technical terms, this implies that we need to figure out the attributes (or combination of attributes) which have strong correlation - positive or negative - to the attribute Status.
Predictors with a single unique value (also known as “Zero-Variance Predictors”) will cause the model to fail. Since we will be tuning models using resampling methods, a random sample of the training set may result in some predictors with more than one unique value to become a zero-variance predictor. These so-called “near zero-variance predictors” can cause numerical problems during resampling for some models, such as linear regression.
To identify this kind of predictors, two properties need to be examined.
The percentage of unique values in the training set can be calculated for each predictor. Variables with low percentages have a higher probability of becoming a zero-variance predictor during resampling. However, this in itself is not a problem. Binary Predictors are likely to have low percentages and should not be discarded for this simple reason.
The other important criterion to examine is the skewness of the frequency distribution of the variable. If the ratio of most frequent value of a predictor to the second most frequent value is large, the distribution of the predictor may be highly skewed.
If both these criteria are flagged, the predictor may be a near zero-variance predictor. We will use the benchmark that if:
the percentage of unique values is less than 20% and
the ratio of the most frequent to the second most frequent value is greater than 20,
the predictor may cause problem for our model and thus should be removed.
The result of Near Zero-Variance Predictor Analysis is as follows.
## [1] "Joining_Bonus"
As Joining_Bonus is a binary variable, we will not discard it from our model.
Models are susceptible to multicollinearity (i.e., high correlations between predictors). Linear models, neural networks and other models can have poor performance in these situations or may generate unstable solutions. Other models, such as classication or regression trees, might be resistant to highly correlated predictors, but multicollinearity may negatively effect interpretability of the model.
To minimize the effect of multicollinearity, we will reduce the number of dimensions. To do this, we compute the correlation matrix of the predictors and use an algorithm to remove the a subset of the problematic predictors such that all of the pairwise correlations are below a threshold. This algorithm finds the minimal set of predictors that can be removed so that the pairwise correlations are below a specific threshold. Note that, if two variables have a high correlation, the algorithm determines which one is involved with the most pairwise correlations and is removed.
The result of our analysis is as follows.
## [1] "No multicollenearity found"
The Correlation Coefficients between Status and the other attributes is as follows.
## Source Percent_Difference_CTC Location
## -0.129734150 -0.088806004 -0.075290612
## Percent_Hike_Offered Age Gender
## -0.074621355 -0.047143620 -0.035927506
## Joining_Bonus Percent_Hike_Expected Offered_Band
## -0.007664980 -0.002603188 0.056405206
## LOB Relevant_Experience DOJ_Extended
## 0.072817581 0.096124151 0.133148866
## Relocation_Status Notice_Period Status
## 0.223981956 0.241088064 1.000000000
We find that the correlation between Status and the other attributes is very weak. However, this is because Status is a binary variable. The determined correlations definitely point to which attributes have a positive influence on a Candidate Joining and which have a negative influence.
The attributes which have a positive influence are as follows (in the decreasing order of correlation).
The attributes which have a negative influence are as follows (in the decreasing order of correlation).
We will try to answer the following questions.
A. Relocation
B. Source
C. Hike Offered
In this chapter, we analyse the data for the queries we established in section 5.2.3.
In this section, we discuss the analysis questions regarding the Relocation Creteria.
We find out the percentage of Candidates who Joined or did not Join based on whether they neededd Relocating or not.
Relocation Status 1 indicates that the Candidate needed Relocating to Join the Company. Relocation Status 2 indicates that the Candidate did not need to Relocate to Join the Company.
Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.
We see that all the Candidates, who needed Relocation, Joined. There are Candidates who did not need relocation and did not join. So, it can be concluded that Relocation is not a criteria because Candidates do not join.
As a result of this finding, all other questions related to Relocation do not need any further analysis.
In this section, we discuss the analysis questions regarding the Sources through which the Candidates are obtained.
The Codes for the Source are as follows.
1 - Agency 2 - Direct 3 - Employee Referral
Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.
We see below the Crosstab between Status and Source.
This graph suggests that Candidates who come through Employee Referrals are more likely to Join. Also, this graph suggests that more perentage of Candidates who do not Join come through Agencies. However, to be sure, let us see the volume of Candidates coming through each of the Sources.
We notice that similar number of Candidates come through Agencies and through Employee Referrals. So, we can safely conclude that Employee Referals is a better Source to get Candidates (who are more likely to Join) than through Agencies.
However, more number of Candidates are recruited through Direct Employment than Candidates through Employee Referrals and Candidates through Agencies put together. So, we need to examine these cases more closely. However, before doing that let us wrap up our finding on regarding Candidates recruited through Employee Referrals and Candidates recruited through Agencies.
Let us examine the Cities from which the Candidates, recruited through Employee Referrals, come from.
The Codes for the Cities (Location) are as follows.
1 - Chennai 2 - Gurgaon 3 - Bangalorel 4 - Kolkata 5 - Mumbai 6 - Hyderabad 7 - Cochin 8 - Pune 9 - Coimbatore 10 - Ahmedabad 11 - Noida 12 - Others
Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.
The graphs shows that 100% of the Candidates coming through Employee Referrals joined when they were recruited for Cochin, Coimbatore, Ahmadabad and Others. All these Cities can be considered as smaller cities. It is safe to consider that the number of Jobs available in these Cities is less (compared to the other Cities in our list).
Let us examine how many of these Candidates needed Relocating.
We notice that majority of these Candidates needed to Relocate to Join the Company.
We examine the Hike Offered to these Candidates.
We see that majority of the Candidates were offered between 30-40% Hike.
Lastly, we examine what was the Hike Expected by these Candidates.
We notice that the Hike Expected by these Candidates was similar to the Hike they were Offered.
So, we can conclude that for Smaller Cities, it is best to get Candidates through Employee Referrals offerring them a hike of 30-40%. We can view this as a win-win situation as Salaries in these Cities would be less ( and thus the Company benefits) and the Candidates get the hike they expect (and thus can save as cost of living would be less).
Let us examine the Cities from which the Candidates, recruited through Agencies, come from.
The Codes for the Cities (Location) are as follows.
1 - Chennai 2 - Gurgaon 3 - Bangalorel 4 - Kolkata 5 - Mumbai 6 - Hyderabad 7 - Cochin 8 - Pune 9 - Coimbatore 10 - Ahmedabad 11 - Noida 12 - Others
Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.
We notice that Agencies generally operate between 60-70% efficiency. We could consider the case of Coimbatore to be an Outlier.
We already know that all Candidates, who needed Relocating, Joined. So, we can conclude that Agengies are not challenged in finding Candidate when Relocation is essential.
Let us try to establish the reason why Candidates from Agencies do not join. We will examine the Hike Offered to the Candidates and the Hike Expected by the Candidates.
We will examine the Hike Offered to the Candidates versus the Hike Expected by the Candidates.
We notice that in majority of the cases, the Hike Offered to the Candidates was in general lower than the Hike Expected by the Candidates (The Slope of the Regression Line is much less than 45 degrees). So, this couldd be an area for investigation for the Company. For example, the lower offers could be due to quality of the Candidates supplied by the Agencies.
Let us examine the Cities from which the Candidates, recruited Directly, come from.
The Codes for the Cities (Location) are as follows.
1 - Chennai 2 - Gurgaon 3 - Bangalorel 4 - Kolkata 5 - Mumbai 6 - Hyderabad 7 - Cochin 8 - Pune 9 - Coimbatore 10 - Ahmedabad 11 - Noida 12 - Others
Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.
The efficiency of Direct Recruitment is between 70-80%. Since recruitment through Direct mean accounts for the maximum number of Candidates, it needs analysing the reason for renege here.
We check the counts per City.
The number of Direct Recruitments in Chennai and Noida are far more than the number of Direct Recruitments in Bangalore. However, the number of reneges in Bangalore is similar to that in Chennai and Noida. So, we examine the case of Bangalore.
Let us examine the following.
We examine the Hike Expected versus the Hike Offered to the Candidates approached through Direct Recruitment for Bangalore who did not join,
We notice that there is a consistency when it comes to making offers to the Candidates. However, in general, Candidates were offered less than their expectation. Let us further examine this by Age and Gender.
We plot the graph of Candiates who came through Direct Recruitment in Bangalore and did not Join.
Gender 1 indicates Male Candidates. Status 2 indicates Female Candidate.
Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.
We notice that the renege percentage is absolutely similar for MAle Candidates and Female Candidates.
We plot the graph of Candiates who came through Direct Recruitment in Bangalore.
Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.
We notice that renege is consistently high for all ages except for age group less than 25 and for age of 34.
It is difficult to interpret the case of age of 34. The first question is why the number of offers made for Candidates aged 34 was so high. We check the LOB for which these Candidates were demanded.
We plot the graph of Candiates who came through Direct Recruitment in Bangalore aged 34.
The Codes for the Line of Business (LOB) are as follows.
1 - INFRA 2 - Healthcare 3 - BFSI 4 - CSMP 5 - ETS 6 - AXON 7 - EAS 8 - SALES 9 - CORP 10 - BSERV 11 - MMS 12 - ERS
Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.
We notice that renege is especially high for EAS LOB. Let us see if this LOB can be shifted to any other location.
We plot the graph of Candiates who came through Direct Recruitment aged 34 for LOB EAS.
The Codes for the Cities (Location) are as follows.
1 - Chennai 2 - Gurgaon 3 - Bangalorel 4 - Kolkata 5 - Mumbai 6 - Hyderabad 7 - Cochin 8 - Pune 9 - Coimbatore 10 - Ahmedabad 11 - Noida 12 - Others
Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.
As the renege in EAS LOB is exceptionally high in Bangalore compared to the other locations, we can conclude that it is better to shut down the EAS LOB in Bangalore and operate from other locations.
The Codes for the Source are as follows.
1 - Agency 2 - Direct 3 - Employee Referral
Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.
We already know the following.
Let us examine the Characteristics with respect to the LOBs.
Let us examine the LOBs for which the Candidates were recruited through Employee Referrals.
The Codes for the Line of Business (LOB) are as follows.
1 - INFRA 2 - Healthcare 3 - BFSI 4 - CSMP 5 - ETS 6 - AXON 7 - EAS 8 - SALES 9 - CORP 10 - BSERV 11 - MMS 12 - ERS
Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.
We see that the levelof renege across all the LOBs for Candidates obtained through Employee Referral is between 10-30. So, there is not much to choose from through there are a few exceptions. However, renege is very low in INFRA LOB. Let us examine the characteristics for this LOB.
We see that the majority of the Candidates who joined were Females.
A quarter of the Candidates were aged 34 years.
We can see that Joining Bonus was given in only 3 cases. So, Joining Bonus is not a motivation for Joining.
Lastly, we see the correlation between Hike Offered versus Hike Expected.
We see that Hike Offered is much lower than the Hike Expected.
We do not find anything exception here except that it is advised to recruit Female Candidates through Employee Referrals in the INFRA LOB in the Lower Bands.
Let us examine the LOBs for which the Candidates were recruited through Agencies.
The Codes for the Line of Business (LOB) are as follows.
1 - INFRA 2 - Healthcare 3 - BFSI 4 - CSMP 5 - ETS 6 - AXON 7 - EAS 8 - SALES 9 - CORP 10 - BSERV 11 - MMS 12 - ERS
Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.
We notice that the renege percentage is generally very high most of the LOBs. We know that majority of the Candidates are recruited directly. So, it advised to reduce the recruitment through Agencies.
Let us examine the LOBs for which the Candidates were recruited through Agencies.
The Codes for the Line of Business (LOB) are as follows.
1 - INFRA 2 - Healthcare 3 - BFSI 4 - CSMP 5 - ETS 6 - AXON 7 - EAS 8 - SALES 9 - CORP 10 - BSERV 11 - MMS 12 - ERS
Status 1 indicates that the Candidate Joined. Status 2 indicates that the Candidate did not Join.
The data is usual except that BSERV has remarkably low renege. Let us examine the characteristics for this LOB.
We see that the majority of the Candidates who joined were Females.
Almost all the Candidates were aged 34 years.
We can see that Joining Bonus was given in only 1 cases. So, Joining Bonus is not a motivation for Joining.
Lastly, we see the correlation between Hike Offered versus Hike Expected.
From the above charts, we can conclude the the preferred mode of recruitment should be through Direct Recruitment.
In this section, we discuss the analysis questions regarding the Hikes.
We have found answer for this question in the previous sections. However, we will look at some data.
First we examine the Hike Offered versus Hike Expected for all the Candidates who did not join.
## Expected o_m1000_m50 o_m50_0 o_0_10 o_10_50 o_50_100 o_100_200 o_200_1000
## 1 e_m1000_m50 0 2 0 1 0 0 0
## 2 e_m50_0 2 23 4 5 0 0 0
## 3 e_0_10 2 37 95 27 4 5 0
## 4 e_10_50 1 215 126 1409 88 23 12
## 5 e_50_100 0 36 19 385 395 25 3
## 6 e_100_200 0 1 1 14 38 37 1
## 7 e_200_1000 0 0 0 0 0 5 1
If you notice the diagnoal of this data, you will notice that majority of the Candidates were offered the hike that they expected.
So, we can conclude that the compensation package offered is not a reason for renege.
The following can be conclude from the above analysis.
We now create the Model for being able to predict renege based on the data that we have.
We split the dataset such that we assign 75% of the data for Training the Model. The rest 25% of the data will be used to Test the Model.
Number of Records in Training Set = 8572 Number of Records in Test Set = 2858
** Summary of Training Set**
## DOJ_Extended Notice_Period Offered_Band Percent_Hike_Expected
## 1:3666 Min. : 0.00 1: 689 Min. :-68.83
## 2:4906 1st Qu.: 30.00 2:4875 1st Qu.: 25.00
## Median : 30.00 3:2438 Median : 38.89
## Mean : 37.66 4: 479 Mean : 41.93
## 3rd Qu.: 60.00 5: 73 3rd Qu.: 52.17
## Max. :120.00 6: 16 Max. :327.35
## 7: 2
## Percent_Hike_Offered Percent_Difference_CTC Joining_Bonus Relocation_Status
## Min. :-60.53 Min. :-67.27 1: 376 1:1035
## 1st Qu.: 20.00 1st Qu.: -8.89 2:8196 2:7537
## Median : 34.45 Median : 0.00
## Mean : 38.15 Mean : -1.94
## 3rd Qu.: 47.54 3rd Qu.: 0.00
## Max. :471.43 Max. :300.00
##
## Gender Source Relevant_Experience LOB Location
## 1:7029 1:2322 Min. : 0.000 1 :2427 1 :2951
## 2:1543 2:4761 1st Qu.: 3.000 12 :2154 11 :2711
## 3:1489 Median : 4.000 3 :1195 3 :2022
## Mean : 4.193 5 : 665 6 : 334
## 3rd Qu.: 6.000 10 : 548 5 : 183
## Max. :26.000 6 : 485 2 : 136
## (Other):1098 (Other): 235
## Age Status
## Min. :20.00 1:6281
## 1st Qu.:27.00 2:2291
## Median :30.00
## Mean :30.34
## 3rd Qu.:34.00
## Max. :62.00
##
** Summary of Test Set**
## DOJ_Extended Notice_Period Offered_Band Percent_Hike_Expected
## 1:1202 Min. : 0.00 1: 222 Min. :-57.43
## 2:1656 1st Qu.: 30.00 2:1650 1st Qu.: 25.00
## Median : 30.00 3: 803 Median : 38.89
## Mean : 37.81 4: 156 Mean : 41.73
## 3rd Qu.: 60.00 5: 21 3rd Qu.: 52.78
## Max. :120.00 6: 6 Max. :359.77
## 7: 0
## Percent_Hike_Offered Percent_Difference_CTC Joining_Bonus Relocation_Status
## Min. :-53.75 Min. :-68.950 1: 135 1: 354
## 1st Qu.: 18.42 1st Qu.: -9.510 2:2723 2:2504
## Median : 33.93 Median : 0.000
## Mean : 37.02 Mean : -2.532
## 3rd Qu.: 46.60 3rd Qu.: 0.000
## Max. :400.00 Max. :233.330
##
## Gender Source Relevant_Experience LOB Location
## 1:2339 1: 759 Min. : 0.000 12 :752 1 :1004
## 2: 519 2:1615 1st Qu.: 3.000 1 :743 11 : 889
## 3: 484 Median : 4.000 3 :402 3 : 667
## Mean : 4.124 5 :218 6 : 120
## 3rd Qu.: 5.000 10 :205 5 : 52
## Max. :22.000 4 :169 4 : 39
## (Other):369 (Other): 87
## Age Status
## Min. :22.00 1:2107
## 1st Qu.:27.00 2: 751
## Median :30.00
## Mean :30.22
## 3rd Qu.:34.00
## Max. :50.00
##
Now we create our Model using Random Forest.
##
## Call:
## randomForest(formula = Status ~ ., data = v_train, ntree = 1000, mtry = 2, importance = TRUE)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 23.12%
## Confusion matrix:
## 1 2 class.error
## 1 5972 309 0.04919599
## 2 1673 618 0.73024880
We now make predictions on the Training Set.
The Accuracy of the Model on the Training Data = 92.2188520765282%
The Confusion Matrix is provides below.
## Confusion Matrix and Statistics
##
##
## v_pred_train 1 2
## 1 6272 658
## 2 9 1633
##
## Accuracy : 0.9222
## 95% CI : (0.9163, 0.9278)
## No Information Rate : 0.7327
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7817
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9986
## Specificity : 0.7128
## Pos Pred Value : 0.9051
## Neg Pred Value : 0.9945
## Prevalence : 0.7327
## Detection Rate : 0.7317
## Detection Prevalence : 0.8084
## Balanced Accuracy : 0.8557
##
## 'Positive' Class : 1
##
We now make predictions on the Test Set.
The Accuracy of the Model on the Test Data = 78.2715185444367%
The Confusion Matrix is provides below.
## Confusion Matrix and Statistics
##
##
## v_pred_test 1 2
## 1 1995 509
## 2 112 242
##
## Accuracy : 0.7827
## 95% CI : (0.7671, 0.7977)
## No Information Rate : 0.7372
## P-Value [Acc > NIR] : 9.98e-09
##
## Kappa : 0.3242
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9468
## Specificity : 0.3222
## Pos Pred Value : 0.7967
## Neg Pred Value : 0.6836
## Prevalence : 0.7372
## Detection Rate : 0.6980
## Detection Prevalence : 0.8761
## Balanced Accuracy : 0.6345
##
## 'Positive' Class : 1
##
We check the variables which are imporant for making the prediction.
## 1 2 MeanDecreaseAccuracy
## DOJ_Extended 42.81650 36.613110 52.541284
## Notice_Period 49.57371 59.624897 70.716803
## Offered_Band 32.36441 -4.069789 34.966272
## Percent_Hike_Expected 42.24829 -16.764391 36.282926
## Percent_Hike_Offered 47.07039 -2.001910 48.593526
## Percent_Difference_CTC 36.65857 31.512760 51.728554
## Joining_Bonus 19.20148 -7.675910 13.283936
## Relocation_Status 74.91316 104.480451 105.321617
## Gender 8.70433 -2.919776 5.627725
## Source 24.51146 3.471047 24.590123
## Relevant_Experience 35.80203 15.007628 47.043338
## LOB 57.37765 19.618882 66.795492
## Location 35.65446 -5.737479 29.735008
## Age 18.46604 8.502554 22.224827
## MeanDecreaseGini
## DOJ_Extended 87.36077
## Notice_Period 214.61865
## Offered_Band 72.65718
## Percent_Hike_Expected 282.04167
## Percent_Hike_Offered 318.07026
## Percent_Difference_CTC 299.51695
## Joining_Bonus 23.08735
## Relocation_Status 148.57220
## Gender 38.04041
## Source 86.87576
## Relevant_Experience 185.65279
## LOB 225.63871
## Location 143.26188
## Age 191.18886
We now create the Random Forest Model using the Caret Package. We use the same Training and Test Data.
## Random Forest
##
## 8572 samples
## 15 predictor
## 2 classes: '1', '2'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 8572, 8572, 8572, 8572, 8572, 8572, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 1 1
## 8 1 1
## 15 1 1
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
We now make predictions on the Training Set.
The Accuracy of the Model on the Training Data = 100%
The Confusion Matrix is provides below.
## Confusion Matrix and Statistics
##
##
## v_pred_train 1 2
## 1 6281 0
## 2 0 2291
##
## Accuracy : 1
## 95% CI : (0.9996, 1)
## No Information Rate : 0.7327
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.7327
## Detection Rate : 0.7327
## Detection Prevalence : 0.7327
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 1
##
We now make predictions on the Test Set.
The Accuracy of the Model on the Test Data = 100%
The Confusion Matrix is provides below.
## Confusion Matrix and Statistics
##
##
## v_pred_test 1 2
## 1 2107 0
## 2 0 751
##
## Accuracy : 1
## 95% CI : (0.9987, 1)
## No Information Rate : 0.7372
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.7372
## Detection Rate : 0.7372
## Detection Prevalence : 0.7372
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 1
##