Responses(accept/decline) of InMail are critical for recruiter to know so he/she can manage their candidate pipeline efficiently. Response rate vary across the users, and we believe that there are some best practice of sending InMails we can recommend to the recruiters. However, those factors may be confounded by other unmanageable factors like country, industry, etc. In this analysis, I’m going to control for those unmanageable factors by logistic regression, find out which factors contribute to a higher response rate, and finally make recommendations to the recruiters.
There are 28 variables and 121727 observations in the data set. The average response rate is 27%(22% accepted, 5% declined). 137 observations miss sender’s information(profile score, country, company size, company industry, company type), 64 observations miss recipient’s information(seniority, profile score, country, company size, company industry, company type), and 14364 observations miss recipient function area. No special missing pattern has been found. Therefore, I deleted those observations with NAs in sender’s information and recipient’s information. And for the rest of the data set, I recoded missing values of recipient function area as “unknown”. The following tables provide the summary of the data set.
## data0_fac
##
## 28 Variables 121727 Observations
## ---------------------------------------------------------------------------
## MESSAGE_ID
## n missing unique Info Mean .05 .10
## 121727 0 121727 1 1.96e+09 8.50e-315 8.56e-315
## .25 .50 .75 .90 .95
## 8.75e-315 1.03e-314 1.06e-314 1.08e-314 1.09e-314
##
## lowest : 1697397143 1697398443 1697398773 1697400883 1697404033
## highest: 2217538538 2217541948 2217543878 2217544718 2217546668
## ---------------------------------------------------------------------------
## SENDER_MEMBER_ID
## n missing unique Info Mean .05 .10 .25
## 121727 0 54225 1 1.29e+08 2.66e+06 5.42e+06 1.86e+07
## .50 .75 .90 .95
## 8.24e+07 2.06e+08 3.45e+08 4.03e+08
##
## lowest : 6472 7067 7486 7814 9467
## highest: 485919637 486211322 486442930 486454109 486948284
## ---------------------------------------------------------------------------
## RECIPIENT_MEMBER_ID
## n missing unique Info Mean .05 .10 .25
## 121727 0 120720 1 1.29e+08 3.77e+06 8.01e+06 2.57e+07
## .50 .75 .90 .95
## 8.56e+07 2.05e+08 3.28e+08 3.87e+08
##
## lowest : -4 1311 1757 1823 2056
## highest: 485608269 485838981 486151350 486373084 486497557
## ---------------------------------------------------------------------------
## REPLY_STATUS
## n missing unique
## 121727 0 3
##
## ACCEPTED (26466, 22%), DECLINED (6614, 5%)
## PENDING (88647, 73%)
## ---------------------------------------------------------------------------
## INMAIL_TEMPLATE_USED
## n missing unique
## 121727 0 2
##
## N (51090, 42%), Y (70637, 58%)
## ---------------------------------------------------------------------------
## MESSAGE_TYPE
## n missing unique
## 121727 0 2
##
## MULTIPLE (61229, 50%), SINGLE (60498, 50%)
## ---------------------------------------------------------------------------
## SENDER_PROFLE_SCORE
## n missing unique Info Mean .05 .10 .25 .50
## 121590 137 91 0.99 61.6 51 55 59 62
## .75 .90 .95
## 65 68 70
##
## lowest : 0 4 5 6 9, highest: 92 93 95 98 99
## ---------------------------------------------------------------------------
## SENDER_COUNTRY
## n missing unique
## 121590 137 130
##
## lowest : Albania Andorra Angola Antigua and Barbuda Argentina
## highest: Uruguay Venezuela Viet Nam Virgin Islands, U.S. Zimbabwe
## ---------------------------------------------------------------------------
## SENDER_COMPANY_SIZE
## n missing unique
## 121590 137 10
##
## myself only 1-10 11-50 51-200 201-500 501-1000 1001-5000
## Frequency 278 11865 20980 18391 12551 6635 16284
## % 0 10 17 15 10 5 13
## 5001-10000 10001+ unknown
## Frequency 5985 21373 7248
## % 5 18 6
## ---------------------------------------------------------------------------
## SENDER_COMPANY_INDUSTRY
## n missing unique
## 121590 137 144
##
## lowest : accounting airlines/aviation alternative medicine animation apparel and fashion
## highest: warehousing wholesale wine and spirits wireless writing and editing
## ---------------------------------------------------------------------------
## SENDER_COMPANY_TYPE
## n missing unique
## 121590 137 9
##
## 1 public company (38092, 31%)
## 2 educational (387, 0%), 3 self-employed (129, 0%)
## 4 government agency (278, 0%)
## 5 non-profit (1479, 1%), 6 self-owned (841, 1%)
## 7 privately held (69350, 57%)
## 8 partnership (3319, 3%), unknown (7715, 6%)
## ---------------------------------------------------------------------------
## RECIPIENT_DOMAIN
## n missing unique
## 121727 0 5
##
## AOL GMail Hotmail Other Yahoo!
## Frequency 1134 54921 13938 39156 12578
## % 1 45 11 32 10
## ---------------------------------------------------------------------------
## RECIPIENT_FUNCTION_AREA
## n missing unique
## 107363 14364 26
##
## lowest : Accounting Administrative Arts and Design Business Development Community and Social Services
## highest: Quality Assurance Real Estate Research Sales Support
## ---------------------------------------------------------------------------
## RECIPIENT_SENIORITY
## n missing unique
## 121663 64 11
##
## unknown Unpaid Training Entry Senior Manager Director VP
## Frequency 2226 83 830 35099 47185 19203 8097 4423
## % 2 0 1 29 39 16 7 4
## Partner CXO Owner
## Frequency 1124 2455 938
## % 1 2 1
## ---------------------------------------------------------------------------
## RECIPIENT_PROFLE_SCORE
## n missing unique Info Mean .05 .10 .25 .50
## 121663 64 92 1 59.9 43 49 57 61
## .75 .90 .95
## 64 68 72
##
## lowest : 0 5 6 10 11, highest: 96 97 98 99 100
## ---------------------------------------------------------------------------
## RECIPIENT_COUNTRY
## n missing unique
## 121663 64 158
##
## lowest : Afghanistan Algeria Andorra Angola Argentina
## highest: Viet Nam Virgin Islands, British Virgin Islands, U.S. Zambia Zimbabwe
## ---------------------------------------------------------------------------
## RECIPIENT_COMPANY_SIZE
## n missing unique
## 121663 64 10
##
## myself only 1-10 11-50 51-200 201-500 501-1000 1001-5000
## Frequency 329 2869 8441 12123 9156 7120 17915
## % 0 2 7 10 8 6 15
## 5001-10000 10001+ unknown
## Frequency 7926 38988 16796
## % 7 32 14
## ---------------------------------------------------------------------------
## RECIPIENT_COMPANY_INDUSTRY
## n missing unique
## 121663 64 149
##
## lowest : accounting airlines/aviation alternative dispute resolution alternative medicine animation
## highest: warehousing wholesale wine and spirits wireless writing and editing
## ---------------------------------------------------------------------------
## RECIPIENT_COMPANY_TYPE
## n missing unique
## 121663 64 9
##
## 1 public company (49281, 41%)
## 2 educational (1778, 1%)
## 3 self-employed (206, 0%)
## 4 government agency (1725, 1%)
## 5 non-profit (3940, 3%), 6 self-owned (502, 0%)
## 7 privately held (41386, 34%)
## 8 partnership (4192, 3%), unknown (18653, 15%)
## ---------------------------------------------------------------------------
## INMAIL_BODY_WORD_COUNT
## n missing unique Info Mean .05 .10 .25 .50
## 121727 0 459 1 144 55 68 93 128
## .75 .90 .95
## 182 250 287
##
## lowest : 1 6 7 8 9, highest: 652 662 731 747 775
## ---------------------------------------------------------------------------
## INMAIL_SUBJECT_WORD_COUNT
## n missing unique Info Mean
## 121727 0 9 0.98 4.19
##
## 1 2 3 4 5 6 7 8 9
## Frequency 22219 15949 15737 15234 14577 12780 10501 8364 6366
## % 18 13 13 13 12 10 9 7 5
## ---------------------------------------------------------------------------
## INMAIL_ATTACHMENT_COUNT
## n missing unique Info Mean
## 121727 0 6 0.2 0.0777
##
## 0 1 2 3 4 5
## Frequency 113148 7912 523 93 31 20
## % 93 6 0 0 0 0
## ---------------------------------------------------------------------------
## INMAIL_HAS_EMAIL_ADDRESS
## n missing unique Info Sum Mean
## 121727 0 2 0.75 58984 0.485
## ---------------------------------------------------------------------------
## IS_COMPANY_FOLLOWER
## n missing unique Info Sum Mean
## 121727 0 2 0.75 63704 0.523
## ---------------------------------------------------------------------------
## INMAIL_HAS_HYPERLINKS
## n missing unique Info Sum Mean
## 121727 0 2 0.75 61916 0.509
## ---------------------------------------------------------------------------
## HAS_COMMON_GROUPS
## n missing unique Info Sum Mean
## 121727 0 2 0.75 64348 0.529
## ---------------------------------------------------------------------------
## HAS_COMPANY_CONNECTIONS
## n missing unique Info Sum Mean
## 121727 0 2 0.75 61159 0.502
## ---------------------------------------------------------------------------
## response
## n missing unique
## 121727 0 2
##
## Y (33080, 27%), N (88647, 73%)
## ---------------------------------------------------------------------------
Before diving into data modeling, I created plots between response and other variables to gain some intuition. In this report, I only show some of the plots, all the others can be seen by running my code.
mosaic(response~IS_COMPANY_FOLLOWER, data=data0_fac, shade = TRUE, na.action=na.omit)
mosaic(response~RECIPIENT_SENIORITY, data=data0_fac, shade = TRUE, na.action=na.omit)
mosaic(response~cut2(INMAIL_BODY_WORD_COUNT,g=5), data=data0_fac, shade = TRUE, na.action=na.omit)
From the plots above, we can see that:
In this part, I’m going to control for unmanageable factors, and figure out which manageable factors may contribute to higher response rate. Here I will use all variables except IDs and REPLY_STATUS as the predictors, the response indicator as the dependent variable to fit a logistic regression model. The reason why I want to use logistic regression for this problem is that logistic regression is more interpretable than other model, and the model coefficient will be helpful to find out the importance of each predictors.
Since many of the predictors are categorical variables and contain many levels, first, I create the design matrix by dummy coding and transform it to a sparse matrix for computational efficiency. The design matrix contains 121526 rows and 664 columns.
# Model matrix
pred_var <- data1[,5:27,with=F]
X <- sparse.model.matrix(~ ., pred_var)
y <- factor(data1$response, levels=c("N","Y"))
Next, I want to fit a logistic regression model. Since the number of predictors is very large, I use the elastic net regularization which allows for learning a sparse model where few of the weights are non-zero like Lasso, while still maintaining the regularization properties of Ridge.
To determine the value for \(\alpha\) and \(\lambda\) of elastic net, I perform a 10 fold cross validation, and use AUC as the measure of model performance because of the unbalanced sample. Also, to speed up the computation, I use parallel foreach to fit each fold.
# Parallel
cl <- makeCluster(detectCores())
registerDoParallel(cl)
# Choose alpha by cv
out_compare <- NULL
foldid=sample(1:10,size=length(y),replace=TRUE)
for (alpha in seq(0, 1, by = 0.1)){
cvfit <- cv.glmnet(X, y, family = "binomial", type.measure = "auc", foldid=foldid, alpha = alpha, intercept=F, parallel = TRUE)
out_compare <- rbind(out_compare, c(alpha, cvfit$lambda.min, max(cvfit$cvm)))
print(alpha)
}
colnames(out_compare) <- c("alpha","lambda.min","cvm")
stopCluster(cl)
# Fit model with chosen alpha and lambda
alpha_fin <- out_compare[which.max(out_compare[,3]),1]
lambda_fin <- out_compare[which.max(out_compare[,3]),2]
fit <- glmnet(X, y, family = "binomial", alpha = alpha_fin, intercept=F)
beta.vec <- coef(fit, s = lambda_fin)
beta.vec <- beta.vec[order(beta.vec),]
alpha lambda.min cvm
[1,] 0.0 0.2398284596 0.6365050
[2,] 0.1 0.0023982846 0.6609083
[3,] 0.2 0.0011991423 0.6617092
[4,] 0.3 0.0007994282 0.6620541
[5,] 0.4 0.0005995711 0.6622058
[6,] 0.5 0.0004796569 0.6623238
[7,] 0.6 0.0003997141 0.6623305
[8,] 0.7 0.0003426121 0.6623637
[9,] 0.8 0.0002997856 0.6624590
[10,] 0.9 0.0002664761 0.6624812
[11,] 1.0 0.0002398285 0.6625114
From the above table, I choose \(\alpha=1\) and \(\lambda=0.0002398285\), which maximize the AUC. Finally I fit the regression model with all data and chosen parameters, then extract the coefficient vector. In this model, 0.2120581 of null deviance are explained.
Country, industry and function area are often fixed when recruiters send InMails for certain position; email domains and sender’s company information are unchangeable. So here I only exam part of the coefficient vector. From the model coefficient, we can see that after controlling for other variables, IS_COMPANY_FOLLOWER has the largest effect size, followed by MESSAGE_TYPE and HAS_COMPANY_CONNECTIONS.
RECIPIENT_COMPANY_SIZE10001+ RECIPIENT_COMPANY_SIZE1001-5000
[1,] -0.9144342 -0.8897663
RECIPIENT_COMPANY_SIZE201-500 RECIPIENT_COMPANY_SIZE11-50
[1,] -0.8700662 -0.8560641
RECIPIENT_COMPANY_SIZE51-200 RECIPIENT_COMPANY_SIZE501-1000
[1,] -0.8517993 -0.8466014
RECIPIENT_COMPANY_SIZE5001-10000 RECIPIENT_COMPANY_SIZE1-10
[1,] -0.811656 -0.7984325
RECIPIENT_COMPANY_SIZEunknown RECIPIENT_SENIORITYUnpaid
[1,] -0.7855443 -0.6262941
RECIPIENT_SENIORITYTraining RECIPIENT_COMPANY_TYPE3 self-employed
[1,] -0.5531715 -0.4871423
RECIPIENT_SENIORITYEntry RECIPIENT_SENIORITYSenior
[1,] -0.4024081 -0.3043337
RECIPIENT_COMPANY_TYPE8 partnership INMAIL_HAS_HYPERLINKS
[1,] -0.2866305 -0.2480928
RECIPIENT_SENIORITYPartner RECIPIENT_SENIORITYOwner
[1,] -0.2421849 -0.2337428
RECIPIENT_COMPANY_TYPE6 self-owned INMAIL_HAS_EMAIL_ADDRESS
[1,] -0.224414 -0.2221354
RECIPIENT_COMPANY_TYPE5 non-profit RECIPIENT_SENIORITYManager
[1,] -0.2084185 -0.19667
RECIPIENT_SENIORITYCXO RECIPIENT_SENIORITYVP INMAIL_TEMPLATE_USEDY
[1,] -0.1220782 -0.06583554 -0.06334427
RECIPIENT_COMPANY_TYPEunknown RECIPIENT_COMPANY_TYPE7 privately held
[1,] -0.062577 -0.05134828
INMAIL_ATTACHMENT_COUNT RECIPIENT_SENIORITYDirector
[1,] -0.04089416 -0.03007943
RECIPIENT_COMPANY_TYPE4 government agency SENDER_PROFLE_SCORE
[1,] -0.01540637 -0.00678554
INMAIL_BODY_WORD_COUNT (Intercept) INMAIL_SUBJECT_WORD_COUNT
[1,] -0.001336112 0 0.0003207711
HAS_COMMON_GROUPS RECIPIENT_COMPANY_TYPE2 educational
[1,] 0.008842989 0.01967159
RECIPIENT_PROFLE_SCORE HAS_COMPANY_CONNECTIONS MESSAGE_TYPESINGLE
[1,] 0.04635391 0.2048318 0.2179077
IS_COMPANY_FOLLOWER
[1,] 0.3333688
After carefully examining the coefficient vector, we can make the following recommendations to recruiters:
To further improve the model, we can try the followings:
All the codes I used can be found here.