Responses(accept/decline) of InMail are critical for recruiter to know so he/she can manage their candidate pipeline efficiently. Response rate vary across the users, and we believe that there are some best practice of sending InMails we can recommend to the recruiters. However, those factors may be confounded by other unmanageable factors like country, industry, etc. In this analysis, I’m going to control for those unmanageable factors by logistic regression, find out which factors contribute to a higher response rate, and finally make recommendations to the recruiters.


I. Data Exploration

There are 28 variables and 121727 observations in the data set. The average response rate is 27%(22% accepted, 5% declined). 137 observations miss sender’s information(profile score, country, company size, company industry, company type), 64 observations miss recipient’s information(seniority, profile score, country, company size, company industry, company type), and 14364 observations miss recipient function area. No special missing pattern has been found. Therefore, I deleted those observations with NAs in sender’s information and recipient’s information. And for the rest of the data set, I recoded missing values of recipient function area as “unknown”. The following tables provide the summary of the data set.

## data0_fac 
## 
##  28  Variables      121727  Observations
## ---------------------------------------------------------------------------
## MESSAGE_ID 
##         n   missing    unique      Info      Mean       .05       .10 
##    121727         0    121727         1  1.96e+09 8.50e-315 8.56e-315 
##       .25       .50       .75       .90       .95 
## 8.75e-315 1.03e-314 1.06e-314 1.08e-314 1.09e-314 
## 
## lowest : 1697397143 1697398443 1697398773 1697400883 1697404033
## highest: 2217538538 2217541948 2217543878 2217544718 2217546668 
## ---------------------------------------------------------------------------
## SENDER_MEMBER_ID 
##        n  missing   unique     Info     Mean      .05      .10      .25 
##   121727        0    54225        1 1.29e+08 2.66e+06 5.42e+06 1.86e+07 
##      .50      .75      .90      .95 
## 8.24e+07 2.06e+08 3.45e+08 4.03e+08 
## 
## lowest :      6472      7067      7486      7814      9467
## highest: 485919637 486211322 486442930 486454109 486948284 
## ---------------------------------------------------------------------------
## RECIPIENT_MEMBER_ID 
##        n  missing   unique     Info     Mean      .05      .10      .25 
##   121727        0   120720        1 1.29e+08 3.77e+06 8.01e+06 2.57e+07 
##      .50      .75      .90      .95 
## 8.56e+07 2.05e+08 3.28e+08 3.87e+08 
## 
## lowest :        -4      1311      1757      1823      2056
## highest: 485608269 485838981 486151350 486373084 486497557 
## ---------------------------------------------------------------------------
## REPLY_STATUS 
##       n missing  unique 
##  121727       0       3 
## 
## ACCEPTED (26466, 22%), DECLINED (6614, 5%) 
## PENDING (88647, 73%) 
## ---------------------------------------------------------------------------
## INMAIL_TEMPLATE_USED 
##       n missing  unique 
##  121727       0       2 
## 
## N (51090, 42%), Y (70637, 58%) 
## ---------------------------------------------------------------------------
## MESSAGE_TYPE 
##       n missing  unique 
##  121727       0       2 
## 
## MULTIPLE (61229, 50%), SINGLE (60498, 50%) 
## ---------------------------------------------------------------------------
## SENDER_PROFLE_SCORE 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##  121590     137      91    0.99    61.6      51      55      59      62 
##     .75     .90     .95 
##      65      68      70 
## 
## lowest :  0  4  5  6  9, highest: 92 93 95 98 99 
## ---------------------------------------------------------------------------
## SENDER_COUNTRY 
##       n missing  unique 
##  121590     137     130 
## 
## lowest : Albania              Andorra              Angola               Antigua and Barbuda  Argentina           
## highest: Uruguay              Venezuela            Viet Nam             Virgin Islands, U.S. Zimbabwe             
## ---------------------------------------------------------------------------
## SENDER_COMPANY_SIZE 
##       n missing  unique 
##  121590     137      10 
## 
##           myself only  1-10 11-50 51-200 201-500 501-1000 1001-5000
## Frequency         278 11865 20980  18391   12551     6635     16284
## %                   0    10    17     15      10        5        13
##           5001-10000 10001+ unknown
## Frequency       5985  21373    7248
## %                  5     18       6
## ---------------------------------------------------------------------------
## SENDER_COMPANY_INDUSTRY 
##       n missing  unique 
##  121590     137     144 
## 
## lowest : accounting           airlines/aviation    alternative medicine animation            apparel and fashion 
## highest: warehousing          wholesale            wine and spirits     wireless             writing and editing  
## ---------------------------------------------------------------------------
## SENDER_COMPANY_TYPE 
##       n missing  unique 
##  121590     137       9 
## 
## 1 public company (38092, 31%) 
## 2 educational (387, 0%), 3 self-employed (129, 0%) 
## 4 government agency (278, 0%) 
## 5 non-profit (1479, 1%), 6 self-owned (841, 1%) 
## 7 privately held (69350, 57%) 
## 8 partnership (3319, 3%), unknown (7715, 6%) 
## ---------------------------------------------------------------------------
## RECIPIENT_DOMAIN 
##       n missing  unique 
##  121727       0       5 
## 
##            AOL GMail Hotmail Other Yahoo!
## Frequency 1134 54921   13938 39156  12578
## %            1    45      11    32     10
## ---------------------------------------------------------------------------
## RECIPIENT_FUNCTION_AREA 
##       n missing  unique 
##  107363   14364      26 
## 
## lowest : Accounting                    Administrative                Arts and Design               Business Development          Community and Social Services
## highest: Quality Assurance             Real Estate                   Research                      Sales                         Support                       
## ---------------------------------------------------------------------------
## RECIPIENT_SENIORITY 
##       n missing  unique 
##  121663      64      11 
## 
##           unknown Unpaid Training Entry Senior Manager Director   VP
## Frequency    2226     83      830 35099  47185   19203     8097 4423
## %               2      0        1    29     39      16        7    4
##           Partner  CXO Owner
## Frequency    1124 2455   938
## %               1    2     1
## ---------------------------------------------------------------------------
## RECIPIENT_PROFLE_SCORE 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##  121663      64      92       1    59.9      43      49      57      61 
##     .75     .90     .95 
##      64      68      72 
## 
## lowest :   0   5   6  10  11, highest:  96  97  98  99 100 
## ---------------------------------------------------------------------------
## RECIPIENT_COUNTRY 
##       n missing  unique 
##  121663      64     158 
## 
## lowest : Afghanistan             Algeria                 Andorra                 Angola                  Argentina              
## highest: Viet Nam                Virgin Islands, British Virgin Islands, U.S.    Zambia                  Zimbabwe                
## ---------------------------------------------------------------------------
## RECIPIENT_COMPANY_SIZE 
##       n missing  unique 
##  121663      64      10 
## 
##           myself only 1-10 11-50 51-200 201-500 501-1000 1001-5000
## Frequency         329 2869  8441  12123    9156     7120     17915
## %                   0    2     7     10       8        6        15
##           5001-10000 10001+ unknown
## Frequency       7926  38988   16796
## %                  7     32      14
## ---------------------------------------------------------------------------
## RECIPIENT_COMPANY_INDUSTRY 
##       n missing  unique 
##  121663      64     149 
## 
## lowest : accounting                     airlines/aviation              alternative dispute resolution alternative medicine           animation                     
## highest: warehousing                    wholesale                      wine and spirits               wireless                       writing and editing            
## ---------------------------------------------------------------------------
## RECIPIENT_COMPANY_TYPE 
##       n missing  unique 
##  121663      64       9 
## 
## 1 public company (49281, 41%) 
## 2 educational (1778, 1%) 
## 3 self-employed (206, 0%) 
## 4 government agency (1725, 1%) 
## 5 non-profit (3940, 3%), 6 self-owned (502, 0%) 
## 7 privately held (41386, 34%) 
## 8 partnership (4192, 3%), unknown (18653, 15%) 
## ---------------------------------------------------------------------------
## INMAIL_BODY_WORD_COUNT 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##  121727       0     459       1     144      55      68      93     128 
##     .75     .90     .95 
##     182     250     287 
## 
## lowest :   1   6   7   8   9, highest: 652 662 731 747 775 
## ---------------------------------------------------------------------------
## INMAIL_SUBJECT_WORD_COUNT 
##       n missing  unique    Info    Mean 
##  121727       0       9    0.98    4.19 
## 
##               1     2     3     4     5     6     7    8    9
## Frequency 22219 15949 15737 15234 14577 12780 10501 8364 6366
## %            18    13    13    13    12    10     9    7    5
## ---------------------------------------------------------------------------
## INMAIL_ATTACHMENT_COUNT 
##       n missing  unique    Info    Mean 
##  121727       0       6     0.2  0.0777 
## 
##                0    1   2  3  4  5
## Frequency 113148 7912 523 93 31 20
## %             93    6   0  0  0  0
## ---------------------------------------------------------------------------
## INMAIL_HAS_EMAIL_ADDRESS 
##       n missing  unique    Info     Sum    Mean 
##  121727       0       2    0.75   58984   0.485 
## ---------------------------------------------------------------------------
## IS_COMPANY_FOLLOWER 
##       n missing  unique    Info     Sum    Mean 
##  121727       0       2    0.75   63704   0.523 
## ---------------------------------------------------------------------------
## INMAIL_HAS_HYPERLINKS 
##       n missing  unique    Info     Sum    Mean 
##  121727       0       2    0.75   61916   0.509 
## ---------------------------------------------------------------------------
## HAS_COMMON_GROUPS 
##       n missing  unique    Info     Sum    Mean 
##  121727       0       2    0.75   64348   0.529 
## ---------------------------------------------------------------------------
## HAS_COMPANY_CONNECTIONS 
##       n missing  unique    Info     Sum    Mean 
##  121727       0       2    0.75   61159   0.502 
## ---------------------------------------------------------------------------
## response 
##       n missing  unique 
##  121727       0       2 
## 
## Y (33080, 27%), N (88647, 73%) 
## ---------------------------------------------------------------------------

Before diving into data modeling, I created plots between response and other variables to gain some intuition. In this report, I only show some of the plots, all the others can be seen by running my code.

mosaic(response~IS_COMPANY_FOLLOWER, data=data0_fac, shade = TRUE, na.action=na.omit)

mosaic(response~RECIPIENT_SENIORITY, data=data0_fac, shade = TRUE, na.action=na.omit)

mosaic(response~cut2(INMAIL_BODY_WORD_COUNT,g=5), data=data0_fac, shade = TRUE, na.action=na.omit)

From the plots above, we can see that:


II. Logistic Regression with Elastic Net Penalty

In this part, I’m going to control for unmanageable factors, and figure out which manageable factors may contribute to higher response rate. Here I will use all variables except IDs and REPLY_STATUS as the predictors, the response indicator as the dependent variable to fit a logistic regression model. The reason why I want to use logistic regression for this problem is that logistic regression is more interpretable than other model, and the model coefficient will be helpful to find out the importance of each predictors.

Since many of the predictors are categorical variables and contain many levels, first, I create the design matrix by dummy coding and transform it to a sparse matrix for computational efficiency. The design matrix contains 121526 rows and 664 columns.

# Model matrix

pred_var <- data1[,5:27,with=F]

X <- sparse.model.matrix(~ ., pred_var)
y <- factor(data1$response, levels=c("N","Y"))

Next, I want to fit a logistic regression model. Since the number of predictors is very large, I use the elastic net regularization which allows for learning a sparse model where few of the weights are non-zero like Lasso, while still maintaining the regularization properties of Ridge.

To determine the value for \(\alpha\) and \(\lambda\) of elastic net, I perform a 10 fold cross validation, and use AUC as the measure of model performance because of the unbalanced sample. Also, to speed up the computation, I use parallel foreach to fit each fold.

# Parallel
cl <- makeCluster(detectCores())
registerDoParallel(cl)

# Choose alpha by cv

out_compare <- NULL

foldid=sample(1:10,size=length(y),replace=TRUE)

for (alpha in seq(0, 1, by = 0.1)){
    cvfit <- cv.glmnet(X, y, family = "binomial", type.measure = "auc", foldid=foldid, alpha = alpha, intercept=F, parallel = TRUE)
    out_compare <- rbind(out_compare, c(alpha, cvfit$lambda.min, max(cvfit$cvm)))
    print(alpha)
}
colnames(out_compare) <- c("alpha","lambda.min","cvm")

stopCluster(cl)

# Fit model with chosen alpha and lambda

alpha_fin <- out_compare[which.max(out_compare[,3]),1]
lambda_fin <- out_compare[which.max(out_compare[,3]),2]
fit <- glmnet(X, y, family = "binomial", alpha = alpha_fin, intercept=F)

beta.vec <- coef(fit, s = lambda_fin)
beta.vec <- beta.vec[order(beta.vec),]
      alpha   lambda.min       cvm
 [1,]   0.0 0.2398284596 0.6365050
 [2,]   0.1 0.0023982846 0.6609083
 [3,]   0.2 0.0011991423 0.6617092
 [4,]   0.3 0.0007994282 0.6620541
 [5,]   0.4 0.0005995711 0.6622058
 [6,]   0.5 0.0004796569 0.6623238
 [7,]   0.6 0.0003997141 0.6623305
 [8,]   0.7 0.0003426121 0.6623637
 [9,]   0.8 0.0002997856 0.6624590
[10,]   0.9 0.0002664761 0.6624812
[11,]   1.0 0.0002398285 0.6625114

From the above table, I choose \(\alpha=1\) and \(\lambda=0.0002398285\), which maximize the AUC. Finally I fit the regression model with all data and chosen parameters, then extract the coefficient vector. In this model, 0.2120581 of null deviance are explained.

Country, industry and function area are often fixed when recruiters send InMails for certain position; email domains and sender’s company information are unchangeable. So here I only exam part of the coefficient vector. From the model coefficient, we can see that after controlling for other variables, IS_COMPANY_FOLLOWER has the largest effect size, followed by MESSAGE_TYPE and HAS_COMPANY_CONNECTIONS.

     RECIPIENT_COMPANY_SIZE10001+ RECIPIENT_COMPANY_SIZE1001-5000
[1,]                   -0.9144342                      -0.8897663
     RECIPIENT_COMPANY_SIZE201-500 RECIPIENT_COMPANY_SIZE11-50
[1,]                    -0.8700662                  -0.8560641
     RECIPIENT_COMPANY_SIZE51-200 RECIPIENT_COMPANY_SIZE501-1000
[1,]                   -0.8517993                     -0.8466014
     RECIPIENT_COMPANY_SIZE5001-10000 RECIPIENT_COMPANY_SIZE1-10
[1,]                        -0.811656                 -0.7984325
     RECIPIENT_COMPANY_SIZEunknown RECIPIENT_SENIORITYUnpaid
[1,]                    -0.7855443                -0.6262941
     RECIPIENT_SENIORITYTraining RECIPIENT_COMPANY_TYPE3 self-employed
[1,]                  -0.5531715                            -0.4871423
     RECIPIENT_SENIORITYEntry RECIPIENT_SENIORITYSenior
[1,]               -0.4024081                -0.3043337
     RECIPIENT_COMPANY_TYPE8 partnership INMAIL_HAS_HYPERLINKS
[1,]                          -0.2866305            -0.2480928
     RECIPIENT_SENIORITYPartner RECIPIENT_SENIORITYOwner
[1,]                 -0.2421849               -0.2337428
     RECIPIENT_COMPANY_TYPE6 self-owned INMAIL_HAS_EMAIL_ADDRESS
[1,]                          -0.224414               -0.2221354
     RECIPIENT_COMPANY_TYPE5 non-profit RECIPIENT_SENIORITYManager
[1,]                         -0.2084185                   -0.19667
     RECIPIENT_SENIORITYCXO RECIPIENT_SENIORITYVP INMAIL_TEMPLATE_USEDY
[1,]             -0.1220782           -0.06583554           -0.06334427
     RECIPIENT_COMPANY_TYPEunknown RECIPIENT_COMPANY_TYPE7 privately held
[1,]                     -0.062577                            -0.05134828
     INMAIL_ATTACHMENT_COUNT RECIPIENT_SENIORITYDirector
[1,]             -0.04089416                 -0.03007943
     RECIPIENT_COMPANY_TYPE4 government agency SENDER_PROFLE_SCORE
[1,]                               -0.01540637         -0.00678554
     INMAIL_BODY_WORD_COUNT (Intercept) INMAIL_SUBJECT_WORD_COUNT
[1,]           -0.001336112           0              0.0003207711
     HAS_COMMON_GROUPS RECIPIENT_COMPANY_TYPE2 educational
[1,]       0.008842989                          0.01967159
     RECIPIENT_PROFLE_SCORE HAS_COMPANY_CONNECTIONS MESSAGE_TYPESINGLE
[1,]             0.04635391               0.2048318          0.2179077
     IS_COMPANY_FOLLOWER
[1,]           0.3333688


III. Summary

After carefully examining the coefficient vector, we can make the following recommendations to recruiters:

To further improve the model, we can try the followings:

All the codes I used can be found here.