Last refreshed on: February 22, 2019


Executive Sumamry

The following is a proof of concept (PoC) on how to leverage data science to pair mentors with mentees.

Introduction

At a recent voluneteer event at which a group of volunteers would eventually be paired with a mentee, one of the volunteers suggested the use of a questionnaire to help make more informed pairings. It occured to me that that was an excellent idea and that the pairing could be done algorithmically.

I’ve been interested in people analytics for some time now and this project was the perfect gateway. So, I set out to gather information related to mentor-mentee pairing. I began on the search engines, obviously lol. Started with Duck Duck Go and Google. I searched the forums then looked for published research. To my surprise I found very little. I did find a cool gem though, the People Matching Project. Though little was learned in regards to mentor-mentee pairing I proceeded to contemplate how to accomplish this endeavor. I considered both supervised techniques (e.g. KNN) and unsupervised techniques (e.g. Kmeans, Hierarchical clustering). Given that I wouldn’t have a pre-labeled dataset, unsupervised techniques seemed the way to go.

S/N: The algorithms used in match-making sites would probably be useful for applications such as this but I was unable to find much about the techniques these services use, which makes sense considering that’s how the match-makers make money… it’s probably proprietary.

Last, I took the time to document this to obtain the buy-in from the mentorship program leaders, to demonstrate to any mentee interested in STEM how fun and cool a career in our field can be, to explore people analytics, to fulfill a desire to publish something relevant to analytics, and to assist anyone else attempting to do something similar. That said, if you find a better way to do something or think a particualr technqiue would be more appropriate or more accurate please reach out and let’s make it better.

#open_source

Table of contents

  1. The questionnaire
  2. Pre-analysis
  3. The matching process

Use the buttons below to navigate throughout the different sections listed in the table of contents

1. The questionnaire

Like any analytic project the launch point begins with data. This endeavor is no different, except that since this was drafted as a proof of concept to obtain buy-in from the organizers of the mentorship program the mentees had not yet been asked to take the survey. Therefore, I had to fabricate responses to imaginary questions. I acheived this with the random between function in a popular spreadsheet software. I tried to mimic responce patterns with the random between function and abitrarily chose cut offs for subsets of questions and mentees/mentors - intentionally creating groups that responded similarly. The end result, as you’ll see, are somewhat unique respondants. Admittedly, not the best approach. Ideally, I would have created an actual survey but this is a proof of concept and I didn’t have time to poll people for help.

Randomly selecting responses to make-believe questions aside, the actual survey questions are obviously vital to the success of the mentor-mentee matching process so it’s critical that the questions are as independent as possible. For example, there shouldn’t be two or more questions about pets or multiple questions about a similar character trait/experience/interest.

For our purposes, I recommend a mix of personality- and interest-based questions on a 1-10 or 1-5 scale, such as:

  1. how often do you get mad?
  2. how often do you try new activities?
  3. how creative are you?
  4. how often do you read?
  5. how difficult do you find it to introduce yourself to other people?
  6. how often do you exercise?

There are a variety of sources online that can help guide which questions to use.

The above questions are not the actual survey questions, merely examples.

As of writing this, 2018-12-04, I still hadn’t presented my idea to the mentorship program’s organizers.


2. Pre-analysis

Data preparation

  • import: loads the questionnaire data into R, the analytic tool used in this analysis
  • clean: remove or impute missing values from the data (certain algorithms can’t process missing values)
  • scale: data across columns must be standardized or scaled, to make the variables comparable
#file info
mentee_file <- 'mentee.csv'
mentor_file <- 'mentor.csv'

#import
mentor_table <- read.table(mentor_file, header=TRUE, sep = ',', row.names = 1)
mentee_table <- read.table(mentee_file, header=TRUE, sep = ',', row.names = 1)

Before we clean and scale the data let’s take a glance at the data itself.

Below is a plot of the number of missing values in each column. If there are none then the chart on the left will be blank.

aggr(rbind(mentor_table,mentee_table))

A glimpse at the data itself.

Mentor:

summary(mentor_table[,1:5])%>%
kable()%>%
kable_styling(bootstrap_options=c("striped")
              ,full_width = F
              ,position = "left"
              )
   q1 </th>
   q2 </th>
   q3 </th>
   q4 </th>
   q5 </th>
Min. : 1.00 Min. : 1.00 Min. : 1.00 Min. : 1.0 Min. : 1.0
1st Qu.: 4.00 1st Qu.: 3.75 1st Qu.: 3.00 1st Qu.: 4.0 1st Qu.: 4.0
Median : 5.00 Median : 5.00 Median : 5.00 Median : 6.0 Median : 6.0
Mean : 5.37 Mean : 5.18 Mean : 5.11 Mean : 5.6 Mean : 5.6
3rd Qu.: 7.00 3rd Qu.: 7.00 3rd Qu.: 7.00 3rd Qu.: 7.0 3rd Qu.: 7.0
Max. :10.00 Max. :10.00 Max. :10.00 Max. :10.0 Max. :10.0

Mentee:

summary(mentee_table[,1:5])%>%
kable()%>%
kable_styling(bootstrap_options=c("striped", "hover")
              ,full_width = F
              ,position = "left"
              )
   q1 </th>
   q2 </th>
   q3 </th>
   q4 </th>
   q5 </th>
Min. : 1.00 Min. : 1.00 Min. : 1.00 Min. : 1.00 Min. : 1.00
1st Qu.: 3.00 1st Qu.: 3.00 1st Qu.: 3.00 1st Qu.: 3.00 1st Qu.: 2.75
Median : 5.00 Median : 5.00 Median : 6.00 Median : 5.00 Median : 5.00
Mean : 5.38 Mean : 5.23 Mean : 5.62 Mean : 5.47 Mean : 5.33
3rd Qu.: 9.00 3rd Qu.: 8.00 3rd Qu.: 8.00 3rd Qu.: 8.00 3rd Qu.: 8.00
Max. :10.00 Max. :10.00 Max. :10.00 Max. :10.00 Max. :10.00

Finally, we clean the data, transform the data if necessary, and scale the data (nothing to visualize so unhide the code if interested).

#clean
mentor_table <- na.omit(mentor_table)
mentee_table <- na.omit(mentee_table)

#transformation
#pca: log
#mentor_table <- log(mentor_table)
#mentee_table <- log(mentee_table)

#scale
mentor_table <- scale(mentor_table)
mentee_table <- scale(mentee_table)

One final view after it’s all done.

mentor_table[1:5,1:5]%>%
kable()%>%
kable_styling(bootstrap_options=c("striped", "hover")
              ,full_width = F
              ,position = "left"
              )
q1 q2 q3 q4 q5
mentor_1 -0.6319271 -0.5105877 0.3797449 0.1772811 0.6073692
mentor_2 0.2905942 -0.0778863 0.3797449 -0.7091242 0.1735341
mentor_3 -0.1706664 -0.0778863 -0.0469348 0.6204837 0.6073692
mentor_4 0.2905942 0.7875165 1.2331044 1.0636863 -0.6941362
mentor_5 0.2905942 -0.0778863 0.8064246 1.0636863 -0.2603011
mentee_table[1:5,1:5]%>%
kable()%>%
kable_styling(bootstrap_options=c("striped", "hover")
              ,full_width = F
              ,position = "left"
              )
q1 q2 q3 q4 q5
student_1 0.851372 0.9271339 1.5373847 1.162461 0.8835237
student_2 1.176323 1.5965447 0.8353826 1.162461 0.8835237
student_3 1.501274 1.2618393 1.1863836 1.162461 1.2144315
student_4 1.176323 0.9271339 1.5373847 1.491770 0.8835237
student_5 0.851372 1.2618393 1.1863836 1.491770 1.5453393

3. Matching process

Two types of approaches:

  • supervised
  • unsupervised (Hierarchical Agglomerative Clustering, K-means)

A principal difference between supervised and unsupervised techniques is that the data used in supervised techniques is pre-labeled or pre-classified, which essentially means that the answer is explicitly stated in the training set. While in unsupervised learning the algothirms determine what the answer is. To read more about the differences between the two approaches click here.

Our dataset is not pre-labeled. We don’t/can’t know how the mentors and mentees will respond before hand. Additionally, we don’t know beforehand which mentee was paired with which mentor. Trying to guess their answers before they answer is a different problem, it would be a forecasting problem. This is a matching problem. So, we have to use an unsupervised technique. Perhaps we could apply the K-means algorithm. But no, unfortunately there is a flaw with that approach. Mainly, the algorithm won’t pair mentors with mentees one-to-one, rather it’ll group like mentors and mentees and just like that we’re back at where we started: which mentees to pair with which mentors.

What if instead we leveraged a simpler method: distance. In the appendix, section Better understanding the mentees I experimented with distance. I used a distance plot to map how similar/dissimilar the mentees were to one another. With some teaks, we can use a distance plot to map how similar/dissimilar each mentee is to each mentor. Voila, a one-to-one match. While not a particularly fancy technique it is simple and easy to explain/interpret, which for anyone in corporate America ease of understanding is a big selling point.

Pre-processing

First, have any of the mentees/mentors responded exactly the same as another mentee/mentor? To accomplish this we will analyze the structure of the responses vs the structure of the unique responses. If the number of rows and columns are the same then there are no duplicates. The odds seem quite low but it’s worth validating (I tried to compute the odds but I gave up after a while).

#mentor
abc <- rbind(dim(mentor_table)[1],dim(unique(mentor_table))[2])
abc <- cbind(c("Total mentors","Unique mentors"),abc)

#student:
def <- rbind(dim(mentee_table)[1],dim(unique(mentee_table))[2])
def <- cbind(c("Total mentees","Unique mentees"),def)

abc <- t(rbind(abc,def))
#abc[,c(1,3,2,4)]%>%
abc%>%
  kable()%>%
  kable_styling(bootstrap_options=c("striped", "hover")
                ,full_width = F
                ,position = "left"
                )
Total mentors Unique mentors Total mentees Unique mentees
100 25 100 25

Now, we’ll attempt to find a one-to-one match. The number unique, one-to-one matches is: 0

Let the pairing begin

I mentioned above that using K-means was flawed. K-means alone didn’t get us what we wanted, a one-to-one match. However, the K-means algorithm will help us verify whether our pairings are reasonable. The K-means algorithm will group similar mentees and mentors. We’ll then compute the distance of each mentee to each mentor. We will combined the cluster data with the distance data. We expect each, or a majority, of the matched mentors and mentees to be in the same cluster.

We’ll compute K then cluster the mentors and mentees. Then we’ll append the cluster number to the combined mentor and mentee questionnaire response data. Finally, we’ll compute the distance.

combined <- rbind(mentor_table,mentee_table)
zyx <- suppressMessages(NbClust(combined, min.nc=2, max.nc=15, method="kmeans"))

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 2 proposed 2 as the best number of clusters 
## * 17 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 2 proposed 5 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
barplot(table(zyx$Best.n[1,])
        ,xlab="Numer of Clusters"
        ,ylab="Number of Criteria"
        ,main="Num of Clusters Chosen by 26 Criteria"
)

Another method to assess a value for K

mdist <- dist(mentor_table,method = "manhattan")
sdist <- dist(mentee_table,method = "manhattan")

plot(mdist,sdist)

With a value for K, let’s cluster

#k-means
fit.km_combined <- kmeans(combined
                 ,centers=3
                 ,nstart=25
                 ,algorithm="Lloyd"
                 ,iter.max = 1000
           )
fviz_cluster(fit.km_combined, data = combined)

Now, we’ll create the table of distance from mentee to mentor and visually represent the distance with a heat map.

#distance
rdist_boom <- rdist(as.table(combined))

#clean-up
row.names(rdist_boom)<-attributes(combined)$dimnames[[1]]
colnames(rdist_boom)<-attributes(combined)$dimnames[[1]]
##filter out mentees from columns and mentors from rows
rdist_boom <- rdist_boom[row.names(rdist_boom) %like% 'student',colnames(rdist_boom) %like% 'mentor']

#heat map
#source: https://sebastianraschka.com/Articles/heatmaps_in_r.html
#source: https://www.r-bloggers.com/drawing-heatmaps-in-r/
my_palette <- colorRampPalette(c("red", "yellow", "green"))(n = 299)
heatmap(rdist_boom,Colv = FALSE,scale = "none",col=my_palette)

Now, let’s assign each mentee and mentor’s respective cluster

#transform to long format
rdist_boom <- melt(rdist_boom)

#mentor cluster
#source: https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right
rdist_boom <- merge(x = rdist_boom
                    , y = melt(as.table(fit.km_combined$cluster))
                    , by.x = "Var2"
                    , by.y = "Var1"
                    , all.x = TRUE)

#student cluster
rdist_boom <- merge(x = rdist_boom
                    , y = melt(as.table(fit.km_combined$cluster))
                    , by = "Var1"
                    , all.x = TRUE)

#assign column names
colnames(rdist_boom)<-c('student','mentor','distance','mentor_cluster','student_cluster')

#convert to data table
rdist_boom <- data.table(rdist_boom)

Another heat map

#heat map
#source: https://www.r-bloggers.com/how-to-create-a-fast-and-easy-heatmap-with-ggplot2/
#source: https://stackoverflow.com/questions/14290364/heatmap-with-values-ggplot2
ggplot(rdist_boom, aes(student, mentor)) +
  geom_tile(aes(fill = distance), color = "white") +
  scale_fill_gradient(low = "white", high = "steelblue") +
  ylab("Mentor") +
  xlab("Student") +
  theme(legend.title = element_text(size = 8),
        legend.text = element_text(size = 8),
        plot.title = element_text(size=10),
        axis.title=element_text(size=8,face="bold"),
        axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(fill = "Distance")

We will attempt to pair the metors and the mentees two different ways.

The first method, by mentee sorts each mentor by distance and by mentee. And within this method we’ll try two methods… methodception lol.

  1. The top mentor
  2. The top three closest mentors
#filter for top n mentors by student
#source: https://stackoverflow.com/questions/16325641/how-to-extract-the-first-n-rows-per-group
#source: https://stackoverflow.com/questions/13685295/sort-a-data-table-fast-by-ascending-descending-order
rdist_boom_0 <- setorder(rdist_boom,student,distance)
rdist_boom_0 <- rdist_boom_0[,.SD[1:3],by=student]

#filter for min distance, ideal mentor
#source: https://stackoverflow.com/questions/16325641/how-to-extract-the-first-n-rows-per-group
#source: https://stackoverflow.com/questions/16573995/subset-by-group-with-data-table
rdist_boom_1 <- rdist_boom[,.SD[distance == min(distance)],by=student]

#method 1: top mentor by mentee
rdist_boom_1[1:3,]%>%
  kable()%>%
  kable_styling(bootstrap_options=c("striped", "hover")
                ,full_width = F
                ,position = "left"
                )  
student mentor distance mentor_cluster student_cluster
student_1 mentor_86 2.747217 3 3
student_2 mentor_96 2.414251 3 3
student_3 mentor_86 2.597768 3 3
#method 2: top n mentors by mentee
rdist_boom_0[1:9,]%>%
  kable()%>%
  kable_styling(bootstrap_options=c("striped", "hover")
                ,full_width = F
                ,position = "left"
                )
student mentor distance mentor_cluster student_cluster
student_1 mentor_86 2.747217 3 3
student_1 mentor_85 2.925842 3 3
student_1 mentor_88 2.951561 3 3
student_2 mentor_96 2.414251 3 3
student_2 mentor_80 2.463323 3 3
student_2 mentor_84 2.805245 3 3
student_3 mentor_86 2.597768 3 3
student_3 mentor_87 2.814029 3 3
student_3 mentor_90 2.814439 3 3

Do you see an issue with this approach? In either method 1 (top n mentors by mentee) or method 2 (top mentor by mentee) are all mentors represented? Was each mentor paired with a mentee? Let’s see.

Method 1: Top mentor

foo <- rbind(dim(mentor_table)[1],dim(unique(rdist_boom_1[,list(mentor)]))[1])
foo <- cbind(c("Total mentors","Mentors matched"),foo)

foo1 <- rbind(dim(mentee_table)[1],dim(unique(rdist_boom_1[,list(student)]))[1])
foo1 <- cbind(c("Total mentees","Mentees matched"),foo1)

foo <- t(rbind(foo,foo1))
foo <- foo[,c(1,3,2,4)]

foo %>%
  kable()%>%
  kable_styling(bootstrap_options=c("striped", "hover")
                ,full_width = F
                ,position = "left"
                )
Total mentors Total mentees Mentors matched Mentees matched
100 100 44 100

Method 2: Top n mentors

bar <- rbind(dim(mentor_table)[1],dim(unique(rdist_boom_0[,list(mentor)]))[1])
bar <- cbind(c("Total mentors","Mentors matched"),bar)

bar1 <- rbind(dim(mentee_table)[1],dim(unique(rdist_boom_0[,list(student)]))[1])
bar1 <- cbind(c("Total mentees","Mentees matched"),bar1)

bar <- t(rbind(bar,bar1))
bar <- bar[,c(1,3,2,4)]

bar %>%
  kable()%>%
  kable_styling(bootstrap_options=c("striped", "hover")
                ,full_width = F
                ,position = "left"
                )  
Total mentors Total mentees Mentors matched Mentees matched
100 100 74 100

As we can see, not all mentors were matched with a mentee. Now, we can debate whether or not a one-to-one match is the appropriate approach. We could argue that unmatched mentors shouldn’t be matched with a mentee. However, for the sake of practice and, one of our favorite corporate buzz words, inclusion let’s present alternative approach.

This time we’ll focus on the mentors and sort/filter by the mentees

Method 3: Top mentee

#reorder columns
#source: http://www.cookbook-r.com/Manipulating_data/Reordering_the_columns_in_a_data_frame/
rdist_boom_2 <- rdist_boom[,c(2,1,3,4,5)]
rdist_boom_2 <- setorder(rdist_boom_2,mentor,distance)
rdist_boom_2 <- rdist_boom_2[,.SD[distance == min(distance)],by=mentor]

spam <- rbind(dim(mentor_table)[1],dim(unique(rdist_boom_2[,list(mentor)]))[1])
spam <- cbind(c("Total mentors","Mentors matched"),spam)

spam1 <- rbind(dim(mentee_table)[1],dim(unique(rdist_boom_2[,list(student)]))[1])
spam1 <- cbind(c("Total mentees","Mentees matched"),spam1)

spam <- t(rbind(spam,spam1))
spam <- spam[,c(1,3,2,4)]

spam %>%
  kable()%>%
  kable_styling(bootstrap_options=c("striped", "hover")
                ,full_width = F
                ,position = "left"
                )
Total mentors Total mentees Mentors matched Mentees matched
100 100 100 54

Method 4: Top n mentees

rdist_boom_3 <- rdist_boom[,c(2,1,3,4,5)]
rdist_boom_3 <- setorder(rdist_boom_3,mentor,distance)
rdist_boom_3 <- rdist_boom_3[,.SD[1:3],by=mentor]

spam2 <- rbind(dim(mentor_table)[1],dim(unique(rdist_boom_3[,list(mentor)]))[1])
spam2 <- cbind(c("Total mentors","Mentors matched"),spam2)

spam3 <- rbind(dim(mentee_table)[1],dim(unique(rdist_boom_3[,list(student)]))[1])
spam3 <- cbind(c("Total mentees","Mentees matched"),spam3)

spam2 <- t(rbind(spam2,spam3))
spam2 <- spam2[,c(1,3,2,4)]

spam2 %>%
  kable()%>%
  kable_styling(bootstrap_options=c("striped", "hover")
                ,full_width = F
                ,position = "left"
                )
Total mentors Total mentees Mentors matched Mentees matched
100 100 100 73

A quick discussion. There are two general approaches

  • Top match
  • Top n match

Personally, I prefer the top match approach. This approach takes away the guess work and provides a single suggestion to whomever reads our report. However, neither the mentor-focused or the mentee-focused approaches included all the mentors or mentees. In some applications that’s probably okay because you’d prefer to provide the best matches and those that didn’t match should wait until a more suited mentor/mentee is available.

In an effort to practice further, what if we provided a comprehensive list of matched mentors and mentees. We would take the matches from one of the above top matches methods above and continue matching with the leftovers so no mentor is without a mentee.

It might be useful to create a rank column that ranks the distance between mentor and mentee from least distance (most similar) to most distant (least similar).

Since the mentor:mentee approach in method 3 had a higher match rate than method 1 we’ll leverage that.

#source: https://stackoverflow.com/questions/15170777/add-a-rank-column-to-a-data-frame
rdist_boom_4 <- rdist_boom[,c(2,1,3,4,5)]
rdist_boom_4 <- setorder(rdist_boom_4,mentor,distance)
rdist_boom_4 <- rdist_boom_4[,rank:=rank(distance,ties.method="first"),by=mentor]
head(rdist_boom_4)%>%
  kable()%>%
  kable_styling(bootstrap_options=c("striped", "hover")
                ,full_width = F
                ,position = "left"
                )
mentor student distance mentor_cluster student_cluster rank
mentor_1 student_41 3.029679 1 1 1
mentor_1 student_33 3.037549 1 1 2
mentor_1 student_34 3.085842 1 1 3
mentor_1 student_37 3.139329 1 1 4
mentor_1 student_30 3.154351 1 1 5
mentor_1 student_46 3.244478 1 1 6

On second thought, why focus on the mentors? What we’re really concerned about is the distance (i.e. how well a mentor and mentee match), right? So, let’s rank by distance without a specific focus on either mentor or mentee.

rdist_boom_5 <- rdist_boom[,c(2,1,3,4,5)]
rdist_boom_5 <- setorder(rdist_boom_5,distance)
rdist_boom_5 <- rdist_boom_5[,rank:=rank(distance,ties.method="first")]
head(rdist_boom_5)%>%
  kable()%>%
  kable_styling(bootstrap_options=c("striped", "hover")
                ,full_width = F
                ,position = "left"
                )
mentor student distance mentor_cluster student_cluster rank
mentor_88 student_16 2.099942 3 3 1
mentor_32 student_53 2.120671 2 2 2
mentor_45 student_73 2.290427 2 2 3
mentor_48 student_71 2.301658 2 2 4
mentor_32 student_52 2.329975 2 2 5
mentor_88 student_7 2.372999 3 3 6

With our dataset ordered by distance (dataset A) we will now loop through dataset A and create a new dataset (dataset B), after the insertion into dataset B we’ll delete from dataset A any record with either that mentor or mentee thus creating a unique list of mentors and mentees.

rdist_boom_6 <- rdist_boom[,c(2,1,3,4,5)]
rdist_boom_6 <- setorder(rdist_boom_6,distance)
rdist_boom_6 <- rdist_boom_6[,rank_d:=rank(distance,ties.method="first")]
rdist_boom_6 <- rdist_boom_6[,rank_m:=rank(distance,ties.method="first"),by=mentor]
rdist_boom_6 <- rdist_boom_6[,rank_s:=rank(distance,ties.method="first"),by=student]

#source: https://stackoverflow.com/questions/29711377/how-do-i-initialize-an-empty-or-example-data-table
comp_list <- data.table(
  mentor = character()
  ,student = character()
  ,distance = numeric()
  ,mentor_cluster = integer()
  ,student_cluster = integer()
  ,rank_d = integer()
  ,rank_m = integer()
  ,rank_s = integer()
)

#source: https://stackoverflow.com/questions/7494848/standard-way-to-remove-multiple-elements-from-a-dataframe
#tweaked for characters
#`%notin%` <- function(x,y) !(x %chin% y)

#source: https://www.datacamp.com/community/tutorials/tutorial-on-loops-in-r

repeat {
  loopy <- rdist_boom_6[!(as.character(rdist_boom_6$mentor)
                                 %chin%
                                 as.character(comp_list$mentor))
                               &
                               !(as.character(rdist_boom_6$student)
                                 %chin%
                                 as.character(comp_list$student)),]
  comp_list <- rbind(comp_list,loopy[1,])
  
    if (nrow(comp_list)==100 &
    dim(comp_list[,list(mentor)])[1] == dim(unique(comp_list[,list(mentor)]))[1] &
    dim(comp_list[,list(student)])[1] == dim(unique(comp_list[,list(student)]))[1]){
      break
    }
}

comp_list[order(mentor),1:5]%>%
  kable()%>%
  kable_styling(bootstrap_options=c("striped", "hover")
                ,full_width = F
                ,position = "left"
                )  
mentor student distance mentor_cluster student_cluster
mentor_1 student_41 3.029679 1 1
mentor_2 student_45 2.562457 1 1
mentor_3 student_32 2.392411 1 1
mentor_4 student_82 4.100195 1 1
mentor_5 student_42 2.625672 1 1
mentor_6 student_33 3.039905 1 1
mentor_7 student_46 2.662955 1 1
mentor_8 student_43 2.581001 1 1
mentor_9 student_96 4.807285 1 1
mentor_10 student_85 5.520550 1 1
mentor_11 student_83 4.456761 1 1
mentor_12 student_89 5.149710 1 1
mentor_13 student_34 2.896505 1 1
mentor_14 student_88 4.845568 1 1
mentor_15 student_36 2.652919 1 1
mentor_16 student_37 2.807648 1 1
mentor_17 student_98 7.690623 1 1
mentor_18 student_92 5.281072 1 1
mentor_19 student_86 4.362615 1 1
mentor_20 student_28 2.847893 1 1
mentor_21 student_48 3.159907 1 1
mentor_22 student_94 4.473839 1 1
mentor_23 student_49 3.198363 1 1
mentor_24 student_39 2.511293 1 1
mentor_25 student_97 5.304856 1 1
mentor_26 student_74 2.959585 2 2
mentor_27 student_57 2.649959 2 2
mentor_28 student_69 2.464794 2 2
mentor_29 student_62 2.646257 2 2
mentor_30 student_70 2.948451 2 2
mentor_31 student_55 2.684011 2 2
mentor_32 student_53 2.120671 2 2
mentor_33 student_64 2.844501 2 2
mentor_34 student_58 2.899029 2 2
mentor_35 student_72 3.195314 2 2
mentor_36 student_66 3.106294 2 2
mentor_37 student_54 3.513544 2 2
mentor_38 student_68 2.917839 2 2
mentor_39 student_65 2.812174 2 2
mentor_40 student_52 3.274101 2 2
mentor_41 student_56 2.972919 2 2
mentor_42 student_61 2.500779 2 2
mentor_43 student_51 2.478912 2 2
mentor_44 student_60 2.867123 2 2
mentor_45 student_73 2.290427 2 2
mentor_46 student_59 3.794640 2 2
mentor_47 student_63 2.723469 2 2
mentor_48 student_71 2.301658 2 2
mentor_49 student_67 3.122865 2 2
mentor_50 student_75 3.228425 2 2
mentor_51 student_40 2.665241 1 1
mentor_52 student_31 2.664058 1 1
mentor_53 student_44 2.391656 1 1
mentor_54 student_76 5.250236 1 1
mentor_55 student_78 5.219787 1 1
mentor_56 student_84 5.041660 1 1
mentor_57 student_26 3.158960 1 1
mentor_58 student_77 4.077882 1 1
mentor_59 student_30 2.624056 1 1
mentor_60 student_80 4.944278 1 1
mentor_61 student_79 5.193446 1 1
mentor_62 student_93 6.083014 1 1
mentor_63 student_47 3.189654 1 1
mentor_64 student_50 2.715196 1 1
mentor_65 student_87 4.444670 1 1
mentor_66 student_99 4.904524 1 1
mentor_67 student_90 4.292501 1 1
mentor_68 student_35 2.736301 1 1
mentor_69 student_95 4.972855 1 1
mentor_70 student_91 4.347424 1 1
mentor_71 student_27 3.235039 1 1
mentor_72 student_100 4.599006 1 1
mentor_73 student_29 3.290807 1 1
mentor_74 student_81 4.787604 1 1
mentor_75 student_38 2.798083 1 1
mentor_76 student_20 2.846385 3 3
mentor_77 student_15 2.657693 3 3
mentor_78 student_11 2.698564 3 3
mentor_79 student_1 3.077246 3 3
mentor_80 student_14 2.624967 3 3
mentor_81 student_6 3.139111 3 3
mentor_82 student_4 2.814877 3 3
mentor_83 student_7 3.458420 3 3
mentor_84 student_9 2.577503 3 3
mentor_85 student_24 2.901643 3 3
mentor_86 student_19 2.521309 3 3
mentor_87 student_3 2.814029 3 3
mentor_88 student_16 2.099942 3 3
mentor_89 student_17 2.978750 3 3
mentor_90 student_18 2.891187 3 3
mentor_91 student_21 3.337798 3 3
mentor_92 student_12 4.039464 3 3
mentor_93 student_25 3.122153 3 3
mentor_94 student_10 2.729186 3 3
mentor_95 student_8 3.357755 3 3
mentor_96 student_2 2.414251 3 3
mentor_97 student_13 2.564496 3 3
mentor_98 student_5 2.869701 3 3
mentor_99 student_22 3.041030 3 3
mentor_100 student_23 3.511749 3 3

And there we have it. A unique list of mentors and mentees algorithmically paired.


Final Thoughts

If you’ve read this document in its entirety, thank you! I hope you find it useful or at minimum, entertaining. I’d still like to put this to the test with real people and real a questionnaire.

Updates

2019-01-04: The mentor program directors have decided not to pair the mentees and mentors. Instead, the program will keep a group structure. I did present my prototype and it was well received. I’m still searching for an opportunity to apply this. Until next time.


Appendix

The matching process complete I still wanted to practice different techniques. Below you’ll find different exploratory approaches to clustering and matching.

  1. Practice: Mentors
  2. Practice: Mentees

2. Better understanding the mentors

Before we get into the fancy stuff let’s do something simple. Let’s see if any mentors responded the same as other mentors. The odds seem low given that in our example there are 10 possible answers for each of the 25 questions, but you never know.

To accomplish this uniqueness check we’ll analyze the dimension of the mentor table vs the dimension of the unique mentor table. And though it’s simple the uniqueness check is important because some techniques require unique respondants. If the dimensions are the same then no futher analysis is needed. If the dimensions were different then it would have required further investigation; which mentors responded the same, why did they responded the same. Consider that since the odds are low any two or more people would respond the same perhaps there is a data integrity issue (e.g. data entry error) or perhaps the respondants completed the survey together on purpose.

rbind(dim(mentor_table),dim(unique(mentor_table)))
##      [,1] [,2]
## [1,]  100   25
## [2,]  100   25

Now, let’s analyze which mentors are similar (i.e. which mentors answered the questionnaire similarly). To accomplish this we will use Hierarchical Agglomerative Clustering (HAC) or AGNES (Agglomerative Nesting) - source
There are several clustering methods:

  • Average
  • Single
  • Complete
  • Ward

Each method will resutlt in a different grouping of mentors, to identify which best fits our data we’ll compute the Agglomerative coefficient, which measures the amount of clustering structure found (values closer to 1 suggest strong clustering structure). We’ll select the method with the highest AC value.

# methods to assess
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")

# function to compute coefficient
ac <- function(x) {agnes(mentor_table, method = x)$ac}

t(map_dbl(m, ac))%>%
  kable()%>%
  kable_styling(bootstrap_options=c("striped", "hover")
                ,full_width = F
                ,position = "left"
                )
average single complete ward
0.5944711 0.2007084 0.7407678 0.922375

Unlike other clustering techniques Hierarchical Clustering doesn’t require that we specify a specific number of clusters up front. The algorithm will automatically detect patterns in the data and split the mentors into groups. It does require a clustering method though. Once selected from the table above we’ll create the dendrogram.

# Dissimilarity matrix
d <- dist(mentor_table, method = "euclidean")
mentor_hc1 <- agnes(mentor_table, method = "ward")

pltree(mentor_hc1
       , cex = 0.6
       , hang = -1
       , main = "Dendrogram"
       , xlab = 'Mentors'
)

Determining Optimal Clusters

Although Hierarchical Clustering doesn’t require we specify a particular number of clusters, doing so can be useful to figure out how to group the mentors. There are a number techniques to assess how many groups of mentors there should be:

  • Elbow Method
  • Average Silhouette Method
  • Gap Statistic Method

Elbow

fviz_nbclust(mentor_table, FUN = hcut, method = "wss")

Average Silhouette

fviz_nbclust(mentor_table, FUN = hcut, method = "silhouette")

Gap Statistic

gap_stat <- clusGap(mentor_table
                    , FUN = hcut
                    , nstart = 25
                    , K.max = 15
                    , B = 500
                    , d.power = 2
                    , spaceH0 = "scaledPCA"
            )
fviz_gap_stat(gap_stat)

Two of the three techniques above recommended 3 so let’s set k = 3.

k <- 3

Which mentors are in each group?

d <- dist(mentor_table, method = "euclidean")
hc5 <- hclust(d, method = "ward.D2" )
plot(hc5, cex = 0.6)
rect.hclust(hc5, k = k, border = 2:5)

How many mentors there are per group

# Number of members in each cluster
clusters <- cutree(as.hclust(mentor_hc1), k = k)
table(clusters)
## clusters
##  1  2  3 
## 50 25 25
# append the cluster for each mentor
#mentor_table <- cbind(mentor_table,clusters)

Finally, let’s vizualize the groups

fviz_cluster(list(data = mentor_table, cluster = clusters)
             ,main='Mentor clusters'
             #,xlab='x label'
             #,ylab='y label'
)


2. Better understanding the mentees

In an effort to explore other techniques and solutions we’ll now test a different clustering technique, K-means.

From “R in Action, Second Edition”:

  1. Selects K centroids (K rows chosen at random)
  2. Assigns each data point to its closest centroid
    Recalculates the centroids as the average of all data points in a cluster (i.e., the centroids are p-length mean vectors, where p is the number of variables)
  3. Assigns data points to their closest centroids
  4. Continues steps 3 and 4 until the observations are not reassigned or the maximum number of iterations (R uses 10 as a default) is reached

More information about K-means can be found here and here

Exploratory

We can view how the mentees compare to each other using a distance plot, more info here or here. Knowing if the mentees are different/similar to each other and help better understand. For example, if none of the mentees are alike then perhaps all of the mentees responded randomly.

#source
#1: https://uc-r.github.io/kmeans_clustering

distance <- get_dist(mentee_table)
fviz_dist(distance
          , gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07")
          ,show_labels = TRUE
)

Figuring out a value for K

Unlike the hierarchical clustering technique used to better understand the mentors, K-means requires we specify the number of clusters (the “k”) at the onset. To investigate what an appropriate number of clusters is we can use the following visuals.

wssplot <- function(data, nc=15){
               wss <- (nrow(data)-1)*sum(apply(data,2,var))
               for (i in 2:nc){
                    set.seed(seed)
                    wss[i] <- sum(kmeans(data, centers=i,iter.max=1000,nstart=25)$withinss)}
                plot(1:nc, wss, type="b", xlab="Number of Clusters",
                     ylab="Within groups sum of squares")}

suppressMessages(wssplot(mentee_table))

nc <- suppressMessages(NbClust(mentee_table, min.nc=2, max.nc=15, method="kmeans"))

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 6 proposed 2 as the best number of clusters 
## * 11 proposed 3 as the best number of clusters 
## * 2 proposed 4 as the best number of clusters 
## * 2 proposed 7 as the best number of clusters 
## * 1 proposed 13 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
suppressMessages(table(nc$Best.n[1,]))
## 
##  0  1  2  3  4  7 13 14 
##  2  1  6 11  2  2  1  1
barplot(table(nc$Best.n[1,])
        ,xlab="Numer of Clusters"
        ,ylab="Number of Criteria"
        ,main="Number of Clusters Chosen by 26 Criteria"
)

Visualizing the clusters

fit.km_student <- kmeans(mentee_table
                 ,centers=3
                 ,nstart=25
                 ,algorithm="Lloyd"
                 ,iter.max = 1000
           )
#source
#1: https://stats.stackexchange.com/questions/31083/how-to-produce-a-pretty-plot-of-the-results-of-k-means-cluster-analysis

dissE <- daisy(mentee_table)
dE2   <- dissE^2
sk2   <- silhouette(fit.km_student$cl, dE2)
plot(sk2)

#source
#1: https://uc-r.github.io/kmeans_clustering

fviz_cluster(fit.km_student, data = mentee_table)

One final view: multiple plots

#source
#1: https://uc-r.github.io/kmeans_clustering

k3 <- kmeans(mentee_table,centers=3,nstart=25,iter.max=1000,algorithm="Lloyd")
k4 <- kmeans(mentee_table,centers=4,nstart=25,iter.max=1000,algorithm="Lloyd")
k5 <- kmeans(mentee_table,centers=5,nstart=25,iter.max=1000,algorithm="Lloyd")
k6 <- kmeans(mentee_table,centers=6,nstart=25,iter.max=1000,algorithm="Lloyd")

# plots to compare
p2 <- fviz_cluster(k3, geom = "point",  data = mentee_table) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = mentee_table) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = mentee_table) + ggtitle("k = 5")
p5 <- fviz_cluster(k6, geom = "point",  data = mentee_table) + ggtitle("k = 6")

grid.arrange(p2,p3,p4,p5, nrow = 2)