Last refreshed on: February 22, 2019

Executive Sumamry

The following is a proof of concept (PoC) on how to leverage data science to pair mentors with mentees.

Introduction

At a recent voluneteer event at which a group of volunteers would eventually be paired with a mentee, one of the volunteers suggested the use of a questionnaire to help make more informed pairings. It occured to me that that was an excellent idea and that the pairing could be done algorithmically.

I’ve been interested in people analytics for some time now and this project was the perfect gateway. So, I set out to gather information related to mentor-mentee pairing. I began on the search engines, obviously lol. Started with Duck Duck Go and Google. I searched the forums then looked for published research. To my surprise I found very little. I did find a cool gem though, the People Matching Project. Though little was learned in regards to mentor-mentee pairing I proceeded to contemplate how to accomplish this endeavor. I considered both supervised techniques (e.g. KNN) and unsupervised techniques (e.g. Kmeans, Hierarchical clustering). Given that I wouldn’t have a pre-labeled dataset, unsupervised techniques seemed the way to go.

S/N: The algorithms used in match-making sites would probably be useful for applications such as this but I was unable to find much about the techniques these services use, which makes sense considering that’s how the match-makers make money… it’s probably proprietary.

Last, I took the time to document this to obtain the buy-in from the mentorship program leaders, to demonstrate to any mentee interested in STEM how fun and cool a career in our field can be, to explore people analytics, to fulfill a desire to publish something relevant to analytics, and to assist anyone else attempting to do something similar. That said, if you find a better way to do something or think a particualr technqiue would be more appropriate or more accurate please reach out and let’s make it better.

#open_source

The questionnaire
Pre-analysis
The matching process

Use the buttons below to navigate throughout the different sections listed in the table of contents

1. The questionnaire

Like any analytic project the launch point begins with data. This endeavor is no different, except that since this was drafted as a proof of concept to obtain buy-in from the organizers of the mentorship program the mentees had not yet been asked to take the survey. Therefore, I had to fabricate responses to imaginary questions. I acheived this with the random between function in a popular spreadsheet software. I tried to mimic responce patterns with the random between function and abitrarily chose cut offs for subsets of questions and mentees/mentors - intentionally creating groups that responded similarly. The end result, as you’ll see, are somewhat unique respondants. Admittedly, not the best approach. Ideally, I would have created an actual survey but this is a proof of concept and I didn’t have time to poll people for help.

Randomly selecting responses to make-believe questions aside, the actual survey questions are obviously vital to the success of the mentor-mentee matching process so it’s critical that the questions are as independent as possible. For example, there shouldn’t be two or more questions about pets or multiple questions about a similar character trait/experience/interest.

For our purposes, I recommend a mix of personality- and interest-based questions on a 1-10 or 1-5 scale, such as:

how often do you get mad?
how often do you try new activities?
how creative are you?
how often do you read?
how difficult do you find it to introduce yourself to other people?
how often do you exercise?

There are a variety of sources online that can help guide which questions to use.

The above questions are not the actual survey questions, merely examples.

As of writing this, 2018-12-04, I still hadn’t presented my idea to the mentorship program’s organizers.

2. Pre-analysis

Data preparation

import: loads the questionnaire data into R, the analytic tool used in this analysis
clean: remove or impute missing values from the data (certain algorithms can’t process missing values)
scale: data across columns must be standardized or scaled, to make the variables comparable

#file info
mentee_file <- 'mentee.csv'
mentor_file <- 'mentor.csv'

#import
mentor_table <- read.table(mentor_file, header=TRUE, sep = ',', row.names = 1)
mentee_table <- read.table(mentee_file, header=TRUE, sep = ',', row.names = 1)

Before we clean and scale the data let’s take a glance at the data itself.

Below is a plot of the number of missing values in each column. If there are none then the chart on the left will be blank.

aggr(rbind(mentor_table,mentee_table))

A glimpse at the data itself.

Mentor:

summary(mentor_table[,1:5])%>%
kable()%>%
kable_styling(bootstrap_options=c("striped")
              ,full_width = F
              ,position = "left"
              )

`q1 </th>`	`q2 </th>`	`q3 </th>`	`q4 </th>`	`q5 </th>`
Min. : 1.00	Min. : 1.00	Min. : 1.00	Min. : 1.0	Min. : 1.0
1st Qu.: 4.00	1st Qu.: 3.75	1st Qu.: 3.00	1st Qu.: 4.0	1st Qu.: 4.0
Median : 5.00	Median : 5.00	Median : 5.00	Median : 6.0	Median : 6.0
Mean : 5.37	Mean : 5.18	Mean : 5.11	Mean : 5.6	Mean : 5.6
3rd Qu.: 7.00	3rd Qu.: 7.00	3rd Qu.: 7.00	3rd Qu.: 7.0	3rd Qu.: 7.0
Max. :10.00	Max. :10.00	Max. :10.00	Max. :10.0	Max. :10.0

Mentee:

summary(mentee_table[,1:5])%>%
kable()%>%
kable_styling(bootstrap_options=c("striped", "hover")
              ,full_width = F
              ,position = "left"
              )

`q1 </th>`	`q2 </th>`	`q3 </th>`	`q4 </th>`	`q5 </th>`
Min. : 1.00	Min. : 1.00	Min. : 1.00	Min. : 1.00	Min. : 1.00
1st Qu.: 3.00	1st Qu.: 3.00	1st Qu.: 3.00	1st Qu.: 3.00	1st Qu.: 2.75
Median : 5.00	Median : 5.00	Median : 6.00	Median : 5.00	Median : 5.00
Mean : 5.38	Mean : 5.23	Mean : 5.62	Mean : 5.47	Mean : 5.33
3rd Qu.: 9.00	3rd Qu.: 8.00	3rd Qu.: 8.00	3rd Qu.: 8.00	3rd Qu.: 8.00
Max. :10.00	Max. :10.00	Max. :10.00	Max. :10.00	Max. :10.00

Finally, we clean the data, transform the data if necessary, and scale the data (nothing to visualize so unhide the code if interested).

#clean
mentor_table <- na.omit(mentor_table)
mentee_table <- na.omit(mentee_table)

#transformation
#pca: log
#mentor_table <- log(mentor_table)
#mentee_table <- log(mentee_table)

#scale
mentor_table <- scale(mentor_table)
mentee_table <- scale(mentee_table)

One final view after it’s all done.

mentor_table[1:5,1:5]%>%
kable()%>%
kable_styling(bootstrap_options=c("striped", "hover")
              ,full_width = F
              ,position = "left"
              )

	q1	q2	q3	q4	q5
mentor_1	-0.6319271	-0.5105877	0.3797449	0.1772811	0.6073692
mentor_2	0.2905942	-0.0778863	0.3797449	-0.7091242	0.1735341
mentor_3	-0.1706664	-0.0778863	-0.0469348	0.6204837	0.6073692
mentor_4	0.2905942	0.7875165	1.2331044	1.0636863	-0.6941362
mentor_5	0.2905942	-0.0778863	0.8064246	1.0636863	-0.2603011

mentee_table[1:5,1:5]%>%
kable()%>%
kable_styling(bootstrap_options=c("striped", "hover")
              ,full_width = F
              ,position = "left"
              )

	q1	q2	q3	q4	q5
student_1	0.851372	0.9271339	1.5373847	1.162461	0.8835237
student_2	1.176323	1.5965447	0.8353826	1.162461	0.8835237
student_3	1.501274	1.2618393	1.1863836	1.162461	1.2144315
student_4	1.176323	0.9271339	1.5373847	1.491770	0.8835237
student_5	0.851372	1.2618393	1.1863836	1.491770	1.5453393

3. Matching process

Two types of approaches:

supervised
unsupervised (Hierarchical Agglomerative Clustering, K-means)

A principal difference between supervised and unsupervised techniques is that the data used in supervised techniques is pre-labeled or pre-classified, which essentially means that the answer is explicitly stated in the training set. While in unsupervised learning the algothirms determine what the answer is. To read more about the differences between the two approaches click here.

Our dataset is not pre-labeled. We don’t/can’t know how the mentors and mentees will respond before hand. Additionally, we don’t know beforehand which mentee was paired with which mentor. Trying to guess their answers before they answer is a different problem, it would be a forecasting problem. This is a matching problem. So, we have to use an unsupervised technique. Perhaps we could apply the K-means algorithm. But no, unfortunately there is a flaw with that approach. Mainly, the algorithm won’t pair mentors with mentees one-to-one, rather it’ll group like mentors and mentees and just like that we’re back at where we started: which mentees to pair with which mentors.

What if instead we leveraged a simpler method: distance. In the appendix, section Better understanding the mentees I experimented with distance. I used a distance plot to map how similar/dissimilar the mentees were to one another. With some teaks, we can use a distance plot to map how similar/dissimilar each mentee is to each mentor. Voila, a one-to-one match. While not a particularly fancy technique it is simple and easy to explain/interpret, which for anyone in corporate America ease of understanding is a big selling point.

Pre-processing

First, have any of the mentees/mentors responded exactly the same as another mentee/mentor? To accomplish this we will analyze the structure of the responses vs the structure of the unique responses. If the number of rows and columns are the same then there are no duplicates. The odds seem quite low but it’s worth validating (I tried to compute the odds but I gave up after a while).

#mentor
abc <- rbind(dim(mentor_table)[1],dim(unique(mentor_table))[2])
abc <- cbind(c("Total mentors","Unique mentors"),abc)

#student:
def <- rbind(dim(mentee_table)[1],dim(unique(mentee_table))[2])
def <- cbind(c("Total mentees","Unique mentees"),def)

abc <- t(rbind(abc,def))
#abc[,c(1,3,2,4)]%>%
abc%>%
  kable()%>%
  kable_styling(bootstrap_options=c("striped", "hover")
                ,full_width = F
                ,position = "left"
                )

Total mentors	Unique mentors	Total mentees	Unique mentees
100	25	100	25

Now, we’ll attempt to find a one-to-one match. The number unique, one-to-one matches is: 0

Let the pairing begin

I mentioned above that using K-means was flawed. K-means alone didn’t get us what we wanted, a one-to-one match. However, the K-means algorithm will help us verify whether our pairings are reasonable. The K-means algorithm will group similar mentees and mentors. We’ll then compute the distance of each mentee to each mentor. We will combined the cluster data with the distance data. We expect each, or a majority, of the matched mentors and mentees to be in the same cluster.

We’ll compute K then cluster the mentors and mentees. Then we’ll append the cluster number to the combined mentor and mentee questionnaire response data. Finally, we’ll compute the distance.

combined <- rbind(mentor_table,mentee_table)
zyx <- suppressMessages(NbClust(combined, min.nc=2, max.nc=15, method="kmeans"))

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 2 proposed 2 as the best number of clusters 
## * 17 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 2 proposed 5 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

barplot(table(zyx$Best.n[1,])
        ,xlab="Numer of Clusters"
        ,ylab="Number of Criteria"
        ,main="Num of Clusters Chosen by 26 Criteria"
)

Another method to assess a value for K

mdist <- dist(mentor_table,method = "manhattan")
sdist <- dist(mentee_table,method = "manhattan")

plot(mdist,sdist)

With a value for K, let’s cluster

#k-means
fit.km_combined <- kmeans(combined
                 ,centers=3
                 ,nstart=25
                 ,algorithm="Lloyd"
                 ,iter.max = 1000
           )
fviz_cluster(fit.km_combined, data = combined)

Now, we’ll create the table of distance from mentee to mentor and visually represent the distance with a heat map.

#distance
rdist_boom <- rdist(as.table(combined))

#clean-up
row.names(rdist_boom)<-attributes(combined)$dimnames[[1]]
colnames(rdist_boom)<-attributes(combined)$dimnames[[1]]
##filter out mentees from columns and mentors from rows
rdist_boom <- rdist_boom[row.names(rdist_boom) %like% 'student',colnames(rdist_boom) %like% 'mentor']

#heat map
#source: https://sebastianraschka.com/Articles/heatmaps_in_r.html
#source: https://www.r-bloggers.com/drawing-heatmaps-in-r/
my_palette <- colorRampPalette(c("red", "yellow", "green"))(n = 299)
heatmap(rdist_boom,Colv = FALSE,scale = "none",col=my_palette)

Now, let’s assign each mentee and mentor’s respective cluster

#transform to long format
rdist_boom <- melt(rdist_boom)

#mentor cluster
#source: https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right
rdist_boom <- merge(x = rdist_boom
                    , y = melt(as.table(fit.km_combined$cluster))
                    , by.x = "Var2"
                    , by.y = "Var1"
                    , all.x = TRUE)

#student cluster
rdist_boom <- merge(x = rdist_boom
                    , y = melt(as.table(fit.km_combined$cluster))
                    , by = "Var1"
                    , all.x = TRUE)

#assign column names
colnames(rdist_boom)<-c('student','mentor','distance','mentor_cluster','student_cluster')

#convert to data table
rdist_boom <- data.table(rdist_boom)

Another heat map

#heat map
#source: https://www.r-bloggers.com/how-to-create-a-fast-and-easy-heatmap-with-ggplot2/
#source: https://stackoverflow.com/questions/14290364/heatmap-with-values-ggplot2
ggplot(rdist_boom, aes(student, mentor)) +
  geom_tile(aes(fill = distance), color = "white") +
  scale_fill_gradient(low = "white", high = "steelblue") +
  ylab("Mentor") +
  xlab("Student") +
  theme(legend.title = element_text(size = 8),
        legend.text = element_text(size = 8),
        plot.title = element_text(size=10),
        axis.title=element_text(size=8,face="bold"),
        axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(fill = "Distance")

We will attempt to pair the metors and the mentees two different ways.

The first method, by mentee sorts each mentor by distance and by mentee. And within this method we’ll try two methods… methodception lol.

The top mentor
The top three closest mentors

#filter for top n mentors by student
#source: https://stackoverflow.com/questions/16325641/how-to-extract-the-first-n-rows-per-group
#source: https://stackoverflow.com/questions/13685295/sort-a-data-table-fast-by-ascending-descending-order
rdist_boom_0 <- setorder(rdist_boom,student,distance)
rdist_boom_0 <- rdist_boom_0[,.SD[1:3],by=student]

#filter for min distance, ideal mentor
#source: https://stackoverflow.com/questions/16325641/how-to-extract-the-first-n-rows-per-group
#source: https://stackoverflow.com/questions/16573995/subset-by-group-with-data-table
rdist_boom_1 <- rdist_boom[,.SD[distance == min(distance)],by=student]

#method 1: top mentor by mentee
rdist_boom_1[1:3,]%>%
  kable()%>%
  kable_styling(bootstrap_options=c("striped", "hover")
                ,full_width = F
                ,position = "left"
                )

student	mentor	distance	mentor_cluster	student_cluster
student_1	mentor_86	2.747217	3	3
student_2	mentor_96	2.414251	3	3
student_3	mentor_86	2.597768	3	3

#method 2: top n mentors by mentee
rdist_boom_0[1:9,]%>%
  kable()%>%
  kable_styling(bootstrap_options=c("striped", "hover")
                ,full_width = F
                ,position = "left"
                )

student	mentor	distance	mentor_cluster	student_cluster
student_1	mentor_86	2.747217	3	3
student_1	mentor_85	2.925842	3	3
student_1	mentor_88	2.951561	3	3
student_2	mentor_96	2.414251	3	3
student_2	mentor_80	2.463323	3	3
student_2	mentor_84	2.805245	3	3
student_3	mentor_86	2.597768	3	3
student_3	mentor_87	2.814029	3	3
student_3	mentor_90	2.814439	3	3

Do you see an issue with this approach? In either method 1 (top n mentors by mentee) or method 2 (top mentor by mentee) are all mentors represented? Was each mentor paired with a mentee? Let’s see.

Method 1: Top mentor

foo <- rbind(dim(mentor_table)[1],dim(unique(rdist_boom_1[,list(mentor)]))[1])
foo <- cbind(c("Total mentors","Mentors matched"),foo)

foo1 <- rbind(dim(mentee_table)[1],dim(unique(rdist_boom_1[,list(student)]))[1])
foo1 <- cbind(c("Total mentees","Mentees matched"),foo1)

foo <- t(rbind(foo,foo1))
foo <- foo[,c(1,3,2,4)]

foo %>%
  kable()%>%
  kable_styling(bootstrap_options=c("striped", "hover")
                ,full_width = F
                ,position = "left"
                )

Total mentors	Total mentees	Mentors matched	Mentees matched
100	100	44	100

Method 2: Top n mentors

bar <- rbind(dim(mentor_table)[1],dim(unique(rdist_boom_0[,list(mentor)]))[1])
bar <- cbind(c("Total mentors","Mentors matched"),bar)

bar1 <- rbind(dim(mentee_table)[1],dim(unique(rdist_boom_0[,list(student)]))[1])
bar1 <- cbind(c("Total mentees","Mentees matched"),bar1)

bar <- t(rbind(bar,bar1))
bar <- bar[,c(1,3,2,4)]

bar %>%
  kable()%>%
  kable_styling(bootstrap_options=c("striped", "hover")
                ,full_width = F
                ,position = "left"
                )

Total mentors	Total mentees	Mentors matched	Mentees matched
100	100	74	100

As we can see, not all mentors were matched with a mentee. Now, we can debate whether or not a one-to-one match is the appropriate approach. We could argue that unmatched mentors shouldn’t be matched with a mentee. However, for the sake of practice and, one of our favorite corporate buzz words, inclusion let’s present alternative approach.

This time we’ll focus on the mentors and sort/filter by the mentees

Method 3: Top mentee

#reorder columns
#source: http://www.cookbook-r.com/Manipulating_data/Reordering_the_columns_in_a_data_frame/
rdist_boom_2 <- rdist_boom[,c(2,1,3,4,5)]
rdist_boom_2 <- setorder(rdist_boom_2,mentor,distance)
rdist_boom_2 <- rdist_boom_2[,.SD[distance == min(distance)],by=mentor]

spam <- rbind(dim(mentor_table)[1],dim(unique(rdist_boom_2[,list(mentor)]))[1])
spam <- cbind(c("Total mentors","Mentors matched"),spam)

spam1 <- rbind(dim(mentee_table)[1],dim(unique(rdist_boom_2[,list(student)]))[1])
spam1 <- cbind(c("Total mentees","Mentees matched"),spam1)

spam <- t(rbind(spam,spam1))
spam <- spam[,c(1,3,2,4)]

spam %>%
  kable()%>%
  kable_styling(bootstrap_options=c("striped", "hover")
                ,full_width = F
                ,position = "left"
                )

Total mentors	Total mentees	Mentors matched	Mentees matched
100	100	100	54

Method 4: Top n mentees

rdist_boom_3 <- rdist_boom[,c(2,1,3,4,5)]
rdist_boom_3 <- setorder(rdist_boom_3,mentor,distance)
rdist_boom_3 <- rdist_boom_3[,.SD[1:3],by=mentor]

spam2 <- rbind(dim(mentor_table)[1],dim(unique(rdist_boom_3[,list(mentor)]))[1])
spam2 <- cbind(c("Total mentors","Mentors matched"),spam2)

spam3 <- rbind(dim(mentee_table)[1],dim(unique(rdist_boom_3[,list(student)]))[1])
spam3 <- cbind(c("Total mentees","Mentees matched"),spam3)

spam2 <- t(rbind(spam2,spam3))
spam2 <- spam2[,c(1,3,2,4)]

spam2 %>%
  kable()%>%
  kable_styling(bootstrap_options=c("striped", "hover")
                ,full_width = F
                ,position = "left"
                )

Total mentors	Total mentees	Mentors matched	Mentees matched
100	100	100	73

A quick discussion. There are two general approaches

Top match
Top n match

Personally, I prefer the top match approach. This approach takes away the guess work and provides a single suggestion to whomever reads our report. However, neither the mentor-focused or the mentee-focused approaches included all the mentors or mentees. In some applications that’s probably okay because you’d prefer to provide the best matches and those that didn’t match should wait until a more suited mentor/mentee is available.

In an effort to practice further, what if we provided a comprehensive list of matched mentors and mentees. We would take the matches from one of the above top matches methods above and continue matching with the leftovers so no mentor is without a mentee.

It might be useful to create a rank column that ranks the distance between mentor and mentee from least distance (most similar) to most distant (least similar).

Since the mentor:mentee approach in method 3 had a higher match rate than method 1 we’ll leverage that.

#source: https://stackoverflow.com/questions/15170777/add-a-rank-column-to-a-data-frame
rdist_boom_4 <- rdist_boom[,c(2,1,3,4,5)]
rdist_boom_4 <- setorder(rdist_boom_4,mentor,distance)
rdist_boom_4 <- rdist_boom_4[,rank:=rank(distance,ties.method="first"),by=mentor]
head(rdist_boom_4)%>%
  kable()%>%
  kable_styling(bootstrap_options=c("striped", "hover")
                ,full_width = F
                ,position = "left"
                )

mentor	student	distance	mentor_cluster	student_cluster	rank
mentor_1	student_41	3.029679	1	1	1
mentor_1	student_33	3.037549	1	1	2
mentor_1	student_34	3.085842	1	1	3
mentor_1	student_37	3.139329	1	1	4
mentor_1	student_30	3.154351	1	1	5
mentor_1	student_46	3.244478	1	1	6

On second thought, why focus on the mentors? What we’re really concerned about is the distance (i.e. how well a mentor and mentee match), right? So, let’s rank by distance without a specific focus on either mentor or mentee.

rdist_boom_5 <- rdist_boom[,c(2,1,3,4,5)]
rdist_boom_5 <- setorder(rdist_boom_5,distance)
rdist_boom_5 <- rdist_boom_5[,rank:=rank(distance,ties.method="first")]
head(rdist_boom_5)%>%
  kable()%>%
  kable_styling(bootstrap_options=c("striped", "hover")
                ,full_width = F
                ,position = "left"
                )

mentor	student	distance	mentor_cluster	student_cluster	rank
mentor_88	student_16	2.099942	3	3	1
mentor_32	student_53	2.120671	2	2	2
mentor_45	student_73	2.290427	2	2	3
mentor_48	student_71	2.301658	2	2	4
mentor_32	student_52	2.329975	2	2	5
mentor_88	student_7	2.372999	3	3	6

With our dataset ordered by distance (dataset A) we will now loop through dataset A and create a new dataset (dataset B), after the insertion into dataset B we’ll delete from dataset A any record with either that mentor or mentee thus creating a unique list of mentors and mentees.

rdist_boom_6 <- rdist_boom[,c(2,1,3,4,5)]
rdist_boom_6 <- setorder(rdist_boom_6,distance)
rdist_boom_6 <- rdist_boom_6[,rank_d:=rank(distance,ties.method="first")]
rdist_boom_6 <- rdist_boom_6[,rank_m:=rank(distance,ties.method="first"),by=mentor]
rdist_boom_6 <- rdist_boom_6[,rank_s:=rank(distance,ties.method="first"),by=student]

#source: https://stackoverflow.com/questions/29711377/how-do-i-initialize-an-empty-or-example-data-table
comp_list <- data.table(
  mentor = character()
  ,student = character()
  ,distance = numeric()
  ,mentor_cluster = integer()
  ,student_cluster = integer()
  ,rank_d = integer()
  ,rank_m = integer()
  ,rank_s = integer()
)

#source: https://stackoverflow.com/questions/7494848/standard-way-to-remove-multiple-elements-from-a-dataframe
#tweaked for characters
#`%notin%` <- function(x,y) !(x %chin% y)

#source: https://www.datacamp.com/community/tutorials/tutorial-on-loops-in-r

repeat {
  loopy <- rdist_boom_6[!(as.character(rdist_boom_6$mentor)
                                 %chin%
                                 as.character(comp_list$mentor))
                               &
                               !(as.character(rdist_boom_6$student)
                                 %chin%
                                 as.character(comp_list$student)),]
  comp_list <- rbind(comp_list,loopy[1,])
  
    if (nrow(comp_list)==100 &
    dim(comp_list[,list(mentor)])[1] == dim(unique(comp_list[,list(mentor)]))[1] &
    dim(comp_list[,list(student)])[1] == dim(unique(comp_list[,list(student)]))[1]){
      break
    }
}

comp_list[order(mentor),1:5]%>%
  kable()%>%
  kable_styling(bootstrap_options=c("striped", "hover")
                ,full_width = F
                ,position = "left"
                )

mentor	student	distance	mentor_cluster	student_cluster
mentor_1	student_41	3.029679	1	1
mentor_2	student_45	2.562457	1	1
mentor_3	student_32	2.392411	1	1
mentor_4	student_82	4.100195	1	1
mentor_5	student_42	2.625672	1	1
mentor_6	student_33	3.039905	1	1
mentor_7	student_46	2.662955	1	1
mentor_8	student_43	2.581001	1	1
mentor_9	student_96	4.807285	1	1
mentor_10	student_85	5.520550	1	1
mentor_11	student_83	4.456761	1	1
mentor_12	student_89	5.149710	1	1
mentor_13	student_34	2.896505	1	1
mentor_14	student_88	4.845568	1	1
mentor_15	student_36	2.652919	1	1
mentor_16	student_37	2.807648	1	1
mentor_17	student_98	7.690623	1	1
mentor_18	student_92	5.281072	1	1
mentor_19	student_86	4.362615	1	1
mentor_20	student_28	2.847893	1	1
mentor_21	student_48	3.159907	1	1
mentor_22	student_94	4.473839	1	1
mentor_23	student_49	3.198363	1	1
mentor_24	student_39	2.511293	1	1
mentor_25	student_97	5.304856	1	1
mentor_26	student_74	2.959585	2	2
mentor_27	student_57	2.649959	2	2
mentor_28	student_69	2.464794	2	2
mentor_29	student_62	2.646257	2	2
mentor_30	student_70	2.948451	2	2
mentor_31	student_55	2.684011	2	2
mentor_32	student_53	2.120671	2	2
mentor_33	student_64	2.844501	2	2
mentor_34	student_58	2.899029	2	2
mentor_35	student_72	3.195314	2	2
mentor_36	student_66	3.106294	2	2
mentor_37	student_54	3.513544	2	2
mentor_38	student_68	2.917839	2	2
mentor_39	student_65	2.812174	2	2
mentor_40	student_52	3.274101	2	2
mentor_41	student_56	2.972919	2	2
mentor_42	student_61	2.500779	2	2
mentor_43	student_51	2.478912	2	2
mentor_44	student_60	2.867123	2	2
mentor_45	student_73	2.290427	2	2
mentor_46	student_59	3.794640	2	2
mentor_47	student_63	2.723469	2	2
mentor_48	student_71	2.301658	2	2
mentor_49	student_67	3.122865	2	2
mentor_50	student_75	3.228425	2	2
mentor_51	student_40	2.665241	1	1
mentor_52	student_31	2.664058	1	1
mentor_53	student_44	2.391656	1	1
mentor_54	student_76	5.250236	1	1
mentor_55	student_78	5.219787	1	1
mentor_56	student_84	5.041660	1	1
mentor_57	student_26	3.158960	1	1
mentor_58	student_77	4.077882	1	1
mentor_59	student_30	2.624056	1	1
mentor_60	student_80	4.944278	1	1
mentor_61	student_79	5.193446	1	1
mentor_62	student_93	6.083014	1	1
mentor_63	student_47	3.189654	1	1
mentor_64	student_50	2.715196	1	1
mentor_65	student_87	4.444670	1	1
mentor_66	student_99	4.904524	1	1
mentor_67	student_90	4.292501	1	1
mentor_68	student_35	2.736301	1	1
mentor_69	student_95	4.972855	1	1
mentor_70	student_91	4.347424	1	1
mentor_71	student_27	3.235039	1	1
mentor_72	student_100	4.599006	1	1
mentor_73	student_29	3.290807	1	1
mentor_74	student_81	4.787604	1	1
mentor_75	student_38	2.798083	1	1
mentor_76	student_20	2.846385	3	3
mentor_77	student_15	2.657693	3	3
mentor_78	student_11	2.698564	3	3
mentor_79	student_1	3.077246	3	3
mentor_80	student_14	2.624967	3	3
mentor_81	student_6	3.139111	3	3
mentor_82	student_4	2.814877	3	3
mentor_83	student_7	3.458420	3	3
mentor_84	student_9	2.577503	3	3
mentor_85	student_24	2.901643	3	3
mentor_86	student_19	2.521309	3	3
mentor_87	student_3	2.814029	3	3
mentor_88	student_16	2.099942	3	3
mentor_89	student_17	2.978750	3	3
mentor_90	student_18	2.891187	3	3
mentor_91	student_21	3.337798	3	3
mentor_92	student_12	4.039464	3	3
mentor_93	student_25	3.122153	3	3
mentor_94	student_10	2.729186	3	3
mentor_95	student_8	3.357755	3	3
mentor_96	student_2	2.414251	3	3
mentor_97	student_13	2.564496	3	3
mentor_98	student_5	2.869701	3	3
mentor_99	student_22	3.041030	3	3
mentor_100	student_23	3.511749	3	3

And there we have it. A unique list of mentors and mentees algorithmically paired.

Final Thoughts

If you’ve read this document in its entirety, thank you! I hope you find it useful or at minimum, entertaining. I’d still like to put this to the test with real people and real a questionnaire.

Updates

2019-01-04: The mentor program directors have decided not to pair the mentees and mentors. Instead, the program will keep a group structure. I did present my prototype and it was well received. I’m still searching for an opportunity to apply this. Until next time.

Appendix

The matching process complete I still wanted to practice different techniques. Below you’ll find different exploratory approaches to clustering and matching.

Practice: Mentors
Practice: Mentees

2. Better understanding the mentors

Before we get into the fancy stuff let’s do something simple. Let’s see if any mentors responded the same as other mentors. The odds seem low given that in our example there are 10 possible answers for each of the 25 questions, but you never know.

To accomplish this uniqueness check we’ll analyze the dimension of the mentor table vs the dimension of the unique mentor table. And though it’s simple the uniqueness check is important because some techniques require unique respondants. If the dimensions are the same then no futher analysis is needed. If the dimensions were different then it would have required further investigation; which mentors responded the same, why did they responded the same. Consider that since the odds are low any two or more people would respond the same perhaps there is a data integrity issue (e.g. data entry error) or perhaps the respondants completed the survey together on purpose.

rbind(dim(mentor_table),dim(unique(mentor_table)))

##      [,1] [,2]
## [1,]  100   25
## [2,]  100   25

Now, let’s analyze which mentors are similar (i.e. which mentors answered the questionnaire similarly). To accomplish this we will use Hierarchical Agglomerative Clustering (HAC) or AGNES (Agglomerative Nesting) - source
There are several clustering methods:

Average
Single
Complete
Ward

Each method will resutlt in a different grouping of mentors, to identify which best fits our data we’ll compute the Agglomerative coefficient, which measures the amount of clustering structure found (values closer to 1 suggest strong clustering structure). We’ll select the method with the highest AC value.

# methods to assess
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")

# function to compute coefficient
ac <- function(x) {agnes(mentor_table, method = x)$ac}

t(map_dbl(m, ac))%>%
  kable()%>%
  kable_styling(bootstrap_options=c("striped", "hover")
                ,full_width = F
                ,position = "left"
                )

average	single	complete	ward
0.5944711	0.2007084	0.7407678	0.922375

Unlike other clustering techniques Hierarchical Clustering doesn’t require that we specify a specific number of clusters up front. The algorithm will automatically detect patterns in the data and split the mentors into groups. It does require a clustering method though. Once selected from the table above we’ll create the dendrogram.

# Dissimilarity matrix
d <- dist(mentor_table, method = "euclidean")
mentor_hc1 <- agnes(mentor_table, method = "ward")

pltree(mentor_hc1
       , cex = 0.6
       , hang = -1
       , main = "Dendrogram"
       , xlab = 'Mentors'
)

Determining Optimal Clusters

Although Hierarchical Clustering doesn’t require we specify a particular number of clusters, doing so can be useful to figure out how to group the mentors. There are a number techniques to assess how many groups of mentors there should be:

Elbow Method
Average Silhouette Method
Gap Statistic Method

Elbow

fviz_nbclust(mentor_table, FUN = hcut, method = "wss")

Average Silhouette

fviz_nbclust(mentor_table, FUN = hcut, method = "silhouette")

Gap Statistic

gap_stat <- clusGap(mentor_table
                    , FUN = hcut
                    , nstart = 25
                    , K.max = 15
                    , B = 500
                    , d.power = 2
                    , spaceH0 = "scaledPCA"
            )
fviz_gap_stat(gap_stat)

Two of the three techniques above recommended 3 so let’s set k = 3.

k <- 3

Which mentors are in each group?

d <- dist(mentor_table, method = "euclidean")
hc5 <- hclust(d, method = "ward.D2" )
plot(hc5, cex = 0.6)
rect.hclust(hc5, k = k, border = 2:5)

How many mentors there are per group

# Number of members in each cluster
clusters <- cutree(as.hclust(mentor_hc1), k = k)
table(clusters)

## clusters
##  1  2  3 
## 50 25 25

# append the cluster for each mentor
#mentor_table <- cbind(mentor_table,clusters)

Finally, let’s vizualize the groups

fviz_cluster(list(data = mentor_table, cluster = clusters)
             ,main='Mentor clusters'
             #,xlab='x label'
             #,ylab='y label'
)

2. Better understanding the mentees

In an effort to explore other techniques and solutions we’ll now test a different clustering technique, K-means.

From “R in Action, Second Edition”:

Selects K centroids (K rows chosen at random)

Assigns each data point to its closest centroid
Recalculates the centroids as the average of all data points in a cluster (i.e., the centroids are p-length mean vectors, where p is the number of variables)

Assigns data points to their closest centroids

Continues steps 3 and 4 until the observations are not reassigned or the maximum number of iterations (R uses 10 as a default) is reached

More information about K-means can be found here and here

Exploratory

We can view how the mentees compare to each other using a distance plot, more info here or here. Knowing if the mentees are different/similar to each other and help better understand. For example, if none of the mentees are alike then perhaps all of the mentees responded randomly.

#source
#1: https://uc-r.github.io/kmeans_clustering

distance <- get_dist(mentee_table)
fviz_dist(distance
          , gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07")
          ,show_labels = TRUE
)

Figuring out a value for K

Unlike the hierarchical clustering technique used to better understand the mentors, K-means requires we specify the number of clusters (the “k”) at the onset. To investigate what an appropriate number of clusters is we can use the following visuals.

wssplot <- function(data, nc=15){
               wss <- (nrow(data)-1)*sum(apply(data,2,var))
               for (i in 2:nc){
                    set.seed(seed)
                    wss[i] <- sum(kmeans(data, centers=i,iter.max=1000,nstart=25)$withinss)}
                plot(1:nc, wss, type="b", xlab="Number of Clusters",
                     ylab="Within groups sum of squares")}

suppressMessages(wssplot(mentee_table))

nc <- suppressMessages(NbClust(mentee_table, min.nc=2, max.nc=15, method="kmeans"))

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 6 proposed 2 as the best number of clusters 
## * 11 proposed 3 as the best number of clusters 
## * 2 proposed 4 as the best number of clusters 
## * 2 proposed 7 as the best number of clusters 
## * 1 proposed 13 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

suppressMessages(table(nc$Best.n[1,]))

## 
##  0  1  2  3  4  7 13 14 
##  2  1  6 11  2  2  1  1

barplot(table(nc$Best.n[1,])
        ,xlab="Numer of Clusters"
        ,ylab="Number of Criteria"
        ,main="Number of Clusters Chosen by 26 Criteria"
)

Visualizing the clusters

fit.km_student <- kmeans(mentee_table
                 ,centers=3
                 ,nstart=25
                 ,algorithm="Lloyd"
                 ,iter.max = 1000
           )

#source
#1: https://stats.stackexchange.com/questions/31083/how-to-produce-a-pretty-plot-of-the-results-of-k-means-cluster-analysis

dissE <- daisy(mentee_table)
dE2   <- dissE^2
sk2   <- silhouette(fit.km_student$cl, dE2)
plot(sk2)

#source
#1: https://uc-r.github.io/kmeans_clustering

fviz_cluster(fit.km_student, data = mentee_table)

One final view: multiple plots

#source
#1: https://uc-r.github.io/kmeans_clustering

k3 <- kmeans(mentee_table,centers=3,nstart=25,iter.max=1000,algorithm="Lloyd")
k4 <- kmeans(mentee_table,centers=4,nstart=25,iter.max=1000,algorithm="Lloyd")
k5 <- kmeans(mentee_table,centers=5,nstart=25,iter.max=1000,algorithm="Lloyd")
k6 <- kmeans(mentee_table,centers=6,nstart=25,iter.max=1000,algorithm="Lloyd")

# plots to compare
p2 <- fviz_cluster(k3, geom = "point",  data = mentee_table) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = mentee_table) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = mentee_table) + ggtitle("k = 5")
p5 <- fviz_cluster(k6, geom = "point",  data = mentee_table) + ggtitle("k = 6")

grid.arrange(p2,p3,p4,p5, nrow = 2)

Mentor-Mentee Pairing PoC