Last refreshed on: February 22, 2019
The following is a proof of concept (PoC) on how to leverage data science to pair mentors with mentees.
At a recent voluneteer event at which a group of volunteers would eventually be paired with a mentee, one of the volunteers suggested the use of a questionnaire to help make more informed pairings. It occured to me that that was an excellent idea and that the pairing could be done algorithmically.
I’ve been interested in people analytics for some time now and this project was the perfect gateway. So, I set out to gather information related to mentor-mentee pairing. I began on the search engines, obviously lol. Started with Duck Duck Go and Google. I searched the forums then looked for published research. To my surprise I found very little. I did find a cool gem though, the People Matching Project. Though little was learned in regards to mentor-mentee pairing I proceeded to contemplate how to accomplish this endeavor. I considered both supervised techniques (e.g. KNN) and unsupervised techniques (e.g. Kmeans, Hierarchical clustering). Given that I wouldn’t have a pre-labeled dataset, unsupervised techniques seemed the way to go.
S/N: The algorithms used in match-making sites would probably be useful for applications such as this but I was unable to find much about the techniques these services use, which makes sense considering that’s how the match-makers make money… it’s probably proprietary.
Last, I took the time to document this to obtain the buy-in from the mentorship program leaders, to demonstrate to any mentee interested in STEM how fun and cool a career in our field can be, to explore people analytics, to fulfill a desire to publish something relevant to analytics, and to assist anyone else attempting to do something similar. That said, if you find a better way to do something or think a particualr technqiue would be more appropriate or more accurate please reach out and let’s make it better.
#open_source
Use the buttons below to navigate throughout the different sections listed in the table of contents
Like any analytic project the launch point begins with data. This endeavor is no different, except that since this was drafted as a proof of concept to obtain buy-in from the organizers of the mentorship program the mentees had not yet been asked to take the survey. Therefore, I had to fabricate responses to imaginary questions. I acheived this with the random between function in a popular spreadsheet software. I tried to mimic responce patterns with the random between function and abitrarily chose cut offs for subsets of questions and mentees/mentors - intentionally creating groups that responded similarly. The end result, as you’ll see, are somewhat unique respondants. Admittedly, not the best approach. Ideally, I would have created an actual survey but this is a proof of concept and I didn’t have time to poll people for help.
Randomly selecting responses to make-believe questions aside, the actual survey questions are obviously vital to the success of the mentor-mentee matching process so it’s critical that the questions are as independent as possible. For example, there shouldn’t be two or more questions about pets or multiple questions about a similar character trait/experience/interest.
For our purposes, I recommend a mix of personality- and interest-based questions on a 1-10 or 1-5 scale, such as:
There are a variety of sources online that can help guide which questions to use.
The above questions are not the actual survey questions, merely examples.
As of writing this, 2018-12-04, I still hadn’t presented my idea to the mentorship program’s organizers.
Data preparation
#file info
mentee_file <- 'mentee.csv'
mentor_file <- 'mentor.csv'
#import
mentor_table <- read.table(mentor_file, header=TRUE, sep = ',', row.names = 1)
mentee_table <- read.table(mentee_file, header=TRUE, sep = ',', row.names = 1)
Before we clean and scale the data let’s take a glance at the data itself.
Below is a plot of the number of missing values in each column. If there are none then the chart on the left will be blank.
aggr(rbind(mentor_table,mentee_table))
A glimpse at the data itself.
Mentor:
summary(mentor_table[,1:5])%>%
kable()%>%
kable_styling(bootstrap_options=c("striped")
,full_width = F
,position = "left"
)
|
|
|
|
| |
|---|---|---|---|---|---|
| Min. : 1.00 | Min. : 1.00 | Min. : 1.00 | Min. : 1.0 | Min. : 1.0 | |
| 1st Qu.: 4.00 | 1st Qu.: 3.75 | 1st Qu.: 3.00 | 1st Qu.: 4.0 | 1st Qu.: 4.0 | |
| Median : 5.00 | Median : 5.00 | Median : 5.00 | Median : 6.0 | Median : 6.0 | |
| Mean : 5.37 | Mean : 5.18 | Mean : 5.11 | Mean : 5.6 | Mean : 5.6 | |
| 3rd Qu.: 7.00 | 3rd Qu.: 7.00 | 3rd Qu.: 7.00 | 3rd Qu.: 7.0 | 3rd Qu.: 7.0 | |
| Max. :10.00 | Max. :10.00 | Max. :10.00 | Max. :10.0 | Max. :10.0 |
Mentee:
summary(mentee_table[,1:5])%>%
kable()%>%
kable_styling(bootstrap_options=c("striped", "hover")
,full_width = F
,position = "left"
)
|
|
|
|
| |
|---|---|---|---|---|---|
| Min. : 1.00 | Min. : 1.00 | Min. : 1.00 | Min. : 1.00 | Min. : 1.00 | |
| 1st Qu.: 3.00 | 1st Qu.: 3.00 | 1st Qu.: 3.00 | 1st Qu.: 3.00 | 1st Qu.: 2.75 | |
| Median : 5.00 | Median : 5.00 | Median : 6.00 | Median : 5.00 | Median : 5.00 | |
| Mean : 5.38 | Mean : 5.23 | Mean : 5.62 | Mean : 5.47 | Mean : 5.33 | |
| 3rd Qu.: 9.00 | 3rd Qu.: 8.00 | 3rd Qu.: 8.00 | 3rd Qu.: 8.00 | 3rd Qu.: 8.00 | |
| Max. :10.00 | Max. :10.00 | Max. :10.00 | Max. :10.00 | Max. :10.00 |
Finally, we clean the data, transform the data if necessary, and scale the data (nothing to visualize so unhide the code if interested).
#clean
mentor_table <- na.omit(mentor_table)
mentee_table <- na.omit(mentee_table)
#transformation
#pca: log
#mentor_table <- log(mentor_table)
#mentee_table <- log(mentee_table)
#scale
mentor_table <- scale(mentor_table)
mentee_table <- scale(mentee_table)
One final view after it’s all done.
mentor_table[1:5,1:5]%>%
kable()%>%
kable_styling(bootstrap_options=c("striped", "hover")
,full_width = F
,position = "left"
)
| q1 | q2 | q3 | q4 | q5 | |
|---|---|---|---|---|---|
| mentor_1 | -0.6319271 | -0.5105877 | 0.3797449 | 0.1772811 | 0.6073692 |
| mentor_2 | 0.2905942 | -0.0778863 | 0.3797449 | -0.7091242 | 0.1735341 |
| mentor_3 | -0.1706664 | -0.0778863 | -0.0469348 | 0.6204837 | 0.6073692 |
| mentor_4 | 0.2905942 | 0.7875165 | 1.2331044 | 1.0636863 | -0.6941362 |
| mentor_5 | 0.2905942 | -0.0778863 | 0.8064246 | 1.0636863 | -0.2603011 |
mentee_table[1:5,1:5]%>%
kable()%>%
kable_styling(bootstrap_options=c("striped", "hover")
,full_width = F
,position = "left"
)
| q1 | q2 | q3 | q4 | q5 | |
|---|---|---|---|---|---|
| student_1 | 0.851372 | 0.9271339 | 1.5373847 | 1.162461 | 0.8835237 |
| student_2 | 1.176323 | 1.5965447 | 0.8353826 | 1.162461 | 0.8835237 |
| student_3 | 1.501274 | 1.2618393 | 1.1863836 | 1.162461 | 1.2144315 |
| student_4 | 1.176323 | 0.9271339 | 1.5373847 | 1.491770 | 0.8835237 |
| student_5 | 0.851372 | 1.2618393 | 1.1863836 | 1.491770 | 1.5453393 |
Two types of approaches:
A principal difference between supervised and unsupervised techniques is that the data used in supervised techniques is pre-labeled or pre-classified, which essentially means that the answer is explicitly stated in the training set. While in unsupervised learning the algothirms determine what the answer is. To read more about the differences between the two approaches click here.
Our dataset is not pre-labeled. We don’t/can’t know how the mentors and mentees will respond before hand. Additionally, we don’t know beforehand which mentee was paired with which mentor. Trying to guess their answers before they answer is a different problem, it would be a forecasting problem. This is a matching problem. So, we have to use an unsupervised technique. Perhaps we could apply the K-means algorithm. But no, unfortunately there is a flaw with that approach. Mainly, the algorithm won’t pair mentors with mentees one-to-one, rather it’ll group like mentors and mentees and just like that we’re back at where we started: which mentees to pair with which mentors.
What if instead we leveraged a simpler method: distance. In the appendix, section Better understanding the mentees I experimented with distance. I used a distance plot to map how similar/dissimilar the mentees were to one another. With some teaks, we can use a distance plot to map how similar/dissimilar each mentee is to each mentor. Voila, a one-to-one match. While not a particularly fancy technique it is simple and easy to explain/interpret, which for anyone in corporate America ease of understanding is a big selling point.
First, have any of the mentees/mentors responded exactly the same as another mentee/mentor? To accomplish this we will analyze the structure of the responses vs the structure of the unique responses. If the number of rows and columns are the same then there are no duplicates. The odds seem quite low but it’s worth validating (I tried to compute the odds but I gave up after a while).
#mentor
abc <- rbind(dim(mentor_table)[1],dim(unique(mentor_table))[2])
abc <- cbind(c("Total mentors","Unique mentors"),abc)
#student:
def <- rbind(dim(mentee_table)[1],dim(unique(mentee_table))[2])
def <- cbind(c("Total mentees","Unique mentees"),def)
abc <- t(rbind(abc,def))
#abc[,c(1,3,2,4)]%>%
abc%>%
kable()%>%
kable_styling(bootstrap_options=c("striped", "hover")
,full_width = F
,position = "left"
)
| Total mentors | Unique mentors | Total mentees | Unique mentees |
| 100 | 25 | 100 | 25 |
Now, we’ll attempt to find a one-to-one match. The number unique, one-to-one matches is: 0
I mentioned above that using K-means was flawed. K-means alone didn’t get us what we wanted, a one-to-one match. However, the K-means algorithm will help us verify whether our pairings are reasonable. The K-means algorithm will group similar mentees and mentors. We’ll then compute the distance of each mentee to each mentor. We will combined the cluster data with the distance data. We expect each, or a majority, of the matched mentors and mentees to be in the same cluster.
We’ll compute K then cluster the mentors and mentees. Then we’ll append the cluster number to the combined mentor and mentee questionnaire response data. Finally, we’ll compute the distance.
combined <- rbind(mentor_table,mentee_table)
zyx <- suppressMessages(NbClust(combined, min.nc=2, max.nc=15, method="kmeans"))
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 2 proposed 2 as the best number of clusters
## * 17 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 2 proposed 5 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
barplot(table(zyx$Best.n[1,])
,xlab="Numer of Clusters"
,ylab="Number of Criteria"
,main="Num of Clusters Chosen by 26 Criteria"
)
Another method to assess a value for K
mdist <- dist(mentor_table,method = "manhattan")
sdist <- dist(mentee_table,method = "manhattan")
plot(mdist,sdist)
With a value for K, let’s cluster
#k-means
fit.km_combined <- kmeans(combined
,centers=3
,nstart=25
,algorithm="Lloyd"
,iter.max = 1000
)
fviz_cluster(fit.km_combined, data = combined)
Now, we’ll create the table of distance from mentee to mentor and visually represent the distance with a heat map.
#distance
rdist_boom <- rdist(as.table(combined))
#clean-up
row.names(rdist_boom)<-attributes(combined)$dimnames[[1]]
colnames(rdist_boom)<-attributes(combined)$dimnames[[1]]
##filter out mentees from columns and mentors from rows
rdist_boom <- rdist_boom[row.names(rdist_boom) %like% 'student',colnames(rdist_boom) %like% 'mentor']
#heat map
#source: https://sebastianraschka.com/Articles/heatmaps_in_r.html
#source: https://www.r-bloggers.com/drawing-heatmaps-in-r/
my_palette <- colorRampPalette(c("red", "yellow", "green"))(n = 299)
heatmap(rdist_boom,Colv = FALSE,scale = "none",col=my_palette)
Now, let’s assign each mentee and mentor’s respective cluster
#transform to long format
rdist_boom <- melt(rdist_boom)
#mentor cluster
#source: https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right
rdist_boom <- merge(x = rdist_boom
, y = melt(as.table(fit.km_combined$cluster))
, by.x = "Var2"
, by.y = "Var1"
, all.x = TRUE)
#student cluster
rdist_boom <- merge(x = rdist_boom
, y = melt(as.table(fit.km_combined$cluster))
, by = "Var1"
, all.x = TRUE)
#assign column names
colnames(rdist_boom)<-c('student','mentor','distance','mentor_cluster','student_cluster')
#convert to data table
rdist_boom <- data.table(rdist_boom)
Another heat map
#heat map
#source: https://www.r-bloggers.com/how-to-create-a-fast-and-easy-heatmap-with-ggplot2/
#source: https://stackoverflow.com/questions/14290364/heatmap-with-values-ggplot2
ggplot(rdist_boom, aes(student, mentor)) +
geom_tile(aes(fill = distance), color = "white") +
scale_fill_gradient(low = "white", high = "steelblue") +
ylab("Mentor") +
xlab("Student") +
theme(legend.title = element_text(size = 8),
legend.text = element_text(size = 8),
plot.title = element_text(size=10),
axis.title=element_text(size=8,face="bold"),
axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(fill = "Distance")
We will attempt to pair the metors and the mentees two different ways.
The first method, by mentee sorts each mentor by distance and by mentee. And within this method we’ll try two methods… methodception lol.
#filter for top n mentors by student
#source: https://stackoverflow.com/questions/16325641/how-to-extract-the-first-n-rows-per-group
#source: https://stackoverflow.com/questions/13685295/sort-a-data-table-fast-by-ascending-descending-order
rdist_boom_0 <- setorder(rdist_boom,student,distance)
rdist_boom_0 <- rdist_boom_0[,.SD[1:3],by=student]
#filter for min distance, ideal mentor
#source: https://stackoverflow.com/questions/16325641/how-to-extract-the-first-n-rows-per-group
#source: https://stackoverflow.com/questions/16573995/subset-by-group-with-data-table
rdist_boom_1 <- rdist_boom[,.SD[distance == min(distance)],by=student]
#method 1: top mentor by mentee
rdist_boom_1[1:3,]%>%
kable()%>%
kable_styling(bootstrap_options=c("striped", "hover")
,full_width = F
,position = "left"
)
| student | mentor | distance | mentor_cluster | student_cluster |
|---|---|---|---|---|
| student_1 | mentor_86 | 2.747217 | 3 | 3 |
| student_2 | mentor_96 | 2.414251 | 3 | 3 |
| student_3 | mentor_86 | 2.597768 | 3 | 3 |
#method 2: top n mentors by mentee
rdist_boom_0[1:9,]%>%
kable()%>%
kable_styling(bootstrap_options=c("striped", "hover")
,full_width = F
,position = "left"
)
| student | mentor | distance | mentor_cluster | student_cluster |
|---|---|---|---|---|
| student_1 | mentor_86 | 2.747217 | 3 | 3 |
| student_1 | mentor_85 | 2.925842 | 3 | 3 |
| student_1 | mentor_88 | 2.951561 | 3 | 3 |
| student_2 | mentor_96 | 2.414251 | 3 | 3 |
| student_2 | mentor_80 | 2.463323 | 3 | 3 |
| student_2 | mentor_84 | 2.805245 | 3 | 3 |
| student_3 | mentor_86 | 2.597768 | 3 | 3 |
| student_3 | mentor_87 | 2.814029 | 3 | 3 |
| student_3 | mentor_90 | 2.814439 | 3 | 3 |
Do you see an issue with this approach? In either method 1 (top n mentors by mentee) or method 2 (top mentor by mentee) are all mentors represented? Was each mentor paired with a mentee? Let’s see.
Method 1: Top mentor
foo <- rbind(dim(mentor_table)[1],dim(unique(rdist_boom_1[,list(mentor)]))[1])
foo <- cbind(c("Total mentors","Mentors matched"),foo)
foo1 <- rbind(dim(mentee_table)[1],dim(unique(rdist_boom_1[,list(student)]))[1])
foo1 <- cbind(c("Total mentees","Mentees matched"),foo1)
foo <- t(rbind(foo,foo1))
foo <- foo[,c(1,3,2,4)]
foo %>%
kable()%>%
kable_styling(bootstrap_options=c("striped", "hover")
,full_width = F
,position = "left"
)
| Total mentors | Total mentees | Mentors matched | Mentees matched |
| 100 | 100 | 44 | 100 |
Method 2: Top n mentors
bar <- rbind(dim(mentor_table)[1],dim(unique(rdist_boom_0[,list(mentor)]))[1])
bar <- cbind(c("Total mentors","Mentors matched"),bar)
bar1 <- rbind(dim(mentee_table)[1],dim(unique(rdist_boom_0[,list(student)]))[1])
bar1 <- cbind(c("Total mentees","Mentees matched"),bar1)
bar <- t(rbind(bar,bar1))
bar <- bar[,c(1,3,2,4)]
bar %>%
kable()%>%
kable_styling(bootstrap_options=c("striped", "hover")
,full_width = F
,position = "left"
)
| Total mentors | Total mentees | Mentors matched | Mentees matched |
| 100 | 100 | 74 | 100 |
As we can see, not all mentors were matched with a mentee. Now, we can debate whether or not a one-to-one match is the appropriate approach. We could argue that unmatched mentors shouldn’t be matched with a mentee. However, for the sake of practice and, one of our favorite corporate buzz words, inclusion let’s present alternative approach.
This time we’ll focus on the mentors and sort/filter by the mentees
Method 3: Top mentee
#reorder columns
#source: http://www.cookbook-r.com/Manipulating_data/Reordering_the_columns_in_a_data_frame/
rdist_boom_2 <- rdist_boom[,c(2,1,3,4,5)]
rdist_boom_2 <- setorder(rdist_boom_2,mentor,distance)
rdist_boom_2 <- rdist_boom_2[,.SD[distance == min(distance)],by=mentor]
spam <- rbind(dim(mentor_table)[1],dim(unique(rdist_boom_2[,list(mentor)]))[1])
spam <- cbind(c("Total mentors","Mentors matched"),spam)
spam1 <- rbind(dim(mentee_table)[1],dim(unique(rdist_boom_2[,list(student)]))[1])
spam1 <- cbind(c("Total mentees","Mentees matched"),spam1)
spam <- t(rbind(spam,spam1))
spam <- spam[,c(1,3,2,4)]
spam %>%
kable()%>%
kable_styling(bootstrap_options=c("striped", "hover")
,full_width = F
,position = "left"
)
| Total mentors | Total mentees | Mentors matched | Mentees matched |
| 100 | 100 | 100 | 54 |
Method 4: Top n mentees
rdist_boom_3 <- rdist_boom[,c(2,1,3,4,5)]
rdist_boom_3 <- setorder(rdist_boom_3,mentor,distance)
rdist_boom_3 <- rdist_boom_3[,.SD[1:3],by=mentor]
spam2 <- rbind(dim(mentor_table)[1],dim(unique(rdist_boom_3[,list(mentor)]))[1])
spam2 <- cbind(c("Total mentors","Mentors matched"),spam2)
spam3 <- rbind(dim(mentee_table)[1],dim(unique(rdist_boom_3[,list(student)]))[1])
spam3 <- cbind(c("Total mentees","Mentees matched"),spam3)
spam2 <- t(rbind(spam2,spam3))
spam2 <- spam2[,c(1,3,2,4)]
spam2 %>%
kable()%>%
kable_styling(bootstrap_options=c("striped", "hover")
,full_width = F
,position = "left"
)
| Total mentors | Total mentees | Mentors matched | Mentees matched |
| 100 | 100 | 100 | 73 |
A quick discussion. There are two general approaches
Personally, I prefer the top match approach. This approach takes away the guess work and provides a single suggestion to whomever reads our report. However, neither the mentor-focused or the mentee-focused approaches included all the mentors or mentees. In some applications that’s probably okay because you’d prefer to provide the best matches and those that didn’t match should wait until a more suited mentor/mentee is available.
In an effort to practice further, what if we provided a comprehensive list of matched mentors and mentees. We would take the matches from one of the above top matches methods above and continue matching with the leftovers so no mentor is without a mentee.
It might be useful to create a rank column that ranks the distance between mentor and mentee from least distance (most similar) to most distant (least similar).
Since the mentor:mentee approach in method 3 had a higher match rate than method 1 we’ll leverage that.
#source: https://stackoverflow.com/questions/15170777/add-a-rank-column-to-a-data-frame
rdist_boom_4 <- rdist_boom[,c(2,1,3,4,5)]
rdist_boom_4 <- setorder(rdist_boom_4,mentor,distance)
rdist_boom_4 <- rdist_boom_4[,rank:=rank(distance,ties.method="first"),by=mentor]
head(rdist_boom_4)%>%
kable()%>%
kable_styling(bootstrap_options=c("striped", "hover")
,full_width = F
,position = "left"
)
| mentor | student | distance | mentor_cluster | student_cluster | rank |
|---|---|---|---|---|---|
| mentor_1 | student_41 | 3.029679 | 1 | 1 | 1 |
| mentor_1 | student_33 | 3.037549 | 1 | 1 | 2 |
| mentor_1 | student_34 | 3.085842 | 1 | 1 | 3 |
| mentor_1 | student_37 | 3.139329 | 1 | 1 | 4 |
| mentor_1 | student_30 | 3.154351 | 1 | 1 | 5 |
| mentor_1 | student_46 | 3.244478 | 1 | 1 | 6 |
On second thought, why focus on the mentors? What we’re really concerned about is the distance (i.e. how well a mentor and mentee match), right? So, let’s rank by distance without a specific focus on either mentor or mentee.
rdist_boom_5 <- rdist_boom[,c(2,1,3,4,5)]
rdist_boom_5 <- setorder(rdist_boom_5,distance)
rdist_boom_5 <- rdist_boom_5[,rank:=rank(distance,ties.method="first")]
head(rdist_boom_5)%>%
kable()%>%
kable_styling(bootstrap_options=c("striped", "hover")
,full_width = F
,position = "left"
)
| mentor | student | distance | mentor_cluster | student_cluster | rank |
|---|---|---|---|---|---|
| mentor_88 | student_16 | 2.099942 | 3 | 3 | 1 |
| mentor_32 | student_53 | 2.120671 | 2 | 2 | 2 |
| mentor_45 | student_73 | 2.290427 | 2 | 2 | 3 |
| mentor_48 | student_71 | 2.301658 | 2 | 2 | 4 |
| mentor_32 | student_52 | 2.329975 | 2 | 2 | 5 |
| mentor_88 | student_7 | 2.372999 | 3 | 3 | 6 |
With our dataset ordered by distance (dataset A) we will now loop through dataset A and create a new dataset (dataset B), after the insertion into dataset B we’ll delete from dataset A any record with either that mentor or mentee thus creating a unique list of mentors and mentees.
rdist_boom_6 <- rdist_boom[,c(2,1,3,4,5)]
rdist_boom_6 <- setorder(rdist_boom_6,distance)
rdist_boom_6 <- rdist_boom_6[,rank_d:=rank(distance,ties.method="first")]
rdist_boom_6 <- rdist_boom_6[,rank_m:=rank(distance,ties.method="first"),by=mentor]
rdist_boom_6 <- rdist_boom_6[,rank_s:=rank(distance,ties.method="first"),by=student]
#source: https://stackoverflow.com/questions/29711377/how-do-i-initialize-an-empty-or-example-data-table
comp_list <- data.table(
mentor = character()
,student = character()
,distance = numeric()
,mentor_cluster = integer()
,student_cluster = integer()
,rank_d = integer()
,rank_m = integer()
,rank_s = integer()
)
#source: https://stackoverflow.com/questions/7494848/standard-way-to-remove-multiple-elements-from-a-dataframe
#tweaked for characters
#`%notin%` <- function(x,y) !(x %chin% y)
#source: https://www.datacamp.com/community/tutorials/tutorial-on-loops-in-r
repeat {
loopy <- rdist_boom_6[!(as.character(rdist_boom_6$mentor)
%chin%
as.character(comp_list$mentor))
&
!(as.character(rdist_boom_6$student)
%chin%
as.character(comp_list$student)),]
comp_list <- rbind(comp_list,loopy[1,])
if (nrow(comp_list)==100 &
dim(comp_list[,list(mentor)])[1] == dim(unique(comp_list[,list(mentor)]))[1] &
dim(comp_list[,list(student)])[1] == dim(unique(comp_list[,list(student)]))[1]){
break
}
}
comp_list[order(mentor),1:5]%>%
kable()%>%
kable_styling(bootstrap_options=c("striped", "hover")
,full_width = F
,position = "left"
)
| mentor | student | distance | mentor_cluster | student_cluster |
|---|---|---|---|---|
| mentor_1 | student_41 | 3.029679 | 1 | 1 |
| mentor_2 | student_45 | 2.562457 | 1 | 1 |
| mentor_3 | student_32 | 2.392411 | 1 | 1 |
| mentor_4 | student_82 | 4.100195 | 1 | 1 |
| mentor_5 | student_42 | 2.625672 | 1 | 1 |
| mentor_6 | student_33 | 3.039905 | 1 | 1 |
| mentor_7 | student_46 | 2.662955 | 1 | 1 |
| mentor_8 | student_43 | 2.581001 | 1 | 1 |
| mentor_9 | student_96 | 4.807285 | 1 | 1 |
| mentor_10 | student_85 | 5.520550 | 1 | 1 |
| mentor_11 | student_83 | 4.456761 | 1 | 1 |
| mentor_12 | student_89 | 5.149710 | 1 | 1 |
| mentor_13 | student_34 | 2.896505 | 1 | 1 |
| mentor_14 | student_88 | 4.845568 | 1 | 1 |
| mentor_15 | student_36 | 2.652919 | 1 | 1 |
| mentor_16 | student_37 | 2.807648 | 1 | 1 |
| mentor_17 | student_98 | 7.690623 | 1 | 1 |
| mentor_18 | student_92 | 5.281072 | 1 | 1 |
| mentor_19 | student_86 | 4.362615 | 1 | 1 |
| mentor_20 | student_28 | 2.847893 | 1 | 1 |
| mentor_21 | student_48 | 3.159907 | 1 | 1 |
| mentor_22 | student_94 | 4.473839 | 1 | 1 |
| mentor_23 | student_49 | 3.198363 | 1 | 1 |
| mentor_24 | student_39 | 2.511293 | 1 | 1 |
| mentor_25 | student_97 | 5.304856 | 1 | 1 |
| mentor_26 | student_74 | 2.959585 | 2 | 2 |
| mentor_27 | student_57 | 2.649959 | 2 | 2 |
| mentor_28 | student_69 | 2.464794 | 2 | 2 |
| mentor_29 | student_62 | 2.646257 | 2 | 2 |
| mentor_30 | student_70 | 2.948451 | 2 | 2 |
| mentor_31 | student_55 | 2.684011 | 2 | 2 |
| mentor_32 | student_53 | 2.120671 | 2 | 2 |
| mentor_33 | student_64 | 2.844501 | 2 | 2 |
| mentor_34 | student_58 | 2.899029 | 2 | 2 |
| mentor_35 | student_72 | 3.195314 | 2 | 2 |
| mentor_36 | student_66 | 3.106294 | 2 | 2 |
| mentor_37 | student_54 | 3.513544 | 2 | 2 |
| mentor_38 | student_68 | 2.917839 | 2 | 2 |
| mentor_39 | student_65 | 2.812174 | 2 | 2 |
| mentor_40 | student_52 | 3.274101 | 2 | 2 |
| mentor_41 | student_56 | 2.972919 | 2 | 2 |
| mentor_42 | student_61 | 2.500779 | 2 | 2 |
| mentor_43 | student_51 | 2.478912 | 2 | 2 |
| mentor_44 | student_60 | 2.867123 | 2 | 2 |
| mentor_45 | student_73 | 2.290427 | 2 | 2 |
| mentor_46 | student_59 | 3.794640 | 2 | 2 |
| mentor_47 | student_63 | 2.723469 | 2 | 2 |
| mentor_48 | student_71 | 2.301658 | 2 | 2 |
| mentor_49 | student_67 | 3.122865 | 2 | 2 |
| mentor_50 | student_75 | 3.228425 | 2 | 2 |
| mentor_51 | student_40 | 2.665241 | 1 | 1 |
| mentor_52 | student_31 | 2.664058 | 1 | 1 |
| mentor_53 | student_44 | 2.391656 | 1 | 1 |
| mentor_54 | student_76 | 5.250236 | 1 | 1 |
| mentor_55 | student_78 | 5.219787 | 1 | 1 |
| mentor_56 | student_84 | 5.041660 | 1 | 1 |
| mentor_57 | student_26 | 3.158960 | 1 | 1 |
| mentor_58 | student_77 | 4.077882 | 1 | 1 |
| mentor_59 | student_30 | 2.624056 | 1 | 1 |
| mentor_60 | student_80 | 4.944278 | 1 | 1 |
| mentor_61 | student_79 | 5.193446 | 1 | 1 |
| mentor_62 | student_93 | 6.083014 | 1 | 1 |
| mentor_63 | student_47 | 3.189654 | 1 | 1 |
| mentor_64 | student_50 | 2.715196 | 1 | 1 |
| mentor_65 | student_87 | 4.444670 | 1 | 1 |
| mentor_66 | student_99 | 4.904524 | 1 | 1 |
| mentor_67 | student_90 | 4.292501 | 1 | 1 |
| mentor_68 | student_35 | 2.736301 | 1 | 1 |
| mentor_69 | student_95 | 4.972855 | 1 | 1 |
| mentor_70 | student_91 | 4.347424 | 1 | 1 |
| mentor_71 | student_27 | 3.235039 | 1 | 1 |
| mentor_72 | student_100 | 4.599006 | 1 | 1 |
| mentor_73 | student_29 | 3.290807 | 1 | 1 |
| mentor_74 | student_81 | 4.787604 | 1 | 1 |
| mentor_75 | student_38 | 2.798083 | 1 | 1 |
| mentor_76 | student_20 | 2.846385 | 3 | 3 |
| mentor_77 | student_15 | 2.657693 | 3 | 3 |
| mentor_78 | student_11 | 2.698564 | 3 | 3 |
| mentor_79 | student_1 | 3.077246 | 3 | 3 |
| mentor_80 | student_14 | 2.624967 | 3 | 3 |
| mentor_81 | student_6 | 3.139111 | 3 | 3 |
| mentor_82 | student_4 | 2.814877 | 3 | 3 |
| mentor_83 | student_7 | 3.458420 | 3 | 3 |
| mentor_84 | student_9 | 2.577503 | 3 | 3 |
| mentor_85 | student_24 | 2.901643 | 3 | 3 |
| mentor_86 | student_19 | 2.521309 | 3 | 3 |
| mentor_87 | student_3 | 2.814029 | 3 | 3 |
| mentor_88 | student_16 | 2.099942 | 3 | 3 |
| mentor_89 | student_17 | 2.978750 | 3 | 3 |
| mentor_90 | student_18 | 2.891187 | 3 | 3 |
| mentor_91 | student_21 | 3.337798 | 3 | 3 |
| mentor_92 | student_12 | 4.039464 | 3 | 3 |
| mentor_93 | student_25 | 3.122153 | 3 | 3 |
| mentor_94 | student_10 | 2.729186 | 3 | 3 |
| mentor_95 | student_8 | 3.357755 | 3 | 3 |
| mentor_96 | student_2 | 2.414251 | 3 | 3 |
| mentor_97 | student_13 | 2.564496 | 3 | 3 |
| mentor_98 | student_5 | 2.869701 | 3 | 3 |
| mentor_99 | student_22 | 3.041030 | 3 | 3 |
| mentor_100 | student_23 | 3.511749 | 3 | 3 |
And there we have it. A unique list of mentors and mentees algorithmically paired.
If you’ve read this document in its entirety, thank you! I hope you find it useful or at minimum, entertaining. I’d still like to put this to the test with real people and real a questionnaire.
2019-01-04: The mentor program directors have decided not to pair the mentees and mentors. Instead, the program will keep a group structure. I did present my prototype and it was well received. I’m still searching for an opportunity to apply this. Until next time.
The matching process complete I still wanted to practice different techniques. Below you’ll find different exploratory approaches to clustering and matching.
Before we get into the fancy stuff let’s do something simple. Let’s see if any mentors responded the same as other mentors. The odds seem low given that in our example there are 10 possible answers for each of the 25 questions, but you never know.
To accomplish this uniqueness check we’ll analyze the dimension of the mentor table vs the dimension of the unique mentor table. And though it’s simple the uniqueness check is important because some techniques require unique respondants. If the dimensions are the same then no futher analysis is needed. If the dimensions were different then it would have required further investigation; which mentors responded the same, why did they responded the same. Consider that since the odds are low any two or more people would respond the same perhaps there is a data integrity issue (e.g. data entry error) or perhaps the respondants completed the survey together on purpose.
rbind(dim(mentor_table),dim(unique(mentor_table)))
## [,1] [,2]
## [1,] 100 25
## [2,] 100 25
Now, let’s analyze which mentors are similar (i.e. which mentors answered the questionnaire similarly). To accomplish this we will use Hierarchical Agglomerative Clustering (HAC) or AGNES (Agglomerative Nesting) - source
There are several clustering methods:
Each method will resutlt in a different grouping of mentors, to identify which best fits our data we’ll compute the Agglomerative coefficient, which measures the amount of clustering structure found (values closer to 1 suggest strong clustering structure). We’ll select the method with the highest AC value.
# methods to assess
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
# function to compute coefficient
ac <- function(x) {agnes(mentor_table, method = x)$ac}
t(map_dbl(m, ac))%>%
kable()%>%
kable_styling(bootstrap_options=c("striped", "hover")
,full_width = F
,position = "left"
)
| average | single | complete | ward |
|---|---|---|---|
| 0.5944711 | 0.2007084 | 0.7407678 | 0.922375 |
Unlike other clustering techniques Hierarchical Clustering doesn’t require that we specify a specific number of clusters up front. The algorithm will automatically detect patterns in the data and split the mentors into groups. It does require a clustering method though. Once selected from the table above we’ll create the dendrogram.
# Dissimilarity matrix
d <- dist(mentor_table, method = "euclidean")
mentor_hc1 <- agnes(mentor_table, method = "ward")
pltree(mentor_hc1
, cex = 0.6
, hang = -1
, main = "Dendrogram"
, xlab = 'Mentors'
)
Although Hierarchical Clustering doesn’t require we specify a particular number of clusters, doing so can be useful to figure out how to group the mentors. There are a number techniques to assess how many groups of mentors there should be:
Elbow
fviz_nbclust(mentor_table, FUN = hcut, method = "wss")
Average Silhouette
fviz_nbclust(mentor_table, FUN = hcut, method = "silhouette")
Gap Statistic
gap_stat <- clusGap(mentor_table
, FUN = hcut
, nstart = 25
, K.max = 15
, B = 500
, d.power = 2
, spaceH0 = "scaledPCA"
)
fviz_gap_stat(gap_stat)
Two of the three techniques above recommended 3 so let’s set k = 3.
k <- 3
d <- dist(mentor_table, method = "euclidean")
hc5 <- hclust(d, method = "ward.D2" )
plot(hc5, cex = 0.6)
rect.hclust(hc5, k = k, border = 2:5)
# Number of members in each cluster
clusters <- cutree(as.hclust(mentor_hc1), k = k)
table(clusters)
## clusters
## 1 2 3
## 50 25 25
# append the cluster for each mentor
#mentor_table <- cbind(mentor_table,clusters)
fviz_cluster(list(data = mentor_table, cluster = clusters)
,main='Mentor clusters'
#,xlab='x label'
#,ylab='y label'
)
In an effort to explore other techniques and solutions we’ll now test a different clustering technique, K-means.
From “R in Action, Second Edition”:
- Selects K centroids (K rows chosen at random)
- Assigns each data point to its closest centroid
Recalculates the centroids as the average of all data points in a cluster (i.e., the centroids are p-length mean vectors, where p is the number of variables)
- Assigns data points to their closest centroids
- Continues steps 3 and 4 until the observations are not reassigned or the maximum number of iterations (R uses 10 as a default) is reached
More information about K-means can be found here and here
We can view how the mentees compare to each other using a distance plot, more info here or here. Knowing if the mentees are different/similar to each other and help better understand. For example, if none of the mentees are alike then perhaps all of the mentees responded randomly.
#source
#1: https://uc-r.github.io/kmeans_clustering
distance <- get_dist(mentee_table)
fviz_dist(distance
, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07")
,show_labels = TRUE
)
Unlike the hierarchical clustering technique used to better understand the mentors, K-means requires we specify the number of clusters (the “k”) at the onset. To investigate what an appropriate number of clusters is we can use the following visuals.
wssplot <- function(data, nc=15){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i,iter.max=1000,nstart=25)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")}
suppressMessages(wssplot(mentee_table))
nc <- suppressMessages(NbClust(mentee_table, min.nc=2, max.nc=15, method="kmeans"))
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 6 proposed 2 as the best number of clusters
## * 11 proposed 3 as the best number of clusters
## * 2 proposed 4 as the best number of clusters
## * 2 proposed 7 as the best number of clusters
## * 1 proposed 13 as the best number of clusters
## * 1 proposed 14 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
suppressMessages(table(nc$Best.n[1,]))
##
## 0 1 2 3 4 7 13 14
## 2 1 6 11 2 2 1 1
barplot(table(nc$Best.n[1,])
,xlab="Numer of Clusters"
,ylab="Number of Criteria"
,main="Number of Clusters Chosen by 26 Criteria"
)
fit.km_student <- kmeans(mentee_table
,centers=3
,nstart=25
,algorithm="Lloyd"
,iter.max = 1000
)
#source
#1: https://stats.stackexchange.com/questions/31083/how-to-produce-a-pretty-plot-of-the-results-of-k-means-cluster-analysis
dissE <- daisy(mentee_table)
dE2 <- dissE^2
sk2 <- silhouette(fit.km_student$cl, dE2)
plot(sk2)
#source
#1: https://uc-r.github.io/kmeans_clustering
fviz_cluster(fit.km_student, data = mentee_table)
One final view: multiple plots
#source
#1: https://uc-r.github.io/kmeans_clustering
k3 <- kmeans(mentee_table,centers=3,nstart=25,iter.max=1000,algorithm="Lloyd")
k4 <- kmeans(mentee_table,centers=4,nstart=25,iter.max=1000,algorithm="Lloyd")
k5 <- kmeans(mentee_table,centers=5,nstart=25,iter.max=1000,algorithm="Lloyd")
k6 <- kmeans(mentee_table,centers=6,nstart=25,iter.max=1000,algorithm="Lloyd")
# plots to compare
p2 <- fviz_cluster(k3, geom = "point", data = mentee_table) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point", data = mentee_table) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point", data = mentee_table) + ggtitle("k = 5")
p5 <- fviz_cluster(k6, geom = "point", data = mentee_table) + ggtitle("k = 6")
grid.arrange(p2,p3,p4,p5, nrow = 2)