Assigning students to work on capstones is challenging because we want the students assigned to the project to be interested in the project from the start. Some projects, however, will be more interesting to more students, and other projects will be interesting to very few students. Our goal is to assign students to capstones they are interested in to the greatest extent possible while still ensuring that every capstone, even an unpopular one, has at least 3 students assigned to it.
I describe and demonstrate a method for measuring students’ preferences for working on different capstones, a way to collect the data, and an algorithm for mapping the set of students’ preferences to assignments to capstones. This approach should allow the students to have an easier time deciding on their preferences and reporting them accurately. The matching algorithm emphasizes forming groups of students with similar preferences across the entire set of projects, and ensures that even the least popular projects have students that are interested in them. First I describe the data and provide a simulated example dataset, then I discuss and demonstrate the matching algorithm.
I propose that we ask the students to rate each project using a five-point Likert scale rating, a number from 1 to 5, where
5 means “extremely interested in working on this project”,
4 means “very interested in working on this project”,
3 means “somewhat interested in working on this project”,
2 means “only a little interested in working on this project”,
1 means “not at all interested in working on this project”.
This system has advantages, and some disadvantages, relative to other ways to measure students’ preferences such as ranking all of the proposals. First, Likert scales are less cognitively challenging than ranking as students only have to decide which projects belong in each of the five bins. That is, students decide which projects deserve a 1 without having to rank-order the projects that get a 1. Second, Likert scales should be less susceptible to (but not entirely free of) strategic voting. Under a rank-order system, students have an incentive to understate their interest in less-popular projects to maximize their chance of getting assigned to a more-popular project. While this incentive exists for the Likert scales as well, there’s less room for this kind of maneuvering. Third, Likert scales are absolute evaluations instead of relative ones – a student can give as many 5s as she wants, but under a rank-order system can only choose one project to rank 1 – and that allows us to more-easily measure the similarity between two students across the entire set of projects. We can leverage this information to construct capstone groups of students whose preferences are as similar to each other as possible. The Likert scales have the disadvantage of not allowing students to express their top choice, or order the projects within each Likert category. If we employ Likert scales, we are trying to maximize the number of students who are happy with their assignment at the expense of maximizing the number of students who are assigned to their top choice.
If we use Likert scale evaluations, we should organize the data into a format in which individual students comprise the rows, capstone projects comprise the columns, and the Likert ratings fill the cells. The data should look like this:
| student | project1 | project2 | project3 | project4 | project5 |
|---|---|---|---|---|---|
| al-Habib, Muhsin | 3 | 3 | 1 | 5 | 4 |
| Ho, Julia Isabel | 1 | 4 | 2 | 5 | 1 |
| Olguin, Alejandra | 1 | 3 | 1 | 4 | 2 |
| Mcphaul, Kayla | 1 | 2 | 2 | 4 | 2 |
| Mark, Mikala | 1 | 3 | 1 | 3 | 4 |
| al-Amara, Zuhaira | 1 | 4 | 3 | 5 | 1 |
| Lester, Caleb | 1 | 2 | 1 | 5 | 2 |
| Smith, Charlotte | 3 | 4 | 2 | 4 | 3 |
| Guerrero, Kimberly | 3 | 5 | 3 | 2 | 3 |
| Lang, Blain | 1 | 2 | 2 | 3 | 5 |
I recommend that we use an online form, such as Google forms, to collect the students’ ratings on each project and organize them directly into a dataset with this format.
To demonstrate the matching algorithm I generate data with the format described above. I make the data more realistic by allowing some projects to have higher average evaluations than others, and by correlating the responses of one project and another.
To construct these data I need the Matrix, MASS, randomNames, and summarytools packages:
library(Matrix)
library(MASS)
library(randomNames)
library(summarytools)
I start by setting the number of students (N) to 60, the number of capstone projects (P) to 20, and the maximum Likert category (J) to 5. These parameters are tunable.
N <- 60
P <- 20
J <- 5
Capstone projects, as they are presented to students early in the fall semester, have varying degrees of quality. The quality accounts for different levels of popularity of projects. I take quality as P draws from the standard normal distribution:
project.quality <- rnorm(P)
I also generate a correlation matrix across projects. This matrix is symmetric, has 1s on the diagonal, and the off-diagonal elements contain the correlation between the project represented by the row and the project represented by the column. I build this matrix and draw the correlations from uniform distributions between -0.7 and 0.7:
Sigma <- diag(P)
for(p in 1:(P-1)){
for(q in (p+1):P){
Sigma[p,q] <- runif(1, min=-.7, max=.7)
Sigma[q,p] <- Sigma[p,q]
}
}
The Sigma matrix, however, might not be positive-definite, which I require for simulating evaluations. The nearPD() function from the Matrix package finds the closest positive-definite symmetric matrix to Sigma:
Sigma <- nearPD(Sigma, doSym = TRUE)$mat
I next generate continuous evaluations for each student with regard to each project. We draw these evaluations from the multivariate normal distribution, where the mean evaluation of each project is given by the project.quality vector, and the correlations between projects are given by Sigma. I draw evaluations using the mvrnorm() function from the MASS package.
data <- mvrnorm(n=N, mu=project.quality, Sigma=Sigma)
The data matrix contains continuous variables, and we need these to be five-point Likert scales instead. I calculate the 20th, 40th, 60th, and 80th percentiles, across variables, of the values inside data. I then use these percentiles to categorize each variable. Because the cutpoints are fixed across variables, and the variables have different means, the result is that some variables will have higher categories in general than others.
cutpoints <- quantile(c(data), probs = seq(0, 1, by=.2))
cutpoints[1] <- -Inf
data <- apply(data, 2, FUN=function(x){
x <- cut(x, breaks = cutpoints, labels=FALSE, right=TRUE)
})
Finally, I use the randomNames() function from the randomNames package to generate some realistic names to include in the data, and I rename the variables.
data <- data.frame(randomNames::randomNames(N), data,
stringsAsFactors = FALSE)
colnames(data) <- c("student", paste("project", 1:P, sep=""))
The simulated capstone-evaluation data is complete. Here’s a good-looking summary table, constructed with the dfSummary() function from the summarytools package.
dfSummary(data[,-1], plain.ascii = FALSE, style = "grid",
graph.magnif = 0.75, valid.col = FALSE,
tmp.img.dir = "/tmp", headings = FALSE)
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing |
|---|---|---|---|---|---|
1 |
project1 |
Mean (sd) : 3 (1.3) |
1 : 10 (16.7%) |
0 |
|
2 |
project2 |
Mean (sd) : 3.4 (1.4) |
1 : 7 (11.7%) |
0 |
|
3 |
project3 |
Mean (sd) : 3.9 (1.1) |
1 : 3 ( 5.0%) |
0 |
|
4 |
project4 |
Mean (sd) : 2.4 (1.2) |
1 : 18 (30.0%) |
0 |
|
5 |
project5 |
Mean (sd) : 2.6 (1.3) |
1 : 16 (26.7%) |
0 |
|
6 |
project6 |
Mean (sd) : 4 (1.1) |
1 : 1 ( 1.7%) |
0 |
|
7 |
project7 |
Mean (sd) : 2.2 (1.1) |
1 : 21 (35.0%) |
0 |
|
8 |
project8 |
Mean (sd) : 3.2 (1.4) |
1 : 9 (15.0%) |
0 |
|
9 |
project9 |
Mean (sd) : 3.1 (1.2) |
1 : 5 ( 8.3%) |
0 |
|
10 |
project10 |
Mean (sd) : 2 (1) |
1 : 22 (36.7%) |
0 |
|
11 |
project11 |
Mean (sd) : 2.4 (1.3) |
1 : 19 (31.7%) |
0 |
|
12 |
project12 |
Mean (sd) : 2 (1.2) |
1 : 26 (43.3%) |
0 |
|
13 |
project13 |
Mean (sd) : 4 (1) |
1 : 1 ( 1.7%) |
0 |
|
14 |
project14 |
Mean (sd) : 4.4 (1) |
1 : 2 ( 3.3%) |
0 |
|
15 |
project15 |
Mean (sd) : 2.8 (1.5) |
1 : 14 (23.3%) |
0 |
|
16 |
project16 |
Mean (sd) : 4.2 (1) |
1 : 1 ( 1.7%) |
0 |
|
17 |
project17 |
Mean (sd) : 2.8 (1.3) |
1 : 15 (25.0%) |
0 |
|
18 |
project18 |
Mean (sd) : 3.4 (1.3) |
1 : 6 (10.0%) |
0 |
|
19 |
project19 |
Mean (sd) : 2.1 (1) |
1 : 19 (31.7%) |
0 |
|
20 |
project20 |
Mean (sd) : 2.1 (1.2) |
1 : 25 (41.7%) |
0 |
The algorithm proceeds according to the following steps. I list the steps here and provide the code to implement each step below.
We collect data with the format described above. We remove the variable containing students names and save it as a separate object.
We perform k-means clustering on the Likert-scale data. We set the number of clusters to be equal to the number of capstone projects. We extract the centroids (the value of each variable for a point in the center of each cluster). We name the matrix of centroids and the centroids’ values C.
We match each centroid to a capstone project by taking these steps:
We identify the capstone with the lowest average evaluation across centroids. (We start with the least popular capstone because we can achieve greater gains in overall satisfaction by selecting students that are willing to take on the projects that few others are willing to do. In contrast, the most popular capstones have many students who would be happy to do the project, so there’s less of a cost if these students are assigned to different projects.)
For this capstone, we find the centroid with the highest evaluation. We match this centroid to this capstone project.
We replace the datapoints for the centroid and capstone that were matched with missing values in C (to remove them from subsequent calculations). We return to (a) and repeat these steps until all centroids are matched to a capstone.
We match each centroid to three students by taking these steps:
We calculate the Euclidean distance between each student and each centroid. (It is straightforward however to change the distance to city block, Mahalanobis, or any other distance measure.) We name this distance matrix D. Here, and in the following steps, students are only identified by the row numbers in D.
We identify the centroid with the largest average distance across students.
We identify the three students closest to this centroid, breaking ties randomly. We match these students to this centroid.
We replace the rows for the matched students and the matched centroid with missing values in D (to remove them from subsequent calculations). We return to (b) and repeat these steps until no centroids and fewer than three students remain unmatched.
In the case that the number of students is not divisible by 3, we will have either one or two students remaining unmatched:
If there is one student remaining, we assign her as a fourth member of the centroid to which she has the smallest distance.
If there are two students remaining, we assign the student listed first in the original data to the centroid to which she has smallest distance, and we assign the student listed second in the original data to the centroid to which she has smallest distance as long as it is not the same centroid the first student is matched to. In that case, we assign the second student to the centroid to which she has the second smallest distance.
We merge the capstone-centroid matches with the student-centroid matches, by centroid. We drop the centroid, leaving us with data that matches students to capstone projects. Finally, we replace the numerical codes for students with the students’ names that we extracted in step 1.
I construct helper functions to do parts of the work necessary to implement the matching algorithm. These functions take as input a dataset in the format described in the first part of this document, with the names variable removed.
The makeCentroids() function creates the centroids, as in step 2 of the matching algorithm.
makeCentroids <- function(data){
student.kmeans <- kmeans(data, centers=ncol(data), iter.max = 50)
centroids <- student.kmeans$centers
return(centroids)
}
The assignProjects() function matches the project with the lowest average evaluation across centroids to the centroid that evaluates it most highly, and breaks ties, as in step 3, parts a and b.
assignProjects <- function(centroids){
m <- which.min(colMeans(centroids, na.rm = TRUE))
max.centroid <- which(centroids[,m] == max(centroids[,m], na.rm=TRUE))
if(length(max.centroid) > 1){
d <- centroids[max.centroid,]
g <- rowMeans(d, na.rm=TRUE)
max.centroid <- max.centroid[which(g == min(g, na.rm=TRUE))]
if(length(max.centroid) > 1){
max.centroid <- max.centroid[sample(1:length(max.centroid), 1)]
}
}
return(list(centroid = max.centroid, project = m))
}
The distStudents() function calculates Euclidean distance between each student and each centroid, as in step 4, part a.
distStudents <- function(data, centroids){
groups <- matrix(NA, nrow(data), ncol(data))
for(i in 1:nrow(data)){
for(p in 1:ncol(data)){
groups[i,p] <- sqrt(sum((data[i,] - centroids[p,])^2))
}
}
return(groups)
}
The assignStudents() function matches each centroid to three students, as in step 4, parts b and c.
assignStudents <- function(groups){
avg.dist <- colMeans(groups, na.rm=TRUE)
centroid <- which.max(avg.dist)
students <- which(rank(groups[,centroid], ties.method = "random") %in% 1:3)
return(list(centroid = centroid, students = students))
}
capstoneMatch()The main function, capstoneMatch() uses the helper functions to implement the entire matching algorithm. It takes as input a dataset in the format described in the first part of this document. It requires the variable that contains student names to be the first column. It outputs a dataset in which each row represents one capstone project, and the students assigned to each project are listed in the columns.
capstoneMatch <- function(data){
names <- data[,1]
data <- data[,-1]
# first, find the centroids
centroids <- makeCentroids(data)
# second, match the centroids to projects
proj.data <- data.frame()
cent <- centroids
while(!all(is.na(cent))){
proj <- assignProjects(cent)
proj.data <- rbind(proj, proj.data)
cent[proj$centroid,] <- NA
cent[,proj$project] <- NA
}
rownames(proj.data) <- NULL
# third, match the centroids to groups of 3 students each
student.data <- data.frame()
group.set <- distStudents(data, centroids)
groups <- group.set
while(length(na.omit(rowMeans(groups, na.rm=TRUE))) > 2){
stud <- assignStudents(groups)
stud.row <- c(stud$centroid, stud$students)
student.data <- rbind(student.data, stud.row)
groups[stud$students,] <- NA
groups[,stud$centroid] <- NA
}
names(student.data) <- c("centroid", "student1", "student2", "student3")
# if necessary, assign excess students as a fourth member of a group:
# if there's 1 extra, assign to her favorite project
# if there's 2 extra, assign to favorites, but not the same one
assigned.students <- c(student.data$student1,
student.data$student2,
student.data$student3)
extra <- which(!(1:nrow(data) %in% assigned.students))
if(length(extra)!=0){
extra.group <- group.set[extra,]
student.data$student4 <- NA
}
if(length(extra)==1){
extra.assign <- which.min(extra.group)
student.data$student4[extra.assign] <- extra
}
if(length(extra)==2){
extra.assign <- which.min(extra.group[1,])
student.data$student4[extra.assign] <- extra[1]
extra.group[2,extra.assign] <- NA
extra.assign <- which.min(extra.group[2,])
student.data$student4[extra.assign] <- extra[2]
}
# then write the students' names instead of numbers
student.data$student1 <- names[student.data$student1]
student.data$student2 <- names[student.data$student2]
student.data$student3 <- names[student.data$student3]
if(length(extra)!=0) student.data$student4 <- names[student.data$student4]
# fourth, create data with project and 3 students. Output
output <- merge(proj.data, student.data, by="centroid")
output <- output[-1]
return(output)
}
A matching can be considered “good” if students are happy in general with their assignments. But that assessment can be tricky: if a capstone is unpopular in general, the students who are assigned to it might be unhappy despite the fact that they favored that capstone more strongly than other students. A matching needs to be assessed in relative terms: how much more strongly does an assigned group rate a capstone than the average across students?
To see the quality of a matching, I wrote a function evalMatch() that takes the original data and the matches provided by capstoneMatch() and provides three sets of output:
A simple frequency table of the evaluations that students have given the projects to which they’ve been assigned. Good matchings minimize low ratings and maximize high ratings.
A table that lists, for every capstone, the mean rating of that project for the students assigned to it, the mean rating across all students, and the difference between these means.
The average of the differences reported in (2).
The function is as follows:
evalMatch <- function(data, output){
require(tidyverse)
data2 <- data
colnames(data2) <- 1:ncol(data2) - 1
data2 <- data2 %>%
gather(-"0", key = "project", value = "evaluation") %>%
rename(name = "0") %>%
mutate(project = as.numeric(project))
output2 <- output %>%
gather(-"project", key = "studentnumber", value = "name") %>%
dplyr::select(-studentnumber)
eval <- left_join(output2, data2, by=c("name", "project")) %>%
na.omit()
res1 <- table(eval$evaluation)
data3 <- data2 %>%
group_by(project) %>%
summarise(overall.mean.eval = mean(evaluation))
res2 <- eval %>%
group_by(project) %>%
summarise(assigned.mean.eval = mean(evaluation)) %>%
full_join(data3, by="project") %>%
mutate(difference = assigned.mean.eval - overall.mean.eval)
res3 <- mean(res2$difference)
return(list(`Students evaluations of assigned projects` = res1,
`Average evaluation of each project overall and for the group` = res2,
`Average difference between overall and group evaluations` = res3))
}
capstoneMatch() to the Simulated DataOnce the data have been collected and stored in the correct format, as they have been in the data object, all I have to do to generate the matches is to pass this dataset to capstoneMatch().
match <- capstoneMatch(data)
The matches are as follows:
knitr::kable(match)
| project | student1 | student2 | student3 |
|---|---|---|---|
| 17 | Gonzalez, Jessica | Stevens, Benton | al-Ayub, Muammar |
| 12 | Vialpando, Cody | Moquino, Francine | al-Jamil, Muna |
| 1 | el-Pour, Shabaan | Smith, Lindzey | al-Tabatabai, Abdul Jaleel |
| 6 | al-Lone, Sabaaha | Recinos, Luis | Kerr, Alicia |
| 2 | Giles, Tavaris | Ali, Marlyssa | al-Attar, Iffat |
| 4 | Simpson, Phaezon | Theurer, Rebekka | Guillory, Amanda |
| 13 | Kell, Tayler | Parran, Ngirchoeang | Gadison, Elizabeth |
| 18 | Yohe-Ironwing, Janice | al-Amir, Khaalida | Hong, Manya |
| 9 | Prado, Christopher | al-Jafari, Nu’ma | Vigil, Cayla |
| 15 | Murphy, Michael | Chavez, Larissa | el-Younes, Zuhra |
| 10 | al-Sharif, Raamiz | el-Pour, Jaad | Turner, Alonzo |
| 14 | Rodriguez, Israel | Nixon, Jared | Bean, Andrew |
| 7 | Martinez, Victoria | Johnson, Alex | Eddington, Justin |
| 5 | Rambat, Michael | Cheng, Parnia | Nolte, Mariah |
| 16 | Desersa, Rhiannon | Alexander, Jason | Mccarthy, Austin |
| 19 | Cruz, Briana | Holiday, Taylor | Mathe, Christopher |
| 3 | el-Hallal, Khadeeja | Parlin-Three Sti, Brittanyclaire | al-Halim, Rashaad |
| 11 | Martinez, Destiny | Vigil, Idalia | Carruthers, Dameion |
| 8 | al-Rehmann, Mishaari | Hampton, Harshwinder | al-Sultana, Irfaan |
| 20 | Dominguez, Miryea | Parker, Thalir | Delgado Calderon, Savannah |
But, how good are these matches? A quick way to find out is to use the evalMatch() function.
evalMatch(data, match)
## $`Students evaluations of assigned projects`
##
## 1 2 3 4 5
## 2 5 8 17 28
##
## $`Average evaluation of each project overall and for the group`
## # A tibble: 20 x 4
## project assigned.mean.eval overall.mean.eval difference
## <dbl> <dbl> <dbl> <dbl>
## 1 1 3.33 2.98 0.35
## 2 2 4 3.4 0.6
## 3 3 4.33 3.93 0.400
## 4 4 4.33 2.37 1.97
## 5 5 2.33 2.6 -0.267
## 6 6 5 3.97 1.03
## 7 7 3.33 2.22 1.12
## 8 8 4.67 3.17 1.5
## 9 9 4.67 3.1 1.57
## 10 10 3.67 2.02 1.65
## 11 11 4.33 2.42 1.92
## 12 12 3.33 2.05 1.28
## 13 13 4 4.03 -0.0333
## 14 14 5 4.4 0.600
## 15 15 5 2.83 2.17
## 16 16 5 4.23 0.767
## 17 17 4 2.77 1.23
## 18 18 4.67 3.35 1.32
## 19 19 3 2.08 0.917
## 20 20 3.33 2.08 1.25
##
## $`Average difference between overall and group evaluations`
## [1] 1.066667
Of the 60 students, we were able to place 28 with a project they rated as 5, and another 17 with a project they rated as 4. Two students got stuck with projects they rated 1. That happened because several of the capstone projects had very low ratings overall: project 10 was rated only 2.02 out of 5, on average, and projects 12, 19, and 20 were respectively rated 2.05, 2.08, and 2.08, on average. While the algorithm does its best to avoid assigned students to projects they are not at all interested in, it may be unavoidable if the students rate many projects poorly.
In the table, every capstone is listed along with its average rating for the students assigned to the project, and its average rating for all students. Project 14 ended up with a group evaluation of 4, while the cohort average was slightly higher at 4.03. Project 5 ended up with a group evaluation of 2.3, which is noticeably less than the cohort average of 2.6. Every other capstone, however, ended up with students who rated the project more highly than average. In some cases the improvement is dramatic: project 15 was rated only 2.8 overall, but was rated 5 by all three group members.
Overall, the matching algorithm improved the group’s evaluation of the project to which they’ve been assigned by 1.07 categories on the 5-point Likert scale, on average, relative to a randomly selected group.