Matching the 2020 Summer MSDS Students to Capstones

Introduction

In this document, I describe the process of matching the MSDS students to capstones. First I clean the raw data from the Qualtrics survey. Then I report on the distribution of rankings for each capstone. Finally I assign students to capstones, and report on the rankings that each student gave to their assigned capstone.

In this report, we require the following packages:

library(knitr)
library(summarytools)
library(lpSolve)
library(lubridate)
library(tidyverse)

The Qualtrics Survey

I wrote a survey for students to report their rankings of 7 capstones, and disseminated it using this link: https://virginia.az1.qualtrics.com/jfe/form/SV_9BNkWQU5NwR2ZJb. The survey consists of two questions. First, students provide their name. Second, they rank all 7 capstones from the one they most want to work on (1) to the one they least want to work on (7). The survey looks like this:

The desktop and mobile versions of the survey allow a student to drag the different capstones to positions higher and lower on this list. When they do so, numbers appear next to each capstone, with the capstone on top labeled 1. The capstones on top represent the students’ top preferences.

Cleaning the Raw Qualtrics Data

I downloaded the raw data from the Qualtrics website in CSV format and loaded it into R:

data <- read_csv("Online+MSDS+Capstones,+take+2_June+1,+2020_09.31.csv")

## Parsed with column specification:
## cols(
##   .default = col_character()
## )

## See spec(...) for full column specifications.

Because the raw data uses numeric codes for capstones, we also input the names of the capstones in the order they are recognized by the Qualtrics survey:

capnames <- c("Using Twitter to Analyze Regional Attitudes Towards COVID Over Time",
              "Collecting and Analyzing Candidate Speeches During the 2020 Presidential Election",
              "Forecasting the Outcome of the 2020 Presidential Election",
              "The Effect of Vote-By-Mail Policies on the 2020 Presidential Election Results",
              "Examining the Impact of Tax Policy and Politics on Economic Performance",
              "A Medical Capstone on a Topic of the Group's Choice",
              "Using Open-Source Satellite Data to Address an Interesting Question of the Group's Choice")

We need to remove the first two rows of metadata, and isolate the columns that refer to the students’ rankings, which all begin with the letter “Q”. We also save the time and date each set of responses was submitted to address the students who submitted more than one set of rankings.

data <- data[-c(1,2),] %>%
  dplyr::select(RecordedDate, starts_with("Q"))

It is possible that some students submitted more than one set of rankings. For these students, we keep only the most recent rankings:

data <- data %>%
  mutate(RecordedDate = ymd_hms(RecordedDate)) %>%
  group_by(Q1) %>%
  slice(which.max(RecordedDate)) %>%
  ungroup() %>%
  select(-RecordedDate)

The data at this point are coded as character. We convert every column to numeric class:

data <- data %>%
  mutate(Q1 = as.factor(Q1)) %>%
  mutate_if(is.character,as.numeric) 
colnames(data) <- c("student", capnames)

We save these data as a CSV:

write_csv(data, path="student_rankings2020.csv")

Understanding the Rankings

To better understand the distribution of students’ rankings for each capstone we create a data frame that places the capstones in the columns and orders these columns from the lowest to the highest average rank:

capstone.ranks <- data[,-1]
capstone.ranks <- capstone.ranks[,order(colMeans(capstone.ranks))]

The following table lists the capstones from most popular, at the top, to least popular, on the bottom. For each capstone, the table lists the mean and standard deviation of the students’ rankings, as well as minimum, median, maximum, and interquartile range. The bar graph on the right is a histogram of the rankings: high bars to the left indicate a lot of high rankings (1, 2, etc.) and high bars to the right indicate a lot of low rankings (20, 21, etc.)

dfSummary(capstone.ranks, plain.ascii = FALSE, style = "grid", 
          graph.magnif = 0.75, valid.col = FALSE, 
          tmp.img.dir = "/tmp", headings = FALSE)

No	Variable	Stats / Values	Freqs (% of Valid)
1	Using Open-Source Satellite Data to Address an Interesting Question of the Group’s Choice [numeric]	Mean (sd) : 2.9 (1.4) min < med < max: 1 < 3 < 6 IQR (CV) : 2 (0.5)	1 : 2 (13.3%) 2 : 5 (33.3%) 3 : 3 (20.0%) 4 : 3 (20.0%) 5 : 1 ( 6.7%) 6 : 1 ( 6.7%)
2	Forecasting the Outcome of the 2020 Presidential Election [numeric]	Mean (sd) : 3.6 (1.9) min < med < max: 1 < 3 < 7 IQR (CV) : 3 (0.5)	1 : 2 (13.3%) 2 : 3 (20.0%) 3 : 3 (20.0%) 4 : 2 (13.3%) 5 : 3 (20.0%) 7 : 2 (13.3%)
3	Collecting and Analyzing Candidate Speeches During the 2020 Presidential Election [numeric]	Mean (sd) : 3.8 (2) min < med < max: 1 < 4 < 7 IQR (CV) : 3.5 (0.5)	1 : 3 (20.0%) 2 : 2 (13.3%) 3 : 1 ( 6.7%) 4 : 3 (20.0%) 5 : 2 (13.3%) 6 : 3 (20.0%) 7 : 1 ( 6.7%)
4	A Medical Capstone on a Topic of the Group’s Choice [numeric]	Mean (sd) : 3.9 (2.1) min < med < max: 1 < 4 < 7 IQR (CV) : 2.5 (0.5)	1 : 3 (20.0%) 2 : 1 ( 6.7%) 3 : 3 (20.0%) 4 : 1 ( 6.7%) 5 : 4 (26.7%) 6 : 1 ( 6.7%) 7 : 2 (13.3%)
5	Using Twitter to Analyze Regional Attitudes Towards COVID Over Time [numeric]	Mean (sd) : 4 (2.1) min < med < max: 1 < 4 < 7 IQR (CV) : 3.5 (0.5)	1 : 3 (20.0%) 2 : 1 ( 6.7%) 3 : 3 (20.0%) 4 : 1 ( 6.7%) 5 : 1 ( 6.7%) 6 : 5 (33.3%) 7 : 1 ( 6.7%)
6	The Effect of Vote-By-Mail Policies on the 2020 Presidential Election Results [numeric]	Mean (sd) : 4.6 (2.2) min < med < max: 1 < 5 < 7 IQR (CV) : 3.5 (0.5)	1 : 2 (13.3%) 2 : 1 ( 6.7%) 3 : 2 (13.3%) 4 : 1 ( 6.7%) 5 : 3 (20.0%) 6 : 2 (13.3%) 7 : 4 (26.7%)
7	Examining the Impact of Tax Policy and Politics on Economic Performance [numeric]	Mean (sd) : 5.2 (1.8) min < med < max: 2 < 6 < 7 IQR (CV) : 3 (0.3)	2 : 2 (13.3%) 4 : 4 (26.7%) 5 : 1 ( 6.7%) 6 : 3 (20.0%) 7 : 5 (33.3%)

Next we count, for every capstone, the number of students who ranked the capstone as their first, second, third, through last choice:

capstone.ranks2 <- capstone.ranks %>%
  gather(colnames(capstone.ranks), key="capstone", value="rank") %>%
  group_by(capstone) %>%
  dplyr::summarize(`Ranked 1st` = sum(rank==1),
            `Ranked 2nd` = sum(rank==2),
            `Ranked 3rd` = sum(rank==3),
            `Ranked 4th` = sum(rank==4),
            `Ranked 5th` = sum(rank==5),
            `Ranked 6th` = sum(rank==6),
            `Ranked 7th` = sum(rank==7)) %>%
  arrange(desc(`Ranked 1st`))
kable(capstone.ranks2)

capstone	Ranked 1st	Ranked 2nd	Ranked 3rd	Ranked 4th	Ranked 5th	Ranked 6th	Ranked 7th
A Medical Capstone on a Topic of the Group’s Choice	3	1	3	1	4	1	2
Collecting and Analyzing Candidate Speeches During the 2020 Presidential Election	3	2	1	3	2	3	1
Using Twitter to Analyze Regional Attitudes Towards COVID Over Time	3	1	3	1	1	5	1
Forecasting the Outcome of the 2020 Presidential Election	2	3	3	2	3	0	2
The Effect of Vote-By-Mail Policies on the 2020 Presidential Election Results	2	1	2	1	3	2	4
Using Open-Source Satellite Data to Address an Interesting Question of the Group’s Choice	2	5	3	3	1	1	0
Examining the Impact of Tax Policy and Politics on Economic Performance	0	2	0	4	1	3	5

We can also use this data to get a sense of the correlations between capstones and whether there exists clusters of capstones which get interest from the same students. We build a Euclidean distance matrix between the capstones, and pass this distance matrix to a multidimensional scaling algorithm with two dimensions:

d <- dist(t(capstone.ranks)) 
fit <- cmdscale(d,eig=TRUE, k=2)

Next we plot the capstones in two-dimensional space.

x <- fit$points[,1]
y <- fit$points[,2]
plot(x, y, 
     xlab="Dimension 1 (Political - Medical?)", 
     ylab="Dimension 2",
     main="A Map of Our Capstones", 
     xlim = c(-10, 15))
text(x, y, labels = row.names(t(capstone.ranks)), cex=.7, pos=3)

The Method for Matching Students to Capstones

I define an \((N \times C)\) matrix \(R\), where \(N\) is the number of students, \(C\) is the number of capstones, and each element \(r_{nc}\) is the rank that student \(n\) has given to capstone \(c\). We define variables \(X_{nc}\), \(\forall n \in \{1,2,. . . ,N\}\) and \(\forall c \in \{1,2,. . . ,C\}\) that are equal to 1 if student \(n\) is assigned to capstone \(c\), and 0 otherwise.

We define an objective function \[ F = \sum_{n=1}^N \sum_{c=1}^C r_{nc}X_{nc}, \] that we minimize with respect to the variables \(X_{nc}\).

To state the problem less formally: we are trying to assign students to capstones in a way that minimizes the sum total of the ranks the students have given to the capstones they’ve been assigned to. If we are able to assign all \(N\) students to their most preferred capstone, then all of the students’ rankings are 1, and \(F = N\). If any students are assigned to a capstone other than their most preferred capstone, then \(F > N\). We are trying to choose the assignments \(X_{nc}\) such that \(F\) is as close as possible to \(N\) as it can be given the constraints we deal with, which are that

(\(L_s\)) Every student must be assigned to one, and only one, capstone, and
(\(L_c\)) Every capstone must have either zero, three, or four students.

The student-constraint \(L_s\) can be expressed with this equation: \[ L_s: \sum_{c=1}^C X_{nc} = 1. \] In other words, the sum of all assignments across capstones for a student must equal 1. The capstone-constraint \(L_c\) can be expressed as \[ L_c: \sum_{n=1}^N X_{nc} \in \{0,3,4\}, \] which means the sum of all assignments across students for a capstone must be either 0, 3, or 4.

`sortinghat()`

I wrote a function as a wrapper for the lp () function from the lpSolve package to perform this optimization. It takes as input data in which the rows represent students, the columns represent capstones, and the cells contain rankings. The data cannot include a column for student IDs.

sortinghat <- function(X){
  
  require(tidyverse)
  require(lpSolveAPI)
  
  N <- nrow(X)
  C <- ncol(X)
  
  # Build constraint matrix
  data <- expand_grid(student = 1:N, capstone = 1:C)
  for(n in 1:N){
    data <- mutate(data, x = (student == n))
    colnames(data)[ncol(data)] <- paste(c("student",n), collapse ="")
  }
  for(i in 1:C){
    data <- mutate(data, x = (capstone == i))
    colnames(data)[ncol(data)] <- paste(c("capstone",i), collapse ="")
  }
  data <- select(data, -student, -capstone)
  data <- t(data)
  sumcap <- matrix(0, N, C)
  sumcap <- rbind(sumcap, -1 * diag(C))
  data <- cbind(data, sumcap)
  
  # Make an LP solve model
  lpmodel <- make.lp(nrow(data), ncol(data))
  for(i in 1:ncol(data)){
    set.column(lpmodel, i, data[,i])
  }
  
  # Build objective function 
  set.objfn(lpmodel, obj = c(c(t(X)), rep(0, C)))
  
  # Set constraints right-hand side
  set.rhs(lpmodel, b = c(rep(1, N), rep(0, C)))
  
  # Set constraint types
  set.constr.type(lpmodel, types = rep("=", N+C))
  
  # Set the sum variables as semi-continuous, bounded
  set.semicont(lpmodel, columns = c((N*C + 1):(N*C + C)))
  set.bounds(lpmodel, 
             lower = c(rep(0, N*C), rep(3,C)), 
             upper = c(rep(1, N*C), rep(4,C)))
  
  # Solve the LP model
  lp.control(lpmodel, sense = "min") 
  solve(lpmodel)
  results <- matrix(get.variables(lpmodel)[1:(N*C)], N, C, byrow=TRUE)
  return(results)
}

The data frame needs to place students in the rows and capstones in the columns, which is how we cleaned the data. But we need to remove the student name variable, which we save as a separate object, and we need to coerce the data to matrix class. We pass the data to sortinghat():

students <- data$student
matches <-sortinghat(as.matrix(data[,-1]))

## Loading required package: lpSolveAPI

The matches are expressed in binary format. To make these results easier to use, we include the student names and collapse the data to one column for the matches.

results <- data.frame(student = students, 
                      capstone = colnames(data[,-1])[apply(matches, 1, which.max)],
                      stringsAsFactors = FALSE)

How Happy are the Students with These Assignments?

We merge these matches with the rankings so that we can see how highly each student ranked the capstone to which they’ve been assigned:

final.assign.df <- data %>%
  gather(-student, key = "capstone", value = "rank") %>%
  right_join(results, by = c("capstone", "student")) %>%
  select(student, capstone, rank)

The final data is as follows:

kable(arrange(final.assign.df[,-3], capstone))

student	capstone
Brooke Williams1	A Medical Capstone on a Topic of the Group’s Choice
Cory Yemen	A Medical Capstone on a Topic of the Group’s Choice
Kevin Lennon	A Medical Capstone on a Topic of the Group’s Choice
Kevin Finity	Collecting and Analyzing Candidate Speeches During the 2020 Presidential Election
Maxwell McGaw	Collecting and Analyzing Candidate Speeches During the 2020 Presidential Election
Ramit Garg	Collecting and Analyzing Candidate Speeches During the 2020 Presidential Election
Ben Rogers	Forecasting the Outcome of the 2020 Presidential Election
Chad Sopata	Forecasting the Outcome of the 2020 Presidential Election
Matt Thomas	Forecasting the Outcome of the 2020 Presidential Election
Jordan Bales	Using Open-Source Satellite Data to Address an Interesting Question of the Group’s Choice
Liam Mulcahy	Using Open-Source Satellite Data to Address an Interesting Question of the Group’s Choice
Will Carruthers	Using Open-Source Satellite Data to Address an Interesting Question of the Group’s Choice
Cullen Baker	Using Twitter to Analyze Regional Attitudes Towards COVID Over Time
Jae Hyun Lee	Using Twitter to Analyze Regional Attitudes Towards COVID Over Time
Jason Lwin	Using Twitter to Analyze Regional Attitudes Towards COVID Over Time

In general, students are very happy with their matches, as the average ranking across students for the capstones to which they’ve been assigned is 1.2. The worst ranking is 2. The overall distribution of the rankings is illustrated below:

g <- ggplot(final.assign.df, aes(x=rank)) +
        geom_histogram(binwidth=1, col="red", fill="blue", alpha=.2) +
        xlab("Students' ranking of their assigned capstone") +
        ylab("Number of students") +
        theme(legend.position = "none") +
        scale_x_continuous(breaks=1:max(final.assign.df$rank)) +
        geom_text(stat='count', aes(label=..count..), vjust=-.5)
g

table(final.assign.df$capstone)

## 
##                                       A Medical Capstone on a Topic of the Group's Choice 
##                                                                                         3 
##         Collecting and Analyzing Candidate Speeches During the 2020 Presidential Election 
##                                                                                         3 
##                                 Forecasting the Outcome of the 2020 Presidential Election 
##                                                                                         3 
## Using Open-Source Satellite Data to Address an Interesting Question of the Group's Choice 
##                                                                                         3 
##                       Using Twitter to Analyze Regional Attitudes Towards COVID Over Time 
##                                                                                         3

Finally, we save these matches in a separate CSV file.

write_csv(final.assign.df, path="capstone_assignments.csv")