In this document, I describe the process of matching the MSDS students to capstones. First I clean the raw data from the Qualtrics survey. Then I report on the distribution of rankings for each capstone. Finally I assign students to capstones, and report on the rankings that each student gave to their assigned capstone.
In this report, we require the following packages:
library(knitr)
library(summarytools)
library(lpSolve)
library(lubridate)
library(tidyverse)
library(plotly)
library(DT)
We wrote a survey for students to report their rankings of all of the capstones, and disseminated it using this link: https://virginia.az1.qualtrics.com/jfe/form/SV_4VNIwLR2rBXtCwm. The survey consists of three questions. First, students provide their UVA computing ID and full name. Then they rank all of the capstones from the one they most want to work on (1) to the one they least want to work on. The survey looks like this:
The desktop and mobile versions of the survey allow a student to drag the different capstones to positions higher and lower on this list. When they do so, numbers appear next to each capstone, with the capstone on top labeled 1. The capstones on top represent the students’ top preferences.
We downloaded the raw data from the Qualtrics website in CSV format and loaded it into R:
data <- read_csv("MSDS+Residential+Capstones+Fall+2023_September+26,+2023_10.09.csv")
## Rows: 65 Columns: 39
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (39): StartDate, EndDate, Status, IPAddress, Progress, Duration (in seco...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Because the raw data uses numeric codes for capstones, we also input the names of the capstones in the order they are recognized by the Qualtrics survey:
capnames <- c(
"Recognition of Human Trafficking with Data Science",
"Develop an Enterprise Architecture Assistant to Improve the Business Decision-making Process",
"Plan-EO: Mapping the bugs that make kids sick in tropical regions of the globe",
"Metadata-to-Concept: Semi-automatic extraction of entities from image metadata on Wikimedia Commons",
"Federal Health GenAI Quality of Care Explorer",
"An ML approach to optimizing project planning for the ALMA Observatory",
"Imputation of Missing Continuous Glucose Monitoring Sensor Data: Best Methods and Identification of Limits",
"Video analysis of parent-child interactions",
"Leveraging Natural Language Processing for Focused Ultrasound Research Taxonomy Development",
"Identifying Illegal AIS Tagged Fishing Nets",
"Deep Learning-based EEG Analysis for Sleep Apnea Detection",
"Comparative Analysis of the Graduate Engineering Schools and Determination of Metrics for Improved Rankings",
"ForHumanity Auditing AI System",
"Generative AI to Facilitate Legislative Knowledge Management",
"Prediction on Federal Purchasing Behaviors on Small Business",
"Inclusive & Bias-Free Curriculum Checklist on the Hand50 video content",
"Exploring predictability of winter California hydroclimate with deep learning",
"Developing a path to net zero carbon emissions in Virginia",
"Open-Source Tool for Analyzing US Government’s Technology Procurement",
"Expanding AI Audits to Include Instruments: Accountability, Measurements, and Data in Motion Capture Technology"
)
We need to remove the first two rows of metadata, and isolate the columns that refer to the students’ rankings, which all begin with the letter “Q”. We also save the time and date each set of responses was submitted to address the students who submitted more than one set of rankings.
data <- data[-c(1,2),] %>%
dplyr::select(RecordedDate, starts_with("Q"))
It is possible that some students submitted more than one set of rankings. For these students, we keep only the most recent rankings:
data <- data %>%
mutate(RecordedDate = ymd_hms(RecordedDate),
Q1 = str_to_lower(Q1)) %>%
group_by(Q1) %>%
slice(which.max(RecordedDate)) %>%
ungroup() %>%
select(-RecordedDate) %>%
filter(Q1 != "test")
The data at this point are coded as character. We convert every column to numeric class:
data <- data %>%
mutate(Q1 = as.factor(Q1),
Q3 = as.factor(Q3)) %>%
mutate_if(is.character,as.numeric)
colnames(data) <- c("student", "fullname", capnames)
if(!all(is.na(data[is.na(data)]))){
data[is.na(data)] <- length(capnames)
}
We save these data as a CSV:
write_csv(data, file="student_rankings.csv")
To better understand the distribution of students’ rankings for each capstone we create a data frame that places the capstones in the columns and orders these columns from the lowest to the highest average rank:
capstone.ranks <- data[,-c(1,2)]
capstone.ranks <- capstone.ranks[,order(colMeans(capstone.ranks))]
The following table lists the capstones from most popular, at the top, to least popular, on the bottom. For each capstone, the table lists the mean and standard deviation of the students’ rankings, as well as minimum, median, maximum, and interquartile range. The bar graph on the right is a histogram of the rankings: high bars to the left indicate a lot of high rankings and high bars to the right indicate a lot of low rankings:
dfSummary(capstone.ranks, plain.ascii = FALSE, style = "grid",
graph.magnif = 0.75, valid.col = FALSE,
tmp.img.dir = "/tmp", headings = FALSE)
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing |
|---|---|---|---|---|---|
| 1 | An ML approach to optimizing project planning for the ALMA
Observatory [numeric] |
Mean (sd) : 6.1 (5.1) min < med < max: 1 < 4 < 20 IQR (CV) : 5.8 (0.8) |
15 distinct values | 0 (0.0%) |
|
| 2 | Identifying Illegal AIS Tagged Fishing Nets [numeric] |
Mean (sd) : 6.8 (5.5) min < med < max: 1 < 4 < 20 IQR (CV) : 8.8 (0.8) |
17 distinct values | 0 (0.0%) |
|
| 3 | Plan-EO: Mapping the bugs that make kids sick in tropical regions of
the globe [numeric] |
Mean (sd) : 7 (4.4) min < med < max: 1 < 7 < 17 IQR (CV) : 7 (0.6) |
16 distinct values | 0 (0.0%) |
|
| 4 | Recognition of Human Trafficking with Data Science [numeric] |
Mean (sd) : 7.5 (5.1) min < med < max: 1 < 7 < 20 IQR (CV) : 6 (0.7) |
18 distinct values | 0 (0.0%) |
|
| 5 | Exploring predictability of winter California hydroclimate with deep
learning [numeric] |
Mean (sd) : 8.4 (5.4) min < med < max: 1 < 7.5 < 19 IQR (CV) : 6.8 (0.6) |
17 distinct values | 0 (0.0%) |
|
| 6 | Deep Learning-based EEG Analysis for Sleep Apnea Detection [numeric] |
Mean (sd) : 8.5 (5.9) min < med < max: 1 < 8 < 20 IQR (CV) : 10.8 (0.7) |
20 distinct values | 0 (0.0%) |
|
| 7 | Prediction on Federal Purchasing Behaviors on Small Business [numeric] |
Mean (sd) : 9.1 (5.7) min < med < max: 1 < 9 < 20 IQR (CV) : 10 (0.6) |
20 distinct values | 0 (0.0%) |
|
| 8 | Federal Health GenAI Quality of Care Explorer [numeric] |
Mean (sd) : 9.7 (6) min < med < max: 1 < 9.5 < 20 IQR (CV) : 10 (0.6) |
19 distinct values | 0 (0.0%) |
|
| 9 | Leveraging Natural Language Processing for Focused Ultrasound
Research Taxonomy Development [numeric] |
Mean (sd) : 10 (4.2) min < med < max: 1 < 10 < 18 IQR (CV) : 7.8 (0.4) |
18 distinct values | 0 (0.0%) |
|
| 10 | Developing a path to net zero carbon emissions in Virginia [numeric] |
Mean (sd) : 10.1 (6.2) min < med < max: 1 < 9 < 20 IQR (CV) : 11 (0.6) |
19 distinct values | 0 (0.0%) |
|
| 11 | Develop an Enterprise Architecture Assistant to Improve the Business
Decision-making Process [numeric] |
Mean (sd) : 10.3 (4.9) min < med < max: 1 < 10.5 < 20 IQR (CV) : 7 (0.5) |
19 distinct values | 0 (0.0%) |
|
| 12 | Generative AI to Facilitate Legislative Knowledge Management [numeric] |
Mean (sd) : 10.6 (6.7) min < med < max: 1 < 11.5 < 20 IQR (CV) : 11.8 (0.6) |
20 distinct values | 0 (0.0%) |
|
| 13 | Metadata-to-Concept: Semi-automatic extraction of entities from
image metadata on Wikimedia Commons [numeric] |
Mean (sd) : 10.7 (5) min < med < max: 1 < 11 < 20 IQR (CV) : 6 (0.5) |
17 distinct values | 0 (0.0%) |
|
| 14 | Imputation of Missing Continuous Glucose Monitoring Sensor Data:
Best Methods and Identification of Limits [numeric] |
Mean (sd) : 10.9 (5) min < med < max: 1 < 12 < 20 IQR (CV) : 7.8 (0.5) |
20 distinct values | 0 (0.0%) |
|
| 15 | Video analysis of parent-child interactions [numeric] |
Mean (sd) : 11.1 (4.8) min < med < max: 1 < 11.5 < 20 IQR (CV) : 6 (0.4) |
19 distinct values | 0 (0.0%) |
|
| 16 | Comparative Analysis of the Graduate Engineering Schools and
Determination of Metrics for Improved Rankings [numeric] |
Mean (sd) : 12.4 (5) min < med < max: 1 < 13 < 20 IQR (CV) : 8 (0.4) |
19 distinct values | 0 (0.0%) |
|
| 17 | ForHumanity Auditing AI System [numeric] |
Mean (sd) : 12.4 (5.2) min < med < max: 1 < 13.5 < 20 IQR (CV) : 7.8 (0.4) |
19 distinct values | 0 (0.0%) |
|
| 18 | Open-Source Tool for Analyzing US Government’s Technology
Procurement [numeric] |
Mean (sd) : 13.2 (5) min < med < max: 1 < 13 < 20 IQR (CV) : 8 (0.4) |
17 distinct values | 0 (0.0%) |
|
| 19 | Expanding AI Audits to Include Instruments: Accountability,
Measurements, and Data in Motion Capture Technology [numeric] |
Mean (sd) : 15.9 (5) min < med < max: 1 < 18 < 20 IQR (CV) : 5 (0.3) |
17 distinct values | 0 (0.0%) |
|
| 20 | Inclusive & Bias-Free Curriculum Checklist on the Hand50 video
content [numeric] |
Mean (sd) : 16.3 (3.8) min < med < max: 1 < 17 < 20 IQR (CV) : 3.8 (0.2) |
13 distinct values | 0 (0.0%) |
Next we count, for every capstone, the number of students who ranked the capstone as their first, second, third, fourth, and fifth choice:
capstone.ranks2 <- capstone.ranks %>%
gather(colnames(capstone.ranks), key="capstone", value="rank") %>%
group_by(capstone) %>%
dplyr::summarize(`Mean rank` = round(mean(rank),2),
`Ranked 1st` = sum(rank==1),
`Ranked 2nd` = sum(rank==2),
`Ranked 3rd` = sum(rank==3),
`Ranked 4th` = sum(rank==4),
`Ranked 5th` = sum(rank==5)) %>%
arrange(desc(`Ranked 1st`))
datatable(capstone.ranks2)
We can also use this data to get a sense of the correlations between capstones and whether there exists clusters of capstones which get interest from the same students. We build a Euclidean distance matrix between the capstones, and pass this distance matrix to a multidimensional scaling algorithm with two dimensions:
d <- dist(t(capstone.ranks))
fit <- cmdscale(d,eig=TRUE, k=2)
Next we plot the capstones in two-dimensional space.
capstone.ranks2 <- capstone.ranks2 %>%
mutate(x = fit$points[,1],
y = fit$points[,2])
g <- ggplot(capstone.ranks2, aes(capstone=capstone, x=x, y=y, color=`Mean rank`)) +
geom_point() +
xlab("Dimension 1") +
ylab("Dimension 2") +
ggtitle("A Map of Our Capstones") +
scale_color_gradient(low='red', high='blue')
ggplotly(g)
Sometimes, as a result of particular academic collaborations and
programs, a student in the MSDS program is obligated to be on a
particular capstone. We can hard-wire these assignments into the
matching algorithm by fixing the assignment variable at the coordinates
associated with that student and capstone to 1. The coordinates can be
extracted from capnames and data with the
following function:
hardwire.fun <- function(data, hardwire){
x <- which(data$student %in% hardwire$student)
y <- which(capnames %in% hardwire$capstone)
return(cbind(x,y))
}
To program these hardwired assignments, create a dataframe with two
columns: student for the student’s UVA computing ID, and
capstone (as exactly listed in the capnames list above) for
the capstone to which this student must be assigned. Create one row for
every students with a pre-determined assignment:
We use the hardwire.fun() function to convert these
assignment to numeric data for the matching algorithm to use below:
I define an \((N \times C)\) matrix \(R\), where \(N\) is the number of students, \(C\) is the number of capstones, and each element \(r_{nc}\) is the rank that student \(n\) has given to capstone \(c\). We define variables \(X_{nc}\), \(\forall n \in \{1,2,. . . ,N\}\) and \(\forall c \in \{1,2,. . . ,C\}\) that are equal to 1 if student \(n\) is assigned to capstone \(c\), and 0 otherwise.
We define an objective function \[ F = \sum_{n=1}^N \sum_{c=1}^C r_{nc}X_{nc}, \] that we minimize with respect to the variables \(X_{nc}\).
To state the problem less formally: we are trying to assign students to capstones in a way that minimizes the sum total of the ranks the students have given to the capstones they’ve been assigned to. If we are able to assign all \(N\) students to their most preferred capstone, then all of the students’ rankings are 1, and \(F = N\). If any students are assigned to a capstone other than their most preferred capstone, then \(F > N\). We are trying to choose the assignments \(X_{nc}\) such that \(F\) is as close as possible to \(N\) as it can be given the constraints we deal with, which are that
(\(L_s\)) Every student must be assigned to one, and only one, capstone, and
(\(L_c\)) Every capstone must have either zero, three, or four students.
The student-constraint \(L_s\) can be expressed with this equation: \[ L_s: \sum_{c=1}^C X_{nc} = 1. \] In other words, the sum of all assignments across capstones for a student must equal 1. The capstone-constraint \(L_c\) can be expressed as \[ L_c: \sum_{n=1}^N X_{nc} \in \{0,3,4\}, \] which means the sum of all assignments across students for a capstone must be either 0, 3, or 4.
sortinghat()I wrote a function as a wrapper for the functions in the
lpSolveAPI package to perform this optimization. It takes
as input data in which the rows represent students, the columns
represent capstones, and the cells contain rankings. The data cannot
include a column for student IDs.
sortinghat <- function(X, hardwire=NULL){
require(tidyverse)
require(lpSolveAPI)
N <- nrow(X)
C <- ncol(X)
# Build constraint matrix
data <- expand_grid(student = 1:N, capstone = 1:C)
for(n in 1:N){
data <- mutate(data, x = (student == n))
colnames(data)[ncol(data)] <- paste(c("student",n), collapse ="")
}
for(i in 1:C){
data <- mutate(data, x = (capstone == i))
colnames(data)[ncol(data)] <- paste(c("capstone",i), collapse ="")
}
data <- select(data, -student, -capstone)
data <- t(data)
sumcap <- matrix(0, N, C)
sumcap <- rbind(sumcap, -1 * diag(C))
data <- cbind(data, sumcap)
# Make an LP solve model
lpmodel <- make.lp(nrow(data), ncol(data))
for(i in 1:ncol(data)){
set.column(lpmodel, i, data[,i])
}
# Build objective function
set.objfn(lpmodel, obj = c(c(t(X)), rep(0, C)))
# Set constraints right-hand side
set.rhs(lpmodel, b = c(rep(1, N), rep(0, C)))
# Set constraint types
set.constr.type(lpmodel, types = rep("=", N+C))
# Set hardwired assignments
if(!is.null(hardwire)){
for(j in 1:nrow(hardwire)){
cons <- rep(0, (N+1)*C)
cons[(hardwire[j,1]-1)*C + hardwire[j,2]] <- 1
add.constraint(lpmodel, cons, type="=", 1)
}
}
# Set the sum variables as semi-continuous, bounded
set.semicont(lpmodel, columns = c((N*C + 1):(N*C + C)))
set.bounds(lpmodel,
lower = c(rep(0, N*C), rep(3,C)),
upper = c(rep(1, N*C), rep(4,C)))
# Solve the LP model
lp.control(lpmodel, sense = "min")
solve(lpmodel)
results <- matrix(get.variables(lpmodel)[1:(N*C)], N, C, byrow=TRUE)
return(results)
}
The data frame needs to place students in the rows and capstones in
the columns, which is how we cleaned the data. But we need to remove the
student name variable, which we save as a separate object, and we need
to coerce the data to matrix class. We pass the data to
sortinghat():
students <- data$fullname
matches <-sortinghat(as.matrix(data[,-c(1,2)]))
## Loading required package: lpSolveAPI
#matches <-sortinghat(as.matrix(data[,-c(1,2)]),
# hardwire = hardwire)
The matches are expressed in binary format. To make these results easier to use, we include the student names and collapse the data to one column for the matches.
results <- data.frame(fullname = students,
capstone = colnames(data[,-c(1,2)])[apply(matches, 1, which.max)],
stringsAsFactors = FALSE)
We merge these matches with the rankings so that we can see how highly each student ranked the capstone to which they’ve been assigned:
final.assign.df <- data %>%
gather(-student, -fullname, key = "capstone", value = "rank") %>%
right_join(results, by = c("capstone", "fullname")) %>%
select(student, fullname, capstone, rank) %>%
mutate(email = paste(student, "@virginia.edu", sep="")) %>%
select(fullname, email, capstone, rank)
The final data is as follows:
datatable(arrange(final.assign.df, capstone))
In general, students are very happy with their matches, as the average ranking across students for the capstones to which they’ve been assigned is 1.5. The worst ranking is 4. The overall distribution of the rankings is illustrated below:
g <- ggplot(final.assign.df, aes(x=rank)) +
geom_histogram(binwidth=1, col="red", fill="blue", alpha=.2) +
xlab("Students' ranking of their assigned capstone") +
ylab("Number of students") +
theme(legend.position = "none") +
scale_x_continuous(breaks=1:max(final.assign.df$rank)) +
geom_text(stat='count', aes(label=after_stat(count)), vjust=-.5)
g
Here is a list of the capstones we will assign students to, with the total enrollment by capstone:
datatable(final.assign.df %>%
group_by(capstone) %>%
summarize(`Number of students` = n()))
The following capstones were dropped:
dropcap <- capnames[!is.element(capnames, unique(final.assign.df$capstone))]
datatable(
data.frame(`Capstones we will not be assigning students to` = dropcap,
check.names=FALSE)
)
Finally, we save these matches in a separate CSV file.
write_csv(final.assign.df, file="capstone_assignments.csv")