This is a small example of matching treatment individuals as closely as possible to other individuals in a larger non-treatment group. This is used in sociological research as a statistical replacement for a paired experiment.
With more complicated data sets propensity score matching would be used. The nature of this data set, however, was amenable to finding control individuals based on similarity in Euclidean distance.
In this example there are students with several categorical variables and then two different scores on the same examination. Treatment students are matched exactly to a subset of all students based on the categorical variables, then from that subset the one individual with the smallest Euclidean distance for the two exam scores is chosen as the control for that treatment individual.
The client provided an Excel spreadsheet that was subsequently broken into two separate .csv files. As long as thes files are in the same folder as this code, this will read properly.
treatment <- read.csv("treatment.csv")
all <- read.csv("all.csv")
Here we generate a list with a separate data set for each individual in the treatment group. The data set has all of the individuals who match the target individual on school, grade, gender, and SES. Ethnicity was not included because it resulted in several individuals with no matches.
# Initialize output data
test <- list(rep(NA, nrow(treatment)))
test.length <- rep(NA, nrow(treatment))
# For each treatment individual
for(i in 1:nrow(treatment)){
# Find all matches...
test[[i]] <- merge(treatment[i,],
all,
# based on these variables
by.x = c("Base.School", "Grade", "Gender", "SES..Free...Reduced."),
by.y = c("Base.School", "Grade", "Gender", "SES..Free...Reduced.Lunch.")
)
# Create a binary variable that is 1 for treatment
test[[i]]$Group <- rep(0, nrow(test[[i]]))
test[[i]][test[[i]]$Student.ID == test[[i]]$Student..ID, "Group"] <- 1
# Subset of variables of interest.Change this if you change the matching variables above.
test[[i]] <- test[[i]][, c(9, 11:13)]
# Sort so the treatment individual is the first row, which helps later on
test[[i]] <- test[[i]][order(test[[i]]$Group, decreasing = TRUE),]
test.length[i] <- nrow(test[[i]])
}
This condition tests that every treatment individual has at least one possible match.
sum(test.length < 2)
## [1] 0
Given all exact matches for the school, grade, gender, and SES variables, this then finds the individual with the smallest Euclidean distance for the two test scores to the treatment individual.
# Initialize output data
ids <- data.frame(treatment = rep(NA, length(test)), control = rep(NA, length(test)))
for(i in 1:length(test)){
# Calculate Euclidean distance between all points
dist.mat <- dist(test[[i]],
method = "euclidean",
diag = FALSE,
upper = TRUE
)
# Convert to matrix for subsetting
my.column <- as.matrix(dist.mat)[-1, 1]
# Find smallest distance
sm.dist <- min(my.column)
# Get treatment ID
ids$treatment[i] <- test[[i]]$Student.ID[1]
# Get control ID by finding individual with smallest distance to treatment ind
ids$control[i] <- test[[i]]$Student.ID[which(sm.dist == as.matrix(dist.mat)[, 1])]
}
This is a list of the treatment individual IDs with the respective control IDs.
ids