R Notebook

This is a small example of matching treatment individuals as closely as possible to other individuals in a larger non-treatment group. This is used in sociological research as a statistical replacement for a paired experiment.

With more complicated data sets propensity score matching would be used. The nature of this data set, however, was amenable to finding control individuals based on similarity in Euclidean distance.

In this example there are students with several categorical variables and then two different scores on the same examination. Treatment students are matched exactly to a subset of all students based on the categorical variables, then from that subset the one individual with the smallest Euclidean distance for the two exam scores is chosen as the control for that treatment individual.

Read in the data

The client provided an Excel spreadsheet that was subsequently broken into two separate .csv files. As long as thes files are in the same folder as this code, this will read properly.

treatment <- read.csv("treatment.csv")
all <- read.csv("all.csv")

Manipulate data

Here we generate a list with a separate data set for each individual in the treatment group. The data set has all of the individuals who match the target individual on school, grade, gender, and SES. Ethnicity was not included because it resulted in several individuals with no matches.

# Initialize output data
test <- list(rep(NA, nrow(treatment)))
test.length <- rep(NA, nrow(treatment))

# For each treatment individual
for(i in 1:nrow(treatment)){ 
  # Find all matches...
  test[[i]] <- merge(treatment[i,],
    all,
    # based on these variables
    by.x = c("Base.School", "Grade", "Gender", "SES..Free...Reduced."),
    by.y = c("Base.School", "Grade", "Gender", "SES..Free...Reduced.Lunch.")
  )
  # Create a binary variable that is 1 for treatment
  test[[i]]$Group <- rep(0, nrow(test[[i]]))
  test[[i]][test[[i]]$Student.ID == test[[i]]$Student..ID, "Group"] <- 1
  # Subset of variables of interest.Change this if you change the matching variables above.
  test[[i]] <- test[[i]][, c(9, 11:13)]
  # Sort so the treatment individual is the first row, which helps later on
  test[[i]] <- test[[i]][order(test[[i]]$Group, decreasing = TRUE),]
  test.length[i] <- nrow(test[[i]])
}

This condition tests that every treatment individual has at least one possible match.

sum(test.length < 2)

## [1] 0

Find matches

Given all exact matches for the school, grade, gender, and SES variables, this then finds the individual with the smallest Euclidean distance for the two test scores to the treatment individual.

# Initialize output data
ids <- data.frame(treatment = rep(NA, length(test)), control = rep(NA, length(test)))

for(i in 1:length(test)){
  # Calculate Euclidean distance between all points
  dist.mat <- dist(test[[i]], 
    method = "euclidean", 
    diag = FALSE, 
    upper = TRUE
    )
  # Convert to matrix for subsetting
  my.column <- as.matrix(dist.mat)[-1, 1]
  # Find smallest distance
  sm.dist <- min(my.column)
  # Get treatment ID
  ids$treatment[i] <- test[[i]]$Student.ID[1]
  # Get control ID by finding individual with smallest distance to treatment ind
  ids$control[i] <- test[[i]]$Student.ID[which(sm.dist == as.matrix(dist.mat)[, 1])]
}

Results

This is a list of the treatment individual IDs with the respective control IDs.

ids