Problem

I have 12 lists of names arranged in columns. Let’s say the columns are N_1-12. The names are deduped, and therefore unique within columns, but not necessarily across columns, i.e. names in but names For N_3 could be in N_1 or N-2. I have to get a count of unique names that did not occur previously. so for N_1 the number will be the number of unique names in N_1. For N_2 the number is the unique names in N_2 that doesn’t appear in N_1. The number in N_3 is the count of unique names that don’t appear in N_1 or N_2.

Solution

First let’s make some data

set.seed(100) # Fix the randomization

data <- data.frame(c1 = sample(letters, size = 10, replace = T),
              c2 = sample(letters, size = 10, replace = T),
              c3 = sample(letters, size = 10, replace = T),
              stringsAsFactors = F)

Here’s how the data look

library(knitr)
kable(data, format = "markdown")
c1 c2 c3
i q n
g w s
o h n
b k t
m t k
m r e
v f u
j j w
o j o
e r h

Ok, now the goal is to come up with a vector of of length 3, where the first element is the number of unique values in c1, the second number is the number of unique values in c2 that did not occur in c1, and the third number is the number of unique values in c3 that did not occur in c1 or c2. We’ll call this vector final.counts.

We’ll loop over the columns of data to get final.counts. For each iteration of the loop, we’ll count the number of valid (new) unique values in the column i that did not occur in previous columns. For column 1, this is simple (just count the number of unique values in the column).

For columns > 1, we’ll do this in three steps:

# Set up
n.columns <- ncol(data)
final.counts <- rep(NA, n.columns)
unique.vals.ls <- vector("list", n.columns)  # Create a list which will contain the new unique values

# Loop over columns
for(i in 1:n.columns) {
  
  data.vec <- data[,i]  # Get the data for column i

  if(i == 1) { # If column is 1, just get the unique values in column 1
    
    new.unique.vals <- unique(data.vec) # Get the unique values in column 1
    
  }
  
  
  if(i > 1) { # If column is greater than 1...
   
  # Step 1: Get the candidate unique values in column i 
    candidate.unique.vals <- unique(data.vec)
    
  # Step 2: Get old unique values (from previous columns) to exclude  
    old.unique.vals <- unique(unlist(data[,1:(i - 1)]))
    
  # Step 3: Remove old unique values from candidate unique values with set.diff
    new.unique.vals <- setdiff(candidate.unique.vals, old.unique.vals)
    
  }
  
  # Finished! Now we just need to count the number of unique values and write the result to final.counts
  
  n.new.unique <- length(new.unique.vals) # Count the number of unique values
  final.counts[i] <- n.new.unique # Write count to final.counts
  unique.vals.ls[[i]] <- sort(new.unique.vals) # Write new unique values to unique.vals.ls

}

Here’s the final result:

final.counts
## [1] 8 7 3

This means that in column 1, there were 8 unique values. In column 2, there were 7 unique values that did not occur in column 1. Finally, in column 3, there were 3 unique values that did not occur in columns 1 or 2.

We can also see the specific unique values found in each column in unique.vals.ls

unique.vals.ls
## [[1]]
## [1] "b" "e" "g" "i" "j" "m" "o" "v"
## 
## [[2]]
## [1] "f" "h" "k" "q" "r" "t" "w"
## 
## [[3]]
## [1] "n" "s" "u"

Success!