Recursively counting unique values in columns

Solution

First let’s make some data

set.seed(100) # Fix the randomization

data <- data.frame(c1 = sample(letters, size = 10, replace = T),
              c2 = sample(letters, size = 10, replace = T),
              c3 = sample(letters, size = 10, replace = T),
              stringsAsFactors = F)

Here’s how the data look

library(knitr)
kable(data, format = "markdown")

c1	c2	c3
i	q	n
g	w	s
o	h	n
b	k	t
m	t	k
m	r	e
v	f	u
j	j	w
o	j	o
e	r	h

Ok, now the goal is to come up with a vector of of length 3, where the first element is the number of unique values in c1, the second number is the number of unique values in c2 that did not occur in c1, and the third number is the number of unique values in c3 that did not occur in c1 or c2. We’ll call this vector final.counts.

We’ll loop over the columns of data to get final.counts. For each iteration of the loop, we’ll count the number of valid (new) unique values in the column i that did not occur in previous columns. For column 1, this is simple (just count the number of unique values in the column).

For columns > 1, we’ll do this in three steps:

Step 1: Create a vector of candidate unique values in column i [candidate.unique.vals]
Step 2: Determine the old unique values occuring in all previous columns 1 through (i - 1) [old.unique.vals]
Step 3: Remove the old unique values (from Step 2) from the candiate unique values (from Step 1) [new.unique.vals]

# Set up
n.columns <- ncol(data)
final.counts <- rep(NA, n.columns)
unique.vals.ls <- vector("list", n.columns)  # Create a list which will contain the new unique values

# Loop over columns
for(i in 1:n.columns) {
  
  data.vec <- data[,i]  # Get the data for column i

  if(i == 1) { # If column is 1, just get the unique values in column 1
    
    new.unique.vals <- unique(data.vec) # Get the unique values in column 1
    
  }
  
  
  if(i > 1) { # If column is greater than 1...
   
  # Step 1: Get the candidate unique values in column i 
    candidate.unique.vals <- unique(data.vec)
    
  # Step 2: Get old unique values (from previous columns) to exclude  
    old.unique.vals <- unique(unlist(data[,1:(i - 1)]))
    
  # Step 3: Remove old unique values from candidate unique values with set.diff
    new.unique.vals <- setdiff(candidate.unique.vals, old.unique.vals)
    
  }
  
  # Finished! Now we just need to count the number of unique values and write the result to final.counts
  
  n.new.unique <- length(new.unique.vals) # Count the number of unique values
  final.counts[i] <- n.new.unique # Write count to final.counts
  unique.vals.ls[[i]] <- sort(new.unique.vals) # Write new unique values to unique.vals.ls

}

Here’s the final result:

final.counts

## [1] 8 7 3

This means that in column 1, there were 8 unique values. In column 2, there were 7 unique values that did not occur in column 1. Finally, in column 3, there were 3 unique values that did not occur in columns 1 or 2.

We can also see the specific unique values found in each column in unique.vals.ls

unique.vals.ls

## [[1]]
## [1] "b" "e" "g" "i" "j" "m" "o" "v"
## 
## [[2]]
## [1] "f" "h" "k" "q" "r" "t" "w"
## 
## [[3]]
## [1] "n" "s" "u"

Success!

Recursively counting unique values in columns

Nathaniel Phillips

1 June 2016

Problem

Solution