This document demonstrates a simple way to look for rows that are duplicated across data sets.

Loading the data

Suppose you have these two data sets:

# For this example, instead reading in a csv, we'll read in data from the console.
data1 <- read.table(header = TRUE, text = '
 label value
     A     4
     A     2
     B     1
     B     3
')

data2 <- read.table(header = TRUE, text = '
 label value
     A     2
     B     3
     C     8
')

Print them out:

data1
#>   label value
#> 1     A     4
#> 2     A     2
#> 3     B     1
#> 4     B     3

data2
#>   label value
#> 1     A     2
#> 2     B     3
#> 3     C     8

Checking for duplicates

In order to check for duplicates, we’ll do the following:

  1. Drop all duplicates within each data set. This is to make sure we don’t accidentally flag duplicates that occur within a data set (we’re looking for duplicates across data sets).
  2. Combine all the data sets together into a new data set.
  3. Check for duplicates in the combined data set.
# Drop duplicates within a data set
data1 <- unique(data1)
data2 <- unique(data2)

# Combine the data sets
data_all <- rbind(data1, data2)
data_all
#>   label value
#> 1     A     4
#> 2     A     2
#> 3     B     1
#> 4     B     3
#> 5     A     2
#> 6     B     3
#> 7     C     8

# Find the indices of any duplicated rows
dup_idx <- duplicated(data_all)

# Get the duplicated rows
dup_rows <- data_all[dup_idx, ]
dup_rows
#>   label value
#> 5     A     2
#> 6     B     3

We’ve found one duplicated row across the two data sets. It can be printed in a nicer-looking table in this R Markdown document with the kable() function from the knitr package:

library(knitr)
kable(dup_rows, row.names = FALSE)
label value
A 2
B 3

We could write it to a file with:

write.csv(dup_rows, file = "duprows.csv")

A few notes: