Finding duplicate rows

This document demonstrates a simple way to look for rows that are duplicated across data sets.

Loading the data

Suppose you have these two data sets:

# For this example, instead reading in a csv, we'll read in data from the console.
data1 <- read.table(header = TRUE, text = '
 label value
     A     4
     A     2
     B     1
     B     3
')

data2 <- read.table(header = TRUE, text = '
 label value
     A     2
     B     3
     C     8
')

Print them out:

data1
#>   label value
#> 1     A     4
#> 2     A     2
#> 3     B     1
#> 4     B     3

data2
#>   label value
#> 1     A     2
#> 2     B     3
#> 3     C     8

Checking for duplicates

In order to check for duplicates, we’ll do the following:

Drop all duplicates within each data set. This is to make sure we don’t accidentally flag duplicates that occur within a data set (we’re looking for duplicates across data sets).
Combine all the data sets together into a new data set.
Check for duplicates in the combined data set.

# Drop duplicates within a data set
data1 <- unique(data1)
data2 <- unique(data2)

# Combine the data sets
data_all <- rbind(data1, data2)
data_all
#>   label value
#> 1     A     4
#> 2     A     2
#> 3     B     1
#> 4     B     3
#> 5     A     2
#> 6     B     3
#> 7     C     8

# Find the indices of any duplicated rows
dup_idx <- duplicated(data_all)

# Get the duplicated rows
dup_rows <- data_all[dup_idx, ]
dup_rows
#>   label value
#> 5     A     2
#> 6     B     3

We’ve found one duplicated row across the two data sets. It can be printed in a nicer-looking table in this R Markdown document with the kable() function from the knitr package:

library(knitr)
kable(dup_rows, row.names = FALSE)

label	value
A	2
B	3

We could write it to a file with:

write.csv(dup_rows, file = "duprows.csv")

A few notes:

If there several data sets, and you want to find which data sets the duplicate row(s) appear in, that will require a little more work.
If there are other columns that you want to ignore when looking for duplicate rows, that’s simple.