This document demonstrates a simple way to look for rows that are duplicated across data sets.
Suppose you have these two data sets:
# For this example, instead reading in a csv, we'll read in data from the console.
data1 <- read.table(header = TRUE, text = '
label value
A 4
A 2
B 1
B 3
')
data2 <- read.table(header = TRUE, text = '
label value
A 2
B 3
C 8
')
Print them out:
data1
#> label value
#> 1 A 4
#> 2 A 2
#> 3 B 1
#> 4 B 3
data2
#> label value
#> 1 A 2
#> 2 B 3
#> 3 C 8
In order to check for duplicates, we’ll do the following:
# Drop duplicates within a data set
data1 <- unique(data1)
data2 <- unique(data2)
# Combine the data sets
data_all <- rbind(data1, data2)
data_all
#> label value
#> 1 A 4
#> 2 A 2
#> 3 B 1
#> 4 B 3
#> 5 A 2
#> 6 B 3
#> 7 C 8
# Find the indices of any duplicated rows
dup_idx <- duplicated(data_all)
# Get the duplicated rows
dup_rows <- data_all[dup_idx, ]
dup_rows
#> label value
#> 5 A 2
#> 6 B 3
We’ve found one duplicated row across the two data sets. It can be printed in a nicer-looking table in this R Markdown document with the kable() function from the knitr package:
library(knitr)
kable(dup_rows, row.names = FALSE)
| label | value |
|---|---|
| A | 2 |
| B | 3 |
We could write it to a file with:
write.csv(dup_rows, file = "duprows.csv")
A few notes: