Identifying Duplicate Entries

Prologue

  • Often times we have an identification (ID) column in a data frame where we would like most, if not all, entries or elements to be unique.
  • These ID’s may correspond to participants from survey, biological samples from experiment etc.
  • Here, we will learn to check for, retrieve, and remove selected duplicate entries from data frames
  • Let’s first create a data frame for us to use in this session. The DUP entries represent our duplicate ID’s.
# set seed for reproducibility
set.seed(888)

# Define vectors
col1 <- sample(toupper(letters)[1:10], size=10, replace=FALSE)
col2 <- rnorm(n=10, mean=1, sd=2)

# Create data frame
df <- data.frame("ID"=col1, "var"=col2, stringsAsFactors=FALSE)

# Insert duplicate entries
df[sample(dim(df)[1], size=2, replace=FALSE), 1] = "DUP"

df
##     ID          var
## 1    A  0.497953244
## 2  DUP -3.332907941
## 3    J  2.173720708
## 4    E  2.261233903
## 5    G  3.207179121
## 6    H  1.634802218
## 7    B  3.711467970
## 8    F  2.912318154
## 9    C  1.811114410
## 10 DUP -0.001118704

Check for duplicates

# Tabulate ID's
tab <- table(df$ID)
tab
## 
##   A   B   C DUP   E   F   G   H   J 
##   1   1   1   2   1   1   1   1   1
# Convert tabulation to data frame
tab_df <- data.frame(tab)
tab_df
##   Var1 Freq
## 1    A    1
## 2    B    1
## 3    C    1
## 4  DUP    2
## 5    E    1
## 6    F    1
## 7    G    1
## 8    H    1
## 9    J    1
  • Notice that the frequency (Freq column) for each unique entry is computed after converting the tabulation to data frame format.
  • Next we tabulate the frequency
tab_freq <- table(tab_df$Freq)
tab_freq
## 
## 1 2 
## 8 1
  • We can see that 8 entries appear once, while 1 entry appear twice.
  • Therefore, there is one duplicate entry in the ID column.

Retrieving duplicate entries

  • Next step is to retrieve the rows with the duplicate entries for further scrutiny.
# Retrieve duplicate ID using the subset() function
dup <- subset(tab_df, subset=(tab_df$Freq > 1), )
dup
##   Var1 Freq
## 4  DUP    2
df_dup <- df[df$ID %in% dup$Var1, ]
df_dup
##     ID          var
## 2  DUP -3.332907941
## 10 DUP -0.001118704

Removing selected duplicate entries

  • After scrutising the duplicate entries, you may decide to choose one or several out of all the duplicates to represent the duplicates.
  • Say, we want to select just the first duplicate entry in this case.
# Identify row numbers of duplicate ID's in the orginal data frame
# The collapse="|" argument is useful when you have >1 ID's that have duplicates. Here, we only have 1 ID that has duplicates
index_dup <- grep(paste(dup$Var1, sep="", collapse="|"), df$ID)
index_dup
## [1]  2 10
# Select the row number to remove. In this case we would like to remove the first duplicate ID
# We do this using logicals where TRUE represents the entry to remove and FALSE otherwise
index_rm <- index_dup[c(FALSE, TRUE)]
index_rm
## [1] 10
# Remove selected entry using subsetting
df_new <- df[-index_rm, ]
df_new
##    ID        var
## 1   A  0.4979532
## 2 DUP -3.3329079
## 3   J  2.1737207
## 4   E  2.2612339
## 5   G  3.2071791
## 6   H  1.6348022
## 7   B  3.7114680
## 8   F  2.9123182
## 9   C  1.8111144