Identifying Duplicate Entries
Prologue
- Often times we have an identification (ID) column in a data frame where we would like most, if not all, entries or elements to be unique.
- These ID’s may correspond to participants from survey, biological samples from experiment etc.
- Here, we will learn to check for, retrieve, and remove selected duplicate entries from data frames
- Let’s first create a data frame for us to use in this session. The DUP entries represent our duplicate ID’s.
# set seed for reproducibility
set.seed(888)
# Define vectors
col1 <- sample(toupper(letters)[1:10], size=10, replace=FALSE)
col2 <- rnorm(n=10, mean=1, sd=2)
# Create data frame
df <- data.frame("ID"=col1, "var"=col2, stringsAsFactors=FALSE)
# Insert duplicate entries
df[sample(dim(df)[1], size=2, replace=FALSE), 1] = "DUP"
df
## ID var
## 1 A 0.497953244
## 2 DUP -3.332907941
## 3 J 2.173720708
## 4 E 2.261233903
## 5 G 3.207179121
## 6 H 1.634802218
## 7 B 3.711467970
## 8 F 2.912318154
## 9 C 1.811114410
## 10 DUP -0.001118704
Check for duplicates
# Tabulate ID's
tab <- table(df$ID)
tab
##
## A B C DUP E F G H J
## 1 1 1 2 1 1 1 1 1
# Convert tabulation to data frame
tab_df <- data.frame(tab)
tab_df
## Var1 Freq
## 1 A 1
## 2 B 1
## 3 C 1
## 4 DUP 2
## 5 E 1
## 6 F 1
## 7 G 1
## 8 H 1
## 9 J 1
- Notice that the frequency (Freq column) for each unique entry is computed after converting the tabulation to data frame format.
- Next we tabulate the frequency
tab_freq <- table(tab_df$Freq)
tab_freq
##
## 1 2
## 8 1
- We can see that 8 entries appear once, while 1 entry appear twice.
- Therefore, there is one duplicate entry in the ID column.
Retrieving duplicate entries
- Next step is to retrieve the rows with the duplicate entries for further scrutiny.
# Retrieve duplicate ID using the subset() function
dup <- subset(tab_df, subset=(tab_df$Freq > 1), )
dup
## Var1 Freq
## 4 DUP 2
df_dup <- df[df$ID %in% dup$Var1, ]
df_dup
## ID var
## 2 DUP -3.332907941
## 10 DUP -0.001118704
Removing selected duplicate entries
- After scrutising the duplicate entries, you may decide to choose one or several out of all the duplicates to represent the duplicates.
- Say, we want to select just the first duplicate entry in this case.
# Identify row numbers of duplicate ID's in the orginal data frame
# The collapse="|" argument is useful when you have >1 ID's that have duplicates. Here, we only have 1 ID that has duplicates
index_dup <- grep(paste(dup$Var1, sep="", collapse="|"), df$ID)
index_dup
## [1] 2 10
# Select the row number to remove. In this case we would like to remove the first duplicate ID
# We do this using logicals where TRUE represents the entry to remove and FALSE otherwise
index_rm <- index_dup[c(FALSE, TRUE)]
index_rm
## [1] 10
# Remove selected entry using subsetting
df_new <- df[-index_rm, ]
df_new
## ID var
## 1 A 0.4979532
## 2 DUP -3.3329079
## 3 J 2.1737207
## 4 E 2.2612339
## 5 G 3.2071791
## 6 H 1.6348022
## 7 B 3.7114680
## 8 F 2.9123182
## 9 C 1.8111144