Identifying Duplicate Entries

Prologue

Often times we have an identification (ID) column in a data frame where we would like most, if not all, entries or elements to be unique.
These ID’s may correspond to participants from survey, biological samples from experiment etc.
Here, we will learn to check for, retrieve, and remove selected duplicate entries from data frames
Let’s first create a data frame for us to use in this session. The DUP entries represent our duplicate ID’s.

# set seed for reproducibility
set.seed(888)

# Define vectors
col1 <- sample(toupper(letters)[1:10], size=10, replace=FALSE)
col2 <- rnorm(n=10, mean=1, sd=2)

# Create data frame
df <- data.frame("ID"=col1, "var"=col2, stringsAsFactors=FALSE)

# Insert duplicate entries
df[sample(dim(df)[1], size=2, replace=FALSE), 1] = "DUP"

df

##     ID          var
## 1    A  0.497953244
## 2  DUP -3.332907941
## 3    J  2.173720708
## 4    E  2.261233903
## 5    G  3.207179121
## 6    H  1.634802218
## 7    B  3.711467970
## 8    F  2.912318154
## 9    C  1.811114410
## 10 DUP -0.001118704

Check for duplicates

# Tabulate ID's
tab <- table(df$ID)
tab

## 
##   A   B   C DUP   E   F   G   H   J 
##   1   1   1   2   1   1   1   1   1

# Convert tabulation to data frame
tab_df <- data.frame(tab)
tab_df

##   Var1 Freq
## 1    A    1
## 2    B    1
## 3    C    1
## 4  DUP    2
## 5    E    1
## 6    F    1
## 7    G    1
## 8    H    1
## 9    J    1

Notice that the frequency (Freq column) for each unique entry is computed after converting the tabulation to data frame format.
Next we tabulate the frequency

tab_freq <- table(tab_df$Freq)
tab_freq

## 
## 1 2 
## 8 1

We can see that 8 entries appear once, while 1 entry appear twice.
Therefore, there is one duplicate entry in the ID column.

Retrieving duplicate entries

Next step is to retrieve the rows with the duplicate entries for further scrutiny.

# Retrieve duplicate ID using the subset() function
dup <- subset(tab_df, subset=(tab_df$Freq > 1), )
dup

##   Var1 Freq
## 4  DUP    2

df_dup <- df[df$ID %in% dup$Var1, ]
df_dup

##     ID          var
## 2  DUP -3.332907941
## 10 DUP -0.001118704

Removing selected duplicate entries

After scrutising the duplicate entries, you may decide to choose one or several out of all the duplicates to represent the duplicates.
Say, we want to select just the first duplicate entry in this case.

# Identify row numbers of duplicate ID's in the orginal data frame
# The collapse="|" argument is useful when you have >1 ID's that have duplicates. Here, we only have 1 ID that has duplicates
index_dup <- grep(paste(dup$Var1, sep="", collapse="|"), df$ID)
index_dup

## [1]  2 10

# Select the row number to remove. In this case we would like to remove the first duplicate ID
# We do this using logicals where TRUE represents the entry to remove and FALSE otherwise
index_rm <- index_dup[c(FALSE, TRUE)]
index_rm

## [1] 10

# Remove selected entry using subsetting
df_new <- df[-index_rm, ]
df_new

##    ID        var
## 1   A  0.4979532
## 2 DUP -3.3329079
## 3   J  2.1737207
## 4   E  2.2612339
## 5   G  3.2071791
## 6   H  1.6348022
## 7   B  3.7114680
## 8   F  2.9123182
## 9   C  1.8111144

For more info on subsetting data frames, refer to Subsetting Data Frames.