Background

Original Reddit Link

I’m currently working on the project where I have about multiple CSV files need to match all combinations of events people attended by matching their customer id. They have been to multiple events. I need the events matched in a new column separated by comma. There is multiple data sets that need to be compared. Vlookup in excel was freezing due to the amount of data. How can I do this in R? I have tidyverse installed. I believe the rbind() function is needed.

Example data

Example data

library(knitr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Generate Sample Data
df1 <- data.frame(LastName = c("Robson", "Robson", "Dixon", "Grimes"),
                  FirstName = c("Jonson", "Jonson", "Darryl", "Rick"),
                  CustID = c(23019, 23019, 23036, 23050),
                  Event = c("NP5", "RTE3", "TWD1", "TWD1"),
                  stringsAsFactors = FALSE)

df2 <- data.frame(LastName = c("Robson", "Robson", "Grimes", "Dixon", "Dixon"),
                  FirstName = c("Jonson", "Jonson", "Lori", "Darryl", "Darryl"),
                  CustID = c(23019, 23019, 23051, 23036, 23036),
                  Event = c("GRT2", "RTE4", "TWD1", "TWD1", "TWD2"),
                  stringsAsFactors = FALSE)

# Combine the data frames
df <- rbind(df1, df2)
knitr::kable(df)
LastName FirstName CustID Event
Robson Jonson 23019 NP5
Robson Jonson 23019 RTE3
Dixon Darryl 23036 TWD1
Grimes Rick 23050 TWD1
Robson Jonson 23019 GRT2
Robson Jonson 23019 RTE4
Grimes Lori 23051 TWD1
Dixon Darryl 23036 TWD1
Dixon Darryl 23036 TWD2
df_summary <- df %>% group_by(LastName, FirstName, CustID) %>%
  summarize(NumEventsUnique = length(unique(Event))) %>%
  arrange(CustID)

knitr::kable(df_summary)
LastName FirstName CustID NumEventsUnique
Robson Jonson 23019 4
Dixon Darryl 23036 2
Grimes Rick 23050 1
Grimes Lori 23051 1

Now, the hard part is done. Although I wouldn’t put multiple events in a single column, it’s a fairly easy task to accomplish.

Using a base R approach.

df_summary$Events <- character(nrow(df_summary))
for (i in 1:nrow(df_summary)) {
  df_summary[i, ]$Events <- paste(df[df_summary[i, ]$CustID == df$CustID, ]$Event, collapse = ", ")
}

knitr::kable(df_summary)
LastName FirstName CustID NumEventsUnique Events
Robson Jonson 23019 4 NP5, RTE3, GRT2, RTE4
Dixon Darryl 23036 2 TWD1, TWD1, TWD2
Grimes Rick 23050 1 TWD1
Grimes Lori 23051 1 TWD1

Using a Tidyverse Approach

df_summary$Events <- character(nrow(df_summary))
df_summary %>% mutate(Events = paste(df[CustID == df$CustID, ]$Event, collapse = ", "))