I’m currently working on the project where I have about multiple CSV files need to match all combinations of events people attended by matching their customer id. They have been to multiple events. I need the events matched in a new column separated by comma. There is multiple data sets that need to be compared. Vlookup in excel was freezing due to the amount of data. How can I do this in R? I have tidyverse installed. I believe the rbind() function is needed.
Example data
library(knitr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Generate Sample Data
df1 <- data.frame(LastName = c("Robson", "Robson", "Dixon", "Grimes"),
FirstName = c("Jonson", "Jonson", "Darryl", "Rick"),
CustID = c(23019, 23019, 23036, 23050),
Event = c("NP5", "RTE3", "TWD1", "TWD1"),
stringsAsFactors = FALSE)
df2 <- data.frame(LastName = c("Robson", "Robson", "Grimes", "Dixon", "Dixon"),
FirstName = c("Jonson", "Jonson", "Lori", "Darryl", "Darryl"),
CustID = c(23019, 23019, 23051, 23036, 23036),
Event = c("GRT2", "RTE4", "TWD1", "TWD1", "TWD2"),
stringsAsFactors = FALSE)
# Combine the data frames
df <- rbind(df1, df2)
knitr::kable(df)
LastName | FirstName | CustID | Event |
---|---|---|---|
Robson | Jonson | 23019 | NP5 |
Robson | Jonson | 23019 | RTE3 |
Dixon | Darryl | 23036 | TWD1 |
Grimes | Rick | 23050 | TWD1 |
Robson | Jonson | 23019 | GRT2 |
Robson | Jonson | 23019 | RTE4 |
Grimes | Lori | 23051 | TWD1 |
Dixon | Darryl | 23036 | TWD1 |
Dixon | Darryl | 23036 | TWD2 |
df_summary <- df %>% group_by(LastName, FirstName, CustID) %>%
summarize(NumEventsUnique = length(unique(Event))) %>%
arrange(CustID)
knitr::kable(df_summary)
LastName | FirstName | CustID | NumEventsUnique |
---|---|---|---|
Robson | Jonson | 23019 | 4 |
Dixon | Darryl | 23036 | 2 |
Grimes | Rick | 23050 | 1 |
Grimes | Lori | 23051 | 1 |
Now, the hard part is done. Although I wouldn’t put multiple events in a single column, it’s a fairly easy task to accomplish.
base
R approach.df_summary$Events <- character(nrow(df_summary))
for (i in 1:nrow(df_summary)) {
df_summary[i, ]$Events <- paste(df[df_summary[i, ]$CustID == df$CustID, ]$Event, collapse = ", ")
}
knitr::kable(df_summary)
LastName | FirstName | CustID | NumEventsUnique | Events |
---|---|---|---|---|
Robson | Jonson | 23019 | 4 | NP5, RTE3, GRT2, RTE4 |
Dixon | Darryl | 23036 | 2 | TWD1, TWD1, TWD2 |
Grimes | Rick | 23050 | 1 | TWD1 |
Grimes | Lori | 23051 | 1 | TWD1 |
df_summary$Events <- character(nrow(df_summary))
df_summary %>% mutate(Events = paste(df[CustID == df$CustID, ]$Event, collapse = ", "))