Translate a comma-seperated field into 1-hot columns. One hot is easy enough, but what if there are multiple values in one field - e.g. multiple product or account names. Here’s two ways how to: first in dplyr style, second in data.table style.

Test Data

Here’s the data

data <- data.table(products = c("P2","P1,P2","P1,P2","","P2,P3","P1,P3"), 
                   accounts = c("A,D,E","","E,B","G,D,F","E,B","C"))
kable(data, format="markdown")

products	accounts
P2	A,D,E
P1,P2
P1,P2	E,B
	G,D,F
P2,P3	E,B
P1,P3	C

A la dplyr

Split symbolic field making the data longer

First, split on the comma and make the data longer. A unique row number is added temporarily.

gathered <- data.frame(transpose( strsplit(data$products,",") ), stringsAsFactors = F) %>%
  mutate(rowno = seq(nrow(data))) %>%
  gather(key, products, -rowno, na.rm=T) %>%
  select(-key) %>% arrange(rowno)
kable(gathered,format="markdown")

rowno	products
1	P2
2	P1
2	P2
3	P1
3	P2
5	P2
5	P3
6	P1
6	P3

Standard one-hot encoding

Now dummy this out using standard tools. Model.matrix would probably work as well.

dummied <- cbind(gathered, predict(dummyVars(~products, data=gathered), gathered, sep="_"))
# Note sep or levelsOnly dont seem to work
kable(dummied,format="markdown")

rowno	products	productsP1	productsP2	productsP3
1	P2	0	1	0
2	P1	1	0	0
2	P2	0	1	0
3	P1	1	0	0
3	P2	0	1	0
5	P2	0	1	0
5	P3	0	0	1
6	P1	1	0	0
6	P3	0	0	1

Group the encoded rows by original row number

Then, finally, group things back together using the row number provided earlier. For empty values, no row would be generated, so we left join it with the row numbers and replace NA by 0. Now this result can just be cbinded to the orginal data.

summarized <- merge(data.table(rowno=seq(nrow(data))),
                    group_by(dummied, rowno) %>% summarise_at(.cols = 3:ncol(dummied),.funs = sum),
                    by="rowno", all.x=T) %>% select(-rowno)
summarized[is.na(summarized)] <- 0
kable(summarized,format="markdown")

productsP1	productsP2	productsP3
0	1	0
1	1	0
1	1	0
0	0	0
0	1	1
1	0	1

Bringing it all together

createDummiesMultiple <- function(values, label)
{
  gathered <- data.frame(transpose( strsplit(values,",") ), stringsAsFactors = F) %>%
    mutate(rowno = seq(length(values))) %>%
    gather(key, dummies, -rowno, na.rm=T) %>%
    select(-key) %>% 
    arrange(rowno)
  dummied <- cbind(gathered, predict(dummyVars(~dummies, data=gathered), gathered))
  summarized <- merge(data.table(rowno=seq(length(values))),
                      group_by(dummied, rowno) %>% 
                        summarise_at(.cols = 3:ncol(dummied),.funs = sum),
                      by="rowno", all.x=T) %>% 
    select(-rowno)
  summarized[is.na(summarized)] <- 0
  # ugly way to set names of dummy columns - some args to dummyVars not working as expected
  setnames(summarized,
           sapply(names(summarized), function(x){
             return (paste(label, substr(x, 1+nchar("dummies"), nchar(x)), sep="."))}))
  return(summarized)
}

final <-
  cbind(data, 
        createDummiesMultiple(data$products, "prd"),
        createDummiesMultiple(data$accounts, "accnt"))

kable(final,format="markdown")

products	accounts	prd.P1	prd.P2	prd.P3	accnt.A	accnt.B	accnt.C	accnt.D	accnt.E	accnt.F	accnt.G
P2	A,D,E	0	1	0	1	0	0	1	1	0	0
P1,P2		1	1	0	0	0	0	0	0	0	0
P1,P2	E,B	1	1	0	0	1	0	0	1	0	0
	G,D,F	0	0	0	0	0	0	1	0	1	1
P2,P3	E,B	0	1	1	0	1	0	0	1	0	0
P1,P3	C	1	0	1	0	0	1	0	0	0	0

Data.table style

An alternative is to bind multiple data tables - splitting the strings as before but each string becomes a little data table all by itself, which are then bulk-merged using rbind. With data.table and rbindlist, this should be a lost faster.

library(data.table)

dummied2 <- rbindlist(lapply(strsplit(data$products,","), function(x){
  df <- data.table(t(rep(1, 1+length(x))))
  setnames(df, c("dummy",x)) # dummy added for when there are no splits
  return(df)
}),use.names = T, fill=T)

kable(dummied2,format="markdown")

dummy	P2	P1	P3
1	1	NA	NA
1	1	1	NA
1	1	1	NA
1	NA	NA	NA
1	1	NA	1
1	NA	1	1

An interesting idea is to create an additional variable with just one symbol, but the most frequent one - frequency defined globally.

sortedLabels <- sort(colSums(select(dummied2, -dummy), na.rm=T), decreasing = T)
dummied2 <- dummied2[, names(sortedLabels), with=F]
dummied2$mostFrequentValue <- names(sortedLabels)[apply(dummied2, 1, 
                                                          function(x) {return(which(!is.na(x))[1])})]

kable(dummied2,format="markdown")

P2	P1	P3	mostFrequentValue
1	NA	NA	P2
1	1	NA	P2
1	1	NA	P2
NA	NA	NA	NA
1	NA	1	P2
NA	1	1	P1

So if we generalize this just a little bit, we get the same results but much faster because of the efficience of data.table operations.

The below code is also resilient against multiple occurences of the same value in a comma seperated string. It also only does the realy hot-one encoding for the unique values of the field (for efficiency). The second function (createDummiesMultipleWithDominantValue) returns a data table with the unique values as the first element, so ready for left-join by the caller.

createDummiesMultiple2 <- function(values)
{
  mlist <- rbindlist(lapply(lapply(lapply(lapply(strsplit(values,","), tolower), trimws), unique), function(x){
    df <- data.table(t(rep(1, 1+length(x))))
    setnames(df, make.names(c("dummy",x)))
    return(df)
  }),use.names = T, fill=T)[, dummy:=NULL]
  mlist[is.na(mlist)] <- 0
  setnames(mlist, make.names(names(mlist)))
  return(mlist[,sort(names(mlist)),with=F])
}

createDummiesMultipleWithDominantValue <- function(data, colName, threshold=0)
{
  groupedByCol <- data %>% group_by_(colName) %>% summarise(n = n())
  oneHot <- createDummiesMultiple2(groupedByCol[[1]])
  sortedLabels <- sort(colSums(oneHot * groupedByCol$n, na.rm=T), decreasing = T)
  oneHot <- oneHot[, names(sortedLabels), with=F]
  oneHot[[".dominantValue"]] <- 
    names(sortedLabels)[apply(oneHot, 1, function(x) {return(which(x!=0)[1])})]
  oneHot <- cbind(select(groupedByCol, -n), 
                  oneHot[, c(names(sortedLabels)[which(sortedLabels >= threshold)], ".dominantValue"), with=F])
  setnames(oneHot, c(colName, paste(colName, names(oneHot)[2:ncol(oneHot)], sep=".")))
  setkeyv(oneHot, colName)
  
  return(oneHot)
} 

final2 <- data %>% 
  merge(createDummiesMultipleWithDominantValue(data, "products", 3), all.x = T, by = "products") %>%
  merge(createDummiesMultipleWithDominantValue(data, "accounts"), all.x = T, by = "accounts")

kable(final2,format="markdown")

accounts	products	products.p2	products.p1	products..dominantValue	accounts.e	accounts.b	accounts.d	accounts.a	accounts.c	accounts.f	accounts.g	accounts..dominantValue
	P1,P2	1	1	p2	0	0	0	0	0	0	0	NA
A,D,E	P2	1	0	p2	1	0	1	1	0	0	0	e
C	P1,P3	0	1	p1	0	0	0	0	1	0	0	c
E,B	P1,P2	1	1	p2	1	1	0	0	0	0	0	e
E,B	P2,P3	1	0	p2	1	1	0	0	0	0	0	e
G,D,F		0	0	NA	0	0	1	0	0	1	1	d

Dummy Vars

Otto