Translate a comma-seperated field into 1-hot columns. One hot is easy enough, but what if there are multiple values in one field - e.g. multiple product or account names. Here’s two ways how to: first in dplyr style, second in data.table style.
Here’s the data
data <- data.table(products = c("P2","P1,P2","P1,P2","","P2,P3","P1,P3"),
accounts = c("A,D,E","","E,B","G,D,F","E,B","C"))
kable(data, format="markdown")
| products | accounts |
|---|---|
| P2 | A,D,E |
| P1,P2 | |
| P1,P2 | E,B |
| G,D,F | |
| P2,P3 | E,B |
| P1,P3 | C |
First, split on the comma and make the data longer. A unique row number is added temporarily.
gathered <- data.frame(transpose( strsplit(data$products,",") ), stringsAsFactors = F) %>%
mutate(rowno = seq(nrow(data))) %>%
gather(key, products, -rowno, na.rm=T) %>%
select(-key) %>% arrange(rowno)
kable(gathered,format="markdown")
| rowno | products |
|---|---|
| 1 | P2 |
| 2 | P1 |
| 2 | P2 |
| 3 | P1 |
| 3 | P2 |
| 5 | P2 |
| 5 | P3 |
| 6 | P1 |
| 6 | P3 |
Now dummy this out using standard tools. Model.matrix would probably work as well.
dummied <- cbind(gathered, predict(dummyVars(~products, data=gathered), gathered, sep="_"))
# Note sep or levelsOnly dont seem to work
kable(dummied,format="markdown")
| rowno | products | productsP1 | productsP2 | productsP3 |
|---|---|---|---|---|
| 1 | P2 | 0 | 1 | 0 |
| 2 | P1 | 1 | 0 | 0 |
| 2 | P2 | 0 | 1 | 0 |
| 3 | P1 | 1 | 0 | 0 |
| 3 | P2 | 0 | 1 | 0 |
| 5 | P2 | 0 | 1 | 0 |
| 5 | P3 | 0 | 0 | 1 |
| 6 | P1 | 1 | 0 | 0 |
| 6 | P3 | 0 | 0 | 1 |
Then, finally, group things back together using the row number provided earlier. For empty values, no row would be generated, so we left join it with the row numbers and replace NA by 0. Now this result can just be cbinded to the orginal data.
summarized <- merge(data.table(rowno=seq(nrow(data))),
group_by(dummied, rowno) %>% summarise_at(.cols = 3:ncol(dummied),.funs = sum),
by="rowno", all.x=T) %>% select(-rowno)
summarized[is.na(summarized)] <- 0
kable(summarized,format="markdown")
| productsP1 | productsP2 | productsP3 |
|---|---|---|
| 0 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
createDummiesMultiple <- function(values, label)
{
gathered <- data.frame(transpose( strsplit(values,",") ), stringsAsFactors = F) %>%
mutate(rowno = seq(length(values))) %>%
gather(key, dummies, -rowno, na.rm=T) %>%
select(-key) %>%
arrange(rowno)
dummied <- cbind(gathered, predict(dummyVars(~dummies, data=gathered), gathered))
summarized <- merge(data.table(rowno=seq(length(values))),
group_by(dummied, rowno) %>%
summarise_at(.cols = 3:ncol(dummied),.funs = sum),
by="rowno", all.x=T) %>%
select(-rowno)
summarized[is.na(summarized)] <- 0
# ugly way to set names of dummy columns - some args to dummyVars not working as expected
setnames(summarized,
sapply(names(summarized), function(x){
return (paste(label, substr(x, 1+nchar("dummies"), nchar(x)), sep="."))}))
return(summarized)
}
final <-
cbind(data,
createDummiesMultiple(data$products, "prd"),
createDummiesMultiple(data$accounts, "accnt"))
kable(final,format="markdown")
| products | accounts | prd.P1 | prd.P2 | prd.P3 | accnt.A | accnt.B | accnt.C | accnt.D | accnt.E | accnt.F | accnt.G |
|---|---|---|---|---|---|---|---|---|---|---|---|
| P2 | A,D,E | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
| P1,P2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| P1,P2 | E,B | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| G,D,F | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | |
| P2,P3 | E,B | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| P1,P3 | C | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
An alternative is to bind multiple data tables - splitting the strings as before but each string becomes a little data table all by itself, which are then bulk-merged using rbind. With data.table and rbindlist, this should be a lost faster.
library(data.table)
dummied2 <- rbindlist(lapply(strsplit(data$products,","), function(x){
df <- data.table(t(rep(1, 1+length(x))))
setnames(df, c("dummy",x)) # dummy added for when there are no splits
return(df)
}),use.names = T, fill=T)
kable(dummied2,format="markdown")
| dummy | P2 | P1 | P3 |
|---|---|---|---|
| 1 | 1 | NA | NA |
| 1 | 1 | 1 | NA |
| 1 | 1 | 1 | NA |
| 1 | NA | NA | NA |
| 1 | 1 | NA | 1 |
| 1 | NA | 1 | 1 |
An interesting idea is to create an additional variable with just one symbol, but the most frequent one - frequency defined globally.
sortedLabels <- sort(colSums(select(dummied2, -dummy), na.rm=T), decreasing = T)
dummied2 <- dummied2[, names(sortedLabels), with=F]
dummied2$mostFrequentValue <- names(sortedLabels)[apply(dummied2, 1,
function(x) {return(which(!is.na(x))[1])})]
kable(dummied2,format="markdown")
| P2 | P1 | P3 | mostFrequentValue |
|---|---|---|---|
| 1 | NA | NA | P2 |
| 1 | 1 | NA | P2 |
| 1 | 1 | NA | P2 |
| NA | NA | NA | NA |
| 1 | NA | 1 | P2 |
| NA | 1 | 1 | P1 |
So if we generalize this just a little bit, we get the same results but much faster because of the efficience of data.table operations.
The below code is also resilient against multiple occurences of the same value in a comma seperated string. It also only does the realy hot-one encoding for the unique values of the field (for efficiency). The second function (createDummiesMultipleWithDominantValue) returns a data table with the unique values as the first element, so ready for left-join by the caller.
createDummiesMultiple2 <- function(values)
{
mlist <- rbindlist(lapply(lapply(lapply(lapply(strsplit(values,","), tolower), trimws), unique), function(x){
df <- data.table(t(rep(1, 1+length(x))))
setnames(df, make.names(c("dummy",x)))
return(df)
}),use.names = T, fill=T)[, dummy:=NULL]
mlist[is.na(mlist)] <- 0
setnames(mlist, make.names(names(mlist)))
return(mlist[,sort(names(mlist)),with=F])
}
createDummiesMultipleWithDominantValue <- function(data, colName, threshold=0)
{
groupedByCol <- data %>% group_by_(colName) %>% summarise(n = n())
oneHot <- createDummiesMultiple2(groupedByCol[[1]])
sortedLabels <- sort(colSums(oneHot * groupedByCol$n, na.rm=T), decreasing = T)
oneHot <- oneHot[, names(sortedLabels), with=F]
oneHot[[".dominantValue"]] <-
names(sortedLabels)[apply(oneHot, 1, function(x) {return(which(x!=0)[1])})]
oneHot <- cbind(select(groupedByCol, -n),
oneHot[, c(names(sortedLabels)[which(sortedLabels >= threshold)], ".dominantValue"), with=F])
setnames(oneHot, c(colName, paste(colName, names(oneHot)[2:ncol(oneHot)], sep=".")))
setkeyv(oneHot, colName)
return(oneHot)
}
final2 <- data %>%
merge(createDummiesMultipleWithDominantValue(data, "products", 3), all.x = T, by = "products") %>%
merge(createDummiesMultipleWithDominantValue(data, "accounts"), all.x = T, by = "accounts")
kable(final2,format="markdown")
| accounts | products | products.p2 | products.p1 | products..dominantValue | accounts.e | accounts.b | accounts.d | accounts.a | accounts.c | accounts.f | accounts.g | accounts..dominantValue |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P1,P2 | 1 | 1 | p2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | |
| A,D,E | P2 | 1 | 0 | p2 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | e |
| C | P1,P3 | 0 | 1 | p1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | c |
| E,B | P1,P2 | 1 | 1 | p2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | e |
| E,B | P2,P3 | 1 | 0 | p2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | e |
| G,D,F | 0 | 0 | NA | 0 | 0 | 1 | 0 | 0 | 1 | 1 | d |