Query String Unrolling

When processing web application log data, I typically get decently delimited files that have sub-fields embedded in a main field. For example (completely made up):

20141012 12:34:23;www.host.com;asd42s329;o1=a1&o2=a2&o3=a3;200 OK

This parses into TIME, HOST, ID, QUERYSTRING;HTTP RESPONSE

The QUERYSTRING field needs sub-parsing. Here’s one method, which depends on a few packages

Build up the sample set:

suppressPackageStartupMessages(require(tidyr))
suppressPackageStartupMessages(require(dplyr))
suppressPackageStartupMessages(require(stringr))

obs = data.frame(id=seq(1:4), 
                 host=letters[seq(1:4)],
                 query=c("o1=a1&o2=a2&o3=a3", 
                         "o1=b1&o2=b2", 
                         "o3=c3&o1=c1",        #note order
                         "o1=d1&o2=d2&o3=d3"))

Obs would be the result of a read.csv() command or some similar technique

##   id host             query
## 1  1    a o1=a1&o2=a2&o3=a3
## 2  2    b       o1=b1&o2=b2
## 3  3    c       o3=c3&o1=c1
## 4  4    d o1=d1&o2=d2&o3=d3

x = str_split(obs$query, "&")
df = data.frame(id = rep(obs$id, sapply(x, length)),
                host=rep(obs$host, sapply(x, length)),
                onvp=unlist(x),
                stringsAsFactors=F)

df_temp = df %>% separate(onvp, c("var", "value")) %>% arrange(id)
pretty_sure_tidy = df_temp %>% spread(var,value)

And the pretty_sure_tidy data frame has the query-string unrolled.

##   id host o1   o2   o3
## 1  1    a a1   a2   a3
## 2  2    b b1   b2 <NA>
## 3  3    c c1 <NA>   c3
## 4  4    d d1   d2   d3

If you aren’t a tidyr user, reshape2 works

#can do the same thing as above with a transform and then a dcast
suppressPackageStartupMessages(require(reshape2))
df_tr = transform(df, var = sub("=.*", "", onvp), value = sub(".*=", "", onvp))
casted = dcast(data = df_tr, id + host ~ var, value.var="value")

Query String Unrolling

Brent Schneeman

October 9, 2014