When processing web application log data, I typically get decently delimited files that have sub-fields embedded in a main field. For example (completely made up):
20141012 12:34:23;www.host.com;asd42s329;o1=a1&o2=a2&o3=a3;200 OK
This parses into TIME, HOST, ID, QUERYSTRING;HTTP RESPONSE
The QUERYSTRING field needs sub-parsing. Here’s one method, which depends on a few packages
Build up the sample set:
suppressPackageStartupMessages(require(tidyr))
suppressPackageStartupMessages(require(dplyr))
suppressPackageStartupMessages(require(stringr))
obs = data.frame(id=seq(1:4),
host=letters[seq(1:4)],
query=c("o1=a1&o2=a2&o3=a3",
"o1=b1&o2=b2",
"o3=c3&o1=c1", #note order
"o1=d1&o2=d2&o3=d3"))
Obs would be the result of a read.csv() command or some similar technique
## id host query
## 1 1 a o1=a1&o2=a2&o3=a3
## 2 2 b o1=b1&o2=b2
## 3 3 c o3=c3&o1=c1
## 4 4 d o1=d1&o2=d2&o3=d3
x = str_split(obs$query, "&")
df = data.frame(id = rep(obs$id, sapply(x, length)),
host=rep(obs$host, sapply(x, length)),
onvp=unlist(x),
stringsAsFactors=F)
df_temp = df %>% separate(onvp, c("var", "value")) %>% arrange(id)
pretty_sure_tidy = df_temp %>% spread(var,value)
And the pretty_sure_tidy data frame has the query-string unrolled.
## id host o1 o2 o3
## 1 1 a a1 a2 a3
## 2 2 b b1 b2 <NA>
## 3 3 c c1 <NA> c3
## 4 4 d d1 d2 d3
If you aren’t a tidyr user, reshape2 works
#can do the same thing as above with a transform and then a dcast
suppressPackageStartupMessages(require(reshape2))
df_tr = transform(df, var = sub("=.*", "", onvp), value = sub(".*=", "", onvp))
casted = dcast(data = df_tr, id + host ~ var, value.var="value")