In this example, I want to extract the opening times from the text using regular expression. The time format is iregular, such as “Monday: 5:30 PM – 7:15 PM, 5:30 – 7:15 PM”, “Closed”, “Opening 24 hours”, or “” (i.e., nothing). The target output is a tidy data frame, in the format of “id,open,close”. If any id has multiple opening-closing times in the day, multiple lines will be generated for that id.
This is the input data:
id <- letters[1:6]
openHour <- c("Monday: 5:30 PM – 7:15 PM, 5:30 PM – 7:15 PM", "Monday: 5:30 PM – 7:15 PM", "Monday: 5:30 PM – 7:15 PM, 5:30 – 7:15 PM, 9:00 PM - 12:00 AM", "", "Monday: Closed", "Monday: Open 24 hours")
dfInput <- data.frame(id = id, Monday = openHour)
dfInput
This is the targeted result:
dfResult <- data.frame(id = c("a","b","c", "e", "f", "a", "c", "c"), open = c("17:30","17:30","17:30","0:00","0:00","17:30","17:30","21:00"), close = c("19:15","19:15","19:15","0:00","23:59","19:15","19:15","0:00"), stringsAsFactors = F)
dfResult
These are the packages that are loaded and used here.
require(dplyr)
require(stringr)
require(data.table)
The first step is to tidy up the data, which includes the following functions: replacing non-ASCII characters (“–”) removing “Monday:” replacing irregular formats of “Closed” and “Open 24 hours”.
We are going to replace “-” with “T”, “Closed” with “12:00 AM T 12:00 AM”, and “Open 24 hours” with “12:00 AM T 11:59 PM”. Note that “-” is replaced by “T”, as “T” is never used in the time format here, and “T” means “TO”.
After the processing, we are going to extract the time formats. There are two types of time formats, namely “5:30 PM T 7:15 PM” and “5:30 T 7:15 PM”. Regular expressions are very powerful to express the patterns.
listExtract <- dfInput$Monday %>% str_replace_all("[^[:alnum:] :]", replacement = "T") %>%
str_replace_all("Monday: ", replacement = "") %>%
str_replace_all("Closed", replacement = "12:00 AM T 12:00 AM") %>%
str_replace_all("Open 24 hours", replacement = "12:00 AM T 11:59 PM") %>%
str_extract_all("[:digit:]+:[:digit:]+ [:alpha:]* *T [:digit:]+:[:digit:]+ [:alpha:]*")
listExtract
[[1]]
[1] "5:30 PM T 7:15 PMT" "5:30 PM T 7:15 PM"
[[2]]
[1] "5:30 PM T 7:15 PM"
[[3]]
[1] "5:30 PM T 7:15 PMT" "5:30 T 7:15 PMT" "9:00 PM T 12:00 AM"
[[4]]
character(0)
[[5]]
[1] "12:00 AM T 12:00 AM"
[[6]]
[1] "12:00 AM T 11:59 PM"
After extracting the times, the result is a list of character vectors that have different lengths (from 0 to 3). The list will be converted into a data frame.
dfExtract <- plyr::ldply(listExtract, rbind) %>%
cbind(dfInput$id) %>%
rename(id = `dfInput$id`, time1=`1`, time2=`2`, time3 =`3`)
dfExtract
The data frame dfExtract consists of many NA values. Moreover, the format is different from the objective format. Using melt function to adjust the format.
dfExtract <- dfExtract %>%
data.table::melt(id="id") %>%
select(id, value) %>%
filter(!is.na(value))
attributes are not identical across measure variables; they will be dropped
dfExtract
The final step is to extract the opening and closing times from the value column. The new functions are used here. Function of addAMPM is to transform the format of “5:30 – 7:15 PM” into “5:30 PM – 7:15 PM”. The other function, called reformatTime, is to transform the time format into 24-hour. For example, “7:00 AM” is transformed into “7:00”, and “7:00 PM” into “19:00”.
Note the format of hour:minute. We want to use the uniform 2-digit minute format, so the C-style sprintf function is used here.
# a function to reformat time: 7:00 AM as 7:00
# 7:00 PM as 19:00
reformatTime <- function(time){
newTime <- strptime(time,format="%I:%M %p")
paste0(hour(newTime), ":", sprintf("%02d",minute(newTime)))
}
# Turn "5:30 – 7:15 PM" into "5:30 PM – 7:15 PM"
# If the input is "5:30 PM – 7:15 PM", the output is the same as input
addAMPM <- function(time){
# if it is in the format of 5:30 – 7:15 PM
if(!grepl(x = time, pattern = ".*[AP]M.*[AP]M")){
apm <- str_extract_all(time, pattern = "[AP]M")[[1]]
# apm: "PM"
splitVec <- str_split(time, pattern = "T")[[1]]
# splitVec: "5:30 ", "7:15 PM"
newTime <- paste0(splitVec[1], apm, " T", splitVec[2])
newTime
}
else{
return(time)
}
}
# the vectorized version of addAMPM
vecAddAMPM <- Vectorize(addAMPM, SIMPLIFY = T)
# Extracting the open and close times.
dfExtract <- dfExtract %>%
mutate(value = vecAddAMPM(value)) %>%
mutate(open = reformatTime(str_extract(value, pattern = "^[:digit:]+:[:digit:]+ [:alpha:]*"))) %>%
mutate(close = reformatTime(str_extract(value, pattern = "[:digit:]+:[:digit:]+ [:alpha:]*$"))) %>%
select(-value)
dfExtract
LS0tDQp0aXRsZTogIkV4dHJhY3RpbmcgdGhlIHRpbWVzIHVzaW5nIFJlZ3VsYXIgRXhwcmVzc2lvbiINCm91dHB1dDogaHRtbF9ub3RlYm9vaw0KLS0tDQoNCkluIHRoaXMgZXhhbXBsZSwgSSB3YW50IHRvIGV4dHJhY3QgdGhlIG9wZW5pbmcgdGltZXMgZnJvbSB0aGUgdGV4dCB1c2luZyByZWd1bGFyIGV4cHJlc3Npb24uIFRoZSB0aW1lIGZvcm1hdCBpcyBpcmVndWxhciwgc3VjaCBhcyAiTW9uZGF5OiA1OjMwIFBNIOKAkyA3OjE1IFBNLCA1OjMwIOKAkyA3OjE1IFBNIiwgIkNsb3NlZCIsICJPcGVuaW5nIDI0IGhvdXJzIiwgb3IgIiIgKGkuZS4sIG5vdGhpbmcpLiBUaGUgdGFyZ2V0IG91dHB1dCBpcyBhIHRpZHkgZGF0YSBmcmFtZSwgaW4gdGhlIGZvcm1hdCBvZiAiaWQsb3BlbixjbG9zZSIuIElmIGFueSBpZCBoYXMgbXVsdGlwbGUgb3BlbmluZy1jbG9zaW5nIHRpbWVzIGluIHRoZSBkYXksIG11bHRpcGxlIGxpbmVzIHdpbGwgYmUgZ2VuZXJhdGVkIGZvciB0aGF0IGlkLg0KDQpUaGlzIGlzIHRoZSBpbnB1dCBkYXRhOg0KYGBge3J9DQppZCA8LSBsZXR0ZXJzWzE6Nl0NCm9wZW5Ib3VyIDwtIGMoIk1vbmRheTogNTozMCBQTSDigJMgNzoxNSBQTSwgNTozMCBQTSDigJMgNzoxNSBQTSIsICJNb25kYXk6IDU6MzAgUE0g4oCTIDc6MTUgUE0iLCAiTW9uZGF5OiA1OjMwIFBNIOKAkyA3OjE1IFBNLCA1OjMwIOKAkyA3OjE1IFBNLCA5OjAwIFBNIC0gMTI6MDAgQU0iLCAiIiwgIk1vbmRheTogQ2xvc2VkIiwgIk1vbmRheTogT3BlbiAyNCBob3VycyIpDQpkZklucHV0IDwtIGRhdGEuZnJhbWUoaWQgPSBpZCwgTW9uZGF5ID0gb3BlbkhvdXIpDQpkZklucHV0DQpgYGANCg0KVGhpcyBpcyB0aGUgdGFyZ2V0ZWQgcmVzdWx0Og0KYGBge3J9DQpkZlJlc3VsdCA8LSBkYXRhLmZyYW1lKGlkID0gYygiYSIsImIiLCJjIiwgImUiLCAiZiIsICJhIiwgImMiLCAiYyIpLCBvcGVuID0gYygiMTc6MzAiLCIxNzozMCIsIjE3OjMwIiwiMDowMCIsIjA6MDAiLCIxNzozMCIsIjE3OjMwIiwiMjE6MDAiKSwgY2xvc2UgPSBjKCIxOToxNSIsIjE5OjE1IiwiMTk6MTUiLCIwOjAwIiwiMjM6NTkiLCIxOToxNSIsIjE5OjE1IiwiMDowMCIpLCBzdHJpbmdzQXNGYWN0b3JzID0gRikNCmRmUmVzdWx0DQpgYGANCg0KVGhlc2UgYXJlIHRoZSBwYWNrYWdlcyB0aGF0IGFyZSBsb2FkZWQgYW5kIHVzZWQgaGVyZS4NCmBgYHtyfQ0KcmVxdWlyZShkcGx5cikNCnJlcXVpcmUoc3RyaW5ncikNCnJlcXVpcmUoZGF0YS50YWJsZSkNCmBgYA0KDQpUaGUgZmlyc3Qgc3RlcCBpcyB0byB0aWR5IHVwIHRoZSBkYXRhLCB3aGljaCBpbmNsdWRlcyB0aGUgZm9sbG93aW5nIGZ1bmN0aW9uczoNCnJlcGxhY2luZyBub24tQVNDSUkgY2hhcmFjdGVycyAoIuKAkyIpDQpyZW1vdmluZyAiTW9uZGF5OiAiDQpyZXBsYWNpbmcgaXJyZWd1bGFyIGZvcm1hdHMgb2YgIkNsb3NlZCIgYW5kICJPcGVuIDI0IGhvdXJzIi4NCg0KV2UgYXJlIGdvaW5nIHRvIHJlcGxhY2UgIi0iIHdpdGggIlQiLCAiQ2xvc2VkIiB3aXRoICIxMjowMCBBTSBUIDEyOjAwIEFNIiwgYW5kICJPcGVuIDI0IGhvdXJzIiB3aXRoICIxMjowMCBBTSBUIDExOjU5IFBNIi4gTm90ZSB0aGF0ICItIiBpcyByZXBsYWNlZCBieSAiVCIsIGFzICJUIiBpcyBuZXZlciB1c2VkIGluIHRoZSB0aW1lIGZvcm1hdCBoZXJlLCBhbmQgIlQiIG1lYW5zICJUTyIuDQoNCkFmdGVyIHRoZSBwcm9jZXNzaW5nLCB3ZSBhcmUgZ29pbmcgdG8gZXh0cmFjdCB0aGUgdGltZSBmb3JtYXRzLiBUaGVyZSBhcmUgdHdvIHR5cGVzIG9mIHRpbWUgZm9ybWF0cywgbmFtZWx5ICI1OjMwIFBNIFQgNzoxNSBQTSIgYW5kICI1OjMwIFQgNzoxNSBQTSIuIFJlZ3VsYXIgZXhwcmVzc2lvbnMgYXJlIHZlcnkgcG93ZXJmdWwgdG8gZXhwcmVzcyB0aGUgcGF0dGVybnMuDQoNCmBgYHtyfQ0KbGlzdEV4dHJhY3QgPC0gZGZJbnB1dCRNb25kYXkgJT4lIHN0cl9yZXBsYWNlX2FsbCgiW15bOmFsbnVtOl0gOl0iLCByZXBsYWNlbWVudCA9ICJUIikgJT4lDQogICAgc3RyX3JlcGxhY2VfYWxsKCJNb25kYXk6ICIsIHJlcGxhY2VtZW50ID0gIiIpICU+JQ0KICAgIHN0cl9yZXBsYWNlX2FsbCgiQ2xvc2VkIiwgcmVwbGFjZW1lbnQgPSAiMTI6MDAgQU0gVCAxMjowMCBBTSIpICU+JQ0KICAgIHN0cl9yZXBsYWNlX2FsbCgiT3BlbiAyNCBob3VycyIsIHJlcGxhY2VtZW50ID0gIjEyOjAwIEFNIFQgMTE6NTkgUE0iKSAlPiUNCiAgICBzdHJfZXh0cmFjdF9hbGwoIls6ZGlnaXQ6XSs6WzpkaWdpdDpdKyBbOmFscGhhOl0qICpUIFs6ZGlnaXQ6XSs6WzpkaWdpdDpdKyBbOmFscGhhOl0qIikNCmxpc3RFeHRyYWN0DQpgYGANCg0KQWZ0ZXIgZXh0cmFjdGluZyB0aGUgdGltZXMsIHRoZSByZXN1bHQgaXMgYSBsaXN0IG9mIGNoYXJhY3RlciB2ZWN0b3JzIHRoYXQgaGF2ZSBkaWZmZXJlbnQgbGVuZ3RocyAoZnJvbSAwIHRvIDMpLiBUaGUgbGlzdCB3aWxsIGJlIGNvbnZlcnRlZCBpbnRvIGEgZGF0YSBmcmFtZS4NCmBgYHtyfQ0KZGZFeHRyYWN0IDwtIHBseXI6OmxkcGx5KGxpc3RFeHRyYWN0LCByYmluZCkgJT4lIA0KICAgIGNiaW5kKGRmSW5wdXQkaWQpICU+JQ0KICAgIHJlbmFtZShpZCA9IGBkZklucHV0JGlkYCwgdGltZTE9YDFgLCB0aW1lMj1gMmAsIHRpbWUzID1gM2ApDQpkZkV4dHJhY3QNCmBgYA0KVGhlIGRhdGEgZnJhbWUgZGZFeHRyYWN0IGNvbnNpc3RzIG9mIG1hbnkgTkEgdmFsdWVzLiBNb3Jlb3ZlciwgdGhlIGZvcm1hdCBpcyBkaWZmZXJlbnQgZnJvbSB0aGUgb2JqZWN0aXZlIGZvcm1hdC4gVXNpbmcgbWVsdCBmdW5jdGlvbiB0byBhZGp1c3QgdGhlIGZvcm1hdC4NCmBgYHtyfQ0KZGZFeHRyYWN0IDwtIGRmRXh0cmFjdCAlPiUgDQogICAgZGF0YS50YWJsZTo6bWVsdChpZD0iaWQiKSAlPiUgDQogICAgc2VsZWN0KGlkLCB2YWx1ZSkgJT4lDQogICAgZmlsdGVyKCFpcy5uYSh2YWx1ZSkpDQpkZkV4dHJhY3QNCmBgYA0KVGhlIGZpbmFsIHN0ZXAgaXMgdG8gZXh0cmFjdCB0aGUgb3BlbmluZyBhbmQgY2xvc2luZyB0aW1lcyBmcm9tIHRoZSB2YWx1ZSBjb2x1bW4uIFRoZSBuZXcgZnVuY3Rpb25zIGFyZSB1c2VkIGhlcmUuIEZ1bmN0aW9uIG9mIGFkZEFNUE0gaXMgdG8gdHJhbnNmb3JtIHRoZSBmb3JtYXQgb2YgIjU6MzAg4oCTIDc6MTUgUE0iIGludG8gIjU6MzAgUE0g4oCTIDc6MTUgUE0iLiBUaGUgb3RoZXIgZnVuY3Rpb24sIGNhbGxlZCByZWZvcm1hdFRpbWUsIGlzIHRvIHRyYW5zZm9ybSB0aGUgdGltZSBmb3JtYXQgaW50byAyNC1ob3VyLiBGb3IgZXhhbXBsZSwgIjc6MDAgQU0iIGlzIHRyYW5zZm9ybWVkIGludG8gIjc6MDAiLCBhbmQgIjc6MDAgUE0iIGludG8gIjE5OjAwIi4NCg0KTm90ZSB0aGUgZm9ybWF0IG9mIGhvdXI6bWludXRlLiBXZSB3YW50IHRvIHVzZSB0aGUgdW5pZm9ybSAyLWRpZ2l0IG1pbnV0ZSBmb3JtYXQsIHNvIHRoZSBDLXN0eWxlIHNwcmludGYgZnVuY3Rpb24gaXMgdXNlZCBoZXJlLg0KYGBge3J9DQojIGEgZnVuY3Rpb24gdG8gcmVmb3JtYXQgdGltZTogNzowMCBBTSBhcyA3OjAwDQojIDc6MDAgUE0gYXMgMTk6MDANCnJlZm9ybWF0VGltZSA8LSBmdW5jdGlvbih0aW1lKXsNCiAgICBuZXdUaW1lIDwtIHN0cnB0aW1lKHRpbWUsZm9ybWF0PSIlSTolTSAlcCIpDQogICAgcGFzdGUwKGhvdXIobmV3VGltZSksICI6Iiwgc3ByaW50ZigiJTAyZCIsbWludXRlKG5ld1RpbWUpKSkNCn0NCg0KIyBUdXJuICI1OjMwIOKAkyA3OjE1IFBNIiBpbnRvICI1OjMwIFBNIOKAkyA3OjE1IFBNIg0KIyBJZiB0aGUgaW5wdXQgaXMgIjU6MzAgUE0g4oCTIDc6MTUgUE0iLCB0aGUgb3V0cHV0IGlzIHRoZSBzYW1lIGFzIGlucHV0DQphZGRBTVBNIDwtIGZ1bmN0aW9uKHRpbWUpew0KICAgICMgaWYgaXQgaXMgaW4gdGhlIGZvcm1hdCBvZiA1OjMwIOKAkyA3OjE1IFBNDQogICAgaWYoIWdyZXBsKHggPSB0aW1lLCBwYXR0ZXJuID0gIi4qW0FQXU0uKltBUF1NIikpew0KICAgICAgICBhcG0gPC0gc3RyX2V4dHJhY3RfYWxsKHRpbWUsIHBhdHRlcm4gPSAiW0FQXU0iKVtbMV1dDQogICAgICAgICMgYXBtOiAiUE0iDQogICAgICAgIHNwbGl0VmVjIDwtIHN0cl9zcGxpdCh0aW1lLCBwYXR0ZXJuID0gIlQiKVtbMV1dDQogICAgICAgICMgc3BsaXRWZWM6ICI1OjMwICIsICI3OjE1IFBNIg0KICAgICAgICBuZXdUaW1lIDwtIHBhc3RlMChzcGxpdFZlY1sxXSwgYXBtLCAiIFQiLCBzcGxpdFZlY1syXSkNCiAgICAgICAgbmV3VGltZQ0KICAgIH0NCiAgICBlbHNlew0KICAgICAgICByZXR1cm4odGltZSkNCiAgICB9DQp9DQoNCiMgdGhlIHZlY3Rvcml6ZWQgdmVyc2lvbiBvZiBhZGRBTVBNDQp2ZWNBZGRBTVBNIDwtIFZlY3Rvcml6ZShhZGRBTVBNLCBTSU1QTElGWSA9IFQpDQoNCiMgRXh0cmFjdGluZyB0aGUgb3BlbiBhbmQgY2xvc2UgdGltZXMuDQpkZkV4dHJhY3QgPC0gZGZFeHRyYWN0ICU+JQ0KICAgIG11dGF0ZSh2YWx1ZSA9IHZlY0FkZEFNUE0odmFsdWUpKSAlPiUNCiAgICBtdXRhdGUob3BlbiA9IHJlZm9ybWF0VGltZShzdHJfZXh0cmFjdCh2YWx1ZSwgcGF0dGVybiA9ICJeWzpkaWdpdDpdKzpbOmRpZ2l0Ol0rIFs6YWxwaGE6XSoiKSkpICU+JQ0KICAgIG11dGF0ZShjbG9zZSA9IHJlZm9ybWF0VGltZShzdHJfZXh0cmFjdCh2YWx1ZSwgcGF0dGVybiA9ICJbOmRpZ2l0Ol0rOls6ZGlnaXQ6XSsgWzphbHBoYTpdKiQiKSkpICU+JQ0KICAgIHNlbGVjdCgtdmFsdWUpDQpkZkV4dHJhY3QNCmBgYA0K