REGEX
Beside regular regexs, e.g., ordinary sequences; there are some special characters which do something different from their regular job. For instance, . points to any alphamutric character. When a string includes on of them, we should escape them by adding a . Thus, it is a case with stringr package. This package also uses strings to represent regexs. And regexs need another to escape. Thus, when we want to use a special character like n or . in a regex pattern, we should escape it by adding another . It says is needed for escaping regex and another one is needed for escaping the string: \., \n,\t,… .
The interesting point is that we should use four to escape a in a regex pattern: \\ .
I list the most common regexs here:
| regex | Function |
|---|---|
| . | any single character (except a newline) |
| ^ | the start of the string |
| $ | the end of the string |
| \d | any digit |
| \D | any non-digit |
| \s | any withespace |
| \S | any non-whitespace |
| \w | any word |
| \W | any non-word |
| [abc] | a or b or c |
| [^abc] | anything except a, b, or c |
| ? | 0 or 1 |
| * | 1 or more |
| + | 0 or more |
| {n,m} | between n and m |
| \b | a word boundary |
| \B | a word non-boundary |
Some basic definitions and rules
The definition of a word in NLP: A word in NLP is defined different from linguistic. In NLP, a word is any sequence of digits, letters, or underscores!
Greedy and lazy regexs:
Repetition regexs (as Kleene* and Kleene+) are greedy. It means that they match as much of a string as they can. In other words, + and * match the largest string.
To make them non-greedy or lazy to match as less as possible, we should use a ? after them: *?, +?, {n,m}?.
Grouping and backreferences
We can groups some rules and then refer back to them using \1. \1 here refers to the first group which is distinguished by (). \2 refers to the 2nd group and so on. Sometimes we want to group a rule but not to refer it later. In this case, we should put a ?: before the rule inside the (). Some examples:
"They are (cats|dogs). Children are playing with \\1"
"(?:some|a few) (people|cats) like some \\1", here \1 refers to (people|cats) not (some| a few).
So:
Capturing group: (abc)
NOn-capturing group: (?:abc)
Lookahead assertions
There will be times when we need to predict the future: look ahead in the text to see if some pattern matches, but not advance the match cursor, so that we can then deal with the pattern if it occurs. In order to do that, we need to use Lookahead or Lookaround assertions. (?= pattern) is true and then matchs the preceding regex if pattern occurs. It is called positive lookahead. The negative lookahead, (?! pattern), is tru if the pattern does not match. Some examples:
\\d(?=px)
1pt 2 px 3em 4 px
\d(?!px)
1 pt 2px 3 em 4px
Disjunctiona nd precedence
You can use alternation to pick between one or more alternative patterns by using |. | means this one or the other. Thus, (gupp(y|ies)) means guppy or guppies, while (guppy|ies) means guppy or ies.
Here we should pay attention to the Precedence hierarchy. The regex use rules which are in higher positions in this hierarchical structure.
This means that regex apply () over other rules. Therefore, putting patterns in a parentheses lead regex to choose between them, not for instance between guppy and ies.