library(stringr);

\b - Word boundary

From what I have understood, a word can constitute of letters, numbers and underscore.

*Reference: http://www.rexegg.com/regex-boundaries.html*

In the first example, I am trying to match string, which can have any character as word boundary at the end.

example.one <- "Hello, how r you ?";
str_extract(example.one, ".+\\b");
## [1] "Hello, how r you"

Note: The pattern didn’t pick “?” at the end as its not part of a word. To prove the point, I will add an underscore at the end. Now the pattern will recognize the “?” too as now the underscore “_” is at the end of word boundary.

example.one <- "Hello, how r you ?_";
str_extract(example.one, ".+\\b");
## [1] "Hello, how r you ?_"

Well, what is a word boundary? When I use \b in the beginning of the pattern, it tries to find the match whose left side is not a word (letter/digit/underscore) and whose right side is a word (letter/digit/underscore).

Let me tweak my string a bit.

example.one <- "_hello, how r you ?_";

If I try to find the part of the string where the word starts with “h” –

unlist(str_extract_all(example.one, "\\bh.+"));
## [1] "how r you ?_"

If you see the above result, the regular expression:

  1. Doesn’t match “hello”:
    Because to the left hand side of hello, I have an underscore and its also a word and hence the letter “h” in hello is not a word boundary.
  2. Matched “how”:
    Because in from of “h” in “how”, there is a space and space is not a word. Hence t this “h” is a word boundary.

Extending my understanding to pattern – \b[some word]\b:

For my example, if I search for \bh\b, I will not find any match, as there is no “h” in the given string, which has a word boundary on the left and a word boundary on the right.

The “h” in “how”, has a word boundary on the left (a space on the left) but has a “o” on the right, which is a word.

unlist(str_extract_all(example.one, "\\bh\\b"));
## character(0)

If I tweak my string a bit and introduce a non-word character after the “h” in “how”, you can see that the pattern will match the “h”:

example.one <- "_hello, h.ow r you ?_";
unlist(str_extract_all(example.one, "\\bh\\b"));
## [1] "h"